Replication Guide

Weak instruments and the AK91 re-examination

Author

Kevin Hu

Published

June 4, 2026

ImportantHow to use this guide

Prerequisites (required). Complete the AK91 replication guide first — at least the QOB IV logic and AK91 Table V / IV specs. Tier-1 context: Readings Guide.

Software. R with renv::restore(); microdata: Rcode/raw_data.dta (same Census extract as AK91).

Four documents — one workflow.

Step Document Your job
1 This guide Understand why weak IV undermines AK91’s second stage
2 Paper (full text) Read Sections 2–4 cited in each Act
3 R replication output Compare coefficients and diagnostics to Tables 1–3
4 Rcode/*.R Run and modify the modular scripts

This guide is narrative-first (R only). Table 3 Monte Carlo (500 replications) takes several hours — design and interpretation here; run commands on the replication page.

1 Overview: From AK91 headline to weak-IV skepticism

Angrist and Krueger (1991) report that quarter-of-birth IV estimates of the return to schooling are close to OLS on Census microdata — a celebrated result. Bound, Jaeger, and Baker (1995) re-examine the same empirical setting and argue that conclusion is fragile:

  1. Inconsistency: If instruments are weak, even tiny correlation between \(Z\) and structural errors can produce large bias (Section 2.1).
  2. Finite-sample bias: Weak IV pushes 2SLS toward OLS — so “IV ≈ OLS” may reflect bias, not absence of ability bias (Section 2.2).
  3. Simulated instruments: Randomly permuting quarter of birth yields plausible-looking second stages while first-stage \(F \approx 1\) (Table 3).

Same data, different question.

Paper Question Key takeaway
AK91 Table V What is the causal return to schooling? IV ≈ OLS, ~6–8%
Bound Table 1 col. (6) Is IV inference credible given first-stage strength? \(\hat\beta_{EDUC} \approx 0.060\), first-stage \(F \approx 1.61\)
Bound Table 3 What if instruments carry no information? Mean \(\hat\beta \approx 0.06\), mean \(F \approx 1\)

Sample (all Bound tables): 5% Public-Use Sample, 1980 Census; men born 1930–1939 (\(N = 329{,}509\)) — AK91 Table V cohort.

flowchart LR
  AK91[AK91 IV headline]
  FS[First stage F and partial R2]
  SS[Second stage EDUC coef]
  Sim[Table 3 permuted QOB]
  AK91 --> FS
  FS --> SS
  FS --> Sim
  Sim --> Lesson[Report F before trusting SS]

R module map.

File Role
bound1995-data-prep.R Load data; QOB/YOB/state dummies
bound1995-iv-diagnostics.R First-stage \(F\), partial \(R^2\), Basmann overid \(F\)
bound1995-chunked-iv.R Large IV sets (Table 2)
Table_I.R Bound Table 1
Table_II.R Bound Table 2
Table_III.R Bound Table 3 simulation

Read first: Bound et al. (1995) HTML · PDF

Suggested path (~2 hours, after AK91). Overview → Act I → Act II → Act IV (read + pilot simulation) → Act V → Capstone.


2 Act I — Two problems with weak instruments

2.1 Story beat

Section 2 states the general IV pathology before Section 3 applies it to AK91. Two distinct problems matter for teaching:

  • Population inconsistency when \(Z\) is weak and slightly invalid.
  • Finite-sample bias when \(Z\) is weak even if exclusion holds exactly.

2.2 Read in the paper

2.3 Identification / econometrics

Inconsistency (intuition). When \(plim(\hat\pi) = 0\) (no first-stage relevance), IV solves a nearly singular problem; small \(Cov(Z, \epsilon)\) in the second stage is amplified.

Finite-sample bias (intuition). With weak instruments, 2SLS behaves like a weighted average of OLS and IV; as \(F \downarrow\), the weight on OLS rises. If OLS is upward biased (ability), weak IV can make 2SLS look like OLS for the wrong reason.

Link to AK91. AK91 emphasize IV ≈ OLS as evidence against large ability bias. Bound reinterpret: with \(F \approx 1.6\), IV ≈ OLS is expected under weak-IV bias, not proof of a valid causal design.

2.4 Replicate

No standalone table — this act frames Table 1.

library(ivreg)
# After bound1995-data-prep.R:
# m_iv <- ivreg(LWKLYWGE ~ EDUC + controls | QOB_YOB_dummies + controls, data = d)
# m_fs <- lm(EDUC ~ QOB_YOB_dummies + controls, data = d)
# anova(lm(EDUC ~ controls, d), m_fs)  # F on excluded instruments

2.5 Check your work

  • State in one sentence why large \(N\) does not fix weak identification.
  • Contrast exclusion violation vs weak relevance as distinct threats.

2.6 Think

  1. If true IV > OLS (positive ability bias), what does weak-IV bias toward OLS do to the IV estimate?
  2. AK91 Act VI previewed weak IV — what new evidence does Bound add beyond quoting a low \(F\)?

3 Act II — Table 1: AK91 Table V with diagnostics

3.1 Story beat

Section 3.2 replicates AK91 Table V, columns (5)–(8) but adds first-stage \(F\), partial \(R^2\), and Basmann overidentification \(F\) for every IV column. The pedagogical punchline: as the specification matches AK91 more closely, \(F\) collapses.

3.2 Read in the paper

3.3 Identification / econometrics

Bound Table 1 columns (3)–(6) map to AK91 Table V (5)–(8). Simpler columns (1)–(2) use only three quarter dummies as instruments.

Bound col. AK91 Table V col. Controls Excluded instruments Paper \(F\) (approx.)
(1)–(2) — (simpler) Age, age\(^2\), demos 3 QOB dummies ~13.5
(3)≈4) (5)≈6) YOB dummies, demos QOB × YOB ~4.7
(5)≈6) (7)–(8) YOB, age in quarters, demos QOB × YOB ~1.6

Partial \(R^2\): share of \(EDUC\) variation explained by excluded instruments after controls. It is tiny in col. (6) — consistent with a weak first stage.

3.4 Replicate

WarningDo not misread the IV point estimate

Col. (6) IV \(\hat\beta_{EDUC} \approx 0.060\) is below AK91’s headline IV (~8%). Bound’s point is not “returns are 6%” — it is that with \(F \approx 1.6\), 2SLS is pulled toward OLS and standard second-stage inference is unreliable.

3.5 Check your work

  • Col. (6) IV: \(\hat\beta_{EDUC} \approx 0.060\) (SE ≈0.036); first-stage \(F \approx 1.61\); partial \(R^2 \approx 0.0003\).
  • \(F\) falls moving from cols. (2) → (4) → (6) — monotone loss of first-stage strength as the AK91-style spec is added.
  • Compare replication vs paper_targets printed when Table_I.R finishes.

3.6 Think

  1. Col. (6) second-stage \(t\)-stat may still look “significant.” Why is that misleading when \(F \approx 1.6\)?
  2. Which diagnostic would you show first in a seminar slide—\(\hat\beta_{EDUC}\) or first-stage \(F\)?

4 Act III — Table 2: State-of-birth interactions

4.1 Story beat

Bound replicate AK91 Table VII, columns (5)–(8) — adding state × quarter interactions to exploit cross-state compulsory-law variation. Precision may improve, but first-stage \(F\) stays near 2: more instruments do not rescue weak identification here.

4.2 Read in the paper

4.3 Identification / econometrics

Excluded set expands to QOB × YOB and QOB × state of birth (plus levels as in AK91). Overidentification tests (Basmann \(F\)) appear in IV columns — often fail to reject even when \(F\) is tiny (foreshadowing Table 3).

4.4 Replicate

WarningIf Table 2 is missing on the site

The replication page loads pre-built Table_II.RData. If absent, run Table_II.R once locally; chunked IV in bound1995-chunked-iv.R handles the large instrument set.

4.5 Check your work

  • Col. (4) IV: \(\hat\beta_{EDUC} \approx 0.081\); first-stage \(F \approx 1.87\) (still far below conventional thresholds).
  • Contrast partial \(R^2\) and overid \(F\) across IV columns with Table 1.

4.6 Think

  1. When does adding interactions as instruments help first-stage strength — and when does it hurt (collinearity with age/QOB)?
  2. Bound footnote 18 notes even richer state × year × QOB instruments — what would you expect for \(F\)?

5 Act IV — Table 3: Simulated quarter of birth

5.1 Story beat

Alan Krueger’s suggestion (cited in Bound, p. 448): randomly permute quarter of birth within the sample, rebuild instruments, and re-estimate IV. Instruments are irrelevant by construction, yet second-stage output still looks respectable. Only the first stage (\(F \approx 1\)) reveals the problem.

5.2 Read in the paper

5.3 Identification / econometrics

Design (each replication):

  1. Hold sample, controls, and model fixed (four specs: Table 1 cols. (4) and (6); Table 2 cols. (2) and (4)).
  2. Shuffle QOB; rebuild QOB × YOB (and state) dummies.
  3. Estimate IV; store \(\hat\beta_{EDUC}\), SE, first-stage \(F\).

Theory: Mean \(\hat\beta\) should cluster near OLS (~0.06); mean \(F \approx 1\). Asymptotic SEs can track the simulation SD — so “reasonable” \(t\)-stats are not evidence of valid IV.

5.4 Replicate

Run (from repo root):

# Pilot (~1–2 h with parallel workers)
BOUND95_SIM_REPS=50 Rscript 04-topics/rep-bound1995/Rcode/Table_III.R

# Paper (500 reps; checkpoints resume if interrupted)
BOUND95_SIM_REPS=500 BOUND95_SIM_CORES=7 Rscript 04-topics/rep-bound1995/Rcode/Table_III.R

Checkpoints: Table_III_spec_*.rds in Rcode/.

# Conceptual loop (see Table_III.R for full implementation):
# for (b in seq_len(B)) {
#   d_sim <- d %>% mutate(QOB = sample(QOB), .by = NULL)  # permute
#   d_sim <- rebuild_qob_dummies(d_sim)
#   fit <- bound95_run_iv(d_sim, controls, excluded, included)
#   store(coef, se, first_stage_F)
# }

5.5 Replication literacy

Issue Lesson
Mean \(\hat\beta \approx 0.06\) with fake \(Z\) Second stage alone is not diagnostic
Mean \(F \approx 1\) First stage is diagnostic
Overid often fails to reject Many weak instruments ≠ valid instruments
Long runtime Report \(F\) and partial \(R^2\) in applied work to avoid needing Table 3

This act closes the loop with AK91 guide Act VI.

5.6 Check your work

  • Mean coefficient ≈ 0.060–0.062 (near OLS, not AK IV).
  • Mean first-stage \(F \approx 1\) across all four columns.
  • Table 1 specs: SD of \(\hat\beta \approx 0.038\); 5th–95th percentiles often straddle zero.

5.7 Think

  1. Why do overidentification tests fail to reject with random instruments?
  2. Draft one sentence for a referee report: “The authors should report ___ before interpreting IV.”

6 Act V — Conclusion and modern follow-ups

6.1 Story beat

Section 4 concludes that applied work must report first-stage strength — not only second-stage coefficients. Subsequent literature formalizes tests and alternatives.

6.3 Think

  1. Is \(F > 10\) sufficient for AK91-style \(YQ\) instruments? What would you add?
  2. How does Bound change how you read any paper that only tables 2SLS coefficients?

7 Capstone — Weak-IV diagnostic memo

Write ½–1 page in English (bullets OK):

  1. Two questions. Restate AK91’s policy estimand vs Bound’s statistical critique (they are not the same question).
  2. Table 1, col. (6). Report \(\hat\beta_{EDUC}\), first-stage \(F\), and partial \(R^2\); interpret all three together.
  3. Table 3. For one simulated spec, explain why mean \(\hat\beta\) is near OLS and mean \(F\) is near 1.
  4. Your lab policy. When \(F < 10\), what estimators/inferences will you report?
  5. Replication note. One issue from this guide (chunked IV runtime, Table II .RData, simulation checkpoints, or AK vs Bound coefficient comparison).

Deliverable (suggested). PDF memo + console log from Table_I.R and a 50-rep pilot of Table_III.R.


8 Appendix — Replication map and troubleshooting

Table Paper Script Known issue
1 §3.2, AK91 Table V (5)–(8) Table_I.R \(F\) must fall across cols. (2)≈6)
2 §3.2, AK91 Table VII (5)–(8) Table_II.R Slow; chunked IV
3 §3.2, simulated QOB Table_III.R BOUND95_SIM_REPS, checkpoints

Render this page:

quarto render 04-topics/rep-bound1995/replication-bound1995-guide.qmd

Relation to AK91 materials

AK91 guide Bound guide
Builds QOB natural experiment Takes experiment as given; attacks IV inference
IV ≈ OLS as headline finding IV ≈ OLS as symptom of weak IV
Identification memo (exclusion) Capstone memo (first-stage reporting)

9 References

Angrist, Joshua D., and Alan B. Krueger. 1991. “Does Compulsory School Attendance Affect Schooling and Earnings?” The Quarterly Journal of Economics 106 (4): 979–1014. https://doi.org/10.2307/2937954.
Bound, John, David A. Jaeger, and Regina M. Baker. 1995. “Problems with Instrumental Variables Estimation When the Correlation Between the Instruments and the Endogenous Explanatory Variable Is Weak.” Journal of the American Statistical Association 90 (430): 443–50.
Keane, Michael P., and Timothy Neal. 2024. “A Practical Guide to Weak Instruments.” Annual Review of Economics 16 (August): 185–212. https://doi.org/10.1146/annurev-economics-092123-111021.
Staiger, Douglas, and James H. Stock. 1997. “Instrumental Variables Regression with Weak Instruments.” Econometrica 65 (3): 557–86. https://doi.org/10.2307/2171753.
Stock, James H., and Motohiro Yogo. 2002. “Testing for Weak Instruments in Linear IV Regression.” {{SSRN Scholarly Paper}}. Rochester, NY: Social Science Research Network.