Replication Guide

Weak instruments and the AK91 re-examination

Author

Kevin Hu

Published

June 4, 2026

How to use this guide

Prerequisites (required). Complete the AK91 replication guide first — at least the QOB IV logic and AK91 Table V / IV specs. Tier-1 context: Readings Guide.

Software. R with renv::restore(); microdata: Rcode/raw_data.dta (same Census extract as AK91).

Four documents — one workflow.

Step	Document	Your job
1	This guide	Understand why weak IV undermines AK91’s second stage
2	Paper (full text)	Read Sections 2–4 cited in each Act
3	R replication output	Compare coefficients and diagnostics to Tables 1–3
4	`Rcode/*.R`	Run and modify the modular scripts

This guide is narrative-first (R only). Table 3 Monte Carlo (500 replications) takes several hours — design and interpretation here; run commands on the replication page.

1 Overview: From AK91 headline to weak-IV skepticism

Angrist and Krueger (1991) report that quarter-of-birth IV estimates of the return to schooling are close to OLS on Census microdata — a celebrated result. Bound, Jaeger, and Baker (1995) re-examine the same empirical setting and argue that conclusion is fragile:

Inconsistency: If instruments are weak, even tiny correlation between \(Z\) and structural errors can produce large bias (Section 2.1).
Finite-sample bias: Weak IV pushes 2SLS toward OLS — so “IV ≈ OLS” may reflect bias, not absence of ability bias (Section 2.2).
Simulated instruments: Randomly permuting quarter of birth yields plausible-looking second stages while first-stage \(F \approx 1\) (Table 3).

Same data, different question.

Paper	Question	Key takeaway
AK91 Table V	What is the causal return to schooling?	IV ≈ OLS, ~6–8%
Bound Table 1 col. (6)	Is IV inference credible given first-stage strength?	\(\hat\beta_{EDUC} \approx 0.060\), first-stage \(F \approx 1.61\)
Bound Table 3	What if instruments carry no information?	Mean \(\hat\beta \approx 0.06\), mean \(F \approx 1\)

Sample (all Bound tables): 5% Public-Use Sample, 1980 Census; men born 1930–1939 (\(N = 329{,}509\)) — AK91 Table V cohort.

flowchart LR
  AK91[AK91 IV headline]
  FS[First stage F and partial R2]
  SS[Second stage EDUC coef]
  Sim[Table 3 permuted QOB]
  AK91 --> FS
  FS --> SS
  FS --> Sim
  Sim --> Lesson[Report F before trusting SS]

R module map.

File	Role
`bound1995-data-prep.R`	Load data; QOB/YOB/state dummies
`bound1995-iv-diagnostics.R`	First-stage \(F\), partial \(R^2\), Basmann overid \(F\)
`bound1995-chunked-iv.R`	Large IV sets (Table 2)
`Table_I.R`	Bound Table 1
`Table_II.R`	Bound Table 2
`Table_III.R`	Bound Table 3 simulation

Read first: Bound et al. (1995) HTML · PDF

Suggested path (~2 hours, after AK91). Overview → Act I → Act II → Act IV (read + pilot simulation) → Act V → Capstone.

2 Act I — Two problems with weak instruments

2.1 Story beat

Section 2 states the general IV pathology before Section 3 applies it to AK91. Two distinct problems matter for teaching:

Population inconsistency when \(Z\) is weak and slightly invalid.
Finite-sample bias when \(Z\) is weak even if exclusion holds exactly.

2.2 Read in the paper

Section 2.1 — Inconsistency
Section 2.2 — Finite-sample bias
Section 3.1 (AK91 application of inconsistency)

2.3 Identification / econometrics

Inconsistency (intuition). When \(plim(\hat\pi) = 0\) (no first-stage relevance), IV solves a nearly singular problem; small \(Cov(Z, \epsilon)\) in the second stage is amplified.

Finite-sample bias (intuition). With weak instruments, 2SLS behaves like a weighted average of OLS and IV; as \(F \downarrow\), the weight on OLS rises. If OLS is upward biased (ability), weak IV can make 2SLS look like OLS for the wrong reason.

Link to AK91. AK91 emphasize IV ≈ OLS as evidence against large ability bias. Bound reinterpret: with \(F \approx 1.6\), IV ≈ OLS is expected under weak-IV bias, not proof of a valid causal design.

2.4 Replicate

No standalone table — this act frames Table 1.

library(ivreg)
# After bound1995-data-prep.R:
# m_iv <- ivreg(LWKLYWGE ~ EDUC + controls | QOB_YOB_dummies + controls, data = d)
# m_fs <- lm(EDUC ~ QOB_YOB_dummies + controls, data = d)
# anova(lm(EDUC ~ controls, d), m_fs)  # F on excluded instruments

2.5 Check your work

State in one sentence why large \(N\) does not fix weak identification.
Contrast exclusion violation vs weak relevance as distinct threats.

2.6 Think

If true IV > OLS (positive ability bias), what does weak-IV bias toward OLS do to the IV estimate?
AK91 Act VI previewed weak IV — what new evidence does Bound add beyond quoting a low \(F\)?

3 Act II — Table 1: AK91 Table V with diagnostics

3.1 Story beat

Section 3.2 replicates AK91 Table V, columns (5)–(8) but adds first-stage \(F\), partial \(R^2\), and Basmann overidentification \(F\) for every IV column. The pedagogical punchline: as the specification matches AK91 more closely, \(F\) collapses.

3.2 Read in the paper

Section 3.2 and embedded Table 1
Compare to AK91 Section II.A / AK91 Table V

3.3 Identification / econometrics

Bound Table 1 columns (3)–(6) map to AK91 Table V (5)–(8). Simpler columns (1)–(2) use only three quarter dummies as instruments.

Bound col.	AK91 Table V col.	Controls	Excluded instruments	Paper \(F\) (approx.)
(1)–(2)	— (simpler)	Age, age\(^2\), demos	3 QOB dummies	~13.5
(3)≈4)	(5)≈6)	YOB dummies, demos	QOB × YOB	~4.7
(5)≈6)	(7)–(8)	YOB, age in quarters, demos	QOB × YOB	~1.6

Partial \(R^2\): share of \(EDUC\) variation explained by excluded instruments after controls. It is tiny in col. (6) — consistent with a weak first stage.

3.4 Replicate

Script: Rcode/Table_I.R
Output: Table 1

Do not misread the IV point estimate

Col. (6) IV \(\hat\beta_{EDUC} \approx 0.060\) is below AK91’s headline IV (~8%). Bound’s point is not “returns are 6%” — it is that with \(F \approx 1.6\), 2SLS is pulled toward OLS and standard second-stage inference is unreliable.

3.5 Check your work

Col. (6) IV: \(\hat\beta_{EDUC} \approx 0.060\) (SE ≈0.036); first-stage \(F \approx 1.61\); partial \(R^2 \approx 0.0003\).
\(F\) falls moving from cols. (2) → (4) → (6) — monotone loss of first-stage strength as the AK91-style spec is added.
Compare replication vs paper_targets printed when Table_I.R finishes.

3.6 Think

Col. (6) second-stage \(t\)-stat may still look “significant.” Why is that misleading when \(F \approx 1.6\)?
Which diagnostic would you show first in a seminar slide—\(\hat\beta_{EDUC}\) or first-stage \(F\)?

4 Act III — Table 2: State-of-birth interactions

4.1 Story beat

Bound replicate AK91 Table VII, columns (5)–(8) — adding state × quarter interactions to exploit cross-state compulsory-law variation. Precision may improve, but first-stage \(F\) stays near 2: more instruments do not rescue weak identification here.

4.2 Read in the paper

Table 2 paragraph in Section 3.2
AK91: Table VII

4.3 Identification / econometrics

Excluded set expands to QOB × YOB and QOB × state of birth (plus levels as in AK91). Overidentification tests (Basmann \(F\)) appear in IV columns — often fail to reject even when \(F\) is tiny (foreshadowing Table 3).

4.4 Replicate

Script: Rcode/Table_II.R — large IV models; often 30+ minutes
Output: Table 2

If Table 2 is missing on the site

The replication page loads pre-built Table_II.RData. If absent, run Table_II.R once locally; chunked IV in bound1995-chunked-iv.R handles the large instrument set.

4.5 Check your work

Col. (4) IV: \(\hat\beta_{EDUC} \approx 0.081\); first-stage \(F \approx 1.87\) (still far below conventional thresholds).
Contrast partial \(R^2\) and overid \(F\) across IV columns with Table 1.

4.6 Think

When does adding interactions as instruments help first-stage strength — and when does it hurt (collinearity with age/QOB)?
Bound footnote 18 notes even richer state × year × QOB instruments — what would you expect for \(F\)?

5 Act IV — Table 3: Simulated quarter of birth

5.1 Story beat

Alan Krueger’s suggestion (cited in Bound, p. 448): randomly permute quarter of birth within the sample, rebuild instruments, and re-estimate IV. Instruments are irrelevant by construction, yet second-stage output still looks respectable. Only the first stage (\(F \approx 1\)) reveals the problem.

5.2 Read in the paper

Table 3 discussion in Section 3.2
Section 2.2 (predictions for the simulation)

5.3 Identification / econometrics

Design (each replication):

Hold sample, controls, and model fixed (four specs: Table 1 cols. (4) and (6); Table 2 cols. (2) and (4)).
Shuffle QOB; rebuild QOB × YOB (and state) dummies.
Estimate IV; store \(\hat\beta_{EDUC}\), SE, first-stage \(F\).

Theory: Mean \(\hat\beta\) should cluster near OLS (~0.06); mean \(F \approx 1\). Asymptotic SEs can track the simulation SD — so “reasonable” \(t\)-stats are not evidence of valid IV.

5.4 Replicate

Script: Rcode/Table_III.R
Output: Table 3

Run (from repo root):

# Pilot (~1–2 h with parallel workers)
BOUND95_SIM_REPS=50 Rscript 04-topics/rep-bound1995/Rcode/Table_III.R

# Paper (500 reps; checkpoints resume if interrupted)
BOUND95_SIM_REPS=500 BOUND95_SIM_CORES=7 Rscript 04-topics/rep-bound1995/Rcode/Table_III.R

Checkpoints: Table_III_spec_*.rds in Rcode/.

# Conceptual loop (see Table_III.R for full implementation):
# for (b in seq_len(B)) {
#   d_sim <- d %>% mutate(QOB = sample(QOB), .by = NULL)  # permute
#   d_sim <- rebuild_qob_dummies(d_sim)
#   fit <- bound95_run_iv(d_sim, controls, excluded, included)
#   store(coef, se, first_stage_F)
# }

5.5 Replication literacy

Issue	Lesson
Mean \(\hat\beta \approx 0.06\) with fake \(Z\)	Second stage alone is not diagnostic
Mean \(F \approx 1\)	First stage is diagnostic
Overid often fails to reject	Many weak instruments ≠ valid instruments
Long runtime	Report \(F\) and partial \(R^2\) in applied work to avoid needing Table 3

This act closes the loop with AK91 guide Act VI.

5.6 Check your work

Mean coefficient ≈ 0.060–0.062 (near OLS, not AK IV).
Mean first-stage \(F \approx 1\) across all four columns.
Table 1 specs: SD of \(\hat\beta \approx 0.038\); 5th–95th percentiles often straddle zero.

5.7 Think

Why do overidentification tests fail to reject with random instruments?
Draft one sentence for a referee report: “The authors should report ___ before interpreting IV.”

6 Act V — Conclusion and modern follow-ups

6.1 Story beat

Section 4 concludes that applied work must report first-stage strength — not only second-stage coefficients. Subsequent literature formalizes tests and alternatives.

6.2 Course extensions (links only)

Weak IV session — full reading list
(Staiger and Stock 1997) — weak-IV asymptotics; TSLS can mimic OLS
(Stock and Yogo 2002) — \(F > 10\) rule and its limits
(Keane and Neal 2024) — modern reporting checklist (readings guide)

Minimum reporting package (applied PhD):

First-stage \(F\) (or Kleibergen–Paap) on excluded instruments
Partial \(R^2\) for excluded instruments
Weak-IV-robust confidence sets (AR/CLR/LIML) when \(F\) is low

6.3 Think

Is \(F > 10\) sufficient for AK91-style \(YQ\) instruments? What would you add?
How does Bound change how you read any paper that only tables 2SLS coefficients?

7 Capstone — Weak-IV diagnostic memo

Write ½–1 page in English (bullets OK):

Two questions. Restate AK91’s policy estimand vs Bound’s statistical critique (they are not the same question).
Table 1, col. (6). Report \(\hat\beta_{EDUC}\), first-stage \(F\), and partial \(R^2\); interpret all three together.
Table 3. For one simulated spec, explain why mean \(\hat\beta\) is near OLS and mean \(F\) is near 1.
Your lab policy. When \(F < 10\), what estimators/inferences will you report?
Replication note. One issue from this guide (chunked IV runtime, Table II .RData, simulation checkpoints, or AK vs Bound coefficient comparison).

Deliverable (suggested). PDF memo + console log from Table_I.R and a 50-rep pilot of Table_III.R.

8 Appendix — Replication map and troubleshooting

Table	Paper	Script	Known issue
1	§3.2, AK91 Table V (5)–(8)	`Table_I.R`	\(F\) must fall across cols. (2)≈6)
2	§3.2, AK91 Table VII (5)–(8)	`Table_II.R`	Slow; chunked IV
3	§3.2, simulated QOB	`Table_III.R`	`BOUND95_SIM_REPS`, checkpoints

Render this page:

quarto render 04-topics/rep-bound1995/replication-bound1995-guide.qmd

Relation to AK91 materials

AK91 guide	Bound guide
Builds QOB natural experiment	Takes experiment as given; attacks IV inference
IV ≈ OLS as headline finding	IV ≈ OLS as symptom of weak IV
Identification memo (exclusion)	Capstone memo (first-stage reporting)

9 References

Angrist, Joshua D., and Alan B. Krueger. 1991. “Does Compulsory School Attendance Affect Schooling and Earnings?” The Quarterly Journal of Economics 106 (4): 979–1014. https://doi.org/10.2307/2937954.

Bound, John, David A. Jaeger, and Regina M. Baker. 1995. “Problems with Instrumental Variables Estimation When the Correlation Between the Instruments and the Endogenous Explanatory Variable Is Weak.” Journal of the American Statistical Association 90 (430): 443–50.

Keane, Michael P., and Timothy Neal. 2024. “A Practical Guide to Weak Instruments.” Annual Review of Economics 16 (August): 185–212. https://doi.org/10.1146/annurev-economics-092123-111021.

Staiger, Douglas, and James H. Stock. 1997. “Instrumental Variables Regression with Weak Instruments.” Econometrica 65 (3): 557–86. https://doi.org/10.2307/2171753.

Stock, James H., and Motohiro Yogo. 2002. “Testing for Weak Instruments in Linear IV Regression.” {{SSRN Scholarly Paper}}. Rochester, NY: Social Science Research Network.