| Key paths under replication_aer/ | ||
| MISSING items must be created locally or replaced by R equivalents | ||
| Path | Status | Detail |
|---|---|---|
| basicvariables.dta | **OK** | 127.4 KB |
| localvariables.dta | **OK** | 59.1 KB |
| school.dta | **OK** | 15.1 KB |
| dataset.dta | **MISSING** | Not found |
| newid.dta | **MISSING** | Not found |
| bootdata | **OK** | 253 file(s) |
| out2_ddd | **MISSING** | Not found |
| mainresults/table4a/data_with_tuition.RData | **OK** | 142.7 KB |
| mainresults/out | **OK** | 2 file(s) |
| normalmodel/out | **OK** | 1 file(s) |
| table6/sls/datab.dta | **MISSING** | Not found |
| table6/bootdata_nodrop | **OK** | 2 file(s) |
| normalmodel/mtexvb.out | **OK** | 4.4 KB |
Minimum Runnable Checklist
CHV2011 replication — data, software, and table unlock map
This checklist walks through the minimum steps needed to run the Carneiro, Heckman and Vytlacil (2011) replication in this repo. The AER package under replication_aer/ ships code and partial data, not pre-computed bootstrap outputs. Use this page to see what you have, what is missing, and which paper tables each step unlocks.
Related pages: Paper (full text) · Replication guide · R replication output · Author README.txt
Implementation: complete — all Table_*.R / Figure_*.R scripts, shared modules, and replicate-chv2011-R.qmd are wired. HTML output renders from cached Rcode/*.RData.
Paper-grade bootstrap runs (bundled data):
| Item | Status | Cached B | Notes |
|---|---|---|---|
P1 bootdata/ |
Done | 250 | R getboot.R |
| Table 3 / A-4 SEs | Done | 250 | Control derivatives ≈ paper; IV terms need geocode |
| Table 4(a) | Run complete | 1000 | HC asymptotic deg-2 ≈ 0.035; bootstrap p-values do not match paper on bundled data |
| Table A-5 | Run complete | 500×100 | RW critical ≈ 0.054; p-values fail to reject (paper rejects) |
| Table 4(b) / 5 | Done | 250 | IV/OLS magnitude reasonable; semiparametric MPRTE deviates from GAUSS |
| Table 6 | Run complete | 250 | Eight columns + bootstrap SEs; R polynomial core — point estimates differ from paper (e.g. baseline ATE 0.044 vs 0.082) |
| Figures 1–7 | Done | — | R proxies; Fig 1 uses shipped mtexvb.out |
Not done (author / publication match): geocode dataset.dta, superboot → mainresults/out/*.out, table6/bootdata_nodrop/, GAUSS cdens pipeline, npindex SLS (>24h).
1 Quick links
| Resource | Path |
|---|---|
| Bundled NLSY ingredients | replication_aer/basicvariables.dta, localvariables.dta, school.dta |
| R data prep | Rcode/chv2011-data-prep.R |
| R MTE core | Rcode/chv2011-mte-core.R |
| Table scripts | Rcode/Table_*.R |
| Author Table 4a | replication_aer/mainresults/table4a/bootstrap_MT.R |
2 Step 0 — Environment scan (auto)
Data mode detected: bundled_aligned — Merging basicvariables.dta and localvariables.dta by row order (author-aligned replication bundle; N should be 1747).
2.1 Bundled file inventory
2.2 Sample sanity check (bundled merge)
Paper Table A-3 reports \(N = 1747\) (882 / 865) and group mean log wages. Compare your merged sample:
| Merged sample vs paper (Table A-3) | ||
| Metric | Your data | Paper |
|---|---|---|
| N | 1747.0000 | 1747 |
| S = 0 | 882.0000 | 882 |
| S = 1 | 865.0000 | 865 |
| mean wage | S=0 | 2.2090 | 2.2089 |
| mean wage | S=1 | 2.5497 | 2.5496 |
Without geocode newid.dta, the AER bundle is merged by row order: bind_cols(basicvariables, localvariables[-newid]). Core demographics (wage, exp, cafqt, …) match Table A-3 to rounding; local instruments (pub4, tuition, lwage5, …) deviate because cor(caseid, newid) ≈ 0.88 — not 1. Full Table 3 / IV joint-test alignment requires P1 (dataset.dta via getdataset.do).
| Table A-3 alignment (P0) | ||||
| Bundled AER files: bind_cols(basic, local[-newid]) by row order. Without geocode newid.dta, local instruments are only approximately aligned (cor(caseid, newid) = 0.881). | ||||
| Variable | Δ mean S=0 | Δ mean S=1 | Source | OK |
|---|---|---|---|---|
| wage | 0.0001 | 0.0001 | basic | TRUE |
| exp | 0.0001 | 0.0000 | basic | TRUE |
| cafqt | -0.0001 | 0.0000 | basic | TRUE |
| mhgc | 0.0001 | 0.0000 | basic | TRUE |
| numsibs | 0.0000 | 0.0001 | basic | TRUE |
| urban14 | 0.0000 | 0.0001 | basic | TRUE |
| lwage5 | 0.0276 | -0.0281 | local | TRUE |
| lurate | 0.1100 | -0.1121 | local | TRUE |
| pub4 | 0.0466 | -0.0474 | local | TRUE |
| lwage5_17 | 0.0000 | 0.0000 | local | TRUE |
| lurate_17 | 0.0398 | -0.0404 | local | TRUE |
| tuition | -0.6002 | 0.6119 | local | FALSE |
| lavlocwage17 | 0.0095 | -0.0096 | local | TRUE |
| avurate | -0.0141 | 0.0145 | local | TRUE |
| Table 3 point estimates vs paper (full sample) | ||||
| LR joint IV p = 0.8450 (paper: 0.0001). Point estimates use full-sample logit + normalden(xbeta), matching avder_ddd.do on phat1. Paper joint IV p = 0.0001 requires geocode merge (dataset.dta); bundled row merge attenuates instrument relevance. | ||||
| Variable | Paper | R | Diff | OK |
|---|---|---|---|---|
| cafqt | 0.2826 | 0.2866 | 0.0040 | TRUE |
| mhgc | 0.0441 | 0.0438 | -0.0003 | TRUE |
| numsibs | -0.0233 | -0.0268 | -0.0035 | TRUE |
| urban14 | 0.0340 | 0.0660 | 0.0320 | TRUE |
| lavlocwage17 | 0.1820 | 0.1803 | -0.0017 | TRUE |
| avurate | 0.0058 | 0.0155 | 0.0097 | TRUE |
| pub4 | 0.0529 | 0.0185 | -0.0344 | FALSE |
| lwage5_17 | -0.2687 | -0.2021 | 0.0666 | FALSE |
| lurate_17 | 0.0149 | -0.0130 | -0.0279 | TRUE |
| tuition | -0.0027 | 0.0021 | 0.0048 | TRUE |
The author README.txt requires geocode access to build newid.dta and dataset.dta. This repo includes basicvariables.dta and localvariables.dta (both \(N = 1747\)). If row-aligned merge matches the table above, you can proceed with descriptive statistics and point estimates without geocode. For publication-grade replication, still follow the official merge when geocode is available.
3 Step 1 — Software stack
Check each box before running author scripts.
| Software | Required for | Check |
|---|---|---|
R + renv::restore() |
All Rcode/ scripts; Table 4a |
☐ |
| Stata-MP (or Stata) | getdataset.do, getboot.do, getphat.do, movestay |
☐ |
GAUSS (tgauss) |
chv01b.prg, cdens*.prg (semi-parametric MTE) |
☐ |
| MATLAB | getzx.m, treatparnew.m, mte4c.m, figures |
☐ |
| Unix shell (WSL on Windows) | runboot, superboot (250 iterations) |
☐ |
R packages: haven, tidyverse, gt, AER, car, sandwich |
Course R replication | ☐ |
R package np |
Table 6(b) SLS only (>24h); bypassed — paper coeffs cached in Table_6.R |
☑ (cached) |
Author orchestration scripts (runboot, superboot) are shell scripts. On Windows, use WSL or rewrite the loop in R (as in Rcode/chv2011-data-prep.R make_bootstrap_indices()).
4 Step 2 — Minimum data pipeline (P0 → P1)
flowchart LR
subgraph p0 [P0 Bundled]
b[basicvariables.dta]
l[localvariables.dta]
m[Row merge or newid merge]
end
subgraph p1 [P1 Official]
g[geocode newid.dta]
d[dataset.dta]
boot[bootdata phat1 to 250]
end
b --> m
l --> m
g --> d
d --> boot
m --> ok1[Table A-3 OK]
boot --> ok2[Table 3 SE and beyond]
4.1 P0 — Bundled data (no geocode)
Goal: Verify sample merge and run descriptive / choice-equation tables.
Unlocks: Table A-3 (sample statistics); Table 3 point estimates (approximate); Table 2 (variable map); formula tables 1, A-1.
Limitation: Table 3 instrument average derivatives and joint IV test (\(p = 0.0001\) in paper) require geocode-merged dataset.dta (P1).
4.2 P1 — Bootstrap samples (getboot)
Goal: Create bootdata/phat1.asc … phat250.asc (author bootstrap infrastructure).
Option A — R (recommended in this repo):
# From repo root; default B=250, seed=703296661
Rscript 04-topics/rep-chv2011/Rcode/getboot.R
# Quick test (10 samples)
CHV2011_BOOT=10 Rscript 04-topics/rep-chv2011/Rcode/getboot.R
# Regenerate existing files
CHV2011_BOOT_OVERWRITE=TRUE Rscript 04-topics/rep-chv2011/Rcode/getboot.RAlso writes bootdata/boot_base.rds, bootstrap_indices.rds, boot_meta.rds for fast R reuse. Table 3 / A-4 scripts pick up cached bootdata automatically when present (CHV2011_USE_BOOTDATA=TRUE, default).
Option B — Stata (requires dataset.dta):
* After placing newid.dta in replication_aer/
do getdataset.do /* creates dataset.dta */
mkdir bootdata
do getboot.do /* creates bootdata/phat1.asc … phat250.asc */| bootdata/ status (P1) | |
| metric | value |
|---|---|
| Status | OK |
| N | 1747 |
| Bootstrap files | 250 / 250 |
| Data mode | bundled_aligned |
| Seed | 703296661 |
Unlocks: Table 3, A-4 bootstrap SEs; prerequisite for §6 pipelines.
Environment variables:
| Variable | Default | Paper value |
|---|---|---|
CHV2011_BOOT |
250 (getboot) / 50 (table scripts) | 250 |
CHV2011_BOOT_SEED |
703296661 | (author Stata: unset) |
CHV2011_USE_BOOTDATA |
TRUE | use cached bootdata when available |
CHV2011_BOOT_OVERWRITE |
FALSE | force rewrite of .asc files |
CHV2011_BOOT_4A |
1000 (Table_4a.R) |
1000 |
CHV2011_BOOT_A5_OUT |
500 (chv2011-table4a-core.R) |
500 |
CHV2011_BOOT_A5_IN |
100 | 100 |
5 Step 3 — Table 4a input (P2)
Goal: Export analysis data and run the linearity-of-\(E(Y|X,P)\) test (Table 4(a)).
Step 1 — Export data_with_tuition.RData:
Rscript 04-topics/rep-chv2011/Rcode/export_data_with_tuition.ROr in R: export_chv2011_table4a_data() from chv2011-table4a-core.R.
Writes replication_aer/mainresults/table4a/data_with_tuition.RData (object name data, sorted by caseid, includes tuition and IV interactions).
Step 2 — Run Table 4(a):
# Paper: B=1000 (slow ~20 min); dev default in env table below
CHV2011_BOOT_4A=1000 Rscript 04-topics/rep-chv2011/Rcode/Table_4a.RPort of author bootstrap_MT.R: probit first step, HC robust Wald, pivotal bootstrap, Romano–Wolf 10% critical value.
Unlocks: Table 4(a), A-5 (R pipeline complete; bootstrap p-values approximate only).
Author bootstrap_MT.R uses a probit first step; the main choice equation in the paper is logit. Table 4(a) follows the author replication script (probit).
After paper-grade bootstrap (B=1000 / 500×100):
| Test | R (bundled) | Paper |
|---|---|---|
| Table 4(a) HC asymptotic, deg 2 | 0.035 | 0.035 |
| Table 4(a) bootstrap p, deg 2–5 | 0.144, 0.250, 0.494, 0.459 | 0.035, 0.049, 0.086, 0.122 |
| Table 4(a) RW joint (10%) | Fail to reject (crit 0.027) | Reject |
| Table A-5 bootstrap p, deg 2–5 | 0.094, 0.178, 0.280, 0.278 | 0.004, 0.006, 0.022, 0.026 |
| Table A-5 RW critical | 0.054 | ~0.052 |
| Table A-5 RW joint | Fail to reject | Reject |
HC / MT asymptotic p-values track the paper on deg 2; pivotal bootstrap p-values are systematically higher on bundled data. Likely needs geocode-merged dataset.dta and/or author double_bootstrap.R (not in AER bundle). Implementation follows description1.tex / bootstrap_MT.R; no clear code bug identified.
6 Step 4 — Semi-parametric pipeline (P3)
Goal: MTE, treatment parameters, Table 4b / 5 / figures.
6.2 Course R pipeline (implemented)
Polynomial MTE proxy aligned with Table 4(a) outcome spec (chv2011-table4a-core.R); uses cached bootdata/ when present.
# Smoke test (B=10); paper-grade: CHV2011_BOOT=250
CHV2011_BOOT=10 Rscript 04-topics/rep-chv2011/Rcode/Table_4b.R
CHV2011_BOOT=10 Rscript 04-topics/rep-chv2011/Rcode/Table_5.R
Rscript 04-topics/rep-chv2011/Rcode/run_all_figures.ROn bundled row-merge data, R IV / OLS and normal-model columns align with paper magnitude (~0.07–0.12 annual returns). Semiparametric MPRTE uses the same polynomial proxy as Table 4(a), not GAUSS chv01b.prg; expect deviation from published Table 5 col 2. Table 4(b) LATE-difference pattern is informative but not identical to mte4c.m on 250 superboot draws.
Unlocks: Table 4(b), 5, A-7, A-8; Figures 2, 3, 4, 5, 6, 7 (R proxies; author GAUSS/MATLAB for publication match).
7 Step 5 — Normal model + Figure 1 (parallel track)
7.2 Course R (uses shipped normalmodel/mtexvb.out)
Unlocks: Table 5 (normal column), A-6; Figure 1.
8 Step 6 — Table 6 robustness (P4)
Each column re-runs the full pipeline under a different spec (table6/*/).
| Column | Directory | Extra bootstrap data |
|---|---|---|
| 6(a) Baseline | mainresults/ |
root bootdata/ |
| 6(a) No dropouts | table6/nodrop/ |
table6/bootdata_nodrop/ (run local getboot.do) |
| 6(a) Dropout dummies | table6/dropdummies/ |
root bootdata/ |
| 6(b) All X in Z | table6/allxinz/ |
root bootdata/ |
| 6(b) No Z×X | table6/nointall/ |
root bootdata/ |
| 6(b) Cameron–Taber | table6/notuitnounemp/ |
root bootdata/ |
| 6(b) No tuition | table6/notuit/ |
root bootdata/ |
| 6(b) SLS | table6/sls/ |
getdatab.do → R npindex (>24h) |
Unlocks: Table 6(a), 6(b) (R pipeline + B=250 bootstrap SEs; point estimates still differ from GAUSS cdens paper numbers).
9 Table unlock map
| Phase | Author requirement | Course R status (2026-05-28) | Tables / figures |
|---|---|---|---|
| P0 | Bundled .dta |
Done | 1, 2, A-1, A-2†, A-3 |
| P1 | bootdata/ (250) |
Done (R getboot.R) |
3, A-4 (+ bootstrap SEs) |
| P2 | data_with_tuition.RData + B=1000 / 500×100 |
Run complete‡ | 4(a), A-5 |
| P3 | mainresults/out/*.out OR R MTE core |
R done (B=250); author not run | 4(b), 5, A-7, A-8, Fig 2–7 |
| P3′ | normalmodel/out/*.out |
Partial (shipped mtexvb.out) |
5 col 1, A-6, Fig 1 |
| P4 | table6/* pipelines |
R done (B=250); author bootdata_nodrop / GAUSS not run |
6(a), 6(b) |
† Table A-2: no author script; cafqt in bundled data is pre-corrected (Hansen–Heckman–Mullen).
‡ P2 bootstrap p-values on bundled data do not match published Table 4(a) / A-5; HC asymptotic deg-2 aligns.
10 One-shot R smoke test (course pipeline)
From repository root, after renv::restore():
# Fast smoke test (reduced bootstrap)
set CHV2011_BOOT=30
set CHV2011_BOOT_4A=50
Rscript 04-topics/rep-chv2011/Rcode/run_all_tables.R
Rscript 04-topics/rep-chv2011/Rcode/run_all_figures.R
quarto render 04-topics/rep-chv2011/replicate-chv2011-R.qmdExpected: HTML renders; tables load from Rcode/*.RData. For paper-matched SEs, set CHV2011_BOOT=250 and allow longer runtimes.
11 Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
use dataset fails |
No dataset.dta |
P0 row merge in R, or P1 Stata merge |
cp ../bootdata/phat$i.asc fails |
No bootstrap samples | Run getboot.do |
load mte1.out fails in MATLAB |
Never ran superboot |
Complete P3 or use R MTE core |
bootstrap_MT.R fails on load() |
No data_with_tuition.RData |
Step P2 export |
| IV joint test ≠ paper | Spec / merge / bootstrap | Compare logit on full \(N\); verify local IV merge |
| Table 6 SLS hangs | npindex runtime |
Cache coefficients; document in output |
12 References
- Carneiro, Heckman, and Vytlacil (2011) — Estimating Marginal Returns to Education, American Economic Review.
- Author replication:
replication_aer/README.txt - Course R output: replicate-chv2011-R.qmd