Minimum Runnable Checklist

CHV2011 replication — data, software, and table unlock map

Author

Hu Huaping

Published

May 28, 2026

ImportantPurpose

This checklist walks through the minimum steps needed to run the Carneiro, Heckman and Vytlacil (2011) replication in this repo. The AER package under replication_aer/ ships code and partial data, not pre-computed bootstrap outputs. Use this page to see what you have, what is missing, and which paper tables each step unlocks.

Related pages: Paper (full text) · Replication guide · R replication output · Author README.txt

TipCourse R pipeline status (2026-05-29)

Implementation: complete — all Table_*.R / Figure_*.R scripts, shared modules, and replicate-chv2011-R.qmd are wired. HTML output renders from cached Rcode/*.RData.

Paper-grade bootstrap runs (bundled data):

Item Status Cached B Notes
P1 bootdata/ Done 250 R getboot.R
Table 3 / A-4 SEs Done 250 Control derivatives ≈ paper; IV terms need geocode
Table 4(a) Run complete 1000 HC asymptotic deg-2 ≈ 0.035; bootstrap p-values do not match paper on bundled data
Table A-5 Run complete 500×100 RW critical ≈ 0.054; p-values fail to reject (paper rejects)
Table 4(b) / 5 Done 250 IV/OLS magnitude reasonable; semiparametric MPRTE deviates from GAUSS
Table 6 Run complete 250 Eight columns + bootstrap SEs; R polynomial core — point estimates differ from paper (e.g. baseline ATE 0.044 vs 0.082)
Figures 1–7 Done R proxies; Fig 1 uses shipped mtexvb.out

Not done (author / publication match): geocode dataset.dta, superbootmainresults/out/*.out, table6/bootdata_nodrop/, GAUSS cdens pipeline, npindex SLS (>24h).

2 Step 0 — Environment scan (auto)

Data mode detected: bundled_aligned — Merging basicvariables.dta and localvariables.dta by row order (author-aligned replication bundle; N should be 1747).

2.1 Bundled file inventory

Table 1
Key paths under replication_aer/
MISSING items must be created locally or replaced by R equivalents
Path Status Detail
basicvariables.dta **OK** 127.4 KB
localvariables.dta **OK** 59.1 KB
school.dta **OK** 15.1 KB
dataset.dta **MISSING** Not found
newid.dta **MISSING** Not found
bootdata **OK** 253 file(s)
out2_ddd **MISSING** Not found
mainresults/table4a/data_with_tuition.RData **OK** 142.7 KB
mainresults/out **OK** 2 file(s)
normalmodel/out **OK** 1 file(s)
table6/sls/datab.dta **MISSING** Not found
table6/bootdata_nodrop **OK** 2 file(s)
normalmodel/mtexvb.out **OK** 4.4 KB

2.2 Sample sanity check (bundled merge)

Paper Table A-3 reports \(N = 1747\) (882 / 865) and group mean log wages. Compare your merged sample:

Table 2
Merged sample vs paper (Table A-3)
Metric Your data Paper
N 1747.0000 1747
S = 0 882.0000 882
S = 1 865.0000 865
mean wage | S=0 2.2090 2.2089
mean wage | S=1 2.5497 2.5496
NoteMerge method (P0 conclusion)

Without geocode newid.dta, the AER bundle is merged by row order: bind_cols(basicvariables, localvariables[-newid]). Core demographics (wage, exp, cafqt, …) match Table A-3 to rounding; local instruments (pub4, tuition, lwage5, …) deviate because cor(caseid, newid) ≈ 0.88 — not 1. Full Table 3 / IV joint-test alignment requires P1 (dataset.dta via getdataset.do).

Table 3
Table A-3 alignment (P0)
Bundled AER files: bind_cols(basic, local[-newid]) by row order. Without geocode newid.dta, local instruments are only approximately aligned (cor(caseid, newid) = 0.881).
Variable Δ mean S=0 Δ mean S=1 Source OK
wage 0.0001 0.0001 basic TRUE
exp 0.0001 0.0000 basic TRUE
cafqt -0.0001 0.0000 basic TRUE
mhgc 0.0001 0.0000 basic TRUE
numsibs 0.0000 0.0001 basic TRUE
urban14 0.0000 0.0001 basic TRUE
lwage5 0.0276 -0.0281 local TRUE
lurate 0.1100 -0.1121 local TRUE
pub4 0.0466 -0.0474 local TRUE
lwage5_17 0.0000 0.0000 local TRUE
lurate_17 0.0398 -0.0404 local TRUE
tuition -0.6002 0.6119 local FALSE
lavlocwage17 0.0095 -0.0096 local TRUE
avurate -0.0141 0.0145 local TRUE
Table 4
Table 3 point estimates vs paper (full sample)
LR joint IV p = 0.8450 (paper: 0.0001). Point estimates use full-sample logit + normalden(xbeta), matching avder_ddd.do on phat1. Paper joint IV p = 0.0001 requires geocode merge (dataset.dta); bundled row merge attenuates instrument relevance.
Variable Paper R Diff OK
cafqt 0.2826 0.2866 0.0040 TRUE
mhgc 0.0441 0.0438 -0.0003 TRUE
numsibs -0.0233 -0.0268 -0.0035 TRUE
urban14 0.0340 0.0660 0.0320 TRUE
lavlocwage17 0.1820 0.1803 -0.0017 TRUE
avurate 0.0058 0.0155 0.0097 TRUE
pub4 0.0529 0.0185 -0.0344 FALSE
lwage5_17 -0.2687 -0.2021 0.0666 FALSE
lurate_17 0.0149 -0.0130 -0.0279 TRUE
tuition -0.0027 0.0021 0.0048 TRUE
NoteBundled vs geocode merge

The author README.txt requires geocode access to build newid.dta and dataset.dta. This repo includes basicvariables.dta and localvariables.dta (both \(N = 1747\)). If row-aligned merge matches the table above, you can proceed with descriptive statistics and point estimates without geocode. For publication-grade replication, still follow the official merge when geocode is available.


3 Step 1 — Software stack

Check each box before running author scripts.

Software Required for Check
R + renv::restore() All Rcode/ scripts; Table 4a
Stata-MP (or Stata) getdataset.do, getboot.do, getphat.do, movestay
GAUSS (tgauss) chv01b.prg, cdens*.prg (semi-parametric MTE)
MATLAB getzx.m, treatparnew.m, mte4c.m, figures
Unix shell (WSL on Windows) runboot, superboot (250 iterations)
R packages: haven, tidyverse, gt, AER, car, sandwich Course R replication
R package np Table 6(b) SLS only (>24h); bypassed — paper coeffs cached in Table_6.R ☑ (cached)
WarningWindows note

Author orchestration scripts (runboot, superboot) are shell scripts. On Windows, use WSL or rewrite the loop in R (as in Rcode/chv2011-data-prep.R make_bootstrap_indices()).


4 Step 2 — Minimum data pipeline (P0 → P1)

flowchart LR
  subgraph p0 [P0 Bundled]
    b[basicvariables.dta]
    l[localvariables.dta]
    m[Row merge or newid merge]
  end
  subgraph p1 [P1 Official]
    g[geocode newid.dta]
    d[dataset.dta]
    boot[bootdata phat1 to 250]
  end
  b --> m
  l --> m
  g --> d
  d --> boot
  m --> ok1[Table A-3 OK]
  boot --> ok2[Table 3 SE and beyond]

4.1 P0 — Bundled data (no geocode)

Goal: Verify sample merge and run descriptive / choice-equation tables.

Unlocks: Table A-3 (sample statistics); Table 3 point estimates (approximate); Table 2 (variable map); formula tables 1, A-1.

Limitation: Table 3 instrument average derivatives and joint IV test (\(p = 0.0001\) in paper) require geocode-merged dataset.dta (P1).

4.2 P1 — Bootstrap samples (getboot)

Goal: Create bootdata/phat1.ascphat250.asc (author bootstrap infrastructure).

Option A — R (recommended in this repo):

# From repo root; default B=250, seed=703296661
Rscript 04-topics/rep-chv2011/Rcode/getboot.R

# Quick test (10 samples)
CHV2011_BOOT=10 Rscript 04-topics/rep-chv2011/Rcode/getboot.R

# Regenerate existing files
CHV2011_BOOT_OVERWRITE=TRUE Rscript 04-topics/rep-chv2011/Rcode/getboot.R

Also writes bootdata/boot_base.rds, bootstrap_indices.rds, boot_meta.rds for fast R reuse. Table 3 / A-4 scripts pick up cached bootdata automatically when present (CHV2011_USE_BOOTDATA=TRUE, default).

Option B — Stata (requires dataset.dta):

* After placing newid.dta in replication_aer/
do getdataset.do          /* creates dataset.dta */
mkdir bootdata
do getboot.do             /* creates bootdata/phat1.asc … phat250.asc */
Table 5
bootdata/ status (P1)
metric value
Status OK
N 1747
Bootstrap files 250 / 250
Data mode bundled_aligned
Seed 703296661

Unlocks: Table 3, A-4 bootstrap SEs; prerequisite for §6 pipelines.

Environment variables:

Variable Default Paper value
CHV2011_BOOT 250 (getboot) / 50 (table scripts) 250
CHV2011_BOOT_SEED 703296661 (author Stata: unset)
CHV2011_USE_BOOTDATA TRUE use cached bootdata when available
CHV2011_BOOT_OVERWRITE FALSE force rewrite of .asc files
CHV2011_BOOT_4A 1000 (Table_4a.R) 1000
CHV2011_BOOT_A5_OUT 500 (chv2011-table4a-core.R) 500
CHV2011_BOOT_A5_IN 100 100

5 Step 3 — Table 4a input (P2)

Goal: Export analysis data and run the linearity-of-\(E(Y|X,P)\) test (Table 4(a)).

Step 1 — Export data_with_tuition.RData:

Rscript 04-topics/rep-chv2011/Rcode/export_data_with_tuition.R

Or in R: export_chv2011_table4a_data() from chv2011-table4a-core.R.

Writes replication_aer/mainresults/table4a/data_with_tuition.RData (object name data, sorted by caseid, includes tuition and IV interactions).

Step 2 — Run Table 4(a):

# Paper: B=1000 (slow ~20 min); dev default in env table below
CHV2011_BOOT_4A=1000 Rscript 04-topics/rep-chv2011/Rcode/Table_4a.R

Port of author bootstrap_MT.R: probit first step, HC robust Wald, pivotal bootstrap, Romano–Wolf 10% critical value.

Unlocks: Table 4(a), A-5 (R pipeline complete; bootstrap p-values approximate only).

TipProbit vs logit

Author bootstrap_MT.R uses a probit first step; the main choice equation in the paper is logit. Table 4(a) follows the author replication script (probit).

NoteAlignment note (P2, bundled data)

After paper-grade bootstrap (B=1000 / 500×100):

Test R (bundled) Paper
Table 4(a) HC asymptotic, deg 2 0.035 0.035
Table 4(a) bootstrap p, deg 2–5 0.144, 0.250, 0.494, 0.459 0.035, 0.049, 0.086, 0.122
Table 4(a) RW joint (10%) Fail to reject (crit 0.027) Reject
Table A-5 bootstrap p, deg 2–5 0.094, 0.178, 0.280, 0.278 0.004, 0.006, 0.022, 0.026
Table A-5 RW critical 0.054 ~0.052
Table A-5 RW joint Fail to reject Reject

HC / MT asymptotic p-values track the paper on deg 2; pivotal bootstrap p-values are systematically higher on bundled data. Likely needs geocode-merged dataset.dta and/or author double_bootstrap.R (not in AER bundle). Implementation follows description1.tex / bootstrap_MT.R; no clear code bug identified.


6 Step 4 — Semi-parametric pipeline (P3)

Goal: MTE, treatment parameters, Table 4b / 5 / figures.

6.1 Author pipeline (GAUSS / MATLAB / superboot)

6.2 Course R pipeline (implemented)

Polynomial MTE proxy aligned with Table 4(a) outcome spec (chv2011-table4a-core.R); uses cached bootdata/ when present.

# Smoke test (B=10); paper-grade: CHV2011_BOOT=250
CHV2011_BOOT=10 Rscript 04-topics/rep-chv2011/Rcode/Table_4b.R
CHV2011_BOOT=10 Rscript 04-topics/rep-chv2011/Rcode/Table_5.R
Rscript 04-topics/rep-chv2011/Rcode/run_all_figures.R
NoteR vs author numbers

On bundled row-merge data, R IV / OLS and normal-model columns align with paper magnitude (~0.07–0.12 annual returns). Semiparametric MPRTE uses the same polynomial proxy as Table 4(a), not GAUSS chv01b.prg; expect deviation from published Table 5 col 2. Table 4(b) LATE-difference pattern is informative but not identical to mte4c.m on 250 superboot draws.

Unlocks: Table 4(b), 5, A-7, A-8; Figures 2, 3, 4, 5, 6, 7 (R proxies; author GAUSS/MATLAB for publication match).


7 Step 5 — Normal model + Figure 1 (parallel track)

7.1 Author pipeline

7.2 Course R (uses shipped normalmodel/mtexvb.out)

Unlocks: Table 5 (normal column), A-6; Figure 1.


8 Step 6 — Table 6 robustness (P4)

Each column re-runs the full pipeline under a different spec (table6/*/).

Column Directory Extra bootstrap data
6(a) Baseline mainresults/ root bootdata/
6(a) No dropouts table6/nodrop/ table6/bootdata_nodrop/ (run local getboot.do)
6(a) Dropout dummies table6/dropdummies/ root bootdata/
6(b) All X in Z table6/allxinz/ root bootdata/
6(b) No Z×X table6/nointall/ root bootdata/
6(b) Cameron–Taber table6/notuitnounemp/ root bootdata/
6(b) No tuition table6/notuit/ root bootdata/
6(b) SLS table6/sls/ getdatab.do → R npindex (>24h)

Unlocks: Table 6(a), 6(b) (R pipeline + B=250 bootstrap SEs; point estimates still differ from GAUSS cdens paper numbers).


9 Table unlock map

Phase Author requirement Course R status (2026-05-28) Tables / figures
P0 Bundled .dta Done 1, 2, A-1, A-2†, A-3
P1 bootdata/ (250) Done (R getboot.R) 3, A-4 (+ bootstrap SEs)
P2 data_with_tuition.RData + B=1000 / 500×100 Run complete 4(a), A-5
P3 mainresults/out/*.out OR R MTE core R done (B=250); author not run 4(b), 5, A-7, A-8, Fig 2–7
P3′ normalmodel/out/*.out Partial (shipped mtexvb.out) 5 col 1, A-6, Fig 1
P4 table6/* pipelines R done (B=250); author bootdata_nodrop / GAUSS not run 6(a), 6(b)

† Table A-2: no author script; cafqt in bundled data is pre-corrected (Hansen–Heckman–Mullen).

‡ P2 bootstrap p-values on bundled data do not match published Table 4(a) / A-5; HC asymptotic deg-2 aligns.


10 One-shot R smoke test (course pipeline)

From repository root, after renv::restore():

# Fast smoke test (reduced bootstrap)
set CHV2011_BOOT=30
set CHV2011_BOOT_4A=50
Rscript 04-topics/rep-chv2011/Rcode/run_all_tables.R
Rscript 04-topics/rep-chv2011/Rcode/run_all_figures.R
quarto render 04-topics/rep-chv2011/replicate-chv2011-R.qmd

Expected: HTML renders; tables load from Rcode/*.RData. For paper-matched SEs, set CHV2011_BOOT=250 and allow longer runtimes.


11 Troubleshooting

Symptom Likely cause Fix
use dataset fails No dataset.dta P0 row merge in R, or P1 Stata merge
cp ../bootdata/phat$i.asc fails No bootstrap samples Run getboot.do
load mte1.out fails in MATLAB Never ran superboot Complete P3 or use R MTE core
bootstrap_MT.R fails on load() No data_with_tuition.RData Step P2 export
IV joint test ≠ paper Spec / merge / bootstrap Compare logit on full \(N\); verify local IV merge
Table 6 SLS hangs npindex runtime Cache coefficients; document in output

12 References

References

Carneiro, Pedro, James J. Heckman, and Edward Vytlacil. 2011. “Estimating Marginal Returns to Education.” The American Economic Review 101 (6): 2754–81. https://doi.org/10.1257/aer.101.6.2754.