Minimum Runnable Checklist

CHV2011 replication — data, software, and table unlock map

Author

Hu Huaping

Published

May 28, 2026

Purpose

This checklist walks through the minimum steps needed to run the Carneiro, Heckman and Vytlacil (2011) replication in this repo. The AER package under replication_aer/ ships code and partial data, not pre-computed bootstrap outputs. Use this page to see what you have, what is missing, and which paper tables each step unlocks.

Related pages: Paper (full text) · Replication guide · R replication output · Author README.txt

Course R pipeline status (2026-05-29)

Implementation: complete — all Table_*.R / Figure_*.R scripts, shared modules, and replicate-chv2011-R.qmd are wired. HTML output renders from cached Rcode/*.RData.

Paper-grade bootstrap runs (bundled data):

Item	Status	Cached B	Notes
P1 `bootdata/`	Done	250	R `getboot.R`
Table 3 / A-4 SEs	Done	250	Control derivatives ≈ paper; IV terms need geocode
Table 4(a)	Run complete	1000	HC asymptotic deg-2 ≈ 0.035; bootstrap p-values do not match paper on bundled data
Table A-5	Run complete	500×100	RW critical ≈ 0.054; p-values fail to reject (paper rejects)
Table 4(b) / 5	Done	250	IV/OLS magnitude reasonable; semiparametric MPRTE deviates from GAUSS
Table 6	Run complete	250	Eight columns + bootstrap SEs; R polynomial core — point estimates differ from paper (e.g. baseline ATE 0.044 vs 0.082)
Figures 1–7	Done	—	R proxies; Fig 1 uses shipped `mtexvb.out`

Not done (author / publication match): geocode dataset.dta, superboot → mainresults/out/*.out, table6/bootdata_nodrop/, GAUSS cdens pipeline, npindex SLS (>24h).

1 Quick links

Resource	Path
Bundled NLSY ingredients	`replication_aer/basicvariables.dta`, `localvariables.dta`, `school.dta`
R data prep	`Rcode/chv2011-data-prep.R`
R MTE core	`Rcode/chv2011-mte-core.R`
Table scripts	`Rcode/Table_*.R`
Author Table 4a	`replication_aer/mainresults/table4a/bootstrap_MT.R`

2 Step 0 — Environment scan (auto)

Data mode detected: bundled_aligned — Merging basicvariables.dta and localvariables.dta by row order (author-aligned replication bundle; N should be 1747).

2.1 Bundled file inventory

Table 1

Path	Status	Detail
Key paths under replication_aer/
MISSING items must be created locally or replaced by R equivalents
basicvariables.dta	OK	127.4 KB
localvariables.dta	OK	59.1 KB
school.dta	OK	15.1 KB
dataset.dta	MISSING	Not found
newid.dta	MISSING	Not found
bootdata	OK	253 file(s)
out2_ddd	MISSING	Not found
mainresults/table4a/data_with_tuition.RData	OK	142.7 KB
mainresults/out	OK	2 file(s)
normalmodel/out	OK	1 file(s)
table6/sls/datab.dta	MISSING	Not found
table6/bootdata_nodrop	OK	2 file(s)
normalmodel/mtexvb.out	OK	4.4 KB

2.2 Sample sanity check (bundled merge)

Paper Table A-3 reports $N = 1747$ (882 / 865) and group mean log wages. Compare your merged sample:

Table 2

Metric	Your data	Paper
Merged sample vs paper (Table A-3)
N	1747.0000	1747
S = 0	882.0000	882
S = 1	865.0000	865
mean wage \| S=0	2.2090	2.2089
mean wage \| S=1	2.5497	2.5496

Merge method (P0 conclusion)

Without geocode newid.dta, the AER bundle is merged by row order: bind_cols(basicvariables, localvariables[-newid]). Core demographics (wage, exp, cafqt, …) match Table A-3 to rounding; local instruments (pub4, tuition, lwage5, …) deviate because cor(caseid, newid) ≈ 0.88 — not 1. Full Table 3 / IV joint-test alignment requires P1 (dataset.dta via getdataset.do).

Table 3

Variable	Δ mean S=0	Δ mean S=1	Source	OK
Table A-3 alignment (P0)
Bundled AER files: bind_cols(basic, local[-newid]) by row order. Without geocode newid.dta, local instruments are only approximately aligned (cor(caseid, newid) = 0.881).
wage	0.0001	0.0001	basic	TRUE
exp	0.0001	0.0000	basic	TRUE
cafqt	-0.0001	0.0000	basic	TRUE
mhgc	0.0001	0.0000	basic	TRUE
numsibs	0.0000	0.0001	basic	TRUE
urban14	0.0000	0.0001	basic	TRUE
lwage5	0.0276	-0.0281	local	TRUE
lurate	0.1100	-0.1121	local	TRUE
pub4	0.0466	-0.0474	local	TRUE
lwage5_17	0.0000	0.0000	local	TRUE
lurate_17	0.0398	-0.0404	local	TRUE
tuition	-0.6002	0.6119	local	FALSE
lavlocwage17	0.0095	-0.0096	local	TRUE
avurate	-0.0141	0.0145	local	TRUE

Table 4

Variable	Paper	R	Diff	OK
Table 3 point estimates vs paper (full sample)
LR joint IV p = 0.8450 (paper: 0.0001). Point estimates use full-sample logit + normalden(xbeta), matching avder_ddd.do on phat1. Paper joint IV p = 0.0001 requires geocode merge (dataset.dta); bundled row merge attenuates instrument relevance.
cafqt	0.2826	0.2866	0.0040	TRUE
mhgc	0.0441	0.0438	-0.0003	TRUE
numsibs	-0.0233	-0.0268	-0.0035	TRUE
urban14	0.0340	0.0660	0.0320	TRUE
lavlocwage17	0.1820	0.1803	-0.0017	TRUE
avurate	0.0058	0.0155	0.0097	TRUE
pub4	0.0529	0.0185	-0.0344	FALSE
lwage5_17	-0.2687	-0.2021	0.0666	FALSE
lurate_17	0.0149	-0.0130	-0.0279	TRUE
tuition	-0.0027	0.0021	0.0048	TRUE

Bundled vs geocode merge

The author README.txt requires geocode access to build newid.dta and dataset.dta. This repo includes basicvariables.dta and localvariables.dta (both $N = 1747$). If row-aligned merge matches the table above, you can proceed with descriptive statistics and point estimates without geocode. For publication-grade replication, still follow the official merge when geocode is available.

3 Step 1 — Software stack

Check each box before running author scripts.

Software	Required for	Check
R + `renv::restore()`	All `Rcode/` scripts; Table 4a	☐
Stata-MP (or Stata)	`getdataset.do`, `getboot.do`, `getphat.do`, `movestay`	☐
GAUSS (`tgauss`)	`chv01b.prg`, `cdens*.prg` (semi-parametric MTE)	☐
MATLAB	`getzx.m`, `treatparnew.m`, `mte4c.m`, figures	☐
Unix shell (WSL on Windows)	`runboot`, `superboot` (250 iterations)	☐
R packages: `haven`, `tidyverse`, `gt`, `AER`, `car`, `sandwich`	Course R replication	☐
R package `np`	Table 6(b) SLS only (>24h); bypassed — paper coeffs cached in `Table_6.R`	☑ (cached)

Windows note

Author orchestration scripts (runboot, superboot) are shell scripts. On Windows, use WSL or rewrite the loop in R (as in Rcode/chv2011-data-prep.R make_bootstrap_indices()).

4 Step 2 — Minimum data pipeline (P0 → P1)

flowchart LR
  subgraph p0 [P0 Bundled]
    b[basicvariables.dta]
    l[localvariables.dta]
    m[Row merge or newid merge]
  end
  subgraph p1 [P1 Official]
    g[geocode newid.dta]
    d[dataset.dta]
    boot[bootdata phat1 to 250]
  end
  b --> m
  l --> m
  g --> d
  d --> boot
  m --> ok1[Table A-3 OK]
  boot --> ok2[Table 3 SE and beyond]

4.1 P0 — Bundled data (no geocode)

Goal: Verify sample merge and run descriptive / choice-equation tables.

Confirm basicvariables.dta and localvariables.dta exist
Merge method: row-order bind_cols (bundled_aligned mode); see alignment tables above
Sample check: $N = 1747$, wage means match paper; local IV means approximate only
From repo root: Rscript 04-topics/rep-chv2011/Rcode/Table_A3.R → Table_A3.RData
CHV2011_BOOT=250 Rscript 04-topics/rep-chv2011/Rcode/Table_3.R (control derivatives ≈ paper; IV terms still need geocode dataset.dta)
Render replicate-chv2011-R.qmd — all table/figure chunks load from Rcode/*.RData

Unlocks: Table A-3 (sample statistics); Table 3 point estimates (approximate); Table 2 (variable map); formula tables 1, A-1.

Limitation: Table 3 instrument average derivatives and joint IV test ($p = 0.0001$ in paper) require geocode-merged dataset.dta (P1).

4.2 P1 — Bootstrap samples (`getboot`)

Goal: Create bootdata/phat1.asc … phat250.asc (author bootstrap infrastructure).

Option A — R (recommended in this repo):

# From repo root; default B=250, seed=703296661
Rscript 04-topics/rep-chv2011/Rcode/getboot.R

# Quick test (10 samples)
CHV2011_BOOT=10 Rscript 04-topics/rep-chv2011/Rcode/getboot.R

# Regenerate existing files
CHV2011_BOOT_OVERWRITE=TRUE Rscript 04-topics/rep-chv2011/Rcode/getboot.R

Also writes bootdata/boot_base.rds, bootstrap_indices.rds, boot_meta.rds for fast R reuse. Table 3 / A-4 scripts pick up cached bootdata automatically when present (CHV2011_USE_BOOTDATA=TRUE, default).

Option B — Stata (requires dataset.dta):

* After placing newid.dta in replication_aer/
do getdataset.do          /* creates dataset.dta */
mkdir bootdata
do getboot.do             /* creates bootdata/phat1.asc … phat250.asc */

Run Option A (R getboot.R; seed 703296661, bundled_aligned data)
Verify: bootdata/ contains 250 .asc files (see status table below)
Optional: mkdir out2_ddd then do avder_ddd.do → avder.m in MATLAB (author track)

Table 5

metric	value
bootdata/ status (P1)
Status	OK
N	1747
Bootstrap files	250 / 250
Data mode	bundled_aligned
Seed	703296661

Unlocks: Table 3, A-4 bootstrap SEs; prerequisite for §6 pipelines.

Environment variables:

Variable	Default	Paper value
`CHV2011_BOOT`	250 (getboot) / 50 (table scripts)	250
`CHV2011_BOOT_SEED`	703296661	(author Stata: unset)
`CHV2011_USE_BOOTDATA`	TRUE	use cached bootdata when available
`CHV2011_BOOT_OVERWRITE`	FALSE	force rewrite of `.asc` files
`CHV2011_BOOT_4A`	1000 (`Table_4a.R`)	1000
`CHV2011_BOOT_A5_OUT`	500 (`chv2011-table4a-core.R`)	500
`CHV2011_BOOT_A5_IN`	100	100

5 Step 3 — Table 4a input (P2)

Goal: Export analysis data and run the linearity-of-$E(Y|X,P)$ test (Table 4(a)).

Step 1 — Export data_with_tuition.RData:

Rscript 04-topics/rep-chv2011/Rcode/export_data_with_tuition.R

Or in R: export_chv2011_table4a_data() from chv2011-table4a-core.R.

Writes replication_aer/mainresults/table4a/data_with_tuition.RData (object name data, sorted by caseid, includes tuition and IV interactions).

Step 2 — Run Table 4(a):

# Paper: B=1000 (slow ~20 min); dev default in env table below
CHV2011_BOOT_4A=1000 Rscript 04-topics/rep-chv2011/Rcode/Table_4a.R

Port of author bootstrap_MT.R: probit first step, HC robust Wald, pivotal bootstrap, Romano–Wolf 10% critical value.

Export script export_data_with_tuition.R
Core module chv2011-table4a-core.R — Murphy–Topel + HC asymptotic, pivotal bootstrap, double bootstrap (A-5)
Table_4a.R — MT asymptotic + bootstrap p-values + Romano–Wolf
Table_A5.R — double bootstrap per description1.tex (default 500×100 inner)
Paper-grade runs completed (2026-05-28): CHV2011_BOOT_4A=1000 (~41 min) and CHV2011_BOOT_A5_OUT=500 CHV2011_BOOT_A5_IN=100 (~5.4 h)
Numerical match to paper bootstrap p-values — not achieved on bundled row-merge data (see alignment note)

Unlocks: Table 4(a), A-5 (R pipeline complete; bootstrap p-values approximate only).

Probit vs logit

Author bootstrap_MT.R uses a probit first step; the main choice equation in the paper is logit. Table 4(a) follows the author replication script (probit).

Alignment note (P2, bundled data)

After paper-grade bootstrap (B=1000 / 500×100):

Test	R (bundled)	Paper
Table 4(a) HC asymptotic, deg 2	0.035	0.035
Table 4(a) bootstrap p, deg 2–5	0.144, 0.250, 0.494, 0.459	0.035, 0.049, 0.086, 0.122
Table 4(a) RW joint (10%)	Fail to reject (crit 0.027)	Reject
Table A-5 bootstrap p, deg 2–5	0.094, 0.178, 0.280, 0.278	0.004, 0.006, 0.022, 0.026
Table A-5 RW critical	0.054	~0.052
Table A-5 RW joint	Fail to reject	Reject

HC / MT asymptotic p-values track the paper on deg 2; pivotal bootstrap p-values are systematically higher on bundled data. Likely needs geocode-merged dataset.dta and/or author double_bootstrap.R (not in AER bundle). Implementation follows description1.tex / bootstrap_MT.R; no clear code bug identified.

6 Step 4 — Semi-parametric pipeline (P3)

Goal: MTE, treatment parameters, Table 4b / 5 / figures.

6.1 Author pipeline (GAUSS / MATLAB / superboot)

Working directory: replication_aer/mainresults/
bootdata/ populated (Step P1)
Run ./superboot (calls runboot → runboot2 → runboot3, 250× each)
Verify out/ contains mte1.out … mte250.out, treat1.out … treat250.out
MATLAB: out/mte4c.m → Table 4(b); out/treatwithiv.m → Table 5 (cols 2, IV, OLS)
./runboot0 + getolsiv.do → olsiv*.out for Table A-8

6.2 Course R pipeline (implemented)

Polynomial MTE proxy aligned with Table 4(a) outcome spec (chv2011-table4a-core.R); uses cached bootdata/ when present.

# Smoke test (B=10); paper-grade: CHV2011_BOOT=250
CHV2011_BOOT=10 Rscript 04-topics/rep-chv2011/Rcode/Table_4b.R
CHV2011_BOOT=10 Rscript 04-topics/rep-chv2011/Rcode/Table_5.R
Rscript 04-topics/rep-chv2011/Rcode/run_all_figures.R

chv2011-mte-core.R — polynomial MTE derivative, MPRTE weights, OLS/2SLS returns, bootdata-aware bootstrap
Table_4b.R — adjacent LATE difference test (Table 4(b))
Table_5.R — normal column from normalmodel/mtexvb.out; semiparametric MPRTE / IV / OLS
Figure_1.R … Figure_7.R + chv2011-figures.R
Wired in replicate-chv2011-R.qmd (tab-4b, tab-5, fig-1 … fig-7)
Paper-grade bootstrap: CHV2011_BOOT=250 for Table 4(b) and 5 (2026-05-28; ~9 min total)
Figures 1–7 cached in Rcode/Figure_*.RData; Quarto page updated
Full author match requires geocode dataset.dta + GAUSS semi-parametric MTE (superboot)

R vs author numbers

On bundled row-merge data, R IV / OLS and normal-model columns align with paper magnitude (~0.07–0.12 annual returns). Semiparametric MPRTE uses the same polynomial proxy as Table 4(a), not GAUSS chv01b.prg; expect deviation from published Table 5 col 2. Table 4(b) LATE-difference pattern is informative but not identical to mte4c.m on 250 superboot draws.

Unlocks: Table 4(b), 5, A-7, A-8; Figures 2, 3, 4, 5, 6, 7 (R proxies; author GAUSS/MATLAB for publication match).

7 Step 5 — Normal model + Figure 1 (parallel track)

7.1 Author pipeline

normalmodel/normalselb.do → mtexvb.out (re-run; shipped file used as demo)
normalmodel/normalsel_boot.do + bootstrap → normalmodel/out/treat*.out
MATLAB: dofig1.m → Figure 1; out/treatb.m → Table 5 col 1, A-6

7.2 Course R (uses shipped `normalmodel/mtexvb.out`)

Figure 1 via Figure_1.R + chv2011_fig_normal_mte()
Table 5 normal column + Table A-6 point estimates from mtexvb.out weights
Normal-column bootstrap SEs (author normalsel_boot.do track)

Unlocks: Table 5 (normal column), A-6; Figure 1.

8 Step 6 — Table 6 robustness (P4)

Each column re-runs the full pipeline under a different spec (table6/*/).

Column	Directory	Extra bootstrap data
6(a) Baseline	`mainresults/`	root `bootdata/`
6(a) No dropouts	`table6/nodrop/`	`table6/bootdata_nodrop/` (run local `getboot.do`)
6(a) Dropout dummies	`table6/dropdummies/`	root `bootdata/`
6(b) All X in Z	`table6/allxinz/`	root `bootdata/`
6(b) No Z×X	`table6/nointall/`	root `bootdata/`
6(b) Cameron–Taber	`table6/notuitnounemp/`	root `bootdata/`
6(b) No tuition	`table6/notuit/`	root `bootdata/`
6(b) SLS	`table6/sls/`	`getdatab.do` → R `npindex` (>24h)

Unlocks: Table 6(a), 6(b) (R pipeline + B=250 bootstrap SEs; point estimates still differ from GAUSS cdens paper numbers).

9 Table unlock map

Phase	Author requirement	Course R status (2026-05-28)	Tables / figures
P0	Bundled `.dta`	Done	1, 2, A-1, A-2†, A-3
P1	`bootdata/` (250)	Done (R `getboot.R`)	3, A-4 (+ bootstrap SEs)
P2	`data_with_tuition.RData` + B=1000 / 500×100	Run complete‡	4(a), A-5
P3	`mainresults/out/*.out` OR R MTE core	R done (B=250); author not run	4(b), 5, A-7, A-8, Fig 2–7
P3′	`normalmodel/out/*.out`	Partial (shipped `mtexvb.out`)	5 col 1, A-6, Fig 1
P4	`table6/*` pipelines	R done (B=250); author `bootdata_nodrop` / GAUSS not run	6(a), 6(b)

† Table A-2: no author script; cafqt in bundled data is pre-corrected (Hansen–Heckman–Mullen).

‡ P2 bootstrap p-values on bundled data do not match published Table 4(a) / A-5; HC asymptotic deg-2 aligns.

10 One-shot R smoke test (course pipeline)

From repository root, after renv::restore():

# Fast smoke test (reduced bootstrap)
set CHV2011_BOOT=30
set CHV2011_BOOT_4A=50
Rscript 04-topics/rep-chv2011/Rcode/run_all_tables.R
Rscript 04-topics/rep-chv2011/Rcode/run_all_figures.R
quarto render 04-topics/rep-chv2011/replicate-chv2011-R.qmd

Expected: HTML renders; tables load from Rcode/*.RData. For paper-matched SEs, set CHV2011_BOOT=250 and allow longer runtimes.

11 Troubleshooting

Symptom	Likely cause	Fix
`use dataset` fails	No `dataset.dta`	P0 row merge in R, or P1 Stata merge
`cp ../bootdata/phat$i.asc` fails	No bootstrap samples	Run `getboot.do`
`load mte1.out` fails in MATLAB	Never ran `superboot`	Complete P3 or use R MTE core
`bootstrap_MT.R` fails on `load()`	No `data_with_tuition.RData`	Step P2 export
IV joint test ≠ paper	Spec / merge / bootstrap	Compare logit on full $N$; verify local IV merge
Table 6 SLS hangs	`npindex` runtime	Cache coefficients; document in output

12 References

Carneiro, Heckman, and Vytlacil (2011) — Estimating Marginal Returns to Education, American Economic Review.
Author replication: replication_aer/README.txt
Course R output: replicate-chv2011-R.qmd

References

Carneiro, Pedro, James J. Heckman, and Edward Vytlacil. 2011. “Estimating Marginal Returns to Education.” The American Economic Review 101 (6): 2754–81. https://doi.org/10.1257/aer.101.6.2754.