These are the instructions for replicating the results in "Estimating Marginal Returns to Education", by Pedro Carneiro, James Heckman, and Edward Vytlacil

1. Assemble the dataset

We use data for white males from the National Longitudinal Survey of Youth of 1979 (NLSY79). The main dataset is publicly available from the Bureau of Labor Statistics: http://www.bls.gov/nls. Since the construction of several of the variables we use requires access to confidential geocode information we cannot directly post the final dataset we use, but we provide the ingredients required to construct it once the researcher acquires access to geocode data.

There are two basic datafiles:
i) "basicvariables.dta" is a STATA dataset with basic variables from the public use files of the NLSY79 - subjects can be identified by their identifier, which we call caseid (R0001.00 in the original dataset)
ii) "localvariables.dta" is a STATA dataset with all the local variables constructed from multiple sources and merged into the NLSY79 using geocode information

These two datasets have two completely different identifiers for each observation ("caseid" in "basicvariables.dta", "newid" in "localvariables.dta") so they cannot be merged without access to a third file which makes the correspondence between the two identifiers (mapping "caseid" to "newid"). We call this third file, to be constructed by the researcher after having access to the geocodes, "newid.dta".

How can we construct "newid.dta"? First extract three variables from the geocode files (extract ALL 12686 observations, even if the paper uses only white males from the main cross sectional sample): individual identifier (caseid), the (FIPS) code for state of residence in 1979 (state; R02190.02 in NLSY), and the FIPS code for county of residence in 1979 (county; R02190.01 in NLSY). Then the STATA instructions to construct the identifier in "localvariables.dta" are the following:

sort caseid
sort state county caseid
ge newid = _n
keep caseid newid
sort caseid
save newid, replace

Finally, create the main dataset for the paper by assembling all three subdatasets using getdataset.do, which is a STATA do-file.

List of variables:

caseid - ID number from NLSY
newid - ID number for merging public and local variables
state - dummy for college attendance
school - 1 if high school dropout, 2 if high school graduate, 3 if some college, 4 if college graduate
wage - log average wage measure for 1991
const - constant
exp - years of experience
expsq - years of experience squared
cafqt - AFQT adjusted for completed schooling at test date
mhgc - years of education of the mother
d57 d58 d59 d60 d61 d62 d63 - dummies for year of birth
lwage5 - log average wage in county of residence in 1991
lurate - average unemployment in state of residence in 1991
pub4 - dummy for presence of public 4 year college in county of residence at 14
lwage5_17 - log average wage in county of residence at 17
lurate_17 - average unemployment in state of residence at 17
urban14 - dummy for urban residence at 15
lavlocwage17 - log average "permanent" wage in county of residence at 17
avurate - average "permanent" unemployment in state of residence at 17
numsibs - number of siblings
tuit4c - average tuition in public 4 year colleges in county of residence at 17 (to be precise, it is the maximum tuition in counties of residence at 16, 17 and 18)

It will be important to have a dataset with only caseid and school. We call that dataset "school.dta".

2. Create the bootstrap samples
getboot.do (STATA)
The bootstrap samples are stored in a directory called bootdata.

3. Table 3
avder_ddd.do (STATA)
The output is stored in a directory called out2_ddd.
avder.m (MATLAB)

4. Table 4
Table 4a uses an R file. One needs to make an R version of the dataset first and call it: data_with_tuition.RData. The detailed description of the procedure is available in description1.tex (which is a tex file).
Go to the directory ./mainresults/table4a
./mainresults/table4a/bootstrap_MT.R (R)
The procedure for constructing Table 4b is discussed below. It is necessary to estimate other quantities first.

5. Figure 1
These programs are in the "normalmodel" directory.
./normalmodel/normalselb.do (STATA)
./normalmodel/dofig1.m (MATLAB)

6. For the remaining tables and figures of the paper a lot of additional setup is needed. Before we turn to them specifically we will set up all the additional estimates we need.

6.1 Estimate the MTE in the semi-parametric model
This requires running a shell script in UNIX environment, which calls several other programs in STATA, GAUSS and MATLAB. These programs are in the "mainresults" directory.
./mainresults/runboot (UNIX)
	./mainresults/getphat.do (STATA)
	./mainresults/chv01b.prg (GAUSS)
	./mainresults/getzx.m (MATLAB)
	./mainresults/getzxpol.do (STATA)
The relevant output is stored in the "out" directory (inside "mainresults").

6.2 Estimate f(P|X) in the semi-parametric model
This requires running a shell script in UNIX environment, which calls several other programs in STATA, GAUSS and MATLAB. These programs are in the "mainresults" directory.
./mainresults/runboot2 (UNIX)
	./mainresults/cdens.prg (GAUSS)
	./mainresults/cdenspol.prg (GAUSS)
	./mainresults/cdenspolb.prg (GAUSS)
The relevant output is stored in the "out" directory (inside "mainresults").

6.3 Estimate Treatment Effect Parameters in the semi-parametric model
This requires running a shell script in UNIX environment, which calls several other programs in STATA, GAUSS and MATLAB. These programs are in the "mainresults" directory.
./mainresults/runboot3 (UNIX)
	./mainresults/treatparnew.m (MATLAB - this call kerndens.m, also a MATLAB program)
The relevant output is stored in the "out" directory (inside "mainresults").

6.4. Estimate MTE in the normal model
These programs are in the "normalmodel" directory.
./normalmodel/normalsel_boot.do (STATA)
The relevant output is stored in the "out" directory (inside "normalmodel").

6.5 Estimate f(P|X) in the normal model
This requires running a shell script in UNIX environment, which calls several other programs in STATA, GAUSS and MATLAB. These programs are in the "normalmodel" directory.
./normalmodel/runboot2 (UNIX)
	./normalmodel/cdens.prg (GAUSS - this uses the file z.out which is in the same directory)
	./normalmodel/cdenspol.prg (GAUSS)
The relevant output is stored in the "out" directory (inside "normalmodel").

6.6 Estimate Treatment Effect Parameters in the normal model
This requires running a shell script in UNIX environment, which calls several other programs in STATA, GAUSS and MATLAB. These programs are in the "normalmodel" directory.
./normalmodel/runboot3 (UNIX)
	./normalmodel/treatparnew.m (MATLAB - this call kerndens.m, also a MATLAB program)
The relevant output is stored in the "out" directory (inside "normalmodel").

6.7 Get OLS and IV estimates
This requires running a shell script in UNIX environment, which calls several other programs in STATA, GAUSS and MATLAB. These programs are in the "mainresults" directory.
./mainresults/runboot0 (UNIX)
	./mainresults/getolsiv.do (STATA)
The relevant output is stored in the "out" directory (inside "mainresults").

7. Table 4b
Go to the directory "mainresults/out"
./mainresults/out/mte4c.m (MATLAB)

8. Table 5

8.1 First column
Go to the directory "normalmodel/out"
./normalmodel/out/treatb.m (MATLAB)

8.2 Second column, OLS, and IV estimates
Go to the directory "mainresults/out"
./mainresults/out/treatwithiv.m (MATLAB)

9. Figure 2
Go to the directory "mainresults/figures"
./mainresults/figures/run_fig2 (UNIX)
	./mainresults/figures/cdens_fig2.prg (GAUSS)
	./mainresults/figures/fig2.m (MATLAB)

10. Figure 3
Go to the directory "mainresults/figures"
./mainresults/figures/getphat_dist.do (STATA)
./mainresults/figures/getphatdist.m (MATLAB)

11. Figure 4
Go to the directory "mainresults/out"
./mainresults/out/mte4c.m (MATLAB)

12. Figures 5 and 7
Go to the directory "mainresults/figures"
./mainresults/figures/run_fig5 (UNIX)
	./mainresults/figures/weights.m (MATLAB, which uses kerndens.m)

13. Figure 6
Go to the directory "mainresults/figures"
./mainresults/figures/getphat_ivsupportb.do (STATA)
./mainresults/figures/supz2575c.m (MATLAB)

14. Table 6
This requires defining different samples or sets of instruments/controls and reestimate everything. All files are in a directory called "table6"

14.1 Create new bootstrap samples (only when needed) using instructions in point 2 above. Look at the following directories:
./table6/bootdata_nodrop - sample without high school dropouts

14.2 Estimate normal model
Loot at the directory ./table6/normal
./table6/normal/normalsel_nodrop.do (STATA) - model without high school dropouts
./table6/normal/normalsel_dropdummies.do (STATA) - model with dummies for dropout and some college
./table6/normal/normalsel_allxinz.do (STATA) - model with all X in Z
./table6/normal/normalsel_nointall.do (STATA) - model without interactions between X and Z
./table6/normal/normalsel_notuitnounemp.do (STATA) - Cameron and Taber
./table6/normal/normalsel_notuit.do (STATA) - model without tuition

14.3 Follow instructions in point 6 above to estimate non-parametric model
Look at the superboot (UNIX) script and all the components it calls within the following directories:
./table6/nodrop - model without high school dropouts
./table6/dropdummies - model with dummies for dropout and some college
./table6/allxinz - model with all X variables included in Z
./table6/nointall - model with no interactions between X and Z
./table6/notuitnounemp - Cameron and Taber
./table6/notuit - model without tuition

14.3 Estimation of P using Semiparametric least squares (SLS)
This requires using R, which has a SLS routine. The np package should be loaded before running the instructions in R.
Before getting to R, one needs to prepare the data for use in R. Go to the directory ./table6/sls:
./table6/sls/getdatab.do (STATA)
This program produces datab.dta, a STATA dataset that can be read into R.
Then run the following instructions in R:

library(foreign)
mydat<-read.dta("datab.dta")
names(mydat)
attach(mydat)
bw<-npindexbw(formula=phat~exp+expsq+cafqt+cafqt2+mhgc+mhgc2+numsibs+numsibs2+urban14+lavlocwage17+avurate+lavlocwage172+avurate2+d57+d58+d59+d60+d61+d62+d63+lwage5+lurate,data=mydat)
model<-npindex(bw)
summary(model)

The fifth instruction takes several hours to run (more than 24 hours in a standard PC). The output of this set of instructions is a set of coefficients that can be used in a STATA program.

./table6/sls/getzxpol_ind.do (STATA)
./table6/sls/cdens.prg (GAUSS)

This gauss program uses z.out which should be in the directory.

Finally, one runs the following
./table6/sls/runboot3 (UNIX)
	./table6/sls/treatparnew.m (MATLAB)

This matlab program uses kerndens.m and ksr_p.m.