Replication with R

Does Compulsory School Attendance Affect Schooling and Earning?

Author

Hu Huaping

Published

May 1, 2025

Introduction

Hu Huaping replicate all 8 Tables in (Angrist and Krueger 1991) using R and provide some learning resources and notes. The R codes (.R) and results files (.rds) could be found here.

Thanks to Win Supanwanid for providing the data and STATA scripts <www.github.com/winsup/angrist_krueger_1991>.

R Scripts and Results files

The data set and R scripts are provided as following:

The results table (table_data and gt_tbl objects) are provided as following (you can download the files and use them in your own project):

Learning Resources

  1. Econometric Replication Paper Project of “Does Compulsory School Attendance Affect Schooling and Earning?” github repo. notes: Stata code and Latex pdf files; also all tables and figures in the paper are provided.

  2. Causal Inference: The Mixtape online free book and github repo here. Chapter 7 Instrumental Variables. notes: Throughly explained this example, but not supply the data and code.

    • (Cunningham 2021) Cunningham, S. Causal Inference: The Mixtape[M]. Yale University Press, 2021.

    • Useful slides for the textbook companion by Prof Scott Cunningham: chapter 7 Instrumental Variables. see here.

    • Dive into Advanced IV topic with Prof Peter Hull. see here. notes: slides and R code.

  3. R Code for Mastering ’Metrics online free book. Chapter 9: Quarter of Birth and Returns to Schooling. notes: useful Figures with R code.

    • (Angrist and Pischke 2015) Angrist, J. D., and J. S. Pischke. Mastering “Metrics— The Path from Cause to Effect[M]. Princeton, Oxford: Princeton University Press, 2015.
  4. chapter 13 Instrumental Variables Regression. by (Muller, Winship, and Morgan 2014) Muller, C., C. Winship, and S. Morgan. Instrumental Variables Regression[A]. The SAGE Handbook of Regression Analysis and Causal Inference[C]. SAGE Publications Ltd, 2014. notes: Another useful example which may supply R code.

    • See the book of (Best and Wolf 2015). Best, H., and C. Wolf. The SAGE Handbook of Regression Analysis and Causal Inference[M]. Los Angeles: Sage, 2015. we may find more resources in this book’s website with permission by subscription.

Appendix Tables

Table I

Table 1: The Effect of Quarter of Birth on Various Educational Outcome Variables
Outcome variable Birth cohort Mean
Quarter-of-birth effect
F-test
QTR1 QTR2 QTR3
Total years of education 1930-1939 12.7922 -0.1243
(0.0167)
-0.0860
(0.0168)
-0.0149
(0.0160)
25.1183
[0.0000]
Total years of education 1940-1949 13.5600 -0.0855
(0.0125)
-0.0348
(0.0126)
-0.0188
(0.0126)
17.3582
[0.0000]
High school graduate 1930-1939 0.7741 -0.0191
(0.0021)
-0.0198
(0.0021)
-0.0039
(0.0020)
46.5981
[0.0000]
High school graduate 1940-1949 0.8637 -0.0145
(0.0014)
-0.0121
(0.0014)
-0.0020
(0.0014)
51.1658
[0.0000]
Years of educ. for high school graduates 1930-1939 14.0060 -0.0296
(0.0143)
0.0051
(0.0143)
0.0165
(0.0136)
3.7914
[0.0099]
Years of educ. for high school graduates 1940-1949 14.2813 -0.0093
(0.0110)
0.0205
(0.0111)
0.0079
(0.0111)
2.6535
[0.0468]
College graduates 1930-1939 0.2356 -0.0050
(0.0022)
0.0028
(0.0022)
0.0019
(0.0021)
4.9976
[0.0018]
College graduates 1940-1949 0.2996 -0.0028
(0.0019)
0.0046
(0.0019)
-0.0000
(0.0019)
5.1465
[0.0015]
Completed master’s degree 1930-1939 0.0898 -0.0010
(0.0015)
0.0019
(0.0015)
-0.0009
(0.0014)
1.7151
[0.1615]
Completed master’s degree 1940-1949 0.1102 0.0001
(0.0013)
0.0039
(0.0013)
0.0010
(0.0013)
3.8492
[0.0091]
Completed doctoral degree 1930-1939 0.0350 0.0016
(0.0009)
0.0025
(0.0009)
0.0004
(0.0009)
2.8831
[0.0343]
Completed doctoral degree 1940-1949 0.0360 -0.0018
(0.0008)
0.0009
(0.0008)
-0.0005
(0.0008)
4.3011
[0.0049]

Notes 1: Be careful about the calculation of movement average MA. We should calculate the mean of EDUC and n for each YQ group with the full sequence of YQ within the two cohorts (1930-1939, and 1940-1949). The lag and lead calculation of MA should be based on the full sequence of YQ within the two cohorts. Check the theory math in the paper. You can read the R code lines to understand the correct calculation of MA.

\[ \begin{aligned} M A_{c j} & =\frac{E_{-2}+E_{-1}+E_{+1}+E_{+2}}{4} & \text{(1)}\\ E_{i c j}-M A_{c j} & =\alpha+\sum_{j}^{3} \beta_{j} Q_{i c j}+\epsilon_{i c j} & \text{(2)} \end{aligned} \]

for \(i=1,2, \ldots, N_{c} ; c=1,2, \ldots, 10 ; j=1,2,3\)

  • \(E_{i c j}\) : the educational outcome variable listed in Table I: total years of education, high school graduate, years of education for high school graduates, college graduates, completed master’s degree, completed doctoral degree.
  • \(M A_{c j}\) : the moving average of educational outcome variable for men born in the surrounding quarters. It’s purpose is to detrend the series.

Notes 2: Keep in mind that there are two different dependent variables in Table I: years of EDUC and level of EDUC. Both of these two dependent variables are calculated by using de-trended technique. The variable years of EDUC is calculated by using EDUC - MA(Educ), and the mean of EDUC may be 12.05 for cohort 1930-1939. While level of EDUC is calculated by using EDUC dummies indirectly,e.g. hs_grad - MA(hs_grad), and the mean of hs_grad may be 0.25 for cohort 1930-1939.

Table II

Table II: Percent enrolled April 1, 19601,2
Type of state law
Date of Birth
Percent enrolled April 1, 1960
Column (1) - (2)
School-leaving age: 16
(1)
School-leaving age: 17 or 18
(2)
Jan 1-Mar 31, 1994
  1. Jan 1-Mar 31, 1994
84.8912
(0.3827)
85.7213
(0.7705)
-0.8301
(0.8604)
Apr 1-Dec 31, 1994
  1. Apr 1-Dec 31, 1994
85.7225
(0.2179)
85.9915
(0.4442)
-0.2690
(0.4947)
Within-state diff.
  1. Within-state diff.
-0.8314
(0.4404)
-0.2702
(0.8894)
-0.5611
(0.9925)
1 Standard errors in parentheses, both estimates and standard errors are in percentage
2 The result of this table will not be the same as in the paper since the author doesn’t provide data for this session. I only used the available data to approximate the result of this table. Results in the last column were manually calculated e.g. \(-0.5611=(-0.8301)-(-0.2690)\)

Notes 1: The results of this table will not be the same as in the paper since the author doesn’t provide data for this session. I only used the available data (appendix2_table.dta) to approximate the result of this table. Results in the last column were manually calculated e.g. \(-0.5611=(-0.8301)-(-0.2690)\)

Table III

Table 3: Wald Estimates1
Variable
(1)
(2)
(3)
Born in 1st quarter of year Born in 2nd, 3rd, or 4th quarter of year Difference (std. error) (1)-(2)
PANEL A: WALD ESTIMATES FOR 1970 CENSUS - MEN BORN 1920-1929
\(ln\)(wkly. wage) 5.1485 5.1574 -0.0090
(0.0030)
Education 11.3996 11.5252 -0.1256
(0.0155)
Wald est. of return to education 0.0715
(0.0256)
OLS return to education 0.0801
(0.0004)
PANEL B: WALD ESTIMATES FOR 1980 CENSUS - MEN BORN 1930-1939
\(ln\)(wkly. wage) 5.8916 5.9027 -0.0111
(0.0027)
Education 12.6881 12.7969 -0.1088
(0.0132)
Wald est. of return to education 0.1020
(0.0281)
OLS return to education 0.0709
(0.0003)
1 The results of standard errors of Wald estimates is different from Stata's nlcom command, eg. R result is 0.02556816 in panel A while Stata's result is 0218682.

Notes 1: The results of Wald Estimates in R are not consistent with the results in the paper of AK1991 and the results in STATA. I think the reason is the standard error calculation is not correct.

Here is the STATA command for calculating the Wald Estimates:

sureg (eq1: LWKLYWGE QTR1 ) (eq2: EDUC QTR1 ) if COHORT==3039
nlcom ratio:[eq1]_b[QTR1]/[eq2]_b[QTR1]

And here is the R code for calculating the Wald Estimates:

# Seemingly unrelated regression for Panel A
sur_a <- systemfit(list(
  wage = LWKLYWGE ~ QTR1,
  educ = EDUC ~ QTR1
), data = panel_a)

# Calculate Wald estimate for Panel A
# Note: In Stata, the standard error is calculated using nlcom command after sureg
# We use the SUR results from systemfit to match Stata's calculation
# From coef(sur_a), we can see:
# wage_QTR1 is at index 2 (-0.008978875)
# educ_QTR1 is at index 4 (-0.125555298)
wald_a <- coef(sur_a)[2] / coef(sur_a)[4]  # ratio of QTR1 coefficients

# Extract the variance-covariance matrix from SUR
# From vcov_sur_a, we can see:
# wage_QTR1 variance is at [2,2] (9.070617e-06)
# educ_QTR1 variance is at [4,4] (2.414637e-04)
# wage_QTR1-educ_QTR1 covariance is at [2,4] (0.000000e+00)
vcov_sur_a <- vcov(sur_a)

# Calculate standard error using the delta method with SUR variance-covariance matrix
# The formula follows Stata's nlcom command calculation:
# SE = sqrt(Var(b1/b2)) = sqrt(
#   (Var(b1)/b2^2) + (b1^2*Var(b2)/b2^4) - (2*b1*Cov(b1,b2)/b2^3)
# )
# where b1 is the coefficient of QTR1 in wage equation
# and b2 is the coefficient of QTR1 in education equation
wald_a_se <- sqrt(
  (vcov_sur_a[2,2] / (coef(sur_a)[4]^2)) +
  ((coef(sur_a)[2]^2) * vcov_sur_a[4,4] / (coef(sur_a)[4]^4)) -
  (2 * coef(sur_a)[2] * vcov_sur_a[2,4] / (coef(sur_a)[4]^3))
)

Table IV

Table IV: OLS and TSLS Estimates of the Return to Education for Men Born 1920-1929: 1970 Census1,2
Independent Variables (1)
OLS
(2)
TSLS
(3)
OLS
(4)
TSLS
(5)
OLS
(6)
TSLS
(7)
OLS
(8)
TSLS
Years of education 0.0802***
(0.0004)
0.0769***
(0.0150)
0.0802***
(0.0004)
0.1310***
(0.0334)
0.0701***
(0.0004)
0.0669***
(0.0151)
0.0701***
(0.0004)
0.1007**
(0.0334)
Race (1 = black) -0.2980***
(0.0043)
-0.3055***
(0.0353)
-0.2980***
(0.0043)
-0.2271**
(0.0776)
SMSA (1 = center city) -0.1343***
(0.0026)
-0.1362***
(0.0092)
-0.1343***
(0.0026)
-0.1163***
(0.0198)
Married (1 = married) 0.2928***
(0.0037)
0.2941***
(0.0072)
0.2928***
(0.0037)
0.2804***
(0.0141)
Age 0.1446*
(0.0676)
0.1409*
(0.0704)
0.1162
(0.0652)
0.1170
(0.0662)
Age-squared -0.0015*
(0.0007)
-0.0014
(0.0008)
-0.0013
(0.0007)
-0.0012
(0.0007)
1 Yes-No Dummies are not kept in the shown figure.
2 \(^{*}p<0.05; ^{**}p<0.01; ^{***}p<0.001\)

Table V

Table V: OLS and TSLS Estimates of the Return to Education for Men Born 1930-1939: 1980 Census1,2,3,4
Independent Variables (1)
OLS
(2)
TSLS
(3)
OLS
(4)
TSLS
(5)
OLS
(6)
TSLS
(7)
OLS
(8)
TSLS
Years of education 0.0711***
(0.0003)
0.0891***
(0.0161)
0.0711***
(0.0003)
0.0760**
(0.0290)
0.0632***
(0.0003)
0.0806***
(0.0164)
0.0632***
(0.0003)
0.0600*
(0.0290)
Race (1 = black) -0.2575***
(0.0040)
-0.2302***
(0.0261)
-0.2575***
(0.0040)
-0.2626***
(0.0458)
SMSA (1 = center city) -0.1763***
(0.0029)
-0.1581***
(0.0174)
-0.1763***
(0.0029)
-0.1797***
(0.0305)
Married (1 = married) 0.2479***
(0.0032)
0.2440***
(0.0049)
0.2479***
(0.0032)
0.2486***
(0.0073)
Age -0.0772
(0.0621)
-0.0801
(0.0645)
-0.0760
(0.0604)
-0.0741
(0.0626)
Age-squared 0.0008
(0.0007)
0.0008
(0.0007)
0.0008
(0.0007)
0.0007
(0.0007)
1 Yes-No Dummies are not kept in the shown figure.
2 \(^{*}p<0.05; ^{**}p<0.01; ^{***}p<0.001\)
3 The dependent variable is the log of weekly wage. The sample includes men born 1930-1939 in the 1980 Census.
4 The excluded instrumental variables for all TSLS models are not shown in the table and are the same which are the 27 interaction terms between the quarter of birth and year of birth \(QTR_j \times YR_i\), \(i \in \{0, 1, \dots, 9\}\) and \(j \in \{1, 2, 3\}\).

Table VI

Table VI: OLS and TSLS Estimates of the Return to Education for Men Born 1940-1949: 1980 Census1,2,3,4
Independent Variables (1)
OLS
(2)
TSLS
(3)
OLS
(4)
TSLS
(5)
OLS
(6)
TSLS
(7)
OLS
(8)
TSLS
Years of education 0.0573***
(0.0003)
0.0553***
(0.0138)
0.0573***
(0.0003)
0.0948***
(0.0223)
0.0520***
(0.0003)
0.0393**
(0.0145)
0.0521***
(0.0003)
0.0779**
(0.0239)
Race (1 = black) -0.2107***
(0.0032)
-0.2266***
(0.0183)
-0.2108***
(0.0032)
-0.1786***
(0.0299)
SMSA (1 = center city) -0.1418***
(0.0023)
-0.1535***
(0.0135)
-0.1419***
(0.0023)
-0.1182***
(0.0220)
Married (1 = married) 0.2445***
(0.0022)
0.2442***
(0.0022)
0.2444***
(0.0022)
0.2450***
(0.0023)
Age 0.1800***
(0.0389)
0.1325**
(0.0486)
0.1518***
(0.0379)
0.1215*
(0.0474)
Age-squared -0.0023***
(0.0006)
-0.0016*
(0.0007)
-0.0019***
(0.0005)
-0.0015*
(0.0007)
1 Yes-No Dummies are not kept in the shown figure.
2 \(^{*}p<0.05; ^{**}p<0.01; ^{***}p<0.001\)
3 The dependent variable is the log of weekly wage. The sample includes men born 1940-1949 in the 1980 Census.
4 The excluded instrumental variables for all TSLS models are not shown in the table and are the same which are the 27 interaction terms between the quarter of birth and year of birth \(QTR_j \times YR_i\), \(i \in \{0, 1, \dots, 9\}\) and \(j \in \{1, 2, 3\}\).

Table VII

Table VII: OLS and TSLS Estimates of the Return to Education for Men Born 1930-1939: 1980 Census1,2,3,4
Independent Variables (1)
OLS
(2)
TSLS
(3)
OLS
(4)
TSLS
(5)
OLS
(6)
TSLS
(7)
OLS
(8)
TSLS
Years of education 0.0673***
(0.0003)
0.0928***
(0.0093)
0.0673***
(0.0003)
0.0907***
(0.0107)
0.0628***
(0.0003)
0.0831***
(0.0095)
0.0628***
(0.0003)
0.0811***
(0.0109)
Race(1 = black) -0.2547***
(0.0043)
-0.2333***
(0.0109)
-0.2547***
(0.0043)
-0.2354***
(0.0122)
SMSA (1 = center city) -0.1705***
(0.0029)
-0.1511***
(0.0095)
-0.1705***
(0.0029)
-0.1531***
(0.0107)
Married (1 = married) 0.2487***
(0.0032)
0.2435***
(0.0040)
0.2487***
(0.0032)
0.2441***
(0.0042)
Age -0.0757
(0.0617)
-0.0880
(0.0624)
-0.0778
(0.0603)
-0.0876
(0.0609)
Age-squared 0.0008
(0.0007)
0.0009
(0.0007)
0.0008
(0.0007)
0.0009
(0.0007)
1 Yes-No Dummies are not kept in the shown figure.
2 \(^{*}p<0.05; ^{**}p<0.01; ^{***}p<0.001\)
3 The dependent variable is the log of weekly wage. The sample includes men born 1930-1939 in the 1980 Census.
4 The excluded instrumental variables for all TSLS models are not shown in the table and include quarter of birth-year of birth (27 interactions) and quarter of birth-state of birth (150 interactions) dummies.

Table VIII

Table VIII: OLS and TSLS Estimates of the Return to Education for Black Men Born 1930-1939: 1980 Census1,2,3,4
Independent Variables (1)
OLS
(2)
TSLS
(3)
OLS
(4)
TSLS
(5)
OLS
(6)
TSLS
(7)
OLS
(8)
TSLS
Years of education 0.0672***
(0.0013)
0.0635***
(0.0185)
0.0671***
(0.0013)
0.0555**
(0.0199)
0.0576***
(0.0013)
0.0461*
(0.0187)
0.0576***
(0.0013)
0.0391
(0.0199)
SMSA (1 = center city) -0.1885***
(0.0142)
-0.2053***
(0.0308)
-0.1884***
(0.0142)
-0.2155***
(0.0324)
Married (1 = married) 0.2216***
(0.0100)
0.2272***
(0.0136)
0.2216***
(0.0100)
0.2307***
(0.0140)
Age -0.3099
(0.2538)
-0.3274
(0.2560)
-0.2978
(0.2473)
-0.3237
(0.2497)
Age-squared 0.0033
(0.0028)
0.0035
(0.0028)
0.0032
(0.0027)
0.0035
(0.0028)
1 Yes-No Dummies are not kept in the shown figure.
2 \(^{*}p<0.05; ^{**}p<0.01; ^{***}p<0.001\)
3 The dependent variable is the log of weekly wage. The sample includes black men born 1930-1939 in the 1980 Census.
4 The excluded instrumental variables for all TSLS models are not shown in the table and include quarter of birth-year of birth (27 interactions) and 150 quarter of birth-state of birth interactions.

Notes 1: I found the sample data set contains only 50 states, but the paper reports 51 states. I think this is because the filtered the data set keep for black thus exclude one of the states. So your should be careful about the state dummy set you used.

Notes 2: I found the reported Table VIII in Win Supanwanid using STATA scripts, is not consistent with the Paper of AK1991. So I checked the R code and obtain the same results as the Paper of AK1991. I think the problem behind Table VIII in Win Supanwanid is source from the state dummy as I have mentioned in Notes 1.

References

Angrist, Joshua D., and Alan B. Krueger. 1991. “Does Compulsory School Attendance Affect Schooling and Earnings?” The Quarterly Journal of Economics 106 (4): 979–1014. https://doi.org/10.2307/2937954.
Angrist, Joshua D., and Jörn-Steffen Pischke. 2015. Mastering Metrics: The Path from Cause to Effect. Princeton, Oxford: Princeton University Press.
Best, Henning, and Christof Wolf, eds. 2015. The SAGE Handbook of Regression Analysis and Causal Inference. Los Angeles: Sage.
Cunningham, Scott. 2021. Causal Inference: The Mixtape. Yale University Press.
Muller, Christopher, Christopher Winship, and Stephen Morgan. 2014. “Instrumental Variables Regression.” In The SAGE Handbook of Regression Analysis and Causal Inference. SAGE Publications Ltd.