Replication with R
Using Geographic Variation in College Proximity to Estimate the Return to Schooling
Introduction
This page replicates the empirical tables and Figure 1 in Card (1993) using the author’s proximity extract (nls.dat, \(N = 3613\) for the 1976 cross-section). Walkthrough: replication guide. The analysis follows the modular R workflow used in Bound et al. (1995) and AK91.
Data: Rcode/proximity/nls.dat with column layout in code_bk.txt. SAS reference output: read1.lst (Table 2 col 1: ED76 = 0.0747, \(N = 3010\)).
Sample: Wage equations use men with valid 1976 log wages (\(N = 3010\)). Table 1 column (1) (\(N = 5225\) full NLS-YM cohort) is not in the distributed data file; columns (2)–(3) are replicated.
| File | Role |
|---|---|
card1993-data-prep.R |
read_fwf, variable construction, control vectors |
card1993-gt-quarto.R |
gt table formatting |
Table_I.R – Table_V.R |
Tables 1–5 |
Figure_1.R |
Figure 1 — schooling by predicted-education quartile |
Set rerun_script <- TRUE in the setup chunk to re-run all scripts and refresh .RData files.
Figure 1 — College proximity and predicted schooling
Mean completed education by quartile of predicted schooling (fit on the no-nearby-college subsample) and 1966 college proximity.
load(here("04-topics/rep-card1993/Rcode/Figure_1.RData"))
print(fig_summary)# A tibble: 8 × 5
pred_quartile nearc4 mean_ed n nearc4_lab
<int> <int> <dbl> <int> <chr>
1 1 0 10.8 358 No nearby college
2 1 1 11.7 546 Near 4-year college
3 2 0 12.5 312 No nearby college
4 2 1 12.9 591 Near 4-year college
5 3 0 13.4 254 No nearby college
6 3 1 13.9 649 Near 4-year college
7 4 0 14.9 239 No nearby college
8 4 1 15.1 664 Near 4-year college
print(paper_note)# A tibble: 2 × 2
metric value
<chr> <dbl>
1 R2 prediction (no college subsample) 0.316
2 Lowest quartile gap (near - no) 0.935
Note: The prediction equation is fit on men who did not grow up near a 4-year college (nearc4 = 0); predicted schooling quartiles are then formed for the full sample. Replication \(R^2 \approx 0.32\) (paper fn. 16: 0.30). Lowest-quartile gap in mean education (near vs. no college) is about 0.9–1.0 years here (paper: \(\approx 1.1\) years).
Table 1 — Sample characteristics
| (1) Overall NLS-YM1 |
(2) 1976 interview; valid education |
(3) Valid wage & education |
|
|---|---|---|---|
| 1. Age Distribution in 1966: | |||
| Age 14-15 (%) | — | 25.3 | 25.5 |
| Age 16-17 | — | 23.8 | 24.1 |
| Age 18-20 | — | 24.1 | 24.7 |
| Age 21-24 | — | 26.7 | 25.8 |
| 2. Regional Distribution in 1966:2 | |||
| Northeast (%) | — | 20.0 | 20.7 |
| Midwest | — | 26.3 | 26.0 |
| South | — | 41.3 | 41.4 |
| West | — | 12.5 | 11.9 |
| 3. Lived in SMSA 1966 (%) | — | 64.3 | 65.0 |
| 4. Lived Near 4-year College in 1966 (%) | — | 67.8 | 68.2 |
| 5. Family structure at Age 14: | |||
| Mother & Father (%) | — | 79.2 | 78.9 |
| Mother Only (%) | — | 10.0 | 10.1 |
| 6. Average Parental Education | |||
| Mother's Education (yrs) | — | 10.3 | 10.3 |
| Father's Education (yrs) | — | 10.0 | 10.0 |
| 7. Percent Black | — | 23.0 | 23.4 |
| 8. Average score on KWW Test | — | 33.5 | 33.5 |
| 9. Interviewed in 1976 (%) | — | 100.0 | 100.0 |
| 10. Mean Education in 1976 | — | 13.2 | 13.3 |
| 11. Live in south in 1976 (%) | — | 40.0 | 40.4 |
| 12. Sample size | 5225 | 3613 | 3010 |
| Notes: Means are based on all available valid observations in each subsample (Card, Table 1 notes). | |||
1 Column (1) shows published values for the original NLS-YM cohort (N = 5225). The proximity extract (nls.dat) contains only the 1976 cross-section; column (1) cannot be fully replicated from the distributed data file. |
|||
2 Northeast = reg661 + reg662; Midwest = reg663 + reg664; West = reg668 + reg669; South = south66. |
|||
load(here("04-topics/rep-card1993/Rcode/Table_I.RData"))
print(replication)# A tibble: 1 × 9
col2_n col3_n age_14_15_c2 age_14_15_c3 nearc4_c2 nearc4_c3 black_c2
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3613 3010 25.3 25.5 67.8 68.2 23.0
# ℹ 2 more variables: mean_ed_c2 <dbl>, mean_ed_c3 <dbl>
Table 2 — OLS log wage equations
| stub | (1) | (2) | (3) | (4) | (5) |
|---|---|---|---|---|---|
| 1. Education | 0.074 (0.004) |
0.075 (0.003) |
0.073 (0.004) |
0.074 (0.004) |
0.073 (0.004) |
| 2. Experience | 0.084 (0.007) |
0.085 (0.007) |
0.085 (0.007) |
0.085 (0.007) |
0.085 (0.007) |
| 3. Experience-squared /100 | -0.224 (0.032) |
-0.229 (0.032) |
-0.230 (0.032) |
-0.226 (0.032) |
-0.229 (0.032) |
| 4. Black Indicator | -0.190 (0.018) |
-0.199 (0.018) |
-0.194 (0.019) |
-0.194 (0.019) |
-0.189 (0.019) |
| 5. Live in South | -0.125 (0.015) |
-0.148 (0.026) |
-0.146 (0.026) |
-0.145 (0.026) |
-0.146 (0.026) |
| 6. Live in SMSA | 0.161 (0.016) |
0.136 (0.020) |
0.136 (0.020) |
0.137 (0.020) |
0.138 (0.020) |
| 7. Region in 1966 (8 indicators) | no | yes | yes | yes | yes |
| 8. Live in SMSA in 1966 | no | yes | yes | yes | yes |
| 9. Parental education (years + missing indicators)1 | no | no | yes | yes | yes |
| 10. Interacted parental education classes2 | no | no | no | yes | yes |
| 11. Family structure (2 indicators)3 | no | no | no | no | yes |
| 12. R-squared | 0.291 | 0.300 | 0.301 | 0.303 | 0.304 |
| 13. F-test on family background variables | – | – | 0.235 | 0.619 | 0.028 |
Notes: Standard errors in parentheses. Sample size is 3010. Dependent variable: log hourly wages in 1976 (mean 6.262, SD 0.444). Experience is age76 − ed76 − 6; experience-squared enters as (experience)$^2$/100 (SAS uses raw squared experience; coefficients differ by 100). |
|||||
Replication: Column (1) education coefficient matches SAS read1.lst MODEL3 (ED76 = 0.0747). Row 13 incremental F-test p-values for columns (4)–(5) differ slightly from the paper (0.619 vs. 0.462; 0.028 vs. 0.165). |
|||||
1 Years of mother’s and father’s education plus indicators for imputed values (nodaded, nomomed). |
|||||
2 Eight interacted parental-education classes (f1–f8 from famed). |
|||||
3 Lived with both parents (momdad14); lived with single mother (sinmom14). |
|||||
load(here("04-topics/rep-card1993/Rcode/Table_II.RData"))
print(replication)# A tibble: 5 × 5
col ed se r2 f
<int> <dbl> <dbl> <dbl> <dbl>
1 1 0.0740 0.00351 0.291 NA
2 2 0.0747 0.00350 0.300 NA
3 3 0.0733 0.00365 0.301 0.235
4 4 0.0737 0.00368 0.303 0.619
5 5 0.0726 0.00370 0.304 0.0282
print(paper_targets)# A tibble: 5 × 5
col ed se r2 f
<int> <dbl> <dbl> <dbl> <dbl>
1 1 0.074 0.004 0.291 NA
2 2 0.075 0.003 0.3 NA
3 3 0.073 0.004 0.301 0.235
4 4 0.074 0.004 0.303 0.462
5 5 0.073 0.004 0.304 0.165
Table 3 — Reduced form and IV
| stub |
Reduced Education
|
Models: Earnings
|
Structural Models of Earnings
|
|||
|---|---|---|---|---|---|---|
| (1) | (2) | (3) | (4) | (5) | (6) | |
| Panel A: Treat experience and experience squared as exogenous | ||||||
| Live Near College in 1966 | 0.320 (0.088) |
0.322 (0.083) |
0.042 (0.018) |
0.045 (0.018) |
– | – |
| Education | – | – | – | – | 0.132 (0.055) |
0.140 (0.055) |
| Family Background variables1 | no | yes | no | yes | no | yes |
| Panel B: Treat experience and experience squared as endogenous2 | ||||||
| Live Near College in 1966 | 0.382 (0.114) |
0.365 (0.105) |
0.047 (0.019) |
0.048 (0.019) |
– | – |
| Education | – | – | – | – | 0.122 (0.046) |
0.132 (0.049) |
| Family Background Variables1 | no | yes | no | yes | no | yes |
| Notes: Standard errors in parentheses. Sample size is 3010. Dependent variable in columns (1)–(2): completed education in 1976 (mean 13.263, SD 2.677). Dependent variable in columns (3)–(6): log hourly wages in 1976 (mean 6.262, SD 0.444). All models include black, 1976 South/SMSA, 1966 region/SMSA, experience and experience-squared. | ||||||
| Replication: Panel A IV columns (5)–(6) use Wald ratios of reduced-form coefficients. Panel B column (6) with full family background is the baseline IV specification for Table 4 row 1. | ||||||
1 Fourteen family-background controls: parental education (years and missing indicators), eight famed interaction classes, and two family-structure indicators (card93_fb_full). |
||||||
2 Experience and experience-squared are endogenous; instruments are age76 and age2 (= age76$^2$/100). Reduced-form equations use ivreg with experience instrumented by age. |
||||||
load(here("04-topics/rep-card1993/Rcode/Table_III.RData"))
print(replication)# A tibble: 4 × 5
panel col nearc_ed nearc_wage iv_ed
<chr> <int> <dbl> <dbl> <dbl>
1 A 1 0.320 0.0421 0.132
2 A 2 0.322 0.0453 0.140
3 B 1 0.382 0.0467 0.122
4 B 2 0.365 0.0484 0.132
print(paper_targets)# A tibble: 10 × 4
panel col metric value
<chr> <int> <chr> <dbl>
1 A 1 nearc_ed 0.32
2 A 2 nearc_ed 0.322
3 A 1 nearc_wage 0.042
4 A 2 nearc_wage 0.045
5 A 5 iv_ed 0.132
6 A 6 iv_ed 0.14
7 B 1 nearc_ed 0.382
8 B 2 nearc_ed 0.365
9 B 5 iv_ed 0.122
10 B 6 iv_ed 0.132
Table 4 — Robustness
| stub | OLS Estimate | IV Estimate1 |
|---|---|---|
| 1. Basic Specification (N = 3010) | 0.073 (0.004) |
0.132 (0.049) |
| 2. Use 1978 Wages and Education (N = 2639 with 1978 data) | 0.069 (0.004) |
0.122 (0.062) |
| 3. Include KWW Test Score (N = 2963 with valid KWW) | 0.055 (0.004) |
0.136 (0.078) |
| 4. Include KWW; instrument KWW with IQ (N = 2040 with valid KWW and IQ)2 | 0.061 (0.005) |
0.085 (0.013) |
| 5. Use Proximity to Public College as instrument for education3 | as in row 1 | 0.194 (0.059) |
| 6. Use Proximities to 2-year and 4-year colleges as instruments for education4 | as in row 1 | 0.117 (0.047) |
| 7. Use Subsample Age 14-19 in 1966 (N = 2037)5 | 0.074 (0.006) |
0.094 (0.064) |
| Notes: Dependent variable in rows 1 and 3–7: log hourly wages in 1976. Row 2: log hourly wages in 1978. Estimates are coefficients on the linear education term with black, 1976 South/SMSA, 1966 region/SMSA, experience and experience-squared, and fourteen family-background controls unless noted. | ||
Replication: Row 2 uses all men with valid lwage78 (N = 2639), not restricted to the N = 3010 wage subsample. Row 4 IV standard error is imprecise in this replication. Rows 5–6 OLS match row 1; only IV instruments differ. |
||
1 Education and experience are endogenous; instruments are nearc4 (or alternatives noted below), age76, and age2, plus all exogenous controls. Row 1 matches Panel B, column (6) of Table 3. |
||
2 KWW enters the wage equation and is instrumented by IQ (iq); subsample with non-missing KWW and IQ. |
||
3 Instrument for schooling is proximity to a public 4-year college (nearc4a). |
||
4 Instruments are nearc4 (4-year) and nearc2 (2-year college proximity). |
||
5 Subsample with age66 \(\leq\) 19 in 1966. |
||
load(here("04-topics/rep-card1993/Rcode/Table_IV.RData"))
print(replication)# A tibble: 7 × 6
row n ols ols_se iv iv_se
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 1 3010 0.0726 0.00370 0.132 0.0493
2 2 2639 0.0691 0.00406 0.122 0.0616
3 3 2963 0.0554 0.00440 0.136 0.0777
4 4 2040 0.0613 0.00549 0.0848 0.0126
5 5 3010 0.0726 0.00370 0.194 0.0591
6 6 3010 0.0726 0.00370 0.117 0.0472
7 7 2037 0.0745 0.00613 0.0944 0.0645
print(paper_targets)# A tibble: 7 × 5
row ols ols_se iv iv_se
<int> <dbl> <dbl> <dbl> <dbl>
1 1 0.073 0.006 0.132 0.049
2 2 0.066 0.006 0.117 0.061
3 3 0.055 0.004 0.136 0.078
4 4 0.061 0.005 0.089 0.085
5 5 0.073 0.006 0.194 0.059
6 6 0.073 0.006 0.117 0.047
7 7 0.076 0.006 0.094 0.064
Table 5 — Interaction IV
| stub |
Reduced Form Models
|
Structural Models of Earnings
|
||
|---|---|---|---|---|
| Education | Earnings | |||
| Live Near College in 1966 | 0.228 (0.092) |
0.031 (0.020) |
0.001 (0.030) |
0.013 (0.024) |
| Live College * Low Parental Education1 | 0.436 (0.176) |
0.065 (0.038) |
– | – |
| Education2 | – | – | 0.149 (0.106) |
0.097 (0.047) |
| Family Background Variables3 | yes | yes | yes | yes |
| Notes: Standard errors in parentheses. Sample size is 3010. Dependent variable: log hourly wages in 1976 (mean 6.262, SD 0.444). All models include black, 1976 South/SMSA, 1966 region/SMSA, experience and experience-squared, and full family background. Experience is endogenous in structural columns; age and age-squared are instruments. | ||||
Replication: Reduced-form columns (1)–(2) use OLS; with endogenous experience, Panel B RF via ivreg gives slightly different point estimates. Direct effect of nearc4 in columns (3)–(4) is small and imprecise. |
||||
1 Interaction of an indicator for living near a 4-year college in 1966 with an indicator for both parents having less than high-school education (lowfam = 1 if famed = 9; nearc4_low = nearc4 \(\times\) lowfam). |
||||
2 Column (3): instrument for schooling is nearc4_low; coefficient is the Wald ratio of reduced-form earnings to schooling effects (paper: 0.093). Column (4): instruments are nearc4 \(\times\) f1–nearc4 \(\times\) f8 (paper IV \(\approx\) 0.097). |
||||
3 Fourteen controls (card93_fb_full): parental education (years and missing indicators), eight famed interaction classes, and two family-structure indicators at age 14. |
||||
load(here("04-topics/rep-card1993/Rcode/Table_V.RData"))
print(replication)# A tibble: 8 × 4
col var estimate std.error
<dbl> <chr> <dbl> <dbl>
1 1 nearc4_ed 0.228 0.0916
2 1 nearc4_low_ed 0.436 0.176
3 2 nearc4_w 0.0312 0.0197
4 2 nearc4_low_w 0.0650 0.0379
5 3 iv_ed 0.149 0.106
6 3 nearc4_direct 0.00105 0.0297
7 4 iv_ed 0.0971 0.0468
8 4 nearc4_direct 0.0129 0.0242
print(paper_targets)# A tibble: 8 × 4
col var value se
<int> <chr> <dbl> <dbl>
1 1 nearc4_ed 0.154 0.135
2 1 nearc4_low_ed 0.462 0.186
3 2 nearc4_w 0.029 0.024
4 2 nearc4_low_w 0.043 0.032
5 3 iv_ed 0.093 0.065
6 3 nearc4 0.015 0.029
7 4 iv_ed 0.097 0.048
8 4 nearc4 0.013 0.024