Replication with R

Using Geographic Variation in College Proximity to Estimate the Return to Schooling

Author

Hu Huaping

Published

May 1, 2025

Introduction

This page replicates the empirical tables and Figure 1 in Card (1993) using the author’s proximity extract (nls.dat, \(N = 3613\) for the 1976 cross-section). Walkthrough: replication guide. The analysis follows the modular R workflow used in Bound et al. (1995) and AK91.

Data: Rcode/proximity/nls.dat with column layout in code_bk.txt. SAS reference output: read1.lst (Table 2 col 1: ED76 = 0.0747, \(N = 3010\)).

Sample: Wage equations use men with valid 1976 log wages (\(N = 3010\)). Table 1 column (1) (\(N = 5225\) full NLS-YM cohort) is not in the distributed data file; columns (2)–(3) are replicated.

File Role
card1993-data-prep.R read_fwf, variable construction, control vectors
card1993-gt-quarto.R gt table formatting
Table_I.RTable_V.R Tables 1–5
Figure_1.R Figure 1 — schooling by predicted-education quartile

Set rerun_script <- TRUE in the setup chunk to re-run all scripts and refresh .RData files.

Figure 1 — College proximity and predicted schooling

Mean completed education by quartile of predicted schooling (fit on the no-nearby-college subsample) and 1966 college proximity.

Figure 1: (Figure 1)Mean completed education by predicted-education quartile and college proximity in 1966. Prediction equation is fit to subsample with no college nearby.
load(here("04-topics/rep-card1993/Rcode/Figure_1.RData"))
print(fig_summary)
# A tibble: 8 × 5
  pred_quartile nearc4 mean_ed     n nearc4_lab         
          <int>  <int>   <dbl> <int> <chr>              
1             1      0    10.8   358 No nearby college  
2             1      1    11.7   546 Near 4-year college
3             2      0    12.5   312 No nearby college  
4             2      1    12.9   591 Near 4-year college
5             3      0    13.4   254 No nearby college  
6             3      1    13.9   649 Near 4-year college
7             4      0    14.9   239 No nearby college  
8             4      1    15.1   664 Near 4-year college
print(paper_note)
# A tibble: 2 × 2
  metric                               value
  <chr>                                <dbl>
1 R2 prediction (no college subsample) 0.316
2 Lowest quartile gap (near - no)      0.935

Note: The prediction equation is fit on men who did not grow up near a 4-year college (nearc4 = 0); predicted schooling quartiles are then formed for the full sample. Replication \(R^2 \approx 0.32\) (paper fn. 16: 0.30). Lowest-quartile gap in mean education (near vs. no college) is about 0.9–1.0 years here (paper: \(\approx 1.1\) years).

Table 1 — Sample characteristics

Table 1: (Table 1) Sample characteristics for the overall sample and 1976 subsets of the National Longitudinal Survey of Young Men
(1)
Overall
NLS-YM1
(2)
1976 interview;
valid education
(3)
Valid wage
& education
1. Age Distribution in 1966:
Age 14-15 (%) 25.3 25.5
Age 16-17 23.8 24.1
Age 18-20 24.1 24.7
Age 21-24 26.7 25.8
2. Regional Distribution in 1966:2
Northeast (%) 20.0 20.7
Midwest 26.3 26.0
South 41.3 41.4
West 12.5 11.9
3. Lived in SMSA 1966 (%) 64.3 65.0
4. Lived Near 4-year College in 1966 (%) 67.8 68.2
5. Family structure at Age 14:
Mother & Father (%) 79.2 78.9
Mother Only (%) 10.0 10.1
6. Average Parental Education
Mother's Education (yrs) 10.3 10.3
Father's Education (yrs) 10.0 10.0
7. Percent Black 23.0 23.4
8. Average score on KWW Test 33.5 33.5
9. Interviewed in 1976 (%) 100.0 100.0
10. Mean Education in 1976 13.2 13.3
11. Live in south in 1976 (%) 40.0 40.4
12. Sample size 5225 3613 3010
Notes: Means are based on all available valid observations in each subsample (Card, Table 1 notes).
1 Column (1) shows published values for the original NLS-YM cohort (N = 5225). The proximity extract (nls.dat) contains only the 1976 cross-section; column (1) cannot be fully replicated from the distributed data file.
2 Northeast = reg661 + reg662; Midwest = reg663 + reg664; West = reg668 + reg669; South = south66.
load(here("04-topics/rep-card1993/Rcode/Table_I.RData"))
print(replication)
# A tibble: 1 × 9
  col2_n col3_n age_14_15_c2 age_14_15_c3 nearc4_c2 nearc4_c3 black_c2
   <int>  <int>        <dbl>        <dbl>     <dbl>     <dbl>    <dbl>
1   3613   3010         25.3         25.5      67.8      68.2     23.0
# ℹ 2 more variables: mean_ed_c2 <dbl>, mean_ed_c3 <dbl>

Table 2 — OLS log wage equations

Table 2: (Table 2) Estimated regression models for log hourly earnings (standard errors in parentheses)
stub (1) (2) (3) (4) (5)
1. Education 0.074
(0.004)
0.075
(0.003)
0.073
(0.004)
0.074
(0.004)
0.073
(0.004)
2. Experience 0.084
(0.007)
0.085
(0.007)
0.085
(0.007)
0.085
(0.007)
0.085
(0.007)
3. Experience-squared /100 -0.224
(0.032)
-0.229
(0.032)
-0.230
(0.032)
-0.226
(0.032)
-0.229
(0.032)
4. Black Indicator -0.190
(0.018)
-0.199
(0.018)
-0.194
(0.019)
-0.194
(0.019)
-0.189
(0.019)
5. Live in South -0.125
(0.015)
-0.148
(0.026)
-0.146
(0.026)
-0.145
(0.026)
-0.146
(0.026)
6. Live in SMSA 0.161
(0.016)
0.136
(0.020)
0.136
(0.020)
0.137
(0.020)
0.138
(0.020)
7. Region in 1966 (8 indicators) no yes yes yes yes
8. Live in SMSA in 1966 no yes yes yes yes
9. Parental education (years + missing indicators)1 no no yes yes yes
10. Interacted parental education classes2 no no no yes yes
11. Family structure (2 indicators)3 no no no no yes
12. R-squared 0.291 0.300 0.301 0.303 0.304
13. F-test on family background variables 0.235 0.619 0.028
Notes: Standard errors in parentheses. Sample size is 3010. Dependent variable: log hourly wages in 1976 (mean 6.262, SD 0.444). Experience is age76 − ed76 − 6; experience-squared enters as (experience)$^2$/100 (SAS uses raw squared experience; coefficients differ by 100).
Replication: Column (1) education coefficient matches SAS read1.lst MODEL3 (ED76 = 0.0747). Row 13 incremental F-test p-values for columns (4)–(5) differ slightly from the paper (0.619 vs. 0.462; 0.028 vs. 0.165).
1 Years of mother’s and father’s education plus indicators for imputed values (nodaded, nomomed).
2 Eight interacted parental-education classes (f1f8 from famed).
3 Lived with both parents (momdad14); lived with single mother (sinmom14).
Table 3
load(here("04-topics/rep-card1993/Rcode/Table_II.RData"))
print(replication)
# A tibble: 5 × 5
    col     ed      se    r2       f
  <int>  <dbl>   <dbl> <dbl>   <dbl>
1     1 0.0740 0.00351 0.291 NA     
2     2 0.0747 0.00350 0.300 NA     
3     3 0.0733 0.00365 0.301  0.235 
4     4 0.0737 0.00368 0.303  0.619 
5     5 0.0726 0.00370 0.304  0.0282
print(paper_targets)
# A tibble: 5 × 5
    col    ed    se    r2      f
  <int> <dbl> <dbl> <dbl>  <dbl>
1     1 0.074 0.004 0.291 NA    
2     2 0.075 0.003 0.3   NA    
3     3 0.073 0.004 0.301  0.235
4     4 0.074 0.004 0.303  0.462
5     5 0.073 0.004 0.304  0.165

Table 3 — Reduced form and IV

Table 4: (Table 3) Reduced-form and structural estimates of education and earnings models (standard errors in parentheses)
stub
Reduced Education
Models: Earnings
Structural Models of Earnings
(1) (2) (3) (4) (5) (6)
Panel A: Treat experience and experience squared as exogenous
Live Near College in 1966 0.320
(0.088)
0.322
(0.083)
0.042
(0.018)
0.045
(0.018)
Education 0.132
(0.055)
0.140
(0.055)
Family Background variables1 no yes no yes no yes
Panel B: Treat experience and experience squared as endogenous2
Live Near College in 1966 0.382
(0.114)
0.365
(0.105)
0.047
(0.019)
0.048
(0.019)
Education 0.122
(0.046)
0.132
(0.049)
Family Background Variables1 no yes no yes no yes
Notes: Standard errors in parentheses. Sample size is 3010. Dependent variable in columns (1)–(2): completed education in 1976 (mean 13.263, SD 2.677). Dependent variable in columns (3)–(6): log hourly wages in 1976 (mean 6.262, SD 0.444). All models include black, 1976 South/SMSA, 1966 region/SMSA, experience and experience-squared.
Replication: Panel A IV columns (5)–(6) use Wald ratios of reduced-form coefficients. Panel B column (6) with full family background is the baseline IV specification for Table 4 row 1.
1 Fourteen family-background controls: parental education (years and missing indicators), eight famed interaction classes, and two family-structure indicators (card93_fb_full).
2 Experience and experience-squared are endogenous; instruments are age76 and age2 (= age76$^2$/100). Reduced-form equations use ivreg with experience instrumented by age.
load(here("04-topics/rep-card1993/Rcode/Table_III.RData"))
print(replication)
# A tibble: 4 × 5
  panel   col nearc_ed nearc_wage iv_ed
  <chr> <int>    <dbl>      <dbl> <dbl>
1 A         1    0.320     0.0421 0.132
2 A         2    0.322     0.0453 0.140
3 B         1    0.382     0.0467 0.122
4 B         2    0.365     0.0484 0.132
print(paper_targets)
# A tibble: 10 × 4
   panel   col metric     value
   <chr> <int> <chr>      <dbl>
 1 A         1 nearc_ed   0.32 
 2 A         2 nearc_ed   0.322
 3 A         1 nearc_wage 0.042
 4 A         2 nearc_wage 0.045
 5 A         5 iv_ed      0.132
 6 A         6 iv_ed      0.14 
 7 B         1 nearc_ed   0.382
 8 B         2 nearc_ed   0.365
 9 B         5 iv_ed      0.122
10 B         6 iv_ed      0.132

Table 4 — Robustness

Table 5: (Table 4) OLS and IV estimates of the return to education — alternative specifications (standard errors in parentheses)
stub OLS Estimate IV Estimate1
1. Basic Specification (N = 3010) 0.073
(0.004)
0.132
(0.049)
2. Use 1978 Wages and Education (N = 2639 with 1978 data) 0.069
(0.004)
0.122
(0.062)
3. Include KWW Test Score (N = 2963 with valid KWW) 0.055
(0.004)
0.136
(0.078)
4. Include KWW; instrument KWW with IQ (N = 2040 with valid KWW and IQ)2 0.061
(0.005)
0.085
(0.013)
5. Use Proximity to Public College as instrument for education3 as in row 1 0.194
(0.059)
6. Use Proximities to 2-year and 4-year colleges as instruments for education4 as in row 1 0.117
(0.047)
7. Use Subsample Age 14-19 in 1966 (N = 2037)5 0.074
(0.006)
0.094
(0.064)
Notes: Dependent variable in rows 1 and 3–7: log hourly wages in 1976. Row 2: log hourly wages in 1978. Estimates are coefficients on the linear education term with black, 1976 South/SMSA, 1966 region/SMSA, experience and experience-squared, and fourteen family-background controls unless noted.
Replication: Row 2 uses all men with valid lwage78 (N = 2639), not restricted to the N = 3010 wage subsample. Row 4 IV standard error is imprecise in this replication. Rows 5–6 OLS match row 1; only IV instruments differ.
1 Education and experience are endogenous; instruments are nearc4 (or alternatives noted below), age76, and age2, plus all exogenous controls. Row 1 matches Panel B, column (6) of Table 3.
2 KWW enters the wage equation and is instrumented by IQ (iq); subsample with non-missing KWW and IQ.
3 Instrument for schooling is proximity to a public 4-year college (nearc4a).
4 Instruments are nearc4 (4-year) and nearc2 (2-year college proximity).
5 Subsample with age66 \(\leq\) 19 in 1966.
load(here("04-topics/rep-card1993/Rcode/Table_IV.RData"))
print(replication)
# A tibble: 7 × 6
    row     n    ols  ols_se     iv  iv_se
  <int> <int>  <dbl>   <dbl>  <dbl>  <dbl>
1     1  3010 0.0726 0.00370 0.132  0.0493
2     2  2639 0.0691 0.00406 0.122  0.0616
3     3  2963 0.0554 0.00440 0.136  0.0777
4     4  2040 0.0613 0.00549 0.0848 0.0126
5     5  3010 0.0726 0.00370 0.194  0.0591
6     6  3010 0.0726 0.00370 0.117  0.0472
7     7  2037 0.0745 0.00613 0.0944 0.0645
print(paper_targets)
# A tibble: 7 × 5
    row   ols ols_se    iv iv_se
  <int> <dbl>  <dbl> <dbl> <dbl>
1     1 0.073  0.006 0.132 0.049
2     2 0.066  0.006 0.117 0.061
3     3 0.055  0.004 0.136 0.078
4     4 0.061  0.005 0.089 0.085
5     5 0.073  0.006 0.194 0.059
6     6 0.073  0.006 0.117 0.047
7     7 0.076  0.006 0.094 0.064

Table 5 — Interaction IV

Table 6: (Table 5) IV estimates based on interaction of parental education and proximity to college (standard errors in parentheses)
stub
Reduced Form Models
Structural Models of Earnings
Education Earnings
Live Near College in 1966 0.228
(0.092)
0.031
(0.020)
0.001
(0.030)
0.013
(0.024)
Live College * Low Parental Education1 0.436
(0.176)
0.065
(0.038)
Education2 0.149
(0.106)
0.097
(0.047)
Family Background Variables3 yes yes yes yes
Notes: Standard errors in parentheses. Sample size is 3010. Dependent variable: log hourly wages in 1976 (mean 6.262, SD 0.444). All models include black, 1976 South/SMSA, 1966 region/SMSA, experience and experience-squared, and full family background. Experience is endogenous in structural columns; age and age-squared are instruments.
Replication: Reduced-form columns (1)–(2) use OLS; with endogenous experience, Panel B RF via ivreg gives slightly different point estimates. Direct effect of nearc4 in columns (3)–(4) is small and imprecise.
1 Interaction of an indicator for living near a 4-year college in 1966 with an indicator for both parents having less than high-school education (lowfam = 1 if famed = 9; nearc4_low = nearc4 \(\times\) lowfam).
2 Column (3): instrument for schooling is nearc4_low; coefficient is the Wald ratio of reduced-form earnings to schooling effects (paper: 0.093). Column (4): instruments are nearc4 \(\times\) f1nearc4 \(\times\) f8 (paper IV \(\approx\) 0.097).
3 Fourteen controls (card93_fb_full): parental education (years and missing indicators), eight famed interaction classes, and two family-structure indicators at age 14.
load(here("04-topics/rep-card1993/Rcode/Table_V.RData"))
print(replication)
# A tibble: 8 × 4
    col var           estimate std.error
  <dbl> <chr>            <dbl>     <dbl>
1     1 nearc4_ed      0.228      0.0916
2     1 nearc4_low_ed  0.436      0.176 
3     2 nearc4_w       0.0312     0.0197
4     2 nearc4_low_w   0.0650     0.0379
5     3 iv_ed          0.149      0.106 
6     3 nearc4_direct  0.00105    0.0297
7     4 iv_ed          0.0971     0.0468
8     4 nearc4_direct  0.0129     0.0242
print(paper_targets)
# A tibble: 8 × 4
    col var           value    se
  <int> <chr>         <dbl> <dbl>
1     1 nearc4_ed     0.154 0.135
2     1 nearc4_low_ed 0.462 0.186
3     2 nearc4_w      0.029 0.024
4     2 nearc4_low_w  0.043 0.032
5     3 iv_ed         0.093 0.065
6     3 nearc4        0.015 0.029
7     4 iv_ed         0.097 0.048
8     4 nearc4        0.013 0.024