作业02:Stata代码

多元线性回归矩阵方法及其Stata编程

作业
参考
Author

胡华平

Published

October 8, 2023

这里将展示Homework Week02的Stata编程细节

配置Stata环境

Rmarkdownquarto文档下,需要配置对Stata命令实时运算的基本环境。

Show the code
```{r}
#| code-fold: show

library(rmarkdown)
library(Statamarkdown)
```
Stata found at C:/Program Files/Stata18/StataMP-64.exe
The 'stata' engine is ready to use.
Show the code
```{r}
#| code-fold: show

# stata path and engine
stataexe <- "C:/Program Files/Stata18/StataMP-64.exe"
knitr::opts_chunk$set(engine.path=list(stata=stataexe))
```

读取数据

Show the code
```{stata first-Stata, collectcode=TRUE}
set more off
import excel "D:\github\master-SEM\04-students\lab02-dataset-cps.xlsx", sheet("Sheet 1") firstrow  //导入数据

```
(12 vars, 46,943 obs)
Show the code
```{stata}
**** 查看数据情况 *******
describe     // 变量说明

summarize   // 基本统计
```
Contains data
 Observations:        46,943                  
    Variables:            12                  
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
obs             long    %10.0g                obs
wage            double  %10.0g                wage
lgwg            double  %10.0g                lgwg
edu             byte    %10.0g                edu
exper           byte    %10.0g                exper
exper2          double  %10.0g                exper2
female          byte    %10.0g                female
union           str6    %9s                   union
married         str6    %9s                   married
formMarried     str6    %9s                   formMarried
hispanic        byte    %10.0g                hispanic
race            str5    %9s                   race
-------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         obs |     46,943       23472    13551.42          1      46943
        wage |     46,943    24.79247    21.15943   .0003846   266.0557
        lgwg |     46,943    2.990225     .660089  -7.863267   5.583706
         edu |     46,943    14.34433    2.298745         12         20
       exper |     46,943    21.85787    11.48215         -4         67
-------------+---------------------------------------------------------
      exper2 |     46,943    6.096033    5.442559          0      44.89
      female |     46,943    .4340157    .4956322          0          1
       union |          0
     married |          0
 formMarried |          0
-------------+---------------------------------------------------------
    hispanic |     46,943    .1158852    .3200906          0          1
        race |          0

任务1:处理定性变量

定性变量 gender

Show the code
```{stata, collectcode=TRUE}
**** 虚拟变量 gender****
tabulate female
generate gender_female = (female==1)
generate gender_male = (female==0)
```
     female |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |     26,569       56.60       56.60
          1 |     20,374       43.40      100.00
------------+-----------------------------------
      Total |     46,943      100.00

定性变量 married

Show the code
```{stata, collectcode=TRUE}
**** 虚拟变量 married****
tabulate married
generate married_female = (married=="female")
generate married_male = (married=="male")
generate married_other = (married=="other")
```
    married |      Freq.     Percent        Cum.
------------+-----------------------------------
     female |     12,155       25.89       25.89
       male |     19,213       40.93       66.82
      other |     15,575       33.18      100.00
------------+-----------------------------------
      Total |     46,943      100.00

定性变量 union

Show the code
```{stata, collectcode=TRUE}
**** 虚拟变量 union****
tabulate union
generate union_female = (union=="female")
generate union_male = (union=="male")
generate union_other = (union=="other")
```
      union |      Freq.     Percent        Cum.
------------+-----------------------------------
     female |        423        0.90        0.90
       male |        622        1.33        2.23
      other |     45,898       97.77      100.00
------------+-----------------------------------
      Total |     46,943      100.00

定性变量 formMarried

Show the code
```{stata, collectcode=TRUE}
**** 虚拟变量 formMarried****
tabulate formMarried
generate formMarried_female = (formMarried=="female")
generate formMarried_male = (formMarried=="male")
generate formMarried_other = (formMarried=="other")
```
formMarried |      Freq.     Percent        Cum.
------------+-----------------------------------
     female |      4,067        8.66        8.66
       male |      2,615        5.57       14.23
      other |     40,261       85.77      100.00
------------+-----------------------------------
      Total |     46,943      100.00

定性变量 hispanic

Show the code
```{stata, collectcode=TRUE}
**** 虚拟变量 hispanic****
tabulate hispanic
generate hispanic_y= (hispanic==1)
generate hispanic_n = (hispanic==0)
```
   hispanic |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |     41,503       88.41       88.41
          1 |      5,440       11.59      100.00
------------+-----------------------------------
      Total |     46,943      100.00

定性变量 race

Show the code
```{stata, collectcode=TRUE}
**** 虚拟变量 race****
tabulate hispanic
generate race_black = (race=="black")
generate race_ai = (race=="AI")
generate race_asian = (race=="asian")
generate race_mixed = (race=="mixed")
generate race_other = (race=="other")
```
   hispanic |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |     41,503       88.41       88.41
          1 |      5,440       11.59      100.00
------------+-----------------------------------
      Total |     46,943      100.00

任务3-1:OLS快速回归分析

stata回归结果:

Show the code
```{stata, collectcode=TRUE}
regress lgwg edu exper exper2 gender_female union_female union_male married_female married_male formMarried_female formMarried_male hispanic_y race_black race_ai race_asian race_mixed
```
      Source |       SS           df       MS      Number of obs   =    46,943
-------------+----------------------------------   F(15, 46927)    =   1137.67
       Model |  5454.42352        15  363.628235   Prob > F        =    0.0000
    Residual |  14999.0242    46,927  .319624613   R-squared       =    0.2667
-------------+----------------------------------   Adj R-squared   =    0.2664
       Total |  20453.4477    46,942  .435717432   Root MSE        =    .56535

------------------------------------------------------------------------------
        lgwg | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         edu |   .1166983   .0011767    99.17   0.000     .1143919    .1190047
       exper |   .0331565   .0008865    37.40   0.000      .031419     .034894
      exper2 |  -.0564307   .0018151   -31.09   0.000    -.0599882   -.0528731
gender_fem~e |  -.0982573   .0120798    -8.13   0.000    -.1219339   -.0745806
union_female |   .0228563   .0277993     0.82   0.411    -.0316307    .0773433
  union_male |   .0951874   .0229579     4.15   0.000     .0501896    .1401852
married_fe~e |   .0161599   .0106713     1.51   0.130     -.004756    .0370758
married_male |   .2111205    .009854    21.42   0.000     .1918064    .2304346
formMa~emale |  -.0064213    .013036    -0.49   0.622     -.031972    .0191294
formMa~_male |   .0828954   .0143544     5.77   0.000     .0547605    .1110303
  hispanic_y |  -.1081364   .0083306   -12.98   0.000    -.1244644   -.0918084
  race_black |  -.0955306   .0088518   -10.79   0.000    -.1128803   -.0781809
     race_ai |  -.1374391    .025821    -5.32   0.000    -.1880488   -.0868295
  race_asian |  -.0384218   .0119141    -3.22   0.001    -.0617735     -.01507
  race_mixed |    -.04128   .0196722    -2.10   0.036    -.0798378   -.0027223
       _cons |   .9085073   .0201129    45.17   0.000     .8690857    .9479289
------------------------------------------------------------------------------

获取斜率系数\(\boldsymbol{\hat{\beta}}\),转换为列向量:

Show the code
```{stata}
matrix b_quick = (e(b))'  // coefficients
matrix list b_quick
```
b_quick[16,1]
                      y1
         edu    .1166983
       exper   .03315647
      exper2  -.05643068
gender_fem~e  -.09825725
union_female   .02285629
  union_male   .09518739
married_fe~e   .01615988
married_male   .21112049
formMa~emale  -.00642132
formMa~_male   .08289541
  hispanic_y   -.1081364
  race_black   -.0955306
     race_ai  -.13743914
  race_asian  -.03842176
  race_mixed  -.04128004
       _cons   .90850729

估计系数的标准差\(\boldsymbol{S_\hat{\beta}}\)

Show the code
```{stata}
scalar RSS_quick = e(rss)      // 获取残差平方和
matrix covvar_b_quick = e(V)       // 获取估计系数的方差协方差矩阵

mata: st_matrix("s_b_quick", sqrt(diagonal(st_matrix("e(V)"))))    // 估计系数的标准差, 这里使用了mata语法
matrix list s_b_quick
```
s_b_quick[16,1]
            c1
 r1  .00117674
 r2  .00088648
 r3  .00181506
 r4  .01207982
 r5  .02779927
 r6   .0229579
 r7  .01067131
 r8  .00985405
 r9  .01303598
r10  .01435442
r11  .00833056
r12  .00885184
r13  .02582104
r14  .01191406
r15  .01967218
r16  .02011292

任务4-1:矩阵分析

Show the code
```{stata, collectcode=TRUE}
generate const = (female > -0.1) // 常数列,全部取值为1

// 以下采用mata语法体系
mata: st_view(y=., ., "lgwg")  // 构造y矩阵
mata: st_view(X=., ., ("edu", "exper", "exper2", "gender_female", "union_female", "union_male", "married_female", "married_male", "formMarried_female", "formMarried_male", "hispanic_y", "race_black", "race_ai", "race_asian", "race_mixed", "const"))  // 构造X矩阵

mata: b = (cholinv(cross(X,X)))*(cross(X,y))  // 计算斜率系数
mata: b

mata: N = rows(y)  // 样本数
mata: k = cols(X)  // 变量数
mata: printf("rows = %f k = %f", N, k)
```
                   1
     +----------------+
   1 |   .1166983029  |
   2 |   .0331564661  |
   3 |  -.0564306759  |
   4 |  -.0982572522  |
   5 |   .0228562922  |
   6 |   .0951873853  |
   7 |   .0161598823  |
   8 |   .2111204886  |
   9 |    -.00642132  |
  10 |   .0828954056  |
  11 |  -.1081364005  |
  12 |  -.0955306037  |
  13 |  -.1374391407  |
  14 |  -.0384217612  |
  15 |  -.0412800403  |
  16 |   .9085072854  |
     +----------------+



rows = 46943 k = 16

任务4-2:方差协方差矩阵

Show the code
```{stata, collectcode=TRUE}
mata: e = y - X*b         // 残差向量
mata: RSS = e'e           // 残差平方和
mata: sigma2 = RSS/(N-k)  // 回归误差方差 或 残差均方和MSS_RSS
mata: printf("RSS = %f sigma2 = %f", RSS, sigma2)

mata: covvar_b = cholinv(cross(X,X))*sigma2  // 估计系数的方差协方差矩阵
mata: s_b = sqrt(diagonal(covvar_b))         // 估计系数的标准差
mata: s_b
```
RSS = 14999.0242 sigma2 = .319624613

                  1
     +---------------+
   1 |   .001176742  |
   2 |  .0008864807  |
   3 |  .0018150624  |
   4 |  .0120798163  |
   5 |  .0277992726  |
   6 |  .0229579012  |
   7 |  .0106713148  |
   8 |  .0098540481  |
   9 |  .0130359761  |
  10 |  .0143544223  |
  11 |  .0083305603  |
  12 |  .0088518353  |
  13 |  .0258210409  |
  14 |  .0119140569  |
  15 |  .0196721817  |
  16 |  .0201129195  |
     +---------------+

任务5-1:异方差一致性标准误差

stata快速命令法, HC1使用'vce(robust)'命令; HC2使用' vce(hc2)'命令;HC3使用'vce(hc3)'命令。此外,HC0往往在实证中不使用,因为它没有进行自由度调整

以下进行HC1稳健性方差估计结果:

Show the code
```{stata, collectcode=TRUE}
regress lgwg edu exper exper2 gender_female union_female union_male married_female married_male formMarried_female formMarried_male hispanic_y race_black race_ai race_asian race_mixed, vce(robust)
```
Linear regression                               Number of obs     =     46,943
                                                F(15, 46927)      =     969.84
                                                Prob > F          =     0.0000
                                                R-squared         =     0.2667
                                                Root MSE          =     .56535

------------------------------------------------------------------------------
             |               Robust
        lgwg | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         edu |   .1166983   .0012822    91.01   0.000     .1141851    .1192115
       exper |   .0331565    .000952    34.83   0.000     .0312906    .0350223
      exper2 |  -.0564307   .0020755   -27.19   0.000    -.0604988   -.0523626
gender_fem~e |  -.0982573   .0110319    -8.91   0.000    -.1198799   -.0766346
union_female |   .0228563   .0196079     1.17   0.244    -.0155755    .0612881
  union_male |   .0951874   .0203004     4.69   0.000     .0553983    .1349765
married_fe~e |   .0161599   .0095411     1.69   0.090    -.0025409    .0348606
married_male |   .2111205   .0097006    21.76   0.000     .1921072    .2301338
formMa~emale |  -.0064213   .0118607    -0.54   0.588    -.0296685    .0168259
formMa~_male |   .0828954   .0145607     5.69   0.000     .0543562    .1114346
  hispanic_y |  -.1081364   .0081209   -13.32   0.000    -.1240535   -.0922193
  race_black |  -.0955306   .0083367   -11.46   0.000    -.1118706   -.0791906
     race_ai |  -.1374391   .0263727    -5.21   0.000    -.1891301   -.0857482
  race_asian |  -.0384218   .0133569    -2.88   0.004    -.0646014   -.0122421
  race_mixed |    -.04128   .0208531    -1.98   0.048    -.0821523   -.0004077
       _cons |   .9085073   .0212559    42.74   0.000     .8668455    .9501691
------------------------------------------------------------------------------

提取出估计系数的稳健标准差, 这里使用了mata语法

Show the code
```{stata, collectcode=TRUE}
mata: st_matrix("s_b_quick_robust", sqrt(diagonal(st_matrix("e(V)")))) 
matrix list s_b_quick_robust
```
s_b_quick_robust[16,1]
            c1
 r1  .00128222
 r2  .00095195
 r3  .00207555
 r4  .01103191
 r5  .01960791
 r6  .02030042
 r7  .00954113
 r8  .00970058
 r9  .01186074
r10   .0145607
r11   .0081209
r12  .00833668
r13  .02637275
r14  .01335685
r15  .02085306
r16  .02125586

手动编程计算法:

Show the code
```{stata, collectcode=TRUE}
mata: Xe = X:*e                // element product
mata: XXe2 = cross(Xe, Xe)     // (Xe)'*(Xe)
mata: hc0 =  cholinv(cross(X,X)) * XXe2 *  cholinv(cross(X,X))  // the HC0 formula
mata: hc1 = hc0 *N/(N-k)
mata: s_b_hc1 = sqrt(diagonal(hc1))  // 估计系数的HC1稳健标准差
mata: s_b_hc1

```
                  1
     +---------------+
   1 |  .0012822244  |
   2 |   .000951951  |
   3 |   .002075546  |
   4 |  .0110319057  |
   5 |  .0196079143  |
   6 |  .0203004152  |
   7 |  .0095411323  |
   8 |  .0097005811  |
   9 |  .0118607367  |
  10 |  .0145607043  |
  11 |  .0081208969  |
  12 |  .0083366799  |
  13 |  .0263727477  |
  14 |  .0133568523  |
  15 |  .0208530587  |
  16 |  .0212558573  |
     +---------------+

附录: do文件

Note

以下是do文件的全部代码行:

Show the code
```{stata, eval=FALSE}
set more off

import excel "D:\github\master-SEM\04-students\lab02-dataset-cps.xlsx", sheet("Sheet 1") firstrow  //导入数据


**** 查看数据情况 *******
describe
summarize

******任务1:处理定性变量******

**** 虚拟变量 gender****
tabulate female
generate gender_female = (female==1)
generate gender_male = (female==0)

**** 虚拟变量 married****
tabulate married
generate married_female = (married=="female")
generate married_male = (married=="male")
generate married_other = (married=="other")

**** 虚拟变量 union****
tabulate union
generate union_female = (union=="female")
generate union_male = (union=="male")
generate union_other = (union=="other")

**** 虚拟变量 formMarried****
tabulate formMarried
generate formMarried_female = (formMarried=="female")
generate formMarried_male = (formMarried=="male")
generate formMarried_other = (formMarried=="other")

**** 虚拟变量 hispanic****
tabulate hispanic
generate hispanic_y= (hispanic==1)
generate hispanic_n = (hispanic==0)

**** 虚拟变量 race****
tabulate hispanic
generate race_black = (race=="black")
generate race_ai = (race=="AI")
generate race_asian = (race=="asian")
generate race_mixed = (race=="mixed")
generate race_other = (race=="other")


******任务3-1:OLS快速回归分析******
regress lgwg edu exper exper2 gender_female union_female union_male married_female married_male formMarried_female formMarried_male hispanic_y race_black race_ai race_asian race_mixed
ereturn list

matrix b_quick = (e(b))'       // 获取斜率系数,转换为列向量
matrix list b_quick

scalar RSS_quick = e(rss)      // 获取残差平方和

matrix covvar_b_quick = e(V)       // 获取估计系数的方差协方差矩阵
matrix list covvar_b_quick

mata: st_matrix("s_b_quick", sqrt(diagonal(st_matrix("e(V)"))))    // 估计系数的标准差, 这里使用了mata语法
matrix list s_b_quick



******任务4-1:矩阵分析******
generate const = (female > -0.1) // 常数列,全部取值为1

// 以下采用mata语法体系
mata: st_view(y=., ., "lgwg")  // 构造y矩阵
mata: st_view(X=., ., ("edu", "exper", "exper2", "gender_female", "union_female", "union_male", "married_female", "married_male", "formMarried_female", "formMarried_male", "hispanic_y", "race_black", "race_ai", "race_asian", "race_mixed", "const"))  // 构造X矩阵

mata: b = (cholinv(cross(X,X)))*(cross(X,y))  // 计算斜率系数
mata: b

mata: N = rows(y)  // 样本数
mata: k = cols(X)  // 变量数
mata: printf("rows = %f k = %f", N, k)


******任务4-2:方差协方差矩阵******
mata: e = y - X*b         // 残差向量
mata: RSS = e'e           // 残差平方和
mata: sigma2 = RSS/(N-k)  // 回归误差方差 或 残差均方和MSS_RSS
mata: printf("RSS = %f sigma2 = %f", RSS, sigma2)

mata: covvar_b = cholinv(cross(X,X))*sigma2  // 估计系数的方差协方差矩阵
mata: s_b = sqrt(diagonal(covvar_b))         // 估计系数的标准差
mata: s_b


******任务5-1:异方差一致性标准误差******
// stata快速命令法, HC1使用', vce(robust)'命令; HC2使用', vce(hc2)'命令;HC3使用', vce(hc3)'命令。此外,HC0往往在实证中不使用,因为它没有进行自由度调整
regress lgwg edu exper exper2 gender_female union_female union_male married_female married_male formMarried_female formMarried_male hispanic_y race_black race_ai race_asian race_mixed, vce(robust)
mata: st_matrix("s_b_quick_robust", sqrt(diagonal(st_matrix("e(V)"))))    // 估计系数的稳健标准差, 这里使用了mata语法
matrix list s_b_quick_robust

// 手动编程计算法

mata: Xe = X:*e                // element product
mata: XXe2 = cross(Xe, Xe)     // (Xe)'*(Xe)
mata: hc0 =  cholinv(cross(X,X)) * XXe2 *  cholinv(cross(X,X))  // the HC0 formula
mata: hc1 = hc0 *N/(N-k)
mata: s_b_hc1 = sqrt(diagonal(hc1))  // 估计系数的HC1稳健标准差
mata: s_b_hc1
```