统计学原理

(Statistics)

第5章 相关和回归分析
05-03 OLS方法与参数估计


Hu Huaping (胡华平)

huhuaping01 at hotmail.com

经济管理学院(CEM)

第五章 相关和回归分析

5.3 OLS方法与参数估计

普通最小二乘法(OLS)

引言

我们如何估计回归函数中的系数?

总体回归: \[ \begin{cases} \begin{aligned} E(Y|X_i) &= \beta_1 +\beta_2X_i && \text{(PRF)} \\ Y_i &= \beta_1 +\beta_2X_i + u_i && \text{(PRM)} \end{aligned} \end{cases} \]

样本回归: \[ \begin{cases} \begin{aligned} \hat{Y}_i & =\hat{\beta}_1 + \hat{\beta}_2X_i && \text{(SRF)} \\ Y_i &= \hat{\beta}_1 + \hat{\beta}_2X_i +e_i && \text{(SRM)} \end{aligned} \end{cases} \]

首先需要回答的问题是,我们该如何估计得出样本回归函数中的系数?事实上,方法有多种多样:

  • 图解法:比较粗糙,但提供了基本的视觉认知

  • 最小二乘法(order lease squares, OLS):最常用的方法

  • 最大似然法(maximum likelihood, ML)

  • 矩估计方法(Moment method, MM)

回顾和比较

总体回归函数PRF:

\[ \begin{aligned} E(Y|X_i) &= \beta_1 +\beta_2X_i \end{aligned} \]

总体回归模型PRM:

\[ \begin{aligned} Y_i &= \beta_1 +\beta_2X_i + u_i \end{aligned} \]

样本回归函数SRF:

\[ \begin{aligned} \hat{Y}_i =\hat{\beta}_1 + \hat{\beta}_2X_i \end{aligned} \]

样本回归模型SRM:

\[ \begin{aligned} Y_i &= \hat{\beta}_1 + \hat{\beta}_2X_i +e_i \end{aligned} \]

思考:

  • PRF无法直接观测,只能用SRF近似替代

  • 估计值与观测值之间存在偏差

  • SRF又是怎样决定的呢?

原理1/2

认识普通最小二乘法的原理:一个图示

原理2/2

OLS的基本原理:残差平方和最小化。

\[ \begin{aligned} e_i &= Y_i - \hat{Y}_i \\ &= Y_i - (\hat{\beta}_1 +\hat{\beta}_2X_i) \end{aligned} \]

\[ \begin{aligned} Q &= \sum{e_i^2} \\ &= \sum{(Y_i - \hat{Y}_i)^2} \\ &= \sum{\left( Y_i - (\hat{\beta}_1 +\hat{\beta}_2X_i) \right)^2} \\ &\equiv f(\hat{\beta}_1,\hat{\beta}_2) \end{aligned} \]

\[ \begin{aligned} Min(Q) &= Min \left ( f(\hat{\beta}_1,\hat{\beta}_2) \right) \end{aligned} \]

(示例) 普通最小二乘法(OLS)的一个数值试验

假设存在下面所示的4组观测值 \((X_i, Y_i)\)

(示例) 普通最小二乘法(OLS)的一个数值试验

假设猜想两个SRF,完成下表计算,并分析哪个SRF给出的 \((\hat{\beta}_1, \hat{\beta}_2)\) 要更好?

\[ \begin{aligned} SRF1:\hat{Y}_{1i} & = \hat{\beta}_1 +\hat{\beta}_2X_i = 1.572 + 1.357X_i \\ SRF2:\hat{Y}_{2i} & = \hat{\beta}_1 +\hat{\beta}_2X_i = 3.0 + 1.0X_i \end{aligned} \]

参数估计

回归参数的OLS点估计

  • 最小化求解:

\[ \begin{aligned} Min(Q) = Min \left ( f(\hat{\beta}_1,\hat{\beta}_2) \right) &= Min\left(\sum{\left( Y_i - (\hat{\beta}_1 +\hat{\beta}_2X_i) \right)^2} \right) \\ &= Min \sum{\left( Y_i - \hat{\beta}_1 - \hat{\beta}_2X_i \right)^2} \end{aligned} \]

  • 方程组变形,得到正规方程组

\[ \begin{aligned} \left \{ \begin{split} \sum{\left[ \hat{\beta}_1 - (Y_i -\hat{\beta}_2X_i) \right]} &=0 \\ \sum{\left[ X_i^2\hat{\beta}_2 - (Y_i-\hat{\beta}_1 )X_i \right ] }&=0 \end{split} \right. \end{aligned} \]

\[ \begin{aligned} \left \{ \begin{split} \sum{Y_i} - n\hat{\beta}_1- (\sum{X_i})\hat{\beta}_2 &=0 \\ \sum{X_iY_i}-(\sum{X_i})\hat{\beta}_1 - (\sum{X_i^2})\hat{\beta}_2 &=0 \end{split} \right. \end{aligned} \]

回归参数的OLS点估计1/2

进而得到回归系数的计算公式1(Favorite Five,FF):

\[ \begin{aligned} \left \{ \begin{split} \hat{\beta}_2 &=\frac{n\sum{X_iY_i}-\sum{X_i}\sum{Y_i}}{n\sum{X_i^2}-\left ( \sum{X_i} \right)^2}\\ \hat{\beta}_1 &=\frac{n\sum{X_i^2Y_i}-\sum{X_i}\sum{X_iY_i}}{n\sum{X_i^2}-\left ( \sum{X_i} \right)^2} \end{split} \right. &&\text{(FF solution)} \end{aligned} \]

回归参数的OLS点估计2/2

此外我们也可以得到如下的离差公式(favorite five,ff)

\[ \begin{aligned} \left \{ \begin{split} \hat{\beta}_2 &=\frac{\sum{x_iy_i}}{\sum{x_i^2}}\\ \hat{\beta}_1 &=\bar{Y}_i-\hat{\beta}_2\bar{X}_i \end{split} \right. && \text{(ff solution)} \end{aligned} \]

其中离差计算 \(x_i=X_i-\bar{X};\ y_i=Y_i - \bar{Y}\)

(测试题)

以下式子为什么是等价的?你能推导出来么?

\[ \begin{aligned} \left\{ \begin{split} \sum{x_iy_i} &= \sum{\left[ (X_i-\bar{X})(Y_i-\bar{Y})\right]} &&= \sum{X_iY_i} - \frac{1}{n}\sum{X_i}\sum{Y_i} \\ \sum{x_i^2} &= \sum{(X_i- \bar{X})^2} &&= \sum{X_i^2} -\frac{1}{n} \left( \sum{X_i} \right)^2 \end{split} \right. \end{aligned} \]

随机干扰项参数的OLS点估计:残差公式

PRM公式变形:

\[ \begin{alignedat}{2} &\left. \begin{split} Y_i &&= \beta_1 - &&\beta_2X_i +u_i \ && \text{(PRM)} \Rightarrow \\ \hat{Y} &&= \beta_1 - &&\beta_2\bar{X} +\bar{u} && \\ \end{split} \right \} \Rightarrow \\ & y_i = \beta_2x_i +(u_i- \bar{u}) \end{alignedat} \]

残差公式变形:

\[ \begin{alignedat}{2} &\left. \begin{split} & e_i = y_i - \hat{\beta}_2x_i \\ & e_i = \beta_2x_i +(u_i- \bar{u}) -\hat{\beta}_2x_i \end{split} \right \} \Rightarrow \\ & e_i =-(\hat{\beta}_2- \beta_2)x_i + (u_i- \hat{u}) \end{alignedat} \]

随机干扰项参数的OLS点估计:残差平方和

求解残差平方和:

\[ \begin{alignedat}{2} & \sum{e_i^2} && = (\hat{\beta}_2 - \beta_2)^2\sum{x_i^2} + \sum{(u-\bar{u})^2} - 2(\hat{\beta}_2 - \beta_2)\sum{x_i(u-\bar{u})} \end{alignedat} \]

求残差平方和的期望:

\[ \begin{aligned} E(\sum{e_i^2}) &= \sum{x_i^2 E \left[ (\hat{\beta}_2 - \beta_2)^2 \right ]}+ E\left[ \sum{(u-\bar{u})^2} \right ]\\ &+ 2E \left[ (\hat{\beta}_2 - \beta_2)\sum{x_i(u-\bar{u})} \right ] \\ & \equiv A + B + C \\ & = \sigma^2 + (n-1)\sigma^2 -2\sigma^2 \\ & = (n-2)\sigma^2 \end{aligned} \]

随机干扰项参数的OLS点估计:回归误差方差

回归误差方差(Deviation of Regression Error):

  • 采用OLS方法下,总体回归模型PRM中随机干扰项 \(u_i\) 的总体方差的无偏估计量,记为 \(E(\sigma^2) \equiv \hat{\sigma}^2\) ,简单地记为 \(\hat{\sigma}^2\)

\[ \begin{aligned} \hat{\sigma}^2=\frac{\sum{e_i^2}}{n-2} \end{aligned} \]

回归误差标准差(Standard Deviation of Regression Error):有时候也记为se

\[ \begin{aligned} \hat{\sigma}=\sqrt{\frac{\sum{e_i^2}}{n-2}} \end{aligned} \]

(附录)A过程证明

\[ \begin{aligned} A & = \sum{x_i^2 E \left[ (\hat{\beta}_2 - \beta_2)^2 \right ]} \\ & = \sum{ \left[ x_i^2 \cdot var(\hat{\beta}_2) \right] } \\ & = var(\hat{\beta}_2) \cdot \sum{x_i^2} \\ & = \frac{\sigma^2}{\sum{ x_i^2}} \cdot \sum{ x_i^2} \\ & = \sigma^2 \end{aligned} \]

(附录)B过程证明

\[ \begin{aligned} B = E \left[ \sum{(u-\bar{u})^2} \right ] & = E(\sum{u_i^2}) - 2E \left[ \sum{(u_i\bar{u})} \right] +nE(\bar{u}^2) \\ & = n \cdot Var(u_i) - 2E \left[ \sum{(u_i \cdot \frac{\sum{u_i}}{n} )} \right] + nE(\frac{\sum{u_i}}{n})^2 \\ & = n \sigma^2 - 2E \left[ \frac{\sum{u_i}}{n} \sum{u_i} \right] + E\left[ \frac{(\sum{u_i})^2}{n} \right]\\ & = n \sigma^2- E\left[ (\sum{u_i})^2/{n} \right] = n \sigma^2 - \frac{E(u_i^2) + E(u_2^2) + \cdots + E(u_n^2) )}{n} \\ & = n \sigma^2 - \frac{nVar{u_i}}{n} = n \sigma^2 - \sigma^2 = (n-1) \sigma^2 \end{aligned} \]

(附录)C过程证明

\[ \begin{aligned} C &= - 2E \left[ (\hat{\beta}_2 - \beta_2)\sum{x_i(u_i-\bar{u})} \right ] \\ &= - 2E \left[ \frac{\sum{x_iu_i}}{\sum{x_i^2}} \left( \sum{x_iu_i}-\bar{u}\sum{x_i} \right) \right ] \\ &= - 2E \left[ \frac{ \left( \sum{x_iu_i} \right)^2}{\sum{x_i^2}} \right ] \\ &= -2E \left[(\hat{\beta}_2 - \beta_2)^2 \right] = -2\sigma^2 \end{aligned} \]

  • 其中:

\[ \begin{aligned} \hat{\beta}_2 & = \sum{k_iY_i} = \sum{k_i(\beta_1 +\beta_2X_i +u_i)} = \beta_1\sum{k_i} +\beta_2 \sum{k_iX_i}+\sum{k_iu_i} = \beta_2 +\sum{k_iu_i} \\ \hat{\beta}_2 - \beta_2 & = \sum{k_iu_i} = \frac{ \sum{x_iu_i} }{\sum{x_i^2}} \end{aligned} \]

(案例)计算表FF和ff

(案例)计算回归系数

公式1: (Favorite Five,FF形式)

\[ \begin{aligned} \hat{\beta}_2 &=\frac{n\sum{X_iY_i}-\sum{X_i}\sum{Y_i}}{n\sum{X_i^2}-\left ( \sum{X_i} \right)^2}\\ &=\frac{ 13 \ast 1485.04 - 156 \ast 112.771}{ 13 \ast 2054 - 156^2} \\ &= 0.7241 \end{aligned} \]

\[ \begin{aligned} \hat{\beta}_1 &= \bar{Y} - \hat{\beta}_2 \bar{X} \\ &= 8.6747 - 0.7241 \ast 12 \\ &= -0.0145 \end{aligned} \]

(案例)计算回归系数

公式2:(离差形式,favorite five,ff形式)

\[ \begin{aligned} \hat{\beta}_2 = \frac{\sum{x_iy_i}}{\sum{x_i^2}} = \frac{ 131.786 }{ 182 } = 0.7241 \end{aligned} \]

\[ \begin{aligned} \hat{\beta}_1 = \bar{Y} - \hat{\beta}_2 \bar{X} = 8.6747 - 0.7241 \ast 12 = -0.0145 \end{aligned} \]

(案例)样本回归方程SRF

\[ \begin{aligned} \hat{Y_i} &= \hat{\beta}_1 + \hat{\beta}_2X_i \\ &= -0.0145 + 0.7241X_i \end{aligned} \]

(案例)样本回归线SRL

(案例)样本回归线SRL

(案例)计算得到拟合值和残差

根据以上样本回归方程,可以计算得到 \(Y_i\) 的回归拟合值 \(\hat{Y}_i\) ,以及回归残差 \(e_i\)

\[ \begin{aligned} \hat{Y}_i &=\hat{\beta}_1 +\hat{\beta}_2X_i\\ e_i &= Y_i - \hat{Y}_i \end{aligned} \]

(案例)计算回归误差方差和标准差

回归误差方差 \(\hat{\sigma}^2\)

\[ \begin{aligned} \hat{\sigma}^2= \frac{\sum{e_i^2}} {(n-2)} = \frac{ 9.693 }{ 11 } = 0.8812 \end{aligned} \]

回归误差标准差 \(\hat{\sigma}\)

\[ \begin{aligned} \hat{\sigma}=\sqrt{\frac{\sum{e_i^2}}{(n-2)}} = \sqrt{ 0.8812 } = 0.9387 \end{aligned} \]

“估计值”与“估计量”

理解OLS方法下的“估计值”与“估计量”

回归系数的计算公式1(Favorite Five,FF):

\[ \begin{aligned} \left \{ \begin{split} \hat{\beta}_2 &=\frac{n\sum{X_iY_i}-\sum{X_i}\sum{Y_i}}{n\sum{X_i^2}-\left ( \sum{X_i} \right)^2}\\ \hat{\beta_1} &=\frac{n\sum{X_i^2Y_i}-\sum{X_i}\sum{X_iY_i}}{n\sum{X_i^2}-\left ( \sum{X_i} \right)^2} \end{split} \right. &&\text{(FF solution)} \end{aligned} \]

  • 如果给出的参数估计结果是由一个具体样本资料计算出来的,它是一个“估计值”,或者“点估计”,是参数估计量的一个具体数值;

  • 如果把上式看成参数估计的一个表达式,那么,则它是 \((X_i,Y_i)\) 的函数,而 \(Y_i\) 是随机变量,所以参数估计也是随机变量,在这个角度上,称之为“估计量”。

SRF和SRM的特征

OLS估计量是纯粹由可观测的(即样本)量(指X和Y)表达的,因此它们很容易计算。

它们是点估计量(point estimators),即对于给定样本,每个估计量仅提供有关总体参数的一个(点)值*

一旦从样本数据得到OLS估计值,便容易画出样本回归线。

SRF和SRM的特征1/5

  • 特征1:样本回归线一定会经过样本均值点 \((\bar{X}, \bar{Y})\)

\[ \begin{aligned} \bar{Y} = \hat{\beta}_1 +\hat{\beta}_2\bar{X} \end{aligned} \]

  • 特征2: \(Y_i\)估计值( \(\hat{Y}_i\) )的均值( \(\bar{\hat{Y_i}}\) )等于Y的样本均值( \(\bar{Y}\) )

\[ \begin{aligned} \hat{Y_i} &= \hat{\beta}_1 +\hat{\beta}_2\bar{X} \\ & =(\bar{Y} - \hat{\beta}_2\bar{X}) + \hat{\beta_2}X_i \\ & = \bar{Y} - \hat{\beta}_2(X_i - \bar{X}) \end{aligned} \]

\[ \begin{aligned} &\Rightarrow 1/n\sum{\hat{Y_i}} = 1/n\sum{\bar{Y} - \hat{\beta}_2(X_i - \bar{X})} \\ &\Rightarrow \bar{\hat{Y_i}} = \bar{Y} \end{aligned} \]

SRF和SRM的特征2/5

  • 特征3:残差的均值( \(\bar{e_i}\) )为零:

\[ \begin{aligned} \sum{\left[ \hat{\beta}_1 - (Y_i -\hat{\beta}_2X_i) \right]} &=0 \\ \sum{\left[ Y_i- \hat{\beta}_1 - \hat{\beta}_2X_i) \right]} &=0 \\ \sum{( Y_i- \hat{Y}_i )} &=0 \\ \sum{e_i} &=0 \\ \bar{e_i} &=0 \end{aligned} \]

SRF和SRM的特征3/5

  • 特征4:SRM和SRF可以写成离差形式:

\[ \begin{equation} & \left. \begin{split} Y_i && = \hat{\beta}_1 + \hat{\beta}_2X_i + e_i \\ \bar{Y} &&= \hat{\beta}_1 + \hat{\beta}_2\bar{X} \end{split} \right \} \Rightarrow \\ & Y_i - \bar{Y} =\hat{\beta_2}(X_i - \bar{X}) + e_i \Rightarrow \\ & y_i=\hat{\beta_2}x_i +e_i \ &&\text{(SRM-dev)} \end{equation} \]

\[ \begin{alignedat}{999} & \left. \begin{split} \hat{Y}_i && = \hat{\beta}_1 + \hat{\beta}_2X_i\\ \bar{Y} &&= \hat{\beta}_1 + \hat{\beta}_2\bar{X} \end{split} \right \} \Rightarrow \\ & \hat{Y}_i - \bar{Y} =\hat{\beta_2}(X_i - \bar{X}) \Rightarrow \\ & \hat{y}_i=\hat{\beta_2}x_i \ &&\text{(SRF-dev)} \end{alignedat} \]

SRF和SRM的特征4/5

  • 特征5:残差( \(e_i\) )和 \(Y_i\) 的拟合值( \(\hat{Y_i}\) )不相关

\[ \begin{aligned} Cov(e_i, \hat{Y_i}) &= E \left[ \left( e_i-E(e_i)\right )\cdot \left( \hat{Y_i}-E(\hat{Y_i})\right ) \right] = E(e_i \cdot \hat{y_i}) \\ & = \sum(e_i \cdot \hat{\beta_2}x_i) \\ & = \sum{ \left[ (y_i-\hat{\beta_2}x_i) \cdot \hat{\beta_2}x_i \right]} \\ & = \hat{\beta_2}\sum \left[ (y_i-\hat{\beta_2}x_i)\cdot x_i \right]\\ & = \hat{\beta_2}\sum \left[ (y_ix_i-\hat{\beta_2}x_i^2) \right]\\ & = \hat{\beta_2}\sum{x_iy_i}-\hat{\beta}_2^2\sum{x_i^2} && \Leftarrow \hat{\beta_2} = \frac{\sum{x_iy_i}}{x_i^2} \\ & = \hat{\beta}_2^2\sum{x_i^2}- \hat{\beta_2}^2\sum{x_i^2} = 0 \end{aligned} \]

  • 特征6:残差( \(e_i\) )和自变量( \(X_i\) )不相关

离差公式1/2

  • 离差定义与符号:

\[ \begin{aligned} x_i &= X_i - \bar{X} \\ y_i &= Y_i - \bar{Y} \\ \hat{y}_i &= \hat{Y}_i - \bar{\hat{Y}}_i = \hat{Y}_i - \bar{Y} \end{aligned} \]

  • PRM及其离差形式:

\[ \begin{alignedat}{999} & \left. \begin{split} Y_i && = \beta_1 + \beta_2X_i + u_i \\ \bar{Y} &&= \beta_1 + \beta_2\bar{X} + \bar{u} \end{split} \right \} \Rightarrow \\ & Y_i - \bar{Y} =\beta_2x_i + (u_i- \bar{u}) \Rightarrow \\ & y_i=\hat{\beta_2}x_i + (u_i- \bar{u}) \ &&\text{(PRM-dev)} \end{alignedat} \]

离差公式2/2

  • SRM及其离差形式: \[ \begin{alignedat}{999} & \left. \begin{split} Y_i && = \hat{\beta}_1 + \hat{\beta}_2X_i + e_i \\ \bar{Y} &&= \hat{\beta}_1 + \hat{\beta}_2\bar{X} \end{split} \right \} \Rightarrow \\ & Y_i - \bar{Y} =\hat{\beta_2}(X_i - \bar{X}) + e_i \Rightarrow \\ & y_i=\hat{\beta_2}x_i +e_i \end{alignedat} \]
  • SRF及其离差形式:

\[ \begin{alignedat}{999} & \left. \begin{split} \hat{Y}_i && = \hat{\beta}_1 + \hat{\beta}_2X_i\\ \bar{Y} &&= \hat{\beta}_1 + \hat{\beta}_2\bar{X} \end{split} \right \} \Rightarrow \\ & \hat{Y}_i - \bar{Y} =\hat{\beta_2}(X_i - \bar{X}) \Rightarrow \\ & \hat{y}_i=\hat{\beta_2}x_i \ \end{alignedat} \]

  • 残差的离差形式:

\[ \begin{aligned} y_i=\hat{\beta_2}x_i +e_i &&\text{(SRM-dev)} \ \Rightarrow \\ e_i =y_i - \hat{\beta_2}x_i \ &&\text{(residual-dev)} \end{aligned} \]

思考与讨论

内容小结

  • 普通最小二乘方法(OLS)采用“铅垂线距离平方和最小化”的思想,来拟合一条样本回归线,进而求解出模型参数估计量。

  • 大家需要很熟练地记住OLS参数估计量公式,以及它们的几大重要特征!

思考讨论

  • OLS采用的“铅垂线距离平方和最小化”这一方案,凭什么它被奉为计量分析的经典方法?你觉得还有其他可行替代方案么?

  • 回归标准误差 \(se\) 的现实含义是什么?回归参数估计与随机干扰项的方差估计有什么内在联系么?

  • OLS方法的几个特征,是不是使它“天生丽质”、“娘胎里生下来就含着金钥匙”?为什么能这么说?

估计精度

引子

我们已经使用OLS方法分别得到总体回归模型(PRM)的3个重要参数(实际不止3个)的点估计量:

\[ \begin{aligned} Y_i &= \beta_1 +\beta_2X_i + u_i \\ \hat{\beta}_2 &=\frac{\sum{x_iy_i}}{\sum{x_i^2}} ; \quad \hat{\beta}_1 =\bar{Y}_i-\hat{\beta}_2\bar{X}_i ; \quad \hat{\sigma}^2 =\frac{\sum{e_i^2}}{n-2} \end{aligned} \]

问题是:我们如何知道OLS方法点估计量是否可靠?OLS方法的点估计量是否稳定? OLS方法的点估计量是否可信?

因此,我们需要找到一种表达OLS方法估计稳定性或估计精度的指标!

  • 点估计量的方差(variance)和标准差(standard deviation)就是衡量估计稳定性或估计精度的一类重要指标!

斜率系数的方差和样本方差

斜率系数( \(\hat{\beta}_2\) )的总体方差( \(\sigma^2_{\hat{\beta}_2}\) )和总体标准差( \(\sigma_{\hat{\beta}_2}\) ):

\[ \begin{aligned} Var(\hat{\beta}_2) \equiv \sigma_{\hat{\beta}_2}^2 & =\frac{\sigma^2}{\sum{x_i^2}} \\ \sigma_{\hat{\beta}_2} &=\sqrt{\frac{\sigma^2}{\sum{x_i^2}}} \end{aligned} \]

  • 其中, \(Var(u_i) \equiv \sigma^2\) 表示随机干扰项 \(u_i\) 的总体方差。

斜率系数( \(\hat{\beta}_2\) )的样本方差( \(S^2_{\hat{\beta}_2}\) )和样本标准差( \(S_{\hat{\beta}_2}\) ):

\[ \begin{aligned} S_{\hat{\beta}_2}^2 &=\frac{\hat{\sigma}^2}{\sum{x_i^2}} \\ S_{\hat{\beta}_2} &=\sqrt{\frac{\hat{\sigma}^2}{\sum{x_i^2}}} \end{aligned} \]

  • 其中, \(E(\sigma^2) = \hat{\sigma}^2 = \frac{\sum{e_i^2}}{n-2}\) 表示对随机干扰项( \(u_i\) )的总体方差的无偏估计量

(附录)证明过程1

步骤1 \(\hat{\beta}_2\) 的变形:

\[ \begin{aligned} \hat{\beta}_2 &=\frac{\sum{x_iy_i}}{\sum{x_i^2}}= \frac{\sum{\left[ x_i (Y_i -\bar{Y}) \right]} }{\sum{x_i^2}} \\ & = \frac{\sum{ x_iY_i}- \sum{ x_i \bar{Y} } }{\sum{x_i^2}} \\ & = \frac{\sum{x_iY_i}- \bar{Y}\sum{x_i} }{\sum{x_i^2}} && \leftarrow \left[ \sum{x_i}=\sum{(X_i -\bar{X})} = 0 \right] \\ & = \sum{ \left(\frac{x_i}{\sum{x_i^2}} \cdot Y_i \right) } && \leftarrow \left[ k_i \equiv \frac{x_i}{\sum{x_i^2}} \right]\\ & = \sum{k_iY_i} \end{aligned} \]

  • 其中, \(k_i \equiv \frac{x_i}{\sum{x_i^2}}\)

(附录)证明过程2

步骤2:计算 \(\hat{\beta}_2\)总体方差\(\sigma^2_{\hat{\beta}_2}\) ):

\[ \begin{aligned} \sigma^2_{\hat{\beta}_2} & \equiv Var(\hat{\beta}_2) = Var(\sum{k_iY_i} ) \\ & = \sum{\left( k_i^2Var(Y_i) \right)} \\ & = \sum{\left( k_i^2Var(\beta_1 +\beta_2X_i +u_i) \right)} \\ & = \sum{ \left( k_i^2Var(u_i) \right)} && \leftarrow \left[ k_i \equiv \frac{x_i}{\sum{x_i^2}} \right]\\ & = \sum{ \left( \left(\frac{x_i}{\sum{x_i^2}} \right)^2 \cdot \sigma^2 \right)} \\ & = \frac{\sigma^2}{\sum{x_i^2}} \end{aligned} \]

其中, \(Var(u_i) \equiv \sigma^2\) 表示随机干扰项 \(u_i\) 的总体方差。

截距系数的方差和样本方差

截距系数( \(\hat{\beta}_1\) )的总体方差( \(\sigma^2_{\hat{\beta}_1}\) )和总体标准差( \(\sigma_{\hat{\beta}_1}\) ):

\[ \begin{aligned} Var(\hat{\beta}_1) \equiv \sigma_{\hat{\beta}_1}^2 &=\frac{\sum{X_i^2}}{n} \cdot \frac{\sigma^2}{\sum{x_i^2}} \\ \sigma_{\hat{\beta}_1} & =\sqrt{\frac{\sum{X_i^2}}{n} \cdot \frac{\sigma^2}{\sum{x_i^2}}} \end{aligned} \]

  • 其中, \(Var(u_i) \equiv \sigma^2\) 表示随机干扰项 \((u_i)\) 的总体方差。

截距系数 \((\hat{\beta}_1)\)样本方差 \((S^2_{\hat{\beta}_1})\)样本标准差 \((S_{\hat{\beta}_1})\)

\[ \begin{aligned} S_{\hat{\beta}_1}^2 &=\frac{\sum{X^2_i}}{n} \cdot \frac{\hat{\sigma}^2}{\sum{x_i^2}} \\ S_{\hat{\beta}_1} &=\sqrt{\frac{\sum{X^2_i}}{n} \cdot \frac{\hat{\sigma}^2}{\sum{x_i^2}}} \end{aligned} \]

  • 其中, \(E(\sigma^2) = \hat{\sigma}^2 = \frac{\sum{e_i^2}}{n-2}\) 表示对随机干扰项 \((u_i)\) 的总体方差的无偏估计量

(附录)证明过程1

步骤1 \(\hat{\beta}_1\) 的变形:

\[ \begin{aligned} \hat{\beta_1} & = \bar{Y}_i-\hat{\beta}_2\bar{X}_i && \leftarrow \left[ \hat{\beta}_2= \sum{k_iY_i} \right] \\ & = \frac{1}{n} \sum{Y_i} - \sum{\left( k_iY_i \cdot \bar{X} \right)} \\ & = \sum{\left( (\frac{1}{n} - k_i\bar{X}) \cdot Y_i \right)} && \leftarrow \left[ w_i \equiv \frac{1}{n} - k_i\bar{X} \right]\\ & = \sum{w_iY_i} \end{aligned} \]

  • 其中:令 \(w_i \equiv \frac{1}{n} - k_i\bar{X}\)

(附录)证明过程2

步骤2计算 \(\hat{\beta}_1\)总体方差\(\sigma^2_{\hat{\beta}_1}\) ):

\[ \begin{aligned} \sigma^2_{\hat{\beta}_1} & \equiv Var(\hat{\beta_1}) = Var(\sum{w_iY_i}) \\ & = \sum{\left( w_i^2Var(\beta_1 +\beta_2X_i + u_i) \right)} && \leftarrow \left[w_i \equiv \frac{1}{n} - k_i\bar{X} \right]\\ & = \sum{\left( \left( \frac{1}{n} - k_i\bar{X} \right)^2Var(u_i) \right)} \\ & = \sigma^2 \cdot \sum{ \left( \frac{1}{n^2} - \frac{2 \bar{X} k_i}{n} + k_i^2 \bar{X}^2 \right) } && \leftarrow \left[ \sum{k_i} = \sum{\left( \frac{x_i}{\sum{x_i^2}} \right)= \frac{\sum{x_i}} {\sum{x_i^2}}}=0 \right] \\ & = \sigma^2 \cdot \left( \frac{1}{n} + \bar{X}^2\sum{k_i^2} \right) && \leftarrow \left[ k_i \equiv \frac{x_i}{\sum{x_i^2}} \right]\\ & = \sigma^2 \cdot \left( \frac{1}{n} + \bar{X}^2\sum{ \left( \frac{x_i}{\sum{x_i^2}} \right) ^2} \right) \end{aligned} \]

(附录)证明过程2(续)

步骤2计算 \(\hat{\beta}_1\)总体方差\(\sigma^2_{\hat{\beta}_1}\) )(续前):

\[ \begin{aligned} & = \sigma^2 \cdot \left( \frac{1}{n} + \bar{X}^2 \frac{\sum{x_i^2}}{\left( \sum{x_i^2} \right)^2} \right) \\ & = \sigma^2 \cdot \left( \frac{1}{n} + \frac{ \bar{X}^2 } { \sum{x_i^2} } \right) \\ & = \frac{\sum{x_i^2} + n\bar{X}^2} {n\sum{x_i^2}} \cdot \sigma^2 && \leftarrow \left[ \sum{x_i^2} + n\bar{X}^2 = \sum{(X_i-\bar{X})^2} + n\bar{X}^2 = \sum{X_i^2}\right]\\ & = \frac{\sum{X_i^2}}{n} \cdot \frac{\sigma^2}{\sum{x_i^2}} \end{aligned} \]

小结与思考

现在做一个内容小结

  • 为了衡量OLS方法的点估计量是否稳定或是否可信,我们一般采用方差和标准差指标来表达。

  • 大家应熟记斜率截距估计量的总体方差样本方差最终公式。

小结与思考

请大家思考如下问题:

  • 总体方差和样本方差都是确定的数么?

  • 二者分别受那些因素的影响?二者又有什么联系?

  • 证明过程中,约定的 \(k_i\)\(w_i\) ,有什么特征?

\[ \begin{cases} \begin{aligned} \sum{k_i} & =0 \\ \sum{k_iX_i} & = 1 \end{aligned} \end{cases} \]

\[ \begin{cases} \begin{aligned} \sum{w_i} & =1 \\ \sum{w_iX_i} & = 0 \end{aligned} \end{cases} \]

(案例)计算回归系数的样本方差

对于“教育程度案例”,利用FF-ff计算表,以及我们已算出的如下计算量:

  • 回归误差方差: \(\hat{\sigma}^2=\) 0.8812。

则可以进一步计算出,回归系数的样本方差的标准差分别为:

\[ \begin{aligned} S^2_{\hat{\beta}_2} &= \frac{\hat{\sigma}^2}{\sum{x_i^2}} \\ S_{\hat{\beta}_2} &= \sqrt{\frac{\hat{\sigma}^2}{\sum{x_i^2}}} = \sqrt{0.0048} = 0.0696 \end{aligned} \]

\[ \begin{aligned} S^2_{\hat{\beta}_1} &= \frac{\sum{X_i^2}} {n} \frac{\hat{\sigma}^2} {\sum{x_i^2}} \\ S_{\hat{\beta}_1} &= \sqrt{\frac{\sum{X_i^2}}{n}\frac{\hat{\sigma}^2} {\sum{x_i^2}}} = \sqrt{0.765} = 0.8746 \end{aligned} \]

区间估计

斜率系数

\[ \begin{aligned} \hat{\beta}_2 & \sim N(\mu_{\hat{\beta}_2}, \sigma^2_{\hat{\beta}_2}) && \leftarrow \left[ \mu_{\hat{\beta}_2}= \beta_2; \quad \sigma^2_{\hat{\beta}_2} = \frac{\sigma^{2}}{\sum x_{i}^{2}} \right] \end{aligned} \]

\[ \begin {aligned} &Z=\frac{\left(\hat{\beta}_{2}-\beta_{2}\right)}{\sqrt{\operatorname{var}\left(\hat{\beta}_{2}\right)}} =\frac{\left(\hat{\beta}_{2}-\beta_{2}\right)}{\sqrt{\sigma_{\beta_{2}}^{2}}} =\frac{\hat{\beta}_{2}-\beta_{2}}{\sigma_{\hat{\beta}_{2}}} =\frac{\left(\hat{\beta}_{2}-\beta_{2}\right)}{\sqrt{\frac{\sigma^{2}}{\sum x_{i}^{2}}}} && \leftarrow Z \sim N(0, 1) \end {aligned} \]

\[ \begin{aligned} T&=\frac{\left(\hat{\beta}_{2}-\beta_{2}\right)}{\sqrt{S_{\beta_{2}}^{2}}} =\frac{\hat{\beta}_{2}-\beta_{2}}{\sqrt{S_{\beta_{2}}^{2}}} =\frac{\hat{\beta}_{2}-\beta_{2}}{S_{\hat{\beta}_{2}}} && \leftarrow T \sim t(n-2) \end{aligned} \]

\[ \begin{aligned} S^2_{\hat{\beta}_2} =\frac{\hat{\sigma}^{2}}{\sum x_{i}^{2}} ; \quad \hat{\sigma}^{2}=\frac{\sum e_{i}^{2}}{n-2} \end{aligned} \]

\[ \begin{aligned} \operatorname{Pr}\left[-t_{\alpha / 2,(n-2)} \leq \mathrm{T} \leq t_{\alpha / 2,(n-2)}\right]=1-\alpha \end{aligned} \]

斜率系数

\[ \begin {aligned} \operatorname{Pr}\left[-t_{\alpha / 2,(n-2)} \leq \frac{\hat{\beta}_{2}-\beta_{2}}{S_{\hat{\beta}_{2}}} \leq t_{\alpha / 2 ,(n-2)}\right]=1-\alpha \end {aligned} \]

\[ \begin {aligned} \operatorname{Pr}\left[\hat{\beta}_{2}-t_{\alpha / 2,(n-2)} \cdot S_{\hat{\beta}_{2}} \leq \beta_{2} \leq \hat{\beta}_{2}+t_{\alpha / 2,(n-2)} \cdot S_{\hat{\beta}_{2}}\right]=1-\alpha \end {aligned} \]

因此, \(\beta_2\)\(100(1-\alpha)\%\) 置信上限和下限分别为:

\[ \hat{\beta}_{2} \pm t_{\alpha / 2} \cdot S_{\hat{\beta}_{2}} \] \(\beta_2\)\(100(1-\alpha)\%\) 置信区间为:

\[ \left[ \hat{\beta}_{2} - t_{\alpha / 2} \cdot S_{\hat{\beta}_{2}}, \quad \hat{\beta}_{2} + t_{\alpha / 2} \cdot S_{\hat{\beta}_{2}} \right] \]

截距系数

\[ \begin{aligned} \hat{\beta}_1 & \sim N(\mu_{\hat{\beta}_1}, \sigma^2_{\hat{\beta}_1}) && \leftarrow \left[ \mu_{\hat{\beta}_1}= \beta_1; \quad \sigma^2_{\hat{\beta}_1} = \frac{\sum{X_i^2}}{n} \frac{\sigma^{2}}{\sum x_{i}^{2}} \right] \end{aligned} \]

\[ \begin {aligned} &Z=\frac{\left(\hat{\beta}_{1}-\beta_{1}\right)}{\sqrt{\operatorname{var}\left(\hat{\beta}_{1}\right)}} =\frac{\left(\hat{\beta}_{1}-\beta_{1}\right)}{\sqrt{\sigma_{\beta_{1}}^{2}}} =\frac{\hat{\beta}_{1}-\beta_{1}}{\sigma_{\hat{\beta}_{1}}} =\frac{\left(\hat{\beta}_{1}-\beta_{1}\right)}{\sqrt{\frac{\sum{X^2_i}}{n} \cdot \frac{\sigma^{2}}{\sum x_{i}^{2}}}} && \leftarrow Z \sim N(0, 1) \end {aligned} \]

\[ \begin{aligned} T&=\frac{\left(\hat{\beta}_{1}-\beta_{1}\right)}{S^2_{\hat{\beta}_1}} =\frac{\hat{\beta}_{1}-\beta_{1}}{\sqrt{S_{\beta_{1}}^{2}}} =\frac{\hat{\beta}_{1}-\beta_{1}}{S_{\hat{\beta}_{1}}} && \leftarrow T \sim t(n-2) \end{aligned} \]

\[ \begin{aligned} S^2_{\hat{\beta}_1} =\frac{\sum{X_i^2}}{n} \cdot \frac{\hat{\sigma}^{2}}{\sum x_{i}^{2}} ; \quad \hat{\sigma}^{2}=\frac{\sum e_{i}^{2}}{n-2} \end{aligned} \]

\[ \begin{aligned} \operatorname{Pr}\left[-t_{\alpha / 2,(n-2)} \leq \mathrm{T} \leq t_{\alpha / 2,(n-2)}\right]=1-\alpha \end{aligned} \]

截距系数

\[ \begin {aligned} \operatorname{Pr}\left[-t_{\alpha / 2,(n-2)} \leq \frac{\hat{\beta}_{1}-\beta_{1}}{S_{\hat{\beta}_{1}}} \leq t_{\alpha / 2 ,(n-2)}\right]=1-\alpha \end {aligned} \]

\[ \begin {aligned} \operatorname{Pr}\left[\hat{\beta}_{1}-t_{\alpha / 2,(n-2)} \cdot S_{\hat{\beta}_{1}} \leq \beta_{1} \leq \hat{\beta}_{1}+t_{\alpha / 2,(n-2)} \cdot S_{\hat{\beta}_{1}}\right]=1-\alpha \end {aligned} \]

因此, \(\beta_1\)\(100(1-\alpha)\%\) 置信上限和下限分别为:

\[ \hat{\beta}_{1} \pm t_{\alpha / 2} \cdot S_{\hat{\beta}_{1}} \] \(\beta_1\)\(100(1-\alpha)\%\) 置信区间为:

\[ \left[ \hat{\beta}_{1} - t_{\alpha / 2} \cdot S_{\hat{\beta}_{1}}, \quad \hat{\beta}_{1} + t_{\alpha / 2} \cdot S_{\hat{\beta}_{1}} \right] \]

随机干扰项的方差

\[ \begin {aligned} \chi^{2} & =(n-2) \frac{\hat{\sigma}^{2}}{\sigma^{2}} &&\leftarrow \quad \chi^{2} \sim \chi^{2}(n-2) \end {aligned} \]

\[ \begin {aligned} \operatorname{Pr}\left(\chi_{\alpha / 2}^{2} \leq \chi^{2} \leq \chi_{\alpha / 2}^{2}\right)=1-\alpha \end {aligned} \]

\[ \begin {aligned} \operatorname{Pr}\left(\chi_{\alpha / 2}^{2} \leq (n-2) \frac{\hat{\sigma}^{2}}{\sigma^{2}} \leq \chi_{1-\alpha / 2}^{2}\right)=1-\alpha \end {aligned} \]

\[ \begin {aligned} \operatorname{Pr}\left[(n-2) \frac{\hat{\sigma}^{2}}{\chi_{1-\alpha/2}^{2}} \leq \sigma^{2} \leq (n-2) \frac{\hat{\sigma}^{2}}{\chi_{\alpha / 2}^{2}}\right]=1-\alpha \end {aligned} \]

因此, \(\sigma^2\)\(100(1-\alpha)\%\) 为:

\[ \left[ (n-2) \frac{\hat{\sigma}^{2}}{\chi_{1-\alpha/2}^{2}}, \quad (n-2) \frac{\hat{\sigma}^{2}}{\chi_{\alpha / 2}^{2}}\right] \]

(案例)主模型

我们继续利用样本数据对教育和工资案例进行分析。

教育和工资案例的总体回归模型(PRM)如下:

\[ \begin{aligned} Wage_i & = \beta_1 + \beta_2 Edu_i +u_i \\ Y_i & = \beta_1 + \beta_2 X_i +u_i \\ \end{aligned} \]

教育和工资案例的总体回归模型(SRM)如下:

\[ \begin{aligned} \widehat{Wage}_i & = \hat{\beta}_1 + \hat{\beta}_2 Edu_i +e_i \\ \hat{Y}_i & = \hat{\beta}_1 + \hat{\beta}_2 X_i + e_i \\ \end{aligned} \]

(案例)相关计算量

我们之前已算出“教育程度案例”中的如下计算量:

  • 回归系数: \(\hat{\beta}_1 =\) -0.0145; \(\hat{\beta}_2 =\) 0.7241; \(\hat{\sigma}^2=\) 0.8812 。

  • 回归误差方差: \(\hat{\sigma}^2=\) 0.8812。

  • 回归系数的样本方差: \(S^2_{\hat{\beta}_1} = \frac{\sum{X_i^2}}{n} \cdot \frac{\hat{\sigma}^2} {\sum{x_i^2}}=\) 0.7650; \(S^2_{\hat{\beta}_2} = \frac{\hat{\sigma}^2} {\sum{x_i^2}}=\) 0.0048;

  • 回归系数的样本标准差: \(S_{\hat{\beta}_1} =\) 0.8746; \(S_{\hat{\beta}_2} =\) 0.0696。

给定 \(\alpha=0.05,\quad (1-\alpha) 100 \%=95 \%\) ,我们可以查t分布表得到理论参照值: \(t_{\alpha / 2}(n-2)=t_{0.05 / 2}(11)=\) 2.2010

(案例)回归系数的区间估计

下面我们进一步计算回归系数的置信区间:

那么,截距参数 \(\beta_1\) 的95%置信区间为:

\[ \begin{aligned} \hat{\beta}_{1} - t_{\alpha / 2} \cdot S_{\hat{\beta}_{1}} \quad \leq & \beta_1 \leq \quad \hat{\beta}_{1} + t_{\alpha / 2} \cdot S_{\hat{\beta}_{1}} \\ -0.0145 - 2.201 \cdot 0.8746 \quad \leq & \beta_1 \leq \quad -0.0145 + 2.201 \cdot 0.8746 \\ -1.9395 \quad \leq & \beta_1 \leq \quad 1.9106 \\ \end{aligned} \]

那么,斜率参数 \(\beta_2\) 的95%置信区间为:

\[ \begin{aligned} \hat{\beta}_{2} - t_{\alpha / 2} \cdot S_{\hat{\beta}_{2}} \quad \leq & \beta_2 \leq \quad \hat{\beta}_{2} + t_{\alpha / 2} \cdot S_{\hat{\beta}_{2}} \\ 0.7241 - 2.201 \cdot 0.0696 \quad \leq & \beta_2 \leq \quad 0.7241 + 2.201 \cdot 0.0696 \\ 0.5709 \quad \leq & \beta_2 \leq \quad 0.8772 \\ \end{aligned} \]

(案例)随机干扰项方差的区间估计

  • 给定 \(\alpha=0.05,\quad (1-\alpha) 100 \%=95 \%\)

  • 查卡方分布表可知:

  • \(\chi^2_{\alpha / 2}(n-2)=\chi^2_{0.05 / 2}(11)=\chi^2_{0.025}(11)=\) 3.8157

  • \(\chi^2_{1-\alpha / 2}(n-2)=\chi^2_{1-0.05 / 2}(11)=\chi^2_{0.975}(11)=\) 21.9200

们之前已算出回归误差方差 \(\hat{\sigma}^2=\frac{\sum{e_i^2}}{n-2}=\) 0.8812 。因此可以算出 \(\sigma^2\) 的95%置信区间为:

\[ \begin{aligned} (n-2) \frac{\hat{\sigma}^{2}}{\chi_{\alpha}^{2}} \leq \sigma^{2} \leq(n-2) \frac{\hat{\sigma}^{2}}{\chi_{1-\alpha / 2}^{2}}\\ 11 \frac{0.8812}{21.92} \leq \sigma^{2} \leq 11 \frac{0.8812}{3.8157}\\ 0.4422 \leq \sigma^{2} \leq 2.5403\\ \end{aligned} \]

本节结束