Summary on Instrumental Variables
Regression Endogeneity
So far in this class we’ve considered a “selection-on-observables” approach to identifying causal effects-that the treatment (or other regressors of interest) are conditionally uncorrelated with potential outcomes (or potential outcome trends, in a panel data setting) given some observed controls. This is a flexible framework for using regression to overcoming “selection bias” or other threats to identification, as we’ve seen. But in many cases this sort of identification strategy may fail us: individuals may select into treatment on the basis of some unobservables (e.g. private information) that we may never have hope to measure and control for in our regression. More generally, the economic model of interest may suffer from “omitted variables bias” by involving terms that cannot be included in a regression. 1
Broadly, the failure of regression-based identification is sometimes called endogeneity: a word economists appear to have made up for this situation. 2 In some cases this problem can be solved by instrumental variables (IVs), a statistical technique which economists appear to have also made up but that is now widely used across many disciplines. 3 The basic idea of IV is as follows: when the causal or otherwise “structural” relationship between some \(Y_{i}\) and some \(X_{i}\) is “endogenous,” we can use an “exogenous” \(Z_{i}\) that affects \(Y_{i}\) only through \(X_{i}\) to estimate the structural relationship. This definition, while compact, is very unclear without some more notation however…
Let’s (as usual) consider the returns-to-schooling example: \(Y_{i}\) denotes an individual’s adult earnings, \(X_{i}\) measures her completed schooling, and \(\varepsilon_{i}\) captures her (unobserved) ability or family characteristics. We posit a simple relationship of
\[ \begin{equation*} Y_{i}=\alpha+\beta X_{i}+\varepsilon_{i} \tag{1} \end{equation*} \]
with \(\beta\) being our usual returns-to-schooling parameter of interest. As written, equation (1) may look like a regression but of course we now know better: the model unobservable \(\varepsilon_{i}\) need not be uncorrelated with schooling \(X_{i}\). When \(\operatorname{Cov}\left(X_{i}, \varepsilon_{i}\right) \neq 0\) we cannot recover \(\beta\) by the regression of \(Y_{i}\) on \(X_{i}\); here we might say \(X_{i}\) is “endogenous,” perhaps because more advantaged people (with higher \(\varepsilon_{i}\) ) are more likely to select into high schooling levels \(X_{i}\).
To see how IV can address this endogeneity challenge, let’s suppose we have some \(Z_{i}\) that is randomly assigned across individuals but which affects schooling decisions. Concretely, let’s suppose we randomly give some students a scholarship to attend college and not others: here, \(Z_{i}=1\) if individual \(i\) is a scholarship winner and \(Z_{i}=0\) otherwise. The randomization of \(Z_{i}\) ensures it is uncorrelated with student ability or background characteristics; if it only affects earnings \(Y_{i}\) through \(X_{i}\) we say it is “excludable” from the model of interest (1) and that \(\operatorname{Cov}\left(Z_{i}, \varepsilon_{i}\right)=0\) . 4 In this case, we can use (1) to write
\[ \begin{align*} \operatorname{Cov}\left(Z_{i}, Y_{i}\right) & =\beta \operatorname{Cov}\left(Z_{i}, X_{i}\right)+\operatorname{Cov}\left(Z_{i}, \varepsilon_{i}\right) \\ & =\beta \operatorname{Cov}\left(Z_{i}, X_{i}\right) \tag{2} \end{align*} \]
Thus, provided \(\operatorname{Cov}\left(Z_{i}, X_{i}\right) \neq 0\) we can identify the returns-to-schooling parameter \(\beta\) by the estimand \(\operatorname{Cov}\left(Z_{i}, Y_{i}\right) / \operatorname{Cov}\left(Z_{i}, X_{i}\right)=\beta\). This is the basic logic of IV; here \(Z_{i}\) is a valid instrument for \(X_{i}\) when \(\operatorname{Cov}\left(Z_{i}, \varepsilon_{i}\right)=0\), and a relevant instrument when \(\operatorname{Cov}\left(Z_{i}, X_{i}\right) \neq 0\).
This simple example captures the core logic of IV, but there are many types of “endogeneity” that IV can solve. Each problem essentially boils down to finding some \(Z_{i}\) which is “valid” in the sense of being uncorrelated with a particular model unobservable and “relevant” in the sense of being correlated with a particular “endogenous variable” or treatment. Let’s walk through a few more examples to see exactly how broad this set of problems can be.
First, consider omitted variables bias (OVB): a problem you previously saw in Chapter 6. The OVB problem is one in which we are interested in a parameter \(\beta\) from the “long” regression of
\[ \begin{equation*} Y_{i}=\alpha+\beta X_{i}+\gamma W_{i}+v_{i} \tag{3} \end{equation*} \]
where \(\operatorname{Cov}\left(X_{i}, v_{i}\right)=\operatorname{Cov}\left(W_{i}, v_{i}\right)=0\). The issue is we do not observe \(W_{i}\), and when it is omitted from our regression we may obtain a biased view of \(\beta\). Indeed, the bivariate regression of \(Y_{i}\) on \(X_{i}\) gives
\[ \begin{align*} \frac{\operatorname{Cov}\left(X_{i}, Y_{i}\right)}{\operatorname{Var}\left(X_{i}\right)} & =\frac{\operatorname{Cov}\left(X_{i}, \alpha+\beta X_{i}+\gamma W_{i}+v_{i}\right)}{\operatorname{Var}\left(X_{i}\right)} \\ & =\beta+\gamma \underbrace{\frac{\operatorname{Cov}\left(X_{i}, W_{i}\right)}{\operatorname{Var}\left(X_{i}\right)}}_{\delta} \tag{4} \end{align*} \]
OVB here is the product of two terms: the “effect” of the omitted variable \(W_{i}\) on the outcome \(Y_{i}, \gamma\), and the regression of the omitted variable on the included variable \(X_{i}, \delta\). If we knew the sign of \(\gamma\) and \(\delta\), we could sign OVB: if, for example, we knew \(X_{i}\) and \(W_{i}\) were positively correlated and that \(W_{i}\) conditionally positively correlates with \(Y_{i}\) given \(X_{i}\) then we would know both \(\gamma>0\) and \(\delta>0\); then we would know that the bivariate regression overstates the parameter of interest \(\beta\). Moreover, if we can credibly argue that either \(\gamma\) or \(\delta\) are zero then we know there is no OVB: the bivariate regression identifies \(\beta\) even without observing \(W_{i}\). You will often see these sorts of arguments and discussions in papers and seminars when people think through the kinds of biases that may plague their regression estimates; they can be useful heuristics for determining whether one’s estimates are likely inflated up or down.
In the OVB setting, a valid IV is one which is uncorrelated with the omitted variables: if \(\operatorname{Cov}\left(Z_{i}, W_{i}\right)= \operatorname{Cov}\left(Z_{i}, v_{i}\right)=0\) then \(\beta=\operatorname{Cov}\left(Z_{i}, Y_{i}\right) / \operatorname{Cov}\left(Z_{i}, X_{i}\right)\), provided \(\operatorname{Cov}\left(Z_{i}, X_{i}\right) \neq 0\). The schooling/scholarship example above is basically a version of this more general setup.
A second example is measurement error. Here suppose the relationship of interest \(Y_{i}=\alpha+\beta X_{i}^{*}+v_{i}\) is known to be unconfounded: i.e. we know \(\operatorname{Cov}\left(X_{i}^{*}, v_{i}\right)=0\) and that this equation is a regression, for some reason. However we do not observe the regressor of interest \(X_{i}^{*}\) but instead observe a noisy measure \(X_{i}=X_{i}^{*}+\eta_{i}\), where \(\operatorname{Cov}\left(X_{i}^{*}, \eta_{i}\right)=0\) and \(\operatorname{Cov}\left(\eta_{i}, v_{i}\right)=0\). In this case a regression of \(Y_{i}\) on \(X_{i}\) does not identify the parameter \(\beta\) :
\[ \begin{align*} \frac{\operatorname{Cov}\left(X_{i}, Y_{i}\right)}{\operatorname{Var}\left(X_{i}\right)} & =\frac{\operatorname{Cov}\left(X_{i}^{*}+\eta_{i}, \alpha+\beta X_{i}^{*}+v_{i}\right)}{\operatorname{Var}\left(X_{i}^{*}+\eta_{i}\right)} \\ & =\beta \frac{\operatorname{Var}\left(X_{i}^{*}\right)}{\operatorname{Var}\left(X_{i}^{*}\right)+\operatorname{Var}\left(\eta_{i}\right)} \tag{5} \end{align*} \]
Here we have what is sometimes called attenuation bias: the regression estimand is a scaled version of the parameter of interest \(\beta\), with a scaling factor of \(\lambda \equiv \operatorname{Var}\left(X_{i}^{*}\right) / \operatorname{Var}\left(X_{i}^{*}\right)+\operatorname{Var}\left(\eta_{i}\right)\) strictly between zero and one. In other words, when \(\beta\) is positive we will identify a smaller positive regression coefficient \(\beta \lambda<\beta\).
In the measurement error setting we can undo attenuation bias by knowing the “signal-to-noise ratio” \(\lambda\), which reduces to knowing the variance of \(X_{i}^{*}\) or the variance of \(\eta_{i}\) (since then we can solve out for \(\lambda\) from knowledge of \(\left.\operatorname{Var}\left(X_{i}\right)\right)\). More directly, if we know \(\operatorname{Var}\left(X_{i}^{*}\right)\) we can directly estimate \(\operatorname{Cov}\left(X_{i}, Y_{i}\right) / \operatorname{Var}\left(X_{i}^{*}\right)=\beta\). We can also “bound” the degree of attenuation bias if we know something about the possible range of \(\operatorname{Var}\left(X_{i}^{*}\right)\).
One example of a valid instrument in the measurement error case is a different mismeasured \(Z_{i}=X_{i}^{*}+\xi_{i}\), satisfying \(\operatorname{Cov}\left(X_{i}^{*}, \xi_{i}\right)=\operatorname{Cov}\left(\eta_{i}, \xi_{i}\right)=0\). The trick here is that the measurement error in this instrument \(\xi_{i}\) is uncorrelated with the measurement error in the observed \(X_{i}=X_{i}^{*}+\eta_{i}\), which might be true in some cases. Such an instrument is guaranteed to be relevant, since \(\operatorname{Cov}\left(Z_{i}, X_{i}\right)=\operatorname{Var}\left(X_{i}^{*}\right)>0\), and we again have \(\operatorname{Cov}\left(Z_{i}, Y_{i}\right) / \operatorname{Cov}\left(Z_{i}, X_{i}\right)=\beta\).
A final example of regression endogeneity is simultaneity, which has a long history with IV. The classic example of simultaneous data is supply and demand: suppose we are interested in a demand elasticity \(\beta\) from the system
\[ \begin{align*} & \ln q=\alpha_{D}+\beta_{D} \ln p+v_{i} \tag{6}\\ & \ln q=\alpha_{S}+\beta_{S} \ln p+\eta_{i} \tag{7} \end{align*} \]
where \(q\) denotes the quantity of some good and \(p\) denotes its price. Here equation (6) is a demand equation: \(\beta_{D}\) tells us how consumer demand increases with the offered price (as an elasticity). Equation (7) is the corresponding supply equation: \(\beta_{S}\) tells us how producer supply increases with the market price (again as an elasticity). We write \(v_{i}\) and \(\eta_{i}\) as demand and supply “shocks” arising across different markets \(i\), normalized to \(E\left[v_{i}\right]=E\left[\eta_{i}\right]=0\). We observe the equilibrium quantities and prices ( \(Q_{i}, P_{i}\) ) which solve the system given by (6) and (7) following the realization of these shocks.
It is easy to see how the simultaneous determination of quantities and prices from this system makes regressions of \(\ln Q_{i}\) on \(\ln P_{i}\) (or vice versa) difficult to interpret. Solving out for these variables, we have
\[ \begin{align*} \ln P_{i} & =\frac{\alpha_{S}-\alpha_{D}+\eta_{i}-v_{i}}{\beta_{D}-\beta_{S}} \tag{8}\\ \ln Q_{i} & =\frac{\beta_{D} \alpha_{S}-\beta_{S} \alpha_{D}+\beta_{D} \eta_{i}-\beta_{S} v_{i}}{\beta_{D}-\beta_{S}} \tag{9} \end{align*} \]
so long as \(\beta_{D} \neq \beta_{S}\). You can see from this that a regression of \(\ln Q_{i}\) on \(\ln P_{i}\), or vice versa, fails to identify either the demand or supply elasticity but instead gives some messy formula involving both \(\beta_{D}, \beta_{S}\), and the relative variances of the shocks \(\eta_{i}\) and \(v_{i}\). This is intuitive, as the equilibrium relationship between prices and quantities is not driven by the variation along either the demand or supply curve, in general, but is instead given by the intersection of these curves as the different shocks move around the equilibrium.
A valid instrument in the simultaneous supply-and-demand case is one that isolates variation in shocks to one of the sides of the market: to identify the demand elasticity \(\beta_{D}\) we require a shock to the supply side (i.e. \(\eta_{i}\) ) and to identify a supply elasticity \(\beta_{S}\) we require a shock to the demand side (i.e. \(v_{i}\) ). This is again intuitive, as when we have isolated variation that shifts around one of the two curves (e.g. supply) we are able to trace out the other curve (e.g. demand). Formally, if we have a \(Z_{i}\) with \(\operatorname{Cov}\left(Z_{i}, \eta_{i}\right) \neq 0\) (relevance) but \(\operatorname{Cov}\left(Z_{i}, v_{i}\right)=0\) (validity), then we can see from equations (8) and (9) that
\[ \begin{align*} \frac{\operatorname{Cov}\left(Z_{i}, \ln Q_{i}\right)}{\operatorname{Cov}\left(Z_{i}, \ln P_{i}\right)} & =\frac{\operatorname{Cov}\left(Z_{i}, \beta_{D} \alpha_{S}-\beta_{S} \alpha_{D}+\beta_{D} \eta_{i}-\beta_{S} v_{i}\right)}{\operatorname{Cov}\left(Z_{i}, \alpha_{S}-\alpha_{D}+\eta_{i}-v_{i}\right)} \\ & =\beta_{D} \tag{10} \end{align*} \]
and similarly for a “demand-side” instrument that identifies the supply elasticity \(\beta_{S}\) when \(\operatorname{Cov}\left(Z_{i}, v_{i}\right) \neq 0\) and \(\operatorname{Cov}\left(Z_{i}, \eta_{i}\right)=0\).
In practice, of course, regression endogeneity can manifest in many ways (including combinations of the above stylized examples); the general formulation of the problem is the existence of some “structural” relationship that regression fails to identify due to the correlation between a regressor of interest and an unobserved model residual. We will next formalize the IV solution to such problems in both the simple bivariate case discussed here, and the more general case with multiple endogenous regressors, multiple instruments, and controls. The basic logic of instrument validity and relevance will continue to hold in that case.
Instrument Validity and Relevance
Let’s first define some terms in the simple (bivariate) case, where we have one outcome \(Y_{i}\), one endogenous variable \(X_{i}\), and one instrument \(Z_{i}\). The IV estimand here is
\[ \begin{equation*} \beta=\frac{\operatorname{Cov}\left(Z_{i}, Y_{i}\right)}{\operatorname{Cov}\left(Z_{i}, X_{i}\right)}=\frac{\operatorname{Cov}\left(Z_{i}, Y_{i}\right) / \operatorname{Var}\left(Z_{i}\right)}{\operatorname{Cov}\left(Z_{i}, X_{i}\right) / \operatorname{Var}\left(Z_{i}\right)} \tag{11} \end{equation*} \]
In the second equality we’ve simply divided the numerator and denominator of the initial definition by \(\operatorname{Var}\left(Z_{i}\right)\). In doing so we can see that the IV estimand \(\beta\) can be written as the ratio of two regression estimands: \(\beta=\rho / \pi\) where
\[ \begin{align*} Y_{i} & =\kappa+\rho Z_{i}+\nu_{i} \tag{12}\\ X_{i} & =\mu+\pi Z_{i}+\eta_{i} \tag{13} \end{align*} \]
denote bivariate regressions of \(Y_{i}\) and \(X_{i}\), respectively, on the instrument. Here, by definition of regression, \(\operatorname{Cov}\left(Z_{i}, \nu_{i}\right)=\operatorname{Cov}\left(Z_{i}, \eta_{i}\right)=0\), with \(\rho=\operatorname{Cov}\left(Z_{i}, Y_{i}\right) / \operatorname{Var}\left(Z_{i}\right)\) and \(\pi=\operatorname{Cov}\left(Z_{i}, X_{i}\right) / \operatorname{Var}\left(Z_{i}\right)\). We sometimes call equation (12) the “reduced form” regression and equation (13) the “first stage” regression, for reasons that will become more clear shortly.
An alternative but equivalent way to define this IV starts with the “second stage”
\[ \begin{equation*} Y_{i}=\alpha+\beta X_{i}+U_{i} \tag{14} \end{equation*} \]
where \((\alpha, \beta)\) (and thus \(U_{i}\) ) are such that \(\operatorname{Cov}\left(Z_{i}, U_{i}\right)=0\). This parallels our definition of population regression as the parameters (and residual) satisfying \(\operatorname{Cov}\left(X_{i}, U_{i}\right)=0\); it is a proper definition so long as \(\operatorname{Cov}\left(Z_{i}, X_{i}\right) \neq 0\), since then it can be shown there are unique \((\alpha, \beta)\) satisfying \(\operatorname{Cov}\left(Z_{i}, U_{i}\right)=0\) . 5 This parallels the “no perfect multicollinearity” condition with regression which uniquely defines the regression coefficients. As with regression, we can always define the IV estimand \(\beta\) when this \(\operatorname{Cov}\left(Z_{i}, X_{i}\right) \neq 0\); there is always a residual \(U_{i}\) satisfying \(\operatorname{Cov}\left(Z_{i}, U_{i}\right)=0\), just as before how there was always a regression residual satisfying \(\operatorname{Cov}\left(X_{i}, U_{i}\right)=0\). The aim of identification is to make sufficient assumptions on the model such that \(Z_{i}\) is uncorrelated with the model’s residual, in which case it coincides with this \(U_{i}\) and \(\beta\) identifies an interesting parameter.
The more general definition of IV starts with a \(J \times 1\) vector of endogenous variables \(\boldsymbol{X}_{i}\), a \(L \times 1\) vector of instruments \(\boldsymbol{Z}_{i}\), and a \(K \times 1\) vector of controls \(\boldsymbol{W}_{i}\) (which includes a constant). Suppose, from an economic model, we arrive at a linear relationship of
\[ \begin{equation*} Y_{i}=\boldsymbol{X}_{i}^{\prime} \boldsymbol{\beta}+e_{i} \tag{15} \end{equation*} \]
where the model parameter of interest is the coefficient vector \(\boldsymbol{\beta}\). Here \(e_{i}\) is a model residual (e.g. something related to “potential outcomes”), and need not be orthogonal to \(\boldsymbol{X}_{i}\). We however think the vector of instruments is orthogonal to \(\varepsilon_{i}\) after controlling for \(\boldsymbol{W}_{\boldsymbol{i}}\) : that is, we think the coefficient on \(\boldsymbol{Z}_{i}\) from the population regression of \(\varepsilon_{i}\) on \(\underline{\boldsymbol{Z}}_{i}=\left[\boldsymbol{Z}_{i}^{\prime}, \boldsymbol{W}_{i}^{\prime}\right]^{\prime}\) is the \(L \times 1\) vector of zeros. The controls here may thus account for some observed confounding between \(\boldsymbol{Z}_{i}\) and \(\varepsilon_{i}\), as with the kind of “selection-on-observables” stories told before (but now with \(\boldsymbol{Z}_{i}\) instead of the actual “treatment” \(\boldsymbol{X}_{i}\); more on this below). To accommodate them, let’s imagine projecting \(e_{i}\) on the control vector to obtain
\[ \begin{equation*} e_{i}=\boldsymbol{W}_{i}^{\prime} \boldsymbol{\gamma}+\varepsilon_{i}, \tag{16} \end{equation*} \]
where \(\varepsilon_{i}\) is by construction orthogonal to \(\boldsymbol{W}_{i}\). Combining (16) and (15), we have our second stage equation: \[ \begin{equation*} Y_{i}=\boldsymbol{X}_{i}^{\prime} \boldsymbol{\beta}+\boldsymbol{W}_{i}^{\prime} \boldsymbol{\gamma}+\varepsilon_{i} . \tag{17} \end{equation*} \]
How can we use instrument exogeneity in this case? Motivated by the simple case above, let’s consider the first stage regression of the endogenous variable on the instrument and controls. Here we have one such regression for each row of \(\boldsymbol{X}_{i}\); stacking these, we have
\[ \begin{equation*} \boldsymbol{X}_{i}=\boldsymbol{\Pi} \boldsymbol{Z}_{i}+\boldsymbol{\mu} \boldsymbol{W}_{i}+\boldsymbol{\eta}_{i}, \tag{18} \end{equation*} \]
where \(\boldsymbol{\Pi}\) is a \(J \times L\) matrix of coefficients from regressing each \(\boldsymbol{X}_{i j}\) on \(\boldsymbol{Z}_{i}\) while controlling for \(\boldsymbol{W}_{i}\). By construction, \(\boldsymbol{\eta}_{i}\) is orthogonal to both \(\boldsymbol{Z}_{i}\) and \(\boldsymbol{W}_{i}\). Substituting this series of first stage regressions into the second stage (17), we obtain
\[ \begin{align*} Y_{i} & =\left(\boldsymbol{\Pi} \boldsymbol{Z}_{i}+\boldsymbol{\mu} \boldsymbol{W}_{i}+\boldsymbol{\eta}_{i}\right)^{\prime} \boldsymbol{\beta}+\boldsymbol{W}_{i}^{\prime} \boldsymbol{\gamma}+\varepsilon_{i} \\ & =\left(\boldsymbol{\Pi} \boldsymbol{Z}_{i}\right)^{\prime} \boldsymbol{\beta}+\boldsymbol{W}_{i}^{\prime}\left(\boldsymbol{\mu}^{\prime} \boldsymbol{\beta}+\boldsymbol{\gamma}\right)+\left(\boldsymbol{\eta}_{i}^{\prime} \boldsymbol{\beta}+\varepsilon_{i}\right) . \tag{19} \end{align*} \]
From this we can see that \(\boldsymbol{\beta}\) is identified, by the population regression of \(Y_{i}\) on \(\boldsymbol{\Pi} \boldsymbol{Z}_{i}\) and \(\boldsymbol{W}_{i}\), under two conditions.
The first condition is the generalized IV validity assumption, that \(E\left[\boldsymbol{\Pi} \boldsymbol{Z}_{i} \varepsilon_{i}\right]=0\). This is enough to ensure equation (19) is a regression, since we know \(\boldsymbol{\eta}_{i}\) is by construction orthogonal to both \(\boldsymbol{Z}_{i}\) and \(\boldsymbol{W}_{i}\) (so the linear combination \(\boldsymbol{\eta}_{i}^{\prime} \boldsymbol{\beta}\) is orthogonal to both the linear combination \(\boldsymbol{\Pi} Z_{i}\) and to \(\boldsymbol{W}_{i}\) ) and we know that \(\varepsilon_{i}\) is orthogonal to \(\boldsymbol{W}_{i}\) by definition. A sufficient condition for IV validity is the conditional orthogonality of \(\boldsymbol{Z}_{i}\) and \(\varepsilon_{i}\) given \(\boldsymbol{W}_{i}\), by the Frisch-Waugh-Lovell theorem. For example, if \(\varepsilon_{i}\) denoted fixed student ability, \(\boldsymbol{Z}_{i}\) were a vector of randomized scholarship offers, and \(\boldsymbol{W}_{i}\) contained information on which scholarship lotteries a student was entered into, we may expect \(E\left[\boldsymbol{\Pi} \boldsymbol{Z}_{i} \varepsilon_{i}\right]=0\).
The second condition in the generalized IV relevance condition, which here resolves to a no perfect collinearity assumption on \(\left[\left(\boldsymbol{\Pi} \boldsymbol{Z}_{i}\right)^{\prime}, \boldsymbol{W}_{i}^{\prime}\right]^{\prime}\). This in turn resolves to no perfect multicollinearity in \(\left[\boldsymbol{Z}_{i}^{\prime}, \boldsymbol{W}_{i}^{\prime}\right]^{\prime}\) (which allows us to define the first stage regressions) and an assumption that \(\boldsymbol{\Pi}\) is of full row rank. This rank condition could fail when, for example, there are fewer instruments than endogenous variables ( \(L<K\) ) or more generally when the instruments do not generate independent variation in all of the endogenous variables.
As above we’ve derived the IV validity and relevance condition by starting with a model satisfying them, but when the relevance condition holds it can be shown there is always a second stage residual satisfying the validity condition. Namely, so long as \(\boldsymbol{Z}_{i}\) and \(\boldsymbol{W}_{i}\) are not perfectly collinear we can always define the first stage regression (18). Let \(\underline{\boldsymbol{X}}_{i}=\left[\boldsymbol{X}_{i}^{\prime}, \boldsymbol{W}_{i}^{\prime}\right]^{\prime}\) collect the endogenous variables and controls, let \(\underline{\boldsymbol{Z}}_{i}=\left[\boldsymbol{Z}_{i}^{\prime}, \boldsymbol{W}_{i}^{\prime}\right]^{\prime}\) collect the instruments and controls, and let
\[ \underline{\boldsymbol{\Pi}}=\left[\begin{array}{cc} \boldsymbol{\Pi} & \boldsymbol{\mu} \tag{20}\\ 0 & \boldsymbol{I} \end{array}\right] \]
collect the first-stage coefficients. Then it can be shown the IV “moment condition”
\[ \boldsymbol{E}\left[\underline{\boldsymbol{\Pi}}_{i} U_{i}\right]=E\left[\underline{\boldsymbol{\Pi}}_{i}\left(Y_{i}-\underline{\boldsymbol{X}}_{i}^{\prime} \underline{\boldsymbol{\beta}}\right)\right]=0, \]
which imposes orthogonality of the second-stage residual \(U_{i}=Y_{i}-\left(\boldsymbol{\Pi} \boldsymbol{Z}_{i}\right)^{\prime} \boldsymbol{\beta}-\boldsymbol{W}_{i}^{\prime} \boldsymbol{\gamma}\) with the regressors \(\boldsymbol{\Pi} \boldsymbol{Z}_{i}\) and \(\boldsymbol{W}_{i}\), has a unique solution
\[ \begin{equation*} \underline{\boldsymbol{\beta}}=\left[\boldsymbol{\beta}^{\prime}, \boldsymbol{\gamma}^{\prime}\right]^{\prime}=\boldsymbol{E}\left[\underline{\boldsymbol{\Pi} \boldsymbol{Z}_{i}} \underline{\boldsymbol{X}}_{i}^{\prime}\right]^{-1} \boldsymbol{E}\left[\underline{\boldsymbol{\Pi Z}}_{i} Y_{i}\right] . \tag{21} \end{equation*} \]
Thus, we can always define the general IV estimand \(\boldsymbol{\beta}\) when relevance holds, just as we could define the “simple” IV estimand when \(\operatorname{Cov}\left(Z_{i}, X_{i}\right) \neq 0\) or define the population regression of \(Y_{i}\) on \(X_{i}\) when \(\operatorname{Var}\left(X_{i}\right)\). As in these cases, the identification question is whether the statistical residual \(U_{i}\), which imposes instrument validity, coincides with the residual of a particular model for how the data are generated. If it does, then we know the coefficients of the IV regression coincide with the coefficients from that “structural” second stage equation. Furthermore, we can see that the IV estimand (21) is relatively straightforward to estimate, as it is a relatively simple function of second moments; more on that soon.
You shouldn’t be too surprised if all of this is sounding familiar; the link between IV and linear regression is tight because the latter is a special case of the former. Formally, when \(\boldsymbol{X}_{i}=\boldsymbol{Z}_{i}\), the first stage regression of \(\boldsymbol{X}_{i}\) on \(\boldsymbol{Z}_{i}\) fits perfectly; the first stage matrix is then \(\underline{\boldsymbol{\Pi}}=\boldsymbol{I}\) and the IV estimand
\[ \begin{aligned} \underline{\boldsymbol{\beta}} & =\boldsymbol{E}\left[\underline{\boldsymbol{\Pi}}_{i} \underline{\boldsymbol{X}}_{i}^{\prime}\right]^{-1} \boldsymbol{E}\left[\underline{\boldsymbol{\Pi}}_{i} Y_{i}\right] \\ & =\boldsymbol{E}\left[\underline{\boldsymbol{X}}_{i}^{\prime}\right]^{-1} \boldsymbol{E}\left[\underline{\boldsymbol{X}} Y_{i}\right] \end{aligned} \]
is just population regression. The IV relevance condition here is satisfied just by the lack of perfect multicollinearity in \(\underline{\boldsymbol{Z}}_{i}=\underline{\boldsymbol{X}}_{i}\) and IV validity is simply \(\boldsymbol{E}\left[\underline{\boldsymbol{X}}_{i} \varepsilon_{i}\right]=0\). Working through this special case is useful for showing how IV allows some endogenous “slippage” between \(\boldsymbol{X}_{i}\) and \(\boldsymbol{Z}_{i}\); instead of regressing \(Y_{i}\) on \(\boldsymbol{X}_{i}\), we regress on the component of \(\boldsymbol{X}_{i}\) which is predicted by the exogenous instrument (the fitted values).
Of course, since IV arises just from regression the Frisch-Waugh-Lovell theorem applies. We can, for example, think of regressing \(Y_{i}\) on the residuals from projecting \(\boldsymbol{\Pi} \boldsymbol{Z}_{i}\) on \(\boldsymbol{W}_{i}\) by the first part of the FWL. This will come in handy for analyzing some IV coefficients, as before.
It may also be handy to work with the generalized IV’s reduced form and first stage expressions:
\[ \begin{align*} Y_{i} & =\boldsymbol{Z}_{i}^{\prime} \boldsymbol{\rho}+\boldsymbol{W}_{i}^{\prime} \boldsymbol{\kappa}+\nu_{i} \tag{22}\\ \boldsymbol{X}_{i} & =\boldsymbol{\Pi} \boldsymbol{Z}_{i}+\boldsymbol{\mu} \boldsymbol{W}_{i}+\boldsymbol{\eta}_{i}, \tag{23} \end{align*} \]
where, per (19), we have \(\boldsymbol{\rho}=\boldsymbol{\Pi}^{\prime} \boldsymbol{\beta}, \boldsymbol{\kappa}=\boldsymbol{\mu}^{\prime} \boldsymbol{\beta}+\boldsymbol{\gamma}\), and \(\nu_{i}=\boldsymbol{\eta}^{\prime} \boldsymbol{\beta}+U_{i}\). Consider the case where \(L= \operatorname{dim}\left(\boldsymbol{Z}_{i}\right)=\operatorname{dim}\left(\boldsymbol{X}_{i}\right)=K\); we call this case “just-identified,” in that there are just as many instruments as endogenous variables. Here \(\boldsymbol{\Pi}\) is a square matrix, which is invertible (i.e. full rank) when the relevance condition holds. Thus we can define \(\boldsymbol{\beta}=\Pi^{\prime-1} \boldsymbol{\rho}\) in this case, generalizing how we defined IV as the reduced form \(\rho=\operatorname{Cov}\left(Z_{i}, Y_{i}\right) / \operatorname{Var}\left(Z_{i}\right)\) divided by the first stage \(\pi=\operatorname{Cov}\left(Z_{i}, X_{i}\right) / \operatorname{Var}\left(Z_{i}\right)\) in the simple (bivariate) case above, which was just-identified. The FWL also of course applies here, allowing us to study \(\Pi\) and \(\rho\) by first residualizing out the controls.
In the just-identified case, where \(\boldsymbol{\Pi}\) is invertible, the IV validity condition becomes equivalent to the orthogonality of the instrument vector with the residual: \(E\left[\boldsymbol{\Pi} \boldsymbol{Z}_{i} \varepsilon_{i}\right]=0\) if and only if \(\boldsymbol{\Pi}^{-1} E\left[\boldsymbol{\Pi} \boldsymbol{Z}_{i} \varepsilon_{i}\right]=\boldsymbol{\Pi}^{-1} 0\) or \(E\left[\boldsymbol{Z}_{i} \varepsilon_{i}\right]=0\). In the just-identified case any full-rank linear combination of \(\boldsymbol{Z}_{i}\) is valid and gives the same IV estimand. In the “overidentified” case of \(L=\operatorname{dim}\left(\boldsymbol{Z}_{i}\right)>\operatorname{dim}\left(\boldsymbol{X}_{i}\right)=K\), where we have more instruments than endogenous variables, this is no longer true: different linear combinations of the \(\boldsymbol{Z}_{i}\) may or may not be valid and will yield different IV estimands. If we assume the stronger validity condition holds, that \(E\left[\boldsymbol{Z}_{i} \varepsilon_{i}\right]=0\), then we can use any \(\boldsymbol{M} \boldsymbol{Z}_{i}\) as a set of \(L\) instruments, for any full-rank \(J \times L\) matrix \(\boldsymbol{M}\). More specifically, we can consider the class of IV estimands
\[ \begin{equation*} \underline{\boldsymbol{\beta}}=\boldsymbol{E}\left[\underline{\boldsymbol{M Z}}_{i} \underline{\boldsymbol{X}}_{i}^{\prime}\right]^{-1} \boldsymbol{E}\left[\underline{\boldsymbol{M Z}}_{i} Y_{i}\right] \tag{24} \end{equation*} \]
for any \(\underline{\boldsymbol{M}}\), not just \(\underline{\boldsymbol{\Pi}}\). This class is quite large, containing some well-studied IV estimands such as the Nagar (1962), k-class, or limited information maximum likelihood (LIML) procedures that you may encounter in future econometrics classes. In this class, however, we will focus on the \(\underline{\boldsymbol{M}}=\underline{\boldsymbol{\Pi}}\) case even when overidentified.
Where do Instruments Come From?
So far we’ve talked about instruments in the abstract, and from that perspective they sound pretty magical: what are these “exogenous” \(Z_{i}\) and how are they actually used in practice? Here we will walk through three examples, from Abdulkadiroglu et al. (2016) (on charter school effectiveness) and Angrist and Krueger (1991) (on - what else?- the returns to schooling).
Some of the best candidates for instruments come from true experiments, in which \(Z_{i}\) is as-good-as-randomly assigned across observations \(i\). Abdulkadiroglu et al. (2016), for example, use the random assignment of offers to attend charter middle schools as an instrument for charter school enrollment. The idea is that when students apply to an “oversubscribed” charter, with more applicants than available seats, the school runs a simple lottery to determine who is eligible to attend. Those who receive offers can decline, and go somewhere else, while other students may find their way into the charter through later admission rounds. Thus, while the randomized offers are a strong predictor of charter enrollment there is still a considerable amount of “slippage” in terms of enrollment (both conditional on application and unconditionally, as most students do not apply to enroll in a charter school). Formally, Abdulkadiroglu et al. (2016) study the effects of charter enrollment \(X_{i} \in\{0,1\}\) on subsequent test score achievement \(Y_{i}\), instrumenting by a \(Z_{i} \in\{0,1\}\) that indicates student \(i\) got an offer to attend a charter on the night of the lottery. 6 We estimate this regression in the sample of charter applicants, controlling for strata \(\boldsymbol{W}_{i}\) indicating different lottery years and schools. The upshot is we estimate very large test score effects from charter enrollment, consistent with other papers in a recent literature on charter effectiveness.
A virtue of IVs like charter enrollment offers is that they are (conditionally) randomly assigned; we thus know for sure that we can estimate their “reduced form” effects on test scores \(Y_{i}\) as well as the “first stage” effect on charter enrollment \(X_{i}\). That is, we can estimate the ATT \(E\left[Y_{i 1}-Y_{i 0}\right]\) of getting an charter offer on test scores-sometimes in the IV context this is referred to as an “intent-to-treat” effect (the idea being \(Z_{i}\) captures the randomized “intent” to receive the endogenous enrollment treatment \(X_{i}\) )-as well as \(E\left[X_{i 1}-X_{i 0}\right]\). A key point to recognize, however, is that random assignment is not sufficient for such \(Z_{i}\) to be a valid instrument for \(X_{i}\). For this we also need an exclusion restriction, that \(Z_{i}\) only affects \(Y_{i}\) through \(X_{i}\). In the charter school case of Abdulkadiroglu et al. (2016) this seems fairly defensible: admission offers likely only affect later achievement via enrollment decisions, having no real effects other than giving a student access to attend a charter school. This sort of logic is often found in randomized control trials (RCTs) with “imperfect compliance,” where an offer to participate in a program is randomized but people can opt out or in. These days most researchers understand that instrumenting by offers can identify causal program effects despite such imperfect incompliance, though economists were a driving force behind making this clear to different fields. 7
Absent literal randomization of \(Z_{i}\), researchers may still credibly argue that it is “as-good-as-randomly” assigned (perhaps conditional on some \(\boldsymbol{W}_{i}\) ). The idea here is to appeal to a kind of “natural experiment” which generates \(Z_{i}\) in a way that is plausibly unrelated to the second-stage model of interest. Angrist and Krueger (1991) give a now-famous example of such a setting when estimating the returns to schooling in the early 20th century. They leverage two institutional features of this time: compulsory schooling laws, which typically required a student to stay in school until their 16th birthday, and the fact that most schools require students to enter school in the calendar year they turn six. Consequently, students born in different quarters who plan to drop out as soon as they are able will tend to have different completed years of schooling. A student born in January, for example, will start school at six and eight months and at her 16th birthday will have nine years of completed schooling. A student born in December, in contrast, will start school at five and eight months and at her 16th birthday will have completed ten years of schooling. Angrist and Krueger (1991) thus use the “natural experiment” of quarter-of-birth as an instrument for completed years of schooling, controlling for the year- and state-of-birth (to help with the instrument’s first-stage power).
One’s quarter-of-birth may appear as-good-as-randomly assigned with respect to the labor market conditions (and other factors) one faces in adulthood. Even though people do not time the conception of their children by a lottery, this “natural experiment” seems fairly plausible. Again, however, we must consider not only the independent assignment of this \(Z_{i}\) across individuals but also ponder the key exclusion restriction: does quarter-of-birth only affect adult earnings through completed schooling? More recent studies have, for example, found that being older in your class in grade school can have direct effects on both mental and physical development which may conceivably violate the exclusion restriction. It turns out that the Angrist and Krueger (1991) student may also have suffered from a different problem, related to estimation instead of identification, as we will cover in the next Chapter. Still, it is a compelling story at first glance as well as an influential early example of our modern view of such “natural experiments.”
Both of these examples highlight a useful way of thinking about instruments, by separating a statistical assumption of (as-good-as) random assignment from the more model-based exclusion restriction. We can think of other ways to ensure random assignment by leveraging what we’ve seen in “reduced form” treatment effect estimation. For example we might tell a “selection-on-observables” story that makes a given instrument \(Z_{i}\) as-good-as-randomly assigned conditional on some \(\boldsymbol{W}_{i}\), even though the ultimate treatment of \(D_{i}\) is not unconfounded. With panel data, we might use a difference-in-differences approach to argue \(Z_{i}\) Post \(_{t}\) satisfies a parallel trends assumption controlling for \(Z_{i}\) and Post \(_{t}\) main effects. Given such arguments we still need to be able to credibly argue an exclusion restriction holds, in order to relate the (say) reduced-form difference-in-difference estimates of the effect of \(Z_{i}\) on \(Y_{i}\) to first-stage difference-in-difference estimates of the effect of \(Z_{i}\) on \(X_{i}\). Again, such arguments tend to be “model-based,” requiring us to rule out stories of other direct effects of the instruments on outcomes. Abdulkadiroglu et al. (2016) pursue such an approach in their study of non-lotteried “takeover” charters, which sits alongside their lottery analysis discussed above. See the course slides for an illustration of these two approaches.
This discussion of where IVs come from, and the example of Abdulkadiroglu et al. (2016) in particular, highlight a general tradeoff between internal and external validity of observational (possibly IV-based) analyses. In many ways the “gold-standard” for estimating causal effects is a randomized treatment; since analyzing such an RCT requires minimal assumptions (besides the existence of potential outcomes) we sometimes say that it has high internal validity. In contrast, an observational study of the treatment’s effects which makes hard-to-swallow selection-on-observables assumptions may have low internal validity (in that it requires assumptions that are not guaranteed by virtue of randomization). But (far from) all treatments of interest can be or are randomized, and those that are often can only be deployed on selected individuals. In the Abdulkadiroglu et al. (2016) example, we have high internal validity for estimating the effects of charter school enrollment among the students who enter the admission lottery. We may worry about the external validity (i.e. generalizability) of such studies, especially when individuals self-select into the study population (by, e.g., applying to a charter school). To probe external validity, it is often helpful to turn to more observational studies (i.e. those that make a selection-on-observables argument or rely on difference-in-difference-type identification) on a more representative population. The point is that IV can help on both fronts, but not all IVs are created equal in this regard. The exclusion restriction can fail even when the instrument is randomized in a lottery, and parallel trends can be very credible even in a non-experimental setting.
As with our discussion of population regression, we’ll next turn to the question of how we estimate IV estimands from data. The key insight, as before, is that these \(\boldsymbol{\beta}\) are also relatively simple functions of second moments; we can thus consider their sample analogues to construct an estimator of \(\underline{\beta}\), and follow similar steps as with OLS to characterize its asymptotic behavior. There are, however, a few new practical considerations with IV that we didn’t have before. These are related to the fact that we now have some “slippage” between the exogenous \(\boldsymbol{Z}_{i}\) and endogenous \(\boldsymbol{X}_{i}\), and can be tricky to deal with in practice.
Footnotes
Here we are again using “bias” to mean “non-identification”-i.e., that the parameter of interest is not recovered by a particular regression estimand. This should not be confused with the statistical definition of bias-that an estimator is not, in expectation, equal to an estimand of interest. The fact that economists use the same word for these very different concepts is unfortunate, but hopefully not too confusing depending on the context.↩︎
The first use of the term “endogenous” in an economics journal appears to be this 1953 poem(!) by Frederick Waugh, of FWL fame: https://pbs.twimg.com/media/EVF1qLGUMAAjOte?format=png&name=small. Waugh may have gotten this term from the natural sciences, as it is sometimes used in biology to refer to substances created by or put into an organism as early as the 1920s. These days “endogeneity” is used in the statistical sense across many fields, especially epidemiology.↩︎
The inventor of IV appears to be Philip G. Wright who, in 1928, devised it as a solution to the classic simultaneity of supply and demand (discussed below). For more on this history, see https://scholar.harvard.edu/files/stock/files/wr_5_w.pdf.↩︎
In general, the random assignment of such \(Z_{i}\) will not be enough for it to be excludable from such specifications. We might imagine, for example, that random scholarship offers make students more successful in school by allowing them to not work during college and earn better grades. In that case \(Z_{i}\) might affect earnings \(Y_{i}\) not only through college attendance but by an unmeasured channel of college GPA. Here we are abstracting from such concerns to introduce the IV concept simply.↩︎
Here \(\alpha=\kappa-\beta \mu\) and \(U_{i}=\nu_{i}-\beta \eta_{i}\), by substituting the reduced form and first stage equations into the second stage equation. The reduced form and first stage coefficients are unique provided \(\operatorname{Var}\left(Z_{i}\right) \neq 0\), which is implied by \(\operatorname{Cov}\left(Z_{i}, X_{i}\right) \neq 0\). Further, as shown above, \(\beta=\rho / \pi\).↩︎
As you’ll see in the paper, and course slides, we actually use two instruments in the main specifications of the paper: an “immediate offer” to enroll on lottery night, and a “waitlist offer” to enroll later. The latter comes from the fact that we know each student’s position on the school waiting list, which is randomized on lottery night. So we can define arbitrary cutoffs on this randomized wait list and use it as an instrument too. In practice we get very similar estimates with both instruments or just the “immediate offer” IV.↩︎
A famous result by Imbens and Angrist (1994), formalized this approach by showing when \(Z_{i}\) and \(X_{i}\) are binary such IV regressions identify “local average treatment effects” (LATEs), defined as the average effect of \(X_{i}\) on \(Y_{i}\) among marginal individuals (“compliers”) who are induced to the treatment by the randomized offer. Such an interpretation requires an additional “monotonicity” condition which says the randomized offer can only shift people into taking the treatment. We may have the opportunity to say more about LATEs and related parameters in the final lectures or TA sessions.↩︎