神刀安全网

Understanding omitted confounders, endogeneity, omitted variable bias, and related concepts

Initial thoughts

Estimating causal relationships from data is one of the fundamental endeavors of researchers. Ideally, we could conduct a controlled experiment to estimate causal relations. However, conducting a controlled experiment may be infeasible. For example, education researchers cannot randomize education attainment and they must learn from observational data.

In the absence of experimental data, we construct models to capture the relevant features of the causal relationship we have an interest in, using observational data. Models are successful if the features we did not include can be ignored without affecting our ability to ascertain the causal relationship we are interested in. Sometimes, however, ignoring some features of reality results in models that yield relationships that cannot be interpreted causally. In a regression framework, depending on our discipline or our research question, we give a different name to this phenomenon: endogeneity, omitted confounders, omitted variable bias, simultaneity bias, selection bias, etc.

Below I show how we can understand many of these problems in a unified regression framework and use simulated data to illustrate how they affect estimation and inference.

Understanding omitted confounders, endogeneity, omitted variable bias, and related concepts

Framework

The following statements allow us to obtain a causal relationship in a regression framework.

/begin{eqnarray*}
y &=& g/left(X/right) + /varepsilon //
E/left(/varepsilon|X/right) &=& 0
/end{eqnarray*}

In the expression above, /(y/) is the outcome vector of interest, /(X/) is a matrix of covariates, /(/varepsilon/) is a vector of unobservables, and /(g/left(X/right)/) is a vector-valued function. The statement /(E/left(/varepsilon|X/right) = 0/) implies that once we account for all the information in the covariates, what we did not include in our model, /(/varepsilon/), does not give us any information, on average. It also implies that, on average, we can infer the causal relationship of our outcome of interest and our covariates. In other words, it implies that

/begin{equation*}
E/left(y|X/right) = g/left(X/right)
/end{equation*}

The opposite occurs when

/begin{eqnarray*}
y &=& g/left(X/right) + /varepsilon //
E/left(/varepsilon|X/right) &/neq& 0
/end{eqnarray*}

The expression /(E/left(/varepsilon|X/right) /neq 0/) implies that it does not suffice to control for the covariates /(X/) to obtain a causal relationship because the unobservables are not negligible when we incorporate the information of the covariates in our model.

Below I present three examples that fall into this framework. In the examples below, /(g/left(X/right)/) is linear, but the framework extends beyond linearity.

Example 1 (omitted variable bias and confounders).
The true model is given by
/begin{eqnarray*}
y &=& X_1/beta_1 + X_2/beta_2 + /varepsilon //
E/left(/varepsilon| X_1, X_2/right)&=& 0
/end{eqnarray*}
However, the researcher does not include the covariate matrix $X_2$ in the model and believes that the relationship between the covariates and the outcome is given by
/begin{eqnarray*}
y &=& X_1/beta_1 + /eta //
E/left(/eta|X_1/right)&=& 0
/end{eqnarray*}

If $E/left(/eta|X_1/right)= 0$, the researcher will get correct inference about $/beta_1$ from linear regression. However, $E/left(/eta|X_1/right)= 0$ will only happen if $X_2$ is irrelevant once we incorporate the information of $X_1$. In other words, this happens if $E/left(X_2|X_1/right)=0$. To see this, we write

/begin{eqnarray*}
E/left(/eta|X_1/right)&=& E/left(X_2/beta_2 + /varepsilon| X_1/right) //
&=& E/left(X_2|X_1/right)/beta_2 + E/left(/varepsilon| X_1/right) //
&=& E/left(X_2|X_1/right)/beta_2
/end{eqnarray*}

If $E/left(/eta|X_1/right) /neq 0$, we have omitted variable bias, which in this case comes from the relationship between the included and omitted variable, that is, $E/left(X_2|X_1/right)$. Depending on your discipline, you would also refer to $X_2$ as an omitted confounder. //

Below I simulate data that exemplify omitted variable bias.

clear capture set seed 111 quietly set obs 20000 local rho = .5  // Generating correlated regressors  generate x1 = rnormal() generate x2 = `rho'*x1 + rnormal()  // Generating Model  quietly generate y   = 1 + x1 - x2 + rnormal()

In line 4, I set a parameter that correlates the two regressors in the model. In lines 6-8 I generate correlated regressors. In line 12, I generate the outcome variable. Below I estimate the model excluding one of the regressors.

/begin{stlog}[auto]
/input{ex6.log.tex}
/end{stlog}

The estimated coefficient is .495, but we know that the true value is 1. Also, our confidence interval suggests that the true value is somewhere between .476 and .515. Estimation and inference are misleading.

Example 2 (endogeneity in a projection model).
The projection model gives us correct inference if
/begin{eqnarray*}
y &=& X_1/beta_1 + X_2/beta_2 + /varepsilon //
E/left(X_j’/varepsilon /right)&=& 0 /quad /text{for} /quad j /in{1,2}
/end{eqnarray*}

If /(E/left(X_j’/varepsilon /right) /neq 0/), we say that the covariates /(X_j/) are endogenous. The law of iterated expectations states that /(E/left(/varepsilon|X_j/right) = 0/) which yields /(E/left(X_j’/varepsilon /right) = 0/). Thus, if /(E/left(X_j’/varepsilon /right) /neq 0/), we have that /(E/left(/varepsilon|X_j/right) /neq 0/). Say /(X_1/) is endogenous; then, we can write the model under endogeneity within our framework as

/begin{eqnarray*}
y &=& X_1/beta_1 + X_2/beta_2 + /varepsilon //
E/left(/varepsilon| X_1 /right)&/neq& 0 //
E/left(/varepsilon| X_2 /right)&=& 0
/end{eqnarray*}

Below I simulate data that exemplify endogeneity:

clear capture set seed 111 quietly set obs 20000  // Generating Endogenous Components   matrix C  = (1, .5/ .5, 1) quietly drawnorm e v, corr(C)   // Generating Regressors  generate x1  = rnormal() generate x2  = v  // Generating Model  generate y   = 1 + x1 - x2 + e

In lines 7–10 I generate correlated unobservable variables. In line 14, I generate a covariate that is correlated to one of the unobservables, x2 . In line 18, I generate the outcome variable. The covariate x2 is endogenous, and its coefficient should be far away from the true value (in this case, -1). Below we observe exactly this:

The estimated coefficient is /(-0.498/), and our confidence interval suggests that the true value is somewhere between /(-0.510/) and /(-0.486/). Estimation and inference are misleading.

Example 3 (selection bias). In this case, we only observe our outcome of interest for a subset of the population. The subset of the population we observe depends on a rule. For instance, we observe /(y/) if /(y_2/geq 0/). In this case, the conditional expectation of our outcome of interest is given by

/begin{equation*}
E/left(y|X_1, y_2 /geq 0/right) = X_1/beta + E/left(/varepsilon|X_1, y_2 /geq 0 /right)
/end{equation*}

Selection bias arises if /(E/left(/varepsilon|X_1, y_2 /geq 0 /right) /neq 0/). This implies that the selection rule is related to the unobservables in our model. If we define /(X /equiv (X_1, y_2 /geq 0)/), we can rewrite the problem in terms of our general framework:

/begin{eqnarray*}
E/left(y|X/right) &=& X_1/beta + E/left(/varepsilon|X /right) //
E/left(/varepsilon|X/right) &/neq & 0
/end{eqnarray*}

Below I simulate data that exemplify selection on unobservables:

clear capture set seed 111 quietly set obs 20000  // Generating Endogenous Components   matrix C    = (1, .8/ .8, 1) quietly drawnorm e v, corr(C)  // Generating exogenous variables   generate x1 = rbeta(2,3) generate x2 = rbeta(2,3) generate x3 = rnormal() generate x4 = rchi2(1)  // Generating outcome variables   generate y1 =  x1 - x2 + e generate y2 =  2 + x3 - x4 + v replace  y1 = . if y2<=0

In lines 7 and 8, I generate correlated unobservable variables. In lines 12–15 I generate the exogenous covariates. In lines 19 and 20, I generate the two outcomes and drop observations according to the selection rule in line 21. If we use linear regression, we obtain

As in the previous cases, the point estimates and confidence intervals lead us to incorrect conclusions.

/end{exam}

Concluding remarks

I have presented a general regression framework to understand many of the problems that do not allow us to interpret our results causally. I also illustrated the effects of these problems on our point estimates and confidence intervals using simulated data.

转载本站任何文章请注明:转载至神刀安全网,谢谢神刀安全网 » Understanding omitted confounders, endogeneity, omitted variable bias, and related concepts

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址