TransWikia.com

Measurement error in one indep variable in OLS with multiple regression

Cross Validated Asked on November 16, 2021

Suppose I regress (with OLS) $y$ on $x_1$ and $x_2$. Suppose I have i.i.d. sample of size n, and that $x_1$ is observed with error but $y$ and $x_2$ are observed without error. What is the probability limit of the estimated coefficient on $x_1$?

Let us suppose for tractability that the measurement error of $x_1$ is "classical". That is the measurement error is normally distributed with mean 0 and is uncorrelated with $x_2$ or the error term.

3 Answers

Suppose you true matrix is $X^*=begin{bmatrix} x_1^{*} & x_2 end{bmatrix}$, but you observe $x_1=x_1^*+v$.

Then the OLS coefficient on $x_1$ has the following probability limit:

$$mathbf{plim} hat beta_{x_1|x_2}=beta left[1-frac{sigma^2_v}{sigma^2_{x_1^*}cdot(1-R^2_{x_1^*,x_2})+sigma^2_v} right]=beta left[1-frac{1}{1+frac{sigma^2_{x_1^*}}{sigma^2_v}cdot(1-R^2_{x_1^*,x_2})} right]$$

where $R^2_{x_1^*,x_2}$ denotes the $R^2$ from the auxiliary regression of $x_1^*$ on $x_2$.

This means the coefficient is still attenuated, but generally less so than in the single regressor case. The bias gets worse as collinearity with $x_2$ increases.

Here $x_2$ can contain more than one variable measured without error, so this formula is pretty general. The coefficient(s) measured without error will be inconsistent in the direction determined by $Sigma_{X^*X^*}$.

You can find this formula (without proof, but surrounded by much auxiliary wisdom) in equation (5) in Bound, John & Brown, Charles & Mathiowetz, Nancy, 2001. "Measurement error in survey data," Handbook of Econometrics, edition 1, volume 5, chapter 59, pages 3705-3843.

They cite these two older papers:

  • Levi, M.D. (1973), "Errors in the variables bias in the presence of correctly measured variables", Econometrica 41:985−986.
  • Garber, S., and S. Klepper (1980), “Extending the classical normal errors-in-variables model”, Econometrica 48:1541−1546.

Answered by dimitriy on November 16, 2021

The solution to this problem is in Wooldridge's "Introductory Econometrics" (Chapter 9 Section "Measurement Error in an Explanatory Variable", p320 in the 2012 version) and in Wooldridge's "Econometric analysis of cross-section and_panel data" (Section 4.4.2, p73 in the 2002 version). Here is the takeaway.

Consider the multiple regression model with a single explanatory variable $x^*_K$ measured with error :

$$y = beta_0 + beta_1 x_1 + beta_2 x_2 + ... + beta_K x^*_K + nu$$

And with "classical" assumptions, mainly that $nu$ is uncorrelated to $x^*_K$ and $nu$ is uncorrelated to $x_K$.

The measurement error is $e_K = x_K - x^*_K$ with $text{E}(e_k) = 0$. The classical assumption implies that $nu$ is uncorrelated to $e_K$

We want to replace $x^*_K$ with $x_K$ and see how this affects OLS estimators, w.r.t. assumptions on the relationship between the measurement error $e_k$ and $x^*_K$ and $x_K$.

The first case, which is not the OP case but I present briefly for the sake of completeness, is when $text{Cov}(e_K, x_K) = 0$. Here OLS using $x_K$ instead of $x^*_K$ provides consistent estimators even if it inflates the error variance of the estimations (and thus of the estimators).

The case of interest is when $text{Cov}(e_K, x^*_K) = 0$ and is called "classical in variables error" in the econometric literature. Here :

$$text{Cov}(e_K, x_K) = text{E}(e_Kx_K) = text{E}(e_Kx^*_K)+ text{E}(e^2_K) = sigma^{2}_{e_{K}} $$

and :

$$ text{plim}(hat{beta}_k) = beta_K left( frac{sigma^{2}_{r^{*}_{K}}}{sigma^{2}_{r^{*}_{K}}+ sigma^{2}_{e_{K}}} right) = beta_KA_K $$

where $r_K$ is error in :

$$ x^*_K = delta_0 + delta_1 x_1 + delta_2 x_2 + ... delta_{K-1} x_{K-1} + r^*_K $$

$A_K$ is always between 0 and 1 and is called the attenuation bias: If $beta_K$ is positive (reps. negative), $hat{beta}_K$ will tend to underestimate (reps. underestimate) $beta_K$.

In the multivariate regression, it is the variance of $x^*_K$ after controlling (netting) for the effects of the other explanatory variables, that affect the attenuation bias. This latter is worse as $x^*_K$ is colinear with the other variables.

In the case where $K=1$, i.e., the simple regression model where there is only one explanatory variable which is measured with error. In this case :

$$text{plim}(hat{beta}_1) = beta_1 left( frac{ sigma^{2}_{x^*_1} }{sigma^{2}_{x^*_1} + sigma^{2}_{e_1}} right)$$

The attenuation term, always between 0 and 1 becomes closer to 1 as $sigma^{2}_{e_1}$ shrinks relatively to $sigma^{2}_{x^*_1}$. Note that in this special case, $r^*_K = x^*_1$.

The $text{plim}(beta_j)$ for $j neq K$ is complicated to derive in this framework, except in the case where $x^*_k$ is uncorrelated to other $x_j$, thus $x_k$ is uncorrelated to other $x_j$, which leads to $text{plim}(hat{beta}_j)=beta_j$.

Answered by SoufianeK on November 16, 2021

In the situation that you describe, the true model is like this:

$y = beta_0 + beta_1 x_1 + beta_2 x_2 + u$

now, you can observe $y$ and $x_2$ but you cannot observe $x_1$. However you can observe $z = x_1 + epsilon$

moreover we assume that $rho(epsilon,u)=0$

So, if we consider the simplification where $beta_2 =0$ is possible to show that the OLS estimator for $beta_1$ is like

$theta_1 = beta_1 V[x_1]/(V[x_1] + V[epsilon]) $

Then the absolute value of $theta_1$, in expected term and/or plim, is lower than of $beta_1$. Then $theta_1$ is biased (incorrect and inconsistent) for $beta_1$. This kind of bias is known as attenuation bias. More $V[epsilon]$ increase, more serious the problem become.

Now, for multivariate case the matrix algebra notation is usually used. Then in vector form we achieve $E[theta]neq beta$ and/or plim $theta neqbeta$
Note that here you can consider that even if only one variable is endogenous, for measurement error or other problems, all parameters become biased. The direction of bias for any $theta_i$ depends from correlations among variables and the sign of the first moments. Special cases exist, for example if the variables are all orthogonal the bias does not spread.

In your case, two variables ($beta_1$ and $beta_2$ different from $0$), you can estimate a regression like

$y = theta_0 + theta_1 z + theta_2 x_2 + v$

here $theta_1$ suffer from attenuation bias (on $beta_1$) but also $theta_2$ is biased (for $beta_2$). In special case where $z$ and $x_2$ are orthogonal, for $theta_1$ the problem remain but $theta_2$ become correct and consistent.

Answered by markowitz on November 16, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP