TransWikia.com

Endogenous controls in linear regression - Alternative approach?

Cross Validated Asked by sgtbp on December 27, 2021

I have a cross-section of $x$, $y_1$, and $y_2$. These are individual level data used in labor economics. I have random variation in $x$ and I’m interested in the effect of $x$ on $y_1$. It is well established in earlier research that $x$ causally affects $y_2$ positively. Economic intuition says that $y_2$ will affect $y_1$ positively as well. Hence, $x$ has an effect on $y_1$ through $y_2$.

Estimating the regression $y_1 = a_1+a_2x + a_3y_2$ I believe is problematic since $y_2$ is endogenous. What type of model(s) can I estimate to identify the “direct” effect of $x$ on $y_1$ (the $a_2$ parameter) while controlling for the effect $x$ has on $y_1$ through $y_2$?

2 Answers

You are interested in causal inference with linear models, therefore with linear regression. In situation like that we have to deal with regression and causality, this is a widespread and slippery problem. I tried to summarize this problem here: Under which assumptions a regression can be interpreted causally?

Following the approach that I suggest there, we have to write down some structural causal equations that encode the causal assumptions. My question in the comments come from clarifications about these. From what you said it seems that we have two structural equations:

$y_1 = beta_1 y_2 + epsilon_1$

$y_2 = beta_1 x + epsilon_2$

So there is not direct effect of $x$ on $y_1$; however there is and indirect one. Indeed we can see that $y_1 = beta_1 beta_2 x + beta_1 epsilon_2 + epsilon_1 = beta_3 x + epsilon_3$

and $beta_3 = beta_1 beta_2$ represents the indirect (and total) effect of $x$ on $y_1$

Now I add some needed (causal) assumptions more. In the initial two structural equation the structural errors are exogenous ($E[epsilon_1 | y_2]=0$ and $E[epsilon_2 | x]=0$) and them are independent.

So, as consequence, in the last structural equations the structural error $epsilon_3$ is exogenous ($E[epsilon_3 | x]=0$)

Then, you can perform the regression $y_1 = theta_1 x + u_1$ and $theta_1$ identify $beta_3$, what you looking for.

Moreover in this example a regression $y_1 = theta_2 y_2 + u_2$ the coefficient $theta_2$ identify $beta_1$.

Modifying the model as suggested in comments we have two structural equations:

$y_1 = beta_1 y_2 + beta_2 x + epsilon_1$

$y_2 = beta_3 x + epsilon_2$

Here $beta_2$ is the direct effect of $x$ on $y_1$; what we are interested in. Moreover there is an indirect one too. Now, we can see that

$y_1 = beta_1 beta_3 x + beta_2 x + beta_1 epsilon_2 + epsilon_1 = beta_4 x + epsilon_3$

where $beta_4 = beta_1 beta_3 + beta_2$ represents the total effect of $x$ on $y_1$

and $epsilon_3 = beta_1 epsilon_2 + epsilon_1 $

Now, as before, I add some needed (causal) assumptions more. In the initial two structural equation the structural errors are exogenous ($E[epsilon_1 | y_2, x]=0$ and $E[epsilon_2 | x]=0$) and them are independent.

So, as consequence, in the last structural equations the structural error $epsilon_3$ is exogenous too ($E[epsilon_3 | x]=0$)

Then, you can perform three useful regressions

$y_1 = theta_1 x + u_1$

$y_2 = theta_2 x + u_2$

$y_1 = theta_3 y_2 + theta_4 x + u_3$

here $theta_1$ identify $beta_4$, $theta_2$ identify $beta_3$, $theta_3$ and $theta_4$ identify $beta_1$ and $beta_2$ (what you looking for).

Moreover from $theta_1 - theta_3 theta_2$ (like $beta_4 - beta_1 beta_3$) we identify $beta_2$ again. So if the restriction
$theta_4 = theta_1 - theta_3 theta_2$ do not hold, we have evidence against the SEM (the causal assumptions).

Moreover we can note that, obviously, not all regressions are good. For example if we run this regression

$y_1 = theta_5 y_2 + u_4$

the coefficient $theta_5$ do not identify any parameter of the SEM. $theta_5$ is biased for $beta_1$ ($x$ play as omitted/confounder variable).

(I dropped out the constant terms for simplicity)

Finally, the initial regression that you had in mind was good, no endogeneity (but it can happen by chance). However I suppose that you had in mind other approach for the problem. I suggest you this one.

Answered by markowitz on December 27, 2021

If $y_2$ can be observed, there is nothing wrong with your approach. You are interested in the effect of $x$ on $y_1$ and you control for the partial influence of $y_2$. Estimating

$$ y_1 = alpha + beta_1 x + beta_2 y_2 + varepsilon $$

will yield a unbiased estimate of the effect of $x$ on $y_1$ if everything else (captured by $varepsilon$) does not affect $y_1$ for given values of $x$ and $y_2$. I might have misunderstood what you mean by the 'direct effect', but $widehat{beta_1}$ will give you the effect of $x$ on $y_1$ net of the impact from $y_2$.

Answered by E. Sommer on December 27, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP