What is the correct way of adding bias terms in the residuals of the linear regression model?

Question

First, I fit a linear model:

$y=beta_0 + beta_1x_1+beta_2x_2 + epsilon$

Now I want to visualize $y$ after the effects of $x_1$ and $x_2$ have been removed or adjusted.  I can visualize the $y$ vs. $x_1$ or $x_2$ relationship by using only the residuals $epsilon$. The problem is I want to add the bias term in the residuals. Say, I want to plot the adjusted $y$ as a boxplot with respect to another independent variable (e.g. diagnostic group).

For now, I am adjusting the effects of $x_1$, $x_2$ on $y$ as below:

$y_a=y - beta_1(x_1-bar{x_1}) - beta_2(x_2-bar{x_2})  qquad (1)$

Here, for one data point, I am defining the effect of $x_1$ as the change in $y$ caused by the difference of $x_1$ from the mean of $x_1$ i.e. $bar{x_1}$.

After few algebraic manipulations:

$y_a= (beta_0 + beta_1 bar{x_1} +  beta_2 bar{x_2}) + epsilon
= bias + residuals  qquad (2) $

First, I am not 100% convinced with myself with this technique. However, this article https://surfer.nmr.mgh.harvard.edu/ftp/articles/buckner2004.pdf also uses this technique for covariate adjustment (see Equation 1 on Page 728).

Question1: Is this technique correct? and why if yes/no? Or asking the same question based on equation 2: Is the bias term $(beta_0 + beta_1 bar{x_1} +  beta_2 bar{x_2})$ added to the residuals is correct?

Let's assume the above adjustment technique is correct. 
Let's say $x_1$ is a categorical variable with more than two levels. How to calculate the mean of $x_1$ ($bar{x_1}$)?

Question2: 
How to calculate the mean of a categorical variable? To be strict,  it doesn't even makes a sense to  calculate the mean or any summary statistic off a categorical variable. Is there any workaround for this?

Derrick Kaufman · Answer

I think I am confused by the way you are using the word bias. Seems like they adjusted the HCV by head size by fitting a linear regression model. So for people with above average head-size they artificially reduced HCV and with below average head size they increased (adjusted) HCV, in order to reduce the variance caused by head size.

Then they used HCVadj as a factor to model dimentia status.

The reason they adjust before hand is because they want to compute the Cohen's D which uses the mean HCVs in the demented and non-demented population (no room for head-size or external factors in this model). So now they have a Cohen's D adjusted for head size. I see no problem here but I am no expert in Cohen's D.

If you have a categorical predictor you can accomplish the same with unbalanced effect coding. (see here: What is effect coding?). Include the 4-level class variable in the model using 4 dummy variables(x_21, x_22, ....) coded as shown on that page.

Fit the model y = B_1*(x_1 - X_bar) + B_2*x_21 + B_3*x_22 + B_4*x_23 + B_4*x_24.

Calculate the Ya's for each individual as you have in your equation (1) (without the intercept). Then use your Ya's to calculate your Cohen's D.  If you are not calculating Cohen's D or something similar that requires 2 groups then you don't need this method. Maybe you can find some other way that takes into account the other factors?

What is the correct way of adding bias terms in the residuals of the linear regression model?

One Answer

Add your own answers!

Ask a Question