TransWikia.com

Is there a mathematical proof for change being correlated with baseline value

Cross Validated Asked on February 2, 2021

It is shown in answer here and at other places that difference of 2 random variables will be correlated with baseline. Hence baseline should not be a predictor for change in regression equations. It can be checked with R code below:

> N=200
> x1 <- rnorm(N, 50, 10)
> x2 <- rnorm(N, 50, 10)  
> change = x2 - x1
> summary(lm(change ~ x1))

Call:
lm(formula = change ~ x1)

Residuals:
     Min       1Q   Median       3Q      Max 
-28.3658  -8.5504  -0.3778   7.9728  27.5865 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 50.78524    3.67257   13.83 <0.0000000000000002 ***
x1          -1.03594    0.07241  -14.31 <0.0000000000000002 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.93 on 198 degrees of freedom
Multiple R-squared:  0.5083,    Adjusted R-squared:  0.5058 
F-statistic: 204.7 on 1 and 198 DF,  p-value: < 0.00000000000000022

The plot between x1 (baseline) and change shows an inverse relation:

enter image description here

However, in many studies (especially, biomedical) baseline is kept as a covariate with change as outcome. This is because intuitively it is thought that change brought about by effective interventions may or may not be related to baseline level. Hence, they are kept in regression equation.

I have following questions in this regard:

  1. Is there any mathematical proof showing that changes (random or those caused by effective interventions) always correlate with baseline? Does it occur only in some circumstances or is it a universal phenomenon? Is distribution of data related to this?

  2. Also, does keeping baseline as one predictor of change affects results for other predictors which are not having any interaction with baseline? For example in regression equation: change ~ baseline + age + gender. Will results for age and gender be invalid in this analysis?

  3. Is there any way to correct for this effect, if there is a biological reason to think that change may DIRECTLY related to baseline (quite common in biological systems)?

Thanks for your insight.

Edit: I probably should have labelled x1 and x2 as y1 and y2 since were discussing response.

Some links on this subject:

Difference between Repeated measures ANOVA, ANCOVA and Linear mixed effects model

Change Score or Regressor Variable Method – Should I regress $Y_1$ over $X$ and $Y_0$ or $(Y_1-Y_0)$ over $X$

What are the worst (commonly adopted) ideas/principles in statistics?

What are the worst (commonly adopted) ideas/principles in statistics?

Change Score or Regressor Variable Method – Should I regress $Y_1$ over $X$ and $Y_0$ or $(Y_1-Y_0)$ over $X$

2 Answers

  1. Is there any mathematical proof showing that changes (random or those caused by effective interventions) always correlate with baseline? Does it occur only in some circumstances or is it a universal phenomenon? Is distribution of data related to this?

We are interested in the covariance of $X$ and $X-Y$ where $X$ and $Y$ may not be independent:

$$ begin{align*} text{Cov}(X,X-Y) &=mathbb{E}[(X)(X-Y)]-mathbb{E}[X]mathbb{E}[X-Y] \ &=mathbb{E}[X^2-XY]-(mathbb{E}[X])^2 + mathbb{E}[X]mathbb{E}[Y] \ &=mathbb{E}[X^2]-mathbb{E}[XY]-(mathbb{E}[X])^2 + mathbb{E}[X]mathbb{E}[Y] \ &=text{Var}(X)-mathbb{E}[XY] + mathbb{E}[X]mathbb{E}[Y] \ &=text{Var}(X) - text{Cov}(X,Y) end{align*} $$

So yes, this is always a problem.

  1. Also, does keeping baseline as one predictor of change affects results for other predictors which are not having any interaction with baseline? For example in regression equation: change ~ baseline + age + gender. Will results for age and gender be invalid in this analysis?

The whole analysis is invalid. The estimate for age is the expected association of age with change while keeping basline constant. Maybe you can make sense of that, and maybe it does make sense but you are fitting a model where you invoke a spurious association (or distort an actual association), so don't do it.

  1. Is there any way to correct for this effect, if there is a biological reason to think that change may DIRECTLY related to baseline (quite common in biological systems)?

Yes, this is very common as you say. Fit a multilevel model (mixed effects model) with 2 time points per participant (baseline and follow up), coded as -1 and +1. If you want to allow for differential treatment effects and then you can fit random slopes too.

An alternatives is Oldham's method but that also has it's drawbacks.

See Tu and Gilthore (2007) "Revisiting the relation between change and initial value: a review and evaluation" https://pubmed.ncbi.nlm.nih.gov/16526009

Answered by Robert Long on February 2, 2021

Consider an agricultural experiment with yield as the response variable and fertilizers as the explanatory variables. In each field, one fertilizers (can be none also) is applied. Consider the following scenario:

(1) There are three fertilizers, say n, p, k. For each of them we can include an effect in our linear model, and take our model as $$y_{ij} =alpha_i + varepsilon_{ij}.$$ Here $alpha_i$ has to be interpreted as the effect of the $i$-th fertilizer.

(2) There are 2 fertilizers (say p, k) and on some of the fields, no fertilizer has been applied (this is like placebo in medical experiments). Now here it is more intuitive to set the none-effect as the baseline and take the model as $$y_{ij} = mu + alpha_{ij} +varepsilon_{ij}$$ where $mu$ accounts for the none effect, $alpha_1 = 0$ and $alpha_2, alpha_3$ have to be interpreted as the "extra" effect of the fertilizers p, k.

Thus, when it seems appropriate to take a baseline, other effects are considered as the "extra" effect of that explanatory variable. Of course we can take a baseline for scenario (1) as well: Define $mu$ as the overall effect and $alpha_i$ to be the extra effect of the $i$-th fertilizer.

In medical experiments, sometimes we come accross a similar scenario. We set a baseline for the overall effect and define the coefficients for the "extra effect". When we consider such baseline, our assumption does not remain that the marginal effects are independent. We rather assume that the overall effect and the extra effects are independent. Such assumptions on the model mainly come from field experience, not from a mathematical point of view.

For your example (mentioned in the comments below), where $y_1$ was the height at the beginning and $y_2$ is the height after 3 months, after applying fertilizer, we can indeed have $y_2 - y_1$ as our response and $y_1$ as our predictor. But my point is that in most of the cases, we won't assume $y_1$ and $y_2$ to be independent (that would be unrealistic, because you have applied a fertilizer on $y_1$ to get $y_2$). When $y_1$ and $y_2$ are independent, you get theoretically that they are negatively correlated. But here this is not the case. In fact, in many cases you will find that $y_2-y_1$ is positively correlated with $y_1$, indicating that for greater height of the response, the fertilizer increases the height more, i.e., becomes more effective.

Answered by Aditya Ghosh on February 2, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP