What's the MSE of $hat{Y}$ in ordinary least squares using bias-variance decomposition?

Question

Suppose I have the following model: $$Y = mu + epsilon = Xbeta + epsilon,$$ where $Y$ is $n times 1$, $X$ is $n times p$, $beta$ is $p times 1$, and $epsilon$ is $n times 1$. I assume that $epsilon$ are independent with mean 0 and variance $sigma^2I$.
In OLS, the fitted values are $hat{Y} = HY$, where $H = X(X^TX)^{-1}X^T$ is the $N times N$ hat matrix. I want to find the MSE of $hat{Y}$.
By the bias-variance decomposition, I know that
begin{align*}
MSE(hat{Y}) &= bias^2(hat{Y}) + var(hat{Y})\
&= (E[HY] - mu)^T(E[HY] - mu) + var(HY)\
&= (Hmu - mu)^T(Hmu - mu) + sigma^2H\
&= 0 + sigma^2H
end{align*}
I'm confused by the dimension in the last step. The $bias^2$ term is a scalar. However, $var(hat{Y})$ is an $N times N$ matrix. How can one add a scalar to an $N times N$ matrix where $N neq 1$?

dwolfeu · Answer

More explanation in the edit below
I think the confusion arises because of the two different meanings of the MSE:

A value calculated from a sample of fitted values or predictions; this is usually what we mean when we write $operatorname{MSE}(hat{Y})$ in the context of OLS, since $hat{Y}$ is the vector of fitted values.

A value calculated from an estimator. It is this meaning where we have the variance–bias decomposition. We use this meaning of the MSE in the context of OLS too, but usually for the MSE of $hat{beta}$, where $hat{beta}$ is an estimator of the parameter $beta$. By the Gauss–Markov theorem we know that $operatorname{Bias}(hat{beta}) = 0$ and thus $operatorname{MSE}(hat{beta}) = operatorname{Var}(hat{beta})$ by the variance–bias decomposition if we take $hat{beta} = (X^TX)^{-1}X^TY$.

However, we can view $hat{Y}$ as an estimator of $Xbeta$ and thus consider $operatorname{MSE}(hat{Y})$ in the second sense. This is really just a rephrasing of the usual OLS estimator of $beta$: In the normal setup we estimate the parameter $beta$ given $X$ and $Y$, while in this new setup we estimate the parameter $Xbeta$ given $X$ and $Y$. Alas the notation is now confusing, since the notation $hat{Y}$ suggests that we are estimating $Y$ (a random variable), which we are not doing.
To simplify the formalism, we will use the notation of the OP and define $mu = Xbeta$. Don't confuse this with a mean!
We also have to clarify some definitions, since we are now dealing with a vector-valued estimator. First the variance (see this answer for some explanation):
begin{equation*}
operatorname{Var}(hat{Y}) = operatorname{E}left[left(hat{Y}-operatorname{E}[hat{Y}]right)left(hat{Y}-operatorname{E}[hat{Y}]right)^Tright]
end{equation*}
The definition of the bias doesn't change from the 1-dimensional case:
begin{equation*}
operatorname{Bias}(hat{Y}) = operatorname{E}[hat{Y}]-mu
end{equation*}
However, we do have to find a vector-valued equivalent of the 1-dimensional expression $operatorname{Bias}_mu(hat{Y})^2$, since this appears in the variance–bias decomposition. In the same vein as the vector-valued variance, this equivalent expression is the following:
begin{equation*}
operatorname{Bias}(hat{Y})operatorname{Bias}(hat{Y})^T
end{equation*}
Note that $operatorname{Bias}(hat{Y})$ is a fixed vector, so if the expression $operatorname{E}[hat{Y}]-mu$ appears within the scope of an expected-value operator, we can take it out as a constant. This question is about this fact, albeit for the 1-dimensional case.
And finally the MSE itself:
begin{equation*}
operatorname{MSE}(hat{Y}) = operatorname{E} left [left(hat{Y}-muright)left(hat{Y}-muright)^T right]
end{equation*}
So, with all this in hand, we can now prove the variance–bias decomposition of the MSE for a vector-valued estimator, which is really just a rephrasing of the usual proof for the 1-dimensional case:
begin{align*}
operatorname{MSE}(hat{Y}) &= operatorname{E} left [left(hat{Y}-muright)left(hat{Y}-muright)^T right] \
&= operatorname{E}left[left(hat{Y}-operatorname{E}[hat{Y}]+operatorname{E}[hat{Y}]-muright)left(hat{Y}-operatorname{E}[hat{Y}]+operatorname{E}[hat{Y}]-muright)^Tright]\ 
&= operatorname{E}left[left(left(hat{Y}-operatorname{E}[hat{Y}]right)+left(operatorname{E}[hat{Y}]-muright)right)left(left(hat{Y}-operatorname{E}[hat{Y}]right)^T+left(operatorname{E}[hat{Y}]-muright)^Tright)right]\ 
&= operatorname{E}left[left(hat{Y}-operatorname{E}[hat{Y}]right)left(hat{Y}-operatorname{E}[hat{Y}]right)^T +left(hat{Y}-operatorname{E}[hat{Y}] right ) left (operatorname{E}[hat{Y}]-mu right)^Tright. \
&hphantom{xxxxxxxxxx} + left.left(operatorname{E}[hat{Y}]-mu right)left(hat{Y}-operatorname{E}[hat{Y}] right)^T +left( operatorname{E}[hat{Y}]-mu right)left( operatorname{E}[hat{Y}]-mu right)^Tright] \ 
&= operatorname{E}left[left(hat{Y}-operatorname{E}[hat{Y}]right)left(hat{Y}-operatorname{E}[hat{Y}]right)^Tright]
+ operatorname{E}left[left(hat{Y}-operatorname{E}[hat{Y}] right ) left (operatorname{E}[hat{Y}]-mu right)^Tright] \
&hphantom{xxxxxxxxxx} + operatorname{E}left[left (operatorname{E}[hat{Y}]-mu right)left(hat{Y}-operatorname{E}[hat{Y}] right)^Tright] + operatorname{E}left[left( operatorname{E}[hat{Y}]-mu right)left( operatorname{E}[hat{Y}]-mu right)^Tright] \ 
&=operatorname{Var}(hat{Y}) + operatorname{E}left[hat{Y}-operatorname{E}[hat{Y}] right]left(operatorname{E}[hat{Y}]-mu right)^T \
&hphantom{xxxxxxxxxx} +
left (operatorname{E}[hat{Y}]-mu right)operatorname{E}left[left(hat{Y}-operatorname{E}[hat{Y}] right)^Tright]+ left( operatorname{E}[hat{Y}]-mu right)left( operatorname{E}[hat{Y}]-mu right)^T hphantom{xx} (*)\ 
&=operatorname{Var}(hat{Y}) + left(operatorname{E}[hat{Y}]-operatorname{E}[hat{Y}] right)left(operatorname{E}[hat{Y}]-mu right)^T \
& hphantom{xxxxxxxxxx}+
left (operatorname{E}[hat{Y}]-mu right)left(operatorname{E}[hat{Y}]-operatorname{E}[hat{Y}] right)^T + operatorname{Bias}(hat{Y})operatorname{Bias}(hat{Y})^T \ 
&=operatorname{Var}(hat{Y}) + 0left(operatorname{E}[hat{Y}]-mu right)^T +
left (operatorname{E}[hat{Y}]-mu right)0^T + operatorname{Bias}(hat{Y})operatorname{Bias}(hat{Y})^T \ 
&= operatorname{Var}(hat{Y}) + operatorname{Bias}(hat{Y})operatorname{Bias}(hat{Y})^T
end{align*}
Let's now actually calculate the bias and the variance of the estimator $hat{Y}$:
begin{align*}
operatorname{Bias}(hat{Y}) &= operatorname{E}[hat{Y}]-mu \
&= operatorname{E}[hat{Y}-mu] \
&= operatorname{E}left[X(X^TX)^{-1}X^TY-Xbetaright] \
&= operatorname{E}left[Xleft((X^TX)^{-1}X^TY-betaright)right] \
&= Xoperatorname{E}left[(X^TX)^{-1}X^TY-betaright] \
&= Xoperatorname{E}[hat{beta}-beta] \
&= X0 \
&= 0
end{align*}
The equality $operatorname{E}[hat{beta}-beta]=0$ is a consequence of the Gauss–Markov theorem. Note that $operatorname{Bias}(hat{Y})=0$ implies that $operatorname{E}[hat{Y}]=mu$ by simple rearrangement.
We now calculate the variance:
begin{align*}
operatorname{Var}(hat{Y}) &= operatorname{E}left[(hat{Y}-operatorname{E}[hat{Y}])(hat{Y}-operatorname{E}[hat{Y}])^Tright]\
&= operatorname{E}left[(hat{Y}-mu)(hat{Y}-mu)^Tright]\
&= operatorname{E}left[(Xhat{beta}-Xbeta)(Xhat{beta}-Xbeta)^Tright]\
&= operatorname{E}left[X(hat{beta}-beta)(hat{beta}-beta)^TX^Tright]\
&= Xoperatorname{E}left[(hat{beta}-beta)(hat{beta}-beta)^Tright]X^T\
&= Xoperatorname{E}left[(hat{beta}-operatorname{E}[hat{beta}])(hat{beta}-operatorname{E}[hat{beta}])^Tright]X^T hphantom{xx} (text{by the Gauss–Markow theorem})\
&= Xoperatorname{Var}(hat{beta})X^T\
&= X(sigma^2(X^TX)^{-1}X^T) hphantom{xx} (**)\
&= X(sigma^2(X^TX)^{-1}X^T)\
&= sigma^2X(X^TX)^{-1}X^T\
&= sigma^2H
end{align*}
We prove the step marked $(**)$, namely that $operatorname{Var}(hat{beta}) = sigma^2(X^TX)^{-1}$:
begin{align*}
hat{beta} - beta &= (X^TX)^{-1}X^TY - beta \
&= (X^TX)^{-1}X^T(Xbeta + epsilon) - beta \
&= (X^TX)^{-1}X^TXbeta + (X^TX)^{-1}X^Tepsilon - beta \
&= beta + (X^TX)^{-1}X^Tepsilon - beta \
&= (X^TX)^{-1}X^Tepsilon
end{align*}
Thus:
begin{align*}
operatorname{Var}(hat{beta}) &=operatorname{E}left[(hat{beta}-beta)(hat{beta}-beta)^Tright] \
&= operatorname{E}left[(X^TX)^{-1}X^Tepsilon((X^TX)^{-1}X^Tepsilon)^Tright] \
&= operatorname{E}left[(X^TX)^{-1}X^Tepsilonepsilon^TX(X^TX)^{-T}right] \
&= (X^TX)^{-1}X^Toperatorname{E}left[epsilonepsilon^Tright]X(X^TX)^{-T} \
&= (X^TX)^{-1}X^Toperatorname{E}left[(epsilon-0)(epsilon-0)^Tright]X(X^TX)^{-T} \
&= (X^TX)^{-1}X^Toperatorname{E}left[(epsilon-operatorname{E}[epsilon])(epsilon-operatorname{E}[epsilon])^Tright]X(X^TX)^{-T} \
&= (X^TX)^{-1}X^Toperatorname{Var}(epsilon)X(X^TX)^{-T} \
&= (X^TX)^{-1}X^T(sigma^2I)X(X^TX)^{-T} hphantom{xx} (text{since the errors are uncorrelated with each other})\
&= (X^TX)^{-1}X^T(sigma^2I)X(X^TX)^{-T} \
&= sigma^2(X^TX)^{-1}X^TX(X^TX)^{-T} \
&= sigma^2(X^TX)^{-T} \
&= sigma^2((X^TX)^T)^{-1} \
&= sigma^2(X^TX)^{-1} \
end{align*}
So, putting it all together:
begin{align*}
operatorname{MSE}(hat{Y}) &= operatorname{Var}(hat{Y}) + operatorname{Bias}(hat{Y})operatorname{Bias}(hat{Y})^T \
&= sigma^2H + 00^T \
&= sigma^2H
end{align*}
This is the answer that the OP calculated. :)

EDIT
The OP asked in the comments why we define
begin{equation*}
operatorname{MSE}(hat{Y}) = operatorname{E} left [left(hat{Y}-muright)left(hat{Y}-muright)^T right]
end{equation*}
and not
begin{equation*}
operatorname{MSE}(hat{Y}) = operatorname{E} left[left(hat{Y}-muright)^Tleft(hat{Y}-muright) right].
end{equation*}
This is a good question; indeed, it's the crux of the OP's original question and I didn't address it properly. I will attempt to redress this oversight.
In the 1-dimensional case, the meaning of the definition
begin{equation*}
operatorname{MSE}(hat{Y}) = operatorname{E}left[left(hat{Y}-muright)^2right]
end{equation*}
is unambiguous. But if $hat{Y}-mu$ is a vector, then we have to decide how to interpret the expression $left(hat{Y}-muright)^2$. We have two options:

$left(hat{Y}-muright)^2 = left(hat{Y}-muright)^Tleft(hat{Y}-muright)$

$left(hat{Y}-muright)^2 = left(hat{Y}-muright)left(hat{Y}-muright)^T$

In my original answer I went with the second option (based on arguments given here). But what about the first option? Well, we still have the variance–bias decomposition! Let's show that. We start by defining all the relevant terms; I mark them with a superscript asterisk * in order to distinguish them from the definitions given in my original answer, but please note that this is not standard notation:
begin{align*}
operatorname{MSE}^*(hat{Y}) &= operatorname{E}left[left(hat{Y}-muright)^Tleft(hat{Y}-muright) right] \
operatorname{Var}^*(hat{Y}) &= operatorname{E}left[left(hat{Y}-operatorname{E}[hat{Y}]right)^Tleft(hat{Y}-operatorname{E}[hat{Y}]right)right] \
operatorname{Bias}^*(hat{Y}) &= operatorname{E}[hat{Y}]-mu left(= operatorname{Bias}(hat{Y}) right)\
operatorname{Bias}^*(hat{Y})^2 &= operatorname{Bias}^*(hat{Y})^Toperatorname{Bias}^*(hat{Y})
end{align*}
(Note that we could multiply by the constant factor $frac{1}{n}$, i.e. define
begin{equation*}
operatorname{MSE}^*(hat{Y}) = operatorname{E}left[tfrac{1}{n}left(hat{Y}-muright)^Tleft(hat{Y}-muright) right].
end{equation*}
It doesn't really matter whether we include this constant factor, since it has no effect on the expected-value operator.)
With these definitions, the MSE still decomposes into the sum of the variance and the square of the bias:
begin{equation*}
operatorname{MSE}^*(hat{Y}) = operatorname{Var}^*(hat{Y}) + operatorname{Bias}^*(hat{Y})^2
end{equation*}
The proof is all but identical to the one given above: One just has to move a few superscript $T$s around.
What the OP did in their original calculation was to mix up the different definitions when they applied the variance–bias decomposition: They used $operatorname{Var}^*(hat{Y})$ but $operatorname{Bias}(hat{Y})operatorname{Bias}(hat{Y})^T$. This is why the dimensions didn't match.

What's the MSE of $hat{Y}$ in ordinary least squares using bias-variance decomposition?

One Answer

Add your own answers!

Ask a Question