Compare RMSE for the same model but varying sample size

Question

My empirical research is based on a variable $a_{i,t} sim f(mathrm{RMSE})$, i.e. it is based on the root mean squared error (RMSE) of a certain regression model $Y_{i,t} = f(X_{i,t}, beta) + epsilon_{i,t}$. The regression is applied using $n=40$ observations, with a minimum of 24 observations available.
Is my variable $a_{i,t}$ comparable across entities, if the underlying number of observations varies between the range $24 le n le 40$? Is $a_{i,t}$ somehow dependent on the number of observations used in the regression?

My question is not related to those (e.g. [1] or [2]), where the RMSE is tried to be used to compare different regression models. The model is the same for all regressions, but the number of observations varies.

dimitriy · Answer

It seems like you are not using RMSE to validate your model's predictive performance. It's a useful quantity for other reasons, like theory. For some of your firms, you have less data to work with, so you are concerned that you might have higher RMSE just because you have less data, but you could have lower RMSE because you over fit. If you have a lot of terms, this can be a real concern with only 24 observations. I think you can gauge how bad this problem is by doing some simulations. Start with the firms where you have a full history, and do your analysis and get the RMSE. Then refit your model truncating each firm. If the RMSE changes when a firm is truncated, compared to the full history model, you know this a bad idea. Maybe there are selection issues there with the firms that have less history, so it is not perfect.

Answered by dimitriy on December 8, 2021

Vivek · Answer

From what I understand, you are trying compare your model performance across different subsets of data with varying number of observations.
RMSE= Sqrt(squared sum of errors/N)
Explanation of division by n under the square root in RMSE: it allows us to estimate the standard deviation σ of the error for a typical single observation rather than some kind of “total error”. By dividing by n, we keep this measure of error consistent as we move from a small collection of observations to a larger collection (it just becomes more accurate as we increase the number of observations). To phrase it another way, RMSE is a good way to answer the question: How far off should we expect our model to be on its next prediction?
In your case, As far as I know, It's not feasible to compare the RMSE across different subsets of data for model performance if that's what you are doing.

Eoin · Answer

No.
RMSE is a simple measure of how far your data is from the regression line,
$sqrt{frac{sum_i^N epsilon_i^2}{N}}$.
Imagine you have $p = 24$ independent predictors, so 24 columns in $X$ and 24 parameters in $beta$. In cases where you have only 24 data points, the model can perfectly fit the data, even if the predictors are totally random, so RMSE $ = 0$. Clearly, this isn't right, and is a case of overfitting. This problem is less extreme when $N >> p$, but it doesn't go away!
A better approach would be to use some kind of out-of-sample prediction,
but without knowing more about your problem I don't think we can say any more about that.

Compare RMSE for the same model but varying sample size

3 Answers

Add your own answers!

Ask a Question