# Justification for and optimality of $R^2_{adj.}$ as a model selection criterion

In a recent thread, use of adjusted $$R^2$$ ($$R^2_{adj.}$$) is mentioned in the context of model selection, e.g.

The adjustment was invented as a solution to problems caused by variable selection

Question: Is there any justification for using $$R^2_{adj.}$$ for model selection? That is, does $$R^2_{adj.}$$ have any optimality properties in the context of model selection?

For example, AIC is an efficient criterion and BIC is a consistent one, but $$R^2$$ does not coincide with any of them and so makes me wonder if it can be optimal in any other sense.

Cross Validated Asked on January 7, 2022

I would propose six optimality properties.

1. Overfit Mitigation
2. Simplicity and Parsimony
3. General Shared Understanding
4. Semi-Efficient Factor Identification
5. Robustness to Sample Size Change
6. Explanatory Utility

Overfit Mitigation

What kind of model is overfit? In part, this depends on the model's use case. Suppose we are using a model to test whether a hypothesized factor-level relationship exists. In that case a model which tends to allow spurious relations is overfit.

"The use of an adjusted R2...is an attempt to account for the phenomenon of the R2 automatically and spuriously increasing when extra explanatory variables are added to the model." Wikipedia.

Simplicity and Parsimony

Parsimony is valued on normative and economic rationale. Occam's Razor is an example of a norm, and depending on what we mean by "justification," it might pass or fail.

The economic rationale for simplicity and parsimony are harder to dismiss:

1. Complex models with many factors are expensive to gather data for.
2. Complex models can be more expensive to execute.
3. Complex models are hard to communicate and think through. Business and legal risks can result from this, as well as plain time spent communicating from one person to another.

Given two models with equal explanatory power (R2), then, AR2 selects for the simpler and more parsimonious model.

General Shared Understanding

Justification involves shared understanding. Consider a peer-review situation. If the reviewer and the reviewed lack a shared understanding of model selection, questions or rejections may occur.

R2 is an elementary statistical concept and those only familiar with elementary statistics still generally understand that R2 is gameable and AR2 is preferred to R2 for the above reasons.

Sure, there may be better choices compared to AR2 such as AIC and BIC, but if the reviewer is unfamiliar with these then their use may not succeed as a justification. What's worse, the reviewer may have a misunderstanding themselves and required AIC or BIC when they aren't required - that itself is unjustified.

My limited understanding indicates that AIC is now considered rather arbitrary by many - specifically the 2s in the formula. WAIC, DIC, and LOO-CV have been suggested as preferred, see here.

I hope by "justified" we don't mean "no better parameter exists" because it seems to me that some better parameter might always exist unbeknownst to us, therefore this style of justification always fails. Instead "justified" ought to mean "satisfies the requirement at hand" in my view.

Semi-Efficient Factor Identification

Caveat: I made up this term and I could be using it wrong :)

Basically, if we are interested in identifying true factor relations, we should expect p < 0.5, ie P(B) > P'(B). AR2 maximization satisfies this as adding a factor with p >= 0.5 will reduce AR2. Now this isn't an exact match because I think AR2 generally penalizes p > 0.35-ish.

It's true AIC penalizes more in general but I'm not sure that's a good thing if the goal is to identify all observed features that have an identifiable relation, say at least directionally, in a given data set.

Robustness to Sample Size Change

In the comments of this post, Scortchi - Reinstate Monica notes that it "makes no sense to compare likelihoods (or therefore AICs) of models fitted on different nos observations." In contrast, r-squared and adjusted r-squared are absolute measures that can be compared with a change in the number of samples.

This might be useful in the case of a questionnaire that includes some optional questions and partial responses. It's of course important to be mindful of issues like response bias in such cases.

Explanatory Utility

Here, we are told that "R2 and AIC are answering two different questions...R2 is saying something to the effect of how well your model explains the observed data...AIC, on the other hand, is trying to explain how well the model will predict on new data."

So if the use case is non-predictive, such as in the case of theory-driven, factor-level hypothesis testing, AIC may be considered inappropriate.

Answered by John Vandivier on January 7, 2022

I don't know if $$R^2_{text{adj.}}$$ have any optimal properties for model selection, but it is surely taught (or at least mentioned) in that context. One reason might be because most students have met $$R^2$$ early on, so there is then something to build on.

One example is the following exam paper from University of Oslo (see problem 1.) The text used in that course, Regression Methods in Biostatistics Linear, Logistic, Survival, and Repeated Measures Models Second edition by Eric Vittinghoff, David V. Glidden, Stephen C. Shiboski and Charles E. McCulloch mentions $$R^2_{text{adj.}}$$ early on in their chapter 10 on variable selection (as penalizing less than AIC, for example) but neither it nor AIC is mentioned in their summary/recommendations 10.5.

So it is maybe mostly used didactically, as an introduction to the problems of model selection, and not because of any optimality properties.

Answered by kjetil b halvorsen on January 7, 2022

1. If you add more variables, even totally insignificant variable, R2 can only go up. this is not the case with adjusted R2. You can try running multiple regression and then add random variable and see what happened to R2 and what happened to the adjusted R2.

Answered by Oren Ben-Harim on January 7, 2022

## Related Questions

### How to determine relationship categorical and numerical data

1  Asked on January 9, 2021 by onhalu

### Multiple Poisson regression (?) in R

2  Asked on January 9, 2021 by jonas8

### Propose a model for this time series

1  Asked on January 8, 2021 by le-anh-dung

### Would a 3D CNN require less training samples than a corresponding 2D CNN?

0  Asked on January 8, 2021 by alexander-soare

### Can regression to the mean be corrected by linear mixed effects?

0  Asked on January 8, 2021 by lili

### T value vs T-stat

1  Asked on January 8, 2021 by student010101

### How can I perform a two-sample multivariate t-test where one group is a subset of the other?

0  Asked on January 7, 2021 by grint

### Minimize the limit of K-L (Kullback Leibler) divergence for a given conditional probability $p(y|x)$ distribution?

0  Asked on January 7, 2021

### Can I use coefficients of one set of regressions as dependent variable in a new regression?

1  Asked on January 7, 2021 by jeremy

### What’s a word meaning “drawn from the same distribution”?

0  Asked on January 6, 2021 by gkhagb

### What Statistical principles are being violated by comparing specific Trainer Fatality Rates to Race Track Fatality rates?

0  Asked on January 6, 2021 by pseudoego

### How to automatically choose the number of components for PCA?

1  Asked on January 6, 2021 by foobar

### Cosine Similarity Intuition

3  Asked on January 6, 2021 by ccb

### Is there a way to get the optimal cutoff points based on probability of topic models and the outcomes?

1  Asked on January 6, 2021 by kuni

### How can I use the box plot to explain the Empirical Rule for a normal distribution?

1  Asked on January 6, 2021 by storymay

### PCA: Dimension Reduction

0  Asked on January 5, 2021 by shank

### How to choose a good operation point from precision recall curves?

4  Asked on January 5, 2021 by amelio-vazquez-reina

### How to develop a likelihood based prediction model to predict chance of rain in a particular hour of a year?

0  Asked on January 5, 2021 by nahid

### How well does GAN (generative adversarial network) perform for small samples?

1  Asked on January 4, 2021

### Using the Hotelling package in R

1  Asked on January 4, 2021 by pitchounet