TransWikia.com

Regression: is it wrong to bin a continuous variable to overcome overfitting?

Cross Validated Asked by st4co4 on December 10, 2020

Would statisticians hang me for doing the following?

I have a heterogeneous dataset of elderly subjects. Thus, I have model with 7 predictors, including 4 categorical ones, of which some have many levels. I am doing a regional analysis, which means that some regions have fewer subjects on certain reference levels of different categorical variables.

Subjects are mostly aged 70-90 years. Age variable, ranging from 50-100, is causing clear overfitting while comparing it to the plots explanatory data analysis. I found out that there are not enough subjects at mean age at some regions to make meaningful predictions. When I bin the age variable into 10-year bins and use the bin with the largest number of subjects as a reference, the results of the regression are in line with the explanatory data analysis.

Would the binning of age variable will be okay if I publish both: plots on raw data + adjusted analysis? Thus, both analysis confirm the main outcome – regional variablity.

One Answer

Binning a continuous variable is not a good idea. You're unlikely to be physically assaulted by statisticians for doing that, but you would probably get a lot of hard stares and frowns and muttering under the breath.

There's a much better approach to deal with this type of problem, which would turn the frowns into smiles: use a mixed model. That allows you to combine information usefully among individuals in different regions without having to cover all combinations of predictors within each region. Depending on the purpose of your study that could be done with a multi-level model that treats both individuals and regions as random effects. This recent answer provides a nice description of the advantages of such modeling.

With respect to age as a continuous predictor, you might find it useful to model with a spline that can discover nonlinear relationships between age and outcome as part of a linear modeling process. That can be incorporated within a mixed model via standard software packages.

Correct answer by EdM on December 10, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP