Cross Validated Asked on December 27, 2021

I’m working on a logistic regression model, I have 417 independent variables to try, but running a model with all of them is too much (for Rstudio), what would be a good criteria to discard some variables beforehand?

Thanks for your answers! The aim of this work is to develop a predictive model to detect which costumers are more likely to become uncollectible accounts in a utility company.

Sounds like regularization models could be of use to you (Lasso especially, since your problem seems to be centered around picking variables, as opposed to Ridge and Elastic net).

However, these can be difficult to set up. They are included in Stata 15 if you have access to that. In R, it's often a bit more handywork but hopefully someone can chime in with whether and how R supports users with regularization techniques.

There are other techniques to manually pick and choose variables based on their behaviors, but with over 400 variables (assuming your have no preconceived hypothesis about any of these), I'd say doing the work to understand regularization models is probably easier than manual selection.

Answered by Paze on December 27, 2021

As mentioned by @DemetriPananos, theoretical justification would be the best approach, especially if your goal is inference. That is, with expert knowledge of the actual data generation process, you can look at the causal paths between the variables and from there you can select the variables which are important, which are confounders and which are mediators. A DAG (directed acyclic graph) sometimes known as a causal diagram can be a great aid in this process. I have personally encountered DAGs with as many of 500 variables, which were able to be brought down to less than 20.

Of course, this might not be practical, or feasible in your particular situation.

Other methods you could use are:

Principal Components Analysis. PCA is a mathematical technique which is used for dimension reduction, by generating new uncorrelated variables (components) that are linear combinations of the original (correlated) variables, such that each component in accounts for a decreasing portion of total variance. That is, PCA computes a new set of variables and expresses the data in terms of these new variables. Considered together, the new variables represent the same amount of information as the original variables, in the sense that we can restore the original data set from the transformed one. Total variance remains the same, but is redistributed so that the first component accounts for the maximum possible variance while being orthogonal to the remaining components. The second component accounts for the maximum possible of the remaining variance while also staying orthogonal to the remaining components, and so on. This explanation is deliberately non-technical. PCA is therefore very useful where variables are highly correlated, and by retaining only the first few components, it is possible to reduce the dimension of the dataset considerably. In R, PCA is available using the base

`prcomp`

function.Partial Least Squares. PLS is similar to PCA except that is also takes account of the correlation of each variable with the outcome. When used with a binary (or other categorical) outcome, it is know as PLS-DA (Partial Least Squares Discriminant Analysis). In R you could use the

`caret`

package.

It is worth noting that variables should be standardised prior to PCA or PLS, in order to avoid domination by variables that are measured on larger scales. A great example of this is analysing the results of an athletics competition - if the variables are analysed on their original scale then events such as the marathon and 10,000m will dominate the results.

Ridge Regression. Also known as Tikhonov Regularization, this is used to deal with variables that are highly correlated, which is very common in high dimension datasets. In ordinary least squares, where the sum of squared residuals is minimised, ridge regression adds a penalty, so that we minimize a quantity which is the sum of the squared residuals, plus a term usually proportional to the sum (or often a weighted sum) of the squared parameters. Essentially, we "penalize" large values of the parameters in the quantity we seeking to minimize. What this means in practice is that the regression estimates are shrunk towards zero. Thus they are no longer unbiased estimates, but they suffer from considerably less variance than would be the case with OLS. The regularization parameter (the penalty) is often chosen by cross validation, however, the cross validation estimator has a finite variance, which can be very large and can lead to overfitting.

LASSO regression. Least Absolute Shrinkage and Selection Operator (LASSO) is very similar to ridge regression - it is also a regularization method. The main difference is that, where ridge regression adds a penalty that is proportional to the

*squared*parameters (also called L2-norm), the LASSO uses the*absolute*value (L1-norm). This means that Lasso shrinks the less important variables' coefficients to zero, thus removing some variables altogether, so this works particularly well for variable election where we have a large number of variables. As previously mentioned, regularization introduces bias with the benefit of lower variance. There are also methods to reduce or eliminate this bias, called*debiasing*.Elastic net. This is a compromise between ridge regression and LASSO and produces a model that is penalized with both the L1-norm and L2-norm. This means that some coefficients are shrunk (as in ridge regression) and some are set to zero (as in LASSO). In R I would suggest the

`glmnet`

package which can do ridge regression, LASSO and elastic net.

Answered by Robert Long on December 27, 2021

Theoretical justification beyond all else. Aside from that, LASSO or similar penalized methods would be my next suggestion.

Answered by Demetri Pananos on December 27, 2021

0 Asked on October 15, 2020 by learning-stats-by-example

count data exponential poisson distribution r stochastic processes

2 Asked on October 13, 2020 by kyle

1 Asked on October 10, 2020 by kyrhee

difference in difference econometrics panel data regression coefficients treatment effect

0 Asked on October 4, 2020 by nada-al-iskandaraniyyah

0 Asked on October 2, 2020 by anonymous

0 Asked on September 30, 2020 by katie-fenton

0 Asked on September 24, 2020 by rkhan8

4 Asked on September 24, 2020 by navige

1 Asked on September 22, 2020

0 Asked on September 22, 2020 by s_am

3 Asked on September 20, 2020 by learner

1 Asked on September 20, 2020 by seydou-goro

0 Asked on September 19, 2020 by user293111

2 Asked on September 15, 2020 by mustapha-hakkou-asz

0 Asked on September 14, 2020 by alhayer

1 Asked on September 13, 2020 by prolix

1 Asked on September 9, 2020

Get help from others!

Recent Questions

- How Do I Get The Ifruit App Off Of Gta 5 / Grand Theft Auto 5
- Iv’e designed a space elevator using a series of lasers. do you know anybody i could submit the designs too that could manufacture the concept and put it to use
- Need help finding a book. Female OP protagonist, magic
- Why is the WWF pending games (“Your turn”) area replaced w/ a column of “Bonus & Reward”gift boxes?
- Does Google Analytics track 404 page responses as valid page views?

Recent Answers

- Jon Church on Why fry rice before boiling?
- Lex on Does Google Analytics track 404 page responses as valid page views?
- Peter Machado on Why fry rice before boiling?
- Joshua Engel on Why fry rice before boiling?
- haakon.io on Why fry rice before boiling?

© 2023 AnswerBun.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP