Variable selection in logistic regression model

Question

I'm working on a logistic regression model, I have 417 independent variables to try, but running a model with all of them is too much (for Rstudio), what would be a good criteria to discard some variables beforehand?

Thanks for your answers! The aim of this work is to develop a predictive model to detect which costumers are more likely to become uncollectible accounts in a utility company.

Paze · Answer

Sounds like regularization models could be of use to you (Lasso especially, since your problem seems to be centered around picking variables, as opposed to Ridge and Elastic net).

However, these can be difficult to set up. They are included in Stata 15 if you have access to that. In R, it's often a bit more handywork but hopefully someone can chime in with whether and how R supports users with regularization techniques.

There are other techniques to manually pick and choose variables based on their behaviors, but with over 400 variables (assuming your have no preconceived hypothesis about any of these), I'd say doing the work to understand regularization models is probably easier than manual selection.

Robert Long · Answer

As mentioned by @DemetriPananos, theoretical justification would be the best approach, especially if your goal is inference. That is, with expert knowledge of the actual data generation process, you can look at the causal paths between the variables and from there you can select the variables which are important, which are confounders and which are mediators. A DAG (directed acyclic graph) sometimes known as a causal diagram can be a great aid in this process. I have personally encountered DAGs with as many of 500 variables, which were able to be brought down to less than 20.

Of course, this might not be practical, or feasible in your particular situation.

Other methods you could use are:

Principal Components Analysis. PCA is a mathematical technique which is used for dimension reduction, by generating new uncorrelated variables (components) that are linear combinations of the original (correlated) variables, such that each component in accounts for a decreasing portion of total variance. That is, PCA computes a new set of variables and expresses the data in terms of these new variables. Considered together, the new variables represent the same amount of information as the original variables, in the sense that we can restore the original data set from the transformed one. Total variance remains the same, but is redistributed so that the first component accounts for the maximum possible variance while being orthogonal to the remaining components. The second component accounts for the maximum possible of the remaining variance while also staying orthogonal to the remaining components, and so on. This explanation is deliberately non-technical. PCA is therefore very useful where variables are highly correlated, and by retaining only the first few components, it is possible to reduce the dimension of the dataset considerably. In R, PCA is available using the base prcomp function.
Partial Least Squares. PLS is similar to PCA except that is also takes account of the correlation of each variable with the outcome. When used with a binary (or other categorical) outcome, it is know as PLS-DA (Partial Least Squares Discriminant Analysis). In R you could use the caret package.

It is worth noting that variables should be standardised prior to PCA or PLS, in order to avoid domination by variables that are measured on larger scales. A great example of this is analysing the results of an athletics competition - if the variables are analysed on their original scale then events such as the marathon and 10,000m will dominate the results.

Ridge Regression. Also known as Tikhonov Regularization, this is used to deal with variables that are highly correlated, which is very common in high dimension datasets. In ordinary least squares, where the sum of squared residuals is minimised, ridge regression adds a penalty, so that we minimize a quantity which is the sum of the squared residuals, plus a term usually proportional to the sum (or often a weighted sum) of the squared parameters. Essentially, we "penalize" large values of the parameters in the quantity we seeking to minimize. What this means in practice is that the regression estimates are shrunk towards zero. Thus they are no longer unbiased estimates, but they suffer from considerably less variance than would be the case with OLS. The regularization parameter (the penalty) is often chosen by cross validation, however, the cross validation estimator has a finite variance, which can be very large and can lead to overfitting.
LASSO regression. Least Absolute Shrinkage and Selection Operator (LASSO) is very similar to ridge regression - it is also a regularization method. The main difference is that, where ridge regression adds a penalty that is proportional to the squared parameters (also called L2-norm), the LASSO uses the absolute value (L1-norm). This means that Lasso shrinks the less important variables' coefficients to zero, thus removing some variables altogether, so this works particularly well for variable election where we have a large number of variables. As previously mentioned, regularization introduces bias with the benefit of lower variance. There are also methods to reduce or eliminate this bias, called debiasing.
Elastic net. This is a compromise between ridge regression and LASSO and produces a model that is penalized with both the L1-norm and L2-norm. This means that some coefficients are shrunk (as in ridge regression) and some are set to zero (as in LASSO). In R I would suggest the glmnet package which can do ridge regression, LASSO and elastic net.

Demetri Pananos · Answer

Theoretical justification beyond all else.  Aside from that, LASSO or similar penalized methods would be my next suggestion.

Variable selection in logistic regression model

3 Answers

Add your own answers!

Ask a Question