Relaxed Lasso Logistic Regression: Estimating second penalty parameter

I’m trying a relaxed lasso logistic regression by first using sklearn’s cross validation to find an optimal penalty parameter (C = 1/lambda). Then, I use that parameter to fit statsmodel’s logit model to the data (lambda = 1/C). At this step, I removed coefficients that are really small (< 1e-5). When I performed cross validation again on the reduced feature set to estimate my second penalty parameter, it ends up being stronger than the first penalty. This doesn’t seem right to me, since the second step should be to find better estimates for features that you’ve already identified as non-zero from the first feature reduction step (so the lambda should be weaker).

I’ve also noticed that a lot of p values for the coefficients in the statsmodel results are NaN, even though the model has converged. Does this happen because there is high multicollinearity among my features? I had assumed that the Lasso would be able to handle collinearity. Perhaps my cross validation isn’t providing the best penalty parameter?

From Elements of Statistical Learning:

…the lasso shrinkage causes the estimates of the non-zero coefficients to be biased towards zero, and in general they are not consistent. One approach for reducing this bias is to run the lasso to identify the set of non-zero coefficients, and then fit an unrestricted linear model to the selected set of features. This is not always feasible, if the selected set is large. Alternatively, one can use the lasso to select the set of non-zero predictors, and then apply the lasso again, but using only the selected predictors from the first step. This is known as the relaxed lasso (Meinshausen, 2007). The idea is to use cross-validation to estimate the initial penalty parameter for the lasso, and then again for a second penalty parameter applied to the selected set of predictors. Since Statistical consistency means as the sample size grows, the estimates converge to the true values. The variables in the second step have less “competition” from noise variables, cross-validation will tend to pick a smaller value for λ, and hence their coefficients will be shrunken less than those in the initial estimate.

Cross Validated Asked by Joanne Cheung on December 30, 2020

It looks like the summary table for logit always gives a NaN standard error for parameters whose coefficient estimate is shrunk to zero. Is this where you are seeing NaN standard errors? If so, these are not really NaN's, it's just that there isn't a good way to come up with a standard error for parameters that are shrunk to zero (or at least one has not been implemented here). I'm also not sure how the SE's for the nonzero coefficients are computed here.

Logit Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                 1000
Model:                          Logit   Df Residuals:                      998
Method:                           MLE   Df Model:                            1
Date:                Thu, 28 Feb 2019   Pseudo R-squ.:                  0.1600
Time:                        20:48:32   Log-Likelihood:                -582.11
converged:                       True   LL-Null:                       -692.99
LLR p-value:                 3.763e-50
==============================================================================
coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept           0        nan        nan        nan         nan         nan
x1             0.8917      0.079     11.355      0.000       0.738       1.046
x2            -0.4602      0.073     -6.284      0.000      -0.604      -0.317
x3                  0        nan        nan        nan         nan         nan
==============================================================================

Answered by Kerby Shedden on December 30, 2020

Are you using statsmodels GLM or statsmodels Logit? They are equivalent, but the scaling of the penalty parameter for regularized fitting is different. Below is an example of how to get equivalent results from these two procedures and from sklearn.

import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression

n = 1000
x = np.random.normal(size=(n, 3))
lp = x[:, 0] - x[:, 1] / 2
pr = 1 / (1 + np.exp(-lp))
y = (np.random.uniform(size=n) < pr).astype(np.int)
df = pd.DataFrame({"y": y, "x1": x[:, 0], "x2": x[:, 1], "x3": x[:, 2]})

alpha = 10

model1 = sm.GLM.from_formula("y ~ x1 + x2 + x3", family=sm.families.Binomial(),
data=df)
result1 = model1.fit_regularized(alpha=alpha/n, L1_wt=1)

model2 = sm.Logit.from_formula("y ~ x1 + x2 + x3", data=df)
result2 = model2.fit_regularized(alpha=alpha)

x0 = np.concatenate((np.ones((n, 1)), x), axis=1)
model3 = LogisticRegression(C=1/alpha, penalty='l1')
result3 = model3.fit(x0, y)

print(result1.params, result2.params, result3.coef_)

Answered by Kerby Shedden on December 30, 2020

Related Questions

Different z and Wald values in logistic regression analyses between SPSS and R

0  Asked on February 23, 2021 by afton-nelson

Creating an index using transactional data

0  Asked on February 23, 2021 by jamzy

How does a pdf change after a variable transformation with another random variable?

1  Asked on February 23, 2021 by fluctuation

A tail bound for an unknown distribution via sampling

1  Asked on February 22, 2021 by bashir

Negative Sampling in Word2Vec – Embedding Vector / Amount of Samples

0  Asked on February 22, 2021 by maxbit

NxN- fold cross validation

1  Asked on February 21, 2021 by tauling

Compare regression coefficients within the same model

2  Asked on February 21, 2021 by shenflow

How to predict an event for different time intervals and compute score?

1  Asked on February 21, 2021

What are arguments against using the (log-)likelihood as a loss function?

1  Asked on February 21, 2021 by joel

How to interpret a ROC curve?

5  Asked on February 20, 2021 by gnal

Minimum sample size per cluster in a random effect model

2  Asked on February 20, 2021 by roccer

Outlier detection for a univariate categorical variable?

3  Asked on February 20, 2021 by abdu

COVID-19: Individual level, cross-country Panel dataset. Which method to use?

0  Asked on February 20, 2021 by user307627

Partial dependence plot, GBM multinomial

1  Asked on February 19, 2021 by daniel-l

Why is it ok to remove variables with low variance from a dataset

0  Asked on February 19, 2021 by amnon

In a predefined (but realistic, and explaining) example, what $p(x)$ should be in a variational autoencoder?

1  Asked on February 19, 2021 by gerg-horvth

Examples of Bayesian and frequentist approach giving different answers

7  Asked on February 19, 2021

Choice of covariate under Independent Hypothesis Weighting (IHW)

0  Asked on February 19, 2021 by kyle-pena

What is the advantages of Wasserstein metric compared to Kullback-Leibler divergence?

5  Asked on February 19, 2021 by thomas-fauskanger

Regression Analysis with a Census or All Observations (Not a Sample)?

1  Asked on February 18, 2021 by statcurious