Relaxed Lasso Logistic Regression: Estimating second penalty parameter

Question

I'm trying a relaxed lasso logistic regression by first using sklearn's cross validation to find an optimal penalty parameter (C = 1/lambda). Then, I use that parameter to fit statsmodel's logit model to the data (lambda = 1/C). At this step, I removed coefficients that are really small (< 1e-5). When I performed cross validation again on the reduced feature set to estimate my second penalty parameter, it ends up being stronger than the first penalty. This doesn't seem right to me, since the second step should be to find better estimates for features that you've already identified as non-zero from the first feature reduction step (so the lambda should be weaker).

I've also noticed that a lot of p values for the coefficients in the statsmodel results are NaN, even though the model has converged. Does this happen because there is high multicollinearity among my features? I had assumed that the Lasso would be able to handle collinearity. Perhaps my cross validation isn't providing the best penalty parameter?

From Elements of Statistical Learning:

...the lasso shrinkage causes the estimates of the non-zero coefficients to be biased towards zero, and in general they are not consistent. One approach for reducing this bias is to run the lasso to identify the set of non-zero coefficients, and then fit an unrestricted linear model to the selected set of features. This is not always feasible, if the selected set is large. Alternatively, one can use the lasso to select the set of non-zero predictors, and then apply the lasso again, but using only the selected predictors from the first step. This is known as the relaxed lasso (Meinshausen, 2007). The idea is to use cross-validation to estimate the initial penalty parameter for the lasso, and then again for a second penalty parameter applied to the selected set of predictors. Since Statistical consistency means as the sample size grows, the estimates converge to the true values. The variables in the second step have less “competition” from noise variables, cross-validation will tend to pick a smaller value for λ, and hence their coefficients will be shrunken less than those in the initial estimate.

Kerby Shedden · Answer

It looks like the summary table for logit always gives a NaN standard error for parameters whose coefficient estimate is shrunk to zero.  Is this where you are seeing NaN standard errors?  If so, these are not really NaN's, it's just that there isn't a good way to come up with a standard error for parameters that are shrunk to zero (or at least one has not been implemented here).  I'm also not sure how the SE's for the nonzero coefficients are computed here.

Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                 1000
Model:                          Logit   Df Residuals:                      998
Method:                           MLE   Df Model:                            1
Date:                Thu, 28 Feb 2019   Pseudo R-squ.:                  0.1600
Time:                        20:48:32   Log-Likelihood:                -582.11
converged:                       True   LL-Null:                       -692.99
                                        LLR p-value:                 3.763e-50
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept           0        nan        nan        nan         nan         nan
x1             0.8917      0.079     11.355      0.000       0.738       1.046
x2            -0.4602      0.073     -6.284      0.000      -0.604      -0.317
x3                  0        nan        nan        nan         nan         nan
==============================================================================

Kerby Shedden · Answer

Are you using statsmodels GLM or statsmodels Logit?  They are equivalent, but the scaling of the penalty parameter for regularized fitting is different.  Below is an example of how to get equivalent results from these two procedures and from sklearn.

import numpy as np                                                                                                                     
import pandas as pd                                                                                                                    
import statsmodels.api as sm                                                                                                           
from sklearn.linear_model import LogisticRegression

n = 1000                                                                                                                               
x = np.random.normal(size=(n, 3))                                                                                                      
lp = x[:, 0] - x[:, 1] / 2                                                                                                             
pr = 1 / (1 + np.exp(-lp))                                                                                                             
y = (np.random.uniform(size=n) < pr).astype(np.int)                                                                                    
df = pd.DataFrame({"y": y, "x1": x[:, 0], "x2": x[:, 1], "x3": x[:, 2]})

alpha = 10

model1 = sm.GLM.from_formula("y ~ x1 + x2 + x3", family=sm.families.Binomial(),                                                        
            data=df)                                                                                                                   
result1 = model1.fit_regularized(alpha=alpha/n, L1_wt=1)

model2 = sm.Logit.from_formula("y ~ x1 + x2 + x3", data=df)                                                                            
result2 = model2.fit_regularized(alpha=alpha)

x0 = np.concatenate((np.ones((n, 1)), x), axis=1)                                                                                      
model3 = LogisticRegression(C=1/alpha, penalty='l1')                                                                                   
result3 = model3.fit(x0, y)

print(result1.params, result2.params, result3.coef_)

Relaxed Lasso Logistic Regression: Estimating second penalty parameter

2 Answers

Add your own answers!

Ask a Question