Why does the Lasso provide Variable Selection?

Question

I've been reading Elements of Statistical Learning, and I would like to know why the Lasso provides variable selection and ridge regression doesn't.

Both methods minimize the residual sum of squares and have a constraint on the possible values of the parameters $beta$. For the Lasso, the constraint is $||beta||_1 le t$, whereas for ridge it is $||beta||_2 le t$, for some $t$.

I've seen the diamond vs ellipse picture in the book and I have some intuition as for why the Lasso can hit the corners of the constrained region, which implies that one of the coefficients is set to zero. However, my intuition is rather weak, and I'm not convinced. It should be easy to see, but I don't know why this is true.

So I guess I'm looking for a mathematical justification, or an intuitive explanation of why the contours of the residual sum of squares are likely to hit the corners of the  $||beta||_1$ constrained region (whereas this situation is unlikely if the constraint is $||beta||_2$).

Atakan · Answer

I recently created a blog post to compare ridge and lasso using a toy data frame of shark attacks. It helped me understand the behaviors of the algorithms especially when correlated variables are present. Take a look and also see this SO question to explain the shrinkage toward zero.

Answered by Atakan on November 29, 2021

vonjd · Answer

I think there are excellent anwers already but just to add some intuition concerning the geometric interpretation:

"The lasso performs $L1$ shrinkage, so that there are "corners" in the constraint, which in two dimensions corresponds to a diamond. If the sum of squares "hits'' one of these corners, then the coefficient corresponding to the axis is shrunk to zero.

As $p$ increases, the multidimensional diamond has an increasing number of corners, and so it is highly likely that some coefficients will be set equal to zero. Hence, the lasso performs shrinkage and (effectively) subset selection.

In contrast with subset selection, ridge performs a soft thresholding: as the smoothing parameter is varied, the sample path of the estimates moves continuously to zero."

Source: https://onlinecourses.science.psu.edu/stat857/book/export/html/137

The effect can nicely be visualized where the colored lines are the paths of regression coefficients shrinking towards zero.

"Ridge regression shrinks all regression coefficients towards zero; the lasso tends to give a set of zero regression coefficients and leads to a sparse solution."

Source: https://onlinecourses.science.psu.edu/stat857/node/158

blarg · Answer

Suppose we have a data set with y = 1 and x = [1/10 1/10] (one data point, two features). One solution is to pick one of the features, another feature is to weight both features. I.e. we can either pick w = [5 5] or w = [10 0].

Note that for the L1 norm both have the same penalty, but the more spread out weight has a lower penalty for the L2 norm.

jbowman · Answer

Let's consider a very simple model: $y = beta x + e$, with an L1 penalty on $hat{beta}$ and a least-squares loss function on $hat{e}$. We can expand the expression to be minimized as:

$min y^Ty -2 y^Txhat{beta} + hat{beta} x^Txhat{beta} + 2lambda|hat{beta}|$

Keep in mind this is a univariate example, with $beta$ and $x$ being scalars, to show how LASSO can send a coefficient to zero. This can be generalized to the multivariate case.

Let us assume the least-squares solution is some $hat{beta} > 0$, which is equivalent to assuming that $y^Tx > 0$, and see what happens when we add the L1 penalty. With $hat{beta}>0$, $|hat{beta}| = hat{beta}$, so the penalty term is equal to $2lambdabeta$. The derivative of the objective function w.r.t. $hat{beta}$ is:

$-2y^Tx +2x^Txhat{beta} + 2lambda$

which evidently has solution $hat{beta} = (y^Tx - lambda)/(x^Tx)$.

Obviously by increasing $lambda$ we can drive $hat{beta}$ to zero (at $lambda = y^Tx$). However, once $hat{beta} = 0$, increasing $lambda$ won't drive it negative, because, writing loosely, the instant $hat{beta}$ becomes negative, the derivative of the objective function changes to:

$-2y^Tx +2x^Txhat{beta} - 2lambda$

where the flip in the sign of $lambda$ is due to the absolute value nature of the penalty term; when $beta$ becomes negative, the penalty term becomes equal to $-2lambdabeta$, and taking the derivative w.r.t. $beta$ results in $-2lambda$. This leads to the solution $hat{beta} = (y^Tx + lambda)/(x^Tx)$, which is obviously inconsistent with $hat{beta} < 0$ (given that the least squares solution $> 0$, which implies $y^Tx > 0$, and $lambda > 0$). There is an increase in the L1 penalty AND an increase in the squared error term (as we are moving farther from the least squares solution) when moving $hat{beta}$ from $0$ to $ < 0$, so we don't, we just stick at $hat{beta}=0$.

It should be intuitively clear the same logic applies, with appropriate sign changes, for a least squares solution with $hat{beta} < 0$.

With the least squares penalty $lambdahat{beta}^2$, however, the derivative becomes:

$-2y^Tx +2x^Txhat{beta} + 2lambdahat{beta}$

which evidently has solution $hat{beta} = y^Tx/(x^Tx + lambda)$. Obviously no increase in $lambda$ will drive this all the way to zero. So the L2 penalty can't act as a variable selection tool without some mild ad-hockery such as "set the parameter estimate equal to zero if it is less than $epsilon$".

Obviously things can change when you move to multivariate models, for example, moving one parameter estimate around might force another one to change sign, but the general principle is the same: the L2 penalty function can't get you all the way to zero, because, writing very heuristically, it in effect adds to the "denominator" of the expression for $hat{beta}$, but the L1 penalty function can, because it in effect adds to the "numerator".

Why does the Lasso provide Variable Selection?

4 Answers

Add your own answers!

Ask a Question