Statistical Learning — Lasso

Stanley G
4 min readJan 17, 2022

In the statistical learning course, the instructors introduced Lasso regression, which is a linear regression method which performs shrinkage on the parameters of the linear model. LASSO (Least Absolute Shrinkage and Selection Operator) can be used to combat collinearity issues, overfitting and variable selection (facilitates interpretability). It is especially useful when we have number of predictors p is much larger than the number of samples we have, n. When p >> n, we do not have a unique solution to minimize the residual sum of squares, hence Lasso forces some parameters to be 0. Lasso favours sparsity (favours but not a must).

As I mentioned, this post is one of the Statistical Learning supplementary materials series where I share my notes, please feel free to comment!

First of all, in a residual sum of squared errors (RSS) setting, Lasso penalizes the parameters of a linear model with an L1 norm, hence minimizing the usual RSS and the penalty induced as an unconstrained minimization problem:

Using Lagrangian method, we can transform it into a constrained minimization problem. First of all, Lagrangian method solves a constrained optimization problem by directly solving the Lagrange first order necessary conditions. The Lagrangian method only works for optimization problem with equality constraints.

Solving this

is equivalent to solving the Lagrange first order necessary conditions:

First note that when we minimize a function f, it’s equivalent to minimizing f -s for some constant s.

To minimize our unconstrained optimization problem, we take the first derivative of RSS + Penalty with respect to \beta and set it to 0:

Often s is treated as a “budget” term which allocates the room for us to adjust the parameters beta. Lambda and s sort of work together in controlling the shrinkage of parameters beta. Imagine we achieve a particular shrinkage effect on parameters beta using s* and lambda*. If we use s < s*, then we are allowing more room for parameters beta to increase. Hence, to achieve the same shrinkage effect, we need to use a larger lambda > lambda*. Similarly, if we use s > s*, then we are allowing less rooms for beta to increase. Hence, to achieve the same shrinkage effect, we set lambda < lambda*. In practice, it’s easier to just tune lambda as s is unknown to us.

Hence, we have the following formulation from Lagrangian method:

As mentioned, the second equation in the system of equations is just a budget that we set for increasing parameters beta. To relax the constraints, we can set the equality constraint to an inequality constraint and will obtain the alternative formulation of Lasso as posted in many machine learning forums:

s’ = s+t for some non-negative t. t is a slack variable.

So, here comes a few questions:

  1. Does Lasso always set some coefficients to 0? Not necessarily, setting lambda close to 0 will approach the problem to an ordinary least square problem.
  2. Why does Lasso favour sparsity? RSS is a convex function, the L1 constraint space is a convex set and solutions are non-linear, hence making the optimization problem a quadratic programming problem. The L1 space can be defined as a polytope in high dimensional space and has many corners in high dimensional space. In 2D, there are 4 corners; in 3D there are 6 corners etc. In a high-dimensional space, RSS has higher chances of intersecting these corners, hence yielding sparsity.

References:

  1. https://en.wikipedia.org/wiki/Lagrange_multiplier
  2. https://hastie.su.domains/ElemStatLearn/

--

--

Stanley G

I'm currently an applied scientist at Amazon, previously a senior machine learning scientist at 1QBit.