LASSO

  • The LASSO adds a penalty to the objective

\[ {\cal P}(\beta) = \lambda \|\beta\|_1 = \sum_{j=1}^p \lambda_j |\beta_j|. \]
  • The indices for which \(\lambda_j=0\) are unpenalized.


The LASSO picture


Why LASSO?

  • There are several sparsity inducing penalties out there: SCAD, MCP, etc.

  • But we know less about the “solutions” compared to LASSO because LASSO is a convex problem.

  1. Solvable even in very high dimensions. (“same” for SCAD, MCP, etc.)

  2. KKT conditions give an exact description of the solution however we decide to solve it (not same for SCAD, MCP, etc), at least under uniqueness.

Will the LASSO solution be unique?

In broad generality, yes. Uniqueness of the LASSO.


Solving the LASSO

  • Many algorithms out there. Many of you will end up taking EE364 (Convex Optimization) where you’ll see things like this in more detail.

  • We’ll consider two simple algorithms: coordinate descent, proximal gradient descent.

  • Everyone has their own way of talking about LASSO. These notes are just my way.


Coordinate descent

  • If the penalty \({\cal P}\) were smooth, then we could use something like Newton-Raphson as we did for fitting the GLM.

  • The penalty is differentiable everywhere except points where one of the \(\beta_j\)’s is 0: \(\implies\) not smooth.

  • However, the penalty is separable meaning

\[ {\cal P}(\beta) = \sum_{i=1}^p p_i(\beta_i). \]
  • Block separability allows for a partition of \(\{1,\dots,p\}= \cup_{g \in G} g\) – group LASSO an example of this.

  • When the penalty is separable, the problem can be solved by successively solving low-dimensional problems.


Basic iteration

  • Each iteration \(t\) chooses a coordinate \(j\) to update. Let the current position be \(\hat{\beta}^t\) define

\[ \hat{\beta}^{(t+1)}_j = \text{argmin}_{\beta_j \in \mathbb{R}} \left[\beta_j \mapsto \Lambda (X_{-j}\hat{\beta}^{(t)}_{-j} + X_j \beta_j) - \beta_j(X_j'Y) + p_j(\beta_j) \right] \]
  • Many space and time-saving tricks can be exploited here.

  • When solution is sparse, this can find the solution quickly!

  • A descent method: objective is monotone decreasing across iterations.

  • For convex loss and separable penalty this converges to a minimizer: Tseng

  • Refs: glmnet,


Coordinate descent and Gibbs sampler

  • Suppose we have a prior with independent coordinates with negative log density \({\cal P}(\beta)\).

  • The Gibbs sampler considers same “objective” that a coordinate descent algorithm does.

  • Instead of minimizing this function, it draws from a density proportional to its negative exponential.

  • Each iteration \(t\) chooses a coordinate \(j\) to update. Let the current position be \(\hat{\beta}^t\) draw \(\hat{\beta}^{(t+1)}_j\) from a density proportional to:

\[ \beta_j \mapsto \exp\left(-\left[\Lambda (X_{-j}\hat{\beta}^{(t)}_{-j} + X_j \beta_j) - \beta_j(X_j'Y) + p_j(\beta_j)\right] \right) \]

Gibbs: sample

Coordinate Descent: optimize


Coordinate descent

coordinate_descent = function() {
    beta = rep(0, num_features)
    for (i in 1:enough_iterations) {
        for (j in 1:num_features) {
            cur_obj = objective(X, Y, penalty, "binomial", beta, j)
            beta[j] = argmax(cur_obj)
        }
    }
}

Gibbs sampler

gibbs_sampler = function() {
    beta = rep(0, num_features)
    for (i in 1:enough_iterations) {
        for (j in 1:num_features) {
            cur_post = objective(X, Y, prior, "binomial", beta, j)
            beta[j] = sample_univariate(exp(cur_post))
        }
    }
}

Logistic regression with offset

  • Each minimization is a logistic regression with a single feature \(X_j\) and offset \(X_{-j}\hat{\beta}^{(t)}_{-j}\).

  • Objective with offset vector \(Z\)

\[ \beta \mapsto -(X\beta + Z)'Y + \sum_{i=1}^n \log(1 + \exp(X_i'\beta + Z_i)) \]
  • Stationarity conditions

\[ X' \left(Y - \frac{\exp(X\widehat{\beta} + Z)}{1 + \exp(X\widehat{\beta} + Z)} \right) = 0 \]
  • Generally speaking offset \(Z\) shifts introduces affine predictor \(\beta \mapsto X\beta + Z\).


Example with offset

X_sim = matrix(rnorm(200), 100, 2)
Z_sim = rnorm(100)
Y_sim = rbinom(100, 1, 0.5)
M = glm(Y_sim ~ X_sim - 1, offset=Z_sim, family=binomial);
affine_pred = X_sim %*% coef(M) + Z_sim
pi_hat = exp(affine_pred) / (1 + exp(affine_pred))
t(X_sim) %*% (Y_sim - pi_hat)
A matrix: 2 × 1 of type dbl
3.587408e-15
-7.105427e-15

  • At iteration \(t\), for each \(j\), if

\[\begin{split} \begin{aligned} \lambda &> \left|\frac{\partial}{\partial \beta_j} \left[\Lambda(X_{-j} \widehat{\beta}_{-j} + X_j \beta_j) - \beta_j \cdot (X_j'Y)\right] \biggl|_{\beta_j=0} \right| \\ &= \left| X_j'\left( \pi(\hat{\beta}_{-j}, 0_j) - Y \right) \right| \end{aligned} \end{split}\]

then

\[ 0 = \text{argmin}_{\beta_j \in \mathbb{R}} \left[\beta_j \mapsto \Lambda (X_{-j}\hat{\beta}^{(t)}_{-j} + X_j \beta_j) - \beta_j(X_j'Y) + p_j(\beta_j) \right] \]
  • Easy to check whether a coefficient should stay or be assigned to 0…


LASSO solution path for SPAM data

library(glmnet)
library(ElemStatLearn)
data(spam)
X_spam = scale(model.matrix(glm(spam ~ .,
                                family=binomial,
                                data=spam))[,-1]) # drop intercept
Y_spam = spam$spam == 'spam'
G_spam = glmnet(X_spam, Y_spam, family='binomial')
plot(G_spam)
Loading required package: Matrix
Loaded glmnet 4.0-2
Warning message:
“glm.fit: fitted probabilities numerically 0 or 1 occurred”
../_images/LASSO_I_3_3.png

Cross-validation to choose \(\lambda\)

cvG_spam = cv.glmnet(X_spam, Y_spam, family='binomial')
plot(cvG_spam)
../_images/LASSO_I_5_0.png

Re-solve at \(\lambda_{\min}\)

beta_CV_spam = coef(G_spam, 
                    s=cvG_spam$lambda.min, 
                    exact=TRUE, 
                    x=X_spam, 
                    y=Y_spam)
sum(beta_CV_spam[-1] != 0) # not very sparse in this case...
53

Predictions at new X

newX = X_spam[1:30,]
newY_hat = predict(G_spam, 
	           newX,
                   s=cvG_spam$lambda.min, 
                   type="response",
                   exact=TRUE, 
                   x=X_spam, 
                   y=Y_spam)
newY_hat[1:5]
  1. 0.573368457876324
  2. 0.979598740492402
  3. 0.999971117964625
  4. 0.747691413387677
  5. 0.747601439683215

LASSO solution path for South African heart data

data(SAheart)
X_heart = scale(model.matrix(glm(chd ~ ., family=binomial, data=SAheart))[,-1]) # drop intercept
Y_heart = SAheart$chd
G_heart = glmnet(X_heart, Y_heart, family='binomial')
plot(G_heart)
../_images/LASSO_I_11_0.png

Cross-validation to choose \(\lambda\)

cvG_heart = cv.glmnet(X_heart, Y_heart, family='binomial')
plot(cvG_heart)
../_images/LASSO_I_13_0.png

Re-solve at \(\lambda_{1SE}\)

beta_1se_heart = coef(G_heart, 
                      s=cvG_heart$lambda.1se, 
                      exact=TRUE, 
                      x=X_heart, 
                      y=Y_heart)
sum(beta_1se_heart[-1] != 0) # eliminates a few 
5

Proximal gradient descent

  • Updates all coordinates each step.

  • This method takes advantage of the fact that

\[ \text{argmin}_{\beta} \frac{L}{2}\|Z-\beta\|^2_2 + {\cal P}(\beta) \]

is soft-thresholding, most importantly it is fast

  • Solution is

\[ \hat{\beta}_i = \max(|Z_i| - \lambda_i / L, 0) \cdot \text{sign}(Z_i) \]

Basic update

  • Given current state \(\hat{\beta}^{(t)}\) define

\[ Q^L(\beta;\hat{\beta}^{(t)}) = \ell(\hat{\beta}^{(t)}) + \nabla \ell(\hat{\beta}^{(t)})'(\beta - \hat{\beta}^{(t)}) + \frac{L}{2} \|\beta - \hat{\beta}^{(t)}\|^2_2 + {\cal P}(\beta) \]
  • The next iterate is defined as

\[ \hat{\beta}^{(t+1)} = \text{argmin}_{\beta} Q^L(\beta; \hat{\beta}^{(t)}) = \text{argmin}_{\beta} \frac{L}{2} \|\beta - \hat{\beta}^{(t)} + L^{-1} \nabla \ell(\hat{\beta}^{(t)})\|^2_2 + {\cal P}(\beta). \]

Comments

  • Quadratic approximation similar to Newton-Raphson.

  • If \(L\) is chosen so that \(L \cdot I \geq \nabla^2 \ell(\beta)\) for all \(\beta\) then this is an example of Majorization-Minimization (MM) method.

  • Stepsize \(L\) only needs to be chosen to be a descent – backtracking typically used.

  • There are several accelerated proximal gradient descent algorithms out there, some local to us: NESTA, TFOCS.

  • Proximal algorithm is key ingredient to all of these methods.


Majorization-Minimization

  • Suppose we want to solve

\[ \text{minimize}_{\beta} f(\beta) \]
  • We assume we have a function \(M(\beta;\beta')\) such that

\[\begin{split} \begin{aligned} Q(\beta;\beta) &= f(\beta) \\ Q(\beta;\beta') & \geq f(\beta) \qquad \forall \beta, \beta' \end{aligned} \end{split}\]
  • Set

\[ \hat{\beta}^{(t+1)} = \text{argmin}_{\beta} Q(\beta;\beta^{(t)}) \]
  • Then, \(f(\hat{\beta}^{(t+1)}) \leq f(\hat{\beta}^{(t)})\). Yields a descent method!


Gradient descent and Langevin sampler

  • Just as coordinate descent “looks” like Gibbs sampling. There is an analogous MCMC sampler (at least for smooth penalties…)

  • Langevin sampling

\[ \hat{\beta}^{(t+1)} = \hat{\beta}^{(t)} + \alpha \nabla \left[ \ell(\hat{\beta}^{(t)}) + {\cal P}(\hat{\beta}^{(t)}) \right] + \sqrt{2 \alpha} \cdot \xi^{(t)}\]

with \(\xi^{(t)} \sim N(0,I)\).


Proximal map

  • The map

\[ Z \mapsto \text{argmin}_{\beta} \frac{1}{2}\|\beta-Z\|^2_2 + {\cal P}(\beta) \]

is called the proximal operator of \({\cal P}\).

  • Proximal gradient descent methods rely on \({\cal P}\) being fast.

  • Each step above evaluates the proximal map of \(L^{-1} \cdot {\cal P}\).

  • Reference: Parikh and Boyd


Solving the LASSO

LASSO in bound form

\[ \text{minimize}_{\beta} \frac{1}{2} \|Y - X\beta\|^2_2 \]

subject to \({\cal P}(\beta) \leq C\).

  • Proximal gradient descent replaces soft-thresholding with map

\[ Z \mapsto \text{argmin}_{\beta:\|\beta\|_1 \leq C} \frac{L}{2} \|\beta-Z\|^2_2. \]
  • Solving LASSO in this form will repeatedly project

\[ \hat{\beta}^{(t)} - \frac{1}{L} \nabla \ell(\hat{\beta}^{(t)}). \]

onto the convex set \(\left\{\beta: \|\beta\|_1 \leq C \right\}.\)


KKT conditions

  • How will we know we have solved

\[ \text{minimize}_{\beta} \ell(\beta) + {\cal P}(\beta) \ ? \]
  • KKT conditions read

\[ \nabla \ell (\hat{\beta}) + \hat{u} = 0 \iff X'Y = X' \nabla \Lambda(X\hat{\beta}) + \hat{u} \]

where

\[ \hat{u} \in \partial {\cal P}(\hat{\beta}). \]
  • The set \(\partial {\cal P}(\beta)\) is called the subdifferential of \({\cal P}\) at \(\beta\).

  • If \({\cal P}\) is differentiable at \(\beta\), then the subdifferential is the derivative.

  • Can also deduce stationarity conditions from proximal gradient algorithm.


Stationarity from proximal gradient

  • If \(\hat{\beta}\) is a statitionary point of proximal gradient with step size \(L^{-1}\), then

\[ \hat{\beta} = S_{\lambda L^{-1}}\left(\hat{\beta} - L^{-1} \nabla \ell(\hat{\beta})\right). \]
  • If \(\hat{\beta}_i=0\) then

\[ |L^{-1}\nabla \ell(\hat{\beta})_i| \leq \lambda L^{-1} \ \implies \ |\nabla \ell(\hat{\beta})_i| \leq \lambda \]
  • If \(\hat{\beta}_i \neq 0\) then

\[ \hat{\beta}_i - L^{-1} \nabla \ell(\hat{\beta})_i = \hat{\beta}_i + \text{sign}(\hat{\beta}_i) \cdot \lambda L^{-1} \ \implies \ -\nabla\ell(\hat{\beta})_i = \lambda \cdot \text{sign}(\hat{\beta}_i) \]

  • Very briefly, what is the subdifferential for our \({\cal P}\)?

  • Note that

\[ {\cal P}(\beta) = \sup_{u: |u_i| \leq \lambda_i} u'\beta. \]
  • For penalties of the form

\[ {\cal P}(\beta) = \sup_{u \in K} u'\beta, \]

the subdifferential is

\[ u \in \partial {\cal P}(\beta) \iff \beta \in N_u(K) \]

with \(N_u(K)\) the set of normal directions at \(u \in K\).

  • Also characterized as

\[ \left\{u: u'\beta = {\cal P}(\beta) \right\}. \]

Normal cones of a cube


  • Suppose \(\hat{\beta}_{-E}=0, \min_{j \in E}|\hat{\beta}_j|>0\) and \(\text{sign}(\hat{\beta}_E)=s_E\).

  • Then,

\[ \hat{u} \in \partial{\cal P}(\hat{\beta}) \iff \|u_{-E}/\lambda_{-E}\|_{\infty} \leq 1, u_E = \lambda_E s_E. \]
  • In other words: active signs are fixed, inactive score is not too big.

X = matrix(rnorm(300), 100, 3)
beta = rep(0, 3)
beta[1:2] = 3
Y = X %*% beta + rnorm(100)
G = glmnet(X, Y, standardize=FALSE, intercept=FALSE)
lam = G$lam[10]
lam
1.44969589216051

beta_hat = coef(G, s=lam, exact=TRUE, x=X, y=Y)[2:4]
beta_hat
  1. 1.75879739460179
  2. 1.60004492874385
  3. 0

score = (t(X) %*% (X %*% beta_hat - Y) / 100)
score
lam
A matrix: 3 × 1 of type dbl
-1.44969577
-1.44969589
0.01360631
1.44969589216051

c(beta_hat[1:2], score[3])
  1. 1.75879739460179
  2. 1.60004492874385
  3. 0.0136063095635466

Let’s perturb \(Y\) a little and see what happens.

Y_star = Y + 0.1 * rnorm(100)
G_star = glmnet(X, Y_star, intercept=FALSE, standardize=FALSE)
beta_star = coef(G_star, s=lam, exact=TRUE, x=X, y=Y_star)[2:4]
beta_star
  1. 1.75907492754101
  2. 1.60473445347705
  3. 0
score_star = t(X) %*% (X %*% beta_star - Y_star) / 100
score_star
A matrix: 3 × 1 of type dbl
-1.44969498
-1.44969589
0.01240643
c(beta_star[1:2], score_star[3])
c(beta_hat[1:2], score[3])
  1. 1.75907492754101
  2. 1.60473445347705
  3. 0.0124064281382372
  1. 1.75879739460179
  2. 1.60004492874385
  3. 0.0136063095635466

Take-away

  • The pair \((\hat{\beta}_E, u_{-E})\) parameterize the sufficient statistic \(X'Y\).


KKT conditions: deeper dive

Restricted problem

  • Suppose variables \(E\) are selected with signs \(s_E\).

  • KKT equations yield

\[\begin{split} \begin{aligned} \nabla \ell(\hat{\beta})_E &= - \lambda_E s_E \\ \|\nabla \ell(\hat{\beta})_{-E}/\lambda_{-E}\|_{\infty} &\leq 1 \\ \end{aligned} \end{split}\]
  • Having fixed \(E\) we can define restricted objective

\[ \beta_E \overset{\ell^E}{\mapsto} \ell(\beta_E, 0) \]
  • First equation reads

\[ \nabla \ell^E(\hat{\beta}_E) = -\lambda_E s_E \]

with constraint \(\text{sign}(\hat{\beta}_E) = s_E\).


Gaussian case

  • Let’s take one Newton-Raphson step from \(\hat{\beta}_E\):

\[ \bar{\beta}_E = \hat{\beta}_E + (X_E'X_E)^{-1} \lambda_E s_E \]

or

\[ \hat{\beta}_E = \bar{\beta}_E - (X_E'X_E)^{-1} \lambda_E s_E \]
  • In the Gaussian case this reads $\( \text{sign}((X_E'X_E)^{-1}(X_E'Y - \lambda_E s_E)) = s_E. \)$

  • Allowing some \(\lambda_i\)’s to be zero changes things a little – only have conditions for \(\lambda_i > 0\).


Inactive block

  • Let’s look at our other set of KKT equations.

\[\begin{split} \begin{aligned} \nabla \ell(\hat{\beta}_E, 0)_{-E} &\approx \nabla \ell(\bar{\beta}_E, 0)_{-E} + \nabla^2 \ell(\bar{\beta}_E,0)[-E,E](\hat{\beta}_E - \bar{\beta}_E) \\ & \approx \nabla \ell(\bar{\beta}_E, 0)_{-E} - \nabla^2 \ell(\bar{\beta}_E,0)_{-E} \nabla^2 \ell^E(\bar{\beta}_E)^{-1} \lambda_E s_E \end{aligned} \end{split}\]
  • We know that

\[ \|\nabla \ell(\hat{\beta}_E, 0)_{-E}/\lambda_{-E}\|_{\infty} \leq 1. \]
  • For any fixed \(E\), \(\nabla \ell(\bar{\beta}_E, 0)_{-E}\) will by asymptotically Gaussian under standard conditions. It will have mean 0 if the model \(E\) is correctly specified.

  • The vector

\[ \nabla^2 \ell(\bar{\beta}_E,0)_{-E} \nabla^2 \ell^E(\bar{\beta}_E)^{-1} \lambda_E s_E \]

will behave as a fixed vector (with signs fixed).

Gaussian case

\[ \|(X_{-E}'(Y - P_EY) + X_{-E}'X_E(X_E'X_E)^{-1} \lambda_Es_E)/\lambda_{-E} \|_{\infty} \leq 1. \]

Model selection consistency

  • Now suppose that model is correct with parameter \(\beta^*\) such that \(\beta^*_{-A}=0\).

  • When will the LASSO recover \(A\) with correct signs \(\text{sign}(\beta^*_A)\)? (For simplicity let’s now set all \(\lambda_i\)’s equal).

  • References (among others): Zhao and Yu, Candes and Plan


  • Our look at the active block of the KKT conditions says that if we solve the restricted MLE on \(A\) and we shrink we must still have the same signs.

  • Up to small remainder this event is

\[ \text{sign}(\bar{\beta}_A - \lambda \nabla \ell^2(\bar{\beta}_A) s^*_A) = s^*_A \]

with \(s^*_A=\text{sign}(\beta^*_A)\).

  • If \(\beta_A^*\) are far away enough from 0 (beta-min condition) then subtracting the second term will not change signs of \(\bar{\beta}_A\).


  • Let’s look at the inactive block. We saw that if the active set is going to be \(A\) and signs \(s_A^*\) then

\[ \|(\nabla \ell(\bar{\beta}_A, 0)_{-A} - I_A \lambda s_A^*) / \lambda\|_{\infty} = \|(\nabla \ell(\bar{\beta}_A, 0)_{-A}/\lambda - I_A s_A^*) / \|_{\infty} \leq 1. \]
  • For this to happen, essentially we need the irrepresentable condition (Zhao and Yu)

\[ \max_{\|u_A\|_{\infty}\leq 1} \|I_A u_A\|_{\infty} \leq 1 \]

How large should \(\lambda\) be?

  • The irrepresentable condition suggests \(\lambda\) should be such that $\( \|\nabla \ell(\bar{\beta}_A, 0)_{-A}/\lambda\|_{\infty} < 1. \)$

  • A typical choice

\[ \lambda \approx C \cdot E[\|\nabla \ell(\beta_A^*,0)\|_{\infty}], \qquad C > 1. \]
  • Such choice often guarantees LASSO is consistent. (Large literature on this, c.f. Negabhan et al.)