Penalized regression

  • The objective function for a GLM (as a minimization) can be written as

\[ \Lambda(X\beta) - \beta'(X'Y) = - \log L(\beta|Y) = \ell(\beta|Y). \]
  • With some abuse of notation

\[ \Lambda(\alpha) = \sum_{i=1}^n \Lambda(\alpha_i) \]

and \(\Lambda\) is the log MGF of \(Y\)’s as a function of the natural parameter.

Penalized regression

  • Objective:

\[ \Lambda(X\beta) - \beta'(X'Y) + {\cal P}(\beta) \]

Three examples we’ve seen:

  • Gaussian \(\Lambda(\alpha) = \alpha^2/2\) (or scale objective \(\sigma^2\));

  • Logistic \(\Lambda(\alpha) = \log(1 + e^{\alpha})\).

  • Poisson log-linear \(\Lambda(\alpha) = e^{\alpha}\).


Ridge regression

  • Ridge regression adds a quadratic pnalty

\[ {\cal P}(\beta) = \frac{\lambda}{2} \|\beta\|_2^2 = \frac{\lambda}{2} \sum_{j=1}^p \beta_j^2. \]
  • Estimator

\[ \hat{\beta}_{\lambda} = \text{argmin}_{\beta} \Lambda(X\beta) -\beta'(X'Y) + \frac{\lambda}{2} \|\beta\|^2_2 \]
  • Features are usually centered and scaled so coefficients are unitless.

“Bayesian” interpretation

  • If we were to put a \(N(0, \lambda^{-1} I)\) prior on \(\beta\) its log posterior would be (up to normalizing constant depending only on \((X,Y)\))

\[ -(X\beta)'Y + \Lambda(X\beta) + \frac{\lambda}{2} \|\beta\|^2_2 \]
  • Ridge estimator yields MAP (Maximum A Posteriori) estimator (i.e. the posterior mode).


Easy to add weights

\[ {\cal P}(\beta) = \frac{1}{2} \sum_{j=1}^p \lambda_j \beta_j^2 \]

unpenalized variables have \(\lambda_j=0\).

General quadratics also fine

\[ {\cal P}(\beta) = \frac{1}{2}\beta'Q\beta \]
  • Corresponds to \(N(0,Q^{-1})\) prior.

Overall objective

\[ \beta \mapsto \Lambda(X\beta) - \beta'(X'Y) + {\cal P}(\beta) = \ell(\beta) + \frac{1}{2} \beta'Q\beta \]

Solving for the ridge estimator

  • From unpenalized case we know 2nd Taylor approximation of likelihood.

  • Differentiating the penalty:

\[\begin{split} \begin{aligned} \nabla {\cal P}(\beta) &= Q \beta \\ \nabla^2 {\cal P}(\beta) &= Q \end{aligned} \end{split}\]

Newton-Raphson / Fisher scoring

  • For canonical models

\[ \hat{\beta}^{(t)} = \hat{\beta}^{(t-1)} - (X'W^{(t-1)}X + Q)^{-1} \left(X'(E_{\hat{\beta}^{(t-1)}}(Y) - Y) + Q \hat{\beta}^{(t-1)}\right) \]
  • General form

\[ \hat{\beta}^{(t)} = \hat{\beta}^{(t-1)} - (X'W^{(t-1)}X + Q)^{-1} \left(X'W^{(t-1)}g'(\hat{\mu}^{(t-1)})(\hat{\mu}^{(t-1)} - Y) + Q \hat{\beta}^{(t-1)}\right) \]
  • Not the only way to solve this problem…


Why do we regularize?

  • Consider Gaussian case with \(Q=\lambda I\)

\[ \hat{\beta} = (X'X+\lambda I)^{-1}X'Y \sim N\left((X'X+\lambda I)^{-1}X'X\beta, \sigma^2 (X'X+\lambda I)^{-1}X'X(X'X+\lambda I)^{-1}\right) \]
  • Writing \(X=UDV'\) we see

\[\begin{split} \begin{aligned} (X'X+\lambda I)^{-1}X'X &= \sum_{i=1}^p \frac{d^2_i}{d^2_i+\lambda} V_iV_i' \\ (X'X+\lambda I)^{-1}X'X(X'X+\lambda I)^{-1} &= \sum_{i=1}^p \frac{d^2_i}{(d^2_i+\lambda)^2} V_iV_i' \\ \end{aligned} \end{split}\]

Bias

  • Writing \(\beta=\sum_j \alpha_j V_j\):

\[\begin{split} \|E[\hat{\beta}]-\beta\|^2_2 = \lambda^2 \sum_{i=1}^p \frac{\alpha_i^2}{(d_i^2 + \lambda)^2} \\ \end{split}\]

Variance

\[ \text{Tr}(\text{Var}(\hat{\beta})) = \sum_{i=1}^p \frac{d^2_i}{(d^2_i+\lambda)^2} \leq \sum_{i=1}^p \frac{1}{d_i^2} \]
  • RHS is trace of \(\text{Var}((X'X)^{-1}(X'Y))\).

  • Ridge always has smaller variance.


MSE

\[ MSE(\hat{\beta}) = E[\|\hat{\beta}-\beta\|^2_2] = \|E[\hat{\beta}]-\beta\|^2_2 + \text{Tr}(\text{Var}(\hat{\beta})) \]
  • Assuming \(\beta \neq 0\)

\[\frac{d}{d\lambda}\left(\|E[\hat{\beta}]-\beta\|^2_2 + \text{Tr}(\text{Var}(\hat{\beta}))\right)\biggl|_{\lambda=0} < 0\]

Ridge solution path for SPAM data

library(glmnet)
library(ElemStatLearn)
data(spam)
X_spam = scale(model.matrix(glm(spam ~ ., 
                                family=binomial, 
                                data=spam))[,-1]) # drop intercept
Y_spam = spam$spam == 'spam'
G_spam = glmnet(X_spam,
                Y_spam,
                alpha=0, # this makes it ridge
                family='binomial')
plot(G_spam)
Loading required package: Matrix
Loaded glmnet 4.0-2
Warning message:
“glm.fit: fitted probabilities numerically 0 or 1 occurred”
../_images/Ridge_1_3.png

Cross-validation to choose \(\lambda\)

cvG_spam = cv.glmnet(X_spam,
                     Y_spam,
                     alpha=0,
                     family='binomial')
plot(cvG_spam)
../_images/Ridge_3_0.png

Re-solve at \(\lambda_{\min}\)

beta_CV_spam = coef(G_spam, 
                    s=cvG_spam$lambda.min, 
                    exact=TRUE, 
                    x=X_spam, 
                    y=Y_spam,
                    alpha=0)

Ridge solution path for South African heart data

data(SAheart)
X_heart = scale(model.matrix(glm(chd ~ ., family=binomial, data=SAheart))[,-1]) # drop intercept
Y_heart = SAheart$chd
G_heart = glmnet(X_heart,
                 Y_heart,
                 alpha=0,
                 family='binomial')
plot(G_heart)
../_images/Ridge_7_0.png

Cross-validation to choose \(\lambda\)

cvG_heart = cv.glmnet(X_heart,
                      Y_heart,
                      alpha=0,
                      family='binomial')
plot(cvG_heart)
../_images/Ridge_9_0.png

Re-solve at \(\lambda_{1SE}\)

beta_1se_heart = coef(G_heart, 
                      s=cvG_heart$lambda.1se, 
                      exact=TRUE, 
                      alpha=0,
                      x=X_heart, 
                      y=Y_heart)