Penalized#

Download#

Bias-variance tradeoff#

  • One goal of a regression analysis is to “build” a model that predicts well – AIC / \(C_p\) & Cross-validation selection criteria based on this.

  • This is slightly different than the goal of making inferences about \(\beta\) that we’ve focused on so far.

  • What does “predict well” mean?

\[\begin{split} \begin{aligned} MSE_{pop}({{\cal M}}) &= E[MSE({\cal M})] \\ {\mathbb{E}}\left((Y_{new} - \widehat{Y}_{new,{\cal M}}(X_{new}))^2\right) \\ &= {\text{Var}}(Y_{new}) + {\text{Var}}(\widehat{Y}_{new,{\cal M}}) + \\ & \qquad \quad \text{Bias}(\widehat{Y}_{new,{\cal M}})^2. \end{aligned}\end{split}\]
  • Can we take an estimator for a model \({\cal M}\) and make it better in terms of \(MSE\)?

Shrinkage estimators: one sample problem#

  • Let’s restrict to Y~1 model:

\[ Y_i = \mu + \epsilon_i \]
  • Instead of \(\hat{\mu}=\bar{Y}\) as an estimate of \(\mu\) we’ll use

\[ \hat{\mu}_{\alpha}(Y) = \alpha \cdot \bar{Y}. \]
  • We shrink the sample mean towards 0!

A little simulation#

  1. Generate \(Y_{100 \times 1} \sim N(\mu \cdot 1, 5^2 I_{100 \times 100})\), with \(\mu=0.5\).

  2. For \(0 \leq \alpha \leq 1\), set \(\hat{\mu}_{\alpha} = \alpha \cdot \bar{Y}.\)

  3. Compute \(MSE(\hat{\mu}_{\alpha}) = \frac{1}{100}\sum_{i=1}^{100} (\hat{\mu}_{\alpha} - 0.5)^2\)

  4. Repeat 1000 times, plot average of \(MSE(\hat{\mu}_{\alpha})\).

For what value of \(\alpha\) is \(\hat{\mu}_{\alpha}\) unbiased?

Is this the best estimate of \(\mu\) in terms of MSE?

mu = 0.5
sigma = 5
nsample = 100
ntrial = 1000

MSE = function(mu.hat, mu) {
  return(sum((mu.hat - mu)^2) / length(mu))
}
alpha = seq(0,1,length=20)
mse = numeric(length(alpha))
bias = (1 - alpha) * mu
variance = alpha^2 * 25 / 100

for (i in 1:ntrial) {
  Z = rnorm(nsample) * sigma + mu
  for (j in 1:length(alpha)) {
    mse[j] = mse[j] + MSE(alpha[j] * mean(Z) * rep(1, nsample), mu * rep(1, nsample))
  }
}
mse = mse / ntrial

Mean-squared error#

plot(alpha, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
     xlab=expression(paste('Shrinkage factor,', alpha)), 
     ylab=expression(paste('MSE(', alpha, ')')), 
     cex.lab=1.2)
../../_images/9cc8e0e7e552aa967f846978b6c3b69400f869ff11dcbde46b904d94da0443fc.png
  • \(\text{E[MSE]}\)

Bias and variance#

plot(alpha, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
     xlab=expression(paste('Shrinkage factor,', alpha)), 
     ylab=expression(paste('MSE(', alpha, ')')), 
     cex.lab=1.2)
lines(alpha, bias^2, col='green', lwd=2)
lines(alpha, variance, col='blue', lwd=2)
../../_images/79bddff5ba2736be7965b7eee0d3b065cab2b9804a3286837a0cd2608f47026e.png
  • \(\text{Bias}^2\)

  • \(\text{Variance}\)

  • \(\text{E[MSE]}\)

Shrinkage & Penalties#

  • Shrinkage can be thought of as “constrained” or “penalized” minimization.

  • Constrained form:

\[\text{minimize}_{\mu} \sum_{i=1}^n (Y_i - \mu)^2 \quad \text{subject to $\mu^2 \leq C$}\]
  • Lagrange multiplier form: equivalent to (for some \(\lambda=\lambda_C\))

\[\widehat{\mu}_{\lambda} = \text{argmin}_{\mu} \sum_{i=1}^n (Y_i - \mu)^2 + \lambda \cdot \mu^2\]
  • As we vary \(\lambda\) we solve all versions of the constrained form.

Solving for \(\widehat{\mu}_{\lambda}\)#

Math aside#

  • Differentiating: \(- 2 \sum_{i=1}^n (Y_i - \widehat{\mu}_{\lambda}) + 2 \lambda \widehat{\mu}_{\lambda} = 0\)

  • Solving \(\widehat{\mu}_{\lambda} = \frac{\sum_{i=1}^n Y_i}{n + \lambda} = \frac{n}{n+\lambda} \overline{Y}.\)

  • As \(\lambda \rightarrow 0\), \(\widehat{\mu}_{\lambda} \rightarrow {\overline{Y}}.\)

  • As \(\lambda \rightarrow \infty\) \(\widehat{\mu}_{\lambda} \rightarrow 0.\)

We see that \(\widehat{\mu}_{\lambda} = \bar{Y} \cdot \left(\frac{n}{n+\lambda}\right).\)#

In other words, considering all penalized estimators traces out the MSE curve…#

Reparametrization#

lam = nsample / alpha - nsample
plot(lam, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
     xlab=expression(paste('Penalty parameter,',  lambda)), 
     ylab=expression(paste('MSE(', lambda, ')')))
lines(lam, bias^2, col='green', lwd=2)
lines(lam, variance, col='blue', lwd=2)
../../_images/b0228fa97b4caa2566ae2c4868fe13dc1410b8039f80ca5fbeb766e9504da197.png
  • \(\text{Bias}^2\)

  • \(\text{Variance}\)

  • \(\text{E[MSE]}\)

How much to shrink?#

  • In our one-sample example,

\[\begin{split}\begin{aligned} MSE_{pop}(\alpha) &= {\text{Var}}( \alpha \cdot \bar{Y}) + \text{Bias}(\alpha \cdot \bar{Y})^2 + \text{Var}(Y_{new}) \\ &= \frac{\alpha^2 \sigma^2}{n} + \mu^2 (1 - \alpha)^2 + \text{Var}(Y_{new}) \end{aligned}\end{split}\]
  • Differentiating and solving:

\[\begin{split}\begin{aligned} 0 &= -2 \mu^2(1 - \alpha^*) + 2 \frac{\alpha^* \sigma^2}{n} \\ \alpha^* & = \frac{\mu^2}{\mu^2+\sigma^2/n} = \frac{(\mu/(\sigma/\sqrt{n}))^2}{(\mu/(\sigma/\sqrt{n}))^2 + 1} \\ &= \frac{0.5^2}{0.5^2+25/100} = 0.5 \end{aligned}\end{split}\]
  • In practice we might hope to estimate MSE with cross-validation.**


plot(alpha, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
     xlab=expression(paste('Shrinkage parameter ', alpha)), 
     ylab=expression(paste('MSE(', alpha, ')')))
abline(v=mu^2/(mu^2+sigma^2/nsample), col='blue', lty=2)
../../_images/9a543b16bf40ef0ca7f2bebea307ea38e9e44e0891840c46f29835540ce0ffcb.png
  • Let’s see how our theoretical choice matches the MSE on our 100 sample.

Penalties & Priors#

  • Minimizing \(\sum_{i=1}^n (Y_i - \mu)^2 + \lambda \cdot \mu^2\) is similar to computing “MLE” of \(\mu\) if the likelihood was proportional to

\[\exp \left(-\frac{1}{2\sigma^2}\left( \|Y-\mu\|^2_2 + \frac{\lambda}{2} \cdot \mu^2\right) \right).\]
  • If \(\lambda=m\), an integer, then \(\widehat{\mu}_{\lambda}\) is the sample mean of \((Y_1, \dots, Y_n,0 ,\dots, 0) \in \mathbb{R}^{n+m}\).

  • This is equivalent to adding some data with \(Y=0\) (and some rows to \(X\)).

  • To a Bayesian, this extra data is a prior distribution and we are computing the so-called MAP or posterior mode.

AIC as penalized regression#

  • Model selection with \(C_p\) (or AIC with \(\sigma^2\) assumed known) is a version of penalized regression.

  • The best subsets version of AIC (which is not exactly equivalent to step)

\[ \hat{\beta}_{AIC} = \text{argmin}_{\beta} \frac{1}{\sigma^2}\|Y-X\beta\|^2_2 + 2 \|\beta\|_0 \]

where

\[ \|\beta\|_0 = \#\left\{j : \beta_j \neq 0 \right\} \]

is called the \(\ell_0\) norm.

  • The \(\ell_0\) penalty can be thought of as a measure of complexity of the model. Most penalties are similar versions of complexity.

Ridge regression#

  • Assume that columns \((X_j)_{1 \leq j \leq p}\) have zero mean, and SD 1 and \(Y\) has zero mean.

  • This is called the standardized model.

  • The ridge estimator is

\[\begin{split} \begin{aligned} \hat{\beta}_{\lambda} &= \text{argmin}_{\beta} \frac{1}{2n}\|Y-X\beta\|^2_2 + \frac{\lambda}{2} \|\beta\|^2_2 \\ &= \text{argmin}_{\beta} MSE_{\lambda}(\beta) \end{aligned} \end{split}\]
  • Corresponds (through Lagrange multiplier) to a quadratic constraint on \({\beta_{}}\)’s.

  • This is the natural generalization of the penalized version of our shrinkage estimator.

Solving the normal equations#

Math aside#

  • Normal equations

\[\frac{\partial}{\partial {\beta_{l}}} MSE_{\lambda}({\beta_{}}) = - \frac{1}{n} (Y - X{\beta_{}})^TX_l + \lambda {\beta_{l}}\]
  • \[ \implies - \frac{1}{n}(Y - X{\widehat{\beta}_{\lambda}})^T X_l + \lambda {\widehat{\beta}_{l,\lambda}} = 0, \qquad 1 \leq l \leq p\]

  • In matrix form

\[-\frac{X^TY}{n} + \left(\frac{X^TX}{n} + \lambda I\right) {\widehat{\beta}_{\lambda}} = 0.\]
  • Or

\[{\widehat{\beta}_{\lambda}} = \left(\frac{X^TX}{n} + \lambda I\right)^{-1} \frac{X^TY}{n}.\]

Ridge regression in R#

library(lars)
data(diabetes)
library(MASS)
diabetes.ridge = lm.ridge(diabetes$y ~ diabetes$x, 
                          lambda=seq(0, 100, 0.5))
plot(diabetes.ridge, lwd=3)
Loaded lars 1.3
../../_images/223bdba3b378612686acdb0fbdd21e036f6ebe2b2289fb934cdbcd91fbc0dffb.png

Choosing \(\lambda\)#

  • If we knew \(E[MSE_{\lambda}]\) as a function of \(\lambda\) then we would simply choose the \(\lambda\) that minimizes it.

  • To do this, we need to estimate it.

  • A popular method is cross-validation as a function of \(\lambda\). Breaks the data up into smaller groups and uses part of the data to predict the rest.

  • We saw this in diagnostics (Cook’s distance measured the fit with and without each point in the data set) and model selection.

\(K\)-fold cross-validation for penalized model#

  • Fix a model (i.e. fix \(\lambda\)). Break data set into \(K\) approximately equal sized groups \((G_1, \dots, G_K)\).

for (i in 1:K) {
   - Use all groups except `G_i` to fit model, predict outcome in group `G_i` based on this model as `Yhat_i`
   - Estimate `CV = sum([sum((Y[G_i]-Yhat_i)^2) for i in 1:K])/n`
}

CV = function(Z, alpha, K=5) {
    cve = numeric(K)
    n = length(Z)
    for (i in 1:K) {
        g = seq(as.integer((i-1)*n/K)+1,as.integer((i*n/K)))
        mu.hat = mean(Z[-g]) * alpha
        cve[i] = sum((Z[g]-mu.hat)^2)
    }
    return(c(sum(cve) / n, sd(cve) / sqrt(n)))
}
alpha = seq(0.0,1,length=20)
mse = numeric(length(alpha))
avg.cv = numeric(length(alpha)) 
for (i in 1:ntrial) {
     Z = rnorm(nsample) * sigma + mu
     for (j in 1:length(alpha)) {
         current_cv = CV(Z, alpha[j])
         avg.cv[j] = avg.cv[j] + current_cv[1]
     }
}
avg.cv = avg.cv / ntrial
plot(alpha, avg.cv, type='l', lwd=2, col='green',
     xlab='Shrinkage parameter, alpha', ylab='Average CV(alpha)')
abline(v=mu^2/(mu^2+sigma^2/nsample), col='blue', lty=2)
abline(v=min(alpha[avg.cv == min(avg.cv)]), col='red', lty=2)
../../_images/1a6141ad7f68ad7251364d8259149b5cb0a9ae8db246027967c805e22a0a2365.png
  • Let’s see how the parameter chosen by 5-fold CV compares to our theoretical choice.

  • The curve above shows what would happen if we could repeat this and average over many samples. In reality, we only get one sample.

cv = numeric(length(alpha))
cv.sd = numeric(length(alpha))
Z = rnorm(nsample) * sigma + mu
for (j in 1:length(alpha)) {
    current_cv = CV(Z, alpha[j])
    cv[j] = current_cv[1]
    cv.sd[j] = current_cv[2]
}
cv = numeric(length(alpha))
cv.sd = numeric(length(alpha))
nsample = 1000
Z = rnorm(nsample) * sigma + mu
for (j in 1:length(alpha)) {
    current_cv = CV(Z, alpha[j])
    cv[j] = current_cv[1]
    cv.sd[j] = current_cv[2]
}
plot(alpha, cv, type='l', lwd=2, col='green',
    xlab='Shrinkage parameter, alpha', ylab='CV(alpha)', xlim=c(0,1))
 abline(v=mu^2/(mu^2+sigma^2/nsample), col='blue', lty=2)
 abline(v=min(alpha[cv == min(cv)]), col='red', lty=2)
../../_images/337357d3d6f9b29753be3559f7656fe77253bbc1bc25db6a18b4b4a118ebbbc9.png

Ridge regression#

par(cex.lab=1.5)
plot(diabetes.ridge$lambda, diabetes.ridge$GCV, xlab='Lambda', 
     ylab='GCV', type='l', lwd=3, col='orange')
select(diabetes.ridge)
modified HKB estimator is 5.462251 
modified L-W estimator is 7.641667 
smallest value of GCV  at 3 
../../_images/c23dce8903f9b72aa57b2d07cb40830adc06c324b44550400615506328ab1b54.png

LASSO#

  • Another popular penalized regression technique. (Again, we standardize)

  • The LASSO estimate is

\[ \hat{\beta}_{\lambda} = \text{argmin}_{\beta} \frac{1}{2n}\|Y-X\beta\|^2_2 + \lambda \|\beta\|_1\]
  • Above we defined

\[ \|\beta\|_1 = \sum_{j=1}^p |\beta_j| \]

is the \(\ell_1\) norm.

  • Corresponds (through Lagrange multiplier) to an \(\ell_1\) constraint on \({\beta_{}}\)’s.

LASSO#

  • In theory and practice, it works well when many \({\beta_{j}}\)’s are 0 and gives “sparse” solutions unlike ridge.

  • It is a (computable) approximation to the best subsets AIC model.

  • It is computable because the minimization problem is a convex problem.

LASSO#

library(lars)
data(diabetes)
diabetes.lasso = lars(diabetes$x, diabetes$y, type='lasso')
plot(diabetes.lasso)
../../_images/01e6af9bb13f2df51342211d0da439a1cfdc1f933545628d3a45e6e1efc48560.png

Cross-validation for the LASSO#

The lars package has a built in function to estimate CV.

par(cex.lab=1.5)
cv.lars(diabetes$x, diabetes$y, K=10, type='lasso')
../../_images/989cd6c981f85e8c2197166c356abc094aae3cf2df926abedef04218934c50da.png

glmnet#

library(glmnet)
G = glmnet(diabetes$x, diabetes$y)
plot(G)
Loading required package: Matrix
Loaded glmnet 4.1-8
../../_images/c9ac582bea0b260ede99f19878c38df1b231dbaf7b703f4eeeb0a5fa4201b26a.png

plot(cv.glmnet(diabetes$x, diabetes$y))
cv.glmnet(diabetes$x, diabetes$y)$lambda.1se
5.8326421644991
../../_images/48c835e79cfab0b08ff09bf207dcc8ded8733d23ed0f02f2c45d982202718297.png

HIV example#

Load the data#

url_X = 'https://raw.githubusercontent.com/StanfordStatistics/stats191-data/main/NRTI_X.csv'
url_Y = 'https://raw.githubusercontent.com/StanfordStatistics/stats191-data/main/NRTI_Y.txt'
X_HIV = read.table(url_X, header=FALSE, sep=',')
Y_HIV = read.table(url_Y, header=FALSE, sep=',')
Y_HIV = as.matrix(Y_HIV)[,1]
X_HIV = as.matrix(X_HIV)

Plot the solution path#

library(glmnet)
G = glmnet(X_HIV, Y_HIV)
plot(G)
../../_images/c8dcba31bd1acdda755de821ffdf840ff3190fbebda46f0bd6f813bf95d0a937.png

CV = cv.glmnet(X_HIV, Y_HIV)
plot(CV)
../../_images/8ab2ac3f62467b75202b59aaba62ac4bbe7c3a20c184b7115acb6adeb6bbf53c.png