Bias-variance tradeoff

  • One goal of a regression analysis is to "build" a model that predicts well.

  • This is slightly different than the goal of making inferences about $\beta$ that we've focused on so far.

  • What does "predict well" mean? $$ \begin{aligned} MSE_{pop}({{\cal M}}) &= {\mathbb{E}}\left((Y_{new} - \widehat{Y}_{new,{\cal M}})^2\right) \\ &= {\text{Var}}(Y_{new}) + {\text{Var}}(\widehat{Y}_{new,{\cal M}}) + \\ & \qquad \quad \text{Bias}(\widehat{Y}_{new,{\cal M}})^2. \end{aligned}$$

  • With $\sigma^2$ known, the quantity $C_p({\cal M})$ is an unbiased estimator of this quantity.

In [1]:
options(repr.plot.width=5, repr.plot.height=5)

Shrinkage estimators

  1. Generate $Y_{100 \times 1} \sim N(\mu \cdot 1, 5^2 I_{100 \times 100})$, with $\mu=0.5$.
  2. For $0 \leq \alpha \leq 1$, set $\hat{Y}(\alpha) = \alpha \bar{Y}.$
  3. Compute $MSE(\hat{Y}(\alpha)) = \sum_{i=1}^{100} (\hat{Y}_{\alpha} - 0.5)^2$
  4. Repeat 500 times, plot average of $MSE(\hat{Y}(\alpha))$.

For what value of $\alpha$ is $\hat{Y}(\alpha)$ unbiased?

Is this the best estimate of $\mu$ in terms of MSE?

In [2]:
nsample = 100
ntrial = 1000
mu = 0.5
sigma = 5
MSE = function(mu.hat, mu) {
  return(sum((mu.hat - mu)^2) / length(mu))
In [3]:
alpha = seq(0.0,1,length=20)
mse = numeric(length(alpha))
bias = (1 - alpha) * mu
variance = alpha^2 * 25 / 100
for (i in 1:ntrial) {
  Z = rnorm(nsample) * sigma + mu
  for (j in 1:length(alpha)) {
    mse[j] = mse[j] + MSE(alpha[j] * mean(Z) * rep(1, nsample), mu * rep(1, nsample)) / ntrial
In [4]:
plot(alpha, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
     xlab=expression(paste('Penalty parameter,', alpha)), 
     ylab=expression(paste('MSE(', alpha, ')')), 
In [5]:
plot(alpha, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
     xlab=expression(paste('Penalty parameter,', alpha)), 
     ylab=expression(paste('MSE(', alpha, ')')), 
lines(alpha, bias^2, col='green', lwd=2)
lines(alpha, variance, col='blue', lwd=2)

Shrinkage & Penalties

  • Shrinkage can be thought of as "constrained" or "penalized" minimization.

  • Constrained form: $$\text{minimize}_{\mu} \sum_{i=1}^n (Y_i - \mu)^2 \quad \text{subject to $\mu^2 \leq C$}$$

  • Lagrange multiplier form: equivalent to $$\widehat{\mu}_{\lambda} = \text{argmin}_{\mu} \sum_{i=1}^n (Y_i - \mu)^2 + \lambda \cdot \mu^2$$ for some $\lambda=\lambda_C$.

  • As we vary $\lambda$ we solve all versions of the constrained form.

Solving for $\widehat{\mu}_{\lambda}$

  • Differentiating: $- 2 \sum_{i=1}^n (Y_i - \widehat{\mu}_{\lambda}) + 2 \lambda \widehat{\mu}_{\lambda} = 0$
  • Solving $\widehat{\mu}_{\lambda} = \frac{\sum_{i=1}^n Y_i}{n + \lambda} = \frac{n}{n+\lambda} \overline{Y}.$
  • As $\lambda \rightarrow 0$, $\widehat{\mu}_{\lambda} \rightarrow {\overline{Y}}.$
  • As $\lambda \rightarrow \infty$ $\widehat{\mu}_{\lambda} \rightarrow 0.$

We see that $\widehat{\mu}_{\lambda} = \bar{Y} \cdot \left(\frac{n}{n+\lambda}\right).$

In other words, considering all penalized estimators traces out the MSE curve above.

In [6]:
lam = nsample / alpha - nsample
plot(lam, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
     xlab=expression(paste('Penalty parameter,',  lambda)), 
     ylab=expression(paste('MSE(', lambda, ')')))
In [7]:
plot(lam, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
     xlab=expression(paste('Penalty parameter,',  lambda)), 
     ylab=expression(paste('MSE(', lambda, ')')))
lines(lam, bias^2, col='green', lwd=2)
lines(lam, variance, col='blue', lwd=2)

How much to shrink?

  • In our one-sample example,
  • $$\begin{aligned} MSE_{pop}(\alpha) &= {\text{Var}}( \alpha \bar{Y}) + \text{Bias}(\alpha \bar{Y})^2 \\ &= \frac{\alpha^2 \sigma^2}{n} + \mu^2 (1 - \alpha)^2 \end{aligned}$$
  • Differentiating and solving: $$\begin{aligned} 0 &= -2 \mu^2(1 - \alpha^*) + 2 \frac{\alpha^* \sigma^2}{n} \\ \alpha^* & = \frac{\mu^2}{\mu^2+\sigma^2/n} \\ &= \frac{0.5^2}{0.5^2+1/10} = 0.71 \end{aligned}$$

We see that the optimal $\alpha$ depends on the unknown $SNR=\mu/(\sigma/\sqrt{n})$.

In practice we might hope to estimate MSE with cross-validation.

Let's see how our theoretical choice matches the MSE on our 100 sample.

In [8]:
plot(alpha, mse, type='l', lwd=2, col='red', ylim=c(0, max(mse)),
     xlab=expression(paste('Shrinkage parameter ', alpha)), 
     ylab=expression(paste('MSE(', alpha, ')')))
abline(v=mu^2/(mu^2+sigma^2/nsample), col='blue', lty=2)

Penalties & Priors

  • Minimizing $\sum_{i=1}^n (Y_i - \mu)^2 + \lambda \mu^2$ is similar to computing "MLE" of $\mu$ if the likelihood was proportional to $$\exp \left(-\frac{1}{2\sigma^2}\left( \|Y-\mu\|^2_2 + \lambda \mu^2\right) \right).$$

  • If $\lambda=m$, an integer, then $\widehat{\mu}_{\lambda}$ is the sample mean of $(Y_1, \dots, Y_n,0 ,\dots, 0) \in \mathbb{R}^{n+m}$.

  • This is equivalent to adding some data with $Y=0$.

  • To a Bayesian, this extra data is a prior distribution and we are computing the so-called MAP or posterior mode.

AIC as penalized regression

  • Model selection with $C_p$ (or AIC with $\sigma^2$ assumed known) is a version of penalized regression.

  • The best subsets version of AIC (which is not exactly equivalent to step) $$ \hat{\beta}_{AIC} = \text{argmin}_{\beta} \frac{1}{\sigma^2}\|Y-X\beta\|^2_2 + 2 \|\beta\|_0 $$ where $$ \|\beta\|_0 = \#\left\{j : \beta_j \neq 0 \right\} $$ is called the $\ell_0$ norm.

  • The $\ell_0$ penalty can be thought of as a measure of complexity of the model. Most penalties are similar versions of complexity.

Penalized regression in general

  • Not all biased models are better – we need a way to find "good" biased models.

  • Inference ($F$, $\chi^2$ tests, etc) is not quite exact for biased models. Though, there has been some recent work to address the issue of post-selection inference, at least for some penalized regression problems.

  • Heuristically, "large $\beta$" (measured by some norm) is interpreted as "complex model". Goal is really to penalize "complex" models, i.e. Occam’s razor.

  • If truth really is complex, this may not work! (But, it will then be hard to build a good model anyways ... (statistical lore))

Ridge regression

  • Assume that columns $(X_j)_{1 \leq j \leq p}$ have zero mean, and SD 1 and $Y$ has zero mean.

  • This is called the standardized model.

  • The ridge estimator is $$ \hat{\beta}_{\lambda} = \text{argmin}_{\beta} \frac{1}{2}\|Y-X\beta\|^2_2 + \frac{\lambda}{2} \|\beta\|^2_2$$

  • Corresponds (through Lagrange multiplier) to a quadratic constraint on ${\beta_{}}$’s.

  • This is the natural generalization of the penalized version of our shrinkage estimator.

Solving the normal equations

  • Normal equations $$\frac{\partial}{\partial {\beta_{l}}} SSE_{\lambda}({\beta_{}}) = - (Y - X{\beta_{}})^TX_l + \lambda {\beta_{l}}$$
  • $$- (Y - X{\widehat{\beta}_{\lambda}})^T X_l + \lambda {\widehat{\beta}_{l,\lambda}} = 0, \qquad 1 \leq l \leq p$$
  • In matrix form $$-X^TY + (X^TX + \lambda I) {\widehat{\beta}_{\lambda}} = 0.$$
  • Or $${\widehat{\beta}_{\lambda}} = (X^TX + \lambda I)^{-1} X^TY.$$

Ridge regression

In [9]:
diabetes.ridge = lm.ridge(diabetes$y ~ diabetes$x, lambda=seq(0,100,0.05))
plot(diabetes.ridge, lwd=3)
Loaded lars 1.2

Choosing $\lambda$

  • If we knew $MSE$ as a function of $\lambda$ then we would simply choose the $\lambda$ that minimizes $MSE$.
  • To do this, we need to estimate $MSE$.
  • A popular method is cross-validation as a function of $\lambda$. Breaks the data up into smaller groups and uses part of the data to predict the rest.
  • We saw this in diagnostics (Cook’s distance measured the fit with and without each point in the data set) and model selection.

$K$-fold cross-validation for penalized model

  • Fix a model (i.e. fix $\lambda$). Break data set into $K$ approximately equal sized groups $(G_1, \dots, G_K)$.
  • for (i in 1:K) Use all groups except $G_i$ to fit model, predict outcome in group $G_i$ based on this model $\widehat{Y}_{j(i),\lambda}, j \in G_i$.
  • Estimate $CV(\lambda) = \frac{1}{n}\sum_{i=1}^K \sum_{j \in G_i} (Y_j - \widehat{Y}_{j(i),\lambda})^2.$

Here is a function to estimate the CV for our one parameter example. In practice, we only have one sample to form the CV curve. In this example below, I will compute the average CV error for 500 trials to show that it is roughly comparable in shape to the MSE curve.

In [10]:
CV = function(Z, alpha, K=5) {
    cve = numeric(K)
    l = length(Z)
    for (i in 1:K) {
        g = c(((i-1)*l/K+1):(i*l/K))
        mu.hat = mean(Z[-g]) * alpha
        cve[i] = sum((Z[g]-mu.hat)^2)
    return(c(sum(cve) / l, sd(cve)))

Let's see how the parameter chosen by 5-fold CV compares to our theoretical choice.

In [11]:
alpha = seq(0.0,1,length=20)
mse = numeric(length(alpha)) = numeric(length(alpha)) 
for (i in 1:ntrial) {
     Z = rnorm(nsample) * sigma + mu
     for (j in 1:length(alpha)) {
         current_cv = CV(Z, alpha[j])[j] =[j] + current_cv[1]
} = / ntrial
In [12]:
plot(alpha,, type='l', lwd=2, col='green',
      xlab='Shrinkage parameter, alpha', ylab='Average CV(alpha)')
 abline(v=mu^2/(mu^2+sigma^2/nsample), col='blue', lty=2)
 abline(v=min(alpha[ == min(]), col='red', lty=2)

The curve above shows what would happen if we could repeat this and averaeg over many samples.

Let's see what one curve looks like on our sample. This is the result we might get in practice on a given data set.

In [13]:
cv = numeric(length(alpha)) = numeric(length(alpha))
Z = rnorm(nsample) * sigma + mu
for (j in 1:length(alpha)) {
    current_cv = CV(Z, alpha[j])
    cv[j] = current_cv[1][j] = current_cv[2]
In [14]:
plot(alpha, cv, type='l', lwd=2, col='green',
    xlab='Shrinkage parameter, alpha', ylab='CV(alpha)')
 abline(v=mu^2/(mu^2+sigma^2/nsample), col='blue', lty=2)
 abline(v=min(alpha[cv == min(cv)]), col='red', lty=2)

Generalized Cross Validation

  • A computational shortcut for $n$-fold cross-validation (also known as leave-one out cross-validation).
  • Let $S_{\lambda} = X(X^TX + \lambda I)^{-1} X^T$ be the matrix in ridge regression that computes $\hat{Y}_{\lambda}$
  • Then $GCV(\lambda) = \frac{\|Y - S_{\lambda}Y\|^2}{n - {\text{Tr}}(S_{\lambda})}.$
  • The quantity ${\text{Tr}}(S_{\lambda})$ can be thought of as the effective degrees of freedom for this choice of $\lambda$.

Ridge regression

In [15]:
plot(diabetes.ridge$lambda, diabetes.ridge$GCV, xlab='Lambda', 
     ylab='GCV', type='l', lwd=3, col='orange')
modified HKB estimator is 5.462251 
modified L-W estimator is 7.641667 
smallest value of GCV  at 3.25 


  • Another popular penalized regression technique.
  • Use the standardized model.
  • The LASSO estimate is $$ \hat{\beta}_{\lambda} = \text{argmin}_{\beta} \frac{1}{2}\|Y-X\beta\|^2_2 + \lambda \|\beta\|_1$$ where $$ \|\beta\|_1 = \sum_{j=1}^p |\beta_j| $$ is the $\ell_1$ norm.

  • Corresponds (through Lagrange multiplier) to an $\ell^1$ constraint on ${\beta_{}}$’s.


  • In theory and practice, it works well when many ${\beta_{j}}$’s are 0 and gives "sparse" solutions unlike ridge.

  • It is a (computable) approximation to the best subsets AIC model.

  • It is computable because the minimization problem is a convex problem.

Why do we get sparse solutions with the LASSO?


In [16]:
diabetes.lasso = lars(diabetes$x, diabetes$y, type='lasso')

Cross-validation for the LASSO

The lars package has a built in function to estimate CV.

In [17]:
cv.lars(diabetes$x, diabetes$y, K=10, type='lasso')


In [18]:
G = glmnet(diabetes$x, diabetes$y)
Loading required package: Matrix
Loading required package: foreach
Loaded glmnet 2.0-5

In [19]:
In [20]:
plot(cv.glmnet(diabetes$x, diabetes$y))

Elastic Net

  • Mix between LASSO and ridge regression.
  • The ENET estimator is $$ \hat{\beta}_{\lambda_1, \lambda_2} = \text{argmin}_{\beta} \frac{1}{2} \|Y-X\beta\|^2_2 + \lambda_1 \|\beta\|_1 + \frac{\lambda_2}{2} \|\beta\|^2_2.$$

Inference after LASSO

In [21]:
beta.hat = coef(G, s=G$lam[20], exact=TRUE)[-1]
sigma.hat = summary(lm(y ~ x, data=diabetes))$sigma
fixedLassoInf(diabetes$x, diabetes$y, beta.hat, G$lam[20], sigma=sigma.hat)
Loading required package: intervals

Attaching package: ‘intervals’

The following object is masked from ‘package:Matrix’:


Loading required package: survival
fixedLassoInf(x = diabetes$x, y = diabetes$y, beta = beta.hat, 
    lambda = G$lam[20], sigma = sigma.hat)

Standard deviation of noise (specified or estimated) sigma = 54.154

Testing results at lambda = 7.710, with alpha = 0.100

 Var     Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
   3  555.279   8.602   0.000   448.527  661.923       0.049      0.049
   4  269.676   4.408   0.000   167.896  371.368       0.048      0.048
   7 -193.954  -3.194   0.001  -294.281  -91.072       0.049      0.049
   9  484.979   7.417   0.000   376.958  593.125       0.049      0.049

Note: coefficients shown are partial regression coefficients