{
 "cells": [
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "---\n",
    "title: 'Model selection and regularization'\n",
    "author: \"Sergio Bacallado, Jonathan Taylor\"\n",
    "subtitle: \"[web.stanford.edu/class/stats202](http://web.stanford.edu/class/stats202)\"\n",
    "date: \"Autumn 2020\"\n",
    "output:\n",
    "  slidy_presentation:\n",
    "    css: styles.css\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's recap what we have learned about regression methods\n",
    "so far:\n",
    "\n",
    "- In linear regression, adding predictors always decreases the training error or RSS.\n",
    "\n",
    "- However, adding predictors does not necessarily improve the test error.\n",
    "\n",
    "- Selecting significant predictors is hard when $n$ is not much larger than $p$. \n",
    "\n",
    "- When $n<p$, there is no least squares solution:\n",
    "$$\\hat \\beta = \\underbrace{(\\mathbf{X}^T\\mathbf{X})}_{\\text{Singular}}^{-1}\\mathbf{X}^T y.$$\n",
    "\n",
    "- We should think about selecting fewer predictors.\n",
    "\n",
    "# Best subset selection\n",
    "\n",
    "- Simple idea: let's compare all models with $k$ predictors.\n",
    "\n",
    "- There are ${p \\choose k} = p!/\\left[k!(p-k)!\\right]$ possible models.\n",
    "\n",
    "- For every possible $k$, choose the model with the smallest RSS. \n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter6/6.1.png\" height=\"500\">\n",
    "</div>\n",
    "\n",
    "## Best subset with `regsubsets`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "library(ISLR) # where Credit is stored\n",
    "library(leaps) # where regsubsets is found\n",
    "summary(regsubsets(Balance ~ ., data=Credit))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Best model with 4 variables includes: `Cards, Income, Student, Limit`.\n",
    "\n",
    "## Choosing $k$ \n",
    "\n",
    "Naturally, $\\text{RSS}$ and \n",
    "$$\n",
    "R^2 =  1-\\frac{\\text{RSS}}{\\text{TSS}}\n",
    "$$\n",
    "improve as we increase $k$.\n",
    "\n",
    "To optimize $k$, we want to minimize the test error, not the training error. \n",
    "\n",
    "We could use **cross-validation**, or alternative estimates of test error:\n",
    "\n",
    "1. **Akaike Information Criterion (AIC)** (closely related to Mallow's $C_p$) given an estimate of the irreducible error $\\hat{\\sigma^2}$ :\n",
    "\n",
    "$$\\frac{1}{n}(\\text{RSS}+2k\\hat\\sigma^2)$$\n",
    "\n",
    "2. **Bayesian Information Criterion (BIC):**\n",
    "\n",
    "$$\\frac{1}{n}(\\text{RSS}+\\log(n)\\hat\\sigma^2)$$\n",
    "\n",
    "3. **Adjusted $R^2$**:\n",
    "\n",
    "$$R^2_a = 1-\\frac{\\text{RSS}/(n-k-1)}{\\text{TSS}/(n-1)}$$\n",
    "\n",
    "How do they compare to cross validation:\n",
    "\n",
    "- They are much less expensive to compute.\n",
    "\n",
    "- They are motivated by asymptotic arguments and rely on model assumptions (eg. normality of the errors).\n",
    "\n",
    "- Equivalent concepts for other models (e.g. logistic regression).\n",
    "\n",
    "## Example: best subset selection for the Credit dataset\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter6/6.2.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "## Best subset selection for the Credit dataset\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter6/6.3.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "*Recall:* In $K$-fold cross validation, we can estimate a standard error or accuracy for our test error estimate. Then, we can\n",
    "apply the *1SE rule*.\n",
    "\n",
    "# Stepwise selection methods\n",
    "\n",
    "Best subset selection has 2 problems:\n",
    "\n",
    "1. It is often very expensive computationally. We have to fit $2^p$ models!\n",
    "\n",
    "2. If for a fixed $k$, there are too many possibilities, we increase our chances of overfitting. The model selected has *high variance*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to mitigate these problems, we can restrict our search space for the best model.\n",
    "This reduces the variance of the selected model at the expense of an increase in bias.\n",
    "\n",
    "## Forward selection\n",
    "\n",
    "<ol>\n",
    "<li>Let ${\\cal M}_0$ denote the *null* model, which contains no predictors.\n",
    "<li>For $k=0, \\dots, p-1$:\n",
    "<ol>\n",
    "<li style=\"list-style-type:lower-alpha\">Consider all $p-k$ models that augment the predictors\n",
    "in ${\\cal M}_k$ with one additional predictor.\n",
    "<li style=\"list-style-type:lower-alpha\">Choose the best among these $p-k$ models, and call it\n",
    "${\\cal M}_{k+1}$. Here *best* is defined as having smallest RSS or $R^2$.\n",
    "</ol>\n",
    "<li>Select a single best model from among ${\\cal M}_0, \\dots, {\\cal M}_p$ using\n",
    "cross-validated prediction error, AIC, BIC or adjusted $R^2$.\n",
    "</ol>\n",
    "\n",
    "## Forward selection in using `step`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "M.full = lm(Balance ~ ., data=Credit)\n",
    "M.null = lm(Balance ~ 1, data=Credit)\n",
    "step(M.null, scope=list(upper=M.full), direction='forward', trace=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- First four variables are `Rating, Income, Student, Limit`.\n",
    "\n",
    "- **Different from best subsets of size 4.**\n",
    "\n",
    "## Backward selection\n",
    "\n",
    "<ol>\n",
    "<li>Let ${\\cal M}_p$ denote the *full* model, which contains all $p$ predictors\n",
    "<li>For $k=p,p-1,\\dots,1$:\n",
    "<ol>\n",
    "<li style=\"list-style-type:lower-alpha\">Consider all $k$ models that contain\n",
    "all but one of the predictors in ${\\cal M}_k$ for a total of $k-1$ predictors.\n",
    "<li style=\"list-style-type:lower-alpha\">Choose the best among these $k$ models, and call it\n",
    "${\\cal M}_{k-1}$. Here *best* is defined as having smallest RSS or $R^2$.\n",
    "</ol>\n",
    "<li>Select a single best model from among ${\\cal M}_0, \\dots, {\\cal M}_p$ using\n",
    "cross-validated prediction error, AIC, BIC or adjusted $R^2$.\n",
    "</ol>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Backward selection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "step(M.full, scope=list(lower=M.null), direction='backward', trace=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Forward selection with BIC"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "step(M.null, scope=list(upper=M.full), direction='forward', trace=0, k=log(nrow(Credit))) # k defaults to 2, i.e AIC"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Forward vs. backward selection\n",
    "\n",
    "<ul>\n",
    "<li>You cannot apply backward selection when $p>n$.\n",
    "<li>Although we might like them to, they need not produce the same sequence of models.\n",
    "<li>**Example:** $X_1,X_2 \\sim \\mathcal{N}(0,\\sigma)$ independent and set\n",
    "$$\n",
    "\\begin{aligned}\n",
    "X_3 &= X_1 + 3 X_2 \\\\\n",
    "Y &= X_1 + 2X_2 + \\epsilon\n",
    "\\end{aligned}\n",
    "$$\n",
    "<li>Regress $Y$ onto $X_1,X_2,X_3$.\n",
    "<ul>\n",
    "<li>Forward: $\\{X_3\\} \\to \\mathbf{\\{X_3,X_2\\} \\text{or} \\{X_3,X_1\\}} \\to \\{X_3,X_2,X_1\\}$\n",
    "<li>Backward: $\\{X_1,X_2,X_3\\} \\to \\mathbf{ \\{X_1,X_2\\}} \\to \\{X_2\\}$\n",
    "</ul>\n",
    "</ul>\n",
    "\n",
    "## Forward vs. backward selection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n = 5000\n",
    "X1 = rnorm(n)\n",
    "X2 = rnorm(n)\n",
    "X3 = X1 + 3 * X2 \n",
    "Y = X1 + 2 * X2 + rnorm(n)\n",
    "D = data.frame(X1, X2, X3, Y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Forward"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "step(lm(Y ~ 1, data=D), list(upper=~ X1 + X2 + X3), direction='forward')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Backward"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "step(lm(Y ~ X1 + X2 + X3, data=D), list(lower=~ 1), direction='backward')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Other stepwise selection strategies\n",
    "\n",
    "- **Mixed stepwise selection (`direction='both'`):** Do forward selection, but at every step, remove any variables that are no longer *necessary*.\n",
    "\n",
    "- **Forward stagewise selection:** Roughly speaking, don't add in the variable *fully* at each step...\n",
    "\n",
    "- $\\dots$\n",
    "\n",
    "# Shrinkage methods\n",
    "\n",
    "**A mainstay of modern statistics!**\n",
    "\n",
    "The idea is to perform a linear regression, while *regularizing* or *shrinking* the coefficients $\\hat \\beta$ toward 0. \n",
    "\n",
    "## Why would shrunk coefficients be better?\n",
    "\n",
    "- This introduces *bias*, but may significantly decrease the *variance* of the estimates. If the latter effect is larger, this would decrease the test error.\n",
    "\n",
    "- Extreme example: set $\\hat{\\beta}$ to 0 -- variance is 0!\n",
    "\n",
    "- There are Bayesian motivations to do this: priors concentrated near 0 tend to shrink the parameters'\n",
    "posterior distribution. \n",
    "\n",
    "## Ridge regression\n",
    "\n",
    "Ridge regression solves the following optimization:\n",
    "\n",
    "$$\\min_{\\beta} \\;\\;\\; \\color{Blue}{ \\sum_{i=1}^n \\left(y_i -\\beta_0 -\\sum_{j=1}^p \\beta_jx_{i,j}\\right)^2} + \\color{Red}{\\lambda \\sum_{j=1}^p \\beta_j^2}$$\n",
    "\n",
    "- <font color=\"blue\">The RSS of the model at $\\beta$.</font>\n",
    "\n",
    "- <font color=\"red\">The squared $\\ell_2$ norm of $\\beta$, or $\\|\\beta\\|_2^2.$</font>\n",
    "\n",
    "- The parameter $\\lambda$ is a tuning parameter. It modulates the importance of fit vs. shrinkage.\n",
    "\n",
    "- We find an estimate $\\hat\\beta^R_\\lambda$ for many values of $\\lambda$ and then choose it by cross-validation.\n",
    "Fortunately, this is no more expensive than running a least-squares regression.\n",
    "\n",
    "## Ridge regression\n",
    "\n",
    "<ul>\n",
    "<li>In least-squares linear regression, scaling the variables has no effect on the fit of the model:\n",
    "$$Y = X_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\dots + \\beta_p X_p.$$\n",
    "<li>\n",
    "<font color=\"red\"> Multiplying $X_1$ by $c$ can be compensated by dividing $\\hat \\beta_1$ by $c$: after scaling we have the same RSS.</font>\n",
    "<li>In ridge regression, this is not true. \n",
    "<li>In practice, what do we do?\n",
    "<ul>\n",
    "<li>Scale each variable such that it has sample variance 1 before running the regression.\n",
    "<li>This prevents penalizing some coefficients more than others.\n",
    "</ul>\n",
    "</ul>\n",
    "\n",
    "## Ridge regression of `default` in the `Default` dataset\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter6/6.4.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "## Ridge regression\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter6/6.5.png\" height=\"500\">\n",
    "</div>\n",
    "\n",
    "- In a simulation study, we compute squared bias, <font color=\"#008080\">variance</font>, and <font color=\"#FF00FF\">test error</font> as a function of $\\lambda$.\n",
    "\n",
    "- In practice, cross validation would yield an estimate of the test error but bias and variance are unobservable.\n",
    "\n",
    "## Selecting $\\lambda$ by cross-validation for ridge regression\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter6/6.12.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "## Lasso regression\n",
    "\n",
    "Lasso regression solves the following optimization:\n",
    "\n",
    "$$\\min_{\\beta} \\;\\;\\; \\color{Blue}{ \\sum_{i=1}^n \\left(y_i -\\beta_0 -\\sum_{j=1}^p \\beta_jx_{i,j}\\right)^2} + \\color{Red}{\\lambda \\sum_{j=1}^p |\\beta_j|}$$\n",
    "\n",
    "- <font color=\"blue\">RSS of the model at $\\beta$.</font>\n",
    "\n",
    "- <font color=\"red\">$\\ell_1$ norm of $\\beta$, or $\\|\\beta\\|_1.$</font>\n",
    "\n",
    "- The parameter $\\lambda$ is a tuning parameter. It modulates the importance of fit vs. shrinkage.\n",
    "\n",
    "### Why would we use the Lasso instead of Ridge regression?\n",
    "\n",
    "- Ridge regression shrinks all the coefficients to a non-zero value.\n",
    "\n",
    "- The Lasso shrinks some of the coefficients all the way to zero.\n",
    "\n",
    "- A **convex** alternative (relaxation) to best subset selection (and its approximation stepwise selection)!\n",
    "\n",
    "## Ridge regression of `default` in the `Default` dataset\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter6/6.4.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "<font color=\"red\">A lot of pesky small coefficients throughout the regularization path</font>\n",
    "\n",
    "## Lasso regression of `default` in the `Default` dataset\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter6/6.6.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "<font color=\"red\">Those coefficients are shrunk to zero</font>\n",
    "\n",
    "## An alternative formulation for regularization\n",
    "\n",
    "<ul>\n",
    "<li>**Ridge:** for every $\\lambda$, there is an $s$ such that $\\hat \\beta^R_\\lambda$ solves:\n",
    "$$\\underset{\\beta}{\\text{minimize}} ~~\\left\\{\\sum_{i=1}^n \\left(y_i -\\beta_0 -\\sum_{j=1}^p \\beta_j  x_{i,j}\\right)^2 \\right\\} ~~~ \\text{subject to}~~~ \\sum_{j=1}^p \\beta_j^2<s.$$\n",
    "<li>**Lasso:**  for every $\\lambda$, there is an $s$ such that $\\hat \\beta^L_\\lambda$ solves:\n",
    "$$\\underset{\\beta}{\\text{minimize}} ~~\\left\\{\\sum_{i=1}^n \\left(y_i -\\beta_0 -\\sum_{j=1}^p \\beta_j  x_{i,j}\\right)^2 \\right\\} ~~~ \\text{subject to}~~~ \\sum_{j=1}^p |\\beta_j|<s.$$\n",
    "<li>**Best subset:**\n",
    "$$\\underset{\\beta}{\\text{minimize}} ~~\\left\\{\\sum_{i=1}^n \\left(y_i -\\beta_0 -\\sum_{j=1}^p \\beta_j  x_{i,j}\\right)^2 \\right\\} ~~~ \\text{s.t.}~~~ \\sum_{j=1}^p \\mathbf{1}(\\beta_j\\neq 0)<s.$$\n",
    "</ul>\n",
    "\n",
    "## Visualizing Ridge and the Lasso with 2 predictors\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter6/6.7.png\" height=\"500\">\n",
    "</div>\n",
    "\n",
    "$\\color{BlueGreen}{\\vardiamond}: ~\\sum_{j=1}^p |\\beta_j|<s ~~~~~~~~~~~~~~~~~~~~ \\color{BlueGreen}{\\CIRCLE}:~ \\sum_{j=1}^p \\beta_j^2<s~~~~~$,\n",
    "Best subset with $s=1$ is union of the axes...\n",
    "\n",
    "## When is the Lasso better than Ridge?\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter6/6.8.png\" height=\"400\">\n",
    "</div>\n",
    "<ul>\n",
    "<li>**Example 1:** Most of the coefficients are non-zero.\n",
    "<li>Bias, <font color=\"#008080\">variance</font>, <font color=\"#E6E6FA\">MSE</font>\n",
    "<li>The Lasso (---), Ridge ($\\cdots$). \n",
    "<li>The bias is about the same for both methods.\n",
    "<li>The variance of Ridge regression is smaller, so is the MSE. \n",
    "</ul>\n",
    "\n",
    "## When is the Lasso better than Ridge?\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter6/6.9.png\" height=\"400\">\n",
    "</div>\n",
    "<ul>\n",
    "<li>**Example 2:** Only 2 coefficients are non-zero.\n",
    "<li>Bias, <font color=\"#008080\">variance</font>, <font color=\"#E6E6FA\">MSE</font>\n",
    "<li>The Lasso (---), Ridge ($\\cdots$). \n",
    "<li> The bias, variance, and MSE are lower for the Lasso.\n",
    "</ul>\n",
    "\n",
    "## Selecting $\\lambda$ by cross-validation for lasso regression\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter6/6.13.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "## A very special case: ridge\n",
    "\n",
    "- Suppose $n=p$ and our matrix of predictors is $\\mathbf{X} = I$\n",
    "\n",
    "- Then, the objective function in Ridge regression can be simplified:\n",
    "$$\\sum_{j=1}^p (y_j-\\beta_j)^2 + \\lambda\\sum_{j=1}^p \\beta_j^2$$\n",
    "\n",
    "- We can minimize the terms that involve each $\\beta_j$ separately:\n",
    "\n",
    "$$(y_j-\\beta_j)^2 + \\lambda \\beta_j^2.$$\n",
    "\n",
    "- It is easy to show that \n",
    "\n",
    "$$\\hat \\beta^R_j = \\frac{y_j}{1+\\lambda}.$$\n",
    "\n",
    "## A very special case: LASSO\n",
    "\n",
    "- Similar story for the Lasso; the objective function is:\n",
    "\n",
    "$$\\sum_{j=1}^p (y_j-\\beta_j)^2 + \\lambda\\sum_{j=1}^p |\\beta_j|$$\n",
    "\n",
    "- We can minimize the terms that involve each $\\beta_j$ separately:\n",
    "$$(y_j-\\beta_j)^2 + \\lambda |\\beta_j|.$$\n",
    "\n",
    "- It is easy to show that \n",
    "$$\\hat \\beta^L_j = \n",
    "\\begin{cases}\n",
    "y_j -\\lambda/2 & \\text{if }y_j>\\lambda/2; \\\\\n",
    "y_j +\\lambda/2 & \\text{if }y_j<-\\lambda/2; \\\\\n",
    "0 & \\text{if }|y_j|<\\lambda/2. \n",
    "\\end{cases}\n",
    "$$\n",
    "\n",
    "## Lasso and Ridge coefficients as a function of $y$\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter6/6.10.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "## Bayesian interpretation\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter6/6.11.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "- **Ridge:** $\\hat\\beta^R$ is the posterior mean, with a Normal prior on $\\beta$.\n",
    "\n",
    "- **Lasso:** $\\hat\\beta^L$ is the posterior mode, with a Laplace prior on $\\beta$.\n",
    "\n",
    "# Dimensionality reduction\n",
    "\n",
    "### Regularization methods \n",
    "\n",
    "<ul>\n",
    "<li>Variable selection:\n",
    "<ul>\n",
    "<li>Best subset selection\n",
    "<li>Forward and backward stepwise selection\n",
    "</ul>\n",
    "<li>Shrinkage\n",
    "<ul>\n",
    "<li>Ridge regression\n",
    "<li>The Lasso (a form of variable selection)\n",
    "</ul>\n",
    "<li>Dimensionality reduction:\n",
    "<ul>\n",
    "<li>**Idea:** Define a small set of $M$ predictors which *summarize* the information in all $p$ predictors.\n",
    "</ul>\n",
    "</ul>\n",
    "\n",
    "## Principal Component Regression\n",
    "\n",
    "**Recall:** The loadings $\\phi_{11}, \\dots, \\phi_{p1}$ for the first principal component define the directions of greatest variance in the space of variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "princomp(USArrests, cor=TRUE)$loadings[,1] # cor=TRUE makes sure to scale variables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Interpretation:* The first principal component measures the overall rate of crime.\n",
    "\n",
    "## Principal Component Regression\n",
    "\n",
    "**Recall:** The scores $z_{11}, \\dots, z_{n1}$ for the first principal component define the deviation of the samples along this direction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "princomp(USArrests, cor=TRUE)$scores[,1] # cor=TRUE makes sure to scale variables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Interpretation:* The scores for the first principal component measure the overall rate of crime in each state.\n",
    "\n",
    "## Principal Component Regression\n",
    "\n",
    "<ul>\n",
    "<li>**Idea:**\n",
    "<ul>\n",
    "<li>Replace the original predictors, $X_1, X_2, \\dots, X_p$, with the first $M$ score vectors $Z_1, Z_2, \\dots, Z_M$.\n",
    "<li>Perform least squares regression, to obtain coefficients $\\theta_0,\\theta_1,\\dots,\\theta_M$.\n",
    "</ul>\n",
    "<li>The model with these derived features is:\n",
    "$$\n",
    "\\begin{aligned}\n",
    "y_i &= \\theta_0 + \\theta_1 z_{i1} + \\theta_2 z_{i2} + \\dots + \\theta_M z_{iM} + \\varepsilon_i\\\\\n",
    " & = \\theta_0 + \\theta_1 \\sum_{j=1}^p \\phi_{j1} x_{ij} + \\theta_2 \\sum_{j=1}^p \\phi_{j2} x_{ij} + \\dots + \\theta_M \\sum_{j=1}^p \\phi_{jM} x_{ij}  + \\varepsilon_i \\\\\n",
    " & = \\theta_0 + \\left[\\sum_{m=1}^M \\theta_m \\phi_{1m}\\right] x_{i1} + \\dots + \\left[\\sum_{m=1}^M \\theta_m \\phi_{pm}\\right] x_{ip}  + \\varepsilon_i\n",
    "\\end{aligned}\n",
    "$$\n",
    "<li>Equivalent to a linear regression onto $X_1,\\dots,X_p$, with coefficients:\n",
    "$$\\beta_j = \\sum_{m=1}^M \\theta_m \\phi_{jm}$$\n",
    "<li>This constraint in the form of $\\beta_j$ introduces *bias*, but it can lower the *variance* of the model.\n",
    "</ul>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Application to the `Credit` dataset\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter6/6.20.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "- A model with 11 components is equivalent to least-squares regression\n",
    "\n",
    "- Best CV error is achieved with 10 components (almost no dimensionality reduction)\n",
    "\n",
    "## Application to the `Credit` dataset\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter6/6.20.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "- The left panel shows the coefficients $\\beta_j$ estimated for each $M$.\n",
    "\n",
    "- <font color=\"red\">The coefficients shrink as we decrease $M$!</font>\n",
    "\n",
    "## Relationship between PCR and Ridge regression\n",
    "\n",
    "<ul>\n",
    "<li>**Least squares regression:**  want to minimize\n",
    "$$RSS  = (y-\\mathbf{X}\\beta)^T(y-\\mathbf{X}\\beta)$$\n",
    "<li>**Score equation:**\n",
    "$$\\frac{\\partial RSS}{\\partial \\beta}  = -2\\mathbf{X}^T(y-\\mathbf{X}\\beta) = 0$$\n",
    "<li>**Solution:**\n",
    "$$\\implies \\hat \\beta = (\\mathbf{X}^T\\mathbf{X})^{-1}\\mathbf{X}^T y$$\n",
    "<li>Compute singular value decomposition: $\\mathbf{X} = U D^{1/2} V^T$, \n",
    "where $D^{1/2} = \\text{diag}(\\sqrt{d_1},\\dots,\\sqrt{d_p})$.\n",
    "<li>Then\n",
    "$$(\\mathbf{X}^T\\mathbf{X})^{-1} = V D^{-1} V^T$$\n",
    "where $D^{-1} = \\text{diag}(1/d_1,1/d_2,\\dots,1/d_p)$. \n",
    "</ul>\n",
    "\n",
    "## Relationship between PCR and Ridge regression\n",
    "\n",
    "<ul>\n",
    "<li>**Ridge regression:** want to minimize\n",
    "$$RSS + \\lambda \\|\\beta\\|_2^2 = (y-\\mathbf{X}\\beta)^T(y-\\mathbf{X}\\beta) + \\lambda \\beta^T\\beta$$\n",
    "$$RSS  = (y-\\mathbf{X}\\beta)^T(y-\\mathbf{X}\\beta)$$\n",
    "<li>**Score equation:**\n",
    "$$\\frac{\\partial (RSS + \\lambda \\|\\beta\\|_2^2)}{\\partial \\beta}  = -2\\mathbf{X}^T(y-\\mathbf{X}\\beta) + 2\\lambda\\beta= 0$$\n",
    "<li>**Solution:**\n",
    "$$\\implies \\hat \\beta^R_\\lambda = (\\mathbf{X}^T\\mathbf{X} + \\lambda I)^{-1}\\mathbf{X}^T y$$\n",
    "<li>Note $$(\\mathbf{X}^T\\mathbf{X}+ \\lambda I)^{-1} = V D_\\lambda ^{-1} V^T$$\n",
    "where $D_\\lambda ^{-1} = \\text{diag}(1/(d_1+\\lambda),1/(d_2+\\lambda),\\dots,1/(d_p+\\lambda))$. \n",
    "</ul>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Relationship between PCR and Ridge regression\n",
    "\n",
    "<ul>\n",
    "<li>**Predictions of least squares regression:**\n",
    "$$\\hat y = \\mathbf{X}\\hat \\beta = \\sum_{j=1}^p u_j u_j^T y, \\qquad u_j \\text{ is the } j\\text{th column of } U$$\n",
    "<li>**Predictions of Ridge regression:**\n",
    "$$\\hat y = \\mathbf{X}\\hat \\beta^R_\\lambda = \\sum_{j=1}^p u_j \\frac{d_j}{d_j + \\lambda} u_j^T y$$\n",
    "The projection of $y$ onto a principal component is shrunk toward zero. The smaller the principal component, the larger the shrinkage.\n",
    "<li>**Predictions of PCR:**\n",
    "$$\\hat y = \\mathbf{X}\\hat \\beta^\\text{PC} = \\sum_{j=1}^p u_j \\mathbf{1}(j\\leq M) u_j^T y$$\n",
    "The projections onto small principal components are shrunk to zero. \n",
    "</ul>\n",
    "\n",
    "## Principal Component Regression\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter6/6.18.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "<ul>\n",
    "<li>In each case $n=50$, $p=45$. \n",
    "<li>Left: response is a function of all the predictors (*dense*). \n",
    "<li>Right: response is a function of 2 predictors (*sparse* - good for Lasso).\n",
    "</ul>\n",
    "\n",
    "## Principal Component Regression\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter6/6.8.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "- Ridge and Lasso on *dense* dataset\n",
    "\n",
    "- The Lasso (---), Ridge ($\\cdots$). \n",
    "\n",
    "- PCR seems at least comparable to optimal LASSO and ridge. \n",
    "\n",
    "## Principal Component Regression\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter6/6.9.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "- Ridge and Lasso on *sparse* dataset\n",
    "\n",
    "- The Lasso (---), Ridge ($\\cdots$). \n",
    "\n",
    "- Lasso clearly better than PCR here.\n",
    "\n",
    "## Principal Component Regression\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter6/6.19.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "<ul>\n",
    "<li>Again, $n=50$, $p=45$.\n",
    "<li>The response is a function of the first 5 principal components. \n",
    "</ul>\n",
    "\n",
    "## Partial least squares\n",
    "\n",
    "- Principal components regression derives $Z_1,\\dots,Z_M$ using *only* the predictors $X_1,\\dots,X_p$. \n",
    "\n",
    "- In partial least squares, we use the response $Y$ as well. (To be fair, best subsets and stepwise do as well.)\n",
    "\n",
    "## Algorithm:\n",
    "\n",
    "<ol>\n",
    "<li>Define $Z_1 = \\sum_{j=1}^p \\phi_{j1} X_j$, where $\\phi_{j1}$ is the coefficient of a simple linear regression of $Y$ onto $X_j$.\n",
    "<li>Let $X_j^{(2)}$ be the residual of regressing $X_j$ onto $Z_1$. \n",
    "<li>Define  $Z_2 = \\sum_{j=1}^p \\phi_{j2} X_j^{(2)}$, where  $\\phi_{j2}$ is the coefficient of a simple linear regression of $Y$ onto $X_j^{(2)}$.\n",
    "<li>Let $X_j^{(3)}$ be the residual of regressing $X_j^{(2)}$ onto $Z_2$. \n",
    "<li>...\n",
    "</ol>\n",
    "\n",
    "## Partial least squares\n",
    "\n",
    "<ul>\n",
    "<li>At each step, we try to find the linear combination of predictors that is highly correlated to the response (the highest\n",
    "correlation is the least squares fit).\n",
    "<li>After each step, we transform the predictors such that they are \\emph{uncorrelated} from the linear combination chosen.\n",
    "<li>Compared to PCR, partial least squares has less bias and more variance (a stronger tendency to overfit).\n",
    "</ul>\n",
    "\n",
    "## High-dimensional regression\n",
    "\n",
    "<ul>\n",
    "<li>Most of the methods we've discussed work best when $n$ is much larger than $p$. \n",
    "<li>However, the case $p\\gg n$ is now common, due to experimental advances and cheaper computers:\n",
    "<ol>\n",
    "<li>**Medicine:** Instead of regressing heart disease onto just a few clinical observations (blood pressure, salt consumption, age), we use in addition 500,000 single nucleotide polymorphisms. \n",
    "<li>**Marketing:** Using search terms to understand online shopping patterns. A \\emph{bag of words} model defines one feature for every possible search term, which counts the number of times the term appears in a person's search. There can be as many features as words in the dictionary.\n",
    "</ol>\n",
    "</ul>\n",
    "\n",
    "## Some problems\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter6/6.22.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "- When $n=p$, we can find a fit that goes through every point.\n",
    "\n",
    "- We could use regularization methods, such as variable selection, ridge regression and the lasso.\n",
    "\n",
    "## Some problems\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter6/6.23.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "- Measures of training error are really bad.\n",
    "\n",
    "- Furthermore, it becomes hard to estimate the noise $\\hat \\sigma^2$.\n",
    "\n",
    "- Measures of model fit $C_p$, AIC, and BIC fail.\n",
    "\n",
    "## Some problems\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter6/6.24.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "- In each case, only 20 predictors are associated to the response.\n",
    "\n",
    "- Plots show the test error of the Lasso.\n",
    "\n",
    "- **Message:** Adding predictors that are uncorrelated with the response hurts the performance of the regression!\n",
    "\n",
    "## Interpreting coefficients when $p>n$\n",
    "\n",
    "- When $p>n$, every predictor is a linear combination of other predictors, i.e.\\ there is an extreme level of multicollinearity.\n",
    "\n",
    "- The Lasso and Ridge regression will choose one set of coefficients. \n",
    "\n",
    "- The coefficients selected $\\{i\\;;\\; |\\hat\\beta_i| >\\delta \\}$ are not guaranteed to be identical to $\\{i\\;;\\; |\\beta_i| >\\delta \\}$. There can be many sets of predictors (possibly non-overlapping) which yield apparently good models. \n",
    "\n",
    "- **Message:** Don't overstate the importance of the predictors selected.\n",
    "\n",
    "## Interpreting inference when $p>n$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- When $p>n$, LASSO might select a sparse model.\n",
    "\n",
    "- Running `lm` on selected variables on *training data* is **bad.**\n",
    "\n",
    "- Running `lm` on selected variables on independent *validation data* is **OK-ish** -- is this `lm` a good model?\n",
    "\n",
    "- **Message:** Don't use inferential methods developed for least squares regression for things like\n",
    "LASSO, forward stepwise, etc.\n",
    "\n",
    "- Can we do better? Yes, but it's complicated and a little above our level here."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "jupytext": {
   "cell_metadata_filter": "all,-slideshow",
   "formats": "ipynb,Rmd,md:myst"
  },
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}