{
 "cells": [
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "---\n",
    "title: 'Resampling methods'\n",
    "author: \"Sergio Bacallado, Jonathan Taylor\"\n",
    "subtitle: \"[web.stanford.edu/class/stats202](http://web.stanford.edu/class/stats202)\"\n",
    "date: \"Autumn 2020\"\n",
    "output:\n",
    "  slidy_presentation:\n",
    "    css: styles.css\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Thinking about the *true* loss function is important\n",
    "\n",
    "- Most of the {\\bf regression} methods we've studied aim to minimize the RSS, while {\\bf classification} methods aim to minimize the 0-1 loss.\n",
    "\n",
    "- In classification, we often care about certain kinds of error more than others; i.e. the natural loss function is not the 0-1 loss.\n",
    "\n",
    "- Even if we use a method which minimizes a certain kind of training error, we can *tune* it to optimize our true loss function.\n",
    "\n",
    "- Example: in the `default` study we could find the threshold that brings the False negative rate  below an acceptable level.\n",
    "\n",
    "# Validation \n",
    "\n",
    "<ul>\n",
    "<li>**Problem:** Choose a supervised method that minimizes the test error.\n",
    "<li>In addition, *tune* the parameters of each method: maybe\n",
    "<ul>\n",
    "<li>$k$ in $k$-nearest neighbors.\n",
    "<li>The number of variables to include in forward or backward selection.\n",
    "<li>The order of a polynomial  in polynomial regression.\n",
    "</ul>\n",
    "</ul>\n",
    "\n",
    "## Validation set approach\n",
    "\n",
    "Use of a **validation set** is one way to approximate the test error:\n",
    "\n",
    "- Divide the data into two parts.\n",
    "- Train each model with one part.\n",
    "- Compute the error on the remaining *validation* data.\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter5/5.1.png\" height=\"200\">\n",
    "</div>\n",
    "\n",
    "## Example: choosing order of polynomial\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter5/5.2.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "- Polynomial regression to estimate `mpg` from `horsepower` in the Auto data.\n",
    "\n",
    "- **Problem:** Every split yields a different estimate of the error.\n",
    "\n",
    "# Leave one out cross-validation (LOOCV)\n",
    "\n",
    "<ol>\n",
    "<li>For every $i=1,\\dots,n$:\n",
    "<ol>\n",
    "<li>train the model on every point except $i$,  \n",
    "<li>compute the test error on the held out point.\n",
    "</ol>\n",
    "<li>Average the test errors.\n",
    "</ol>\n",
    "\n",
    "## Regression\n",
    "\n",
    "- Overall error: $$\\text{CV}_{(n)} = \\frac{1}{n}\\sum_{i=1}^n (y_i - \\color{Red}{\\hat y_i^{(-i)}})^2$$\n",
    "\n",
    "- Notation <font color=\"red\">\\hat y_i^{(-i)}: prediction for the $i$ sample when learning without using the $i$th sample.</font>\n",
    "\n",
    "## Example: ?\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter5/5.3.png\" height=\"600\">\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Algorithm 5.2?: LOOCV\n",
    "\n",
    "<ol>\n",
    "<li>For every $i=1,\\dots,n$:\n",
    "<ol>\n",
    "<li>train the model on every point except $i$,  \n",
    "<li>compute the test error on the held out point.\n",
    "</ol>\n",
    "<li>Average the test errors.\n",
    "</ol>\n",
    "\n",
    "## Classification\n",
    "\n",
    "- Overall error: $$\\text{CV}_{(n)} = \\frac{1}{n}\\sum_{i=1}^n \\mathbf{1}(y_i \\neq \\color{Red}{\\hat y_i^{(-i)}})$$\n",
    "\n",
    "- Here, <font color=\"red\">\\hat y_i^{(-i)} is predicted label for the $i$ sample when learning without using the $i$th sample.</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Shortcut for linear regression\n",
    "\n",
    "- Computing $\\text{CV}_{(n)}$ can be computationally expensive, since it involves fitting the model $n$ times. \n",
    "\n",
    "- For linear regression, there is a shortcut:\n",
    "\n",
    "$$\\text{CV}_{(n)} = \\frac{1}{n} \\sum_{i=1}^n \\left(\\frac{y_i-\\hat y_i}{1-h_{ii}}\\right)^2$$\n",
    "\n",
    "- Above, $h_{ii}$ is the leverage statistic.\n",
    "\n",
    "- Approximate versions sometimes used for logistic regression...\n",
    "\n",
    "# $K$-fold cross-validation\n",
    "\n",
    "## Algorithm 5.3? $K$-fold CV\n",
    "\n",
    "<ol>\n",
    "<li>Split the data into $K$ subsets or *folds*.\n",
    "<li>For every $i=1,\\dots,K$:\n",
    "<ol>\n",
    "<li>train the model on every fold except the $i$th fold,  \n",
    "<li>compute the test error on the $i$th fold.\n",
    "</ol>\n",
    "<li>Average the test errors.\n",
    "</ol>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example: ?\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter5/5.5.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "## LOOCV vs. $K$-fold cross-validation\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter5/5.4.png\" height=\"400\">\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- $K$-fold CV depends on the chosen split (somewhat).\n",
    "\n",
    "- In $K$-fold CV, we train the model on less data than what is available. This introduces *bias* into the estimates of test error.\n",
    "\n",
    "- In LOOCV, the training samples highly resemble each other. This increases the *variance* of the test error estimate.\n",
    "\n",
    "- $n$-fold CV is equivalent LOOCV.\n",
    "\n",
    "## Choosing an optimal model\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter5/5.6.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "Even if the error estimates are off, choosing the model with the minimum cross validation error often leads to a method with near minimum test error.\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter5/5.7.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "In a classification problem, things look similar.\n",
    "\n",
    "<font color=\"#800080\">-- -- --  Bayes boundary</font>,------ Logistic regression with polynomial predictors of increasing degree.\n",
    "\n",
    "## Choosing an optimal model\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter5/5.8.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "In a classification problem, things look similar.\n",
    "\n",
    "## The one standard error (1SE) rule of thumb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"center\">\n",
    "<table>\n",
    "<tr>\n",
    "<td><img src=\"figs/esl7.3.png\" height=\"600\">\n",
    "</td>\n",
    "<td>\n",
    "\n",
    "- Forward stepwise selection \n",
    "\n",
    "- <font color=\"#0000FF\">10-fold cross validation</font>, <font color=\"#FFFF00\">True test error</font>\n",
    "\n",
    "- <ol>\n",
    "<li>A number of models with $10\\le p\\le 15$ have almost the same CV error.\n",
    "<li>The vertical bars represent 1 standard error in the test error from the 10 folds. \n",
    "<li>**1-SE rule of thumb:** Choose the simplest model whose CV error is no more than one standard error above the model with the lowest CV error. \n",
    "</ol>\n",
    "\\ei\n",
    "\\ec\n",
    "\\ecc\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "</div>\n",
    "\n",
    "## The wrong way to do cross validation\n",
    "\n",
    "<ul>\n",
    "<li>*Reading:* Section 7.10.2 of The Elements of Statistical Learning.\n",
    "<li>We want to classify 200 individuals according to whether they have cancer or not.\n",
    "<li>We use logistic regression onto 1000 measurements of gene expression. \n",
    "<li>**Proposed strategy:**\n",
    "<ol>\n",
    "<li>Using all the data, select the 20 most significant genes using $z$-tests. \n",
    "<li>Estimate the test error of logistic regression with these 20 predictors via 10-fold cross validation.\n",
    "</ol>\n",
    "</ul>\n",
    "\n",
    "## The wrong way to do cross validation\n",
    "\n",
    "<ul>\n",
    "<li> To see how that works, let's use the following simulated data:\n",
    "<ol>\n",
    "<li>Each gene expression is standard normal and independent of all others. \n",
    "<li>The response (cancer or not) is sampled from a coin flip --- no correlation to any of the \"genes\".\n",
    "</ol>\n",
    "<li><font color=\"red\">Q: What should the misclassification rate be for any classification method using these predictors?</font>\n",
    "<li><font color=\"red\">A: Roughly 50%.</font>\n",
    "</ul>\n",
    "\n",
    "## The wrong way to do cross validation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<ul>\n",
    "<li>We run this simulation, and obtain a CV error rate of 3%!\n",
    "<li>Why?\n",
    "<ul>\n",
    "<li>Since we only have 200 individuals in total, among 1000 variables, at least some will be correlated with the response.\n",
    "<li>We had run variable selection using \\emph{all the data}, so the variables we select have some correlation with the response in every subset or fold in the cross validation.\n",
    "</ul>\n",
    "</ul>\n",
    "\n",
    "## The right way to do cross validation\n",
    "\n",
    "<ol>\n",
    "<li>Divide the data into 10 folds.\n",
    "<li>For $i=1,\\dots,10$:\n",
    "<ol>\n",
    "<li>Using every fold except $i$, perform the variable selection and fit the model with the selected variables.\n",
    "<li>Compute the error on fold $i$.\n",
    "</ol>\n",
    "<li>Average the 10 test errors obtained.\n",
    "</ol>\n",
    "\n",
    "- In our simulation, this produces an error estimate of close to 50%.\n",
    "\n",
    "- **Moral of the story:** Every aspect of the learning method that involves using the data --- variable selection, for example --- must be cross-validated.\n",
    "\n",
    "# Bootstrap\n",
    "\n",
    "- Another resampling technique often seen in practice.\n",
    "\n",
    "## Cross-validation vs. the Bootstrap\n",
    "\n",
    "- **Cross-validation:** provides <font color=\"blue\">estimates</font> of the (test) <font color=\"red\">error</font>\n",
    "\n",
    "- **The Bootstrap:** provides the (standard) <font color=\"red\">error</font> of <font color=\"blue\">estimates</font>\n",
    "\n",
    "## Bootstrap\n",
    "\n",
    "<div align=\"center\">\n",
    "<table>\n",
    "<tr>\n",
    "<td>\n",
    "<img src=\"figs/efron.jpg\" height=\"600\">\n",
    "</td>\n",
    "<td>\n",
    "- One of the most important techniques in all of Statistics.\n",
    "\n",
    "- Computer intensive method. \n",
    "\n",
    "- Popularized by Brad Efron <font color=\"red\">$\\leftarrow$ Stanford pride!</font>\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "</div>\n",
    "\n",
    "## Standard errors in linear regression from a sample of size $n$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "remove_cell",
     "remove_input"
    ]
   },
   "outputs": [],
   "source": [
    "Advertising = read.csv('http://faculty.marshall.usc.edu/gareth-james/ISL/Advertising.csv')\n",
    "M.sales = lm(sales ~ TV, data=Advertising)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "summary(M.sales)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Classical way to compute Standard Errors\n",
    "\n",
    "<ul>\n",
    "<li>**Example:** Estimate the variance of a sample $x_1,x_2,\\dots,x_n$:\n",
    "<li>Unbiased estimate of $\\sigma^2$: $$\\hat \\sigma^2 = \\frac{1}{n-1}\\sum_{i=1}^n (x_i-\\overline x)^2.$$\n",
    "<li>What is the Standard Error of $\\hat \\sigma^2$?\n",
    "<ul>\n",
    "<li>Assume that $x_1,\\dots,x_n$ are normally distributed with common\n",
    "mean $\\mu$ and variance $\\sigma^2$.\n",
    "<li>Then $\\hat \\sigma^2(n-1)$ has a $\\chi$-squared distribution with $n-1$ degrees of freedom.\n",
    "<li>For large $n$, $\\hat{\\sigma}^2$ is normally distributed around $\\sigma^2$.\n",
    "<li>The SD of this *sampling distribution* is the Standard Error.\n",
    "</ul>\n",
    "</ul>\n",
    "\n",
    "## Limitations of the classical approach\n",
    "\n",
    "<ul>\n",
    "<li>This approach has served statisticians well for many years; however, what happens if:\n",
    "<ul>\n",
    "<li>The distributional assumption --- for example, $x_1,\\dots,x_n$ being normal --- breaks down?\n",
    "<li>The estimator does not have a simple form and its sampling distribution cannot be derived analytically?\n",
    "</ul>\n",
    "</ul>\n",
    "\n",
    "## Example: Investing in two assets\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter5/5.9-1.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "- Suppose that $X$ and $Y$ are the returns of two assets.\n",
    "\n",
    "- These returns are observed every day: $(x_1,y_1),\\dots,(x_n,y_n)$.\n",
    "\n",
    "## Example. Investing in two assets\n",
    "\n",
    "- We have a fixed amount of money to invest and we will invest a fraction $\\alpha$ on $X$ and a fraction $(1-\\alpha)$ on $Y$.\n",
    "\n",
    "- Therefore, our return will be\n",
    "\n",
    "$$\\alpha X + (1-\\alpha) Y.$$\n",
    "\n",
    "- Our goal will be to minimize the variance of our return as a function of $\\alpha$.\n",
    "\n",
    "- One can show that the optimal $\\alpha$ is:\n",
    "$$\\alpha = \\frac{\\sigma_Y^2 - \\text{Cov}(X,Y)}{\\sigma_X^2 + \\sigma_Y^2 -2\\text{Cov}(X,Y)}.$$\n",
    "\n",
    "- **Proposal:** Use an estimate:\n",
    "$$\\hat \\alpha = \\frac{\\hat \\sigma_Y^2 - \\hat{ \\text{Cov}}(X,Y)}{\\hat \\sigma_X^2 + \\hat \\sigma_Y^2 -2\\hat{ \\text{Cov}}(X,Y)}.$$\n",
    "\n",
    "## Example: Investing in two assets\n",
    "\n",
    "- Suppose we compute the estimate $\\hat\\alpha = 0.6$ using the samples $(x_1,y_1),\\dots,(x_n,y_n)$.\n",
    "\n",
    "- How sure can we be of this value?  (*A little vague of a question.*)\n",
    "\n",
    "- If we had sampled the observations in a different 100 days, would we get a wildly different $\\hat \\alpha$? (*A more precise question.*)\n",
    "\\ei\n",
    "\n",
    "## Resampling the data from the true distribution\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter5/5.9.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "- In this thought experiment, we know the actual joint distribution $P(X,Y)$, so we can resample the $n$ observations to our hearts' content.  \n",
    "\n",
    "## Computing the standard error of $\\hat \\alpha$\n",
    "\n",
    "- We will use $S$ samples to estimate the standard error of $\\hat{\\alpha}$.\n",
    "\n",
    "- For each sampling of the data, for $1 \\leq s \\leq S$\n",
    "\n",
    "$$(x_1^{(s)},\\dots,x_n^{(s)})$$\n",
    "\n",
    "we can compute a value of the estimate $\\hat \\alpha^{(1)},\\hat \\alpha^{(2)},\\dots$.\n",
    "\n",
    "- The Standard Error of $\\hat \\alpha$ is approximated by the standard deviation of these values.\n",
    "\n",
    "## In reality, we only have $n$ samples\n",
    "\n",
    "<div align=\"center\">\n",
    "<table>\n",
    "<tr>\n",
    "<td>\n",
    "<img src=\"figs/Chapter5/5.9-1.png\" height=\"600\">\n",
    "</td>\n",
    "<td>\n",
    "\n",
    "- However, these samples can be used to approximate the joint distribution of $X$ and $Y$.\n",
    "\n",
    "- **The Bootstrap:** Sample from the *empirical distribution*:\n",
    "\n",
    "$$\\hat P(X,Y) = \\frac{1}{n}\\sum_{i=1}^{n} \\delta(x_i,y_i).$$\n",
    "\n",
    "- Equivalently, resample the data by drawing $n$ samples *with replacement* from the actual observations.\n",
    "\n",
    "- *Why it works:* variances computed under the empirical distribution are good approximations\n",
    "of variances computed under the true distribution (in many cases).\n",
    "\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "</div>\n",
    "\n",
    "## A schematic of the Bootstrap\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter5/5.11.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "## Comparing Bootstrap sampling to sampling from the true distribution\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter5/5.10.png\" height=\"600\">\n",
    "</div>"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "jupytext": {
   "cell_metadata_filter": "all,-slideshow",
   "formats": "ipynb,Rmd,md:myst"
  },
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}