{
 "cells": [
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "---\n",
    "title: 'Linear regression'\n",
    "author: \"Sergio Bacallado, Jonathan Taylor\"\n",
    "subtitle: \"[web.stanford.edu/class/stats202](http://web.stanford.edu/class/stats202)\"\n",
    "date: \"Autumn 2020\"\n",
    "output:\n",
    "  slidy_presentation:\n",
    "    css: styles.css\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter3/3.1.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "- Chapter 3 of **ISLR**\n",
    "\n",
    "- Simplest example of a supervised method\n",
    "\n",
    "# Simple linear regression\n",
    "\n",
    "<div align=\"center\">\n",
    "<table>\n",
    "<tr>\n",
    "<td>\n",
    "<img src=\"figs/Chapter3/3.1.png\" height=\"500\">\n",
    "</td>\n",
    "<td style=\"vertical-align:top\">\n",
    "<ul>\n",
    "<li>Model:\n",
    "$$y_i = \\beta_0 + \\beta_1 x_i +\\varepsilon_i$$\n",
    "<li>Errors: $$\\varepsilon_i \\sim \\mathcal{N}(0,\\sigma)\\quad \\text{i.i.d.}$$\n",
    "<li>Fit:\n",
    "the estimates $\\hat\\beta_0$ and $\\hat\\beta_1$ are chosen to minimize the residual sum of squares (RSS):\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\text{RSS}(\\beta_0,\\beta_1) &= \\sum_{i=1}^n (y_i -\\hat y_i(\\beta_0, \\beta_1))^2 \\\\\n",
    "& = \\sum_{i=1}^n (y_i - \\beta_0- \\beta_1 x_i)^2. \n",
    "\\end{aligned}\n",
    "$$\n",
    "</ul>\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Sample code: advertising data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "Advertising = read.csv('http://faculty.marshall.usc.edu/gareth-james/ISL/Advertising.csv')\n",
    "M.sales = lm(sales ~ TV, data=Advertising)\n",
    "M.sales"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Estimates $\\hat\\beta_0$ and $\\hat\\beta_1$\n",
    "\n",
    "A little calculus shows that the minimizers of the RSS are:\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\hat \\beta_1 & = \\frac{\\sum_{i=1}^n (x_i-\\overline x)(y_i-\\overline y)}{\\sum_{i=1}^n (x_i-\\overline x)^2} \\\\\n",
    "\\hat \\beta_0 & = \\overline y- \\hat\\beta_1\\overline x.\n",
    "\\end{aligned}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Assessing the accuracy of $\\hat \\beta_0$ and $\\hat\\beta_1$\n",
    "\n",
    "<div align=\"center\">\n",
    "<table>\n",
    "<tr>\n",
    "<td>\n",
    "<img src=\"figs/Chapter3/3.3.png\" height=\"500\">\n",
    "</td>\n",
    "<td style=\"vertical-align:top\">\n",
    "**Based on our model:**\n",
    "<ul>\n",
    "<li>The Standard Errors for the parameters are:\n",
    "<ul>\n",
    "<li>\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\text{SE}(\\hat\\beta_0)^2 &= \\sigma^2\\left[\\frac{1}{n}+\\frac{\\overline x^2}{\\sum_{i=1}^n(x_i-\\overline x)^2}\\right] \\\\\n",
    "\\text{SE}(\\hat\\beta_1)^2 &= \\frac{\\sigma^2}{\\sum_{i=1}^n(x_i-\\overline x)^2}.\n",
    "\\end{aligned}$$\n",
    "</ul>\n",
    "</ul>\n",
    "<ul>\n",
    "<li>95\\% confidence intervals:\n",
    "<ul>\n",
    "<li>\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\hat\\beta_0 &\\pm 2\\cdot\\text{SE}(\\hat\\beta_0) \\\\\n",
    "\\hat\\beta_1 &\\pm 2\\cdot\\text{SE}(\\hat\\beta_1)\n",
    "\\end{aligned}\n",
    "$$\n",
    "</ul>\n",
    "</ul>\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "</div>\n",
    "\n",
    "## Hypothesis test\n",
    "\n",
    "<ul>\n",
    "<li>Null hypothesis $H_0$: There is no relationship between $X$ and $Y$.\n",
    "<li>Alternative hypothesis $H_a$: There is some relationship between $X$ and $Y$.\n",
    "<li>**Based on our model:** this translates to\n",
    "<ul>\n",
    "<li>$H_0$: $\\beta_1=0$.\n",
    "<li>$H_a$: $\\beta_1\\neq 0$.\n",
    "</ul>\n",
    "<li>Test statistic: $\\quad t = \\frac{\\hat\\beta_1 -0}{\\text{SE}(\\hat\\beta_1)}.$\n",
    "<li>Under the null hypothesis, this has a $t$-distribution with $n-2$ degrees of freedom.\n",
    "\n",
    "## Sample output: advertising data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "summary(M.sales)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Interpreting the hypothesis test\n",
    "\n",
    "-  If we reject the null hypothesis, can we assume there is an *exact*  linear relationship?\n",
    "\n",
    "- **No.** A quadratic relationship may be a better fit, for example. This test assumes the simple\n",
    "linear regression model is correct which precludes a quadratic relationship.\n",
    "\n",
    "- If we don't reject the null hypothesis, can we assume there is no relationship between $X$ and $Y$?\n",
    "\n",
    "- **No.** This test is based on the model\n",
    "we posited above and is only powerful against certain monotone alternatives. There could be more complex non-linear relationships.\n",
    "\n",
    "# Multiple linear regression\n",
    "\n",
    "<div align=\"center\">\n",
    "<table>\n",
    "<tr>\n",
    "<td>\n",
    "<img src=\"figs/Chapter3/3.4.png\" height=\"500\">\n",
    "</td>\n",
    "<td style=\"vertical-align:top\">\n",
    "<ul>\n",
    "<li>Model:\n",
    "$$\n",
    "\\begin{aligned}\n",
    "y_i &= \\beta_0 + \\beta_1 x_{i1}+\\dots+\\beta_p x_{ip}+\\varepsilon_i \\\\\n",
    "Y &=  \\beta_0 + \\beta_1 X_{1}+\\dots+\\beta_p X_{p}+\\varepsilon \\\\\n",
    "\\end{aligned}\n",
    "$$\n",
    "<li>Errors: $$\\varepsilon_i \\sim \\mathcal{N}(0,\\sigma)\\quad \\text{i.i.d.}$$\n",
    "<li>Matrix notation\n",
    "$$E(\\mathbf{Y}) = \\mathbf{X}\\beta$$\n",
    "with\n",
    "$\\beta=(\\beta_0,\\dots,\\beta_p)$ and $\\mathbf{X}$ is our usual data matrix with an extra column of ones on the left to account for the intercept.\n",
    "</ul>\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Multiple linear regression answers several questions\n",
    "\n",
    "- Is at least one of the variables $X_i$ useful for predicting the outcome $Y$?\n",
    "\n",
    "- Which subset of the predictors is most important?\n",
    "\n",
    "- How good is a linear model for these data?\n",
    "\n",
    "- Given a set of predictor values, what is a likely value for $Y$, and how accurate is this prediction?\n",
    "\n",
    "## The estimates $\\hat\\beta$\n",
    "\n",
    "Our goal again is to minimize the RSS:\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\text{RSS}(\\beta) &= \\sum_{i=1}^n (y_i -\\hat y_i(\\beta))^2 \\\\\n",
    "& = \\sum_{i=1}^n (y_i - \\beta_0- \\beta_1 x_{i,1}-\\dots-\\beta_p x_{i,p})^2 \\\\\n",
    "&= \\|Y-X\\beta\\|^2_2\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "- One can show that this is minimized by the vector $\\hat\\beta$:\n",
    "$$\\hat\\beta = (\\mathbf{X}^T\\mathbf{X})^{-1}\\mathbf{X}^T\\mathbf{y}.$$\n",
    "\n",
    "- We usually write $RSS=RSS(\\hat{\\beta})$ for the *minimized* RSS.\n",
    "\n",
    "## Which variables are important?\n",
    "\n",
    "- Consider the hypothesis: $H_0:$ the last $q$ predictors have no relation with $Y$.\n",
    "\n",
    "- **Based on our model:** $H_0:\\beta_{p-q+1}=\\beta_{p-q+2}=\\dots=\\beta_p=0.$\n",
    "\n",
    "- Let $\\text{RSS}_0$ be the minimized residual sum of squares for the model which excludes these variables.\n",
    "\n",
    "- The $F$-statistic is defined by:\n",
    "$$F = \\frac{(\\text{RSS}_0-\\text{RSS})/q}{\\text{RSS}/(n-p-1)}.$$\n",
    "\n",
    "- Under the null hypothesis (of our model), this has an $F$-distribution.\n",
    "\n",
    "- Example: If $q=p$, we test whether any of the variables is important.\n",
    "$$\\text{RSS}_0 = \\sum_{i=1}^n(y_i-\\overline y)^2 $$ \n",
    "\n",
    "## Which variables are important?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "library(MASS) # where Boston data is stored\n",
    "M.Boston = lm(medv ~ ., data=Boston)\n",
    "summary(M.Boston)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Which variables are important?\n",
    "\n",
    "- The $t$-statistic associated to the $i$th predictor is the square root of the $F$-statistic for the null hypothesis which sets only $\\beta_i=0$.\n",
    "\n",
    "- A low $p$-value indicates that the predictor is important.\n",
    "\n",
    "- <font color=\"red\">Warning:</font> If there are many predictors, even under the null hypothesis, some of the $t$-tests will have low p-values\n",
    "even when the model has no explanatory power."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## *How many* variables are important?\n",
    "\n",
    "<ul>\n",
    "<li>When we select a subset of the predictors, we have $2^p$ choices. \n",
    "<li> A way to simplify the choice is to define a range of models with an increasing number of variables, then select the best.\n",
    "<ul>\n",
    "<li>Forward selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step.\n",
    "<li>Backward selection: Starting from the full model, eliminate variables one at a time, choosing the one with the largest p-value at each step.\n",
    "<li>Mixed selection: Starting from some model,\n",
    "include variables one at a time, minimizing the RSS at each step. If the p-value for some variable goes beyond a threshold, eliminate that variable.\n",
    "</ul>\n",
    "<li><font color=\"red\">Choosing one model in the range produced is a form of *tuning*.</font> This tuning can invalidate some of our\n",
    "methods like hypothesis tests and confidence intervals...\n",
    "</ul>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## How good are the predictions?\n",
    "\n",
    "- The function `predict` in R outputs predictions from a linear model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "predict(M.sales, data.frame(TV=c(50,150,250)), interval='confidence', level=0.95)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "- Prediction intervals reflect uncertainty on $\\hat\\beta$ and the irreducible error $\\varepsilon$ as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "predict(M.sales, data.frame(TV=c(50,150,250)), interval='prediction', level=0.95)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "- These functions rely on our *linear regression model*\n",
    "$$\n",
    "Y = X\\beta + \\epsilon.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Dealing with categorical or qualitative predictors\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter3/3.6.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "<ul>\n",
    "<li>Example: Credit dataset trying to predict $Y$=`Balance`.\n",
    "\n",
    "## Dealing with categorical or qualitative predictors\n",
    "\n",
    "For each qualitative predictor, e.g. `ethnicity`:\n",
    "\n",
    "<ul>\n",
    "<li> Choose a baseline category, e.g. `African American`\n",
    "<li>For every other category, define a new predictor:\n",
    "<ul>\n",
    "<li> $X_\\text{Asian}$ is 1 if the person is Asian and 0 otherwise.\n",
    "<li> $X_\\text{Caucasian}$ is 1 if the person is Caucasian and 0 otherwise.\\\\[3mm]\n",
    "</ul>\n",
    "<li>The model will be:\n",
    "$$Y = \\beta_0 + \\beta_1 X_1 +\\dots +\\beta_7 X_7 + \\color{Red}{\\beta_\\text{Asian}} X_\\text{Asian} + \\beta_\\text{Caucasian} X_\\text{Caucasian} +\\varepsilon.$$\n",
    "<li>The parameter $\\color{Red}{\\beta_\\text{Asian}}$ is the relative effect on `balance` (our $Y$) for being Asian compared to the baseline category.\n",
    "</ul>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Dealing with categorical or qualitative predictors\n",
    "\n",
    "1. The model fit and predictions are independent of the choice of the baseline category.\n",
    "\n",
    "2. However, hypothesis tests derived from these variables are affected by the choice.\n",
    "\n",
    "3. Solution: To check whether `ethnicity` is important, use an $F$-test for the hypothesis $\\beta_\\text{Asian}=\\beta_\\text{Caucasian}=0$\n",
    "by dropping `Ethnicity` from the model. This does not depend on the coding.\n",
    "\n",
    "4. Note that there are other ways to encode qualitative predictors produce the same fit $\\hat f$, but the coefficients have different interpretations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Recap\n",
    "\n",
    "So far, we have:\n",
    "\n",
    "1. Defined Multiple Linear Regression\n",
    "\n",
    "2. Discussed how to test the importance of variables.\n",
    "\n",
    "3. Described one approach to choose a subset of variables.\n",
    "\n",
    "4. Explained how to code qualitative variables.\n",
    "\n",
    "5. Now, how do we evaluate model fit? Is the linear model any good? What can go wrong?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## How good is the fit?\n",
    "\n",
    "- To assess the fit, we focus on the residuals\n",
    "$$\n",
    "e = Y - \\hat{Y}\n",
    "$$\n",
    "\n",
    "- The RSS always decreases as we add more variables.\n",
    "\n",
    "- The residual standard error (RSE) corrects this:\n",
    "$$\\text{RSE} = \\sqrt{\\frac{1}{n-p-1}\\text{RSS}}.$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## How good is the fit?\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter3/3.5.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "- Visualizing the residuals can reveal phenomena that are not accounted for by the model; eg. synergies or interactions:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Potential issues in linear regression\n",
    "\n",
    "1. Interactions between predictors\n",
    "\n",
    "2. Non-linear relationships\n",
    "\n",
    "3. Correlation of error terms\n",
    "\n",
    "4. Non-constant variance of error (heteroskedasticity)\n",
    "\n",
    "5. Outliers\n",
    "\n",
    "6. High leverage points\n",
    "\n",
    "7. Collinearity\n",
    "\n",
    "## Interactions between predictors\n",
    "\n",
    "- Linear regression has an *additive* assumption:\n",
    "$$\\mathtt{sales} = \\beta_0 + \\beta_1\\times\\mathtt{tv}+ \\beta_2\\times\\mathtt{radio}+\\varepsilon$$\n",
    "\n",
    "- i.e. An increase of 100 USD dollars in TV ads causes a fixed increase of $100 \\beta_2$ USD in sales on average, regardless of how much you spend on radio ads.\n",
    "\n",
    "## Interactions between predictors\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter3/3.5.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "- If we visualize the residuals, we see they are not evenly scattered around the plane.  This could be caused by an interaction."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Interactions between predictors\n",
    "\n",
    "- One way to deal with this is to include multiplicative variables in the model:\n",
    "\n",
    "$$\\mathtt{sales} = \\beta_0 + \\beta_1\\times\\mathtt{tv}+ \\beta_2\\times\\mathtt{radio}+\\color{Red}{\\beta_3\\times(\\mathtt{tv}\\cdot\\mathtt{radio})}+\\varepsilon$$\n",
    "\n",
    "- <font color=\"red\">The *interaction variable* `tv * radio` is high when both `tv` and `radio` are high. </font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Interactions between predictors\n",
    "\n",
    "- R makes it easy to include interaction variables in the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "summary(lm(sales ~ TV + radio + radio:TV, data=Advertising))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Non-linearities\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter3/3.8.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "- Example: Auto dataset.\n",
    "\n",
    "- A scatterplot between a predictor and the response may reveal a non-linear relationship.\n",
    "\n",
    "## Non-linearities\n",
    "\n",
    "- **Solution:** include polynomial terms in the model.\n",
    "\n",
    "$$\\mathtt{MPG} = \\beta_0 + \\beta_1\\times\\mathtt{horsepower}\n",
    "+ \\beta_2 \\times\\mathtt{horsepower}^2 +\n",
    "\\beta_3 \\times\\mathtt{horsepower}^3 +\n",
    "\\beta_4 \\times\\mathtt{horsepower}^4 + \\dots + \\varepsilon\n",
    "$$\n",
    "\n",
    "- Could use other functions besides polynomials...\n",
    "\n",
    "$$\\mathtt{MPG} = \\beta_0 + \\beta_1\\times h_1(\\mathtt{horsepower})\n",
    "+ \\beta_2 \\times h_2(\\mathtt{horsepower}) +\n",
    "\\beta_3 \\times h_3(\\mathtt{horsepower}) +\n",
    "\\beta_4 \\times h_4(\\mathtt{horsepower}) + \\dots + \\varepsilon\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Non-linearities\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter3/3.9.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "- In 2 or 3 dimensions, this is easy to visualize. What do we do when we have too many predictors?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Correlation of error terms\n",
    "\n",
    "- We assumed that the errors for each sample are independent:\n",
    "\n",
    "$$y_i = f(x_i) + \\varepsilon_i \\quad;\\quad \\varepsilon_i \\sim \\mathcal{N}(0,\\sigma) \\text{ i.i.d.}$$\n",
    "\n",
    "- What if this breaks down?\n",
    "\n",
    "- The main effect is that this invalidates any assertions about Standard Errors, confidence intervals, and hypothesis tests...\n",
    "\n",
    "- *Example*: Suppose that by accident, we duplicate the data (we use each sample twice). Then, the standard errors would be artificially smaller by a factor of $\\sqrt{2}$. \n",
    "\n",
    "## Correlation of error terms\n",
    "\n",
    "When could this happen in real life:\n",
    "\n",
    "- *Time series:* Each sample corresponds to a different point in time. The errors for samples that are close in time are correlated.\n",
    "\n",
    "- *Spatial data:* Each sample corresponds to a different location in space. \n",
    "\n",
    "- *Grouped data:* Imagine a study on predicting height from weight at birth. If some of the subjects in the study are in the same family,\n",
    "their shared environment could make them deviate from $f(x)$ in similar ways."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Correlation of error terms\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter3/3.10.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "Simulations of time series with increasing correlations between $\\varepsilon_i$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Non-constant variance of error (heteroskedasticity)\n",
    "\n",
    "- The variance of the error depends on some characteristics of the input features.\n",
    "\n",
    "- To diagnose this, we can plot residuals vs. fitted values:\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter3/3.11.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "## Non-constant variance of error (heteroskedasticity)\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter3/3.11.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "If the trend in variance is relatively simple, we can transform the response using a logarithm, for example.\n",
    "\n",
    "## Outliers\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter3/3.12.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "- Outliers from a model are points with very high errors. \n",
    "\n",
    "- While they may not affect the fit, they might affect our assessment of model quality.\n",
    "\n",
    "## Outliers\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter3/3.12.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "Possible solutions:\n",
    "<ul>\n",
    "<li>If we believe an outlier is due to an error in data collection, we can remove it.\n",
    "<li>An outlier might be evidence of a missing predictor, or the need to specify a more complex model.\n",
    "</ul>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## High leverage points\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter3/3.13.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "- Some samples with extreme inputs have an outsized effect on $\\hat \\beta$.\n",
    "\n",
    "- This can be measured with the **leverage statistic** or **self influence**:\n",
    "\n",
    "$$h_{ii} =  \\frac{\\partial \\hat y_i}{\\partial y_i} =  (\\underbrace{\\mathbf X (\\mathbf X^T \\mathbf X)^{-1} \\mathbf X^T}_{\\text{Hat matrix}})_{i,i} \\in [1/n,1].$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Studentized residuals\n",
    "\n",
    "1. The residual $e_i = y_i - \\hat y_i$ is an estimate for the noise $\\epsilon_i$. \n",
    "\n",
    "2. The standard error of $\\hat \\epsilon_i$ is $\\sigma \\sqrt{1-h_{ii}}$.\n",
    "\n",
    "3. A **studentized residual** is $\\hat \\epsilon_i$ divided by its standard error (with appropriate\n",
    "estimate of $\\sigma$)\n",
    "\n",
    "4. When model is correct, it follows a Student-t distribution with $n-p-2$ degrees of freedom."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Collinearity\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter3/3.14.png\" height=\"500\">\n",
    "</div>\n",
    "\n",
    "- Two predictors are collinear if one explains the other well:\n",
    "\n",
    "$$\\mathtt{limit} \\approx a\\times\\mathtt{rating}+b$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Collinearity\n",
    "\n",
    "- **Problem:* The coefficients become *unidentifiable*.\n",
    "\n",
    "- Consider the extreme case of using two identical predictors `limit`:\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\mathtt{balance} &= \\beta_0 + \\beta_1\\times\\mathtt{limit} + \\beta_2\\times\\mathtt{limit} + \\epsilon \\\\\n",
    "& = \\beta_0 + (\\beta_1+100)\\times\\mathtt{limit} + (\\beta_2-100)\\times\\mathtt{limit} + \\epsilon\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "- For every  $(\\beta_0,\\beta_1,\\beta_2)$ the fit at  $(\\beta_0,\\beta_1,\\beta_2)$  is just as good as\n",
    "at  $(\\beta_0,\\beta_1+100,\\beta_2-100)$.\n",
    "\n",
    "## Collinearity\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter3/3.15.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "## Collinearity\n",
    "\n",
    "- If 2 variables are collinear, we can easily diagnose this using their correlation.\n",
    "\n",
    "- A group of $q$ variables is *multilinear* if these variables \"contain less information\" than $q$ independent variables.\n",
    "\n",
    "- Pairwise correlations may not reveal multilinear variables.\n",
    "\n",
    "- The Variance Inflation Factor (VIF) measures how predictable it is given the other variables, a proxy for\n",
    "how *necessary* a variable is:\n",
    "\n",
    "$$VIF(\\hat \\beta_j) = \\frac{1}{1-R^2_{X_j|X_{-j}}},$$\n",
    "\n",
    "- Above,  $R^2_{X_j|X_{-j}}$ is the $R^2$ statistic for Multiple Linear regression of the predictor $X_j$ onto the remaining predictors.\n",
    "\n",
    "## Comparison to $K$-nearest neighbors\n",
    "\n",
    "- **Linear regression:** prototypical parametric method. <font color=\"red\">Easy for inference.</font>\n",
    "\n",
    "- **KNN regression:** prototypical nonparametric method. <font color=\"red\">Inference less clear.</font>\n",
    "\n",
    "### KNN estimator\n",
    "\n",
    "$$\\hat f(x) = \\frac{1}{K} \\sum_{i\\in N_K(x)} y_i$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# $K$-nearest neighbors\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter3/3.16.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "- $K=1$ on left, $K=9$ on right."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Comparing linear regression to $K$-nearest neighbors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "- **Linear regression:** prototypical parametric method. <font color=\"red\">Easy for inference.</font>\n",
    "\n",
    "- **KNN regression:** prototypical nonparametric method. <font color=\"red\">Inference less clear.</font>\n",
    "\n",
    "### Long story short:\n",
    "\n",
    "-  KNN is only better when the function $f$ is far from linear (in which case linear model is *misspecified*)\n",
    "\n",
    "-  When $n$ is not much larger than $p$, even if $f$ is nonlinear, Linear Regression can outperform KNN.\n",
    "\n",
    "- KNN has smaller bias, but this comes at a price of higher variance.\n",
    "\n",
    "## KNN estimates for a simulation from a linear model\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter3/3.17.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "- $K=1$ on left, $K=9$ on right."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Linear models dominate KNN\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter3/3.18.png\" height=\"600\">\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Increasing deviations from linearity\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter3/3.19.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "## When there are more predictors than observations, Linear Regression dominates\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter3/3.20.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "- When $p\\gg n$, each sample has no *near* neighbors, this is known as the *curse of dimensionality*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## When there are more predictors than observations, Linear Regression dominates\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter3/3.20.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "- The variance of KNN regression is very large."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "jupytext": {
   "cell_metadata_filter": "all,-slideshow",
   "formats": "ipynb,md:myst,Rmd"
  },
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
