{
 "cells": [
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "---\n",
    "title: 'Model selection and regularization'\n",
    "author: \"Sergio Bacallado, Jonathan Taylor\"\n",
    "subtitle: \"[web.stanford.edu/class/stats202](http://web.stanford.edu/class/stats202)\"\n",
    "date: \"Autumn 2020\"\n",
    "output:\n",
    "  slidy_presentation:\n",
    "    css: styles.css\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<script type=\"text/x-mathjax-config\">\n",
    "MathJax.Hub.Register.StartupHook(\"TeX Jax Ready\",function () {\n",
    "  MathJax.Hub.Insert(MathJax.InputJax.TeX.Definitions.macros,{\n",
    "    cancel: [\"Extension\",\"cancel\"],\n",
    "    bcancel: [\"Extension\",\"cancel\"],\n",
    "    xcancel: [\"Extension\",\"cancel\"],\n",
    "    cancelto: [\"Extension\",\"cancel\"]\n",
    "  });\n",
    "});\n",
    "</script>\n",
    "\n",
    "## Tree-based methods\n",
    "\n",
    "<div align=\"center\">\n",
    "<table>\n",
    "<tr>\n",
    "<td>\n",
    "<img src=\"figs/Chapter8/8.3.png\" height=\"600\">\n",
    "</td>\n",
    "<td>\n",
    "<ol>\n",
    "<li>Find a partition of the space of predictors.\n",
    "<li>Predict a constant in each set of the partition.\n",
    "<li>The partition is defined by splitting the range of one predictor at a time.\n",
    "<li><font color=\"red\">Note: $\\to$ not all partitions are possible.</font>\n",
    "</ol>\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "</div>\n",
    "\n",
    "## Tree-based methods\n",
    "\n",
    "### Regression trees\n",
    "\n",
    "#### Example: Predicting a baseball player's salary\n",
    "\n",
    "<div align=\"center\">\n",
    "<table>\n",
    "<tr>\n",
    "<td>\n",
    "<img src=\"figs/Chapter8/8.1.png\" height=\"500\">\n",
    "</td>\n",
    "<td>\n",
    "<img src=\"figs/Chapter8/8.2.png\" height=\"500\">\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "</div>\n",
    "\n",
    "- The prediction for a point in $R_i$ is the average of the training points in $R_i$.\n",
    "\n",
    "## Tree-based methods\n",
    "\n",
    "### Regression trees\n",
    "\n",
    "#### How is a regression tree built?\n",
    "\n",
    "<ul>\n",
    "<li>Start with a single region $R_1$, and iterate:\n",
    "<ol>\n",
    "<li>Select a region $R_k$, a predictor $X_j$, and a splitting point $s$, such that splitting $R_k$ with the criterion $X_j<s$ produces the largest decrease in RSS:\n",
    "$$\\sum_{m=1}^{|T|} \\sum_{x_i\\in R_m} (y_i-\\bar y_{R_m})^2$$\n",
    "<li>Redefine the regions with this additional split.\n",
    "</ol>\n",
    "<li>Terminate when there are, say, 5 observations or fewer in each region.\n",
    "<li>This grows the tree from the root towards the leaves.\n",
    "</ul>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tree-based methods\n",
    "\n",
    "### Regression trees\n",
    "\n",
    "#### How is a regression tree built?\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter8/8.4.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "## Tree-based methods\n",
    "\n",
    "### Regression trees\n",
    "\n",
    "#### How do we control overfitting?\n",
    "\n",
    "<ul>\n",
    "<li>**Idea 1:** Find the optimal subtree by cross validation.\n",
    "<li><font color=\"red\">Hmm.... there are too many possibilities -- harder than best subsets!</font>\n",
    "<li>**Idea 2:** Stop growing the tree when the RSS doesn't drop by more than a threshold with any new cut.\n",
    "<li><font color=\"red\">Hmm... in our greedy algorithm, it is possible to find good cuts after bad ones.</font>\n",
    "</ul>\n",
    "\n",
    "## Tree-based methods\n",
    "\n",
    "### Regression trees\n",
    "\n",
    "#### How do we control overfitting?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Solution:** Prune a large tree from the leaves to the root. \n",
    "<ul>\n",
    "<li>**Weakest link pruning:**\n",
    "<ul>\n",
    "<li>Starting with $T_0$, substitute a subtree with a leaf to obtain $T_1$, by minimizing:\n",
    "$$\\frac{RSS(T_1)-RSS(T_0)}{|T_0|- |T_1|}.$$\n",
    "<li>Iterate this pruning to obtain a sequence $T_0,T_1,T_2,\\dots,T_m$ where $T_m$ is the null tree.\n",
    "</ul>\n",
    "<li>The sequence of trees can be used to solve the minimization for **cost complexity pruning** (over subtrees $T \\subset T_0$)\n",
    "$$\n",
    "\\text{minimize}_T \\left(\\sum_{m=1}^T \\sum_{x_i \\in R_m}(y_i - \\hat{y}_{R_m})^2 + \\alpha |T|\\right)\n",
    "$$\n",
    "<li>Choose $\\alpha$ by cross validation. \n",
    "</ul>\n",
    "</ul>\n",
    "\n",
    "## Tree-based methods\n",
    "\n",
    "### Regression trees\n",
    "\n",
    "#### Predicting baseball salaries\n",
    "\n",
    "<div align=\"center\">\n",
    "<table>\n",
    "<tr>\n",
    "<td>\n",
    "**Unpruned tree**\n",
    "</td>\n",
    "<td>\n",
    "**Error rates**\n",
    "</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<td>\n",
    "<img src=\"figs/Chapter8/8.4.png\" height=\"600\">\n",
    "</td>\n",
    "<td>\n",
    "<img src=\"figs/Chapter8/8.5.png\" height=\"600\">\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "</div>\n",
    "\n",
    "## Tree-based methods\n",
    "\n",
    "### Regression trees\n",
    "\n",
    "#### Predicting baseball salaries\n",
    "\n",
    "<div align=\"center\">\n",
    "<table>\n",
    "<tr>\n",
    "<td>\n",
    "**Optimal tree**\n",
    "</td>\n",
    "<td>\n",
    "**Error rates**\n",
    "</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<td>\n",
    "<img src=\"figs/Chapter8/8.1.png\" height=\"600\">\n",
    "</td>\n",
    "<td>\n",
    "<img src=\"figs/Chapter8/8.5.png\" height=\"600\">\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "</div>\n",
    "\n",
    "## Tree-based methods\n",
    "\n",
    "### Classification trees\n",
    "\n",
    "<ul>\n",
    "<li>They work much like regression trees. Both are examples of *decision trees*.\n",
    "<li>We predict the response by **majority vote**, i.e. pick the most common class in every region.\n",
    "<li>Instead of trying to minimize the RSS:\n",
    "$$\\cancel{\\sum_{m=1}^{|T|} \\sum_{x_i\\in R_m} (y_i-\\bar y_{R_m})^2}$$\n",
    "we minimize a classification loss function. \n",
    "</ul>\n",
    "\n",
    "## Tree-based methods\n",
    "\n",
    "### Classification trees\n",
    "\n",
    "#### Classification losses\n",
    "\n",
    "<ul>\n",
    "<li>The 0-1 loss or misclassification rate:\n",
    "$$\\sum_{m=1}^{|T|} \\sum_{x_i\\in R_m} \\mathbf{1}(y_i \\neq \\hat y_{R_m})$$\n",
    "<li>The Gini index:\n",
    "$$\\sum_{m=1}^{|T|} q_m \\sum_{k=1}^K \\hat p_{mk}(1-\\hat p_{mk}),$$\n",
    "where $\\hat p_{m,k}$ is the proportion of class $k$ within $R_m$, and $q_m$ is the proportion of samples in $R_m$.\n",
    "<li>The cross-entropy:\n",
    "$$- \\sum_{m=1}^{|T|} q_m  \\sum_{k=1}^K \\hat p_{mk}\\log(\\hat p_{mk}).$$\n",
    "</ul>\n",
    "\n",
    "<ul>\n",
    "<li>The Gini index and cross-entropy are better measures of the purity of a region, i.e. they are low when the region is mostly one category.\n",
    "<li>**Motivation for the Gini index:**\n",
    "If instead of predicting the most likely class, we predict a random sample from the distribution $(\\hat p_{1,m},\\hat p_{2,m},\\dots,\\hat p_{K,m})$, the Gini index is the expected misclassification rate. \n",
    "<li>It is typical to use the Gini index or cross-entropy for growing the tree, while using the misclassification rate when pruning the tree.\n",
    "</ul>\n",
    "\n",
    "#### Heart dataset\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter8/8.6.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "## Tree-based methods\n",
    "\n",
    "### Decision trees\n",
    "\n",
    "#### How do we deal with categorical predictors?\n",
    "\n",
    "<ul>\n",
    "<li>If there are only 2 categories, then the split is obvious. We don't have to choose the splitting point $s$, as for a numerical variable.\n",
    "<li>If there are more than 2 categories:\n",
    "<ul>\n",
    "<li>Order the categories according to the average of the response:\n",
    "$$\\mathtt{ChestPain:a} > \\mathtt{ChestPain:c} > \\mathtt{ChestPain:b}$$\n",
    "<li>Treat as a numerical variable with this ordering, and choose a splitting point $s$.\n",
    "</ul>\n",
    "<li>One can show that this is the optimal way of partitioning.\n",
    "</ul>\n",
    "\n",
    "## Tree-based methods\n",
    "\n",
    "### Decision trees\n",
    "\n",
    "#### How do we deal with missing data?\n",
    "\n",
    "<ul>\n",
    "<li>Suppose we can assign every sample to a leaf $R_i$ despite the missing data.\n",
    "<li>When choosing a new split with variable $X_j$ (growing the tree):\\\\\n",
    "<ul>\n",
    "<li>Only consider the samples which have the variable $X_j$.\n",
    "<li>In addition to choosing the best split, choose a second best split using a different variable, and a third best, ...\n",
    "</ul> \n",
    "<li>To propagate a sample down the tree, if it is missing a variable to make a decision,\n",
    "try the second best decision, or the third best: **surrogate splitting**\n",
    "</ul>\n",
    "\n",
    "## Tree-based methods\n",
    "\n",
    "### Decision trees\n",
    "\n",
    "#### Some advantages\n",
    "\n",
    "<ul>\n",
    "<li>Very easy to interpret!\n",
    "<li>Closer to human decision-making.\n",
    "<li>Easy to visualize graphically (for shallow ones)\n",
    "<li>They easily handle qualitative predictors and missing data.\n",
    "</ul>\n",
    "\n",
    "<font color=\"red\">Downside: they don't necessarily fit that well!</font>\n",
    "\n",
    "## Tree-based methods\n",
    "\n",
    "### Bagging\n",
    "\n",
    "<ul>\n",
    "<li>Bagging = <font color=\"red\">B</font>ootstrap <font color=\"red\">Agg</font>regat<font color=\"red\">ing</font>\n",
    "<li>In the Bootstrap, we replicate our dataset by sampling with replacement:\n",
    "<ul>\n",
    "<li>Original dataset: `x=c(x1, x2, ..., x100)`\n",
    "<li>Bootstrap samples:  `boot1=sample(x, 100, replace=True)`, ..., `bootB=sample(x, 100, replace=True)`\n",
    "</ul>\n",
    "<li>We used these samples to get the Standard Error of a parameter estimate:\n",
    "$$\\text{SE}(\\hat\\beta_1) \\approx \\frac{1}{B}\\sum_{b=1}^B \\hat\\beta_1^{(b)}$$\n",
    "</ul>\n",
    "\n",
    "## Tree-based methods\n",
    "\n",
    "### Bagging\n",
    "\n",
    "<ul>\n",
    "<li> In **bagging** we average the predictions of a model fit to many Bootstrap samples.\n",
    "<li>*Example.* Bagging the Lasso\n",
    "<ul>\n",
    "<li>Let $\\hat y^{L,b}$ be the prediction of the Lasso applied to the $b$th bootstrap sample.\n",
    "<li>Bagging prediction:\n",
    "$$\\hat y^\\text{bag} = \\frac{1}{B} \\sum_{b=1}^B \\hat y^{L,b}.$$\n",
    "</ul>\n",
    "</ul>\n",
    "\n",
    "## Tree-based methods\n",
    "\n",
    "### Bagging\n",
    "\n",
    "#### When does Bagging make sense?\n",
    "\n",
    "- When a regression method or a classifier has a tendency to overfit, bagging reduces the variance of the prediction.\n",
    "\n",
    "<ul>\n",
    "<li>When $n$ is large, the empirical distribution is similar to the true distribution of the samples.\n",
    "<li>Bootstrap samples are similar to independent realizations of the data. They are actually conditionally independent, given the data.\n",
    "<li>Bagging smooths out an estimator which can reduce variance.\n",
    "</ul>\n",
    "\n",
    "## Tree-based methods\n",
    "\n",
    "### Bagging\n",
    "\n",
    "#### Bagging decision trees\n",
    "\n",
    "<ul>\n",
    "<li>**Disadvantage:** Every time we fit a decision tree to a Bootstrap sample, we get a different tree $T^b$.\\\\[2mm]\n",
    "<li><font color=\"red\">$\\to$ Loss of interpretability</font>\n",
    "<li>**Variable importance:**\n",
    "<ul>\n",
    "<li>For each predictor, add up the total amount by which the RSS (or Gini index) decreases every time we use the predictor in $T^b$.\n",
    "<li>Average this total over each Boostrap estimate $T^1,\\dots,T^B$.\n",
    "</ul>\n",
    "</ul>\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter8/8.9.png\" height=\"400\">\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tree-based methods\n",
    "\n",
    "### Bagging\n",
    "\n",
    "#### Out-of-bag (OOB) error\n",
    "\n",
    "<ul>\n",
    "<li>To estimate the test error of a bagging estimate, we could use cross-validation.\n",
    "<li>Each time we draw a Bootstrap sample, we only use ~63% of the observations.\n",
    "<li>**Idea:** use the rest of the observations as a test set.\n",
    "<li>**OOB error:**\n",
    "<ul>\n",
    "<li>For each sample $x_i$, find the prediction $\\hat y_{i}^b$ for all bootstrap samples $b$ which do not contain $x_i$. There should be around $0.37B$ of them.\n",
    "Average these predictions to obtain $\\hat y_{i}^\\text{oob}$.\n",
    "<li>Compute the error $(y_i-\\hat y_{i}^\\text{oob})^2$.\n",
    "<li>Average the errors over all observations $i=1,\\dots,n$.\n",
    "</ul>\n",
    "</ul>\n",
    "\n",
    "## Tree-based methods\n",
    "\n",
    "### Bagging\n",
    "\n",
    "#### Out-of-bag (OOB) error\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter8/8.8.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "- The test error decreases as we increase $B$ (dashed line is the error for a plain decision tree).\n",
    "\n",
    "## Tree-based methods\n",
    "\n",
    "### Bagging\n",
    "\n",
    "#### Random Forests\n",
    "\n",
    "<ul><li>\n",
    "Bagging has a problem:\n",
    "<font color=\"red\">$\\to$ The trees produced by different Bootstrap samples can be very similar</font>\n",
    "<li>**Random Forests**:\n",
    "<ul>\n",
    "<li>We fit a decision tree to different Bootstrap samples.\n",
    "<li>When growing the tree, we select a random sample of $m<p$ predictors to consider in each step.\n",
    "<li>This will lead to trees that are less correlated from each sample.\n",
    "<li>Finally, average the prediction of each tree. \n",
    "</ul>\n",
    "\n",
    "## Tree-based methods\n",
    "\n",
    "### Bagging\n",
    "\n",
    "#### Random Forests vs. Bagging\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter8/8.8.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "## Tree-based methods\n",
    "\n",
    "### Bagging\n",
    "\n",
    "#### Random Forests, choosing $m$\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter8/8.10.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "- The optimal $m$ is usually around $\\sqrt p$, but this can be used as a tuning parameter.\n",
    "\n",
    "## Tree-based methods\n",
    "\n",
    "### Boosting\n",
    "\n",
    "- Another *ensemble* method (i.e. uses a collection of learners)\n",
    "\n",
    "- Instead of randomizing each learner, each learner fits to the *residual* (not that unlike backfitting)\n",
    "\n",
    "#### Boosting regression trees\n",
    "\n",
    "<ol>\n",
    "<li>Set $\\hat f(x) = 0$, and $r_i=y_i$ for $i=1,\\dots,n$.\n",
    "<li>For $b=1,\\dots,B$, iterate:\n",
    "<ol>\n",
    "<li>Fit a regression tree $\\hat f^b$ with $d$ splits to the response $r_1,\\dots,r_n$.\n",
    "<li>Update the prediction to:\n",
    "$$\\hat f(x) \\leftarrow \\hat f(x) + \\lambda \\hat f^b(x).$$\n",
    "<li>Update the residuals,\n",
    "$$ r_i \\leftarrow r_i - \\lambda \\hat f^b(x_i).$$\n",
    "</ol>\n",
    "<li>Output the final model:\n",
    "$$\\hat f(x) = \\sum_{b=1}^B \\lambda \\hat f^b(x).$$\n",
    "</ol>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Boosting classification trees\n",
    "\n",
    "- Can be done with appropriately defined *residual* for classification\n",
    "based on *offset* in log-odds.\n",
    "\n",
    "## Tree-based methods\n",
    "\n",
    "### Boosting\n",
    "\n",
    "#### Some intuition\n",
    "\n",
    "- <font color=\"red\">Boosting learns *slowly*</font>\n",
    "\n",
    "- We first use the samples that are easiest to predict, then slowly down weigh these cases, moving on to harder samples.  \n",
    "\n",
    "\\end{frame}\n",
    "\n",
    "## Tree-based methods\n",
    "\n",
    "### Boosting vs. random forests\n",
    "\n",
    "#### Random Forests, choosing $m$\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter8/8.11.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "- The parameter $\\lambda=0.01$ in each case.\n",
    "\n",
    "- We can tune the model by CV using $\\lambda, d, B.$"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "jupytext": {
   "cell_metadata_filter": "all,-slideshow",
   "formats": "ipynb,Rmd,md:myst"
  },
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}