{
 "cells": [
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "---\n",
    "title: 'Nonlinear methods'\n",
    "author: \"Sergio Bacallado, Jonathan Taylor\"\n",
    "subtitle: \"[web.stanford.edu/class/stats202](http://web.stanford.edu/class/stats202)\"\n",
    "date: \"Autumn 2020\"\n",
    "output:\n",
    "  slidy_presentation:\n",
    "    css: styles.css\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Non-linear regression\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter7/7.1.png\" height=\"500\">\n",
    "</div>\n",
    "\n",
    "**Problem:** How do we model a non-linear relationship?\n",
    "\n",
    "- **Left:** Regression of `wage` onto `age`.\n",
    "\n",
    "- **Right:** Logistic regression for classes `wage>250` and `wage<250`\n",
    "\n",
    "## Basis functions\n",
    "\n",
    "**Strategy:**\n",
    "\n",
    "<ul>\n",
    "<li>Define a model:\n",
    "$$Y = \\beta_0 + \\beta_1 f_1(X) + \\beta_2 f_2(X) + \\dots + \\beta_d f_d(X) + \\epsilon.$$\n",
    "<li>Fit this model through least-squares regression: $f_j$'s are nonlinear, model is linear!\n",
    "<li>Some options for $f_1,\\dots,f_d$:\n",
    "<ol>\n",
    "<li>Polynomials, $f_i(x) = x^i$.\n",
    "<li>Indicator functions, $f_i(x) = \\mathbf{1}(c_i \\leq x < c_{i+1})$.\n",
    "</ol>\n",
    "</ul>\n",
    "\n",
    "## Piecewise constant\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter7/7.2.png\" height=\"600\">\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Piecewise polynomials\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter7/7.3.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "## Cubic splines\n",
    "\n",
    "<ul>\n",
    "<li>Define a set of knots $\\xi_1< \\xi_2 < \\dots<\\xi_K$.\n",
    "<li>We want the function $f$ in $Y= f(X) + \\epsilon$ to:\n",
    "<ol>\n",
    "<li>Be a cubic polynomial between every pair of knots $\\xi_i,\\xi_{i+1}$.\n",
    "<li>Be continuous at each knot.\n",
    "<li>Have continuous first and second derivatives at each knot.\n",
    "</ol>\n",
    "<li>It turns out, we can write $f$ in terms of $K+3$ basis functions:\n",
    "$$f(X) = \\beta_0 + \\beta_1 X + \\beta_2 X^2 + \\beta_3 X^3 + \\beta_4 h(X,\\xi_1) + \\dots + \\beta_{K+3} h(X,\\xi_K)$$\n",
    "<li>Above, \n",
    "$$ h(x,\\xi) = \\begin{cases} (x-\\xi)^3 \\quad \\text{if }x>\\xi \\\\ 0 \\quad \\text{otherwise} \\end{cases} $$\n",
    "</ul>\n",
    "\n",
    "## Natural cubic splines\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter7/7.4.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "- Spline which is linear instead of cubic for $X<\\xi_1$, $X>\\xi_K$.\n",
    "\n",
    "- The predictions are more stable for extreme values of $X$.\n",
    "\n",
    "## Choosing the number and locations of knots\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter7/7.6.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "- The locations of the knots are typically quantiles of $X$.\n",
    "\n",
    "- The number of knots, $K$, is chosen by cross validation:\n",
    "\n",
    "## Natural cubic splines vs. polynomial regression\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter7/7.7.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "- Splines can fit complex functions with few parameters.\n",
    "\n",
    "- Polynomials require high degree terms to be flexible.\n",
    "\n",
    "- High-degree polynomials can be unstable at the edges.\n",
    "\n",
    "## Smoothing splines\n",
    "\n",
    "Find the function $f$ which minimizes\n",
    "\n",
    "$$\\color{Blue}{\\sum_{i=1}^n (y_i - f(x_i))^2} + \\color{Red}{ \\lambda \\int f''(x)^2 dx}$$\n",
    "\n",
    "- <font color=\"blue\">The RSS when using $f$ to predict.</font>\n",
    "\n",
    "- <font color=\"red\">A penalty for the roughness of the function.</font>\n",
    "\n",
    "### Facts\n",
    "\n",
    "- The minimizer $\\hat f$ is a natural cubic spline, with knots at each sample point $x_1,\\dots,x_n$.\n",
    "\n",
    "- Obtaining $\\hat f$ is similar to a Ridge regression.\n",
    "\n",
    "## Advanced: deriving a smoothing spline\n",
    "\n",
    "<ol>\n",
    "<li>Show that if you fix the values $f(x_1),\\dots,f(x_2)$, the roughness \n",
    "$$\\int f''(x)^2 dx$$\n",
    "is minimized by a natural cubic spline.\n",
    "<li>Deduce that the solution to the smoothing spline problem is a natural cubic spline, which can be written in terms of its basis functions.\n",
    "$$f(x) = \\beta_0 + \\beta_1 f_1(x) + \\dots \\beta_{n+3} f_{n+3}(x)$$\n",
    "<li>Letting $\\mathbf N$ be a matrix with $\\mathbf N(i,j) = f_j(x_i)$, we can write the objective function:\n",
    "$$ (y - \\mathbf N\\beta)^T(y - \\mathbf N\\beta) + \\lambda \\beta^T \\Omega_{\\mathbf N}\\beta, $$\n",
    "where $\\Omega_{\\mathbf N}(i,j) = \\int N_i''(t) N_j''(t) dt$.\n",
    "</ol>\n",
    "\n",
    "## Advanced: deriving a smoothing spline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<ol start=\"3\">\n",
    "<li> By simple calculus, the coefficients $\\hat \\beta$ which minimize\n",
    "$$ (y - \\mathbf N\\beta)^T(y - \\mathbf N\\beta) + \\lambda \\beta^T \\Omega_{\\mathbf N}\\beta, $$\n",
    "are $\\hat \\beta = (\\mathbf N^T \\mathbf N + \\lambda \\Omega_{\\mathbf N})^{-1} \\mathbf N^T y$.\n",
    "<li>Note that the predicted values are a linear function of the observed values:\n",
    "$$\\hat y = \\underbrace{\\mathbf N  (\\mathbf N^T \\mathbf N + \\lambda \\Omega_{\\mathbf N})^{-1} \\mathbf N^T}_{\\mathbf S_\\lambda} y$$\n",
    "<li>The **degrees of freedom** for a smoothing spline are:\n",
    "$$\\text{Trace}(\\mathbf S_\\lambda)= \\mathbf S_\\lambda(1,1) + \\mathbf S_\\lambda(2,2) + \\cdots + \\mathbf S_\\lambda(n,n) $$\n",
    "</ol>\n",
    "\n",
    "## Natural cubic splines vs. Smoothing splines\n",
    "\n",
    "<div align=\"center\">\n",
    "<table border=\"1px\">\n",
    "<tr>\n",
    "<td width=\"400\">**Natural cubic splines**</td>\n",
    "<td width=\"400\">**Smoothing splines**</td>\n",
    "</tr>\n",
    "<td>\n",
    "<ul>\n",
    "<li>Fix the locations of $K$ knots at quantiles of $X$.\n",
    "<li>Number of knots $K<n$.\n",
    "<li>Find the natural cubic spline $\\hat f$ which minimizes the RSS:\n",
    "$$\\sum_{i=1}^n (y_i - f(x_i) )^2$$\n",
    "<li>Choose $K$ by cross validation.\n",
    "</ul>\n",
    "</td>\n",
    "<td>\n",
    "<ul>\n",
    "<li>Put $n$ knots at $x_1,\\dots,x_n$.\n",
    "<li>We could find a cubic spline which makes the RSS $=0$ $\\longrightarrow$ {\\color{Red}Overfitting!}\n",
    "<li>Instead, we obtain the fitted values $\\hat f(x_1),\\dots,\\hat f(x_n)$ through an algorithm similar to Ridge regression.\n",
    "<li>The function $\\hat f$ is the only natural cubic spline that has these fitted values.\n",
    "</ul>\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "</div>\n",
    "\n",
    "## Choosing the regularization parameter $\\lambda$\n",
    "\n",
    "- We typically choose $\\lambda$ through cross validation. \n",
    "\n",
    "- Fortunately, we can solve the problem for any $\\lambda$ with the same complexity of diagonalizing an $n\\times n$ matrix.\n",
    "\n",
    "- There is a shortcut for LOOCV:\n",
    "$$\n",
    "\\begin{aligned}\n",
    "RSS_\\text{loocv}(\\lambda) &= \\sum_{i=1}^n (y_i - \\hat f_\\lambda^{(-i)}(x_i))^2 \\\\\n",
    "&= \\sum_{i=1}^n \\left[\\frac{y_i-\\hat f_\\lambda(x_i)}{1-\\mathbf S_\\lambda(i,i)}\\right]^2\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "## Choosing the regularization parameter $\\lambda$\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter7/7.8.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "# Local linear regression\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter7/7.9.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "- Sample points nearer $x$ are weighted higher in corresponding regression.\n",
    "\n",
    "## Algorithm\n",
    "\n",
    "To predict the regression function $f$ at an input $x$:\n",
    "<ol>\n",
    "<li>Assign a weight $K_i(x)$ to the training point $x_i$, such that:\n",
    "<ul>\n",
    "<li>$K_i(x)=0$ unless $x_i$ is one of the $k$ nearest neighbors of $x$ (not strictly necessary).\n",
    "<li>$K_i(x)$ decreases when the distance $d(x,x_i)$ increases.\n",
    "</ul>\n",
    "<li>Perform a **weighted least squares** regression; i.e. find $(\\beta_0,\\beta_1)$ which minimize\n",
    "$$\\hat{\\beta}(x) = \\text{argmin}_{(\\beta_0, \\beta_1)} \\sum_{i=i}^n K_i(x) (y_i - \\beta_0 -\\beta_1 x_i)^2.$$\n",
    "<li>Predict $\\hat f(x) = \\hat \\beta_0(x) + \\hat \\beta_1(x) x$. \n",
    "</ol>\n",
    "\n",
    "## Generalized nearest neighbors\n",
    "\n",
    "<ol>\n",
    "<li>Set $K_i(x)=1$ if $x_i$ is one of $x$'s $k$ nearest neighbors.\n",
    "<li>Perform a ``regression'' with only an intercept; i.e. find $\\beta_0$ which minimizes\n",
    "$$\\hat{\\beta}_0(x) = \\text{argmin}_{\\beta_0} \\sum_{i=i}^n K_i(x)(y_i - \\beta_0)^2.$$\n",
    "<li>Predict $\\hat f(x) = \\hat \\beta_0(x)$.\n",
    "<li>Common choice that is smoother than nearest neighbors\n",
    "$$K_i(x) = \\exp(-\\|x-x_i\\|^2/2\\lambda)$$\n",
    "</ol>\n",
    "\n",
    "## Local linear regression\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter7/7.10.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "- The *span* $k/n$, is chosen by cross-validation.\n",
    "\n",
    "# Generalized Additive Models (GAMs)\n",
    "\n",
    "- Extension of non-linear models to multiple predictors:\n",
    "\n",
    "$$\\;\\;\\mathtt{wage} =  \\beta_0 + \\beta_1\\times \\mathtt{year} + \\beta_2\\times \\mathtt{age} + \\beta_3 \\times\\mathtt{education} +\\epsilon$$\n",
    "\n",
    "$$\\longrightarrow \\;\\;\\mathtt{wage} =  \\beta_0 + f_1(\\mathtt{year}) + f_2(\\mathtt{age}) + f_3(\\mathtt{education}) +\\epsilon$$\n",
    "\n",
    "- The functions $f_1,\\dots,f_p$ can be polynomials, natural splines, smoothing splines, local regressions...\n",
    "\n",
    "## Fitting a GAM\n",
    "\n",
    "$$\\mathtt{wage} =  \\beta_0 + f_1(\\mathtt{year}) + f_2(\\mathtt{age}) + f_3(\\mathtt{education}) +\\epsilon$$\n",
    "\n",
    "<ul>\n",
    "<li>If the functions $f_1$ have a basis representation, we can simply use least squares:\n",
    "<ul>\n",
    "<li>Natural cubic splines\n",
    "<li>Polynomials\n",
    "<li>Step functions\n",
    "</ul>\n",
    "</ul>\n",
    "\n",
    "## Fitting a GAM\n",
    "\n",
    "<ul>\n",
    "<li> Otherwise, we can use **backfitting**:\n",
    "<ol>\n",
    "<li>Keep $\\beta_0,f_2,\\dots,f_p$ fixed, and fit $f_1$ using the partial residuals:\n",
    "$$y_i - \\beta_0 - f_2( x_{i2}) -\\dots  - f_p( x_{ip}),$$\n",
    "as the response.\n",
    "<li>Keep $\\beta_0,f_1,f_3,\\dots,f_p$ fixed, and fit $f_2$ using the partial residuals:\n",
    "$$y_i - \\beta_0 - f_1( x_{i1}) - f_3( x_{i3})  -\\dots  - f_p( x_{ip}),$$\n",
    "as the response. \n",
    "<li>...\n",
    "<li>Iterate\n",
    "</ol>\n",
    "<li>This works for smoothing splines and local regression.\n",
    "<li>**For smoothing splines this is a descent method, descending on convex loss ...**\n",
    "</ul>\n",
    "\n",
    "## Backfitting: coordinate descent\n",
    "\n",
    "<ul>\n",
    "<li>Also works for linear regression... \n",
    "<ol>\n",
    "<li>Initialize $\\hat{\\beta}^{(0)} = 0$ and, (say). \n",
    "<li>Given $\\hat{\\beta}^{(T-1)}$, choose a coordinate $0 \\leq k(T) \\leq p$ and find\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\hat{\\alpha}(T) &= \\text{argmin}_{\\alpha} \\sum_{i=1}^n\\left(Y_i - \\hat{\\beta}^{(T-1)}_0 - \\sum_{j:j \\neq k(T)} X_{ij} \\hat{\\beta}^{(T-1)}_j - \\alpha X_{ik(T)}\\right)^2 \\\\\n",
    "&= \\frac{\\sum_{i=1}^n X_{ik(T)}\\left(Y_i - \\hat{\\beta}^{(T-1)}_0 - \\sum_{j: j \\neq k(T)} X_{ij} \\hat{\\beta}^{(T-1)}_j\\right)}\n",
    "{\\sum_{i=1}^n X_{ik(T)}^2}\n",
    "\\end{aligned}\n",
    "$$\n",
    "<li>Set $\\hat{\\beta}^{(T)}=\\hat{\\beta}^{(T-1)}$ except $k(T)$ entry which we set to $\\hat{\\alpha}(T)$.\n",
    "<li>Iterate\n",
    "</ol>\n",
    "</ul>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Backfitting: coordinate descent and LASSO\n",
    "\n",
    "<ul>\n",
    "<li>Also works for LASSO... \n",
    "<ol>\n",
    "<li>Initialize $\\hat{\\beta}^{(0)} = 0$ and, (say). \n",
    "<li>Given $\\hat{\\beta}^{(T-1)}$, choose a coordinate $0 \\leq k(T) \\leq p$ and find\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\hat{\\alpha}_{\\lambda}(T) &= \\text{argmin}_{\\alpha} \\sum_{i=1}^n\\left(r^{(T-1)}_{ik(T)} - \\alpha X_{ik(T)}\\right)^2 \\\\\n",
    "& \\qquad + \\lambda \\sum_{j: j \\neq k(T)} |\\hat{\\beta}^{(T-1)}_j| + \\lambda |\\alpha|\n",
    "\\end{aligned}\n",
    "$$\n",
    "with $r^{(T-1)}_j$ the $j$-th partial residual at iteration $T$\n",
    "$$\n",
    "r^{(T-1)}_j = Y -  \\hat{\\beta}^{(T-1)}_0 - \\sum_{l:l \\neq j} X_l \\hat{\\beta}^{(T-1)}_l.\n",
    "$$\n",
    "Solution is a simple soft-thresholded version of previous $\\hat{\\alpha}(T)$ -- **Very fast! Used in `glmnet`**\n",
    "<li>Set $\\hat{\\beta}^{(T)}=\\hat{\\beta}^{(T-1)}$ except $k(T)$ entry which we set to $\\hat{\\alpha}_{\\lambda}(T)$.\n",
    "<li>Iterate...\n",
    "</ol>\n",
    "</ul>\n",
    "\n",
    "## Backfitting: GAM\n",
    "\n",
    "<ul>\n",
    "<li>Let's look at basis functions\n",
    "<ol>\n",
    "<li>Initialize $\\hat{\\beta}^{(0)} = 0$ and, (say). \n",
    "<li>Given $\\hat{\\beta}^{(T-1)}$, choose a coordinate $0 \\leq k(T) \\leq p$ and find\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\hat{\\alpha}(T) &= \\text{argmin}_{\\alpha \\in \\mathbb{R}^{n_{k(T)}}} \\\\\n",
    "& \\qquad \\sum_{i=1}^n\\biggl(Y_i - \\hat{\\beta}^{(T-1)}_0 - \\sum_{j:j \\neq k(T)} \\sum_{l=1}^{n_j} f_{lj}(X_{ij}) \\hat{\\beta}^{(T-1)}_{lj}  \\\\\n",
    "& \\qquad \\quad - \\sum_{l=1}^{n_{k(T)}} \\alpha_{l} f_{lk(T)}(X_{ik(T)})\\biggr)^2 \n",
    "\\end{aligned}\n",
    "$$\n",
    "<li>Set $\\hat{\\beta}^{(T)}=\\hat{\\beta}^{(T-1)}$ except $k(T)$ entries which we set to $\\hat{\\alpha}(T)$.\n",
    "**Blockwise coordinate descent!**\n",
    "<li>Iterate...\n",
    "</ol>\n",
    "</ul>\n",
    "\n",
    "## Properties\n",
    "\n",
    "<ul>\n",
    "<li>GAMs are a step from linear regression toward a fully nonparametric method.\n",
    "<li>The only constraint is additivity. This can be partially addressed by adding key interaction variables $X_iX_j$\n",
    "(or tensor product of basis functions -- e.g. polynomials of two variables).\n",
    "<li>We can report degrees of freedom for many non-linear functions. \n",
    "<li>As in linear regression, we can examine the significance of each of the variables. \n",
    "</ul>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example: Regression for `Wage`\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter7/7.11.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "- `year`: natural spline with df=4.\n",
    "\n",
    "- `age`: natural spline with df=5.\n",
    "\n",
    "- `education`: factor.\n",
    "\n",
    "## Example: Regression for `Wage`\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter7/7.12.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "- `year`: smoothing spline with df=4.\n",
    "\n",
    "- `age`: smoothing spline with df=5.\n",
    "\n",
    "- `education`: step function\n",
    "\n",
    "## Classification\n",
    "\n",
    "We can model the log-odds in a classification problem using a GAM:\n",
    "\n",
    "$$\\log \\frac{P(Y=1\\mid X)}{P(Y=0\\mid X)} = \\beta_0 + f_1(X_1) + \\dots + f_p(X_p).$$\n",
    "\n",
    "Again fit by backfitting ...\n",
    "\n",
    "## Backfitting: GAM with logistic loss\n",
    "\n",
    "<ul>\n",
    "<li>Also works for logistic... \n",
    "<ol>\n",
    "<li>Initialize $\\hat{\\beta}^{(0)} = 0$ and, (say). \n",
    "<li>Given $\\hat{\\beta}^{(T-1)}$, choose a coordinate $0 \\leq k(T) \\leq p$ with $\\ell$ logistic loss, find\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\hat{\\alpha}(T) &= \\text{argmin}_{\\alpha \\in \\mathbb{R}^{n_{k(T)}}} \\\\\n",
    "& \\qquad \\sum_{i=1}^n \\ell\\biggl(Y_i, \\hat{\\beta}^{(T-1)}_0 + \\sum_{j:j \\neq k(T)} \\sum_{l=1}^{n_j} f_{lj}(X_{ij}) \\hat{\\beta}^{(T-1)}_{lj} \\\\\n",
    "& \\qquad \\quad  + \\sum_{l=1}^{n_{k(T)}} \\alpha_{l} f_{lk(T)}(X_{ik(T)}) \\biggr)\n",
    "\\end{aligned}\n",
    "$$\n",
    "<li>Set $\\hat{\\beta}^{(T)}=\\hat{\\beta}^{(T-1)}$ except $k(T)$ entries which we set to $\\hat{\\alpha}(T)$.\n",
    "<li>Works for losses that have a *linear predictor*.\n",
    "<li>For GAMs, the linear predictor is\n",
    "$$\n",
    "\\beta_0 + f_1(X_1) + \\dots + f_p(X_p)\n",
    "$$\n",
    "</ol>\n",
    "</ul>\n",
    "\n",
    "## Example: Classification for `Wage>250`\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter7/7.13.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "- `year`: linear\n",
    "\n",
    "- `age`: smoothing spline with df=5\n",
    "\n",
    "- `education`: step function\n",
    "\n",
    "## Example: Classification for `Wage>250`\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter7/7.14.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "- Same model excluding cases `education == \"<HS\"`"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "jupytext": {
   "cell_metadata_filter": "all,-slideshow",
   "formats": "ipynb,md:myst,Rmd"
  },
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}