{
 "cells": [
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "---\n",
    "title: 'Support vector machines'\n",
    "author: \"Sergio Bacallado, Jonathan Taylor\"\n",
    "subtitle: \"[web.stanford.edu/class/stats202](http://web.stanford.edu/class/stats202)\"\n",
    "date: \"Autumn 2020\"\n",
    "output:\n",
    "  slidy_presentation:\n",
    "    css: styles.css\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Support vector machines\n",
    "\n",
    "- Linear classifiers can be described geometrically by *separating hyperplanes*.\n",
    "\n",
    "- An affine function\n",
    "$$\n",
    "x \\mapsto \\beta_0 + \\sum_{j=1}^p x_j \\beta_j\n",
    "$$\n",
    "determines a hyperplane\n",
    "$$\n",
    "H = \\left\\{x: \\sum_{j=1}^p x_j \\beta_j + \\beta_0 = 0 \\right\\}.\n",
    "$$\n",
    "and two half-spaces\n",
    "$$\n",
    "\\left\\{x: \\sum_{j=1}^p x_j \\beta_j + \\beta_0 > 0 \\right\\}, \\qquad \n",
    "\\left\\{x: \\sum_{j=1}^p x_j \\beta_j + \\beta_0 < 0 \\right\\}.\n",
    "$$\n",
    "\n",
    "- The vector $N=(\\beta_1, \\dots, \\beta_p)$ is the *normal* vector of the hyperplane $H$.\n",
    "\n",
    "- For a given $H$, by scaling, we can always choose $N$ so that $\\|N\\|_2=1$ (we must also scale\n",
    "$\\beta_0$ to keep $H$ the same).\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Hyperplanes and normal vectors\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter9/9.1.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "- Function is $x \\mapsto 1 + 2 x_1 + 3 x_2$\n",
    "\n",
    "- <font color=\"blue\">$\\{x: 1+2x_1+3x_2>0\\}$</font>\n",
    "\n",
    "- <font color=\"purple\">$\\{x: 1+2x_1+3x_2>0\\}$</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Support vector machines\n",
    "\n",
    "### Hyperplanes and normal vectors\n",
    "\n",
    "<ul>\n",
    "<li>If the hyperplane goes through the origin $(\\beta_0=0)$, the deviation between a point $(x_1,\\dots,x_p)$ and the hyperplane is the dot product:\n",
    "$$x\\cdot \\beta = x_1\\beta_1 + \\dots + x_p\\beta_p.$$\n",
    "<li>The sign of the dot product tells us on which side of the hyperplane the point lies.\n",
    "<li>If the hyperplane goes through a point $-\\beta_0 \\beta$, i.e. it is displaced from the origin by $-\\beta_0$ along the normal vector $(\\beta_1, \\dots, \\beta_p)$, the deviation of a point $(x_1,\\dots,x_p)$ from the hyperplane is:\n",
    "$$ \\beta_0 + x_1\\beta_1 + \\dots + x_p\\beta_p.$$\n",
    "<li>The sign tells us on which side of the hyperplane the point lies.\n",
    "</ul>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Maximal margin classifier\n",
    "\n",
    "<ul>\n",
    "<li>Suppose we have a classification problem with response $Y=-1$ or $Y=1$.\n",
    "<li>If the classes can be separated, most likely, there will be an infinite number of hyperplanes separating the classes.\n",
    "</ul>\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter9/9.2.png\" height=\"500\">\n",
    "</div>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Maximal margin classifier\n",
    "\n",
    "**Idea:**\n",
    "\n",
    "<ul>\n",
    "<li>Draw the largest possible empty margin around the hyperplane.\n",
    "<li>Out of all possible hyperplanes that separate the 2 classes, choose the one with the widest margin.\n",
    "</ul>\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter9/9.3.png\" height=\"500\">\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Support vector machines\n",
    "\n",
    "### Maximal margin classifier\n",
    "\n",
    "- This can be written as an optimization problem:\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "&\\max_{\\beta_0,\\beta_1,\\dots,\\beta_p}\\;\\; M\\\\\n",
    "&\\text{subject to } \\sum_{j=1}^p \\beta_j^2 = 1,\\\\\n",
    "&\\underbrace{y_i(\\beta_0 +\\beta_1x_{i1}+\\dots+\\beta_p x_{ip})}_\\text{How far is $x_i$ from the hyperplane} \\geq M \\quad \\text{ for all }i=1,\\dots,n.\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "- $M$ is simply the width of the margin in either direction.\n",
    "\n",
    "- Ultimately, we will see that this problem can be transformed to something similar to logistic regression...\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Finding the maximal margin classifier\n",
    "\n",
    "- We can reformulate the problem by defining a vector $w=(w_1,\\dots,w_p) = \\beta/M$:\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "&\\min_{\\beta_0,w}\\;\\; \\frac{1}{2}\\|w\\|^2 \\\\\n",
    "&\\text{subject to } \\\\\n",
    "&y_i(\\beta_0 +w\\cdot x_i) \\geq 1 \\quad \\text{ for all }i=1,\\dots,n.\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "- This is a quadratic optimization problem. Having found\n",
    "$(\\hat{\\beta}_0, \\hat{w})$ we can recover\n",
    "$\\hat{\\beta}=\\hat{w}/\\|\\hat{w}\\|_2, M=1/\\|\\hat{w}\\|_2$.\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Finding the maximal margin classifier\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "&\\min_{\\beta_0,w}\\;\\; \\frac{1}{2}\\|w\\|^2 \\\\\n",
    "&\\text{subject to } \\\\\n",
    "&y_i(\\beta_0 +w\\cdot x_i) \\geq 1 \\quad \\text{ for all }i=1,\\dots,n.\n",
    "\\end{align*}\\\\[3mm] \n",
    "$$\n",
    "\n",
    "Introducing Karush-Kuhn-Tucker (KKT) multipliers, $\\alpha_1,\\dots,\\alpha_n$, this is equivalent to:\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "&\\max_\\alpha\\; \\min_{\\beta_0,w}\\;\\; \\frac{1}{2}\\|w\\|^2 - \\sum_{i=1}^n \\alpha_i [y_i(\\beta_0 +w\\cdot x_i)-1]\\\\\n",
    "&\\text{subject to } \\alpha_i \\geq 0. \\\\\n",
    "\\end{align*}\\\\[3mm] \n",
    "$$\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Finding the maximal margin classifier\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "&\\max_\\alpha\\; \\min_{\\beta_0,w}\\;\\; \\frac{1}{2}\\|w\\|^2 - \\sum_{i=1}^n \\alpha_i [y_i(\\beta_0 +w\\cdot x_i)-1]\\\\\n",
    "&\\text{subject to } \\alpha_i \\geq 0.\n",
    "\\end{align*}\\\\[3mm] \n",
    "$$\n",
    "\n",
    "<ul>\n",
    "<li>Setting the partial derivatives with respect to $w$ and $\\beta_0$ to 0, we get:\n",
    "$$\\hat w = \\sum_{i=1}^n \\alpha_i y_i x_i, \\quad \\sum_{i=1}^n\\alpha_iy_i = 0$$\n",
    "<li>Furthermore, one of the KKT conditions yields $\\alpha_i>0$ if and only if $y_i(\\beta_0 +w\\cdot x_i)=1$, that is, if $x_i$ falls on the margin.\n",
    "</ul> \n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Support vectors\n",
    "\n",
    "The vectors that fall on the margin and determine the solution are called **support vectors**\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter9/9.3.png\" height=\"500\">\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Support vector machines\n",
    "\n",
    "### Finding the maximal margin classifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\begin{align*}\n",
    "&\\max_\\alpha\\; \\min_{\\beta_0,w}\\;\\; \\frac{1}{2}\\|w\\|^2 - \\sum_{i=1}^n \\alpha_i [y_i(\\beta_0 +w\\cdot x_i)-1]\\\\\n",
    "&\\text{subject to }\\;\\; \\alpha_i \\geq 0.\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "- The solution is $\\hat w = \\sum_{i=1}^n \\alpha_i y_i x_i$, and $\\sum_{i=1}^n \\alpha_iy_i=0$ so we can plug this in above to obtain the dual problem:\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "&\\max_\\alpha\\;\\; \\sum_{i=1}^n \\alpha_i - \\frac{1}{2}\\sum_{i=1}^n\\sum_{i'=1}^n \\alpha_i\\alpha_{i'}y_i y_{i'} \\color{Blue}{(x_i\\cdot x_{i'})}\\\\\n",
    "&\\text{subject to }\\;\\; \\alpha_i \\geq 0, \\quad \\sum_{i}\\alpha_i y_i = 0.\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Maximal margin classifier\n",
    "\n",
    "#### Summary\n",
    "\n",
    "- We've reduced the problem of finding $w$, which describes the hyperplane and the size of the margin, to finding a set of coefficients\n",
    "$\\alpha_1,\\dots,\\alpha_n$ through:\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "&\\max_\\alpha\\;\\; \\sum_{i=1}^n \\alpha_i- \\frac{1}{2}\\sum_{i=1}^n\\sum_{i'=1}^n \\alpha_i\\alpha_{i'}y_i y_{i'} \\color{Blue}{(x_i\\cdot x_{i'})}\\\\\n",
    "&\\text{subject to }\\;\\; \\alpha_i \\geq 0, \\quad \\sum_{i}\\alpha_i y_i = 0.\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "- This only depends on the training sample inputs through the inner products $x_i\\cdot x_j$ for every pair $i,j$. \n",
    "\n",
    "- Whatever $p$ is, this is an optimization problem in $n$ variables. <font color=\"red\">$\\to$ could be $p \\gg n$</font>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Support vector machines\n",
    "\n",
    "### Support vector classifier\n",
    "\n",
    "**Problem:** It is not always possible to separate the points using a hyperplane.\n",
    "\n",
    "**Support vector classifier:**\n",
    "<ul>\n",
    "<li>Relaxation of the maximal margin classifier.\n",
    "<li>Allows a number of points to be on the wrong side of the margin or even the hyperplane\n",
    "by allowing *slack* $\\epsilon_i$ for each case.\n",
    "</ul>\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter9/9.6.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Support vector classifier\n",
    "\n",
    "- A new optimization problem:\n",
    "$$\n",
    "\\begin{align*}\n",
    "&\\max_{\\beta_0,\\beta,\\epsilon}\\;\\; M\\\\\n",
    "&\\text{subject to } \\sum_{j=1}^p \\beta_j^2 = 1,\\\\\n",
    "&\\underbrace{y_i(\\beta_0 +\\beta_1x_{i1}+\\dots+\\beta_p x_{ip})}_\\text{How far is $x_i$ from the hyperplane} \\geq M(1-\\epsilon_i) \\quad \\text{ for all }i=1,\\dots,n\\\\\n",
    "&\\epsilon_i \\geq 0 \\;\\text{for all }i=1,\\dots,n, \\quad \\sum_{i=1}^n \\epsilon_i \\leq C.\n",
    "\\end{align*}\\\\ \n",
    "$$\n",
    "\n",
    "- $M$ is the width of the margin in either direction\n",
    "\n",
    "- $\\epsilon=(\\epsilon_1,\\dots,\\epsilon_n)$ are called *slack* variables\n",
    "\n",
    "- $C$ is called the *budget*\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Support vector classifier\n",
    "\n",
    "#### Tuning the budget, $C$ (high to low)\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter9/9.7.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Support vector classifier\n",
    "\n",
    "#### If the budget is too low, we tend to overfit\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter9/9.5.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "- Maximal margin classifier, $C=0$.\n",
    "\n",
    "- Adding one observation dramatically changes the classifier.\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Support vector classifier\n",
    "\n",
    "#### Finding the support vector classifier\n",
    "\n",
    "We can reformulate the problem by defining a vector $w=(w_1,\\dots,w_p) = \\beta/M$:\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "&\\min_{\\beta_0,w,\\epsilon}\\;\\; \\frac{1}{2}\\|w\\|^2 + D\\sum_{i=1}^n \\epsilon_i\\\\\n",
    "&\\text{subject to } \\\\\n",
    "&y_i(\\beta_0 +w\\cdot x_i) \\geq (1-\\epsilon_i) \\quad \\text{ for all }i=1,\\dots,n,\\\\\n",
    "&\\epsilon_i \\geq 0 \\quad \\text{for all }i=1,\\dots,n.\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "The penalty $D\\geq 0$ serves a function similar to the budget $C$, but is inversely related to it.\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Support vector classifier\n",
    "\n",
    "#### Finding the support vector classifier\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "&\\min_{\\beta_0,w,\\epsilon}\\;\\; \\frac{1}{2}\\|w\\|^2 + D\\sum_{i=1}^n \\epsilon_i \\\\\n",
    "&\\text{subject to } \\\\\n",
    "&y_i(\\beta_0 +w\\cdot x_i) \\geq (1-\\epsilon_i) \\quad \\text{ for all }i=1,\\dots,n.\\\\\n",
    "&\\epsilon_i \\geq 0 \\quad \\text{for all }i=1,\\dots,n.\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "Introducing KKT multipliers, $\\alpha_i$ and $\\mu_i$, this is equivalent to:\n",
    "$$\n",
    "\\begin{align*}\n",
    "&\\max_{\\alpha,\\mu}\\; \\min_{\\beta_0,w,\\epsilon}\\;\\; \\frac{1}{2}\\|w\\|^2 - \\sum_{i=1}^n \\alpha_i [y_i(\\beta_0 +w\\cdot x_i)-1+\\epsilon_i] + \\sum_{i=1}^n (D-\\mu_i)\\epsilon_i\\\\\n",
    "&\\text{subject to } \\;\\;\\alpha_i \\geq  0, \\mu_i\\geq 0 ,  \\quad \\text{ for all }i=1,\\dots,n.\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Support vector classifier\n",
    "\n",
    "#### Finding the support vector classifier\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "&\\max_{\\alpha,\\mu}\\; \\min_{\\beta_0,w,\\epsilon}\\;\\; \\frac{1}{2}\\|w\\|^2 - \\sum_{i=1}^n \\alpha_i [y_i(\\beta_0 +w\\cdot x_i)-1+\\epsilon_i] + \\sum_{i=1}^n (D-\\mu_i)\\epsilon_i\\\\\n",
    "&\\text{subject to } \\;\\;\\alpha_i \\geq  0, \\mu_i\\geq 0 ,  \\quad \\text{ for all }i=1,\\dots,n.\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "<ul>\n",
    "<li>\n",
    "Setting the derivatives with respect to $w$, $\\beta_0$, and $\\epsilon$ to 0, we obtain:\n",
    "$$\\hat w = \\sum_{i=1}^n \\alpha_i y_i x_i, \\quad \\sum_{i=1}^n\\alpha_i y_i = 0, \\quad \\mu_i=D-\\alpha_i$$\n",
    "<li>\n",
    "Furthermore, the KKT conditions yield $\\alpha_i>0$ if and only if $y_i(\\beta_0 +w\\cdot x_i)\\leq 1$, that is, if $x_i$ falls on the wrong side of the margin.\n",
    "</ul>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Support vector classifier\n",
    "\n",
    "#### Support vectors\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter9/9.7.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Support vector classifier\n",
    "\n",
    "#### The problem only depends on $x_i\\cdot x_{i'}$\n",
    "\n",
    "- As with the maximal margin classifier, the problem can be reduced to finding $\\alpha_1,\\dots,\\alpha_n$:\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "&\\max_\\alpha\\;\\; \\sum_{i=1}^n \\alpha_i - \\frac{1}{2}\\sum_{i=1}^n\\sum_{i'=1}^n \\alpha_i\\alpha_{i'}y_i y_{i'} \\color{Blue}{(x_i\\cdot x_{i'})}\\\\\n",
    "&\\text{subject to }\\;\\; 0\\leq \\alpha_i  \\leq D  \\;\\text{ for all }i=1,\\dots,n, \\\\\n",
    "&\\quad \\sum_{i}\\alpha_i y_i = 0.\n",
    "\\end{align*}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- As for maximal\n",
    "margin classifier, this only depends on the training sample inputs through the inner products $x_i\\cdot x_j$ for every pair $i,j$. \n",
    "\n",
    "- Why care?\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Support vector classifier\n",
    "\n",
    "#### Key fact about the support vector classifier\n",
    "\n",
    "- To **find the hyperplane** all we need to know is the dot product between any pair of input vectors:\n",
    "\n",
    "$$K(x_i,x_k) = (x_i\\cdot x_k) =\\langle x_i, x_k \\rangle = \\sum_{j=1}^p x_{ij}x_{kj}$$\\\\\n",
    "\n",
    "- The matrix $\\mathbf{K}_{ij}=K(x_i,x_k)$ is called the *kernel* or *Gram* matrix.\n",
    "\n",
    "- To **make predictions at $x$** all we need to know is $\\langle x_i, x \\rangle$ for each\n",
    "point $x_i$ (actually only the support vectors).\n",
    "\n",
    "- This actually opens up the possibility of (richer) *nonlinear* classifiers: using basis functions\n",
    "can use\n",
    "$$\n",
    "K_f(x_i, x_j) = \\langle f(x_i), f(x_j) \\rangle, \\qquad K_f(x, x_i) = \\langle f(x), f(x_i) \\rangle\n",
    "$$\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Support vector classifier\n",
    "\n",
    "#### How to create non-linear boundaries?\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter9/9.8.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "- The support vector classifier can only produce a linear boundary."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Support vector machines\n",
    "\n",
    "### Support vector classifier\n",
    "\n",
    "#### How to create non-linear boundaries?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<ul>\n",
    "<li> In **logistic regression**, we dealt with this problem by adding transformations of the predictors.\n",
    "<li>The original decision boundary is a hyperplane:\n",
    "$$\\log \\left[ \\frac{P(Y=1|X)}{P(Y=0|X)} \\right] = \\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 = 0.$$\n",
    "<li>With a quadratic predictor, we get a quadratic boundary:\n",
    "$$\\log \\left[ \\frac{P(Y=1|X)}{P(Y=0|X)} \\right] = \\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\beta_3 X_1^2= 0.$$\n",
    "</ul>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Support vector classifier\n",
    "\n",
    "#### How to create non-linear boundaries?\n",
    "\n",
    "<ul>\n",
    "<li> With a **support vector classifier** we can apply a similar trick. \n",
    "<li>The original decision boundary is the hyperplane defined by:\n",
    "$$ \\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 = 0.$$\n",
    "<li>If we expand the set of predictors to the 3D space $(X_1,X_2,X_1^2)$, the decision boundary becomes:\n",
    "$$ \\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\beta_3 X_1^2= 0.$$\n",
    "<li>This is in fact a linear boundary in the 3D space. However, we can classify a point knowing just $(X_1,X_2)$.\n",
    "The boundary in this projection is quadratic in $X_1$. \n",
    "</ul>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Support vector classifier\n",
    "\n",
    "#### How to create non-linear boundaries?\n",
    "\n",
    "<ul>\n",
    "<li>**Idea:** Add polynomial terms up to degree $d$:\n",
    "$$Z = (X_1, X_1^2, \\dots, X_1^d, X_2, X_2^2, \\dots, X_2^d, \\dots, X_p, X_p^2, \\dots, X_p^d).$$ \n",
    "<li>Does this make the computation more expensive?\n",
    "<li>Recall that all we need to compute is the dot product:\n",
    "$$x_i \\cdot x_{k} = \\langle x_i,x_k\\rangle = \\sum_{j=1}^p x_{ij}x_{kj}.$$\n",
    "<li>With the expanded set of predictors, we need:\n",
    "$$z_i \\cdot z_{k} = \\langle z_i,z_k\\rangle = \\sum_{j=1}^p\\sum_{\\ell=1}^d x^\\ell_{ij}x^\\ell_{kj}.$$\n",
    "<li>Modulo computing these dot products, the order of computations\n",
    "is the same as with linear features. *Not quite true for logistic regression as described above.*\n",
    "</ul>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Kernels and the kernel trick"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- The **kernel matrix** defined by $K(x_i,x_k) = \\langle z_i, z_k \\rangle$ for a set of linearly\n",
    "independent vectors $z_1,\\dots,z_n$ is always symmetric and *positive semi-definite*, i.e. has no negative eigenvalues. \n",
    "\n",
    "- **Theorem:** If $K$ is a positive definite $n\\times n$ matrix, there\n",
    "exist vectors $(z_1,\\dots,z_n)$ in some space $\\mathbf{Z}$, such that\n",
    "$K(x_i,x_k) = \\langle z_i,z_k\\rangle$.\n",
    "\n",
    "- We just need $\\mathbf{K}_{ij}=K(x_i,x_j)$ to classify all of our learning examples, but to classify\n",
    "we need to know $x \\mapsto K(x, x_i)$.\n",
    "\n",
    "- The kernel $K:\\mathbb{R}^p \\times \\mathbb{R}^p \\rightarrow \\mathbb{R}$ should be such that for any choice\n",
    "of $\\{x_1, \\dots, x_n\\}$ the matrix $\\mathbf{K}$\n",
    "is positive semidefinite.\n",
    "\n",
    "- Such kernel functions are associated to *reproducing kernel Hilbert spaces (RKHS)*.\n",
    "\n",
    "- Support vector classifiers using this kernel trick are *support vector machines*...\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Kernels and the kernel trick\n",
    "\n",
    "- With a kernel function $K$, the problem can be reduced to the optimization:\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "\\hat \\alpha\\;=\\;&\\arg\\max_\\alpha\\;\\; \\sum_{i=1}^n \\alpha_i - \\frac{1}{2}\\sum_{i=1}^n\\sum_{i'=1}^n \\alpha_i\\alpha_{i'}y_i y_{i'} \\mathbf{K}_{ii'} \\\\\n",
    "&\\text{subject to }\\; 0\\leq \\alpha_i  \\leq D  \\;\\text{ for all }i=1,\\dots,n, \\\\\n",
    "&\\quad \\sum_{i=1}^n \\alpha_i y_i = 0.\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "<ul>\n",
    "<li>This is the dual problem of a *different* optimization problem than we started with.\n",
    "<li>Predictions can be computed similarly to original kernel $K(x,y)=x \\cdot y$. (Details omitted.)\n",
    "</ul>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Kernels and the kernel trick\n",
    "\n",
    "<div align=\"center\">\n",
    "<table border=\"1px\">\n",
    "<tr>\n",
    "<td>**Expand the set of predictors:**\n",
    "</td>\n",
    "<td>**Define a kernel:**\n",
    "</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<td>\n",
    "<ul>\n",
    "<li>Find a mapping $\\Phi$ which expands the original set of predictors $X_1,\\dots,X_p$. For example, \n",
    "$$\\Phi(X) = (X_1,X_2,X_1^2)$$\n",
    "<li>For each pair of samples, compute:\n",
    "$$\\mathbf{K}_{ik} = \\langle\\Phi(x_i),\\Phi(x_k)\\rangle.$$\n",
    "<li>Predictions use $\\langle\\Phi(x), \\Phi(x_i) \\rangle, 1 \\leq i \\leq n$.\n",
    "</ul>\n",
    "</td>\n",
    "<td>\n",
    "<ul>\n",
    "<li>Prove that a function $R(\\cdot,\\cdot)$ is positive definite. For example:\n",
    "$$R(x_i,x_k) = \\left( 1+ \\langle x_i,x_k\\rangle\\right)^2.$$\n",
    "<li> For each pair of samples, compute:\n",
    "$$\\mathbf{K}_{ik} = f(x_i,x_k).$$\n",
    "<li><font color=\"red\">Often much easier!</font>\n",
    "<li>Predictions use $R(x, x_i), 1 \\leq i \\leq n$.\n",
    "</ul>\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "</div>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### The polynomial kernel with $d=2$:\n",
    "\n",
    "$$\\mathbf{K}_{ik} = R(x_i,x_k) =  \\left( 1+ \\langle x_i,x_k\\rangle\\right)^2$$\n",
    "\n",
    "This is equivalent to the expansion:\n",
    "$$\n",
    "\\begin{align*}\n",
    "\\Phi(x) =& (\\sqrt{2}x_1,\\dots,\\sqrt{2}x_p, \\\\\n",
    "& x_1^2,\\dots,x_p^2, \\\\\n",
    "& \\sqrt{2}x_1x_2,\\sqrt{2}x_1x_3, \\dots, \\sqrt{2}x_{p-1}x_p)\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "<ul>\n",
    "<li>For a single $x$, computing $K(x,x_k)$ directly is $O(p)$.\n",
    "<li>For a single $x$ computing the kernel using the expansion is $O(p^2)$.\n",
    "</ul>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Support vector machines\n",
    "\n",
    "### How are kernels defined?\n",
    "\n",
    "<ul>\n",
    "<li> Proving that a bilinear function $R(\\cdot,\\cdot)$ is positive definite (PD) is not always easy.\n",
    "<li><2-> However, we can easily define PD kernels by combining those we are familiar with:\n",
    "<ul>\n",
    "<li> Sums and products of PD kernels are PD.\n",
    "</ul>\n",
    "<li>Intuitively, a kernel $K(x_i,x_k)$ defines a \\emph{similarity} between the samples $x_i$ and $x_k$.\n",
    "<li>This intuition can guide our choice in different problems.\n",
    "</ul>\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter9/9.9.png\" height=\"400\">\n",
    "</div>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Common kernels\n",
    "\n",
    "<ul>\n",
    "<li>The polynomial kernel:\n",
    "$$K(x_i,x_k) = \\left( 1+ \\langle x_i,x_k\\rangle\\right)^d$$\n",
    "<li>The radial basis kernel:\n",
    "$$K(x_i,x_k) = \\exp\\big(-\\gamma\\underbrace{\\sum_{j=1}^p (x_{ip}-x_{kp})^2}_\\text{Euclidean  $d(x_i,x_k)$}\\big)$$\n",
    "</ul>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Summary so far\n",
    "\n",
    "<ul>\n",
    "<li>As noted above **support vector machine** is a support vector classifier applied on an expanded set of predictors, e.g.\n",
    "$$\\Phi: (X_1,X_2) \\to (X_1,X_2,X_1X_2, X_1^2,X_2^2).$$ \n",
    "<li>We expand the vector of predictors for each sample $x_i$ and then perform the algorithm.\n",
    "<li>We only need to know the dot products:\n",
    "$$ \\langle \\Phi(x_i),\\Phi(x_k) \\rangle \\equiv K(x_i,x_k)$$\n",
    "for every pair of samples $(x_i,x_k)$ as well as how to compute $\\langle \\Phi(x), \\Phi(x_i) \\rangle$ for new $x$'s.\n",
    "<li> Often, the dot product:\n",
    "$$\\langle \\Phi(x_i),\\Phi(x_k) \\rangle \\equiv K(x_i, x_k) $$\n",
    "is a simple function $f(x_i,x_k)$ of the original vectors. Even if the mapping $\\Phi$ significantly expands the space of predictors. \n",
    "</ul>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Summary so far\n",
    "\n",
    "<ul>\n",
    "<li>**Example 1:** Polynomial kernel\n",
    "<ul>\n",
    "<li>$$K(x_i,x_k) = (1+ \\langle x_i,x_k \\rangle)^2.$$\n",
    "<li> With two predictors, this corresponds to the mapping:\n",
    "$$\\Phi: (X_1,X_2) \\to (\\sqrt{2}X_1,\\sqrt{2}X_2,\\sqrt{2}X_1X_2, X_1^2,X_2^2).$$ \n",
    "</ul>\n",
    "<li>**Example 2:** RBF kernel\n",
    "$$K(x_i,x_k) = \\exp(-\\gamma d( x_i,x_k )^2),$$\n",
    "where $d$ is the Euclidean distance between $x_i$ and $x_k$.\n",
    "<ul>\n",
    "<li>In this case, the mapping $\\Phi$ is an expansion into an infinite number of transformations!\n",
    "<li><font color=\"red\">We can apply the method even if we don't know what these transformations explicitly are.</font>\n",
    "</ul>\n",
    "</ul>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Kernels for non-standard data types\n",
    "\n",
    "<ul>\n",
    "<li>We can define families of kernels (with tuning parameters), which capture similarity between non-standard kinds of data:\n",
    "<ol>\n",
    "<li>Text, strings\n",
    "<li>Images\n",
    "<li>Graphs\n",
    "<li>Histograms\n",
    "</ol>\n",
    "<li>Sometimes we know the mapping $\\Phi$, but there are algorithms that are fast for computing $K(x_i,x_k)$ without doing the expansion explicitly.\n",
    "<li>Other times, the expansion $\\Phi$ is infinite-dimensional or simply not known.\n",
    "</ul>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Kernels for non-standard data types\n",
    "\n",
    "#### Example: strings\n",
    "\n",
    "Suppose we want to compare two strings in a finite alphabet:\n",
    "\n",
    "$$x_1 = ACCTATGCCATA$$\n",
    "$$x_2 = AGCTAAGCATAC$$\n",
    "\n",
    "<ul>\n",
    "<li>**Stringdot kernel:**\n",
    "For each word $u$ of length $p$, we define a feature:\n",
    "$$\\Phi_u(x_i) = \\text{# of times that }u\\text{ appears in }x_i$$\n",
    "<li>Naive algorithm would require looping over each sequence, for every subsequence $u$ of length $p$.\n",
    "<li>This would be $O(n^2)$ steps, where $n$ is the length of the sequences.\n",
    "</ul>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Kernels for non-standard data types\n",
    "\n",
    "#### Example: strings\n",
    "\n",
    "Suppose we want to compare two strings in a finite alphabet:\n",
    "$$x_1 = ACCTATGCCATA$$\n",
    "$$x_2 = AGCTAAGCATAC$$\n",
    "\n",
    "<ul>\n",
    "<li> **Gap weight kernel:** \n",
    "For each word $u$ of length $p$, we define a feature:\n",
    "$$\\Phi_u(x_i) = \\sum_{v: \\text{a subsequence of  $x_i$ containing $u$}} \\lambda^{\\text{length}(v)} $$\n",
    "with $0<\\lambda\\leq 1$.\n",
    "<li>The number of features can be huge! However, this can be computed in $\\mathcal O(Mp\\log n )$ steps where $M$ is the number of matches.\n",
    "</ul>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Applying SVMs with more than 2 classes\n",
    "\n",
    "<ul>\n",
    "<li> SVMs don't generalize nicely to the case of more than 2 classes.\n",
    "<li>Two main approaches:\n",
    "<ol>\n",
    "<li>**One vs. one:** Construct $n \\choose 2$ SVMs comparing every pair\n",
    "of classes. Apply all SVMs to a test observation and classify to the\n",
    "class that wins the most one-on-one challenges.\n",
    "<li>**One vs. all:** For each class $k$, construct an SVM\n",
    "$\\beta^{(k)}$ coding class $k$ as $1$ and all other classes as\n",
    "$-1$. Assign a test observation to the class $k^*$, such that the\n",
    "distance from $x_i$ to the hyperplane defined by $\\beta^{(k^*)}$ is\n",
    "largest (the distance is negative if the sample is misclassified).\n",
    "</ol>\n",
    "</ul>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Relationship of SVM to logistic regression\n",
    "\n",
    "Recall the Lagrange form of the problem.\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "&\\min_{\\beta_0,w,\\epsilon}\\;\\; \\frac{1}{2}\\|w\\|^2 + D\\sum_{i=1}^n \\epsilon_i\\\\\n",
    "&\\text{subject to } \\\\\n",
    "&y_i(\\beta_0 +w\\cdot x_i) \\geq (1-\\epsilon_i) \\quad \\text{ for all }i=1,\\dots,n,\\\\\n",
    "&\\epsilon_i \\geq 0 \\quad \\text{for all }i=1,\\dots,n.\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Relationship of SVM to logistic regression\n",
    "\n",
    "<ul>\n",
    "<li>Set $D=1/\\lambda$ and minimize over $\\epsilon_i$ explicitly.\n",
    "<li>If $1 - y_i(\\beta+0 + w \\cdot x_i) \\leq 0$ we can take $\\hat{\\epsilon}_i=0$.\n",
    "Otherwise, we take \n",
    "$$\\hat{\\epsilon}_i=1 - y_i(\\beta_0 + w \\cdot x_i).$$\n",
    "<li>Or, $$\\hat{\\epsilon}_i = \\max(1 - y_i(\\beta_0+w \\cdot x_i), 0).$$\n",
    "</ul>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Relationship of SVM to logistic regression\n",
    "\n",
    "<ul>\n",
    "<li>Plugging this into the objective (and replacing $w$ with $\\beta$) yields\n",
    "$$\n",
    "\\begin{align*}\n",
    "&\\min_{\\beta}\\;\\; \\sum_{i=1}^n \\max(1 - y_i(\\beta_0 + \\sum_{j=1}^p \\beta_j x_{ij}, 0) + \\frac{\\lambda}{2}\\sum_{j=1}^p\\|\\beta_j\\|^2 + \\\\\n",
    "\\end{align*}\n",
    "$$\n",
    "<li>This loss is a function of $y_i(\\beta_0 + \\sum_{j=1}^p \\beta_j x_{ij}$ and a ridge penalty.\n",
    "<li>Loss for logistic regression is also a function of $y_i(\\beta_0 + \\sum_{j=1}^p \\beta_j x_{ij}$.\n",
    "<li>Large $\\lambda$ $\\iff$ small $D$ $\\iff$ large $C$.\n",
    "</ul>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### Relationship of SVM to logistic regression\n",
    "\n",
    "#### Comparing the losses\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"figs/Chapter9/9.12.png\" height=\"600\">\n",
    "</div>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### The kernel trick can be applied beyond SVMs\n",
    "\n",
    "**Kernels and dot products**: \n",
    "<ul>\n",
    "<li>Associated to a positive definite $R$ is a dot product. For $x$ in the feature space $\\mathbb{R}^p$, define $R_x:\\mathbb{R}^p \\rightarrow \\mathbb{R}$\n",
    "by\n",
    "$$\n",
    "R_x(x_0) = R(x,x_0)\n",
    "$$\n",
    "<li>The kernel defines a dot product on linear combinations of the $R_x$'s for different $x$'s:\n",
    "$$\\left \\langle \\sum_j c_j R_{x_j}, \\sum_i d_i R_{y_i} \\right \\rangle_R = \\sum_{i,j} c_j d_i R(x_j,y_i)\n",
    "$$\n",
    "and hence a length\n",
    "$$\n",
    "\\|\\sum_j c_j R_{x_j}\\|^2_R = \\sum_{i,j} c_i c_j R(x_i, x_j)\n",
    "$$\n",
    "</ul>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### The kernel trick can be applied beyond SVMs\n",
    "\n",
    "**Kernel regression:**\n",
    "<ul>\n",
    "<li>For tuning parameter $\\lambda$ define\n",
    "$$\n",
    "\\hat{f}_{\\lambda} = \\text{argmin}_f \\sum_{i=1}^n (Y_i - f(X_i))^2 + \\lambda \\|f\\|^2_K\n",
    "$$\n",
    "<li>Remarkably, it is known that\n",
    "$$\n",
    "\\hat{f} = \\sum_{i=1}^n \\hat{\\alpha}_i K_{X_i}.\n",
    "$$\n",
    "for some $\\hat{\\alpha}$.\n",
    "</ul>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### The kernel trick can be applied beyond SVMs\n",
    "\n",
    "**Kernel regression:**\n",
    "\n",
    "<ul>\n",
    "<li>Problem reduces to finding\n",
    "$$\n",
    "\\hat{\\alpha} = \\text{argmin}_{\\alpha} \\sum_{i=1}^n (Y_i - \\sum_{j=1}^n \\alpha_j K(X_i, X_j))^2 + \\lambda \\sum_{l, r=1}^n \\alpha_l \\alpha_r K(X_i, X_j)\n",
    "$$\n",
    "<li>Finding $\\hat{\\alpha}$ is just like ridge regression! \n",
    "<li>Similar phenomenon for logistic regression...\n",
    "<li>Just like smoothing splines, we solved a problem over an big space of functions!\n",
    "<li>Smoothing splines\n",
    "are a special case of the kernel trick...\n",
    "<li>Like support vector machines, optimization is over an $n$-dimensional vector, not a $p$-dimensional vector as in linear regression...\n",
    "</ul>\n",
    "\n",
    "## Support vector machines\n",
    "\n",
    "### The kernel trick can be applied beyond SVMs\n",
    "\n",
    "**Kernel PCA:**\n",
    "<ul>\n",
    "<li>Suppose we want to do PCA with an expanded set of predictors, defined by the mapping $\\Phi$.\n",
    "<li>First principal component is\n",
    "$$\n",
    "\\hat{f}_1 = \\text{argmax}_{f: \\|f\\|_K \\leq 1} \\hat{\\text{Var}}(f(X)).\n",
    "$$\n",
    "<li>Even if $\\Phi$ expands the predictors to a very high dimensional space, we can do PCA!\n",
    "<li>The cost only depends on the number of observations $n$.\n",
    "<li>**Long story short:** Kernel trick allows transformation of linear methods into\n",
    "nonlinear ones by *implicit* featurization.\n",
    "</ul>"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "jupytext": {
   "cell_metadata_filter": "all,-slideshow",
   "formats": "ipynb,Rmd,md:myst"
  },
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
