Support vector machines#

  • Linear classifiers can be described geometrically by separating hyperplanes.

  • An affine function $\( x \mapsto \beta_0 + \sum_{j=1}^p x_j \beta_j \)\( determines a hyperplane \)\( H = \left\{x: \sum_{j=1}^p x_j \beta_j + \beta_0 = 0 \right\}. \)\( and two half-spaces \)\( \left\{x: \sum_{j=1}^p x_j \beta_j + \beta_0 > 0 \right\}, \qquad \left\{x: \sum_{j=1}^p x_j \beta_j + \beta_0 < 0 \right\}. \)$

  • The vector \(N=(\beta_1, \dots, \beta_p)\) is the normal vector of the hyperplane \(H\).

  • For a given \(H\), by scaling, we can always choose \(N\) so that \(\|N\|_2=1\) (we must also scale \(\beta_0\) to keep \(H\) the same).

Support vector machines#

Hyperplanes and normal vectors#

  • Function is \(x \mapsto 1 + 2 x_1 + 3 x_2\)

  • \(\{x: 1+2x_1+3x_2>0\}\)

  • \(\{x: 1+2x_1+3x_2>0\}\)

Support vector machines#

Hyperplanes and normal vectors#

  • If the hyperplane goes through the origin $(\beta_0=0)$, the deviation between a point $(x_1,\dots,x_p)$ and the hyperplane is the dot product: $$x\cdot \beta = x_1\beta_1 + \dots + x_p\beta_p.$$
  • The sign of the dot product tells us on which side of the hyperplane the point lies.
  • If the hyperplane goes through a point $-\beta_0 \beta$, i.e. it is displaced from the origin by $-\beta_0$ along the normal vector $(\beta_1, \dots, \beta_p)$, the deviation of a point $(x_1,\dots,x_p)$ from the hyperplane is: $$ \beta_0 + x_1\beta_1 + \dots + x_p\beta_p.$$
  • The sign tells us on which side of the hyperplane the point lies.

Support vector machines#

Maximal margin classifier#

  • Suppose we have a classification problem with response $Y=-1$ or $Y=1$.
  • If the classes can be separated, most likely, there will be an infinite number of hyperplanes separating the classes.

Support vector machines#

Maximal margin classifier#

Idea:

  • Draw the largest possible empty margin around the hyperplane.
  • Out of all possible hyperplanes that separate the 2 classes, choose the one with the widest margin.

Support vector machines#

Maximal margin classifier#

  • This can be written as an optimization problem:

\[\begin{split} \begin{aligned} &\max_{\beta_0,\beta_1,\dots,\beta_p}\;\; M\\ &\text{subject to } \sum_{j=1}^p \beta_j^2 = 1,\\ &\underbrace{y_i(\beta_0 +\beta_1x_{i1}+\dots+\beta_p x_{ip})}_\text{How far is $x_i$ from the hyperplane} \geq M \quad \text{ for all }i=1,\dots,n. \end{aligned} \end{split}\]
  • \(M\) is simply the width of the margin in either direction.

  • Ultimately, we will see that this problem can be transformed to something similar to logistic regression…

Support vector machines#

Finding the maximal margin classifier#

  • We can reformulate the problem by defining a vector \(w=(w_1,\dots,w_p) = \beta/M\):

\[\begin{split} \begin{align*} &\min_{\beta_0,w}\;\; \frac{1}{2}\|w\|^2 \\ &\text{subject to } \\ &y_i(\beta_0 +w\cdot x_i) \geq 1 \quad \text{ for all }i=1,\dots,n. \end{align*} \end{split}\]
  • This is a quadratic optimization problem. Having found \((\hat{\beta}_0, \hat{w})\) we can recover \(\hat{\beta}=\hat{w}/\|\hat{w}\|_2, M=1/\|\hat{w}\|_2\).

Support vector machines#

Finding the maximal margin classifier#

\[\begin{split} \begin{align*} &\min_{\beta_0,w}\;\; \frac{1}{2}\|w\|^2 \\ &\text{subject to } \\ &y_i(\beta_0 +w\cdot x_i) \geq 1 \quad \text{ for all }i=1,\dots,n. \end{align*}\\[3mm] \end{split}\]

Introducing Karush-Kuhn-Tucker (KKT) multipliers, \(\alpha_1,\dots,\alpha_n\), this is equivalent to:

\[\begin{split} \begin{align*} &\max_\alpha\; \min_{\beta_0,w}\;\; \frac{1}{2}\|w\|^2 - \sum_{i=1}^n \alpha_i [y_i(\beta_0 +w\cdot x_i)-1]\\ &\text{subject to } \alpha_i \geq 0. \\ \end{align*}\\[3mm] \end{split}\]

Support vector machines#

Finding the maximal margin classifier#

\[\begin{split} \begin{align*} &\max_\alpha\; \min_{\beta_0,w}\;\; \frac{1}{2}\|w\|^2 - \sum_{i=1}^n \alpha_i [y_i(\beta_0 +w\cdot x_i)-1]\\ &\text{subject to } \alpha_i \geq 0. \end{align*}\\[3mm] \end{split}\]
  • Setting the partial derivatives with respect to $w$ and $\beta_0$ to 0, we get: $$\hat w = \sum_{i=1}^n \alpha_i y_i x_i, \quad \sum_{i=1}^n\alpha_iy_i = 0$$
  • Furthermore, one of the KKT conditions yields $\alpha_i>0$ if and only if $y_i(\beta_0 +w\cdot x_i)=1$, that is, if $x_i$ falls on the margin.

Support vector machines#

Support vectors#

The vectors that fall on the margin and determine the solution are called support vectors

Support vector machines#

Finding the maximal margin classifier#

\[\begin{split} \begin{align*} &\max_\alpha\; \min_{\beta_0,w}\;\; \frac{1}{2}\|w\|^2 - \sum_{i=1}^n \alpha_i [y_i(\beta_0 +w\cdot x_i)-1]\\ &\text{subject to }\;\; \alpha_i \geq 0. \end{align*} \end{split}\]
  • The solution is \(\hat w = \sum_{i=1}^n \alpha_i y_i x_i\), and \(\sum_{i=1}^n \alpha_iy_i=0\) so we can plug this in above to obtain the dual problem:

\[\begin{split} \begin{align*} &\max_\alpha\;\; \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n\sum_{i'=1}^n \alpha_i\alpha_{i'}y_i y_{i'} \color{Blue}{(x_i\cdot x_{i'})}\\ &\text{subject to }\;\; \alpha_i \geq 0, \quad \sum_{i}\alpha_i y_i = 0. \end{align*} \end{split}\]

Support vector machines#

Maximal margin classifier#

Summary#

  • We’ve reduced the problem of finding \(w\), which describes the hyperplane and the size of the margin, to finding a set of coefficients \(\alpha_1,\dots,\alpha_n\) through:

\[\begin{split} \begin{align*} &\max_\alpha\;\; \sum_{i=1}^n \alpha_i- \frac{1}{2}\sum_{i=1}^n\sum_{i'=1}^n \alpha_i\alpha_{i'}y_i y_{i'} \color{Blue}{(x_i\cdot x_{i'})}\\ &\text{subject to }\;\; \alpha_i \geq 0, \quad \sum_{i}\alpha_i y_i = 0. \end{align*} \end{split}\]
  • This only depends on the training sample inputs through the inner products \(x_i\cdot x_j\) for every pair \(i,j\).

  • Whatever \(p\) is, this is an optimization problem in \(n\) variables. \(\to\) could be \(p \gg n\).

Support vector machines#

Support vector classifier#

Problem: It is not always possible to separate the points using a hyperplane.

Support vector classifier:

  • Relaxation of the maximal margin classifier.
  • Allows a number of points to be on the wrong side of the margin or even the hyperplane by allowing *slack* $\epsilon_i$ for each case.

Support vector machines#

Support vector classifier#

  • A new optimization problem: $\( \begin{align*} &\max_{\beta_0,\beta,\epsilon}\;\; M\\ &\text{subject to } \sum_{j=1}^p \beta_j^2 = 1,\\ &\underbrace{y_i(\beta_0 +\beta_1x_{i1}+\dots+\beta_p x_{ip})}_\text{How far is \)x_i\( from the hyperplane} \geq M(1-\epsilon_i) \quad \text{ for all }i=1,\dots,n\\ &\epsilon_i \geq 0 \;\text{for all }i=1,\dots,n, \quad \sum_{i=1}^n \epsilon_i \leq C. \end{align*}\\ \)$

  • \(M\) is the width of the margin in either direction

  • \(\epsilon=(\epsilon_1,\dots,\epsilon_n)\) are called slack variables

  • \(C\) is called the budget

Support vector machines#

Support vector classifier#

Tuning the budget, \(C\) (high to low)#

Support vector machines#

Support vector classifier#

If the budget is too low, we tend to overfit#

  • Maximal margin classifier, \(C=0\).

  • Adding one observation dramatically changes the classifier.

Support vector machines#

Support vector classifier#

Finding the support vector classifier#

We can reformulate the problem by defining a vector \(w=(w_1,\dots,w_p) = \beta/M\):

\[\begin{split} \begin{align*} &\min_{\beta_0,w,\epsilon}\;\; \frac{1}{2}\|w\|^2 + D\sum_{i=1}^n \epsilon_i\\ &\text{subject to } \\ &y_i(\beta_0 +w\cdot x_i) \geq (1-\epsilon_i) \quad \text{ for all }i=1,\dots,n,\\ &\epsilon_i \geq 0 \quad \text{for all }i=1,\dots,n. \end{align*} \end{split}\]

The penalty \(D\geq 0\) serves a function similar to the budget \(C\), but is inversely related to it.

Support vector machines#

Support vector classifier#

Finding the support vector classifier#

\[\begin{split} \begin{align*} &\min_{\beta_0,w,\epsilon}\;\; \frac{1}{2}\|w\|^2 + D\sum_{i=1}^n \epsilon_i \\ &\text{subject to } \\ &y_i(\beta_0 +w\cdot x_i) \geq (1-\epsilon_i) \quad \text{ for all }i=1,\dots,n.\\ &\epsilon_i \geq 0 \quad \text{for all }i=1,\dots,n. \end{align*} \end{split}\]

Introducing KKT multipliers, \(\alpha_i\) and \(\mu_i\), this is equivalent to: $\( \begin{align*} &\max_{\alpha,\mu}\; \min_{\beta_0,w,\epsilon}\;\; \frac{1}{2}\|w\|^2 - \sum_{i=1}^n \alpha_i [y_i(\beta_0 +w\cdot x_i)-1+\epsilon_i] + \sum_{i=1}^n (D-\mu_i)\epsilon_i\\ &\text{subject to } \;\;\alpha_i \geq 0, \mu_i\geq 0 , \quad \text{ for all }i=1,\dots,n. \end{align*} \)$

Support vector machines#

Support vector classifier#

Finding the support vector classifier#

\[\begin{split} \begin{align*} &\max_{\alpha,\mu}\; \min_{\beta_0,w,\epsilon}\;\; \frac{1}{2}\|w\|^2 - \sum_{i=1}^n \alpha_i [y_i(\beta_0 +w\cdot x_i)-1+\epsilon_i] + \sum_{i=1}^n (D-\mu_i)\epsilon_i\\ &\text{subject to } \;\;\alpha_i \geq 0, \mu_i\geq 0 , \quad \text{ for all }i=1,\dots,n. \end{align*} \end{split}\]
  • Setting the derivatives with respect to $w$, $\beta_0$, and $\epsilon$ to 0, we obtain: $$\hat w = \sum_{i=1}^n \alpha_i y_i x_i, \quad \sum_{i=1}^n\alpha_i y_i = 0, \quad \mu_i=D-\alpha_i$$
  • Furthermore, the KKT conditions yield $\alpha_i>0$ if and only if $y_i(\beta_0 +w\cdot x_i)\leq 1$, that is, if $x_i$ falls on the wrong side of the margin.

Support vector machines#

Support vector classifier#

Support vectors#

Support vector machines#

Support vector classifier#

The problem only depends on \(x_i\cdot x_{i'}\)#

  • As with the maximal margin classifier, the problem can be reduced to finding \(\alpha_1,\dots,\alpha_n\):

\[\begin{split} \begin{align*} &\max_\alpha\;\; \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n\sum_{i'=1}^n \alpha_i\alpha_{i'}y_i y_{i'} \color{Blue}{(x_i\cdot x_{i'})}\\ &\text{subject to }\;\; 0\leq \alpha_i \leq D \;\text{ for all }i=1,\dots,n, \\ &\quad \sum_{i}\alpha_i y_i = 0. \end{align*} \end{split}\]
  • As for maximal margin classifier, this only depends on the training sample inputs through the inner products \(x_i\cdot x_j\) for every pair \(i,j\).

  • Why care?

Support vector machines#

Support vector classifier#

Key fact about the support vector classifier#

  • To find the hyperplane all we need to know is the dot product between any pair of input vectors:

\[ \begin{align}\begin{aligned}\begin{split}K(x_i,x_k) = (x_i\cdot x_k) =\langle x_i, x_k \rangle = \sum_{j=1}^p x_{ij}x_{kj}$$\\\end{split}\\- The matrix $\mathbf{K}_{ij}=K(x_i,x_k)$ is called the *kernel* or *Gram* matrix.\\- To **make predictions at $x$** all we need to know is $\langle x_i, x \rangle$ for each point $x_i$ (actually only the support vectors).\\- This actually opens up the possibility of (richer) *nonlinear* classifiers: using basis functions can use \end{aligned}\end{align} \]

K_f(x_i, x_j) = \langle f(x_i), f(x_j) \rangle, \qquad K_f(x, x_i) = \langle f(x), f(x_i) \rangle $$

Support vector machines#

Support vector classifier#

How to create non-linear boundaries?#

  • The support vector classifier can only produce a linear boundary.

Support vector machines#

Support vector classifier#

How to create non-linear boundaries?#

  • In **logistic regression**, we dealt with this problem by adding transformations of the predictors.
  • The original decision boundary is a hyperplane: $$\log \left[ \frac{P(Y=1|X)}{P(Y=0|X)} \right] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0.$$
  • With a quadratic predictor, we get a quadratic boundary: $$\log \left[ \frac{P(Y=1|X)}{P(Y=0|X)} \right] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1^2= 0.$$

Support vector machines#

Support vector classifier#

How to create non-linear boundaries?#

  • With a **support vector classifier** we can apply a similar trick.
  • The original decision boundary is the hyperplane defined by: $$ \beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0.$$
  • If we expand the set of predictors to the 3D space $(X_1,X_2,X_1^2)$, the decision boundary becomes: $$ \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1^2= 0.$$
  • This is in fact a linear boundary in the 3D space. However, we can classify a point knowing just $(X_1,X_2)$. The boundary in this projection is quadratic in $X_1$.

Support vector machines#

Support vector classifier#

How to create non-linear boundaries?#

  • **Idea:** Add polynomial terms up to degree $d$: $$Z = (X_1, X_1^2, \dots, X_1^d, X_2, X_2^2, \dots, X_2^d, \dots, X_p, X_p^2, \dots, X_p^d).$$
  • Does this make the computation more expensive?
  • Recall that all we need to compute is the dot product: $$x_i \cdot x_{k} = \langle x_i,x_k\rangle = \sum_{j=1}^p x_{ij}x_{kj}.$$
  • With the expanded set of predictors, we need: $$z_i \cdot z_{k} = \langle z_i,z_k\rangle = \sum_{j=1}^p\sum_{\ell=1}^d x^\ell_{ij}x^\ell_{kj}.$$
  • Modulo computing these dot products, the order of computations is the same as with linear features. *Not quite true for logistic regression as described above.*

Support vector machines#

Kernels and the kernel trick#

  • The kernel matrix defined by \(K(x_i,x_k) = \langle z_i, z_k \rangle\) for a set of linearly independent vectors \(z_1,\dots,z_n\) is always symmetric and positive semi-definite, i.e. has no negative eigenvalues.

  • Theorem: If \(K\) is a positive definite \(n\times n\) matrix, there exist vectors \((z_1,\dots,z_n)\) in some space \(\mathbf{Z}\), such that \(K(x_i,x_k) = \langle z_i,z_k\rangle\).

  • We just need \(\mathbf{K}_{ij}=K(x_i,x_j)\) to classify all of our learning examples, but to classify we need to know \(x \mapsto K(x, x_i)\).

  • The kernel \(K:\mathbb{R}^p \times \mathbb{R}^p \rightarrow \mathbb{R}\) should be such that for any choice of \(\{x_1, \dots, x_n\}\) the matrix \(\mathbf{K}\) is positive semidefinite.

  • Such kernel functions are associated to reproducing kernel Hilbert spaces (RKHS).

  • Support vector classifiers using this kernel trick are support vector machines

Support vector machines#

Kernels and the kernel trick#

  • With a kernel function \(K\), the problem can be reduced to the optimization:

\[\begin{split} \begin{align*} \hat \alpha\;=\;&\arg\max_\alpha\;\; \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n\sum_{i'=1}^n \alpha_i\alpha_{i'}y_i y_{i'} \mathbf{K}_{ii'} \\ &\text{subject to }\; 0\leq \alpha_i \leq D \;\text{ for all }i=1,\dots,n, \\ &\quad \sum_{i=1}^n \alpha_i y_i = 0. \end{align*} \end{split}\]
  • This is the dual problem of a *different* optimization problem than we started with.
  • Predictions can be computed similarly to original kernel $K(x,y)=x \cdot y$. (Details omitted.)

Support vector machines#

Kernels and the kernel trick#

**Expand the set of predictors:** **Define a kernel:**
  • Find a mapping $\Phi$ which expands the original set of predictors $X_1,\dots,X_p$. For example, $$\Phi(X) = (X_1,X_2,X_1^2)$$
  • For each pair of samples, compute: $$\mathbf{K}_{ik} = \langle\Phi(x_i),\Phi(x_k)\rangle.$$
  • Predictions use $\langle\Phi(x), \Phi(x_i) \rangle, 1 \leq i \leq n$.
  • Prove that a function $R(\cdot,\cdot)$ is positive definite. For example: $$R(x_i,x_k) = \left( 1+ \langle x_i,x_k\rangle\right)^2.$$
  • For each pair of samples, compute: $$\mathbf{K}_{ik} = f(x_i,x_k).$$
  • Often much easier!
  • Predictions use $R(x, x_i), 1 \leq i \leq n$.

Support vector machines#

The polynomial kernel with \(d=2\):#

\[\mathbf{K}_{ik} = R(x_i,x_k) = \left( 1+ \langle x_i,x_k\rangle\right)^2\]

This is equivalent to the expansion: $\( \begin{align*} \Phi(x) =& (\sqrt{2}x_1,\dots,\sqrt{2}x_p, \\ & x_1^2,\dots,x_p^2, \\ & \sqrt{2}x_1x_2,\sqrt{2}x_1x_3, \dots, \sqrt{2}x_{p-1}x_p) \end{align*} \)$

  • For a single $x$, computing $K(x,x_k)$ directly is $O(p)$.
  • For a single $x$ computing the kernel using the expansion is $O(p^2)$.

Support vector machines#

How are kernels defined?#

  • Proving that a bilinear function $R(\cdot,\cdot)$ is positive definite (PD) is not always easy.
  • <2-> However, we can easily define PD kernels by combining those we are familiar with:
    • Sums and products of PD kernels are PD.
  • Intuitively, a kernel $K(x_i,x_k)$ defines a \emph{similarity} between the samples $x_i$ and $x_k$.
  • This intuition can guide our choice in different problems.

Support vector machines#

Common kernels#

  • The polynomial kernel: $$K(x_i,x_k) = \left( 1+ \langle x_i,x_k\rangle\right)^d$$
  • The radial basis kernel: $$K(x_i,x_k) = \exp\big(-\gamma\underbrace{\sum_{j=1}^p (x_{ip}-x_{kp})^2}_\text{Euclidean $d(x_i,x_k)$}\big)$$

Support vector machines#

Summary so far#

  • As noted above **support vector machine** is a support vector classifier applied on an expanded set of predictors, e.g. $$\Phi: (X_1,X_2) \to (X_1,X_2,X_1X_2, X_1^2,X_2^2).$$
  • We expand the vector of predictors for each sample $x_i$ and then perform the algorithm.
  • We only need to know the dot products: $$ \langle \Phi(x_i),\Phi(x_k) \rangle \equiv K(x_i,x_k)$$ for every pair of samples $(x_i,x_k)$ as well as how to compute $\langle \Phi(x), \Phi(x_i) \rangle$ for new $x$'s.
  • Often, the dot product: $$\langle \Phi(x_i),\Phi(x_k) \rangle \equiv K(x_i, x_k) $$ is a simple function $f(x_i,x_k)$ of the original vectors. Even if the mapping $\Phi$ significantly expands the space of predictors.

Support vector machines#

Summary so far#

  • **Example 1:** Polynomial kernel
    • $$K(x_i,x_k) = (1+ \langle x_i,x_k \rangle)^2.$$
    • With two predictors, this corresponds to the mapping: $$\Phi: (X_1,X_2) \to (\sqrt{2}X_1,\sqrt{2}X_2,\sqrt{2}X_1X_2, X_1^2,X_2^2).$$
  • **Example 2:** RBF kernel $$K(x_i,x_k) = \exp(-\gamma d( x_i,x_k )^2),$$ where $d$ is the Euclidean distance between $x_i$ and $x_k$.
    • In this case, the mapping $\Phi$ is an expansion into an infinite number of transformations!
    • We can apply the method even if we don't know what these transformations explicitly are.

Support vector machines#

Kernels for non-standard data types#

  • We can define families of kernels (with tuning parameters), which capture similarity between non-standard kinds of data:
    1. Text, strings
    2. Images
    3. Graphs
    4. Histograms
  • Sometimes we know the mapping $\Phi$, but there are algorithms that are fast for computing $K(x_i,x_k)$ without doing the expansion explicitly.
  • Other times, the expansion $\Phi$ is infinite-dimensional or simply not known.

Support vector machines#

Kernels for non-standard data types#

Example: strings#

Suppose we want to compare two strings in a finite alphabet:

\[x_1 = ACCTATGCCATA\]
\[x_2 = AGCTAAGCATAC\]
  • **Stringdot kernel:** For each word $u$ of length $p$, we define a feature: $$\Phi_u(x_i) = \text{# of times that }u\text{ appears in }x_i$$
  • Naive algorithm would require looping over each sequence, for every subsequence $u$ of length $p$.
  • This would be $O(n^2)$ steps, where $n$ is the length of the sequences.

Support vector machines#

Kernels for non-standard data types#

Example: strings#

Suppose we want to compare two strings in a finite alphabet: $\(x_1 = ACCTATGCCATA\)\( \)\(x_2 = AGCTAAGCATAC\)$

  • **Gap weight kernel:** For each word $u$ of length $p$, we define a feature: $$\Phi_u(x_i) = \sum_{v: \text{a subsequence of $x_i$ containing $u$}} \lambda^{\text{length}(v)} $$ with $0<\lambda\leq 1$.
  • The number of features can be huge! However, this can be computed in $\mathcal O(Mp\log n )$ steps where $M$ is the number of matches.

Support vector machines#

Applying SVMs with more than 2 classes#

  • SVMs don't generalize nicely to the case of more than 2 classes.
  • Two main approaches:
    1. **One vs. one:** Construct $n \choose 2$ SVMs comparing every pair of classes. Apply all SVMs to a test observation and classify to the class that wins the most one-on-one challenges.
    2. **One vs. all:** For each class $k$, construct an SVM $\beta^{(k)}$ coding class $k$ as $1$ and all other classes as $-1$. Assign a test observation to the class $k^*$, such that the distance from $x_i$ to the hyperplane defined by $\beta^{(k^*)}$ is largest (the distance is negative if the sample is misclassified).

Support vector machines#

Relationship of SVM to logistic regression#

Recall the Lagrange form of the problem.

\[\begin{split} \begin{align*} &\min_{\beta_0,w,\epsilon}\;\; \frac{1}{2}\|w\|^2 + D\sum_{i=1}^n \epsilon_i\\ &\text{subject to } \\ &y_i(\beta_0 +w\cdot x_i) \geq (1-\epsilon_i) \quad \text{ for all }i=1,\dots,n,\\ &\epsilon_i \geq 0 \quad \text{for all }i=1,\dots,n. \end{align*} \end{split}\]

Support vector machines#

Relationship of SVM to logistic regression#

  • Set $D=1/\lambda$ and minimize over $\epsilon_i$ explicitly.
  • If $1 - y_i(\beta+0 + w \cdot x_i) \leq 0$ we can take $\hat{\epsilon}_i=0$. Otherwise, we take $$\hat{\epsilon}_i=1 - y_i(\beta_0 + w \cdot x_i).$$
  • Or, $$\hat{\epsilon}_i = \max(1 - y_i(\beta_0+w \cdot x_i), 0).$$

Support vector machines#

Relationship of SVM to logistic regression#

  • Plugging this into the objective (and replacing $w$ with $\beta$) yields $$ \begin{align*} &\min_{\beta}\;\; \sum_{i=1}^n \max(1 - y_i(\beta_0 + \sum_{j=1}^p \beta_j x_{ij}, 0) + \frac{\lambda}{2}\sum_{j=1}^p\|\beta_j\|^2 + \\ \end{align*} $$
  • This loss is a function of $y_i(\beta_0 + \sum_{j=1}^p \beta_j x_{ij}$ and a ridge penalty.
  • Loss for logistic regression is also a function of $y_i(\beta_0 + \sum_{j=1}^p \beta_j x_{ij}$.
  • Large $\lambda$ $\iff$ small $D$ $\iff$ large $C$.

Support vector machines#

Relationship of SVM to logistic regression#

Comparing the losses#

Support vector machines#

The kernel trick can be applied beyond SVMs#

Kernels and dot products:

  • Associated to a positive definite $R$ is a dot product. For $x$ in the feature space $\mathbb{R}^p$, define $R_x:\mathbb{R}^p \rightarrow \mathbb{R}$ by $$ R_x(x_0) = R(x,x_0) $$
  • The kernel defines a dot product on linear combinations of the $R_x$'s for different $x$'s: $$\left \langle \sum_j c_j R_{x_j}, \sum_i d_i R_{y_i} \right \rangle_R = \sum_{i,j} c_j d_i R(x_j,y_i) $$ and hence a length $$ \|\sum_j c_j R_{x_j}\|^2_R = \sum_{i,j} c_i c_j R(x_i, x_j) $$

Support vector machines#

The kernel trick can be applied beyond SVMs#

Kernel regression:

  • For tuning parameter $\lambda$ define $$ \hat{f}_{\lambda} = \text{argmin}_f \sum_{i=1}^n (Y_i - f(X_i))^2 + \lambda \|f\|^2_K $$
  • Remarkably, it is known that $$ \hat{f} = \sum_{i=1}^n \hat{\alpha}_i K_{X_i}. $$ for some $\hat{\alpha}$.

Support vector machines#

The kernel trick can be applied beyond SVMs#

Kernel regression:

  • Problem reduces to finding $$ \hat{\alpha} = \text{argmin}_{\alpha} \sum_{i=1}^n (Y_i - \sum_{j=1}^n \alpha_j K(X_i, X_j))^2 + \lambda \sum_{l, r=1}^n \alpha_l \alpha_r K(X_i, X_j) $$
  • Finding $\hat{\alpha}$ is just like ridge regression!
  • Similar phenomenon for logistic regression...
  • Just like smoothing splines, we solved a problem over an big space of functions!
  • Smoothing splines are a special case of the kernel trick...
  • Like support vector machines, optimization is over an $n$-dimensional vector, not a $p$-dimensional vector as in linear regression...

Support vector machines#

The kernel trick can be applied beyond SVMs#

Kernel PCA:

  • Suppose we want to do PCA with an expanded set of predictors, defined by the mapping $\Phi$.
  • First principal component is $$ \hat{f}_1 = \text{argmax}_{f: \|f\|_K \leq 1} \hat{\text{Var}}(f(X)). $$
  • Even if $\Phi$ expands the predictors to a very high dimensional space, we can do PCA!
  • The cost only depends on the number of observations $n$.
  • **Long story short:** Kernel trick allows transformation of linear methods into nonlinear ones by *implicit* featurization.