Maximal margin classifier#

Suppose we have a classification problem with response $Y=-1$ or $Y=1$.
If the classes can be separated, most likely, there will be an infinite number of hyperplanes separating the classes.

Draw the largest possible empty margin around the hyperplane.
Out of all possible hyperplanes that separate the 2 classes, choose the one with the widest margin.

Optimization problem#

This can be written as an optimization problem:

\[\begin{split} \begin{aligned} &\max_{\beta_0,\beta_1,\dots,\beta_p}\;\; M\\ &\text{subject to } \sum_{j=1}^p \beta_j^2 = 1,\\ &y_i(\beta_0 +\beta_1x_{i1}+\dots+\beta_p x_{ip}) \geq M \quad \text{ for all }i=1,\dots,n. \end{aligned} \end{split}\]

The quantity $y_i(\beta_0 +\beta_1x_{i1}+\dots+\beta_p x_{ip})$ is distance from $x_i$ from the hyperplane.
$M$ is simply the width of the margin in either direction.
Ultimately, we will see that this problem can be transformed to something similar to logistic regression…

Reformulation#

We can reformulate the problem by defining a vector $w=(w_1,\dots,w_p) = \beta/M$:

\[\begin{split} \begin{align*} &\min_{\beta_0,w}\;\; \frac{1}{2}\|w\|^2 \\ &\text{subject to } \\ &y_i(\beta_0 +w\cdot x_i) \geq 1 \quad \text{ for all }i=1,\dots,n. \end{align*} \end{split}\]

This is a quadratic optimization problem. Having found $(\hat{\beta}_0, \hat{w})$ we can recover $\hat{\beta}=\hat{w}/\|\hat{w}\|_2, M=1/\|\hat{w}\|_2$.

\[\begin{split} \begin{align*} &\min_{\beta_0,w}\;\; \frac{1}{2}\|w\|^2 \\ &\text{subject to } \\ &y_i(\beta_0 +w\cdot x_i) \geq 1 \quad \text{ for all }i=1,\dots,n. \end{align*}\\[3mm] \end{split}\]

Introducing Karush-Kuhn-Tucker (KKT) multipliers, $\alpha_1,\dots,\alpha_n$, this is equivalent to:

\[\begin{split} \begin{align*} &\max_\alpha\; \min_{\beta_0,w}\;\; \frac{1}{2}\|w\|^2 - \sum_{i=1}^n \alpha_i [y_i(\beta_0 +w\cdot x_i)-1]\\ &\text{subject to } \alpha_i \geq 0. \\ \end{align*}\\[3mm] \end{split}\]

Setting the partial derivatives with respect to $w$ and $\beta_0$ to 0, we get: $$\hat w = \sum_{i=1}^n \alpha_i y_i x_i, \quad \sum_{i=1}^n\alpha_iy_i = 0$$
Furthermore, one of the KKT conditions yields $\alpha_i>0$ if and only if $y_i(\beta_0 +w\cdot x_i)=1$, that is, if $x_i$ falls on the margin.

Support vectors#

The vectors that fall on the margin and determine the solution are called support vectors

Summary#

\[\begin{split} \begin{align*} &\max_\alpha\; \min_{\beta_0,w}\;\; \frac{1}{2}\|w\|^2 - \sum_{i=1}^n \alpha_i [y_i(\beta_0 +w\cdot x_i)-1]\\ &\text{subject to }\;\; \alpha_i \geq 0. \end{align*} \end{split}\]

The solution is $\hat w = \sum_{i=1}^n \alpha_i y_i x_i$, and $\sum_{i=1}^n \alpha_iy_i=0$ so we can plug this in above to obtain the dual problem:

\[\begin{split} \begin{align*} &\max_\alpha\;\; \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n\sum_{i'=1}^n \alpha_i\alpha_{i'}y_i y_{i'} \color{Blue}{(x_i\cdot x_{i'})}\\ &\text{subject to }\;\; \alpha_i \geq 0, \quad \sum_{i}\alpha_i y_i = 0. \end{align*} \end{split}\]

We’ve reduced the problem of finding $w$, which describes the hyperplane and the size of the margin, to finding a set of coefficients $\alpha_1,\dots,\alpha_n$ through:

\[\begin{split} \begin{align*} &\max_\alpha\;\; \sum_{i=1}^n \alpha_i- \frac{1}{2}\sum_{i=1}^n\sum_{i'=1}^n \alpha_i\alpha_{i'}y_i y_{i'} \color{Blue}{(x_i\cdot x_{i'})}\\ &\text{subject to }\;\; \alpha_i \geq 0, \quad \sum_{i}\alpha_i y_i = 0. \end{align*} \end{split}\]

This only depends on the training sample inputs through the inner products $x_i\cdot x_j$ for every pair $i,j$.
Whatever $p$ is, this is an optimization problem in $n$ variables. $\to$ could be $p \gg n$.

STATS 202

Maximal margin classifier

Contents

Maximal margin classifier#

Optimization problem#

Reformulation#

Support vectors#

Summary#