Maximal margin classifier#

Fig 9.2
  • Suppose we have a classification problem with response \(Y=-1\) or \(Y=1\).

  • If the classes can be separated, most likely, there will be an infinite number of hyperplanes separating the classes.


  • Draw the largest possible empty margin around the hyperplane.

  • Out of all possible hyperplanes that separate the 2 classes, choose the one with the widest margin.

Fig 9.3

Optimization problem#

  • This can be written as an optimization problem:

\[\begin{split} \begin{aligned} &\max_{\beta_0,\beta_1,\dots,\beta_p}\;\; M\\ &\text{subject to } \sum_{j=1}^p \beta_j^2 = 1,\\ &y_i(\beta_0 +\beta_1x_{i1}+\dots+\beta_p x_{ip}) \geq M \quad \text{ for all }i=1,\dots,n. \end{aligned} \end{split}\]
  • The quantity \(y_i(\beta_0 +\beta_1x_{i1}+\dots+\beta_p x_{ip})\) is distance from \(x_i\) from the hyperplane.

  • \(M\) is simply the width of the margin in either direction.

  • Ultimately, we will see that this problem can be transformed to something similar to logistic regression…


Reformulation#

  • We can reformulate the problem by defining a vector \(w=(w_1,\dots,w_p) = \beta/M\):

\[\begin{split} \begin{align*} &\min_{\beta_0,w}\;\; \frac{1}{2}\|w\|^2 \\ &\text{subject to } \\ &y_i(\beta_0 +w\cdot x_i) \geq 1 \quad \text{ for all }i=1,\dots,n. \end{align*} \end{split}\]
  • This is a quadratic optimization problem. Having found \((\hat{\beta}_0, \hat{w})\) we can recover \(\hat{\beta}=\hat{w}/\|\hat{w}\|_2, M=1/\|\hat{w}\|_2\).

\[\begin{split} \begin{align*} &\min_{\beta_0,w}\;\; \frac{1}{2}\|w\|^2 \\ &\text{subject to } \\ &y_i(\beta_0 +w\cdot x_i) \geq 1 \quad \text{ for all }i=1,\dots,n. \end{align*}\\[3mm] \end{split}\]

Introducing Karush-Kuhn-Tucker (KKT) multipliers, \(\alpha_1,\dots,\alpha_n\), this is equivalent to:

\[\begin{split} \begin{align*} &\max_\alpha\; \min_{\beta_0,w}\;\; \frac{1}{2}\|w\|^2 - \sum_{i=1}^n \alpha_i [y_i(\beta_0 +w\cdot x_i)-1]\\ &\text{subject to } \alpha_i \geq 0. \\ \end{align*}\\[3mm] \end{split}\]

\[\begin{split} \begin{align*} &\max_\alpha\; \min_{\beta_0,w}\;\; \frac{1}{2}\|w\|^2 - \sum_{i=1}^n \alpha_i [y_i(\beta_0 +w\cdot x_i)-1]\\ &\text{subject to } \alpha_i \geq 0. \end{align*}\\[3mm] \end{split}\]
  • Setting the partial derivatives with respect to \(w\) and \(\beta_0\) to 0, we get: $\(\hat w = \sum_{i=1}^n \alpha_i y_i x_i, \quad \sum_{i=1}^n\alpha_iy_i = 0\)$

  • Furthermore, one of the KKT conditions yields \(\alpha_i>0\) if and only if \(y_i(\beta_0 +w\cdot x_i)=1\), that is, if \(x_i\) falls on the margin.


Support vectors#

The vectors that fall on the margin and determine the solution are called support vectors

Fig 9.3

Summary#

\[\begin{split} \begin{align*} &\max_\alpha\; \min_{\beta_0,w}\;\; \frac{1}{2}\|w\|^2 - \sum_{i=1}^n \alpha_i [y_i(\beta_0 +w\cdot x_i)-1]\\ &\text{subject to }\;\; \alpha_i \geq 0. \end{align*} \end{split}\]
  • The solution is \(\hat w = \sum_{i=1}^n \alpha_i y_i x_i\), and \(\sum_{i=1}^n \alpha_iy_i=0\) so we can plug this in above to obtain the dual problem:

\[\begin{split} \begin{align*} &\max_\alpha\;\; \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n\sum_{i'=1}^n \alpha_i\alpha_{i'}y_i y_{i'} \color{Blue}{(x_i\cdot x_{i'})}\\ &\text{subject to }\;\; \alpha_i \geq 0, \quad \sum_{i}\alpha_i y_i = 0. \end{align*} \end{split}\]

  • We’ve reduced the problem of finding \(w\), which describes the hyperplane and the size of the margin, to finding a set of coefficients \(\alpha_1,\dots,\alpha_n\) through:

\[\begin{split} \begin{align*} &\max_\alpha\;\; \sum_{i=1}^n \alpha_i- \frac{1}{2}\sum_{i=1}^n\sum_{i'=1}^n \alpha_i\alpha_{i'}y_i y_{i'} \color{Blue}{(x_i\cdot x_{i'})}\\ &\text{subject to }\;\; \alpha_i \geq 0, \quad \sum_{i}\alpha_i y_i = 0. \end{align*} \end{split}\]
  • This only depends on the training sample inputs through the inner products \(x_i\cdot x_j\) for every pair \(i,j\).

  • Whatever \(p\) is, this is an optimization problem in \(n\) variables. \(\to\) could be \(p \gg n\).