Supervised learning with a qualitative or categorical response.

Basic approach#

Just as common, if not more common than regression:

  1. Medical diagnosis: Given the symptoms a patient shows, predict which of 3 conditions they are attributed to.

  2. Online banking: Determine whether a transaction is fraudulent or not, on the basis of the IP address, client’s history, etc.

  3. Web searching: Based on a user’s history, location, and the string of a web search, predict which link a person is likely to click.

  4. Online advertising: Predict whether a user will click on an ad or not.

Review: Bayes classifier#

  • Suppose \(P(Y\mid X)\) is known. Then, given an input \(x_0\), we predict the response

\[\hat y_0 = \text{argmax}_{\;y}\; P(Y=y \mid X=x_0).\]
  • The Bayes classifier minimizes the expected 0-1 loss:

\[ E\left[ \frac{1}{m} \sum_{i=1}^m \mathbf{1}(\hat y_i \neq y_i) \right]\]
  • This minimum 0-1 loss (the best we can hope for) is the Bayes error rate.

Basic strategy: estimate \(P(Y\mid X)\)#

  • If we have a good estimate for the conditional probability $\hat P(Y\mid X)$, we can use the classifier:
    \[\hat y_0 = \text{argmax}_{\;y}\; \hat P(Y=y \mid X=x_0).\]
  • Suppose $Y$ is a binary variable. Could we use a linear model?
    \[P(Y=1 | X) = \beta_0 + \beta_1X_1 + \dots+ \beta_1X_p \]
  • Problems:
    • This would allow probabilities $<0$ and $>1$.
    • Difficult to extend to more than 2 categories.

Logistic regression#

  • We model the joint probability as:

\[ \begin{aligned} P(Y=1 \mid X) & \frac{e^{\beta_0 + \beta_1 X_1 +\dots+\beta_p X_p}}{1+e^{\beta_0 + \beta_1 X_1 +\dots+\beta_p X_p}} P(Y=0 \mid X) &= \frac{1}{1+e^{\beta_0 + \beta_1 X_1 +\dots+\beta_p X_p}}. \end{aligned} \]

This is the same as using a linear model for the log odds:

\[\log\left[\frac{P(Y=1 \mid X)}{P(Y=0 \mid X)}\right] = \beta_0 + \beta_1 X_1 +\dots+\beta_p X_p.\]

Fitting logistic regression#

  • The training data is a list of pairs \((y_1,x_1), (y_2,x_2), \dots, (y_n,x_n)\).

  • We don’t observe the left hand side in the model

\[\log\left[\frac{P(Y=1 \mid X)}{P(Y=0 \mid X)}\right] = \beta_0 + \beta_1 X_1 +\dots+\beta_p X_p,\]
  • \(\implies\) We cannot use a least squares fit.

Likelihood#

  • Solution: The likelihood is the probability of the training data, for a fixed set of coefficients \(\beta_0,\dots,\beta_p\):

\[\prod_{i=1}^n P(Y=y_i \mid X=x_i) \]
  • We can rewrite as \onslide<2->{

\[ \underbrace{\prod_{i; y_i=1}\frac{e^{\beta_0 + \beta_1 x_{i1} +\dots+\beta_p x_{ip}}}{1+e^{\beta_0 + \beta_1 x_{i1} +\dots+\beta_p x_{ip}}}}_\text{Probability of $Y$ = 1 given $X$'s} \underbrace{\prod_{j; y_j=0}\frac{1}{1+e^{\beta_0 + \beta_1 x_{j1} +\dots+\beta_p x_{jp}}}}_\text{Probability of $Y$ = 0 given $X$'s} \]
  • Choose estimates \(\hat \beta_0, \dots,\hat \beta_p\) which maximize the likelihood.

  • Solved with numerical methods (e.g. Newton’s algorithm).

Logistic regression in R#

library(ISLR)
glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
              family=binomial, data=Smarket)
summary(glm.fit)

Inference for logistic regression#

  1. We can estimate the Standard Error of each coefficient.

  2. The \(z\)-statistic is the equivalent of the \(t\)-statistic in linear regression:

\[z = \frac{\hat \beta_j}{\text{SE}(\hat\beta_j)}.\]
  1. The \(p\)-values are test of the null hypothesis \(\beta_j=0\) (Wald’s test).

  2. Other possible hypothesis tests: likelihood ratio test (chi-square distribution).

Example: Predicting credit card default#

Predictors:

  • student: 1 if student, 0 otherwise

  • balance: credit card balance

  • income: person’s income.

Confounding#

In this dataset, there is confounding, but little collinearity.

  • Students tend to have higher balances. So, balance is explained by student, but not very well.

  • People with a high balance are more likely to default.

  • Among people with a given balance, students are less likely to default.

Results: predicting credit card default#

Using only balance#

library(ISLR) # where Default data is stored
summary(glm(default ~ balance,
        family=binomial, data=Default))

Using only student#

summary(glm(default ~ student,
        family=binomial, data=Default))

Using both balance and student#

summary(glm(default ~ balance + student,
        family=binomial, data=Default))

Using all 3 predictors#

summary(glm(default ~ balance + income + student,
        family=binomial, data=Default))

Multinomial logistic regression#

  • Extension of logistic regression to more than 2 categories

  • Suppose \(Y\) takes values in \(\{1,2,\dots,K\}\), then we can use a linear model for the log odds against a baseline category (e.g. 1): for \(j \neq 1\)

\[\log\left[\frac{P(Y=j \mid X)}{P(Y=1 \mid X)}\right] = \beta_{0,j} + \beta_{1,j} X_1 +\dots+\beta_{p,j} X_p\]
  • In this case \(\beta \in \mathbb{R}^{p \times (K-1)}\) is a matrix of coefficients.

Some potential problems#

  • The coefficients become unstable when there is collinearity. Furthermore, this affects the convergence of the fitting algorithm.

  • When the classes are well separated, the coefficients become unstable. This is always the case when \(p\geq n-1\). In this case, prediction error is low, but \(\hat{\beta}\) is very variable.

Linear Discriminant Analysis (LDA)#

  • **Strategy:** Instead of estimating $P(Y\mid X)$ directly, we could estimate:
    1. $\hat P(X \mid Y)$: Given the response, what is the distribution of the inputs.
    2. $\hat P(Y)$: How likely are each of the categories.
  • Then, we use \emph{Bayes rule} to obtain the estimate: $$ \begin{aligned} \hat P(Y = k \mid X = x) &= \frac{\hat P(X = x \mid Y = k) \hat P(Y = k)}{\hat P(X=x)} \\ &= \frac{\hat P(X = x \mid Y = k) \hat P(Y = k)}{\sum_{j=1}^K\hat P(X = x \mid Y=j) \hat P(Y=j)} \end{aligned}$$

LDA: multivariate normal with equal covariance#

  • LDA is the special case of the above strategy when \(P(X \mid Y=k) = N(\mu_k, \mathbf\Sigma)\).

  • That is, within each class the features have multivariate normal distribution with center depending on the class and common covariance \(\mathbf\Sigma\).

  • The probabilities \(P(Y=k)\) are estimated by the fraction of training samples of class \(k\).

Decision boundaries#

LDA has (piecewise) linear decision boundaries#

Suppose that:

  1. We know \(P(Y=k) = \pi_k\) exactly.

  2. \(P(X=x | Y=k)\) is Mutivariate Normal with density:

\[f_k(x) = \frac{1}{(2\pi)^{p/2}|\mathbf\Sigma|^{1/2}} e^{-\frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k)}\]
  1. Above: \(\mu_k: \) Mean of the inputs for category \(k\) and \(\mathbf\Sigma:\) isovariance matrix (common to all categories)

Then, what is the Bayes classifier?

LDA has linear decision boundaries#

  • By Bayes rule, the probability of category \(k\), given the input \(x\) is:

\[P(Y=k \mid X=x) = \frac{f_k(x) \pi_k}{P(X=x)}\]
  • The denominator does not depend on the response \(k\), so we can write it as a constant:

\[P(Y=k \mid X=x) = C \times f_k(x) \pi_k\]
  • Now, expanding \(f_k(x)\):

\[P(Y=k \mid X=x) = \frac{C\pi_k}{(2\pi)^{p/2}|\mathbf\Sigma|^{1/2}} e^{-\frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k)}\]
  • Let’s absorb everything that does not depend on \(k\) into a constant \(C'\):

\[P(Y=k \mid X=x) = C'\pi_k e^{-\frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k)}\]
  • Take the logarithm of both sides: $\(\log P(Y=k \mid X=x) = \color{Red}{\log C'} + \color{blue}{\log \pi_k - \frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k)}.\)$

  • This is the same for every category, \(k\).

  • We want to find the maximum of this expression over \(k\).

LDA has linear decision boundaries#

  • Goal is to maximize the following over \(k\): $\( \begin{aligned} &\log \pi_k - \frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k).\\ =&\log \pi_k - \frac{1}{2}\left[x^T \mathbf{\Sigma}^{-1}x + \mu_k^T \mathbf{\Sigma}^{-1}\mu_k\right] + x^T \mathbf{\Sigma}^{-1}\mu_k \\ =& C'' + \log \pi_k - \frac{1}{2}\mu_k^T \mathbf{\Sigma}^{-1}\mu_k + x^T \mathbf{\Sigma}^{-1}\mu_k \end{aligned} \)$

  • We define the objectives (called discriminant functions): $\(\delta_k(x) = \log \pi_k - \frac{1}{2}\mu_k^T \mathbf{\Sigma}^{-1}\mu_k + x^T \mathbf{\Sigma}^{-1}\mu_k\)\( At an input \)x\(, we predict the response with the highest \)\delta_k(x)$.

LDA has linear decision boundaries#

  • What is the decision boundary? It is the set of points \(x\) in which 2 classes do just as well (i.e. the discriminant functions of the two classes agree at \(x\)): $\( \begin{aligned} \delta_k(x) &= \delta_\ell(x) \\ \log \pi_k - \frac{1}{2}\mu_k^T \mathbf{\Sigma}^{-1}\mu_k + x^T \mathbf{\Sigma}^{-1}\mu_k & = \log \pi_\ell - \frac{1}{2}\mu_\ell^T \mathbf{\Sigma}^{-1}\mu_\ell + x^T \mathbf{\Sigma}^{-1}\mu_\ell \end{aligned} \)$

  • This is a linear equation in \(x\).

Decision boundaries revisited#

Estimating \(\pi_k\)#

\[\hat \pi_k = \frac{\#\{i\;;\;y_i=k\}}{n}\]

In English: the fraction of training samples of class \(k\).

Estimating the parameters of \(f_k(x)\)#

  • Estimate the center of each class $\mu_k$:
    \[\hat\mu_k = \frac{1}{\#\{i\;;\;y_i=k\}}\sum_{i\;;\; y_i=k} x_i\]
  • Estimate the common covariance matrix $\mathbf{\Sigma}$:
    • One predictor ($p=1$):
      \[\hat \sigma^2 = \frac{1}{n-K}\sum_{k=1}^K \;\sum_{i:y_i=k} (x_i-\hat\mu_k)^2.\]
    • Many predictors ($p>1$): Compute the vectors of deviations $(x_1 -\hat \mu_{y_1}),(x_2 -\hat \mu_{y_2}),\dots,(x_n -\hat \mu_{y_n})$ and use an unbiased estimate of its covariance matrix, $\mathbf\Sigma$.

Estimating the parameters of \(f_k(x)\)#

  • For an input \(x\), predict the class with the largest:

\[\hat\delta_k(x) = \log \hat\pi_k - \frac{1}{2}\hat\mu_k^T \mathbf{\hat\Sigma}^{-1}\hat\mu_k + x^T \mathbf{\hat\Sigma}^{-1}\hat\mu_k\]
  • The decision boundaries are defined by \(\left\{x: \delta_k(x) = \delta_{\ell}(x) \right\}, 1 \leq j,\ell \leq K\).

Quadratic discriminant analysis (QDA)#

  • The assumption that the inputs of every class have the same covariance \(\mathbf{\Sigma}\) can be quite restrictive:

QDA: multivariate normal with differing covariance#

  • In quadratic discriminant analysis we estimate a mean \(\hat\mu_k\) and a covariance matrix \(\hat{\mathbf \Sigma}_k\) for each class separately.

  • Given an input, it is easy to derive an objective function: $\(\delta_k(x) = \log \pi_k - \frac{1}{2}\mu_k^T \mathbf{\Sigma}_k^{-1}\mu_k + x^T \mathbf{\Sigma}_k^{-1}\mu_k - \frac{1}{2}x^T \mathbf{\Sigma}_k^{-1}x -\frac{1}{2}\log |\mathbf{\Sigma}_k|\)$

  • This objective is now quadratic in \(x\) and so the decision boundaries are 0s of quadratic functions.

QDA decision boundarieskQuadratic discriminant analysis (QDA)#

Bayes boundary (– – –), LDA (\(\cdot\cdot\cdot\)), QDA (——–).

Evaluating a classification method#

  • We have talked about the 0-1 loss:

\[\frac{1}{m}\sum_{i=1}^m \mathbf{1}(y_i \neq \hat y_i).\]
  • It is possible to make the wrong prediction for some classes more often than others. The 0-1 loss doesn’t tell you anything about this.

  • A much more informative summary of the error is a confusion matrix:

Evaluating a classification method#

library(MASS) # where the `lda` function lives
lda.fit = predict(lda(default ~ balance + student, data=Default))
table(lda.fit$class, Default$default)
  1. The error rate among people who do not default (false positive rate) is very low.

  2. However, the rate of false negatives is 76%.

  3. It is possible that false negatives are a bigger source of concern!

  4. One possible solution: Change the threshold

Evaluating a classification method#

new.class = rep("No", length(Default$default))
new.class[lda.fit$posterior[,"Yes"] > 0.2] = "Yes"
table(new.class, Default$default)
  • Predicted Yes if \(P(\mathtt{default}=\text{yes} | X) > \color{Red}{0.2}\).

  • Changing the threshold to 0.2 makes it easier to classify to Yes.

  • Note that the rate of false positives became higher! That is the price to pay for fewer false negatives.

Evaluating a classification method#

Let’s visualize the dependence of the error on the threshold:

– – – False negative rate (error for defaulting customers), \(\cdot\cdot\cdot\) False positive rate (error for non-defaulting customers), ——– Overall error rate.

The ROC curve#

  • Displays the performance of the method for any choice of threshold.
  • The area under the curve (AUC) measures the quality of the classifier:
    • 0.5 is the AUC for a random classifier
    • The closer the AUC is to 1, the better.

Comparing classification methods through simulation#

  • Simulate data from several different known distributions with $2$ predictors and a binary response variable.
  • Compare the test error (0-1 loss) for the following methods:
    • KNN-1
    • KNN-CV ("optimally tuned" KNN)
    • Logistic regression
    • Linear discriminant analysis (LDA)
    • Quadratic discriminant analysis (QDA)

Scenario 1#

  • \(X_1,X_2\) normal with identical variance.

  • No correlation in either class.

Scenario 2#

  • \(X_1,X_2\) normal with identical variance.

  • Correlation is -0.5 in both classes.

Scenario 3#

  • \(X_1,X_2\) student \(T\).

  • No correlation in either class.

Results for first 3 scenarios#

Scenario 4#

  • \(X_1, X_2\) normal with identical variance.

  • First class has correlation 0.5, second class has correlation -0.5.

Scenario 5#

  • \(X_1, X_2\) normal with identical variance.

  • Response \(Y\) was sampled from: $\( \begin{aligned} P(Y=1 \mid X) &= \frac{e^{\beta_0+\beta_1 X_1^2+\beta_2X_2^2+\beta_3X_1X_2}}{1+e^{\beta_0+\beta_1X_1^2+\beta_2X_2^2+\beta_3X_1X_2}}. \end{aligned} \)$

  • The true decision boundary is quadratic but this is not QDA model. (Why?)

Scenario 6#

  • \(X_1, X_2\) normal with identical variance.

  • Response \(Y\) was sampled from: $\( \begin{aligned} P(Y=1 \mid X) &= \frac{e^{f_\text{nonlinear}(X_1,X_2)}}{1+e^{f_\text{nonlinear}(X_1,X_2)}}. \end{aligned} \)$

  • The true decision boundary is very rough.

Results for scenarios 4-6#