Supervised learning with a qualitative or categorical response.

Basic approach#

Just as common, if not more common than regression:

Medical diagnosis: Given the symptoms a patient shows, predict which of 3 conditions they are attributed to.
Online banking: Determine whether a transaction is fraudulent or not, on the basis of the IP address, client’s history, etc.
Web searching: Based on a user’s history, location, and the string of a web search, predict which link a person is likely to click.
Online advertising: Predict whether a user will click on an ad or not.

Review: Bayes classifier#

Suppose $P(Y\mid X)$ is known. Then, given an input $x_0$, we predict the response

\[\hat y_0 = \text{argmax}_{\;y}\; P(Y=y \mid X=x_0).\]

The Bayes classifier minimizes the expected 0-1 loss:

\[ E\left[ \frac{1}{m} \sum_{i=1}^m \mathbf{1}(\hat y_i \neq y_i) \right]\]

This minimum 0-1 loss (the best we can hope for) is the Bayes error rate.

Basic strategy: estimate $P(Y\mid X)$#

If we have a good estimate for the conditional probability $\hat P(Y\mid X)$, we can use the classifier:
\[\hat y_0 = \text{argmax}_{\;y}\; \hat P(Y=y \mid X=x_0).\]
Suppose $Y$ is a binary variable. Could we use a linear model?
\[P(Y=1 | X) = \beta_0 + \beta_1X_1 + \dots+ \beta_1X_p \]
Problems:
- This would allow probabilities $<0$ and $>1$.
- Difficult to extend to more than 2 categories.

Logistic regression#

We model the joint probability as:

\[ \begin{aligned} P(Y=1 \mid X) & \frac{e^{\beta_0 + \beta_1 X_1 +\dots+\beta_p X_p}}{1+e^{\beta_0 + \beta_1 X_1 +\dots+\beta_p X_p}} P(Y=0 \mid X) &= \frac{1}{1+e^{\beta_0 + \beta_1 X_1 +\dots+\beta_p X_p}}. \end{aligned} \]

This is the same as using a linear model for the log odds:

\[\log\left[\frac{P(Y=1 \mid X)}{P(Y=0 \mid X)}\right] = \beta_0 + \beta_1 X_1 +\dots+\beta_p X_p.\]

Fitting logistic regression#

The training data is a list of pairs $(y_1,x_1), (y_2,x_2), \dots, (y_n,x_n)$.
We don’t observe the left hand side in the model

\[\log\left[\frac{P(Y=1 \mid X)}{P(Y=0 \mid X)}\right] = \beta_0 + \beta_1 X_1 +\dots+\beta_p X_p,\]

$\implies$ We cannot use a least squares fit.

Likelihood#

Solution: The likelihood is the probability of the training data, for a fixed set of coefficients $\beta_0,\dots,\beta_p$:

\[\prod_{i=1}^n P(Y=y_i \mid X=x_i) \]

We can rewrite as \onslide<2->{

\[ \underbrace{\prod_{i; y_i=1}\frac{e^{\beta_0 + \beta_1 x_{i1} +\dots+\beta_p x_{ip}}}{1+e^{\beta_0 + \beta_1 x_{i1} +\dots+\beta_p x_{ip}}}}_\text{Probability of $Y$ = 1 given $X$'s} \underbrace{\prod_{j; y_j=0}\frac{1}{1+e^{\beta_0 + \beta_1 x_{j1} +\dots+\beta_p x_{jp}}}}_\text{Probability of $Y$ = 0 given $X$'s} \]

Choose estimates $\hat \beta_0, \dots,\hat \beta_p$ which maximize the likelihood.
Solved with numerical methods (e.g. Newton’s algorithm).

Logistic regression in R#

library(ISLR)
glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
              family=binomial, data=Smarket)
summary(glm.fit)

Inference for logistic regression#

We can estimate the Standard Error of each coefficient.
The $z$-statistic is the equivalent of the $t$-statistic in linear regression:

\[z = \frac{\hat \beta_j}{\text{SE}(\hat\beta_j)}.\]

The $p$-values are test of the null hypothesis $\beta_j=0$ (Wald’s test).
Other possible hypothesis tests: likelihood ratio test (chi-square distribution).

Example: Predicting credit card `default`#

Predictors:

student: 1 if student, 0 otherwise
balance: credit card balance
income: person’s income.

Confounding#

In this dataset, there is confounding, but little collinearity.

Students tend to have higher balances. So, balance is explained by student, but not very well.
People with a high balance are more likely to default.
Among people with a given balance, students are less likely to default.

Results: predicting credit card `default`#

Using only `balance`#

library(ISLR) # where Default data is stored
summary(glm(default ~ balance,
        family=binomial, data=Default))

Using only `student`#

summary(glm(default ~ student,
        family=binomial, data=Default))

Using both `balance` and `student`#

summary(glm(default ~ balance + student,
        family=binomial, data=Default))

Using all 3 predictors#

summary(glm(default ~ balance + income + student,
        family=binomial, data=Default))

Multinomial logistic regression#

Extension of logistic regression to more than 2 categories
Suppose $Y$ takes values in $\{1,2,\dots,K\}$, then we can use a linear model for the log odds against a baseline category (e.g. 1): for $j \neq 1$

\[\log\left[\frac{P(Y=j \mid X)}{P(Y=1 \mid X)}\right] = \beta_{0,j} + \beta_{1,j} X_1 +\dots+\beta_{p,j} X_p\]

In this case $\beta \in \mathbb{R}^{p \times (K-1)}$ is a matrix of coefficients.

Some potential problems#

The coefficients become unstable when there is collinearity. Furthermore, this affects the convergence of the fitting algorithm.
When the classes are well separated, the coefficients become unstable. This is always the case when $p\geq n-1$. In this case, prediction error is low, but $\hat{\beta}$ is very variable.

Linear Discriminant Analysis (LDA)#

**Strategy:** Instead of estimating $P(Y\mid X)$ directly, we could estimate:
1. $\hat P(X \mid Y)$: Given the response, what is the distribution of the inputs.
2. $\hat P(Y)$: How likely are each of the categories.
Then, we use \emph{Bayes rule} to obtain the estimate: $$ \begin{aligned} \hat P(Y = k \mid X = x) &= \frac{\hat P(X = x \mid Y = k) \hat P(Y = k)}{\hat P(X=x)} \\ &= \frac{\hat P(X = x \mid Y = k) \hat P(Y = k)}{\sum_{j=1}^K\hat P(X = x \mid Y=j) \hat P(Y=j)} \end{aligned}$$

LDA: multivariate normal with equal covariance#

LDA is the special case of the above strategy when $P(X \mid Y=k) = N(\mu_k, \mathbf\Sigma)$.
That is, within each class the features have multivariate normal distribution with center depending on the class and common covariance $\mathbf\Sigma$.
The probabilities $P(Y=k)$ are estimated by the fraction of training samples of class $k$.

Decision boundaries#

LDA has (piecewise) linear decision boundaries#

Suppose that:

We know $P(Y=k) = \pi_k$ exactly.
$P(X=x | Y=k)$ is Mutivariate Normal with density:

\[f_k(x) = \frac{1}{(2\pi)^{p/2}|\mathbf\Sigma|^{1/2}} e^{-\frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k)}\]

Above: $\mu_k: $ Mean of the inputs for category $k$ and $\mathbf\Sigma:$ isovariance matrix (common to all categories)

Then, what is the Bayes classifier?

LDA has linear decision boundaries#

By Bayes rule, the probability of category $k$, given the input $x$ is:

\[P(Y=k \mid X=x) = \frac{f_k(x) \pi_k}{P(X=x)}\]

The denominator does not depend on the response $k$, so we can write it as a constant:

\[P(Y=k \mid X=x) = C \times f_k(x) \pi_k\]

Now, expanding $f_k(x)$:

\[P(Y=k \mid X=x) = \frac{C\pi_k}{(2\pi)^{p/2}|\mathbf\Sigma|^{1/2}} e^{-\frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k)}\]

Let’s absorb everything that does not depend on $k$ into a constant $C'$:

\[P(Y=k \mid X=x) = C'\pi_k e^{-\frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k)}\]

Take the logarithm of both sides: $$\log P(Y=k \mid X=x) = \color{Red}{\log C'} + \color{blue}{\log \pi_k - \frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k)}.$$
This is the same for every category, $k$.
We want to find the maximum of this expression over $k$.

LDA has linear decision boundaries#

Goal is to maximize the following over $k$: $$ \begin{aligned} &\log \pi_k - \frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k).\\ =&\log \pi_k - \frac{1}{2}\left[x^T \mathbf{\Sigma}^{-1}x + \mu_k^T \mathbf{\Sigma}^{-1}\mu_k\right] + x^T \mathbf{\Sigma}^{-1}\mu_k \\ =& C'' + \log \pi_k - \frac{1}{2}\mu_k^T \mathbf{\Sigma}^{-1}\mu_k + x^T \mathbf{\Sigma}^{-1}\mu_k \end{aligned} $$
We define the objectives (called discriminant functions): $$\delta_k(x) = \log \pi_k - \frac{1}{2}\mu_k^T \mathbf{\Sigma}^{-1}\mu_k + x^T \mathbf{\Sigma}^{-1}\mu_k$$ At an input $x$, we predict the response with the highest $\delta_k(x)$.

LDA has linear decision boundaries#

What is the decision boundary? It is the set of points $x$ in which 2 classes do just as well (i.e. the discriminant functions of the two classes agree at $x$): $$ \begin{aligned} \delta_k(x) &= \delta_\ell(x) \\ \log \pi_k - \frac{1}{2}\mu_k^T \mathbf{\Sigma}^{-1}\mu_k + x^T \mathbf{\Sigma}^{-1}\mu_k & = \log \pi_\ell - \frac{1}{2}\mu_\ell^T \mathbf{\Sigma}^{-1}\mu_\ell + x^T \mathbf{\Sigma}^{-1}\mu_\ell \end{aligned} $$
This is a linear equation in $x$.

Decision boundaries revisited#

Estimating $\pi_k$#

\[\hat \pi_k = \frac{\#\{i\;;\;y_i=k\}}{n}\]

In English: the fraction of training samples of class $k$.

Estimating the parameters of $f_k(x)$#

Estimate the center of each class $\mu_k$:
\[\hat\mu_k = \frac{1}{\#\{i\;;\;y_i=k\}}\sum_{i\;;\; y_i=k} x_i\]
Estimate the common covariance matrix $\mathbf{\Sigma}$:
- One predictor ($p=1$):
  \[\hat \sigma^2 = \frac{1}{n-K}\sum_{k=1}^K \;\sum_{i:y_i=k} (x_i-\hat\mu_k)^2.\]
- Many predictors ($p>1$): Compute the vectors of deviations $(x_1 -\hat \mu_{y_1}),(x_2 -\hat \mu_{y_2}),\dots,(x_n -\hat \mu_{y_n})$ and use an unbiased estimate of its covariance matrix, $\mathbf\Sigma$.

Estimating the parameters of $f_k(x)$#

For an input $x$, predict the class with the largest:

\[\hat\delta_k(x) = \log \hat\pi_k - \frac{1}{2}\hat\mu_k^T \mathbf{\hat\Sigma}^{-1}\hat\mu_k + x^T \mathbf{\hat\Sigma}^{-1}\hat\mu_k\]

The decision boundaries are defined by $\left\{x: \delta_k(x) = \delta_{\ell}(x) \right\}, 1 \leq j,\ell \leq K$.

Quadratic discriminant analysis (QDA)#

The assumption that the inputs of every class have the same covariance $\mathbf{\Sigma}$ can be quite restrictive:

QDA: multivariate normal with differing covariance#

In quadratic discriminant analysis we estimate a mean $\hat\mu_k$ and a covariance matrix $\hat{\mathbf \Sigma}_k$ for each class separately.
Given an input, it is easy to derive an objective function: $$\delta_k(x) = \log \pi_k - \frac{1}{2}\mu_k^T \mathbf{\Sigma}_k^{-1}\mu_k + x^T \mathbf{\Sigma}_k^{-1}\mu_k - \frac{1}{2}x^T \mathbf{\Sigma}_k^{-1}x -\frac{1}{2}\log |\mathbf{\Sigma}_k|$$
This objective is now quadratic in $x$ and so the decision boundaries are 0s of quadratic functions.

QDA decision boundarieskQuadratic discriminant analysis (QDA)#

Bayes boundary (– – –), LDA ($\cdot\cdot\cdot$), QDA (——–).

Evaluating a classification method#

We have talked about the 0-1 loss:

\[\frac{1}{m}\sum_{i=1}^m \mathbf{1}(y_i \neq \hat y_i).\]

It is possible to make the wrong prediction for some classes more often than others. The 0-1 loss doesn’t tell you anything about this.
A much more informative summary of the error is a confusion matrix:

Evaluating a classification method#

library(MASS) # where the `lda` function lives
lda.fit = predict(lda(default ~ balance + student, data=Default))
table(lda.fit$class, Default$default)

The error rate among people who do not default (false positive rate) is very low.
However, the rate of false negatives is 76%.
It is possible that false negatives are a bigger source of concern!
One possible solution: Change the threshold

Evaluating a classification method#

new.class = rep("No", length(Default$default))
new.class[lda.fit$posterior[,"Yes"] > 0.2] = "Yes"
table(new.class, Default$default)

Predicted Yes if $P(\mathtt{default}=\text{yes} | X) > \color{Red}{0.2}$.
Changing the threshold to 0.2 makes it easier to classify to Yes.
Note that the rate of false positives became higher! That is the price to pay for fewer false negatives.

Evaluating a classification method#

Let’s visualize the dependence of the error on the threshold:

– – – False negative rate (error for defaulting customers), $\cdot\cdot\cdot$ False positive rate (error for non-defaulting customers), ——– Overall error rate.

The ROC curve#

Displays the performance of the method for any choice of threshold.
The area under the curve (AUC) measures the quality of the classifier:
- 0.5 is the AUC for a random classifier
- The closer the AUC is to 1, the better.

Comparing classification methods through simulation#

Simulate data from several different known distributions with $2$ predictors and a binary response variable.
Compare the test error (0-1 loss) for the following methods:
- KNN-1
- KNN-CV ("optimally tuned" KNN)
- Logistic regression
- Linear discriminant analysis (LDA)
- Quadratic discriminant analysis (QDA)

Scenario 1#

$X_1,X_2$ normal with identical variance.
No correlation in either class.

Scenario 2#

$X_1,X_2$ normal with identical variance.
Correlation is -0.5 in both classes.

Scenario 3#

$X_1,X_2$ student $T$.
No correlation in either class.

Results for first 3 scenarios#

Scenario 4#

$X_1, X_2$ normal with identical variance.
First class has correlation 0.5, second class has correlation -0.5.

Scenario 5#

$X_1, X_2$ normal with identical variance.
Response $Y$ was sampled from: $$ \begin{aligned} P(Y=1 \mid X) &= \frac{e^{\beta_0+\beta_1 X_1^2+\beta_2X_2^2+\beta_3X_1X_2}}{1+e^{\beta_0+\beta_1X_1^2+\beta_2X_2^2+\beta_3X_1X_2}}. \end{aligned} $$
The true decision boundary is quadratic but this is not QDA model. (Why?)

Scenario 6#

$X_1, X_2$ normal with identical variance.
Response $Y$ was sampled from: $$ \begin{aligned} P(Y=1 \mid X) &= \frac{e^{f_\text{nonlinear}(X_1,X_2)}}{1+e^{f_\text{nonlinear}(X_1,X_2)}}. \end{aligned} $$
The true decision boundary is very rough.

STATS 202

Basic approach

Contents

Basic approach#

Review: Bayes classifier#

Basic strategy: estimate \(P(Y\mid X)\)#

Logistic regression#

Fitting logistic regression#

Likelihood#

Logistic regression in R#

Inference for logistic regression#

Example: Predicting credit card default#

Confounding#

Results: predicting credit card default#

Using only balance#

Using only student#

Using both balance and student#

Using all 3 predictors#

Multinomial logistic regression#

Some potential problems#

Linear Discriminant Analysis (LDA)#

LDA: multivariate normal with equal covariance#

Decision boundaries#

LDA has (piecewise) linear decision boundaries#

LDA has linear decision boundaries#

LDA has linear decision boundaries#

LDA has linear decision boundaries#

Decision boundaries revisited#

Estimating \(\pi_k\)#

Estimating the parameters of \(f_k(x)\)#

Estimating the parameters of \(f_k(x)\)#

Quadratic discriminant analysis (QDA)#

QDA: multivariate normal with differing covariance#

QDA decision boundarieskQuadratic discriminant analysis (QDA)#

Evaluating a classification method#

Evaluating a classification method#

Evaluating a classification method#

Evaluating a classification method#

The ROC curve#

Comparing classification methods through simulation#

Scenario 1#

Scenario 2#

Scenario 3#

Results for first 3 scenarios#

Scenario 4#

Scenario 5#

Scenario 6#

Results for scenarios 4-6#

Example: Predicting credit card `default`#

Results: predicting credit card `default`#

Using only `balance`#

Using only `student`#

Using both `balance` and `student`#