Basic approach
Contents
Supervised learning with a qualitative or categorical response.
Basic approach#
Just as common, if not more common than regression:
Medical diagnosis: Given the symptoms a patient shows, predict which of 3 conditions they are attributed to.
Online banking: Determine whether a transaction is fraudulent or not, on the basis of the IP address, client’s history, etc.
Web searching: Based on a user’s history, location, and the string of a web search, predict which link a person is likely to click.
Online advertising: Predict whether a user will click on an ad or not.
Review: Bayes classifier#
Suppose \(P(Y\mid X)\) is known. Then, given an input \(x_0\), we predict the response
The Bayes classifier minimizes the expected 0-1 loss:
This minimum 0-1 loss (the best we can hope for) is the Bayes error rate.
Basic strategy: estimate \(P(Y\mid X)\)#
- If we have a good estimate for the conditional probability $\hat P(Y\mid X)$, we can use the classifier:
\[\hat y_0 = \text{argmax}_{\;y}\; \hat P(Y=y \mid X=x_0).\]
- Suppose $Y$ is a binary variable. Could we use a linear model?
\[P(Y=1 | X) = \beta_0 + \beta_1X_1 + \dots+ \beta_1X_p \]
-
Problems:
- This would allow probabilities $<0$ and $>1$.
- Difficult to extend to more than 2 categories.
Logistic regression#
We model the joint probability as:
This is the same as using a linear model for the log odds:
Fitting logistic regression#
The training data is a list of pairs \((y_1,x_1), (y_2,x_2), \dots, (y_n,x_n)\).
We don’t observe the left hand side in the model
\(\implies\) We cannot use a least squares fit.
Likelihood#
Solution: The likelihood is the probability of the training data, for a fixed set of coefficients \(\beta_0,\dots,\beta_p\):
We can rewrite as \onslide<2->{
Choose estimates \(\hat \beta_0, \dots,\hat \beta_p\) which maximize the likelihood.
Solved with numerical methods (e.g. Newton’s algorithm).
Logistic regression in R#
library(ISLR)
glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
family=binomial, data=Smarket)
summary(glm.fit)
Inference for logistic regression#
We can estimate the Standard Error of each coefficient.
The \(z\)-statistic is the equivalent of the \(t\)-statistic in linear regression:
The \(p\)-values are test of the null hypothesis \(\beta_j=0\) (Wald’s test).
Other possible hypothesis tests: likelihood ratio test (chi-square distribution).
Example: Predicting credit card default#
Predictors:
student: 1 if student, 0 otherwisebalance: credit card balanceincome: person’s income.
Confounding#
In this dataset, there is confounding, but little collinearity.
Students tend to have higher balances. So,
balanceis explained bystudent, but not very well.People with a high
balanceare more likely to default.Among people with a given
balance, students are less likely to default.
Results: predicting credit card default#
Using only balance#
library(ISLR) # where Default data is stored
summary(glm(default ~ balance,
family=binomial, data=Default))
Using only student#
summary(glm(default ~ student,
family=binomial, data=Default))
Using both balance and student#
summary(glm(default ~ balance + student,
family=binomial, data=Default))
Using all 3 predictors#
summary(glm(default ~ balance + income + student,
family=binomial, data=Default))
Multinomial logistic regression#
Extension of logistic regression to more than 2 categories
Suppose \(Y\) takes values in \(\{1,2,\dots,K\}\), then we can use a linear model for the log odds against a baseline category (e.g. 1): for \(j \neq 1\)
In this case \(\beta \in \mathbb{R}^{p \times (K-1)}\) is a matrix of coefficients.
Some potential problems#
The coefficients become unstable when there is collinearity. Furthermore, this affects the convergence of the fitting algorithm.
When the classes are well separated, the coefficients become unstable. This is always the case when \(p\geq n-1\). In this case, prediction error is low, but \(\hat{\beta}\) is very variable.
Linear Discriminant Analysis (LDA)#
- **Strategy:** Instead of estimating $P(Y\mid X)$ directly, we could estimate:
- $\hat P(X \mid Y)$: Given the response, what is the distribution of the inputs.
- $\hat P(Y)$: How likely are each of the categories.
- Then, we use \emph{Bayes rule} to obtain the estimate: $$ \begin{aligned} \hat P(Y = k \mid X = x) &= \frac{\hat P(X = x \mid Y = k) \hat P(Y = k)}{\hat P(X=x)} \\ &= \frac{\hat P(X = x \mid Y = k) \hat P(Y = k)}{\sum_{j=1}^K\hat P(X = x \mid Y=j) \hat P(Y=j)} \end{aligned}$$
LDA: multivariate normal with equal covariance#
LDA is the special case of the above strategy when \(P(X \mid Y=k) = N(\mu_k, \mathbf\Sigma)\).
That is, within each class the features have multivariate normal distribution with center depending on the class and common covariance \(\mathbf\Sigma\).
The probabilities \(P(Y=k)\) are estimated by the fraction of training samples of class \(k\).
Decision boundaries#
LDA has (piecewise) linear decision boundaries#
Suppose that:
We know \(P(Y=k) = \pi_k\) exactly.
\(P(X=x | Y=k)\) is Mutivariate Normal with density:
Above: \(\mu_k: \) Mean of the inputs for category \(k\) and \(\mathbf\Sigma:\) isovariance matrix (common to all categories)
Then, what is the Bayes classifier?
LDA has linear decision boundaries#
By Bayes rule, the probability of category \(k\), given the input \(x\) is:
The denominator does not depend on the response \(k\), so we can write it as a constant:
Now, expanding \(f_k(x)\):
Let’s absorb everything that does not depend on \(k\) into a constant \(C'\):
Take the logarithm of both sides: $\(\log P(Y=k \mid X=x) = \color{Red}{\log C'} + \color{blue}{\log \pi_k - \frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k)}.\)$
This is the same for every category, \(k\).
We want to find the maximum of this expression over \(k\).
LDA has linear decision boundaries#
Goal is to maximize the following over \(k\): $\( \begin{aligned} &\log \pi_k - \frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k).\\ =&\log \pi_k - \frac{1}{2}\left[x^T \mathbf{\Sigma}^{-1}x + \mu_k^T \mathbf{\Sigma}^{-1}\mu_k\right] + x^T \mathbf{\Sigma}^{-1}\mu_k \\ =& C'' + \log \pi_k - \frac{1}{2}\mu_k^T \mathbf{\Sigma}^{-1}\mu_k + x^T \mathbf{\Sigma}^{-1}\mu_k \end{aligned} \)$
We define the objectives (called discriminant functions): $\(\delta_k(x) = \log \pi_k - \frac{1}{2}\mu_k^T \mathbf{\Sigma}^{-1}\mu_k + x^T \mathbf{\Sigma}^{-1}\mu_k\)\( At an input \)x\(, we predict the response with the highest \)\delta_k(x)$.
LDA has linear decision boundaries#
What is the decision boundary? It is the set of points \(x\) in which 2 classes do just as well (i.e. the discriminant functions of the two classes agree at \(x\)): $\( \begin{aligned} \delta_k(x) &= \delta_\ell(x) \\ \log \pi_k - \frac{1}{2}\mu_k^T \mathbf{\Sigma}^{-1}\mu_k + x^T \mathbf{\Sigma}^{-1}\mu_k & = \log \pi_\ell - \frac{1}{2}\mu_\ell^T \mathbf{\Sigma}^{-1}\mu_\ell + x^T \mathbf{\Sigma}^{-1}\mu_\ell \end{aligned} \)$
This is a linear equation in \(x\).
Decision boundaries revisited#
Estimating \(\pi_k\)#
In English: the fraction of training samples of class \(k\).
Estimating the parameters of \(f_k(x)\)#
- Estimate the center of each class $\mu_k$:
\[\hat\mu_k = \frac{1}{\#\{i\;;\;y_i=k\}}\sum_{i\;;\; y_i=k} x_i\]
- Estimate the common covariance matrix $\mathbf{\Sigma}$:
- One predictor ($p=1$):
\[\hat \sigma^2 = \frac{1}{n-K}\sum_{k=1}^K \;\sum_{i:y_i=k} (x_i-\hat\mu_k)^2.\]
- Many predictors ($p>1$): Compute the vectors of deviations $(x_1 -\hat \mu_{y_1}),(x_2 -\hat \mu_{y_2}),\dots,(x_n -\hat \mu_{y_n})$ and use an unbiased estimate of its covariance matrix, $\mathbf\Sigma$.
- One predictor ($p=1$):
Estimating the parameters of \(f_k(x)\)#
For an input \(x\), predict the class with the largest:
The decision boundaries are defined by \(\left\{x: \delta_k(x) = \delta_{\ell}(x) \right\}, 1 \leq j,\ell \leq K\).
Quadratic discriminant analysis (QDA)#
The assumption that the inputs of every class have the same covariance \(\mathbf{\Sigma}\) can be quite restrictive:
QDA: multivariate normal with differing covariance#
In quadratic discriminant analysis we estimate a mean \(\hat\mu_k\) and a covariance matrix \(\hat{\mathbf \Sigma}_k\) for each class separately.
Given an input, it is easy to derive an objective function: $\(\delta_k(x) = \log \pi_k - \frac{1}{2}\mu_k^T \mathbf{\Sigma}_k^{-1}\mu_k + x^T \mathbf{\Sigma}_k^{-1}\mu_k - \frac{1}{2}x^T \mathbf{\Sigma}_k^{-1}x -\frac{1}{2}\log |\mathbf{\Sigma}_k|\)$
This objective is now quadratic in \(x\) and so the decision boundaries are 0s of quadratic functions.
QDA decision boundarieskQuadratic discriminant analysis (QDA)#
Bayes boundary (– – –), LDA (\(\cdot\cdot\cdot\)), QDA (——–).
Evaluating a classification method#
We have talked about the 0-1 loss:
It is possible to make the wrong prediction for some classes more often than others. The 0-1 loss doesn’t tell you anything about this.
A much more informative summary of the error is a confusion matrix:
Evaluating a classification method#
library(MASS) # where the `lda` function lives
lda.fit = predict(lda(default ~ balance + student, data=Default))
table(lda.fit$class, Default$default)
The error rate among people who do not default (false positive rate) is very low.
However, the rate of false negatives is 76%.
It is possible that false negatives are a bigger source of concern!
One possible solution: Change the threshold
Evaluating a classification method#
new.class = rep("No", length(Default$default))
new.class[lda.fit$posterior[,"Yes"] > 0.2] = "Yes"
table(new.class, Default$default)
Predicted
Yesif \(P(\mathtt{default}=\text{yes} | X) > \color{Red}{0.2}\).Changing the threshold to 0.2 makes it easier to classify to
Yes.Note that the rate of false positives became higher! That is the price to pay for fewer false negatives.
Evaluating a classification method#
Let’s visualize the dependence of the error on the threshold:
– – – False negative rate (error for defaulting customers), \(\cdot\cdot\cdot\) False positive rate (error for non-defaulting customers), ——– Overall error rate.
The ROC curve#
|
|
Comparing classification methods through simulation#
- Simulate data from several different known distributions with $2$ predictors and a binary response variable.
- Compare the test error (0-1 loss) for the following methods:
- KNN-1
- KNN-CV ("optimally tuned" KNN)
- Logistic regression
- Linear discriminant analysis (LDA)
- Quadratic discriminant analysis (QDA)
Scenario 1#
\(X_1,X_2\) normal with identical variance.
No correlation in either class.
Scenario 2#
\(X_1,X_2\) normal with identical variance.
Correlation is -0.5 in both classes.
Scenario 3#
\(X_1,X_2\) student \(T\).
No correlation in either class.
Results for first 3 scenarios#
Scenario 4#
\(X_1, X_2\) normal with identical variance.
First class has correlation 0.5, second class has correlation -0.5.
Scenario 5#
\(X_1, X_2\) normal with identical variance.
Response \(Y\) was sampled from: $\( \begin{aligned} P(Y=1 \mid X) &= \frac{e^{\beta_0+\beta_1 X_1^2+\beta_2X_2^2+\beta_3X_1X_2}}{1+e^{\beta_0+\beta_1X_1^2+\beta_2X_2^2+\beta_3X_1X_2}}. \end{aligned} \)$
The true decision boundary is quadratic but this is not QDA model. (Why?)
Scenario 6#
\(X_1, X_2\) normal with identical variance.
Response \(Y\) was sampled from: $\( \begin{aligned} P(Y=1 \mid X) &= \frac{e^{f_\text{nonlinear}(X_1,X_2)}}{1+e^{f_\text{nonlinear}(X_1,X_2)}}. \end{aligned} \)$
The true decision boundary is very rough.
Results for scenarios 4-6#