Discriminant analysis

Discriminant analysis#

Download#

Classification / Discriminant Analysis#

Setup for *DA#

Given $K$ classes in $\mathbb{R}^p$, represented as densities $f_i(x), 1 \leq i \leq K$ we want to classify $x \in \mathbb{R}^p$.
In other words, partition $\mathbb{R}^p$ (or other sample space) into subsets $\Pi_i, 1 \leq i \leq k$ based on the densities $f_i(x)$.
We’ll focus on classical methods: LDA and generalizations.
LDA is a misnomer: it is a generative classifier rather than discriminative classifier.

Why is this multivariate analysis?#

Approximations to LDA (e.g. Fisher’s linear discriminant) are connected to CCA and reduced rank regression.

Maximum likelihood rule#

Classify $x$ to class for which density / likelihood is highest

\[ x \in \Pi_i \iff i = \text{argmax}_{j} f_j(x)\]

Bayes rule#

If prior class probabilities $(\pi_1, \dots, \pi_K)$ are available, a more sensible rule is

\[ x \in \Pi_i \iff i= \text{argmax}_{j} \pi_j f_j(x).\]

Optimality of Bayes rule#

Of all discriminant rules $\delta: \mathbb{R}^p \to {\cal S}_K$, the simplex in $\mathbb{R}^k$ (i.e. rules which assign points $x \in \mathbb{R}^p$ a probability on $\{1,\dots,K\}$) none has higher probability of correct assignment than the Bayes rule.

A simple discriminant function#

Suppose the sample space is all $p$-tuples of integers that sum to $n$.
Two classes $f_1 = \text{Multinom}(n, \alpha)$, $f_2 = \text{Multinom}(n, \beta)$.
ML rule boils down to

\[ x \in \Pi_1 \iff \sum_{i=1}^p x_i \log \frac{\alpha_i}{\beta_i} > 0 \]

The function

\[ h_{12}(x) = \sum_{i=1}^p x_i \log \frac{\alpha_i}{\beta_i} \]

is called a discriminant function between classes 1 & 2 (though it is derived from a generative model…)

Discriminant functions#

Set

\[ h_{ij}(x) = \log \frac{f_i(x)}{f_j(x)}.\]

ML rule can be summarized as

\[ x \in \Pi_i \iff h_{ij}(x) > 0 \ \forall j\]

Bayes rule can be summarized as

\[ x \in \Pi_i \iff h_{ij}(x) + \log \frac{\pi_i}{\pi_j} > 0 \ \forall j \]

Example: Gaussian in $\mathbb{R}$#

Let $f_1 = N(\mu_1, \sigma^2_1)$, $f_2 = N(\mu_2, \sigma^2_2)$.
Discriminant function for Bayes rule:

\[\begin{split} \begin{aligned} h_{12}(x) &= - \frac{x^2}{2} \left(\frac{1}{\sigma^2_1} - \frac{1}{\sigma^2_2} \right) - x \left( \frac{\mu_1}{\sigma^2_1} - \frac{\mu_2}{\sigma^2_2}\right) + \frac{1}{2} \left(\frac{\mu_2^2}{\sigma^2_2} - \frac{\mu_1^2}{\sigma_1^2} \right) + \log \frac{\sigma_2}{\sigma_1} \\ & \qquad + \log \frac{\pi_1}{\pi_2} \end{aligned}\end{split}\]

Note that $h_{12}$ is quadratic in $x$, unless $\sigma_1=\sigma_2$.
LDA (Linear Discriminant Analysis): $\sigma_1=\sigma_2$
QDA (Quadratic Discriminant Analysis): $\sigma_1 \neq \sigma_2$.

Example: Gaussian in $\mathbb{R}^p$#

In general, ML rule classifies $x$ by minimizing Mahalanobis distance (after adjusting for $\Sigma_i$)

\[ x \in \Pi_i \iff i = \text{argmin}_{j}\Delta_{\Sigma_j}(x, \mu_j) + \log \det(\Sigma_j),\]

If $\Sigma_i=\Sigma$ for all $i$, the ML (LDA) rule classifies by minimizing Mahalanobis distance.
Bayesian rule (with $\Sigma_i=\Sigma$) classifies by

\[ x \in \Pi_i \iff i = \text{argmin}_{j}\Delta_{\Sigma}(x, \mu_j) - 2 \log \pi_j \]

so classes with smaller prior probabilities get penalized.

Sample ML and Bayesian rules#

For each class, estimate $(\widehat{\mu}_i, \widehat{\Sigma}_i, \widehat{\pi}_i)$ with $\widehat{\pi}_i = n_i/n$.

QDA (sample version)#

Classify according to

\[ x \in \widehat{\Pi}_i \iff i = \text{argmin}_{j} \Delta_{\widehat{\Sigma}_j}(x, \widehat{\mu}_j) + \log \det(\widehat{\Sigma}_j) - 2 \log \widehat{\pi}_j.\]

LDA (sample version)#

First, estimate a pooled covariance matrix

\[ \widehat{\Sigma}_P = \frac{1}{n - K} \sum_{i=1}^K (n_i - 1) \widehat{\Sigma}_i \]

Classify according to $$ x \in \widehat{\Pi}_i \iff i = \text{argmin}_{j} \Delta_{\widehat{\Sigma}_P}(x, \widehat{\mu}_j) - 2 \log \widehat{\pi}_j.$$

Effective dimension for LDA#

Suppose that $p > K$, and let $$ L = \mu_1 \oplus \text{span}\{\mu_i - \mu_1, 2 \leq i\leq K\} $$
It is clear that all of the action happens along affine subspace $L \subset \mathbb{R}^p$ of dimension at most $(K-1)$.
Suggests we can / should reduce dimension …

Fisher’s linear discriminant#

Given a data matrix ${X}_{n \times p}$ and labels $L \in \{1,\dots,K\}^n$ consider a linear combination $Xv$
The SSE of $Xv$ can be decomposed as

\[\begin{split} \begin{aligned} v'\widehat{\Sigma}_T v &= \frac{1}{n}\sum_{i=1}^n ((Xv)_i - \overline{Xv})^2 \\ &= \frac{1}{n} \biggl(\sum_{k=1}^K n_k (\overline{Xv}_k - \overline{Xv})^2 \\ & \qquad \quad + \frac{1}{n} \sum_{k=1}^K \sum_{j:L_i=k} ((Xv)_{j} - \overline{Xv}_k)^2 \biggr)\\ &\overset{def}{=} v'\widehat{\Sigma}_B v + v'\widehat{\Sigma}_Wv \end{aligned} \end{split}\]

Fisher’s intuition#

Directions $v$ with high between variance should be important for classification.
Candidate for a good direction

\[ \widehat{v} = \text{argmax}_{v: v'\widehat{\Sigma}_Wv = 1} v'\widehat{\Sigma}_Bv \]

i.e. maximize between groups variance subject to within group variance of 1.

Characterization of discriminants#

Yields a generalized eigenvalue problem

\[ \widehat{\Sigma}_B v = \lambda \widehat{\Sigma}_W v \]

Can construct up to $K-1$ different directions (subject $\widehat{v}_i'\widehat{\Sigma}_W \widehat{v}_j = \delta_{ij}$).
Define the Fisher discriminant scores

\[ V_j = X \widehat{v}_j, 1 \leq j \leq K-1 \]

Stack as columns to form a new data matrix ${V}_{n \times (K-1)}$.

First two sample discriminants for `iris`#

Classification with derived features#

The pooled covariance matrix of the $V_i$’s will be $I$ so LDA is just classifying to the nearest centroid (by Euclidean distance)

\[ \bar{V}_i = \text{mean of ${V}$ in class $i$}. \]

Centroids form a (simple) Voronoi diagram on $\mathbb{R}^{K-1}$

Source: Wikipedia

Fisher’s linear discriminant & CCA#

Consider the one-hot matrix

\[ Y_{il} = 1_{\{L_l=i\}}, 1 \leq l \leq n, 1 \leq i \leq K-1 \]

We can put the data matrices ${Y}, {X}$ through CCA yielding $K-1$ pairs $(\widehat{a}_i, \widehat{b}_i), 1 \leq i \leq K-1$ of canonical directions.
It turns out that $\widehat{b}_i = \widehat{v}_i$ (up to a scalar multiple)…

Eigenproblem for CCA#

\[ \lambda \hat{\Sigma}_Tv = \hat{\Sigma}_B v \iff \lambda (\hat{\Sigma}_B + \hat{\Sigma}_W)v = \hat{\Sigma}_B v \]

Compare to eigenproblem for Fisher’s discriminant#

\[ \lambda \hat{\Sigma}_Wv = \hat{\Sigma}_B v \]

Dimension reduction#

Suppose some of the $\mu_i$’s are collinear $$ \text{dim}(L) < K-1 $$
Then, some of the Fisher scores will have little information.
We can discard some of the scores and then classify according to the nearest centroid of reduced space.
Really, to classify, we just need good centroids and some covariance for use in Mahalanobis distance.

Glass identification dataset #

Cross-validation to pick `n_components`?#

sklearn ignores n_components in fit and predict… see documentation

Solution / hack: use `LDA` as a transform#

Generalizations#

Usual generalizations apply:
1. create new features
2. regularize covariance estimates
3. impose structure on discriminant functions

Nonlinear decision boundaries#

If we transform $X \in \mathbb{R}^p$ to $f(X) \in \mathbb{R}^q$, LDA on $f(X)$ will produce boundaries that are linear in the components of $f(X)$.
For example, suppose $p=2$, take $f(x)=(x_1, x_2, x_1^2, x_2^2, x_1x_2)$.
This will produce discriminant functions $$ f_{ij}(x) = a_{ij,1} x_1 + a_{ij,2} x_2 + a_{ij,3} x_1^2 + a_{ij,4} x_2^2 + a_{ij,5} x_1 x_1 + c_{ij} $$
Cheap way to get quadratic boundaries, not exactly QDA.

Using quadratic functions for LDA#

library(mda)
data(iris)
data(glass)
glass.quad = mda::fda(Type ~ ., method=polyreg, degree=2, data=glass)
plot(glass.quad)
confusion(glass.quad)

Loading required package: class

Loaded mda 0.5-5

         true
predicted  1  2  3  5  6  7
        1 53 15  3  0  0  0
        2 14 60  2  0  0  0
        3  3  1 12  0  0  0
        5  0  0  0 13  0  0
        6  0  0  0  0  9  0
        7  0  0  0  0  0 29

../_images/36339b870dedd24e5b6d2a6b11a4bf3914e78aede0e81fee60c1817d687f159c.png

Using `sklearn`#

Do quadratic features help?#

More general expansions#

Why limit ourselves to quadratic?
We could take a large basis $f(x)=(f_1(x), \dots, f_m(x))$ and perform LDA on $f({X})$ with labels $L$.
If we take all $K-1$ fisher scores, the number of coefficients we need to estimate is $(K-1) * m$ – this grows quickly.
We may also have to invert large matrices… regularize?

Does covariance regularization help?#

How about regularization & quadratic features?#

Mixture discriminant analysis#

LDA / Fisher’s method presumes that a single centroid (and common spread) is sufficient to describe a class.
Within class densities are $N(\mu_j,\Sigma)$…
A richer model could use a mixture for within class density:

\[ F_j = \sum_{l=1}^{n_j} \pi_{jl} N(\mu_{jl}, \Sigma_{jl}) \]

Or use common $\Sigma_{jl}=\Sigma$…
Can run EM within each class to fit a mixture. Given the mixture, the Bayes or ML rule is easy…
ESL describes this as a compromise between LDA and nearest-neighbor classifier.

3 components per class#

#| echo: true
glass.lda = mda::fda(Type ~ ., data=glass)
glass.mda = mda::mda(Type ~ ., data=glass, subclasses=3)
glass.mda
accuracy = function(C_) { sum(diag(C_))/sum(C_) }
accuracy(confusion(glass.mda))
accuracy(confusion(glass.lda))

Call:
mda::mda(formula = Type ~ ., data = glass, subclasses = 3)

Dimension: 9 

Percent Between-Group Variance Explained:
    v1     v2     v3     v4     v5     v6     v7     v8     v9 
 72.43  84.70  92.40  95.99  97.80  98.88  99.43  99.74 100.00 

Degrees of Freedom (per dimension): 10 

Training Misclassification Error: 0.25234 ( N = 214 )

Deviance: 338.763 

0.747663551401869

0.672897196261682

Combining dimension reduction & mixtures#

First, project to lower dimension and then fit mixture model

#| echo: true
glass.mda4 = mda(Type ~ ., data=glass, subclasses=3, dimension=4)
glass.mda4
accuracy(confusion(glass.mda4))

Call:
mda(formula = Type ~ ., data = glass, subclasses = 3, dimension = 4)

Dimension: 4 

Percent Between-Group Variance Explained:
   v1    v2    v3    v4 
78.76 88.59 94.82 97.59 

Degrees of Freedom (per dimension): 10 

Training Misclassification Error: 0.25701 ( N = 214 )

Deviance: 335.744 

0.742990654205608

Does dimension reduction have any effect for MDA?#

#| echo: true
set.seed(1)
samp = sample(1:214, 100, replace=FALSE)
glass.test = glass[samp,]
glass.train = glass[-samp,]
glass.mda = mda(Type ~ ., data=glass.train, subclasses=3)
glass.mda4 = mda(Type ~ ., data=glass.train, subclasses=3, dimension=4)
accuracy(confusion(glass.mda, glass.test))
accuracy(confusion(glass.mda4, glass.test))

0.62

0.55

Naive Bayes#

As in CCA, we must implicitly invert $\hat{\Sigma}_W$ to compute Fisher’s score (or to fit LDA).
We can simplify the model by assuming that $\Sigma_j$ is diagonal in QDA – this leads to Naive Bayes

\[ f(X|Y=l) = \prod_{j=1}^p f_{j}(X_j|Y=l) \]

For continuous features, default is

\[ f_j(x|Y=l) = N(\mu_j, \sigma^2_j) \]

Naive Bayes as covariance regularization#

Assuming a diagonal covariance is form of regularization on the covariance…
ESL also defines Diagonal LDA – use common diagonal covariance.

Other regularized covariance estimates#

\[ \widehat{\Sigma}_j(\gamma) = \gamma \widehat{\Sigma}_j + (1 - \gamma) \widehat{\Sigma}_{\text{pooled}} \]

or

\[ \widehat{\Sigma}_j(\gamma) = \gamma \widehat{\Sigma}_j + (1 - \gamma) \widehat{\sigma}^2 I \]

Choice of tuning parameters#

Since we are classifying, we can use cross-validation…

Shrunken centroids#

One can also regularize the centroid estimates.
Neareast Shrunken Centroids combines diagonal LDA with regularization.
For each feature $l$, soft-threshold $\bar{X}_{j,l} - \bar{X}_l$ (or more precisely the $Z$-scores).
Forces many features to share common centroid value $\hat{\mu}_{j,l}$ – means they effectively do not contribute to classification rule

\[ x \mapsto \text{argmin}_l \frac{1}{2} \sum_{j=1}^p \frac{ \left(x_j - \hat{\mu}^{\lambda}_{j,l}\right)^2 }{ \widehat{\sigma}^2_j} - \log \pi_l \]

Nearest shrunken centroids#

(Source: ESL)