Unsupervised Learning

web.stanford.edu/class/stats202

Sergio Bacallado, Jonathan Taylor

Autumn 2022

Principal Components Analysis

Some facts

  1. This is the most popular unsupervised procedure ever.

  2. Invented by Karl Pearson (1901).

  3. Developed by Harold Hotelling (1933). \(\leftarrow\) Stanford pride!

  4. What does it do? It provides a way to visualize high dimensional data, summarizing the most important information.

What is PCA good for?

All pairs scatter Biplot for first 2 principal components
pairs(USarrests) Fig 10.1

What is the first principal component?

It is the line which passes the closest to a cloud of samples, in terms of squared Euclidean distance.

What does this look like with 3 variables?

The first two principal components span a plane which is closest to the data.

Fig 10.2a Fig 10.2b

A second interpretation

The projection onto the first principal component is the one with the highest variance.

How do we say this in math?

\[ \begin{aligned} \max_{\phi_{11},\dots,\phi_{p1}} \left\{ \frac{1}{n}\sum_{i=1}^n \left( \sum_{j=1}^p x_{ij}\phi_{j1} \right)^2 \right\} \\ \text{ subject to} \sum_{j=1}^p \phi_{j1}^2 =1. \end{aligned} \]

\[ \sum_{j=1}^p x_{ij} \phi_{j1} = X\phi_1. \]

\[ Z_1 = X\hat{\phi}_1 \]

Second principal component

\[ \begin{aligned} \max_{\phi_{12},\dots,\phi_{p2}} \left\{ \frac{1}{n}\sum_{i=1}^n \left( \sum_{j=1}^p \phi_{j2}x_{ij} \right)^2 \right\} \\ \text{ subject to} \;\;\sum_{j=1}^p \phi_{j2}^2 =1 \;\;\text{ and }\;\; \sum_{j=1}^p \hat{\phi}_{j1}\phi_{j2} = 0. \end{aligned} \]

\[ Z_2 = X\hat{\phi}_2 \]

Solving the optimization

This optimization is fundamental in linear algebra. It is satisfied by either:

  1. The singular value decomposition (SVD) of (the centered) \(\mathbf{X}\): \[\mathbf{X} = \mathbf{U\Sigma\Phi}^T\] where the \(i\)th column of \(\mathbf{\Phi}\) is the \(i\)th principal component \(\hat{\phi}_i\), and the \(i\)th column of \(\mathbf{U\Sigma}\) is the \(i\)th vector of scores \((z_{1i},\dots,z_{ni})\).

  2. Above, \(\mathbf{U}^T\mathbf{U}=\mathbf{I}_{n \times n}\) and \(\mathbf{\Phi}^T\mathbf{\Phi}=\mathbf{I}_{p \times p}\).

  3. The eigendecomposition of \(\mathbf{X}^T\mathbf{X}\):

\[\mathbf{X}^T\mathbf{X} = \mathbf{\Phi\Sigma}^2\mathbf{\Phi}^T\]

Biplot

Scaling the variables

Example: scaled vs. unscaled PCA

How many principal components are enough?

The proportion of variance explained

\[\sum_{i=1}^p \frac{1}{n}\sum_{j=1}^n z_{ji}^2 = \sum_{k=1}^p \text{Var}(x_{k}).\]

Scree plot

\[ \frac{1}{n}Z_m^TZ_m = \frac{1}{n}\sum_{i=1}^n z_{im}^2 =\frac{1}{n}\sum_{i=1}^n \left(\sum_{j=1}^p \hat{\phi}_{jm}x_{ij}\right)^2 \]

Clustering

\(K\)-means clustering

  1. \(K\) is the number of clusters and must be fixed in advance.

  2. The goal of this method is to maximize the similarity of samples within each cluster:

\[\min_{C_1,\dots,C_K} \sum_{\ell=1}^K W(C_\ell) \quad;\quad W(C_\ell) = \frac{1}{|C_\ell|}\sum_{i,j\in C_\ell} \text{Distance}^2(x_{i},x_{j}).\]

\(K\)-means clustering

\(K\)-means clustering algorithm

\[\overline{x}_{\ell} = \frac{1}{|\hat{C}_\ell|}\sum_{i\in \hat{C}_\ell} x_{i} \quad \text{for }j=1,\dots,p.\]

\[ \hat{C}_{\ell} = \left\{i: \text{Distance}(x_i, \overline{x}_{\ell}) \leq \text{Distance}(x_i, \overline{x}_k), 1 \leq k \leq K \right\} \]

\(K\)-means clustering algorithm

Properties of \(K\)-means clustering

\[\min_{C_1,\dots,C_K} \sum_{\ell=1}^K W(C_\ell)\quad;\quad W(C_\ell) = \frac{1}{|C_\ell|}\sum_{i,j\in C_\ell} \text{Distance}^2(x_{i},x_{j}).\]

\[ \frac{1}{|C_\ell|}\sum_{i,j\in C_\ell} \text{Distance}^2(x_{i},x_{j}) = 2\sum_{i\in C_\ell} \text{Distance}^2(x_{i},\overline x_{\ell}) \]

Right hand side can only be reduced in each iteration.

Example: \(K\)-means output with different initializations

In practice, we start from many random initializations and choose the output which minimizes the objective function.

Hierarchical clustering

Many algorithms for hierarchical clustering are agglomerative.

Hierarchical clustering

Hierarchical clustering

Distance between clusters

Distance between clusters

Distance between clusters

Distance between clusters

Clustering is riddled with questions and choices

  1. Is clustering appropriate? i.e. Could a sample belong to more than one cluster?

  2. Mixture models, soft clustering, topic models.

  3. How many clusters are appropriate?

  4. There are formal methods based on gap statistics, mixture models, etc.

  5. Are the clusters robust?

  6. Run the clustering on different random subsets of the data. Is the structure preserved?

  7. Try different clustering algorithms. Are the conclusions consistent?

  8. Most important: temper your conclusions.

Clustering is riddled with questions and choices

  1. Should we scale the variables before doing the clustering \(\implies\) changes the notion of distance

  2. Variables with larger variance have a larger effect on the Euclidean distance between two samples.

  3. Does Euclidean distance capture dissimilarity between samples?

Choosing an appropriate distance

  1. Euclidean distance would cluster all customers who purchase few things (orange and purple).

  2. Perhaps we want to cluster customers who purchase similar things (orange and teal).

  3. In this case the correlation distance may be a more appropriate measure of dissimilarity between samples.