CME/STATS 195

Lecture 8: Hypothesis Testing and Classification

Evan Rosenman

April 30, 2019

Contents

  • Course wrap-up

  • What is unsupervised learning?

  • Dimensionality reduction with PCA

  • Cluster Analysis:
    • k-means Clustering
    • Hierarchical Clustering

Course wrap-up

Our journey

How to learn more

Where to find out more about the topics of this class:

How to learn more advanced topics on R:

Unsupervised Learning

Unsupervised Learning

  • There is only \(X\) and no \(Y\) i.e. there are no special variables such as response or output variables, and no prespecified labels for the observations.

  • Deals with a task of inferring latent (hidden) patterns and structures unlabeled data.

  • The goal is to understand the relationships between features or among observations.

  • Unsupervised learning encompasses:
    • dimensionality reduction, manifold learning
      e.g. PCA, MDS, Isomap, Diffusion Map, t-SNE, Autoencoder
    • clustering e.g. k-means, hierarchical clustering, mixture models
    • anomaly detection
    • latent variable models


  • It can handle the tasks such as:
    • image segmentation,
    • image clustering / automatic labeling,
    • visualization of high dimensional data e.g. gene expression,
    • finding cell subtypes.

Dimensionality Reduction

Dimensionality Reduction

  • Many modern datasets are high-dimensional (many columns)
    • e.g. genetic sequencing, medical records, user internet activity data etc.
  • DR or feature extraction methods can reduce the number of variables.

Dimensionality Reduction

  • The methods can be used to:
    • compress the data
    • remove redundant features and noise
  • Counter-intuitively, training on the reduced feature set can increase accuracy of learning methods by avoiding over-fitting and the curse of dimensionality
  • Common methods for dimensionality reduction include: PCA, CA, ICA, MDS, Isomaps, Laplacian Eigenmaps, tSNE, Autoencoder.

Principal Component Analysis (PCA)

Source: ESL Chapter 14

Maximal Variance Projection


  • For \(X\in \mathbb{R}^{n \times p}\), \(X^{\star} = (X - \bar X)\) is a centered data matrix.
  • PCA is an eigenvalue decomposition of the sample covariance matrix:

\[C = \frac{1}{n-1} (X^{\star})^T X^{\star} = \frac{1}{n-1} V \Sigma^2 V^T\]

where \(V\) is an orthogonal matrix and \(\Sigma\) is a diagonal matrix.

Key idea: the principal components are the eigenvectors of the sample covariance matrix \(C\). The associated eigenvalues are known as the loadings.

Dimensionality reduction with PCA

  • PCA finds a set of \(p\) uncorrelated directions (components) that are linear combinations of the original \(p\) variables.
  • These components sequentially explain most of the variation remaining subsequently in the data.
  • Reduction occurs when the top \(k < p\) components are kept and used to represent the original \(p\)-dimensional data.

The US crime rates dataset

The built in dataset includes information on violent crime rates in the US in 1975.

head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

Mean and standard deviation of the crime rates across all states:

apply(USArrests, 2, mean)
##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232
apply(USArrests, 2, sd)
##    Murder   Assault  UrbanPop      Rape 
##  4.355510 83.337661 14.474763  9.366385

PCA in R

  • In R, the function prcomp() can be used to perform PCA.
  • Unlike other common tasks, PCA has several other implementations (e.g. princomp()), but prcomp has some speed advantages
pca.res <- prcomp(USArrests, scale = TRUE)
  • The output of prcomp() is a list containing:
names(pca.res)
## [1] "sdev"     "rotation" "center"   "scale"    "x"

  • Think of PCA as switching to a new coordinate system. We need both:
    • the coordinates for every data point in the new coordinate system
    • a description of the new coordinate system in terms of the the original coordinate system

The elements of prcomp output are:

  • The principal components matrix, with projected samples coordinates.
head(pca.res$x)
##                   PC1        PC2         PC3          PC4
## Alabama    -0.9756604  1.1220012 -0.43980366  0.154696581
## Alaska     -1.9305379  1.0624269  2.01950027 -0.434175454
## Arizona    -1.7454429 -0.7384595  0.05423025 -0.826264240
## Arkansas    0.1399989  1.1085423  0.11342217 -0.180973554
## California -2.4986128 -1.5274267  0.59254100 -0.338559240
## Colorado   -1.4993407 -0.9776297  1.08400162  0.001450164

These are the sample coordinates in the PCA projection space.

  • The loadings matrix, which contains the eigenvectors themselves (these directions define the new coordinate system)

The loadings or principal axes give the weights of the variables in each of the principal components.

pca.res$rotation
##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

pca.res$rotation
##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432


Key idea: the components often have intuitive explanations!

  • PC1 places similar weights on Assault, Murder, Rape variables; much smaller weight on UrbanPop. PC1 measures an overall measure of crime.
  • The 2nd loading puts most weight on UrbanPop. Thus, PC2 measures a level of urbanization.
  • The crime-related variables are correlated with each other, and therefore are close to each other on the biplot.
  • UrbanPop is independent of the crime rate, and so it is further away on the plot.

  • The standard deviations of the principal components (square roots of the eigenvalues of \((X^{\star})^T X^{\star}\))
pca.res$sdev
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
  • The centers of the features, used for shifting:
pca.res$center
##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232
  • The standard deviations of the features, used for scaling:
pca.res$scale
##    Murder   Assault  UrbanPop      Rape 
##  4.355510 83.337661 14.474763  9.366385

Scree plot

  • A scree plot can be used to choose how many components to retain.

  • Look for “elbows” in the scree plots

  • Discard the dimensions with corresponding eigenvalues or equivalently the proportion of variance explained that drop off significantly.

# PCA eigenvalues/variances:
round((pr.var <- pca.res$sdev^2), 4)
## [1] 2.4802 0.9898 0.3566 0.1734
plot(pca.res)

Percent of variance explained

Can also look at the cumulative plot

pct.var.explained <- 100*cumsum(pca.res$sdev^2)/sum(pca.res$sdev^2)
names(pct.var.explained) <- seq(1:length(pct.var.explained))
barplot(pct.var.explained, xlab = "# of Components", 
        ylab = "% of Variance Explained")

Samples Plot

Each principal component loading and score vector is unique, up to a sign flip. So another software could return this plot instead:

fviz_pca_ind(pca.res) + coord_fixed() + theme(text = element_text(size = 20))

Features Plot

fviz_pca_var(pca.res) + coord_fixed() +
    theme(text = element_text(size = 20))

fviz_contrib(pca.res, choice = "var", axes = 1) +
    theme(text = element_text(size = 20))

Exercise


Recall the mtcars dataset we work with before, which comprises fuel consumption and other aspects of design and performance for 32 cars from 1974. The dataset has 11 dimensions.

  1. Use prcomp() to compute a PCA for mtcars. Remember to set the scale parameter, as the variables are in different units and have different ranges

  2. Generate a scree plot and note how many dimensions should you retain.

  3. Compute the percentage of variance explained by each of the principal components.

Cluster Analysis

Cluster Analysis

  • Clustering is an exploratory technique which can discover hidden groupings in the data

  • Groupings are determined from the data itself, without any prior knowledge about labels or classes.

  • There are the clustering methods available; a lot of them have an R implementation available on CRAN.

  • To cluster the data we need a measure of similarity or dissimilarity between a pair of observations, e.g. a Euclidean distance.

k-means

  • k-means is a simple and fast iterative method for clustering data into \(k\) distinct non-overlapping groups.

  • The algorithm minimizes the variation within each cluster.

  • Algorithm
    • Choose \(k\) points at random to serve as centroids
    • Repeat until no change
      • Assign each point to the cluster defined by the centroid to which it is closest
      • Once each point is assigned, compute a new centroid for each cluster by taking the mean across points in the cluster

Source: link

k-means drawbacks

  • The number of clusters \(k\) must be prespecified (before clustering).

  • The method is stochastic, and involves random initialization of cluster centers.

  • This means that each time the algorithm is run, the results obtained can be different.

The number of clusters, \(k\), should be chosen using statistics such as:

  • Gap Statistic link

  • Silhouette statistic link

  • Calinski-Harbasz index link

Image segmentation

  • One of the application of k-means clustering is image segmentation.

  • Here we use a picture of a field of tulips in the Netherlands downloaded from here.

Importing image to R

  • First, we download the image:
library(jpeg)
url <- "http://www.infohostels.com/immagini/news/2179.jpg"
dFile <- download.file(url, "./Lecture8-figure/Image.jpg")
img <- readJPEG("./Lecture8-figure/Image.jpg") 
(imgDm <- dim(img))
## [1] 480 960   3
  • The image is a 3D array, so we convert it to a data frame.

  • Each row of the data frame should correspond a single pixel.

  • The columns should include the pixel location (x and y), and the pixel intensity in red, green, and blue ( R, G, B).

# Assign RGB channels to data frame
imgRGB <- data.frame(
  x = rep(1:imgDm[2], each = imgDm[1]),
  y = rep(imgDm[1]:1, imgDm[2]),
  R = as.vector(img[,,1]),
  G = as.vector(img[,,2]),
  B = as.vector(img[,,3])
)

k-means in R

  • Each pixel is a datapoint in 3D specifying the intensity in each of the three “R”, “G”, “B” channels, which determines the pixel’s color.
head(imgRGB, 3)
##   x   y R         G         B
## 1 1 480 0 0.3686275 0.6980392
## 2 1 479 0 0.3686275 0.6980392
## 3 1 478 0 0.3725490 0.7019608
  • We use k-means to cluster the pixels \(k\) into color groups (clusters).
  • k-means can be performed in R with kmeans() built-in function.
# Set seed since k-means involves a random initialization
set.seed(43658)
k <- 2
kmeans.2clust <- kmeans(imgRGB[, c("R", "G", "B")], centers = k)
names(kmeans.2clust)
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

# k cluster centers
kmeans.2clust$centers
##           R         G         B
## 1 0.5682233 0.3251528 0.1452832
## 2 0.6597320 0.6828609 0.7591578
# The centers correspond to the following colors:
rgb(kmeans.2clust$centers)
## [1] "#915325" "#A8AEC2"
# Cluster assignment of the first 10 pixels
head(kmeans.2clust$cluster, 10)
##  [1] 2 2 2 2 2 2 2 2 2 2
# Convert cluster assignment lables to cluster colors
kmeans.2colors <- rgb(kmeans.2clust$centers[kmeans.2clust$cluster, ])
head(kmeans.2colors, 10)
##  [1] "#A8AEC2" "#A8AEC2" "#A8AEC2" "#A8AEC2" "#A8AEC2" "#A8AEC2" "#A8AEC2"
##  [8] "#A8AEC2" "#A8AEC2" "#A8AEC2"

ggplot(data = imgRGB, aes(x = x, y = y)) + 
  geom_point(colour = kmeans.2colors) +
  labs(title = paste("k-Means Clustering with", k, "clusters (colors)")) +
  xlab("x") + ylab("y") + theme_bw()

Now add more colors, by increase the number of clusters to 6:

set.seed(348675)
kmeans.6clust <- kmeans(imgRGB[, c("R", "G", "B")], centers = 6)
kmeans.6colors <- rgb(kmeans.6clust$centers[kmeans.6clust$cluster, ])

Hierarchical clustering

Alexander Calder’s mobile

  • If it’s difficult (or if you simply don’t want) to choose the number of clusters ahead, you can do hierarchical clustering.

  • Hierarchical clustering can be performed using agglomerative (bottom-up) or divisive (top-down) approach.

  • The method requires a choice of a pairwise distance metric and a rule of how to merge or divide clusters.

  • The output of the method can be represented as a graphical tree-based representation of the data, called a dendrogram.

  • The tree allows you to evaluate where the cutoff for grouping should occur.

Hierarchical clustering algorithm

Source: ISL

Results for hierarchical clustering differ depending on the choice of:

  • A distance metric used for pairs of observations, e.g. Euclidean (L2), Manhattan (L1), Jaccard (Binary), etc

  • The rule used for grouping clusters that are already generated, e.g. single (minimum), completer (maximum), centroid or average cluster linkages.


Different ways to compute dissimilarity between 2 clusters:

Iris dataset

  • We will use the Fisher’s Iris dataset containing measurements on 150 irises.
  • Hierarchical clustering will calculate the grouping of the flowers into groups corresponding. We will see that these groups will roughly correspond to the flower species.
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Hierarchical clustering in R

  • Built-in function hclust() performs hierarchical clustering.
  • We will use only the petal dimensions (2 columns) to compute the distances between flowers.
# We use the Euclidean distance for the dissimilarities between flowers
distMat <- dist(iris[, 3:4])


# We use the "complete" linkage method for computing the cluster distances.
clusters <- hclust(distMat, method = "complete")

plot(clusters, cex = 0.7)

The dendrogram suggests that a reasonable choice of the number of clusters is either 3 or 4.

plot(clusters, cex = 0.7)
abline(a = 2, b = 0, col = "blue")
abline(a = 3, b = 0, col = "blue")

  • We pick 3 clusters.
  • To get the assignments with 3 clusters from the truncated tree we can use a cutree() function.
(clusterCut <- cutree(clusters, 3))
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 2 2 2 3 2 3 3 3 3 2 3 3 2 3 2 3
##  [71] 2 3 2 2 3 3 2 2 2 3 3 3 3 2 2 2 2 3 3 3 3 2 3 3 3 3 3 3 3 3 2 2 2 2 2
## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2
table(clusterCut, iris$Species)
##           
## clusterCut setosa versicolor virginica
##          1     50          0         0
##          2      0         21        50
##          3      0         29         0

plot(clusters, labels = clusterCut, cex = 0.9)
rect.hclust(clusters, k = 3, border=c("red", "blue", "green"))

table(clusterCut, iris$Species)
##           
## clusterCut setosa versicolor virginica
##          1     50          0         0
##          2      0         21        50
##          3      0         29         0
  • From the table we see that the sentosa and virginica were correctly assigned to separate groups.
  • However, the method had difficulty grouping the versicolorm flowers into a separate cluster.

  • 2D plot of the iris dataset using petal dimensions as coordinates.
  • The cluster assignments partition the flowers into species with high accuracy.
ggplot(iris, aes(Petal.Length, Petal.Width)) + theme_bw() +
  geom_text(aes(label = clusterCut), vjust = -1) + 
  geom_point(aes(color = Species)) + coord_fixed(1.5)

Exercise

We will generate synthetic data to use for k-means clustering.

set.seed(489576)
N <- 1000
C1 <- data.frame(cluster = "C1", x = rnorm(n = N, mean = 1), y = rnorm(n = N, mean = 1))
C2 <- data.frame(cluster = "C2", x = rnorm(n = N, mean = -2), y = rnorm(n = N, mean = -5))
C3 <- data.frame(cluster = "C3", x = rnorm(n = N, mean = 5), y = rnorm(n = N, mean = 1))
DF <- rbind(C1, C2, C3)
ggplot(DF, aes(x, y, color = cluster)) + geom_point()

  1. Apply k-means with k = 3 (as you know the true number of clusters). Pring the cluster centers.

  2. Print a confusion map to compare k-means cluster assignment with the true cluster labels.

  3. Generate a scatter plot of points, now colored by the cluster assignment.

  4. Now pretend that you don’t know the real number of clusters. Use k = 4 and recompute kmeans. Plot the results and see what happened.