Course wrap-up

What is unsupervised learning?

Dimensionality reduction with PCA

- Cluster Analysis:
- k-means Clustering
- Hierarchical Clustering

Where to find out more about the topics of this class:

- R for Data Science, by Hadley Wickham: (http://r4ds.had.co.nz)
- The tidyverse: (https://www.tidyverse.org)
- RStudio: (https://www.rstudio.com/)
- R Markdown: (http://rmarkdown.rstudio.com/)
- Many online tutorials and forums (e.g. Data Carpentry and DataCamp)

How to learn more advanced topics on R:

- Take “Stat 290: Computing for Data Science”
- Read “Advanced R”, by Hadley Wickham: (http://adv-r.had.co.nz/)
- Read “R packages”, by Hadley Wickham: (http://rpkgs.had.co.nz/)

**There is only \(X\) and no \(Y\)**i.e. there are no special variables such as response or output variables, and no prespecified labels for the observations.Deals with a task of

**inferring latent (hidden) patterns and structures unlabeled data**.The goal is to understand the

**relationships between features or among observations**.

- Unsupervised learning encompasses:
- dimensionality reduction, manifold learning

*e.g. PCA, MDS, Isomap, Diffusion Map, t-SNE, Autoencoder* - clustering
*e.g. k-means, hierarchical clustering, mixture models* - anomaly detection
- latent variable models

- dimensionality reduction, manifold learning

- It can handle the tasks such as:
- image segmentation,
- image clustering / automatic labeling,
- visualization of high dimensional data
*e.g. gene expression*, - finding cell subtypes.

- Many
**modern datasets are high-dimensional**(many columns)- e.g. genetic sequencing, medical records, user internet activity data etc.

- DR or feature extraction methods
**can reduce the number of variables.**

- The methods can be used to:
- compress the data
- remove redundant features and noise

- Counter-intuitively, training on the reduced feature set can increase accuracy of learning methods by avoiding over-fitting and the curse of dimensionality
- Common methods for dimensionality reduction include: PCA, CA, ICA, MDS, Isomaps, Laplacian Eigenmaps, tSNE, Autoencoder.

Source: ESL Chapter 14

- For \(X\in \mathbb{R}^{n \times p}\), \(X^{\star} = (X - \bar X)\) is a centered data matrix.
- PCA is
**an eigenvalue decomposition of the sample covariance matrix**:

\[C = \frac{1}{n-1} (X^{\star})^T X^{\star} = \frac{1}{n-1} V \Sigma^2 V^T\]

where \(V\) is an orthogonal matrix and \(\Sigma\) is a diagonal matrix.

*Key idea:* the principal components are the eigenvectors of the sample covariance matrix \(C\). The associated eigenvalues are known as the **loadings**.

- PCA finds
**a set of \(p\) uncorrelated**directions (components) that are**linear combinations of the original \(p\) variables**. - These components sequentially explain most of the variation remaining subsequently in the data.
- Reduction occurs when
**the top \(k < p\) components are kept and used to represent**the original \(p\)-dimensional data.

The built in dataset includes information on violent crime rates in the US in 1975.

`head(USArrests)`

```
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
```

Mean and standard deviation of the crime rates across all states:

`apply(USArrests, 2, mean)`

```
## Murder Assault UrbanPop Rape
## 7.788 170.760 65.540 21.232
```

`apply(USArrests, 2, sd)`

```
## Murder Assault UrbanPop Rape
## 4.355510 83.337661 14.474763 9.366385
```

- In R, the function
`prcomp()`

can be used to perform PCA. - Unlike other common tasks, PCA has several other implementations (e.g.
`princomp()`

), but`prcomp`

has some speed advantages

`pca.res <- prcomp(USArrests, scale = TRUE)`

- The output of
`prcomp()`

is a list containing:

`names(pca.res)`

`## [1] "sdev" "rotation" "center" "scale" "x"`

- Think of PCA as switching to a new coordinate system. We need both:
- the coordinates for every data point in the new coordinate system
- a description of the new coordinate system in terms of the the original coordinate system

The elements of `prcomp`

output are:

- The principal components matrix, with projected samples coordinates.

`head(pca.res$x)`

```
## PC1 PC2 PC3 PC4
## Alabama -0.9756604 1.1220012 -0.43980366 0.154696581
## Alaska -1.9305379 1.0624269 2.01950027 -0.434175454
## Arizona -1.7454429 -0.7384595 0.05423025 -0.826264240
## Arkansas 0.1399989 1.1085423 0.11342217 -0.180973554
## California -2.4986128 -1.5274267 0.59254100 -0.338559240
## Colorado -1.4993407 -0.9776297 1.08400162 0.001450164
```

These are the sample coordinates in the PCA projection space.

- The
**loadings**matrix, which contains the eigenvectors themselves (these directions define the new coordinate system)

The loadings or principal axes give the weights of the variables in each of the principal components.

`pca.res$rotation`

```
## PC1 PC2 PC3 PC4
## Murder -0.5358995 0.4181809 -0.3412327 0.64922780
## Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
## Rape -0.5434321 -0.1673186 0.8177779 0.08902432
```

`pca.res$rotation`

```
## PC1 PC2 PC3 PC4
## Murder -0.5358995 0.4181809 -0.3412327 0.64922780
## Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
## Rape -0.5434321 -0.1673186 0.8177779 0.08902432
```

**Key idea:** the components often have intuitive explanations!

- PC1 places similar weights on
`Assault`

,`Murder`

,`Rape`

variables; much smaller weight on`UrbanPop`

.**PC1 measures an overall measure of crime**. - The 2nd loading puts most weight on
`UrbanPop`

. Thus,**PC2 measures a level of urbanization**. - The crime-related variables are
**correlated**with each other, and therefore are close to each other on the biplot. `UrbanPop`

is**independent**of the crime rate, and so it is further away on the plot.

- The standard deviations of the principal components (
**square roots of the eigenvalues**of \((X^{\star})^T X^{\star}\))

`pca.res$sdev`

`## [1] 1.5748783 0.9948694 0.5971291 0.4164494`

- The centers of the features, used for shifting:

`pca.res$center`

```
## Murder Assault UrbanPop Rape
## 7.788 170.760 65.540 21.232
```

- The standard deviations of the features, used for scaling:

`pca.res$scale`

```
## Murder Assault UrbanPop Rape
## 4.355510 83.337661 14.474763 9.366385
```

A scree plot can be used to choose how many components to retain.

Look for

**“elbows”**in the scree plotsDiscard the dimensions with corresponding eigenvalues or equivalently

**the proportion of variance explained**that drop off significantly.

```
# PCA eigenvalues/variances:
round((pr.var <- pca.res$sdev^2), 4)
```

`## [1] 2.4802 0.9898 0.3566 0.1734`

`plot(pca.res)`

Can also look at the cumulative plot

```
pct.var.explained <- 100*cumsum(pca.res$sdev^2)/sum(pca.res$sdev^2)
names(pct.var.explained) <- seq(1:length(pct.var.explained))
barplot(pct.var.explained, xlab = "# of Components",
ylab = "% of Variance Explained")
```

Each principal component loading and score vector is **unique, up to a sign flip**. So another software could return this plot instead:

`fviz_pca_ind(pca.res) + coord_fixed() + theme(text = element_text(size = 20))`

```
fviz_pca_var(pca.res) + coord_fixed() +
theme(text = element_text(size = 20))
```

```
fviz_contrib(pca.res, choice = "var", axes = 1) +
theme(text = element_text(size = 20))
```

Recall the `mtcars`

dataset we work with before, which comprises fuel consumption and other aspects of design and performance for 32 cars from 1974. The dataset has 11 dimensions.

Use

`prcomp()`

to compute a PCA for`mtcars`

. Remember to set the scale parameter, as the variables are in different units and have different rangesGenerate a scree plot and note how many dimensions should you retain.

Compute the percentage of variance explained by each of the principal components.

**Clustering is an exploratory technique**which can**discover hidden groupings**in the dataGroupings are determined from the data itself,

**without any prior knowledge about labels or classes**.There are the clustering methods available; a lot of them have an R implementation available on CRAN.

- To cluster the data we need a
**measure of similarity**or**dissimilarity**between a pair of observations, e.g. a Euclidean distance.

k-means is a simple and fast

**iterative method**for clustering data into \(k\) distinct non-overlapping groups.The algorithm minimizes the variation within each cluster.

- Algorithm
- Choose \(k\) points at random to serve as
*centroids* - Repeat until no change
- Assign each point to the cluster defined by the centroid to which it is closest
- Once each point is assigned, compute a
*new centroid*for each cluster by taking the mean across points in the cluster

- Choose \(k\) points at random to serve as

Source: link

**The number of clusters \(k\) must be prespecified**(before clustering).The method is stochastic, and involves

**random initialization of cluster centers**.This means that each time the algorithm is run, the results obtained can be different.

The number of clusters, \(k\), should be chosen using statistics such as:

One of the application of k-means clustering is

**image segmentation**.Here we use a picture of a field of tulips in the Netherlands downloaded from here.

- First, we download the image:

```
library(jpeg)
url <- "http://www.infohostels.com/immagini/news/2179.jpg"
dFile <- download.file(url, "./Lecture8-figure/Image.jpg")
img <- readJPEG("./Lecture8-figure/Image.jpg")
(imgDm <- dim(img))
```

`## [1] 480 960 3`

The image is a 3D array, so we convert it to a data frame.

Each row of the data frame should correspond a single pixel.

The columns should include the pixel location (

`x`

and`y`

), and the pixel intensity in red, green, and blue (`R`

,`G`

,`B`

).

```
# Assign RGB channels to data frame
imgRGB <- data.frame(
x = rep(1:imgDm[2], each = imgDm[1]),
y = rep(imgDm[1]:1, imgDm[2]),
R = as.vector(img[,,1]),
G = as.vector(img[,,2]),
B = as.vector(img[,,3])
)
```

- Each pixel is a datapoint in 3D specifying the intensity in each of the three “R”, “G”, “B” channels, which determines the pixel’s color.

`head(imgRGB, 3)`

```
## x y R G B
## 1 1 480 0 0.3686275 0.6980392
## 2 1 479 0 0.3686275 0.6980392
## 3 1 478 0 0.3725490 0.7019608
```

- We use k-means to cluster the pixels \(k\) into color groups (clusters).
- k-means can be performed in R with
`kmeans()`

built-in function.

```
# Set seed since k-means involves a random initialization
set.seed(43658)
k <- 2
kmeans.2clust <- kmeans(imgRGB[, c("R", "G", "B")], centers = k)
names(kmeans.2clust)
```

```
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
```

```
# k cluster centers
kmeans.2clust$centers
```

```
## R G B
## 1 0.5682233 0.3251528 0.1452832
## 2 0.6597320 0.6828609 0.7591578
```

```
# The centers correspond to the following colors:
rgb(kmeans.2clust$centers)
```

`## [1] "#915325" "#A8AEC2"`

```
# Cluster assignment of the first 10 pixels
head(kmeans.2clust$cluster, 10)
```

`## [1] 2 2 2 2 2 2 2 2 2 2`

```
# Convert cluster assignment lables to cluster colors
kmeans.2colors <- rgb(kmeans.2clust$centers[kmeans.2clust$cluster, ])
head(kmeans.2colors, 10)
```

```
## [1] "#A8AEC2" "#A8AEC2" "#A8AEC2" "#A8AEC2" "#A8AEC2" "#A8AEC2" "#A8AEC2"
## [8] "#A8AEC2" "#A8AEC2" "#A8AEC2"
```

```
ggplot(data = imgRGB, aes(x = x, y = y)) +
geom_point(colour = kmeans.2colors) +
labs(title = paste("k-Means Clustering with", k, "clusters (colors)")) +
xlab("x") + ylab("y") + theme_bw()
```

Now add more colors, by increase the number of clusters to 6:

```
set.seed(348675)
kmeans.6clust <- kmeans(imgRGB[, c("R", "G", "B")], centers = 6)
kmeans.6colors <- rgb(kmeans.6clust$centers[kmeans.6clust$cluster, ])
```

Alexander Calder’s mobile

- If it’s difficult (or if you simply don’t want) to choose the number of clusters ahead, you can do
**hierarchical clustering**.

Hierarchical clustering can be performed using agglomerative (bottom-up) or divisive (top-down) approach.

The method requires a choice of a pairwise distance metric and a rule of how to merge or divide clusters.

The output of the method can be represented as a graphical tree-based representation of the data, called a

**dendrogram**.The tree allows you to evaluate where the cutoff for grouping should occur.

Source: ISL

Results for hierarchical clustering differ depending on the choice of:

A distance metric used for pairs of observations, e.g. Euclidean (L2), Manhattan (L1), Jaccard (Binary), etc

The rule used for grouping clusters that are already generated, e.g. single (minimum), completer (maximum), centroid or average cluster linkages.

Different ways to compute dissimilarity between 2 clusters:

- We will use the Fisher’s Iris dataset containing measurements on 150 irises.
- Hierarchical clustering will calculate the grouping of the flowers into groups corresponding. We will see that these groups will roughly correspond to the flower species.

`head(iris)`

```
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
```

- Built-in function
`hclust()`

performs hierarchical clustering. - We will use only the petal dimensions (2 columns) to compute the distances between flowers.

```
# We use the Euclidean distance for the dissimilarities between flowers
distMat <- dist(iris[, 3:4])
# We use the "complete" linkage method for computing the cluster distances.
clusters <- hclust(distMat, method = "complete")
```

`plot(clusters, cex = 0.7)`

The dendrogram suggests that a reasonable choice of the number of clusters is either 3 or 4.

```
plot(clusters, cex = 0.7)
abline(a = 2, b = 0, col = "blue")
abline(a = 3, b = 0, col = "blue")
```

- We pick 3 clusters.
- To get the assignments with 3 clusters from the
**truncated**tree we can use a`cutree()`

function.

`(clusterCut <- cutree(clusters, 3))`

```
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 2 2 2 3 2 3 3 3 3 2 3 3 2 3 2 3
## [71] 2 3 2 2 3 3 2 2 2 3 3 3 3 2 2 2 2 3 3 3 3 2 3 3 3 3 3 3 3 3 2 2 2 2 2
## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2
```

`table(clusterCut, iris$Species)`

```
##
## clusterCut setosa versicolor virginica
## 1 50 0 0
## 2 0 21 50
## 3 0 29 0
```

```
plot(clusters, labels = clusterCut, cex = 0.9)
rect.hclust(clusters, k = 3, border=c("red", "blue", "green"))
```

`table(clusterCut, iris$Species)`

```
##
## clusterCut setosa versicolor virginica
## 1 50 0 0
## 2 0 21 50
## 3 0 29 0
```

- From the table we see that the sentosa and virginica were correctly assigned to separate groups.
- However, the method had difficulty grouping the versicolorm flowers into a separate cluster.

- 2D plot of the iris dataset using petal dimensions as coordinates.
- The cluster assignments partition the flowers into species with high accuracy.

```
ggplot(iris, aes(Petal.Length, Petal.Width)) + theme_bw() +
geom_text(aes(label = clusterCut), vjust = -1) +
geom_point(aes(color = Species)) + coord_fixed(1.5)
```

We will generate synthetic data to use for k-means clustering.

```
set.seed(489576)
N <- 1000
C1 <- data.frame(cluster = "C1", x = rnorm(n = N, mean = 1), y = rnorm(n = N, mean = 1))
C2 <- data.frame(cluster = "C2", x = rnorm(n = N, mean = -2), y = rnorm(n = N, mean = -5))
C3 <- data.frame(cluster = "C3", x = rnorm(n = N, mean = 5), y = rnorm(n = N, mean = 1))
DF <- rbind(C1, C2, C3)
```

`ggplot(DF, aes(x, y, color = cluster)) + geom_point()`

Apply k-means with k = 3 (as you know the true number of clusters). Pring the cluster centers.

Print a confusion map to compare k-means cluster assignment with the true cluster labels.

Generate a scatter plot of points, now colored by the cluster assignment.

Now pretend that you don’t know the real number of clusters. Use k = 4 and recompute kmeans. Plot the results and see what happened.