Goal

In this lab we will learn the basics of Multivariate Analysis and PCA using a few simple examples.

Work through this lab by running all the R code to your computer and making sure that you understand the input and the output. Make alterations where you seem fit. We encourage you to work through this lab with a partner.

This Lab is part of two book chapters from Modern Statistics for Modern Biology by Susan Holmes and Wolfgang Huber Holmes and Huber (2019) which are available freely online Chapter 7 online Chapter 9.

knitr::opts_chunk$set(echo = TRUE)
pkgs_needed = c("tidyverse","GGally", "factoextra", "ade4")
setdiff(pkgs_needed, installed.packages())
#If this is not empty you made need to execute the following line:
#BiocManager::install(setdiff(pkgs_needed, installed.packages()))
library("tidyverse")

Obtain Data

In this lab we will be working with the following datasets:

#turtles = read.table(url("https://web.stanford.edu/class/bios221/data/PaintedTurtles.txt"),
#                 header=TRUE)
turtles<-read.table("../../data/PaintedTurtles.txt",header=TRUE)
head(turtles)
##   sex length width height
## 1   f     98    81     38
## 2   f    103    84     38
## 3   f    103    86     42
## 4   f    105    86     40
## 5   f    109    88     44
## 6   f    123    92     50
#download.file(url = "https://web.stanford.edu/class/bios221/data/athletes.RData",
#              destfile = "athletes.RData",mode = "wb")
load("athletes.RData")
athletes[1:3,]
##    m100 long weight highj  m400  m110  disc pole javel  m1500
## 1 11.25 7.43  15.48  2.27 48.90 15.13 49.28  4.7 61.32 268.95
## 2 10.87 7.45  14.97  1.97 47.71 14.46 44.36  5.1 61.76 273.02
## 3 11.18 7.44  14.20  1.97 48.29 14.81 43.66  5.2 64.16 263.20

Low dimensional data summaries and preparation

It is instructive to first consider 2-dimensional summaries of the datasets:

library("GGally")
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
ggpairs(turtles[,-1], axisLabels="none")

Can you do the same for the athletes data?

#TODO

Correlations can be displayed on a color scale by a simple call to the pheatmap function:

library("pheatmap")
pheatmap(cor(athletes),cell.width=10,cell.height=10)

Preprocessing the data

Our first task in data analysis is to transform the data: standardizing the data to a common standard deviation. This rescaling is done using the scale function which makes every column have a variance of 1 (and also mean 0).

scaledTurtles=data.frame(scale(turtles[,-1]),sex=turtles[,1])
ggplot(scaledTurtles,aes(x=width,y=height, group =sex)) +
  geom_point(aes(color=sex))

Can you compute the standard deviation and mean of each column in the turtles data frame? Can you do the same on the scaled dataset, i.e. on scaledturtles?

Question: What was the mean of turtles’ heights before standardizing?

Answer:

mean(turtles$length)
## [1] 124.6875

Question: What was the standard deviation of turtles’ widths before standardizing?

Answer:

sd(turtles$width)
## [1] 12.67584

Question: What was the standard deviation of turtles’ widths after standardizing?

Answer:

sd(scaledTurtles$width)
## [1] 1

Dimension reduction

library("ggplot2")
athletes = scale(athletes)
n = nrow(athletes)
athletes = data.frame(athletes)
p = ggplot(athletes, aes(x = weight,y=  disc)) +
  geom_point(size = 2, shape=21)
p + geom_point(aes(y = rep(0, n)), colour="red") +
  geom_segment(aes(xend = weight, yend = rep(0,n)), linetype = "dashed")

Now try to do the following:

Question: Calculate the variance of the red points in the above figure. (points are projected onto the weight-axis).

Answer:

var(athletes$weight)
## [1] 1

Summarize 2D-data by a line

We regress disc on weight with the lm function (linear model) to find the regression line; its slope (a) is given by the second coefficient in the output of lm and its intercept (b) is the first:

ath_gg = ggplot(athletes, aes(x = weight, y = disc)) +
  geom_point(size = 2, shape = 21)
reg1 = lm(disc ~ weight, data = athletes)
a1 = reg1$coefficients[1] # intercept
b1 = reg1$coefficients[2] # slope
pline1 = ath_gg + geom_abline(intercept = a1, slope = b1,
    col = "blue", lwd = 1.5)
pline1 + geom_segment(aes(xend = weight, yend = reg1$fitted),
    colour = "red", arrow = arrow(length = unit(0.15, "cm")))

Can you regress weight on discs and generate a similar plot?

reg2 = lm(weight ~ disc, data = athletes)
a2 = reg2$coefficients[1] # intercept
b2 = reg2$coefficients[2] # slope
pline2 = ath_gg + geom_abline(intercept = -a2/b2, slope = 1/b2,
    col = "darkgreen", lwd = 1.5)
pline2 + geom_segment(aes(xend=reg2$fitted, yend=disc),
    colour = "orange", arrow = arrow(length = unit(0.15, "cm")))

Can you create a plot that shows all points, as well as both regression lines, i.e., a plot that show both the line you get from lm(disc ~ weight) and lm(weight ~ disc)?

pline1 + geom_segment(aes(xend = weight, yend = reg1$fitted), colour = "blue", alpha = 0.35) +
  geom_abline(intercept = -a2/b2, slope = 1/b2, col = "darkgreen", lwd = 1.5, alpha = 0.8) +
  geom_segment(aes(xend = reg2$fitted, yend = disc), colour = "orange", alpha = 0.35) +
   coord_fixed()

A line that minimizes distances in both directions

Now we will plot the line chosen to minimize the sum of squares of the orthogonal (perpendicular) projections of data points onto it; we call this the principal component line.

X = cbind(athletes$disc, athletes$weight)
svda = svd(X)
pc = X %*% svda$v[, 1] %*% t(svda$v[, 1])
bp = svda$v[2, 1] / svda$v[1, 1]
ap = mean(pc[, 2]) - bp * mean(pc[, 1])

p + geom_segment(xend = pc[,1], yend = pc[,2]) + 
  geom_abline(intercept = ap, slope = bp, col = "purple", lwd = 1.5) + 
  coord_fixed()

Can you create a plot that includes both the line from the plot above, plus the two regression lines lm(disc ~ weight) and lm(weight ~ disc)?

pline1 + geom_segment(aes(xend = weight, yend = reg1$fitted), colour = "blue", alpha = 0.35) +
  geom_abline(intercept = -a2/b2, slope = 1/b2, col = "darkgreen", lwd = 1.5, alpha = 0.8) +
  geom_segment(aes(xend = reg2$fitted, yend = disc), colour = "orange", alpha = 0.35) +
  geom_abline(intercept = ap, slope = bp, col = "purple", lwd = 1.5, alpha = 0.8) +
  geom_segment(xend = pc[, 1], yend = pc[, 2], colour = "purple", alpha = 0.35) + coord_fixed()

If we rotate the (discus, weight) plane with this change of coordinates making the purple line the horizontal \(x\) axis, we obtain what is know as the first principal plane:

ppdf = data.frame(PC1n = -svda$u[,1]*svda$d[1], PC2n=svda$u[,2] * svda$d[2])

ggplot(ppdf,aes(x = PC1n, y=PC2n)) + geom_point() + ylab("PC2 ") +
  geom_hline(yintercept = 0, color = "purple", lwd = 1.5, alpha = 0.5) +
  geom_point(aes(x = PC1n, y = 0),color = "red") + xlab("PC1 ")+
  xlim(-3.5, 2.7)+ylim(-2, 2) + coord_fixed() +
  geom_segment(aes(xend = PC1n,yend = 0), color = "red")

Question: What are the sums of squares of the red segments equal to?

Answer:

sum(ppdf$PC2n^2)
## [1] 6.196729

Question: What is the variance of this new set of red points?

Answer:

var(ppdf$PC1n)
## [1] 1.806352

Question: What is the sum of the variances of ppdf$PC1n and ppdf$PC2n?

Answer:

var(ppdf$PC1n) + var(ppdf$PC2n)
## [1] 2

We could have gotten the same results using the princomp command as follows:

pca_athletes = princomp(X)

Now compare (note that e.g. loadings are not unique up to sign, but the lines they define are the same):

svda$v
##            [,1]       [,2]
## [1,] -0.7071068  0.7071068
## [2,] -0.7071068 -0.7071068
pca_athletes$loadings
## 
## Loadings:
##      Comp.1 Comp.2
## [1,]  0.707  0.707
## [2,]  0.707 -0.707
## 
##                Comp.1 Comp.2
## SS loadings       1.0    1.0
## Proportion Var    0.5    0.5
## Cumulative Var    0.5    1.0
head(pca_athletes$scores)
##           Comp.1      Comp.2
## [1,]  2.11505770  0.51860275
## [2,]  0.90889250 -0.14608045
## [3,]  0.36703787  0.12959656
## [4,]  1.01909156 -0.08896787
## [5,] -0.78018106  0.34139129
## [6,] -0.07617437  0.34465656
head(svda$u %*% diag(svda$d))
##             [,1]        [,2]
## [1,] -2.11505770  0.51860275
## [2,] -0.90889250 -0.14608045
## [3,] -0.36703787  0.12959656
## [4,] -1.01909156 -0.08896787
## [5,]  0.78018106  0.34139129
## [6,]  0.07617437  0.34465656

Question: Which field in pca_athletes contains approximately the same object as c(sd(ppdf$PC1n), sd(ppdf$PC2n))?

Answer:

pca_athletes$sdev
##    Comp.1    Comp.2 
## 1.3234856 0.4333355

Question: Unfortunately the results stored in the above field do not perfectly match c(sd(ppdf$PC1n), sd(ppdf$PC2n)). If you multiply by which correction factor will you get the results to match c(sd(ppdf$PC1n), sd(ppdf$PC2n))?

Answer: The factor is \(\sqrt(33/32)\).

sqrt(33/32)*pca_athletes$sdev
##    Comp.1    Comp.2 
## 1.3440060 0.4400543

Not that the above is the same as:

c(sd(ppdf$PC1n), sd(ppdf$PC2n))
## [1] 1.3440060 0.4400543

The difference is that princomp returns unbiased estimates of sample standard deviations.

Turtle PCA

Now let’s continue inspecting the turtles data.

turtles3var = turtles[, -1]
apply(turtles3var, 2, mean)
##    length     width    height 
## 124.68750  95.43750  46.33333

We start by looking at the variances of the three components in the unstandardized case:

apply(turtles3var, 2, var)
##    length     width    height 
## 419.49601 160.67686  70.43972

Next we see that basically all 3 variables are very strongly correlated:

turtlesc = scale(turtles3var)
cor(turtlesc)
##           length     width    height
## length 1.0000000 0.9783116 0.9646946
## width  0.9783116 1.0000000 0.9605705
## height 0.9646946 0.9605705 1.0000000

Because of the strong correlations, we would expect that the data matrix can be well approximated by a rank 1 matrix. Let’s do the PCA:

library("factoextra")
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
pca1 = princomp(turtlesc)
# or alternatively:
#pca1 = ade4::dudi.pca(turtlesc, scannf = FALSE)
pca1
## Call:
## princomp(x = turtlesc)
## 
## Standard deviations:
##    Comp.1    Comp.2    Comp.3 
## 1.6954576 0.2048201 0.1448180 
## 
##  3  variables and  48 observations.
fviz_eig(pca1, geom = "bar", width = 0.4)

The screeplot showing the eigenvalues for the standardized data: one very large component in this case and two very small ones. In this case the data are (almost) one dimensional.

Question: What is the percentage of variance explained by the first PC?

Answer:

100*pca1$sdev[1]^2/sum(pca1$sdev^2)
##   Comp.1 
## 97.85792
fviz_pca_biplot(pca1, label = "var", col.ind = turtles[,1]) 

Add ellipses for female and male groups to the plot above.

fviz_pca_biplot(pca1, label = "var", col.ind = turtles[,1], addEllipses=TRUE) 

Back to the athletes

Now let us try to interpret another scree plot with more dimensions.

library("ade4")
pca.ath = dudi.pca(athletes, scannf = FALSE)
pca.ath$eig
##  [1] 3.4182381 2.6063931 0.9432964 0.8780212 0.5566267 0.4912275 0.4305952
##  [8] 0.3067981 0.2669494 0.1018542
fviz_eig(pca.ath, geom = "bar", bar_width = 0.3) + ggtitle("")

The screeplot make a clear drop after the second eigenvalue. This indicates a good approximation will be obtained at rank 2. Let’s look at an interpretation of the first two axes by projecting the loadings of the original (old) variables as they project onto the two new ones.

fviz_pca_var(pca.ath, col.circle = "black") + ggtitle("")

Note

It can seem paradoxical that the m variables are opposed to the others.

Question: Why does this occur?

We can make the variables align and give the left direction on PCA 1 to be an axis of athletic ability by changing the signs:

athletes[, c(1, 5, 6, 10)] = -athletes[, c(1, 5, 6, 10)]
cor(athletes) %>% round(1)
##        m100 long weight highj m400 m110 disc pole javel m1500
## m100    1.0  0.5    0.2   0.1  0.6  0.6  0.0  0.4   0.1   0.3
## long    0.5  1.0    0.1   0.3  0.5  0.5  0.0  0.3   0.2   0.4
## weight  0.2  0.1    1.0   0.1 -0.1  0.3  0.8  0.5   0.6  -0.3
## highj   0.1  0.3    0.1   1.0  0.1  0.3  0.1  0.2   0.1   0.1
## m400    0.6  0.5   -0.1   0.1  1.0  0.5 -0.1  0.3  -0.1   0.6
## m110    0.6  0.5    0.3   0.3  0.5  1.0  0.1  0.5   0.1   0.1
## disc    0.0  0.0    0.8   0.1 -0.1  0.1  1.0  0.3   0.4  -0.4
## pole    0.4  0.3    0.5   0.2  0.3  0.5  0.3  1.0   0.3   0.0
## javel   0.1  0.2    0.6   0.1 -0.1  0.1  0.4  0.3   1.0  -0.1
## m1500   0.3  0.4   -0.3   0.1  0.6  0.1 -0.4  0.0  -0.1   1.0
pcan.ath = dudi.pca(athletes, nf = 2, scannf = FALSE)
pcan.ath$eig
##  [1] 3.4182381 2.6063931 0.9432964 0.8780212 0.5566267 0.4912275 0.4305952
##  [8] 0.3067981 0.2669494 0.1018542
fviz_pca_var(pcan.ath, col.circle="black") + ggtitle("")

fviz_pca_ind(pcan.ath) + ggtitle("") + ylim(c(-2.5,5.7))

What do you notice about the numbers?

Contiguous or supplementary information

Many programs and workflows for biological sequence analysis or assays separate the environmental and contextual information, which they call metadata, from the assay data or sequence reads. We discourage such practice as the exact connections between the samples and covariates are important. A lost connection between the assays and covariates makes later analyses impossible. Covariates such as clinical history, time, batch or location are important and should be considered components of the data.

Extra information about sample batches, dates of measurement, different protocols are often named metadata; this can be misnomer if it is implied that metadata are somehow less important. Such information is real data that need to be integrated into the analyses. We typically store it in a data.frame or a similar R class and tightly link it to the primary assay data.

Known batches in data

Here we show an example of an analysis that was done by Holmes et al. (2011) on bacterial abundance data from Phylochip microarrays. The experiment was designed to detect differences between a group of healthy rats and a group who had Irritable Bowel Disease. This example shows a case where the nuisance batch effects become apparent in the analysis of experimental data. It is an illustration of the fact that best practices in data analyses are sequential and that it is better to analyse data as they are collected to adjust for severe problems in the experimental design as they occur, instead of having to deal with them post mortem.

When data collection started on this project, days 1 and 2 were delivered and we made the plot that appears in the Figure below. This showed a definite day effect. When investigating the source of this effect, we found that both the protocol and the array were different in days 1 and 2. This leads to uncertainty in the source of variation, we call this confounding of effects.

This section will use a new package from Bioconductor called “sva”, you need to install it first (take away the comments in the following code, also please load the data from the website).

#IBDchip = readRDS(url("https://web.stanford.edu/class/bios221/Pune/data/vsn28Exprd.rds"))
IBDchip = readRDS("../../data/vsn28Exprd.rds")
#BiocManager::install("sva")
library("sva")
## Loading required package: mgcv
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-28. For overview type 'help("mgcv-package")'.
## Loading required package: genefilter
## 
## Attaching package: 'genefilter'
## The following object is masked from 'package:readr':
## 
##     spec
## Loading required package: BiocParallel

Question What class is the IBDchip ? Look at the last row of the matrix, what do you notice?

Answer

class(IBDchip)
## [1] "matrix"
dim(IBDchip)
## [1] 8635   28
tail(IBDchip[,1:3])
##                                  20CF     20DF     20MF
## bm-026.1.sig_st              7.299308 7.275802 7.383103
## bm-125.1.sig_st              8.538857 8.998562 9.296096
## bru.tab.d.HIII.Con32.sig_st  6.802736 6.777566 6.859950
## bru.tab.d.HIII.Con323.sig_st 6.463604 6.501139 6.611851
## bru.tab.d.HIII.Con5.sig_st   5.739235 5.666060 5.831079
## day                          2.000000 2.000000 2.000000
summary(IBDchip[nrow(IBDchip),])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.857   2.000   3.000

The data are normalized abundance measurements of 8634 taxa measured on 28 samples. We use a rank-threshold transformation, giving the top 3000 most abundant tax scores from 3000 to 1, and letting the 5634 least abundant all have a score of 1. We separate out the assay data from the day variable which should be considered a factor:

assayIBD = IBDchip[-nrow(IBDchip), ]
day = factor(IBDchip[nrow(IBDchip), ])
The screeplot shows us that the phenomenon can be usefully represented in two dimensions.

Figure 1: The screeplot shows us that the phenomenon can be usefully represented in two dimensions

Instead of using the continuous, normalized data, we use a robust analysis replacing the values by their ranks. The lower values are considered ties encoded as a threshold chosen to reflect the number of expected taxa thought to be present:

rankthreshPCA = function(x, threshold = 3000) {
  ranksM = apply(x, 2, rank)
  ranksM[ranksM < threshold] = threshold
  ranksM = threshold - ranksM
  dudi.pca(t(ranksM), scannf = FALSE, nf = 2)
}
pcaDay12 = rankthreshPCA(assayIBD[,day!=3])
day12 = day[day!=3]
fviz(pcaDay12, element="ind", axes=c(1,2), geom=c("point","text"),
  habillage = day12, repel = TRUE, palette = "Dark2",
  addEllipses = TRUE, ellipse.type = "convex") + ggtitle("") +
  coord_fixed()

We have used colors to identify the different days and have kept the sample labels as well. We have also added convex hulls for each day. The group mean is identified as the point with the larger symbol (circle, triangle or square).

Figure 2: We have used colors to identify the different days and have kept the sample labels as well
We have also added convex hulls for each day. The group mean is identified as the point with the larger symbol (circle, triangle or square).

Question Why do we use a threshold for the ranks?

Answer Low abundances, at noise level occur for species that are not really present, of which there are more than half. A large jump in rank for these observations could easily occur without any meaningful reason. Thus we create a large number of ties for low abundance.

Question What can you say about the PCA plot?

Answer The figure shows that the sample arrange themselves naturally into two different groups according to the day of the samples. After discovering this effect, we delved into the differences that could explain these distinct clusters. There were two different protocols used (protocol 1 on day 1, protocol 2 on day 2) and unfortunately two different provenances for the arrays used on those two days (array 1 on day 1, array 2 on day 2).

A third set of data of four samples had to be collected to deconvolve the confounding effect. Array 2 was used with protocol 2 on Day 3,

The new figure below shows the new PCA plot with all the samples created by the following:

pcaDay123 = rankthreshPCA(assayIBD)
fviz(pcaDay123, element="ind", axes=c(1,2), geom=c("point","text"),
  habillage = day, repel=TRUE, palette = "Dark2",
  addEllipses = TRUE, ellipse.type = "convex") + ggtitle("") +
  coord_fixed()

When comparing the three day analysis to that of the first two days, we notice the inversion of signs in the coordinates on the second axis: this has no biological relevance. The important finding is that group 3 overlaps heavily with group 1 indicating that it was the protocol change on Day 2 which created the variability.When comparing the three day analysis to that of the first two days, we notice the inversion of signs in the coordinates on the second axis: this has no biological relevance. The important finding is that group 3 overlaps heavily with group 1 indicating that it was the protocol change on Day 2 which created the variability.

Figure 3: When comparing the three day analysis to that of the first two days, we notice the inversion of signs in the coordinates on the second axis: this has no biological relevance
The important finding is that group 3 overlaps heavily with group 1 indicating that it was the protocol change on Day 2 which created the variability.

In which situation would it be preferable to make confidence ellipses around the group means using the following code?

fviz_pca_ind(pcaDay123, habillage = day, labelsize = 3,
  palette = "Dark2", addEllipses = TRUE, ellipse.level = 0.69)

fviz_eig(pcaDay123,bar_width=0.6) + ggtitle("")
The eigenvalue screeplot the case of 3 groups is extremely similar to that with two groups shown in Figure \@ref(fig:screepc12).

Figure 4: The eigenvalue screeplot the case of 3 groups is extremely similar to that with two groups shown in Figure 1

Through this visualization we were able to uncover a flaw in the original experimental design. The first two batches shown in green and brown were both balanced with regards to IBS and healthy rats. They do show very different levels of variability and overall multivariate coordinates. In fact, there are two confounded effects. Both the arrays and protocols were different on those two days. We had to run a third batch of experiments on day 3, represented in purple, this used protocol from day 1 and the arrays from day 2. The third group faithfully overlaps with batch 1, telling us that the change in protocol was responsible for the variability.

Removing batch effects

Through the combination of the continuous measurements from assayIBD and the supplementary batch number as a factor, the PCA map has provided an invaluable investigation tool.

This is a good example of the use of supplementary points.

The mean-barycenter points are created by using the group-means of points in each of the three groups and serve as extra markers on the plot.

We can decide to re-align the three groups by subtracting the group means so that all the batches are centered on the origin. A slightly more effective way is to use the ComBat function available in the sva package. This function uses a similar, but slightly more sophisticated method (Empirical Bayes mixture approach (Leek et al. 2010)). We can see its effect on the data by redoing our robust PCA (see the result in this Figure:

model0 = model.matrix(~1, day)
combatIBD = ComBat(dat = assayIBD, batch = day, mod = model0)
## Standardizing Data across genes
pcaDayBatRM = rankthreshPCA(combatIBD)
fviz(pcaDayBatRM, element = "ind", geom = c("point", "text"),
  habillage = day, repel=TRUE, palette = "Dark2", addEllipses = TRUE,
  ellipse.type = "convex", axes =c(1,2)) + coord_fixed() + ggtitle("")
The modified data with the batch effects removed now show three batch-groups heavily overlapping and centered almost at the origin.

Figure 5: The modified data with the batch effects removed now show three batch-groups heavily overlapping and centered almost at the origin

Also see a quick overview of the important steps:

10 tips for dimension reduction

References

Holmes, Susan, Alexander V Alekseyenko, Alden Timme, Tyrrell Nelson, Pankaj Jay Pasricha, and Alfred Spormann. 2011. “Visualization and Statistical Comparisons of Microbial Communities Using R Packages on Phylochip Data.” In Pacific Symposium on Biocomputing, 142–53. World Scientific.

Holmes, Susan, and Wolfgang Huber. 2019. Modern Statistics for Modern Biology. Cambridge University Press.

Leek, Jeffrey T, Robert B Scharpf, Héctor Corrada Bravo, David Simcha, Benjamin Langmead, W Evan Johnson, Donald Geman, Keith Baggerly, and Rafael A Irizarry. 2010. “Tackling the Widespread and Critical Impact of Batch Effects in High-Throughput Data.” Nature Reviews Genetics 11 (10): 733–39.