# 7  Multivariate Analysis

Many datasets consist of several variables measured on the same set of subjects: patients, samples, or organisms. For instance, we may have biometric characteristics such as height, weight, age as well as clinical variables such as blood pressure, blood sugar, heart rate, and genetic data for, say, a thousand patients. The raison d’être for multivariate analysis is the investigation of connections or associations between the different variables measured. Usually the data are reported in a tabular data structure with one row for each subject and one column for each variable. In the following, we will focus on the special case where each of the variables is numeric, so we can represent the data structure as a matrix in R.

If the columns of the matrix are all independent of each other (unrelated), we can simply study each column separately and do standard “univariate” statistics on them one by one; there would be no benefit in studying them as a matrix.

More often, there will be patterns and dependencies. For instance, in the biology of cells, we know that the proliferation rate will influence the expression of many genes simultaneously. Studying the expression of 25,000 gene (columns) on many samples (rows) of patient-derived cells, we notice that many of the genes act together, either that they are positively correlated or that they are anti-correlated. We would miss a lot of important information if we were to only study each gene separately. Important connections between genes are detectable only if we consider the data as a whole: each row representing the many measurements made on the same observational unit. However, having 25,000 dimensions of variation to consider at once is daunting; we will show how to reduce our data to a smaller number of most important dimensions1 without losing too much information.

• 1 We will elaborate this idea of dimension reduction in much more detail below. For the time being, remember we live in a four-dimensional world.

• This chapter presents many examples of multivariate data matrices that we encounter in high-throughput experiments, as well as some more elementary examples that we hope will enhance your intuition. We will focus in this chapter on Principal Component Analysis, abbreviated as PCA, a dimension reduction method. We will provide geometric explanations of the algorithm as well as visualizations that help interprete the output of PCA analyses.

## 7.1 Goals for this chapter

In this chapter we will:

• See examples of matrices that come up in the study of biological data.

• Perform dimension reduction to understand correlations between variables.

• Preprocess, rescale and center the data before starting a multivariate analysis.

• Build new variables, called principal components (PC), that are more useful than the original measurements.

• See what is “under the hood” of PCA: the singular value decomposition of a matrix.

• Visualize what this decomposition achieves and learn how to choose the number of principal components.

• Run through a complete PCA analysis from start to finish.

• Project factor covariates onto the PCA map to enable a more useful interpretation of the results.

## 7.2 What are the data? Matrices and their motivation

First, let’s look at a set of examples of rectangular matrices used to represent tables of measurements. In each matrix, the rows and columns represent specific entities.

Turtles: A simple data set that will help us understand the basic principles is a matrix of three dimensions of biometric measurements on painted turtles .

turtles = read.table("../data/PaintedTurtles.txt", header = TRUE)
turtles[1:4, ]
  sex length width height
1   f     98    81     38
2   f    103    84     38
3   f    103    86     42
4   f    105    86     40

The last three columns are length measurements (in millimetres), whereas the first column is a factor variable that tells us the sex of each animal.

Athletes: This matrix is an interesting example from the sports world. It reports the performances for 33 athletes in the ten disciplines of the decathlon: m100, m400 and m1500 are times in seconds for the 100 meters, 400 meters, and 1500 meters respectively; m110 is the time to finish the 110 meters hurdles; pole is the pole-vault height, and high and long are the results of the high and long jumps, all in meters; weight, disc, and javel are the lengths in meters the athletes were able to throw the weight, discus and javelin. Here are these variables for the first three athletes:

data("olympic", package = "ade4")
athletes = setNames(olympictab, c("m100", "long", "weight", "high", "m400", "m110", "disc", "pole", "javel", "m1500")) athletes[1:3, ]  m100 long weight high m400 m110 disc pole javel m1500 1 11.25 7.43 15.48 2.27 48.90 15.13 49.28 4.7 61.32 268.95 2 10.87 7.45 14.97 1.97 47.71 14.46 44.36 5.1 61.76 273.02 3 11.18 7.44 14.20 1.97 48.29 14.81 43.66 5.2 64.16 263.20 Cell Types: Holmes et al. (2005) studied gene expression profiles of sorted T-cell populations from different subjects. The columns are a subset of gene expression measurements, they correspond to 156 genes that show differential expression between cell types. load("../data/Msig3transp.RData") round(Msig3transp,2)[1:5, 1:6]  X3968 X14831 X13492 X5108 X16348 X585 HEA26_EFFE_1 -2.61 -1.19 -0.06 -0.15 0.52 -0.02 HEA26_MEM_1 -2.26 -0.47 0.28 0.54 -0.37 0.11 HEA26_NAI_1 -0.27 0.82 0.81 0.72 -0.90 0.75 MEL36_EFFE_1 -2.24 -1.08 -0.24 -0.18 0.64 0.01 MEL36_MEM_1 -2.68 -0.15 0.25 0.95 -0.20 0.17 Bacterial Species Abundances: Matrices of counts are used in microbial ecology studies (as we saw in Chapter 4). Here the columns represent different species (or operational taxonomic units, OTUs) of bacteria, which are identified by numerical tags. The rows are labeled according to the samples in which they were measured, and the (integer) numbers represent the number of times of each of the OTUs was observed in each of the samples. data("GlobalPatterns", package = "phyloseq") GPOTUs = as.matrix(t(phyloseq::otu_table(GlobalPatterns))) GPOTUs[1:4, 6:13] OTU Table: [4 taxa and 8 samples] taxa are rows 246140 143239 244960 255340 144887 141782 215972 31759 CL3 0 7 0 153 3 9 0 0 CC1 0 1 0 194 5 35 3 1 SV1 0 0 0 0 0 0 0 0 M31Fcsw 0 0 0 0 0 0 0 0 mRNA reads: RNA-Seq transcriptome data report the number of sequence reads matching each gene2 in each of several biological samples. We will study this type of data in detail in Chapter 8 • 2 Or sub-gene structures, such as exons. • library("SummarizedExperiment") data("airway", package = "airway") assay(airway)[1:3, 1:4]  SRR1039508 SRR1039509 SRR1039512 SRR1039513 ENSG00000000003 679 448 873 408 ENSG00000000005 0 0 0 0 ENSG00000000419 467 515 621 365 It is customary in the RNA-Seq field—and so it is for the airway data above—to report the genes in the rows and the samples in the columns. Compared to the other matrices we look at here, this is transposed: rows and columns are swapped. Such different conventions easily lead to errors, so they are worthwhile paying attention to3. Proteomic profiles: Here, the columns are aligned mass spectroscopy peaks or molecules identified through their $$m/z$$-ratios; the entries in the matrix are the measured intensities4. • 3 The Bioconductor project tries to help users and developers to avoid such ambiguities by defining data containers in which such conventions are explicitly fixed. In Chapter 8, we will see the example of the SummarizedExperiment class. • 4 More details can be found, e.g., on Wikipedia. • metab = t(as.matrix(read.csv("../data/metabolites.csv", row.names = 1))) metab[1:4, 1:4]  146.0985388 148.7053275 310.1505057 132.4512963 KOGCHUM1 29932.36 17055.70 1132.82 785.5129 KOGCHUM2 94067.61 74631.69 28240.85 5232.0499 KOGCHUM3 146411.33 147788.71 64950.49 10283.0037 WTGCHUM1 229912.57 384932.56 220730.39 26115.2007 When a peak was not detected for a particular $$m/z$$ score in the mass spectrometry run, a zero was recorded in metab. Similarly, zeros in GPOTUs or in the airway object occur when there were no matching sequence reads detected. Tabulate the frequencies of zeros in these data matrices. Question 7.1 1. What are the columns of these data matrices usually called? 2. In each of these examples, what are the rows of the matrix? 3. What does a cell in a matrix represent? 4. If the data matrix is called athletes and you want to see the value of the third variable for the fifth athlete, what do you type into R? ### 7.2.1 Low-dimensional data summaries and preparation If we are studying only one variable, i.e., just the third column of the turtles matrix5, we say we are looking at one-dimensional data. Such a vector, say all the turtle weights, can be visualized by plots such as those that we saw in Section 3.6, e.g., a histogram. If we compute a one number summary, say mean or median, we have made a zero-dimensional summary of our one-dimensional data. This is already an example of dimension reduction. • 5 The third column of a matrix $$X$$ is denoted mathematically by $${\mathbf x}_{\cdot 3}$$ or accessed in R using X[, 3]. • In Chapter 3 we studied two-dimensional scatterplots. We saw that if there are too many observations, it can be beneficial to group the data into (hexagonal) bins: these are two-dimensional histograms. When considering two variables ($$x$$ and $$y$$) measured together on a set of observations, the correlation coefficient measures how the variables co-vary. This is a single number summary of two-dimensional data. Its formula involves the summaries $$\bar{x}$$ and $$\bar{y}$$: $\hat{\rho}= \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})} {\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2} \sqrt{\sum_{j=1}^n (y_j-\bar{y})^2}} \tag{7.1}$ In R, we use the cor function to calculate its value. Applied to a matrix this function computes all the two way correlations between continuous variables. In Chapter 9 we will see how to analyse multivariate categorical data. Question 7.2 Compute the matrix of all correlations between the measurements from the turtles data. What do you notice ? We take out the categorical variable and compute the matrix. cor(turtles[, -1])  length width height length 1.0000000 0.9783116 0.9646946 width 0.9783116 1.0000000 0.9605705 height 0.9646946 0.9605705 1.0000000 We see that this square matrix is symmetric and the values are all close to 1. The diagonal values are always 1. It is always beneficial to start a multidimensional analysis by checking these simple one-dimensional and two-dimensional summary statistics using visual displays such as those we look at in the next two questions. Question 7.3 1. Produce all pairwise scatterplots, as well as the one-dimensional histograms on the diagonal, for the turtles data. Use the package GGally. 2. Guess the underlying or “true dimension” of these data? library("ggplot2") library("dplyr") library("GGally") ggpairs(turtles[, -1], axisLabels = "none") From Figure 7.2, it looks like all three of the variables are highly correlated and mostly reflect the same “underlying” variable, which we might interpreted as the size of the turtle. Question 7.4 Compute all pairwise correlations of the variables in the athletes data and display the matrix in a heatmap. What do you notice? library("pheatmap") pheatmap(cor(athletes), cell.width = 10, cell.height = 10) Figure 7.3 shows how the 10 variables cluster into groups: running, throwing and jumping. ### 7.2.2 Preprocessing the data In many cases, different variables are measured in different units, so they have different baselines and different scales6. These are not directly comparable in their original form. • 6 Common measures of scale are the range and the standard deviation. For instance, the times for the 110 metres vary between 14.18 and 16.2, with a standard deviation of 0.51, whereas the times to complete the 1500 metres vary between 256.64 and 303.17, with a standard deviation of 13.66; more than an order of magnitude larger. Moreover, the athletes data also contain measurements in different units (seconds, metres), whose choice is arbitrary (lengths could also be recorded in centimetres or feet, times in milliseconds). • For PCA and many other methods, we therefore need to transform the numeric values to some common scale in order to make comparisons meaningful. Centering means subtracting the mean, so that the mean of the centered data is at the origin. Scaling or standardizing then means dividing by the standard deviation, so that the new standard deviation is $$1$$. In fact, we have already encountered these operations when computing the correlation coefficient (Equation 7.1): the correlation coefficient is simply the vector product of the centered and scaled variables. To perform these operations, there is the R function scale, whose default behavior when given a matrix or a data frame is to make every column have a mean of zero and a standard deviation of $$1$$. Question 7.5 1. Compute the means and standard deviations of the turtle data, then use the scale function to center and standardize the continuous variables. Call this scaledTurtles, then verify the new values for mean and standard deviation of scaledTurtles. 2. Make a scatterplot of the scaled and centered width and height variables of the turtle data and color the points by their sex. apply(turtles[,-1], 2, sd)  length width height 20.481602 12.675838 8.392837  apply(turtles[,-1], 2, mean)  length width height 124.68750 95.43750 46.33333  scaledTurtles = scale(turtles[, -1]) apply(scaledTurtles, 2, mean)  length width height -1.432050e-18 1.940383e-17 -2.870967e-16  apply(scaledTurtles, 2, sd) length width height 1 1 1  data.frame(scaledTurtles, sex = turtles[, 1]) %>% ggplot(aes(x = width, y = height, group = sex)) + geom_point(aes(color = sex)) + coord_fixed() We have already encountered other data transformation choices in Chapters 4 and 5, where we used the log and asinh functions. The aim of these transformations is (usually) variance stabilization, i.e., to make the variances of replicate measurements of one and the same variable in different parts of its dynamic range more similar. In contrast, the standardizing transformation described above aims to make the scale (as measured by mean and standard deviation) of different variables the same. Sometimes it is preferable to leave variables at different scales because they are truly of different importance. If their original scale is relevant, then we can (should) leave the data as is. In other cases, the variables have different precisions known a priori. We will see in Chapter 9 that there are several ways of weighting such variables. After preprocessing the data, we are ready to undertake data simplification through dimension reduction. Useful books with relevant chapters are Flury (1997) for an introductory account and Mardia, Kent, and Bibby (1979) for a detailed mathematical approach. ## 7.3 Dimension reduction We will explain dimension reduction from several different perspectives. It was invented in 1901 by Karl Pearson as a way to reduce a two-variable scatterplot to a single coordinate. It was used by statisticians in the 1930s to summarize a battery of psychological tests run on the same subjects ; thus providing overall scores that summarize many tested variables at once. This idea of principal scores inspired the name Principal Component Analysis (abbreviated PCA). PCA is called an unsupervised learning technique because, as in clustering, it treats all variables as having the same status. We are not trying to predict or explain one particular variable’s value from the others; rather, we are trying to find a mathematical model for an underlying structure for all the variables. PCA is primarily an exploratory technique that produces maps that show the relations between variables and between observations in a useful way. We first provide a flavor of what this multivariate analysis does to the data. There is an elegant mathematical formulation of these methods through linear algebra, although here we will try to minimize its use and focus on visualization and data examples. We use geometrical projections that take points in higher-dimensional spaces and projects them down onto lower dimensions. Figure 7.5 shows the projection of the point $$A$$ onto the line generated by the vector $${\mathbf v}$$. PCA is a linear technique, meaning that we look for linear relations between variables and that we will use new variables that are linear functions of the original ones ($$f(ax+by)=af(x)+b(y)$$). The linearity constraints makes computations particularly easy. We will see non-linear techniques in Chapter 9. ### 7.3.1 Lower-dimensional projections Here we show one way of projecting two-dimensional data onto a line using the athletes data. The code below provides the preprocessing and plotting steps that were used to generate Figure 7.6: athletes = data.frame(scale(athletes)) ath_gg = ggplot(athletes, aes(x = weight, y = disc)) + geom_point(size = 2, shape = 21) ath_gg + geom_point(aes(y = 0), colour = "red") + geom_segment(aes(xend = weight, yend = 0), linetype = "dashed") 1. Calculate the variance of the red points in Figure 7.6. 2. Make a plot showing projection lines onto the $$y$$ axis and projected points. 3. Compute the variance of the points projected onto the vertical $$y$$ axis. ### 7.3.2 How do we summarize two-dimensional data by a line? In general, we lose information about the points when we project from two dimensions (a plane) to one (a line). If we do it just by using the original coordinates, as we did on the weight variable in Figure 7.6, we lose all the information about the disc variable. Our goal is to keep as much information as we can about both variables. There are actually many ways of projecting the point cloud onto a line. One is to use what are known as regression lines. Let’s look at these lines and how they are constructed in R. #### Regressing one variable on the other If you have seen linear regression, you already know how to compute lines that summarize scatterplots; linear regression is a supervised method that gives preference minimizing the residual sum of squares in one direction: that of the response variable. #### Regression of the disc variable on weight. In Figure 7.7, we use the lm (linear model) function to find the regression line. Its slope and intercept are given by the values in the coefficients slot of the resulting object reg1. reg1 = lm(disc ~ weight, data = athletes) a1 = reg1coefficients[1] # intercept
b1 = reg1$coefficients[2] # slope pline1 = ath_gg + geom_abline(intercept = a1, slope = b1, col = "blue", linewidth = 1.5) pline1 + geom_segment(aes(xend = weight, yend = reg1$fitted),
colour = "red", arrow = arrow(length = unit(0.15, "cm")))

#### Regression of weight on discus.

Figure 7.8 shows the line produced when reversing the roles of the two variables; weight becomes the response variable.

reg2 = lm(weight ~ disc, data = athletes)
a2 = reg2$coefficients[1] # intercept b2 = reg2$coefficients[2] # slope
pline2 = ath_gg + geom_abline(intercept = -a2/b2, slope = 1/b2,
col = "darkgreen", linewidth = 1.5)
pline2 + geom_segment(aes(xend=reg2$fitted, yend=disc), colour = "orange", arrow = arrow(length = unit(0.15, "cm"))) Each of the regression lines in Figures 7.7 and 7.8 gives us an approximate linear relationship between disc and weight. However, the relationship differs depending on which of the variables we choose to be the predictor and which the response. Question 7.6 How large is the variance of the projected points that lie on the blue regression line of Figure 7.7? Compare this to the variance of the data when projected on the original axes, weight and disc. Pythagoras’ theorem tells us that the squared length of the hypotenuse of a right-angled triangle is equal to the sum of the squared lengths of the other two sides, which we apply as follows: var(athletes$weight) + var(reg1$fitted) [1] 1.650204 The variances of the points along the original axes weight and disc are 1, since we scaled the variables. #### A line that minimizes distances in both directions Figure 7.9 shows the line chosen to minimize the sum of squares of the orthogonal (perpendicular) projections of data points onto it; we call this the principal component line. All our three ways of fitting a line (Figures 7.77.9) together in one plot are shown in Figure 7.10. xy = cbind(athletes$disc, athletes$weight) svda = svd(xy) pc = xy %*% svda$v[, 1] %*% t(svda$v[, 1]) bp = svda$v[2, 1] / svdav[1, 1] ap = mean(pc[, 2]) - bp * mean(pc[, 1]) ath_gg + geom_segment(xend = pc[, 1], yend = pc[, 2]) + geom_abline(intercept = ap, slope = bp, col = "purple", linewidth = 1.5) Question 7.7 1. What is particular about the slope of the purple line? 2. Redo the plots on the original (unscaled) variables. What happens? The lines computed here depend on the choice of units. Because we have made the standard deviations equal to one for both variables, the PCA line is the diagonal that cuts exactly in the middle of both regression lines. Since the data were centered by subtracting their means, the line passes through the origin $$(0,0)$$. Question 7.8 Compute the variance of the points on the purple line. We have computed the coordinates of the points when we made the plot, these are in the pc vector: apply(pc, 2, var) [1] 0.9031761 0.9031761 sum(apply(pc, 2, var)) [1] 1.806352 We see that the variance along this axis is larger than the other variances we calculated in Question 7.6. Pythagoras’ theorem tells us two interesting things here: • If we are minimizing in both horizontal and vertical directions we are in fact minimizing the orthogonal projections onto the line from each point. • The total variability of the points is measured by the sum of squares of the projection of the points onto the center of gravity, which is the origin (0,0) if the data are centered. This is called the total variance or the inertia of the point cloud. This inertia can be decomposed into the sum of the squares of the projections onto the line plus the variances along that line. For a fixed variance, minimizing the projection distances also maximizes the variance along that line. Often we define the first principal component as the line with maximum variance. ## 7.4 The new linear combinations The PC line we found in the previous section could be written Image credit: Sara Holmes $PC = \frac{1}{2} \mbox{disc} + \frac{1}{2} \mbox{weight}. \tag{7.2}$ Principal components are linear combinations of the variables that were originally measured, they provide a new coordinate system. To understand what a linear combination really is, we can take an analogy. When making a healthy juice mix, you will follow a recipe like \begin{align} V &= 2 \times \text{Beet} + 1 \times \text{Carrot} \\ &+ \tfrac{1}{2} \text{Gala} + \tfrac{1}{2} \text{GrannySmith} \\ &+ 0.02 \times \text{Ginger} + 0.25 \times \text{Lemon}. \end{align} This recipe is a linear combination of individual juice types (the original variables). The result is a new variable, $$V$$, and the coefficients $$(2,1,\frac{1}{2},\frac{1}{2},0.02,0.25)$$ are called the loadings. Question 7.9 How would you compute the calories in a glass of juice? ### 7.4.1 Optimal lines A linear combination of variables defines a line in higher dimensions in the same way we constructed lines in the scatterplot plane of two dimensions. As we saw in that case, there are many ways to choose lines onto which we project the data, there is however a best’ line for our purpose. The total variance of all the points in all the variables can de decomposed. In PCA, we use the fact that the total sums of squares of the distances between the points and any line can be decomposed into the distance to the line and the variance along the line. We saw that the principal component minimizes the distance to the line, and it also maximizes the variance of the projections along the line. Why is maximizing the variance along a line a good idea? Let’s look at another example of a projection from three dimensions into two. In fact, human vision depends on such dimension reduction: Question 7.10 In Figure 7.11, there is a two-dimensional projection of a three-dimensional object. What is the object? Question 7.11 Which of the two projections, Figure 7.11 or 7.13, do you find more informative, and why? One can argue that the projection that maximizes the area of the shadow shows more information’. ## 7.5 The PCA workflow PCA is based on the principle of finding the axis showing the largest inertia/variability, removing the variability in that direction and then iterating to find the next best orthogonal axis, and so on. In fact, we do not have to run iterations, all the axes can be found in one linear algebra operation called the Singular Value Decomposition (we will delve more deeply into the details below). In the diagram in Figure 7.12, we see that first the means and variances are computed and the choice of whether to work directly with the covariance matrix or with the correlation matrix has to be made. The next step is the choice of $$k$$, the number of components we deem relevant to the data. We say that $$k$$ is the rank of the approximation. The best choice of $$k$$ is a difficult question, and we discuss on how to approach it below. The choice of $$k$$ requires looking at a plot of the variances explained by the successive principal components. Once we have chosen $$k$$, we can proceed to the projections of the data in the new $$k$$-dimensional subspace. The end results of the PCA workflow are useful maps of both the variables and the samples. Understanding how these maps are constructed will maximize the information we can gather from them. ## 7.6 The inner workings of PCA: rank reduction This is a small section for those whose background in linear algebra is but a faint memory. It tries to give some intuition to the singular value decomposition method underlying PCA, without too much notation. The singular value decomposition of a matrix finds horizontal and vertical vectors (called the singular vectors) and normalizing values (called singular values). As before, we start by giving the forward-generative explanation before doing the actual reverse engineering that is used in creating the decomposition. To calibrate the meaning of each step, we will start with an artificial example before moving to the complexity of real data. ### 7.6.1 Rank-one matrices A simple generative model demonstrates the meaning of the rank of a matrix and explains how we find it in practice. Suppose we have two vectors, $$u$$ (a one-column matrix), and $$v^t=t(v)$$ (a one-row matrix–the transpose of a one-column matrix $$v$$). For instance, $$u =\left( \begin{smallmatrix} 1\\2\\3\\4 \end{smallmatrix} \right)$$ and $$v =\left( \begin{smallmatrix} 2\\4\\8 \end{smallmatrix} \right)$$. The transpose of $$v$$ is written $$v^t = t(v) = (2\; 4\; 8)$$. We multiply a copy of $$u$$ by each of the elements of $$v^t$$ in turn as follows: Step 0: X 2 4 8 1 2 3 4 Step 1: X 2 4 8 1 2 2 4 3 6 4 8 Step 2: X 2 4 8 1 2 4 2 4 8 3 6 12 4 8 16 Step 3: X 2 4 8 1 2 4 8 2 4 8 16 3 6 12 24 4 8 16 32 Thus, the $$(2,3)$$ entry of the matrix $$X$$, written $$x_{2,3}$$, is obtained by multiplying $$u_2$$ by $$v_3$$. We can write this $X=\begin{pmatrix} 2&4&8\\ 4&8&16\\ 6 &12&24\\ 8&16&32\\ \end{pmatrix} = u * t(v)= u * v^t \tag{7.3}$ The matrix $$X$$ we obtain here is said to be of rank 1, because both $$u$$ and $$v$$ have one column. Question 7.12 Why can we say that writing $$X = u*v^t$$ is more economical than spelling out the full matrix $$X$$? $$X$$ has $$4\times3=12$$ elements, while in terms of $$u$$ and $$v$$ it can be expressed by only $$4+3=7$$ numbers. The compression is even more impressive when $$u$$ or $$v$$ are longer. On the other hand, suppose that we want to reverse the process and simplify another matrix $$X$$ given below with 3 rows and 4 columns (12 numbers). Can we always express it in a similar way as a product of vectors without loss of information? In the diagrams shown in Figures 7.14 and 7.15, the colored boxes have areas proportional to the numbers in the cells of the matrix (7.4). Question 7.13 Here is a matrix $$X$$ we want to decompose. $\begin{array}{rrrrr} \hline {\large X} & x_{.1} & x_{.2} & x_{.3} & x_{.4} \\ \hline x_{1.} & 780 & 936 & 1300 & 728\\ x_{2.} & 75 & 90 & 125 & 70\\ x_{3.} & 540 & 648 & 900 & 504\\ \hline \end{array} \tag{7.4}$ $$X$$ has been redrawn as series of rectangles in Figure 7.14. What numbers could we put in the white $$u$$ and $$v$$ boxes so that the values of the sides of the rectangle give the numbers as their product? A matrix with the special property of being perfectly “rectangular” like $$X$$ is said to be of rank 1. We can represent the numbers in $$X$$ by the areas of rectangles, where the sides of rectangles are given by the values in the side vectors ($$u$$ and $$v$$). We see in Figure 7.15 that the decomposition of $$X$$ is not unique: there are several candidate choices for the vectors $$u$$ and $$v$$. We will make the choice unique by requiring that the sum of the squares of each vector’s elements add to 1 (we say the vectors $$v$$ and $$u$$ have norm 1). Then we have to keep track of one extra number by which to multiply each of the products, and which represents the “overall scale” of $$X$$. This is the value we have put in the upper left hand corner. It is called the singular value $$s_1$$. In the R code below, we start by supposing we know the values in u, v and s1; later we will see a function that finds them for us. Let’s check the multiplication and norm properties in R: X = matrix(c(780, 75, 540, 936, 90, 648, 1300, 125, 900, 728, 70, 504), nrow = 3) u = c(0.8196, 0.0788, 0.5674) v = c(0.4053, 0.4863, 0.6754, 0.3782) s1 = 2348.2 sum(u^2) [1] 1 sum(v^2) [1] 1 s1 * u %*% t(v)  [,1] [,2] [,3] [,4] [1,] 780 936 1300 728 [2,] 75 90 125 70 [3,] 540 648 900 504 X - s1 * u %*% t(v)  [,1] [,2] [,3] [,4] [1,] -0.03419 0.0745 0.1355 0.1221 [2,] 0.00403 0.0159 0.0252 0.0186 [3,] -0.00903 0.0691 0.1182 0.0982 Question 7.14 Try svd(X) in R. Look at the components of the output of the svd function carefully. Check the norm of the columns of the matrices that result from this call. Where did the above value of s1 = 2348.2 come from? svd(X)u[, 1]
svd(X)$v[, 1] sum(svd(X)$u[, 1]^2)
sum(svd(X)$v[, 1]^2) svd(X)$d

In fact, in this particular case we were lucky: we see that the second and third singular values are 0 (up to the numeric precision we care about). That is why we say that $$X$$ is of rank 1. For a more general matrix $$X$$, it is rare to be able to write $$X$$ exactly as this type of two-vector product. The next subsection shows how we can decompose $$X$$ when it is not of rank 1: we will just need more pieces.

### 7.6.2 How do we find such a decomposition in a unique way?

In the above decomposition, there were three elements: the horizontal and vertical singular vectors, and the diagonal corner, called the singular value. These can be found using the singular value decomposition function (svd). For instance:

Xtwo = matrix(c(12.5, 35.0, 25.0, 25, 9, 14, 26, 18, 16, 21, 49, 32,
18, 28, 52, 36, 18, 10.5, 64.5, 36), ncol = 4, byrow = TRUE)
USV = svd(Xtwo)

Question 7.15 Look at the USV object, the result of calling the svd function. What are its components?

names(USV)
[1] "d" "u" "v"
USV$d [1] 1.350624e+02 2.805191e+01 3.111680e-15 2.290270e-15 So 135.1 is the first singular value USV$d[1].

Question 7.16 Check how each successive pair of singular vectors improves our approximation to Xtwo. What do you notice about the third and fourth singular values?

Xtwo - USV$d[1] * USV$u[, 1] %*% t(USV$v[, 1]) Xtwo - USV$d[1] * USV$u[, 1] %*% t(USV$v[, 1]) -
USV$d[2] * USV$u[, 2] %*% t(USV$v[, 2]) The third and fourth singular values are so small that they do not improve the approximation (within rounding errors), so we can conclude that Xtwo is of rank 2. Again, there are many ways to write a rank two matrix such as Xtwo as a sum of rank one matrices: in order to ensure uniqueness, we impose yet another7 condition on the singular vectors. The output vectors of the singular decomposition do not only have their norms equal to 1, each column vector in the $$U$$ matrix is orthogonal to all the previous ones. We write $$u_{\cdot 1} \perp u_{\cdot 2}$$, this means that the sum of the products of the values in the same positions is $$0$$: $$\sum_i u_{i1} u_{i2} = 0$$. Ditto for the $$V$$ matrix. • 7 Above, we had chosen the norm of the vectors to be 1. • Check the orthonormality by computing the cross product of the $$U$$ and $$V$$ matrices: t(USV$u) %*% USV$u t(USV$v) %*% USV$v Let’s submit our scaledTurtles matrix to a singular value decomposition. turtles.svd = svd(scaledTurtles) turtles.svd$d
[1] 11.746475  1.419035  1.003329
turtles.svd$v  [,1] [,2] [,3] [1,] 0.5787981 0.3250273 0.74789704 [2,] 0.5779840 0.4834699 -0.65741263 [3,] 0.5752628 -0.8127817 -0.09197088 dim(turtles.svd$u)
[1] 48  3

Question 7.17 What can you conclude about the turtles matrix from the svd output?

The first column of turtles.svd$v shows that the coefficients for the three variables are practically equal. Other noticeable “coincidences” include: sum(turtles.svd$v[,1]^2)
[1] 1
sum(turtles.svd$d^2) / 47 [1] 3 We see that the coefficients are in fact $$\sqrt{1/3}$$ and the sum of squares of the singular values is equal to $$(n-1)\times p$$. ### 7.6.3 Singular value decomposition $$X$$ is decomposed additively into rank-one pieces. Each of the $$u$$ vectors is combined into the $$U$$ matrix, and each of $$v$$ vectors into $$V$$. The Singular Value Decomposition is ${\mathbf X} = U S V^t, V^t V={\mathbb I}, U^t U={\mathbb I}, \tag{7.5}$ where $$S$$ is the diagonal matrix of singular values, $$V^t$$ is the transpose of $$V$$, and $${\mathbb I}$$ is the Identity matrix. Expression 7.5 can be written elementwise as $X_{ij} = u_{i1}s_1v_{1j} + u_{i2}s_2v_{2j} + u_{i3}s_3v_{3j} +... + u_{ir}s_rv_{rj},$ $$U$$ and $$V$$ are said to be orthonormal8, because their self-crossproducts are the identity matrix. • 8 Nothing to do with the normal distribution, it stands for orthogonal and having norm 1. • ### 7.6.4 Principal components The singular vectors from the singular value decomposition (provided by the svd function in R) contain the coefficients to put in front of the original variables to make the more informative ones we call the principal components. We write this as: $Z_1=v_{11} X_{\cdot 1} +v_{21} X_{\cdot 2} + v_{31} X_{\cdot 3}+ \cdots + v_{p1} X_{\cdot p}.$ If usv = svd(X), then $$(v_{11},v_{21},v_{31},...)$$ are given by the first column of usv$v; similarly for $$Z_2$$ with the second column of usv$v, and so son. $$p$$ is the number of columns of $$X$$ and the number of rows of $$V$$. These new variables $$Z_1, Z_2, Z_3, ...$$ have variances that decrease in size: $$s_1^2 \geq s_2^2 \geq s_3^2 \geq ...$$. Question 7.18 Compute the first principal component for the turtles data by multiplying by the first singular value d[1] by u[,1]. What is another way of computing it ? We show this using the code: US = turtles.svd$u[,1, drop = FALSE] %*% turtles.svd$d[1] XV = scaledTurtles %*% turtles.svd$v[, 1, drop = FALSE]
max(abs(US-XV))

We can also see using matrix algebra that $$XV$$ and $$US$$ are the same. Remember that $$V$$ is orthogonal, so $$V^t V={\mathbb I}$$ and $$XV = USV^tV=US\,{\mathbb I}$$.

Note: The drop = FALSE argument in the first line of the below code makes sure that the selected matrix column retains matrix / array class attributes and thus is eligible for the matrix multiplication operator. Alternatively, you could use the regular multiplication operator *. In the second line, the drop = FALSE is not strictly necessary, but we have it there for symmetry.

Here are two useful facts, first in words, then with the mathematical shorthand.

The number of principal components $$k$$ is always chosen to be fewer than the number of original variables or the number of observations. We are “lowering” the dimension of the problem:

$k\leq \min(n,p).$

The principal component transformation is defined so that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each successive component in turn has the highest variance possible under the constraint that it be orthogonal to the preceding components:

$\max_{aX \perp bX}\mbox{var}(\mbox{Proj}_{aX} (X)), \qquad \mbox{where } bX=\mbox{previous components}$

## 7.7 Plotting the observations in the principal plane

We revisit our two-variable athletes data with the discus and the weight variables. In Section 7.3.2, we computed the first principal component and represented it as the purple line in Figure 7.10. We showed that $$Z_1$$ was the linear combination given by the diagonal. As the coefficients have to have their sum of squares add to $$1$$, we have that $Z_1=-0.707*\text{athletes\disc}- 0.707*\text{athletes\weight}.$

This is the same as if the two coordinates were $$c_1=0.7071$$ and $$c_2=0.7071$$.

Question 7.19 What part of the output of the svd functions leads us to the first PC coefficients, also known as the PC loadings ?

Note that we use svda which was the svd applied to the two variables discus and weight.
svda$v[,1] [1] -0.7071068 -0.7071068 If we rotate the (discus, weight) plane by making the purple line the horizontal $$x$$ axis, we obtain what is know as the first principal plane. ppdf = tibble(PC1n = -svda$u[, 1] * svda$d[1], PC2n = svda$u[, 2] * svda$d[2]) gg = ggplot(ppdf, aes(x = PC1n, y = PC2n)) + geom_point() + geom_hline(yintercept = 0, color = "purple", linewidth = 1.5, alpha = 0.5) + xlab("PC1 ")+ ylab("PC2") + xlim(-3.5, 2.7) + ylim(-2, 2) + coord_fixed() gg + geom_point(aes(x = PC1n, y = 0), color = "red") + geom_segment(aes(xend = PC1n, yend = 0), color = "red") gg + geom_point(aes(x = 0, y = PC2n), color = "blue") + geom_segment(aes(yend = PC2n, xend = 0), color = "blue") + geom_vline(xintercept = 0, color = "skyblue", linewidth = 1.5, alpha = 0.5)  Question 7.20 1. What is the mean of the sums of squares of the red segments in Figure 7.16 equal to? 2. How does this compare to the variance of the red points? 3. Compute the ratio of the standard deviation of the blue segments to the red segments in Figure 7.16. Compare this to the ratio of singular values 1 and 2. 1. The sum of squares of the red segments corresponds to the square of the second singular value: sum(ppdf$PC2n^2) 
[1] 6.196729
svda$d[2]^2 [1] 6.196729 Since the mean of the red segments is zero, the above quantities are also proportional to the variance: mean(ppdf$PC2n) 
[1] 5.451106e-16
var(ppdf$PC2n) * (nrow(ppdf)-1) [1] 6.196729 1. The variance of the red points is var(ppdf$PC1n), which is larger than what we calculated in a) by design of the first PC.
var(ppdf$PC1n)  [1] 1.806352 var(ppdf$PC2n) 
[1] 0.1936478
1. We take the ratios of the standard deviations explained by the points on the vertical and horizontal axes by computing:
sd(ppdf$PC1n) / sd(ppdf$PC2n)
[1] 3.054182
svda$d[1] / svda$d[2]
[1] 3.054182

Use prcomp to compute the PCA of the first two columns of the athletes data, look at the output. Compare to the singular value decomposition.

### 7.7.1 PCA of the turtles data

We now want to do a complete PCA analysis on the turtles data. Remember, we already looked at the summary statistics for the one- and two-dimensional data. Now we are going to answer the question about the “true” dimensionality of these rescaled data.

In the following code, we use the function princomp. Its return value is a list of all the important pieces of information needed to plot and interpret a PCA.

cor(scaledTurtles)
          length     width    height
length 1.0000000 0.9783116 0.9646946
width  0.9783116 1.0000000 0.9605705
height 0.9646946 0.9605705 1.0000000
pcaturtles = princomp(scaledTurtles)
pcaturtles
Call:
princomp(x = scaledTurtles)

Standard deviations:
Comp.1    Comp.2    Comp.3
1.6954576 0.2048201 0.1448180

3  variables and  48 observations.
library("factoextra")
fviz_eig(pcaturtles, geom = "bar", bar_width = 0.4) + ggtitle("")

Question 7.21 Many PCA functions have been created by different teams who worked in different areas at different times. This can lead to confusion, especially because they have different naming conventions. Let’s compare three of them; run the following lines of code and look at the resulting objects:

svd(scaledTurtles)$v[, 1] prcomp(turtles[, -1])$rotation[, 1]
princomp(scaledTurtles)$loadings[, 1] library("ade4") dudi.pca(turtles[, -1], nf = 2, scannf = FALSE)$c1[, 1]

What happens when you disable the scaling in the prcomp and princomp functions?

In what follows, we always suppose that the matrix $$X$$ represents the centered and scaled matrix.

Question 7.22 The coordinates of the observations in the new variables from the prcomp function (call it res) are in the scores slot of the result. Take a look at PC1 for the turtles and compare it to res$scores. Compare the standard deviation sd1 to that in the res object and to the standard deviation of the scores. res = princomp(scaledTurtles) PC1 = scaledTurtles %*% res$loadings[,1]
sd1 = sqrt(mean(res$scores[, 1]^2)) Question 7.23 Check the orthogonality of the res$scores matrix. Why can’t we say that it is orthonormal?

Now we are going to combine both the PC scores ($$US$$) and the loadings-coefficients ($$V$$). The plots with both the samples and the variables represented are called biplots. This can be done in one line using the following factoextra package function.

fviz_pca_biplot(pcaturtles, label = "var", habillage = turtles[, 1]) +
ggtitle("")

Question 7.24 Is it possible to have a PCA plot with the PC1 as the horizontal axis whose height is longer than its width?

The variance of points in the PC1 direction is $$\lambda_1=s_1^2$$ which is always larger than $$\lambda_2=s_2^2$$, so the PCA plot will always be wider than high.

Question 7.25 Looking at Figure 7.18: a) Did the males or female turtles tend to be larger?
b) What do the arrows tell us about the correlations?

Question 7.26 Compare the variance of each new coordinate to the eigenvalues returned by the PCA dudi.pca function.

pcadudit = dudi.pca(scaledTurtles, nf = 2, scannf = FALSE)
apply(pcadudit$li, 2, function(x) sum(x^2)/48)  Axis1 Axis2 2.93573765 0.04284387  pcadudit$eig
[1] 2.93573765 0.04284387 0.02141848

Now we look at the relationships between the variables, both old and new by drawing what is known as the correlation circle. The aspect ratio is 1 here and the variables are represented by arrows as shown in Figure 7.19. The lengths of the arrows indicate the quality of the projections onto the first principal plane:

fviz_pca_var(pcaturtles, col.circle = "black") + ggtitle("") +
xlim(c(-1.2, 1.2)) + ylim(c(-1.2, 1.2))

Question 7.27 Explain the relationships between the number of rows of our turtles data matrix and the following numbers:

svd(scaledTurtles)$d/pcaturtles$sdev
sqrt(47)

When computing the variance covariance matrix, many implementations use $$1/(n-1)$$ as the denominator. Here, $$n=48$$ so the sum of the variances are off by a factor of 48/47.

These data are a good example of how sometimes almost all the variation in the data can be captured in a lower-dimensional space: here, three-dimensional data can be essentially replaced by a line. Keep in mind: $$X^tC=VSU^tUS=VS^2.$$ The principal components are the columns of the matrix $$C=US$$. The $$p$$ columns of $$U$$ (the matrix given as USV$u in the output from the svd function above) are rescaled to have norms $$(s_1^2,s_2^2,...,s_p^2)$$. Each column has a different variance it is responsible for explaining. Notice that these will be decreasing numbers. If we only want the first one then it is just $$c_1=s_1 u_1$$. Notice that $$||c_1||^2=s_1^tu_1 u_1^t s_1= s_1^2 u_1^tu_1=s_1^2=\lambda_1$$ If the matrix $$X$$ comes from the study of $$n$$ different samples or specimens, then the principal components provides new coordinates for these $$n$$ points as in Figure 7.16. These are sometimes called the scores in the results of PCA functions. Before we go into more detailed examples, let’s summarize what SVD and PCA provide: • Each principal component has a variance measured by the corresponding eigenvalue, the square of the corresponding singular value. • The new variables are made to be orthogonal. Since they are also centered, this means they are uncorrelated. In the case of normal distributed data, this also means they are independent. • When the variables are have been rescaled, the sum of the variances of all the variables is the number of variables ($$=p$$). The sum of the variances is computed by adding the diagonal of the crossproduct matrix9. • The principal components are ordered by the size of their eigenvalues. We always check the screeplot before deciding how many components to retain. It is also best practice to do as we did in Figure 7.18 and annotate each PC axis with the proportion of variance it explains. Eigen Decomposition: The crossproduct of X with itself verifies $X^tX=VSU^tUSV^t=VS^2V^t=V\Lambda V^t$ where $$V$$ is called the eigenvector matrix of the symmetric matrix $$X^tX$$ and $$\Lambda$$ is the diagonal matrix of eigenvalues of $$X^tX$$. • 9 This sum of the diagonal elements is called the trace of the matrix. • Look up eigenvalue in the Wikipedia. Try to find a sentence that defines it without using a formula. Why would eigenvectors come into use in Cinderella (at a stretch)? (See the xkcd cartoon in Figure 7.20.) For help with the basics of linear algebra, a motivated student pressed for time may consult Khan’s Academy. If you have more time and would like in depth coverage, Gil Strang’s MIT course is a classic, and some of the book is available online . ### 7.7.2 A complete analysis: the decathlon athletes We started looking at these data earlier in this chapter. Here, we will follow step by step a complete multivariate analysis. First, let us have another look at the correlation matrix (rounded to 2 digits after the decimal point), which captures the bivariate associations. We already plotted it as a colored heatmap in Figure 7.3. cor(athletes) |> round(2)  m100 long weight high m400 m110 disc pole javel m1500 m100 1.00 -0.54 -0.21 -0.15 0.61 0.64 -0.05 -0.39 -0.06 0.26 long -0.54 1.00 0.14 0.27 -0.52 -0.48 0.04 0.35 0.18 -0.40 weight -0.21 0.14 1.00 0.12 0.09 -0.30 0.81 0.48 0.60 0.27 high -0.15 0.27 0.12 1.00 -0.09 -0.31 0.15 0.21 0.12 -0.11 m400 0.61 -0.52 0.09 -0.09 1.00 0.55 0.14 -0.32 0.12 0.59 m110 0.64 -0.48 -0.30 -0.31 0.55 1.00 -0.11 -0.52 -0.06 0.14 disc -0.05 0.04 0.81 0.15 0.14 -0.11 1.00 0.34 0.44 0.40 pole -0.39 0.35 0.48 0.21 -0.32 -0.52 0.34 1.00 0.27 -0.03 javel -0.06 0.18 0.60 0.12 0.12 -0.06 0.44 0.27 1.00 0.10 m1500 0.26 -0.40 0.27 -0.11 0.59 0.14 0.40 -0.03 0.10 1.00 Then we look at the screeplot, which will help us choose a rank $$k$$ for representing the essence of these data. pca.ath = dudi.pca(athletes, scannf = FALSE) pca.ath$eig
 [1] 3.4182381 2.6063931 0.9432964 0.8780212 0.5566267 0.4912275 0.4305952
[8] 0.3067981 0.2669494 0.1018542
fviz_eig(pca.ath, geom = "bar", bar_width = 0.3) + ggtitle("")

The screeplot in Figure 7.21 shows a clear drop in the eigenvalues after the second one. This indicates that a good approximation will be obtained at rank 2. Let’s look at an interpretation of the first two axes by projecting the loadings of the original variables onto the two new ones, the principal components.

fviz_pca_var(pca.ath, col.var = "blue", repel = TRUE) + ggtitle("")

The correlation circle Figure 7.22 displays the projection of the original variables onto the two first new principal axes. The angles between vectors are interpreted as correlations. On the right side of the plane, we have the track and field events (m110, m100, m400, m1500), and on the left, we have the throwing and jumping events. Maybe there is an opposition of skills as characterized in the correlation matrix. We did see the correlations were negative between variables from these two groups. How can we interpret this?

It seems that those who throw the best have lower scores in the track competitions. In fact, if we look at the original measurements, we can see what is happening. The athletes who run in short times are the stronger ones, as are the ones who throw or jump longer distances. We should probably change the scores of the track variables and redo the analysis.

Question 7.28 What transformations of the variables induce the best athletic performances to vary in the same direction, i.e., be mostly positively correlated?

If we change the signs on the running performances, almost all the variables will be positively correlated.

runningvars = grep("^m", colnames(athletes), value = TRUE)
runningvars
[1] "m100"  "m400"  "m110"  "m1500"
athletes[, runningvars] = -athletes[, runningvars]
cor(athletes) |> round(2)
       m100 long weight high  m400 m110  disc pole javel m1500
m100   1.00 0.54   0.21 0.15  0.61 0.64  0.05 0.39  0.06  0.26
long   0.54 1.00   0.14 0.27  0.52 0.48  0.04 0.35  0.18  0.40
weight 0.21 0.14   1.00 0.12 -0.09 0.30  0.81 0.48  0.60 -0.27
high   0.15 0.27   0.12 1.00  0.09 0.31  0.15 0.21  0.12  0.11
m400   0.61 0.52  -0.09 0.09  1.00 0.55 -0.14 0.32 -0.12  0.59
m110   0.64 0.48   0.30 0.31  0.55 1.00  0.11 0.52  0.06  0.14
disc   0.05 0.04   0.81 0.15 -0.14 0.11  1.00 0.34  0.44 -0.40
pole   0.39 0.35   0.48 0.21  0.32 0.52  0.34 1.00  0.27  0.03
javel  0.06 0.18   0.60 0.12 -0.12 0.06  0.44 0.27  1.00 -0.10
m1500  0.26 0.40  -0.27 0.11  0.59 0.14 -0.40 0.03 -0.10  1.00
pcan.ath = dudi.pca(athletes, nf = 2, scannf = FALSE)
pcan.ath$eig  [1] 3.4182381 2.6063931 0.9432964 0.8780212 0.5566267 0.4912275 0.4305952 [8] 0.3067981 0.2669494 0.1018542 Now all the negative correlations are quite small. The screeplot will show no change, as the eigenvalues of the matrix are unaffected by the above sign flips. The only ouput that changes are the signs of the coefficients of the principal component loadings for the variables whose signs we flipped. fviz_pca_var(pcan.ath, col.var = "blue", repel = TRUE) + ggtitle("") Figure 7.23 shows the correlation circle of the transformed variables. We now see we have a broad common overall axis: all the arrows are pointing broadly in the same direction. We now plot the athletes projected in the principal plane using: fviz_pca_ind(pcan.ath, repel = TRUE) + ggtitle("")  Question 7.29 If we look at the athletes themselves as they are shown in Figure 7.24, we notice a slight ordering effect. Do you see a relation between the performance of the athletes and their numbering in Figure 7.24 ? If you play join the dots following the order of the numbers, you will probably realize you are spending more time on one side of the plot than you would be if the numbers were randomly assigned. It turns out there is complementary information available in the olympic dataset. An extra vector variable called score reports the final scores at the competition, the men’s decathlon at the 1988 Olympics. olympic$score
 [1] 8488 8399 8328 8306 8286 8272 8216 8189 8180 8167 8143 8114 8093 8083 8036
[16] 8021 7869 7860 7859 7781 7753 7745 7743 7623 7579 7517 7505 7422 7310 7237
[31] 7231 7016 6907

So let us look at the scatterplot comparing the first principal component coordinate of the athletes to this score. This is shown in Figure 7.25. We can see a strong correlation between the two variables. We note that athlete number 1 (who in fact won the Olympic decathlon gold medal) has the highest score, but not the highest value in PC1. Why do you think that is?

ggplot(data = tibble(pc1 = pcan.ath$li[, 1], score = olympic$score, label = rownames(athletes)),
mapping = aes(y = score, x = pc1)) +
geom_text(aes(label = label)) + stat_smooth(method = "lm", se = FALSE)

### 7.7.3 How to choose $$k$$, the number of dimensions ?

We have seen in the examples that the first step in PCA is to make the screeplot of the variances of the new variables (equal to the eigenvalues). We cannot decide how many dimensions are needed before seeing this plot. The reason is that there are situations when the principal components are ill-defined: when two or three successive PCs have very similar variances, giving a screeplot as in Figure 7.26, the subspace corresponding to a group of similar eigenvalues exists. In this case this would be 3D space generated by $$u_2,u_3,u_4$$. The vectors are not meaningful individually and one cannot interpret their loadings. This is because a very slight change in one observations could give a completely different set of three vectors. These would generate the same 3D space, but could have very different loadings. We say the PCs are unstable.

## 7.8 PCA as an exploratory tool: using extra information

We have seen that unlike regression, PCA treats all variables equally (to the extent that they were preprocessed to have equivalent standard deviations). However, it is still possible to map other continuous variables or categorical factors onto the plots in order to help interpret the results. Often we have complementary information on the samples, for example, diagnostic labels in the diabetes data or the cell types in the T-cell gene expression data.

Here we see how we can use such extra variables to inform our interpretation. The best place to store such so-called metadata is in appropriate slots of the data object (such as in the Bioconductor SummarizedExperiment class); the second-best, in additional columns of the data frame that also contains the numeric data. In practice, such information is often stored in a more or less cryptic manner in the row names of the matrix. Below, we need to face the latter scenario, and we use substr gymnastics to extract the cell types and show the screeplot in Figure 7.27 and the PCA in Figure 7.28.

pcaMsig3 = dudi.pca(Msig3transp, center = TRUE, scale = TRUE,
scannf = FALSE, nf = 4)
fviz_screeplot(pcaMsig3) + ggtitle("")
ids = rownames(Msig3transp)
celltypes = factor(substr(ids, 7, 9))
status = factor(substr(ids, 1, 3))
table(celltypes)
celltypes
EFF MEM NAI
10   9  11 
label = rownames(pcamat1$li), number = substr(label, 3, 4), type = factor(substr(label, 1, 2)))) pcsplot = ggplot(dfmat1, aes(x=Axis1, y=Axis2, label=label, group=number, colour=type)) + geom_text(size = 4, vjust = -0.5)+ geom_point(size = 3)+ylim(c(-18,19)) pcsplot + geom_hline(yintercept = 0, linetype = 2) + geom_vline(xintercept = 0, linetype = 2) Question 7.30 Looking at Figure 7.30, do the samples seem to be randomly placed in the plane? Do you notice any structure explained by the labels? The answer becomes (even more) evident if you make this plot. Knockouts are always below their paired wildtype sample. We will revisit this example when we look at supervised multivariate methods in our next chapter. pcsplot + geom_line(colour = "red") ### 7.8.2 Biplots and scaling In the previous example, the number of variables measured was too large to enable useful concurrent plotting of both variables and samples. In this example we plot the PCA biplot of a simple data set where chemical measurements were made on different wines for which we also have a categorical wine.class variable. We start the analysis by looking at the two-dimensional correlations and a heatmap of the variables. library("pheatmap") load("../data/wine.RData") load("../data/wineClass.RData") wine[1:2, 1:7]  Alcohol MalicAcid Ash AlcAsh Mg Phenols Flav 1 14.23 1.71 2.43 15.6 127 2.80 3.06 2 13.20 1.78 2.14 11.2 100 2.65 2.76 pheatmap(1 - cor(wine), treeheight_row = 0.2) A biplot is a simultaneous representation of both the space of observations and the space of variables. In the case of a PCA biplot like Figure 7.32 the arrows represent the directions of the old variables as they project onto the plane defined by the first two new axes. Here the observations are just colored dots, the color has been chosen according to which type of wine is being plotted. We can interpret the variables’ directions with regards to the sample points, for instance the blue points are from the barbera group and show higher Malic Acid content than the other wines. winePCAd = dudi.pca(wine, scannf=FALSE) table(wine.class) wine.class barolo grignolino barbera 59 71 48  fviz_pca_biplot(winePCAd, geom = "point", habillage = wine.class, col.var = "violet", addEllipses = TRUE, ellipse.level = 0.69) + ggtitle("") + coord_fixed() Interpretation of multivariate plots requires the use of as much of the available information as possible; here we have used the samples and their groups as well as the variables to understand the main differences between the wines. ### 7.8.3 An example of weighted PCA Sometimes we want to see variability between different groups or observations, but want to weight them. This can be the case if, e.g., the groups have very different sizes. Let’s re-examine the Hiiragi data we already saw in Chapter 3. In the code below, we select the wildtype (WT) samples and the top 100 features with the highest overall variance. data("x", package = "Hiiragi2013") xwt = x[, x$genotype == "WT"]
sel = order(rowVars(Biobase::exprs(xwt)), decreasing = TRUE)[1:100]
xwt = xwt[sel, ]
tab = table(xwt$sampleGroup) tab  E3.25 E3.5 (EPI) E3.5 (PE) E4.5 (EPI) E4.5 (PE) 36 11 11 4 4  xwt$weight = 1 / as.numeric(tab[xwt$sampleGroup]) pcaMouse = dudi.pca(as.data.frame(t(Biobase::exprs(xwt))), row.w = xwt$weight,
center = TRUE, scale = TRUE, nf = 2, scannf = FALSE)
fviz_eig(pcaMouse) + ggtitle("")
fviz_pca_ind(pcaMouse, geom = "point", col.ind = xwt$sampleGroup) + ggtitle("") + coord_fixed() We see from tab that the groups are represented rather unequally. To account for this, we reweigh each sample by the inverse of its group size. The function dudi.pca in the ade4 package has a row.w argument into which we can enter the weights. The output of the code is in Figures 7.33 and 7.34. ## 7.9 Summary of this chapter Preprocessing matrices Multivariate data analyses require “conscious” preprocessing. After consulting all the means, variances and one-dimensional histograms we saw how to rescale and recenter the data. Projecting onto new variables We saw how we can make projections into lower dimensions (2D planes and 3D are the most frequently used) of high dimensional data without losing too much information. PCA searches for new “more informative” variables that are linear combinations of the original (old) ones. Matrix decomposition PCA is based on finding decompositions of the matrix $$X$$ called SVD. This decomposition provides a lower rank approximation and is equivalent to the eigendecomposition of $$X^tX$$. The squares of the singular values are equal to the eigenvalues and to the variances of the new variables. We systematically plotted these values before deciding how many axes are necessary to reproduce the signal in the data. Two eigenvalues which are quite close can give rise to scores or PC scores which are highly unstable. It is always necessary to look at the screeplot of the eigenvalues and avoid separating the axes corresponding to the these close eigenvalues. This may require using interactive three or four-dimensional projections, which are available in several R packages. Biplot representations The space of observations is naturally a $$p$$-dimensional space (the $$p$$ original variables provide the coordinates). The space of variables is $$n$$-dimensional. Both decompositions we have studied (singular values / eigenvalues and singular vectors / eigenvectors) provide new coordinates for both of these spaces, sometimes we call one the dual of the other. We can plot the projection of both the observations and the variables onto the same eigenvectors. This provides a biplot that can be useful for interpreting the PCA output. Projecting other group variables Interpretation of PCA can also be facilitated by redundant or contiguous data about the observations. ## 7.10 Further reading The best way to deepen your understanding of singular value decomposition is to read Chapter 7 of Strang (2009). The whole book sets the foundations for the linear algebra necessary to understanding the meaning of the rank of matrix and the duality between row spaces and column spaces . Complete textbooks have been written on the subject of PCA and related methods. Mardia, Kent, and Bibby (1979) is a standard text that covers all multivariate methods in a classical way, with linear algebra and matrices. By making the parametric assumptions that the data come from multivariate normal distributions, Mardia, Kent, and Bibby (1979) also provide inferential tests for the number of components and limiting properties for principal components. Jolliffe (2002) is a book-long treatment of everything to do with PCA with extensive examples. We can incorporate supplementary information into weights for the observations and the variables. This was introduced in the 1970’s by French data scientists, see Holmes (2006) for a review and Chapter 9 for further examples. Improvements to the interpretation and stability of PCA can be obtained by adding a penalty that minimizes the number of nonzero coefficients that appear in the linear combinations. Zou, Hastie, and Tibshirani (2006) and Witten, Tibshirani, and Hastie (2009) have developed sparse versions of principal component analysis, and their packages elasticnet and PMA provide implementations in R. ## 7.11 Exercises Exercise 7.1 Revise the material about svd by reading sections 1, 2, and 3 of the Wikipedia article about SVD. It will also be beneficial to read about the related eigenvalue decomposition by reading sections 1, 2, and 2.1 of the Wikipedia article about eigendecomposition of a matrix. We know that we can decompose a $$n$$ row by $$p$$ column rank 1 matrix $$X$$ as: $**X** = \begin{pmatrix} x_{11} & x_{12} & ... & x_{1p}\\ x_{21} & x_{22} & ... & x_{2p}\\ \vdots & \vdots & \vdots & \vdots \\ x_{n1} & x_{n2} & ... & x_{np} \end{pmatrix} = \begin{pmatrix} u_{11} \\ u_{21} \\ \vdots \\ u_{n1} \\ \end{pmatrix} \times \begin{pmatrix} v_{11} & v_{21} & \cdots & v_{p1} \end{pmatrix}$ 1. If $$X$$ has no rows and no columns which are all zeros, then is this decomposition unique? 2. Generate a rank-one matrix. Start by taking a vector of length 15 with values from 2 to 30 in increments of 2, and a vector of length 4 with values 3, 6, 9, 12, then take their outer product. u = seq(2, 30, by = 2) v = seq(3, 12, by = 3) X1 = u %*% t(v) Why do we have to take t(v)? 1. Now we add some noise in the form a matrix we call Materr so we have an “approximately rank-one” matrix. Materr = matrix(rnorm(60,1),nrow=15,ncol=4) X = X1+Materr Visualize $$X$$ using ggplot. 1. Redo the same analyses with a rank 2 matrix. Note that X1 can also be computed as outer(u, v) ggplot(data=data.frame(X), aes(x=X1, y=X2, col=X3, size=X4)) + geom_point() Here we see that the data looks linear in all four dimensions. This is what it means to be of rank-one. Now let’s consider a rank 2 matrix. n = 100 p = 4 Y2 = outer(rnorm(n), rnorm(p)) + outer(rnorm(n), rnorm(p)) head(Y2)  [,1] [,2] [,3] [,4] [1,] -0.34642766 0.4500204 0.2453097 -0.1979948 [2,] 0.40992879 -1.7589255 -0.5800029 -0.7874186 [3,] 0.38865199 3.0618895 0.5673989 3.1935373 [4,] -0.78604388 -0.2768918 0.2499721 -1.5305821 [5,] -0.54264178 -2.3744740 -0.3432191 -2.8755210 [6,] 0.01111087 2.4578408 0.5761800 2.0659614 ggplot(data=data.frame(Y2), aes(x=X1, y=X2, col=X3, size=X4)) + geom_point() Now there are obviously at least two dimensions because if we project the data onto the first two coordinates (by default called X1 and X2 when you convert a matrix into a data frame in R), then the data varies in both dimensions. So the next step is to try to decide if there are more than two dimensions. The top right points are the closest to you (they’re biggest) and as you go down and left in the plot those points are farther away. In the left are the bluest points and they seem to get darker linearly as you move right. As you can probably tell, it is very hard to visually discover a low dimensional space in higher dimensions, even when “high dimensions” only means 4! This is one reason why we rely on the singular value decomposition. svd(Y2)$d # two non-zero eigenvalues
[1] 2.957254e+01 1.130368e+01 1.735344e-15 7.621118e-16
Y = Y2 + matrix(rnorm(n*p, sd=0.01),n,p) # add some noise to Y2
svd(Y)$d # four non-zero eigenvalues (but only 2 big ones) [1] 29.58236762 11.31576409 0.10065395 0.09342547 Here we have two dimensions which are non-zero and two dimensions which are approximately 0 (for “Y2”, they are within square root of computer tolerance of 0). Exercise 7.2 1. create a first a matrix of highly correlated bivariate data such as that shown in Figure 7.37. Hint: Use the function mvrnorm. Check the rank of the matrix by looking at its singular values. 1. perform a PCA and show the rotated principal component axes. 1. we generate correlated bivariate normal data using: library("MASS") mu1 = 1; mu2 = 2; s1=2.5; s2=0.8; rho=0.9; sigma = matrix(c(s1^2, s1*s2*rho, s1*s2*rho, s2^2),2) sim2d = data.frame(mvrnorm(50, mu = c(mu1,mu2), Sigma = sigma)) svd(scale(sim2d))$d
[1] 9.575684 2.511230
svd(scale(sim2d))$v[,1] [1] -0.7071068 -0.7071068 1. We use prcomp to perform a PCA and the scores provide the desired rotation. respc = princomp(sim2d) dfpc = data.frame(pc1=respc$scores[,1],
pc2=respc\$scores[,2])

ggplot(data.frame(sim2d), aes(x=X1,y=X2)) + geom_point()
ggplot(dfpc, aes(x=pc1, y=pc2)) + geom_point() + coord_fixed(2)

Exercise 7.3

Part (a) in Figure 7.37 shows a very elongated plotting region, why?
What happens if you do not use the coord_fixed() option and have a square plotting zone? Why can this be misleading?

Exercise 7.4

Let’s revisit the Hiiragi data and compare the weighted and unweighted approaches.

1. make a correlation circle for the unweighted Hiiragi data xwt. Which genes have the best projections on the first principal plane (best approximation)?

2. make a biplot showing the labels of the extreme gene-variables that explain most of the variance in the first plane. Add the the sample-points.

Page built on 2023-08-03 21:37:40.828927 using R version 4.3.0 (2023-04-21)

Support for maintaining the online version of this book is provided by de.NBI