Multivariate Analysis

Susan Holmes (c) STAMPS, August, 2014

Decompose1

Matrices and their Motivation

It is current practice to measure many variables on the same patients, we may have all the biometrical characteristics, height, weight, BMI, age as well as clinical variables such as blood pressure, blood sugar, heart rate for 100 patients, these variables will not be independent.

What is our data?

Diabetes

diabetes=read.table(url("http://bios221.stanford.edu/data/diabetes.txt"),header=TRUE,row.names=1)
diabetes[1:4,]
##   relwt glufast glutest steady insulin Group
## 1  0.81      80     356    124      55     3
## 3  0.94     105     319    143     105     3
## 5  1.00      90     323    240     143     3
## 7  0.91     100     350    221     119     3

Microbial Ecology

RNA-Seq

Mass Spectroscopy

Expression Data (microarray)

#######Melanoma/Tcell Data: Peter Lee, Susan Holmes, PNAS.
load(url("http://bios221.stanford.edu/data/Msig3transp.RData"))
round(Msig3transp,2)[1:5,1:6]
##              X3968 X14831 X13492 X5108 X16348  X585
## HEA26_EFFE_1 -2.61  -1.19  -0.06 -0.15   0.52 -0.02
## HEA26_MEM_1  -2.26  -0.47   0.28  0.54  -0.37  0.11
## HEA26_NAI_1  -0.27   0.82   0.81  0.72  -0.90  0.75
## MEL36_EFFE_1 -2.24  -1.08  -0.24 -0.18   0.64  0.01
## MEL36_MEM_1  -2.68  -0.15   0.25  0.95  -0.20  0.17
celltypes=factor(substr(rownames(Msig3transp),7,9))
status=factor(substr(rownames(Msig3transp),1,3))

Biometrical Measurements

#require(MSBdata)
Turtles=read.table("http://bios221.stanford.edu/data/PaintedTurtles.txt",header=TRUE)
Turtles[1:4,]
##   sex length width height
## 1   f     98    81     38
## 2   f    103    84     38
## 3   f    103    86     42
## 4   f    105    86     40
require(ade4)
## Loading required package: ade4
data(lizards)
lizards$traits[1:4,c(1,5,6,7,8)]
##    mean.L hatch.m clutch.S age.mat clutch.F
## Sa   69.2   0.572      6.0      13      1.5
## Sh   48.4   0.310      3.2       5      2.0
## Tl  168.4   2.235     16.9      19      1.0
## Mc   66.1   0.441      7.2      11      1.5

Data visualization and preparation

It is always beneficial to start a multidimensional analysis by checking the simple one dimensional and two dimensional summary statistics, we can visualize these using a graphics package that builds on ‘ggplot2‘ called ‘GGally‘.

Diabetes Data, the first 5 variables are numeric and the last is a factor variable encoding the specific patient group of the subjects.

Various scales

Transform the data one choice is to make all the variables, or columns of the matrix have the same standard deviation, this will put all the variables on the same footing in terms of their variance.

We also make sure that the means of all columns are zero, this is called centering .

apply(diabetes[,-6],2,sd)
##   relwt glufast glutest  steady insulin 
##   0.128  61.338 306.221 120.505 106.243
apply(diabetes[,-6],2,mean)
##   relwt glufast glutest  steady insulin 
##   0.979 120.431 536.500 187.306 183.729
diabetes.scaled=scale(diabetes)
###Replace the group variable by its original values
diabetes.scaled[,6]=diabetes[,6]
apply(diabetes.scaled[,-6],2,mean)
##     relwt   glufast   glutest    steady   insulin 
## -2.82e-16 -1.31e-17  1.02e-18  9.51e-17  9.31e-17
apply(diabetes.scaled[,-6],2,sd)
##   relwt glufast glutest  steady insulin 
##       1       1       1       1       1

A Little History

Invented in 1901 by Karl Pearson as a way to reduce a two variable scatterplot to a single coordinate.

Used by statisticians in the 1930s to summarize a battery of psychological tests run on the same subjects Hotelling:1933, extracting overall scores that could summarize many variables at once. It is called Principal Component Analysis (abbreviated PCA). (Useful books with relevant chapters are Flury for an introductory account and Mardia for a more mathematical approach).

Dimension reduction

PCA is an ‘unsupervised learning technique’ because it treats all variables as having the same status.

PCA is primarily a visualization technique which produces maps that show the relations between the variables in a useful way. We will use some geometrical notions such as projections which take points in higher dimensions and projects them down onto lower dimensions.

We are going to give you a flavor of what is called multivariate analyses. As a useful first approximation we formulate many of the methods through manipulations called linear algebra.

Linear Algebra

PCA is a linear technique (based on linear algebra) and thus particularly tractable.

There are also non-linear methods, which can be useful.

We will see in the next chapter where we study multidimensional scaling methods that there are even ‘non metric methods’ that can plot and capture much more general relationships and show data in a useful way not just by projecting them onto a plane linearly.

The raison d’être for multivariate analyses is connections or associations between the different variables. If the columns of the matrix are unrelated, we should just study each column separately and do standard univariate statistics on them one by one.

Low Dimensional Projections

Here we show one way of projecting two dimensional data onto a line.

The olympic data come from the ade4 package, they are the performances of decathlon athletes in an olympic competition.

plot of chunk SS

Scatterplot of two variables showing projection on the x coordinate in red.

How do we summarize two dimensional data by a line?

In general, we lose information about the points when we project down from two dimensions (a plane) to one (a line).

If we do it just by using the original coordinates, for instance the x coordinate as we did above, we lose all the information about the second one. There are actually many ways of projecting the point cloud onto a line. One is to use what are known as regression lines. Let’s look at these lines and how there are constructed in R:

Regressing one variable on the other

The disc variable on the weight

plot of chunk Reg1

The blue line minimizes the sum of squares of the vertical residuals (in red),

Regression of weight on discus

plot of chunk Reg2

(/Users/susan/gitbiobook/BioBook/Chap8-IntroMultivariate/figure/chap8-Reg1.png)

The orange line minimizes the horizontal residuals for the weight variable in orange. (/Users/susan/gitbiobook/BioBook/Chap8-IntroMultivariate/figure/chap8-Reg2.png)

A line that minimizes in both directions

plot of chunk PCAmin

Minimizing the distance to the line in both directions, the purple line is the principal component line, the green and blue line are the regression lines.

(/Users/susan/gitbiobook/BioBook/Chap8-IntroMultivariate/figure/chap8-PCAmin.png “fig:”)

The purple line minimizes both residuals and thus (through Pythagoras) it minimizes the sum of squared distances from the points to the line.

Notes about Lines

The line created here is sensitive to the choice of units, and to the center of the cloud.

Note that Pythagoras’ theorem tells us two interesting things here, if we are minimizing in both horizontal and vertical directions we are in fact minimizing the diagonal projections onto the line from each point.

Principal Components are Linear Combinations of the ‘old’ variables

To understand what that a linear combination really is, we can take an analogy, when making a healthy juice mix, you can follow a recipe.

[-3cm] image image

\[V=2\times \mbox{ Beets }+ 1\times \mbox{Carrots } +\frac{1}{2} \mbox{ Gala}+ \frac{1}{2} \mbox{ GrannySmith} +0.02\times \mbox{ Ginger} +0.25 \mbox{ Lemon }\] This recipe is a linear combination of individual juice types, in our analogy these are replaced by the original variables. The result is a new variable, the coefficients \((2,1,\frac{1}{2},\frac{1}{2},0.02,0.25)\) are called the loadings.

Optimal lines

A linear combination of variables defines a line in our space in the same way we say lines in the scatterplot plane for two dimensions. As we saw in that case, there are many ways to choose lines onto which we project the data, there is however a ‘best’ line for our purposes.

Total variance can de decomposed The total sums of squares of the distances between the points and any line can be decomposed into the distance to the line and the variance along the line.

We saw that the principal component minimizes the distance to the line, and it also maximizes the variance of the projections along the line.

Good Projections

image

Which projection do you think is better? It’s the projection that maximizes the area of the shadow and an equivalent measurement is the sums of squares of the distances between points in the projection, we want to see as much of the variation as possible, that’s what PCA does.

The PCA workflow

Many Choices have to made during PCA processing.

PCA is based on the principle of finding the largest axis of inertia/variability and then iterating to find the next best axis which is orthogonal to the previous one and so on.

The Inner Workings of PCA: the Singular Value Decomposition

Eigenvalues of X’X or Singular values of X tell us the rank.

What does rank mean?

   X |  2  4  8  
  ---| --------
   1 | 
   2 |
   3 |
   4 |
  X  |  2  4  8  
  -- | ---------
  1  |  2
  2  |  4
  3  |  6
  4  |  8
  X  |  2  4  8  
  ---| --------
   1 |  2  4  8
   2 |  4  8 16
   3 |  6 12 24
   4 |  8 16 32 

We say that the matrix \[ \bigl(\begin{smallmatrix} 2&4&8\\ 4&8&16\\ 6 &12&24\\ 8&16&32\\ \end{smallmatrix} \bigr) \] is of rank one.

Backwards from the matrix to decomposition

X=matrix(c(780, 75, 540, 936, 90, 648, 1300, 125, 900,
          728, 70, 504),nrow=3)
X
##      [,1] [,2] [,3] [,4]
## [1,]  780  936 1300  728
## [2,]   75   90  125   70
## [3,]  540  648  900  504
u1=c(0.8,0.1,0.6)
v1=c(0.4,0.5,0.7,0.4)
sum(u1^2)
## [1] 1.01
sum(v1^2)
## [1] 1.06
s1=2348.2
s1*u1 %*%t(v1)
##       [,1] [,2] [,3]  [,4]
## [1,] 751.4  939 1315 751.4
## [2,]  93.9  117  164  93.9
## [3,] 563.6  704  986 563.6
X-s1*u1 %*%t(v1)
##       [,1]   [,2]  [,3]  [,4]
## [1,]  28.6  -3.28 -15.0 -23.4
## [2,] -18.9 -27.41 -39.4 -23.9
## [3,] -23.6 -56.46 -86.2 -59.6

Graphical Decompositions

Decompose1

Matrix \(X\) we would like to decompose.

Decompose1

Areas are proportional to the entries

Decompose2

Looking at different possible margins

Decompose3

Forcing the margins to have norm \(1\)

Check with R

## ----checkX--------------------------------------------------------------
u1=c(0.8196, 0.0788, 0.5674)
v1=c(0.4053, 0.4863, 0.6754, 0.3782)
s1=2348.2
s1*u1 %*%t(v1)
##      [,1] [,2] [,3] [,4]
## [1,]  780  936 1300  728
## [2,]   75   90  125   70
## [3,]  540  648  900  504
Xsub=matrix(c(12.5 , 35.0 , 25.0 , 25,9,14,26,18,16,21,49,
           32,18,28,52,36,18,10.5,64.5,36),ncol=4,byrow=T)
Xsub
##      [,1] [,2] [,3] [,4]
## [1,] 12.5 35.0 25.0   25
## [2,]  9.0 14.0 26.0   18
## [3,] 16.0 21.0 49.0   32
## [4,] 18.0 28.0 52.0   36
## [5,] 18.0 10.5 64.5   36
USV=svd(Xsub)
USV
## $d
## [1] 1.35e+02 2.81e+01 3.10e-15 1.85e-15
## 
## $u
##        [,1]    [,2]    [,3]   [,4]
## [1,] -0.344  0.7717  0.5193 -0.114
## [2,] -0.264  0.0713 -0.3086 -0.504
## [3,] -0.475 -0.0415 -0.0386  0.803
## [4,] -0.528  0.1426 -0.6423 -0.103
## [5,] -0.554 -0.6143  0.4702 -0.280
## 
## $v
##        [,1]    [,2]   [,3]    [,4]
## [1,] -0.250  0.0404 -0.967  0.0244
## [2,] -0.343  0.8798  0.133  0.3010
## [3,] -0.755 -0.4668  0.186  0.4214
## [4,] -0.500  0.0808  0.111 -0.8551
## ----CheckUSV------------------------------------------------------------
Xsub-(135*USV$u[,1]%*%t(USV$v[,1]))
##         [,1]   [,2]    [,3]    [,4]
## [1,]  0.8802  19.05 -10.088  1.7604
## [2,]  0.0849   1.76  -0.921  0.1698
## [3,] -0.0396  -1.01   0.565 -0.0792
## [4,]  0.1698   3.53  -1.842  0.3397
## [5,] -0.6877 -15.15   8.069 -1.3754
Xsub-(135*USV$u[,1]%*%t(USV$v[,1]))-(28.1*USV$u[,2]%*%t(USV$v[,2]))
##         [,1]     [,2]   [,3]    [,4]
## [1,] 0.00387 -0.02528 0.0335 0.00774
## [2,] 0.00398  0.00264 0.0140 0.00796
## [3,] 0.00749  0.01192 0.0214 0.01498
## [4,] 0.00796  0.00527 0.0281 0.01592
## [5,] 0.00983  0.03784 0.0123 0.01965
Xsub- USV$d[1]*USV$u[,1]%*%t(USV$v[,1])-USV$d[2]*USV$u[,2]%*%t(USV$v[,2])
##          [,1]      [,2]     [,3]     [,4]
## [1,] 7.22e-15 -1.07e-14 8.88e-15 4.88e-15
## [2,] 2.04e-15 -6.00e-15 1.05e-14 3.22e-15
## [3,] 2.87e-15 -9.55e-15 1.55e-15 6.23e-15
## [4,] 4.39e-15 -5.77e-15 1.78e-14 7.05e-15
## [5,] 5.11e-15 -1.78e-15 1.78e-14 1.78e-14

Another Example

Xsub=matrix(c(12.5 , 35.0 , 25.0 , 25,9,14,26,18,16,21,49,32,18,28,52,36,18,10.5,64.5,36),ncol=4,byrow=T)
Xsub
##      [,1] [,2] [,3] [,4]
## [1,] 12.5 35.0 25.0   25
## [2,]  9.0 14.0 26.0   18
## [3,] 16.0 21.0 49.0   32
## [4,] 18.0 28.0 52.0   36
## [5,] 18.0 10.5 64.5   36
svd(Xsub)
## $d
## [1] 1.35e+02 2.81e+01 3.10e-15 1.85e-15
## 
## $u
##        [,1]    [,2]    [,3]   [,4]
## [1,] -0.344  0.7717  0.5193 -0.114
## [2,] -0.264  0.0713 -0.3086 -0.504
## [3,] -0.475 -0.0415 -0.0386  0.803
## [4,] -0.528  0.1426 -0.6423 -0.103
## [5,] -0.554 -0.6143  0.4702 -0.280
## 
## $v
##        [,1]    [,2]   [,3]    [,4]
## [1,] -0.250  0.0404 -0.967  0.0244
## [2,] -0.343  0.8798  0.133  0.3010
## [3,] -0.755 -0.4668  0.186  0.4214
## [4,] -0.500  0.0808  0.111 -0.8551
USV=svd(Xsub)
XS1=Xsub-USV$d[1]*(USV$u[,1]%*% t(USV$v[,1]))
XS1
##         [,1]   [,2]    [,3]   [,4]
## [1,]  0.8748  19.05 -10.104  1.750
## [2,]  0.0808   1.76  -0.933  0.162
## [3,] -0.0470  -1.02   0.543 -0.094
## [4,]  0.1616   3.52  -1.866  0.323
## [5,] -0.6963 -15.16   8.043 -1.393
XS2=XS1-USV$d[2]*(USV$u[,2]%*% t(USV$v[,2]))
XS2
##          [,1]      [,2]     [,3]     [,4]
## [1,] 7.22e-15 -1.07e-14 8.88e-15 4.88e-15
## [2,] 2.04e-15 -6.00e-15 1.05e-14 3.22e-15
## [3,] 2.87e-15 -9.55e-15 1.55e-15 6.23e-15
## [4,] 4.39e-15 -5.77e-15 1.78e-14 7.05e-15
## [5,] 5.11e-15 -1.78e-15 1.78e-14 1.78e-14

Special Example of Rank one matrix: independence

require(ade4)
HairColor=HairEyeColor[,,2]
HairColor
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    36    9     5     2
##   Brown    66   34    29    14
##   Red      16    7     7     7
##   Blond     4   64     5     8
chisq.test(HairColor)
## Warning: Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  HairColor
## X-squared = 107, df = 9, p-value < 2.2e-16
Indep=313*as.matrix(prows)%*%t(as.matrix(pcols))
## Error: object 'prows' not found
round(Indep)
## Error: object 'Indep' not found
sum((Indep-HairColor)^2/Indep)
## Error: object 'Indep' not found

SVD for real data

## ------------------------------------------------------------------------
diabetes.svd=svd(scale(diabetes[,-5]))
names(diabetes.svd)
## [1] "d" "u" "v"
diabetes.svd$d
## [1] 20.09 13.38  9.89  5.63  1.70
Turtles.svd=svd(scale(Turtles[,-1]))
Turtles.svd$d
## [1] 11.75  1.42  1.00

Principal Components

The singular vectors from the singular value decomposition, svd function above tell us the coefficients to put in front of the old variables to make our new ones with better properties. We write this as : \[PC_1=c_1 X_{\bullet 1} +c_2 X_{\bullet 2}+ c_3 X_{\bullet 3}+\cdots c_p X_{\bullet p}\]

Mathematical Facts |in the maths shorthand

The number of principal components is less than or equal to the number of original variables. \[K\leq min(n,p)\]

The Principal Component transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each successive component in turn has the highest variance possible under the constraint that it be orthogonal to the preceding components. \[\max_{aX} \mbox{var}(Proj_{aX} (X))\]

Suppose the matrix of data \(X\) has been made to have column means 0 and standard deviations 1.

Matrix Decomposition

We write our horizontal/vertical decompostion of the matrix \(X\) in short hand as: \[ X = USV', V'V=I, U'U=I, S \mbox{ diagonal matrix of singular values, given by the {\tt d} component in the R function}\] The crossproduct of X with itself verifies \[X'X=VSU'USV'=VS^2V'=V\Lambda V'\] where \(V\) is called the eigenvector matrix of the symmetric matrix \(X'X\) and \(\Lambda\) is the diagonal matrix of eigenvalues of \(X'X\).

Why Eigenvectors are useful?

Why would eigenvectors come into use in Cinderella?

Khan’s Academy

We call the principal components the columns of the matrix, \(C=US\).

The columns of U (the matrix given as USV$u in the output from the svd function above) are rescaled to have norm \(s^2\), the variance they are responsable for. If the matrix \(X\) comes from the study of \(n\) different samples or specimens, then the principal components provides new coordinates for these \(n\) points these are sometimes also called the scores in some of the (many) PCA functions available in R (princomp,prcomp,dudi.pca in ade4).

Transition Formulae

If we only want the first one then it is just \(c_1=s_1 u_1\). Notice that \(||c_1||^2=s_1'u_1 u_1' s_1= s_1^2 u_1'u_1=s_1^2=\lambda_1\)

\(X'C=VSU'US=VS^2\)

Remarks:

  1. Each principal component is chosen to maximize the variance it explains, this variance is measured by the corresponding eigenvalue.
  2. The new variables are made to be orthogonal, if the data are multivariate normal the new variables will be independent.
  3. When the variables are rescaled or we choose the correlation matrix as the one we want to study instead of the covariance matrix then the sum of the variances of all the variables is the number of variables (=p), this is sometimes called the trace.
  4. The principal components are always ordered by ``importance’’, always look at what proportion of the variability you are interpreting (and check the screeplot before deciding how many components).

A few examples of using PCA

We start with the turtles data that has 3 continuous variables and a gender variable that we leave out for the original PCA analysis. # Turtles Data When computing the variance covariance matrix, many programs use 1/(n-1) as the denominator, here n=48 so the sum of the variances are off by a small fudge factor of 48/47.

turtles3var=Turtles[,-1]
apply(turtles3var,2,mean)
## length  width height 
##  124.7   95.4   46.3
turtles.pca=princomp(turtles3var)
print(turtles.pca)
## Call:
## princomp(x = turtles3var)
## 
## Standard deviations:
## Comp.1 Comp.2 Comp.3 
##  25.06   2.26   1.94 
## 
##  3  variables and  48 observations.
(25.06^2+2.26^2+1.94^2)*(48/47)
## [1] 650
apply(turtles3var,2,var)
## length  width height 
##  419.5  160.7   70.4
apply(Turtles[,-1],2,sd)
## length  width height 
##  20.48  12.68   8.39
turtlesc=scale(Turtles[,-1])
cor(turtlesc)
##        length width height
## length  1.000 0.978  0.965
## width   0.978 1.000  0.961
## height  0.965 0.961  1.000
pca1=princomp(turtlesc)
pca1
## Call:
## princomp(x = turtlesc)
## 
## Standard deviations:
## Comp.1 Comp.2 Comp.3 
##  1.695  0.205  0.145 
## 
##  3  variables and  48 observations.

Step one: always the screeplot

The screeplot showing the eigenvalues for the standardized data: one very large component in this case and two very small ones, the data are (almost) one dimensional.

pca.turtles=dudi.pca(Turtles[,-1],scannf=F,nf=2)
scatter(pca.turtles)

plot of chunk Turtlesbiplot

Step Two: Variables

plot of chunk TurtlesCircle

Biplot

plot of chunk Turtlesbiplotfunction

All together

scatter(pca.turtles)

plot of chunk scatterturtles

Exercise: How are the following numbers related?

svd(turtlesc)$d/pca1$sdev
## [1] 6.86 6.86 6.86
sqrt(47)
## [1] 6.86
nrow(turtlesc)
## [1] 48

Lizards Data Analyses

This data set describes 18 lizards as reported by Bauwens and D'iaz-Uriarte (1997). It also gives life-history traits corresponding to these 18 species.

mean.L (mean length (mm)), matur.L (length at maturity (mm)), max.L (maximum length (mm)), hatch.L (hatchling length (mm)), hatch.m (hatchling mass (g)), clutch.S (Clutch size), age.mat (age at maturity (number of months of activity)), clutch.F (clutch frequency).

library(ade4)
data(lizards)
names(lizards)
## [1] "traits" "hprA"   "hprB"
lizards$traits[1:4,]
##    mean.L matur.L max.L hatch.L hatch.m clutch.S age.mat clutch.F
## Sa   69.2      58    82    27.8   0.572      6.0      13      1.5
## Sh   48.4      42    56    22.9   0.310      3.2       5      2.0
## Tl  168.4     132   190    42.8   2.235     16.9      19      1.0
## Mc   66.1      56    72    25.0   0.441      7.2      11      1.5

It is always a good idea to check the variables one at a time and two at a time to see what the basic statistics are for the data

tabtraits=lizards$traits
options(digits=2)
colMeans(tabtraits)
##   mean.L  matur.L    max.L  hatch.L  hatch.m clutch.S  age.mat clutch.F 
##    71.34    59.39    82.83    26.88     0.56     5.87    10.89     1.56
cor(tabtraits)
##          mean.L matur.L max.L hatch.L hatch.m clutch.S age.mat clutch.F
## mean.L     1.00    0.99  0.99    0.89    0.94     0.92    0.77    -0.48
## matur.L    0.99    1.00  0.99    0.90    0.92     0.92    0.79    -0.49
## max.L      0.99    0.99  1.00    0.88    0.92     0.91    0.78    -0.51
## hatch.L    0.89    0.90  0.88    1.00    0.96     0.72    0.58    -0.42
## hatch.m    0.94    0.92  0.92    0.96    1.00     0.78    0.64    -0.45
## clutch.S   0.92    0.92  0.91    0.72    0.78     1.00    0.81    -0.55
## age.mat    0.77    0.79  0.78    0.58    0.64     0.81    1.00    -0.62
## clutch.F  -0.48   -0.49 -0.51   -0.42   -0.45    -0.55   -0.62     1.00

Biplot

require(ade4)
res=dudi.pca(tabtraits,scannf=F,nf=2)
res
## Duality diagramm
## class: pca dudi
## $call: dudi.pca(df = tabtraits, scannf = F, nf = 2)
## 
## $nf: 2 axis-components saved
## $rank: 8
## eigen values: 6.5 0.83 0.42 0.17 0.045 ...
##   vector length mode    content       
## 1 $cw    8      numeric column weights
## 2 $lw    18     numeric row weights   
## 3 $eig   8      numeric eigen values  
## 
##   data.frame nrow ncol content             
## 1 $tab       18   8    modified array      
## 2 $li        18   2    row coordinates     
## 3 $l1        18   2    row normed scores   
## 4 $co        8    2    column coordinates  
## 5 $c1        8    2    column normed scores
## other elements: cent norm
biplot(res)

plot of chunk lizardbiplot

res$eig/(sum(res$eig))
## [1] 0.81118 0.10387 0.05219 0.02133 0.00563 0.00488 0.00061 0.00031

The Decathlon Athletes

round(cor(athletes),1)
##        m100 long weight highj m400 m110 disc pole javel m1500
## m100    1.0 -0.5   -0.2  -0.1  0.6  0.6  0.0 -0.4  -0.1   0.3
## long   -0.5  1.0    0.1   0.3 -0.5 -0.5  0.0  0.3   0.2  -0.4
## weight -0.2  0.1    1.0   0.1  0.1 -0.3  0.8  0.5   0.6   0.3
## highj  -0.1  0.3    0.1   1.0 -0.1 -0.3  0.1  0.2   0.1  -0.1
## m400    0.6 -0.5    0.1  -0.1  1.0  0.5  0.1 -0.3   0.1   0.6
## m110    0.6 -0.5   -0.3  -0.3  0.5  1.0 -0.1 -0.5  -0.1   0.1
## disc    0.0  0.0    0.8   0.1  0.1 -0.1  1.0  0.3   0.4   0.4
## pole   -0.4  0.3    0.5   0.2 -0.3 -0.5  0.3  1.0   0.3   0.0
## javel  -0.1  0.2    0.6   0.1  0.1 -0.1  0.4  0.3   1.0   0.1
## m1500   0.3 -0.4    0.3  -0.1  0.6  0.1  0.4  0.0   0.1   1.0
pca.ath <- dudi.pca(athletes, scan = F)
pca.ath$eig
##  [1] 3.42 2.61 0.94 0.88 0.56 0.49 0.43 0.31 0.27 0.10
barplot(pca.ath$eig)

plot of chunk ade4athletes

The screeplot is the first thing to look at, it tells us that it is satifactory to use a two dimensional plot.

Correlation Circle

s.corcircle(pca.ath$co,clab=1, grid=FALSE, fullcircle = TRUE,box=FALSE)

plot of chunk athletecorr The correlation circle made by showing the projection of the old variables onto the two first new principal axes.

athletes[,c(1,5,6,10)]=-athletes[,c(1,5,6,10)]
round(cor(athletes),1)
##        m100 long weight highj m400 m110 disc pole javel m1500
## m100    1.0  0.5    0.2   0.1  0.6  0.6  0.0  0.4   0.1   0.3
## long    0.5  1.0    0.1   0.3  0.5  0.5  0.0  0.3   0.2   0.4
## weight  0.2  0.1    1.0   0.1 -0.1  0.3  0.8  0.5   0.6  -0.3
## highj   0.1  0.3    0.1   1.0  0.1  0.3  0.1  0.2   0.1   0.1
## m400    0.6  0.5   -0.1   0.1  1.0  0.5 -0.1  0.3  -0.1   0.6
## m110    0.6  0.5    0.3   0.3  0.5  1.0  0.1  0.5   0.1   0.1
## disc    0.0  0.0    0.8   0.1 -0.1  0.1  1.0  0.3   0.4  -0.4
## pole    0.4  0.3    0.5   0.2  0.3  0.5  0.3  1.0   0.3   0.0
## javel   0.1  0.2    0.6   0.1 -0.1  0.1  0.4  0.3   1.0  -0.1
## m1500   0.3  0.4   -0.3   0.1  0.6  0.1 -0.4  0.0  -0.1   1.0
pcan.ath=dudi.pca(athletes,nf=2,scannf=F)
pcan.ath$eig
##  [1] 3.42 2.61 0.94 0.88 0.56 0.49 0.43 0.31 0.27 0.10

Now all the negative correlations are quite small ones. Doing the screeplot over again will show no change in the eigenvalues, the only thing that changes is the sign of loadings for the m variables.

New Data changing signs

s.corcircle(pcan.ath$co,clab=1.2,box=FALSE)

plot of chunk athletecorrn

Correlation circle after changing the signs of the running variables.

Observations

plot of chunk athletepc (/Users/susan/gitbiobook/BioBook/Chap8-IntroMultivariate/figure/chap8-athletepc.png)

data(olympic)
olympic$score
##  [1] 8488 8399 8328 8306 8286 8272 8216 8189 8180 8167 8143 8114 8093 8083 8036
## [16] 8021 7869 7860 7859 7781 7753 7745 7743 7623 7579 7517 7505 7422 7310 7237
## [31] 7231 7016 6907

PCA as an exploratory tool: using meta-information

######center and scale the data 
###(they have already had variance normalization applied to them)
res.Msig3=dudi.pca(Msig3transp,center=TRUE,scale=TRUE,scannf=F,nf=4)
screeplot(res.Msig3,main="")

plot of chunk tcellexpr

Plot by cell types

celltypes=factor(substr(rownames(Msig3transp),7,9))
table(celltypes)
## celltypes
## EFF MEM NAI 
##  10   9  11
status=factor(substr(rownames(Msig3transp),1,3))
require(ggplot2)
gg <- cbind(res.Msig3$li,Cluster=celltypes)
gg <- cbind(sample=rownames(gg),gg)
ggplot(gg, aes(x=Axis1, y=Axis2)) + 
  geom_point(aes(color=factor(Cluster)),size=5) + 
  geom_hline(yintercept=0,linetype=2) + 
  geom_vline(xintercept=0,linetype=2) +
  scale_color_discrete(name="Cluster") +
  xlim(-14,18)

plot of chunk tcelltypes

PCA of gene expression for a subset of 156 genes involved in specificities of each of the three separate T cell types: effector, naive and memory

Mass Spectroscopy Data Analysis

Example from paper: Kashnap et al, PNAS, 2013

Sample situations in PC map

##  PCA Example
require(ade4)
require(ggplot2)
load(url("http://bios221.stanford.edu/data/logtmat.RData"))
pca.result=dudi.pca(logtmat, scannf=F,nf=3)
labs=rownames(pca.result$li)
nos=substr(labs,3,4)
type=as.factor(substr(labs,1,2))
kos=which(type=="ko")
wts=which(type=="wt")
pcs=data.frame(Axis1=pca.result$li[,1],Axis2=pca.result$li[,2],labs,type)

pcsplot=ggplot(pcs,aes(x=Axis1,y=Axis2,label=labs,group=nos,colour=type)) + 
  geom_text(size=4,vjust=-0.5) + geom_point()
 pcsplot +  geom_hline(yintercept=0,linetype=2) + 
  geom_vline(xintercept=0,linetype=2)

plot of chunk VanillaGGplot

Extra Connections

pcsplot+geom_line(colour="red")

plot of chunk RedConnects

Checking data by frequent multivariate projections

Phylochip data allowed us to discover a batch effect (phylochip).

Phylochip data for three different batches and two different arrays, first principal plane explains 66% of the total variation.

Summary of this Section