PCA : Interpretation Examples ========================================= These example provide a short introduction to using *R* for PCA analysis. We will use the ``dudi.pca`` function from the ``ade4`` package .. sourcecode:: r install.packages('ade4') .. rcode:: library(ade4) data(olympic) attach(olympic) #Turtles=read.table("http://stats366.stanford.edu/data/PaintedTurtles.txt",header=TRUE) Two example datasets ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1) ``Turtles`` is Jolicoeur and Mossiman's 1960's Painted Turtles Dataset with size variables for two turtle populations. 2) The decathlon data are scores on various olympic decathlon events for 33 athletes. It is already contained in the package *ade4*. The R object ``olympic`` is composed of a list of 2 components: `tab` is a data frame with 33 rows and 10 columns events of the decathlon: 100 meters (100), long jump (long), shotput (poid), high jump (haut),400 meters (400), 110-meter hurdles (110), discus throw (disq), pole vault (perc), javelin (jave) and 1500 meters (1500). `score` is a vector of the final points scores of the competition. Start with unvariate and bivariate statistics ------------------------------------------------------ .. rcode:: :force: Turtles=read.table("/Users/susan/sphinx/data/PaintedTurtles.txt",header=T) turtlem=Turtles[,2:4] save(Turtles,file="Turtles.save") save(turtlem,file="turtlem.save") apply(turtlem,2,summary) round(cor(turtlem),1) round(cov(turtlem),2) turtlesc=scale(turtlem) round(cov(turtlesc),2) Now we compute the principal components of the data: .. rplot:: :force: require(ade4) load("turtlem.save") pca.turtles = dudi.pca(turtlem,scannf=F,nf=2) print(pca.turtles) scatter(pca.turtles) We can easily add in the discrete categorical `sex' variable with the ``s.class`` function .. rplot:: :force: s.class(pca.turtles$li,factor(Turtles[,1])) We can interpret the first component as the overall ``size`` of the turtles. We see that overall the females are smaller than the males. If we only look at the id numbers of the turtles down the list we notice a double pattern. What is it? .. rplot:: :force: s.label(pca.turtles$li,boxes=F) Turtles[c(20:29,45:48),] .. rflush:: Decathlon Data ~~~~~~~~~~~~~~~~~~~~ .. rcode:: :force: require(ade4) data(olympic) apply(olympic$tab,2,summary) print(round(cov(olympic$tab),2)) print(round(apply(olympic$tab,2,var),1)) print(round(cor(olympic$tab),1)) Now we do a PCA using the variance covariance cross product matrix: .. rplot:: :force: require(ade4) data(olympic) pca.olympic = dudi.pca(olympic$tab,scale=F,scannf=F,nf=2) scatter(pca.olympic) Here are the variables .. rplot:: :force: s.label(pca.olympic$co,boxes=F) From this plot, we see that the first principal component is positively associated with longer times on the 1500. So, slower runners will have higher value on this component. .. rcode:: :force: cor(olympic$tab[,'1500'],pca.olympic$li[,1]) The first principal component is negatively correlated to the ``javelin`` variable. The second principal component is positively associated with the javelin variable. .. rcode:: :force: cor(olympic$tab[,'jave'],pca.olympic$li[,2]) cor(olympic$tab[,'jave'],pca.olympic$li[,1]) .. rflush:: Standardizing ~~~~~~~~~~~~~ In the previous example, we saw that the two variables were based somewhat on speed and strength. However, we did not scale the variables so the 1500 has much more weight than the 400, for instance. We correct this by rescaling the variables (this is actually the default in dudi.pca). .. rplot:: :force: pca.olympic = dudi.pca(olympic$tab,scale=T,scannf=F,nf=2) scatter(pca.olympic) This plot reinforces our earlier interpretation and has put the running events on an "even playing field" by standardizing. .. rplot:: :force: require(ade4) pca.olympic = dudi.pca(olympic$tab,scale=T,scannf=F,nf=2) s.label(pca.olympic$co,boxes=F) .. rcode:: :force: require(ade4) pca.olympic = dudi.pca(olympic$tab,scannf=F,nf=2) cor(olympic$tab[,'1500'],pca.olympic$li[,1]) cor(olympic$tab[,'jave'],pca.olympic$li[,2]) cor(olympic$tab[,'jave'],pca.olympic$li[,1]) .. rflush:: More meaningful variables and Overall score ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We replace the times with their opposites so as the values increase performance increases. .. rplot:: :force: olympic2=olympic$tab olympic2[,c(1,5,6,10)]=-olympic2[,c(1,5,6,10)] pca.olympic2=dudi.pca(olympic2,nf=2,scannf=F) s.label(pca.olympic2$co,boxes=F) round(cor(olympic2),2) Now we can see that the olympic score is very (negatively) correlated to the first component .. rplot:: :force: cor(pca.olympic2$li[,1],olympic$score) plot(pca.olympic2$li[,1], olympic$score, xlab='Comp. 1', pch=23, bg='red', cex=2) .. rflush::