PCA (classical)

STATS 305C

PCA

Two views

Low-rank matrix approximation
Maximal variance projection

Low-rank approximation

Consider problem

\[ \text{minimize}_{A \in \mathbb{R}^{n \times k},B \in \mathbb{R}^{q \times k}} \|R_cX-AB^T\|^2_F \]

with

\[ R_c = \left(I - \frac{1}{n}11^T\right) \]

This is reduced rank regression objective with \(\tilde{Y}=R_cX\), \(\tilde{X}=I\), \(\Sigma=I\).
Going forward, we’ll somtimes drop \(R_c\) by assuming that \(R_cX=X\), i.e. that it has already been centered.

Solution

Our reduced-rank regression solution told us to look at \(\tilde{W}\), \(k\) leading eigenvectors of

\[ XX^T \]

Then, we take

\[ \begin{aligned} \tilde{A} &= \tilde{W} \\ \tilde{B} &= X^T\tilde{A}(\tilde{A}^T\tilde{A})^{-1} \\ &= X^T\tilde{A} \end{aligned} \]

Relation to SVD

Given matrix \(M \in \mathbb{R}^{n \times p}\) (assume \(n>p\)) there exists an SVD

\[ M = U_{n \times p} D_{p \times p} V_{p \times p}^T \]

Properties

\[ \begin{aligned} U^TU &= I_{p \times p} \\ D &= \text{diag}(d_1 \geq d_2 \geq \dots \geq d_p \geq 0) \\ V^TV &= \text{I}_{p \times p} \end{aligned} \]

Consider

\[ MM^T = UD^2U^T \]

Eigen-decomposition of \(MM^T\) can be found from SVD of \(M\).
Similarly

\[ M^TM = VD^2V^T. \]

Finally,

\[ M^TU = VD \]

Can absorb \(D\) into \(U\) by writing

\[ M = (UD)V^T \]

Back to reduced rank

Our reduced rank solution with \(X=UDV^T\) picks

\[ \begin{aligned} \hat{A} &= U[,1:k] \\ \hat{B} &= D[1:k,1:k] V[,1:k]^T \end{aligned} \]

More common to think of rank-\(k\) PCA in terms of first \(k\) scores

\[ U[,1:k] D[1:k,1:k] \]

and loadings

\[ V[,1:k] \]

A second motivation

Could also consider the problem

\[ \text{minimize}_{A \in \mathbb{R}^{p \times k}, B \in \mathbb{R}^{p \times k}} \|X - XAB^T\|^2_F \]

Also a reduced rank problem whose solutions are projections onto a k-leading eigenspace of \((X^TX)\).
This one clearly isn’t a likelihood – \(X\) appears as response and design.

Maximal variance

Minimizing first over \(A\), we see this is equivalent to solving this problem

\[ \text{minimize}_{P_k} \|X - XP_k\|^2_F \]

where \(P_k\) is a projection matrix of rank \(k\).

Equivalently, we solve (recall \(R_cX=X\))

\[ \text{minimize}_{P_k} \text{Tr}(X^TR_cX) - \text{Tr}(P_kX^TR_cXP_k) \]

This is the same as

\[ \text{maximize}_{P_k} \text{Tr}(P_kX^TR_cXP_k) \]

Rank \(k\) PCA captures \(k\) dimensional subspace of features that maximizes variance…

Population version

The maximizing variance picture lends itself to a population version
Thinking of rows of \(X\) as IID with law \(F\), let \(\Sigma = \text{Var}_F(X[1,])\)

\[ \text{minimize}_{P_k} \text{Tr}(\Sigma) - \text{Tr}(P_k\Sigma P_k) \iff \text{maximize}_{P_k} \text{Tr}(P_k \Sigma P_k) \]

Solution is to choose top \(k\) eigen-subspace of \(\Sigma\)…

PCA = SVD

Given data matrix \(X \in \mathbb{R}^{n \times p}\) with rows \(X[i,] \overset{IID}{\sim} F\), the sample PCA can be described by the SVD of \(R_cX\)
As \(V\) can be thought as an estimate of \(\Gamma\) in an SVD of

\[ \text{Var}_F(X) \approx \Gamma \Lambda \Gamma^T \]

we might write \(V = \widehat{\Gamma}\).

Columns of \(\widehat{\Gamma}\) are the loadings of the principal components while columns of \(UD = \hat{Z}\) are the (sample) principal components scores.

Scaling

Common to scale columns of \(X\) to have standard deviation 1.
In this case, we are really computing the PCA of

\[ R_c X \text{diag}(S)^{-1} \]

Scalings are

\[ S^2_{i,i} = \frac{1}{n-1}X[,i]^TR_cX[,i] = {\tt sd(X[,i])}^2 \]

Synthetic example

n = 50
p = 10
X = matrix(rnorm(n*p), n, p)
R_c = diag(rep(1, n)) - matrix(1, n, n) / n
SVD_X = svd(R_c %*% X)
Z = SVD_X$u %*% diag(SVD_X$d)
Gamma_hat = SVD_X$v
pca_X = prcomp(X, center=TRUE, scale.=FALSE)
names(pca_X)

[1] "sdev"     "rotation" "center"   "scale"    "x"

Screeplot: is there an elbow?

   varZ2 
1.436072

Check loadings


Call:
lm(formula = Gamma_hat[, 1] ~ pca_X$rotation[, 1])

Residuals:
       Min         1Q     Median         3Q        Max 
-9.336e-16 -7.160e-16 -5.859e-17  4.491e-16  1.941e-15 

Coefficients:
                      Estimate Std. Error    t value Pr(>|t|)    
(Intercept)         -6.733e-16  3.150e-16 -2.137e+00    0.065 .  
pca_X$rotation[, 1]  1.000e+00  9.962e-16  1.004e+15   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.387e-16 on 8 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 1.008e+30 on 1 and 8 DF,  p-value: < 2.2e-16

Check scores

With scaling

S_inv = 1/ (sqrt(diag(t(X) %*% R_c %*% X) / 49))
SVD_Xs = svd(R_c %*% X %*% diag(S_inv))
Z_s = SVD_Xs$u %*% diag(SVD_Xs$d)
Gamma_hat_s = SVD_Xs$v
pca_Xs = prcomp(X, center=TRUE, scale.=TRUE)

Check loadings

summary(lm(Gamma_hat_s[,1] ~ pca_Xs$rotation[,1]))


Call:
lm(formula = Gamma_hat_s[, 1] ~ pca_Xs$rotation[, 1])

Residuals:
       Min         1Q     Median         3Q        Max 
-4.314e-15 -2.378e-15 -2.862e-16  1.956e-15  7.417e-15 

Coefficients:
                       Estimate Std. Error    t value Pr(>|t|)    
(Intercept)          -1.391e-15  1.239e-15 -1.123e+00    0.294    
pca_Xs$rotation[, 1]  1.000e+00  3.917e-15  2.553e+14   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.698e-15 on 8 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 6.516e+28 on 1 and 8 DF,  p-value: < 2.2e-16

Check scores

Some example applications

`USArrests` data

pr.out = prcomp(USArrests, scale = TRUE)
names(pr.out)

[1] "sdev"     "rotation" "center"   "scale"    "x"

Output of `prcomp`

The center and scale components correspond to the means and standard deviations of the variables that were used for scaling prior to computing SVD

pr.out$center

  Murder  Assault UrbanPop     Rape 
   7.788  170.760   65.540   21.232

pr.out$scale

   Murder   Assault  UrbanPop      Rape 
 4.355510 83.337661 14.474763  9.366385

Rotation / loading

pr.out$rotation

                PC1        PC2        PC3         PC4
Murder   -0.5358995 -0.4181809  0.3412327  0.64922780
Assault  -0.5831836 -0.1879856  0.2681484 -0.74340748
UrbanPop -0.2781909  0.8728062  0.3780158  0.13387773
Rape     -0.5434321  0.1673186 -0.8177779  0.08902432

Biplot

biplot(pr.out)

Variance explained

pr.out$sdev

[1] 1.5748783 0.9948694 0.5971291 0.4164494

pr.var = pr.out$sdev^2
pr.var

[1] 2.4802416 0.9897652 0.3565632 0.1734301

Congressional voting records (1984) – UCI

voting_data = read.csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data', header=FALSE)
head(voting_data)

          V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
1 republican  n  y  n  y  y  y  n  n   n   y   ?   y   y   y   n   y
2 republican  n  y  n  y  y  y  n  n   n   n   n   y   y   y   n   ?
3   democrat  ?  y  y  ?  y  y  n  n   n   n   y   n   y   y   n   n
4   democrat  n  y  y  n  ?  y  n  n   n   n   y   n   y   n   n   y
5   democrat  y  y  y  n  y  y  n  n   n   n   y   ?   y   y   y   y
6   democrat  n  y  y  n  y  y  n  n   n   n   n   n   y   y   y   y

voting_X = c()
for (i in 2:7) {
    voting_data[,i] = factor(voting_data[,i], c('n', '?', 'y'), ordered=TRUE)
    voting_X = cbind(voting_X, as.numeric(voting_data[,i]))
}
pca_votes = prcomp(voting_X, scale.=TRUE)
dem = voting_data[,1] == 'democrat'
repub = !dem

plot(pca_votes$x[,1], pca_votes$x[,2], type='n', xlab='PC1', ylab='PC2')
points(pca_votes$x[dem, 1], pca_votes$x[dem, 2], pch=22, bg='blue')
points(pca_votes$x[repub, 1], pca_votes$x[repub, 2], pch=22, bg='red')

dim(voting_data)

[1] 435  17

Congress of 2017

[1] 422 709

Fewer votes (randomly chosen 100 votes)

Fewer votes (randomly chosen 20 votes)

Flow cytometry data

flow = read.table('http://www.stanford.edu/class/stats305c/data/sachs.data', sep=' ', header=FALSE)
dim(flow)

[1] 7466   11

plot(prcomp(flow, center=TRUE, scale.=TRUE))

A rank 4 model?

Olive oil data

library(pdfCluster)
data(oliveoil)
featO = oliveoil[,3:10]
pcaO = prcomp(featO, scale.=TRUE, center=TRUE)
plot(pcaO)

plot(pcaO$x[,1], pcaO$x[,2], col=oliveoil[,'macro.area'])

plot(pcaO$x[,1], pcaO$x[,2], col=oliveoil[,'region'])

PCA (classical)

PCA

Two views

Low-rank approximation

Solution

Relation to SVD

Properties

Back to reduced rank

A second motivation

Maximal variance

Population version

PCA = SVD

Scaling

Synthetic example

Screeplot: is there an elbow?

Check loadings

Check scores

With scaling

Check loadings

Check scores

Some example applications

USArrests data

Output of prcomp

Rotation / loading

Biplot

Variance explained

Congressional voting records (1984) – UCI

Congress of 2017

Fewer votes (randomly chosen 100 votes)

Fewer votes (randomly chosen 20 votes)

Flow cytometry data

A rank 4 model?

Olive oil data

`USArrests` data

Output of `prcomp`