PCA (applications and extensions)

STATS 305C

Principal components regression

Supervised setting with response \(Y\) and design \(X\): model

\[ Y | X \sim N(X\beta, \sigma^2 I) \equiv N(Z\alpha, \sigma^2 I), \qquad (Y_i, X_i) \overset{IID}{\sim} F \]

where \(\alpha = \Gamma^T\beta\) and \(\text{Var}_F(X) = \Gamma \Lambda \Gamma^T\) and \(Z=X\Gamma\).

For sample version take \(\Gamma=V\) in SVD of \(R_cX\) so \(Z=\hat{Z}\)
We might hypothesize the \(\alpha[(k+1):p] = 0\), i.e. only the first \(k\) principal components of \(X\) contribute to regression…
Estimation is easy

\[ \widehat{\alpha}_i = \frac{\widehat{Z}[,i]^T Y}{n\widehat{\Lambda}[i]} \sim N\left(\alpha_i, \frac{1}{n \Lambda[i]}\right) \]

Example of Principal Components Regression

prostate = read.table('http://www.stanford.edu/class/stats305c/data/prostate.data', header=TRUE)
M = lm(lpsa ~ . - train, data=prostate)
X = model.matrix(M)[,-1]
summary(M)


Call:
lm(formula = lpsa ~ . - train, data = prostate)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.76644 -0.35510 -0.00328  0.38087  1.55770 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.181561   1.320568   0.137  0.89096    
lcavol       0.564341   0.087833   6.425 6.55e-09 ***
lweight      0.622020   0.200897   3.096  0.00263 ** 
age         -0.021248   0.011084  -1.917  0.05848 .  
lbph         0.096713   0.057913   1.670  0.09848 .  
svi          0.761673   0.241176   3.158  0.00218 ** 
lcp         -0.106051   0.089868  -1.180  0.24115    
gleason      0.049228   0.155341   0.317  0.75207    
pgg45        0.004458   0.004365   1.021  0.31000    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6995 on 88 degrees of freedom
Multiple R-squared:  0.6634,    Adjusted R-squared:  0.6328 
F-statistic: 21.68 on 8 and 88 DF,  p-value: < 2.2e-16

prostate$Z = prcomp(X, center=TRUE, scale.=TRUE)$x
prcomp_reg3 = lm(lpsa ~ Z[,1:3], data=prostate)
summary(prcomp_reg3)


Call:
lm(formula = lpsa ~ Z[, 1:3], data = prostate)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.92094 -0.42694 -0.03113  0.52694  1.48932 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.47839    0.07661  32.352  < 2e-16 ***
Z[, 1:3]PC1  0.43188    0.04201  10.282  < 2e-16 ***
Z[, 1:3]PC2  0.07136    0.05998   1.190    0.237    
Z[, 1:3]PC3  0.38651    0.07796   4.958 3.19e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7545 on 93 degrees of freedom
Multiple R-squared:  0.5861,    Adjusted R-squared:  0.5728 
F-statistic:  43.9 on 3 and 93 DF,  p-value: < 2.2e-16

anova(prcomp_reg3, M)

Analysis of Variance Table

Model 1: lpsa ~ Z[, 1:3]
Model 2: lpsa ~ (lcavol + lweight + age + lbph + svi + lcp + gleason + 
    pgg45 + train) - train
  Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
1     93 52.941                                
2     88 43.058  5     9.883 4.0396 0.002395 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Variations on PCA

Matrix completion

Source: Wikipedia
This was the task for the Netflix prize

Noisy version

We imagine an ideal

\[ Y_{n \times p} = \mu_{n \times p} + \sigma \cdot \epsilon_{n \times p}, \qquad \epsilon \sim N(0, I \otimes I) \]

We observe entries of \(Y\) only on \(\Omega \subset \{(i,j):1 \leq i \leq n, 1 \leq j \leq p\}\).
Goal is to estimate \(\mu\).
Natural problem to try to solve:

\[ \text{minimize}_{\mu} \frac{1}{2} \|P_{\Omega}(Y - \mu)\|^2_F \]

Needs some regularization… low-rank?

Nuclear norm

Noiseless case

When \(\sigma=0\) important work on the Netflix prize considered the problem

\[ \text{minimize}_{\mu: P_{\Omega}\mu=P_{\Omega}Y} \|\mu\|_* \]

Nuclear norm: if \(\mu=UDV^T\):

\[ \|\mu\|_* = \sum_{i} d_i \]

Noisy case: Soft Impute

\[ \text{minimize}_{\mu} \frac{1}{2} \|P_{\Omega}(Y-\mu)\|^2_F + \lambda \|\mu\|_* \]

Authors note that the data and estimate are “sparse + low-rank”

\[ P_{\Omega}(Y-\mu) = P_{\Omega}Y + P_{\Omega}^{\perp}\mu - \mu \]

Algorithm involves repeated calls of nuclear norm proximal map (this soft-thresholds the singular values…)
When solution is low-rank the proximal map can be computed very quickly (faster than a full SVD) due to “sparse + low-rank” structure

PCA as an autoencoder

Recall that the rank \(k\) PCA solution can be recast (with some particular choices) as

\[ \text{minimize}_{A \in \mathbb{R}^{p \times k}, B \in \mathbb{R}^{p \times k}} \frac{1}{2} \|X - XAB^T\|^2_F \]

The map \(\mathbb{R}^p \ni x \mapsto Ax \in \mathbb{R}^k\) is a “low-dimensional projection” say \(P_A\).
The map \(\mathbb{R}^k \ni u \mapsto B^Tu \in \mathbb{R}^p\) is an embedding, say \(E_B\).

We could rewrite this objective as

\[ \text{minimize}_{A \in \mathbb{R}^{p \times k}, B \in \mathbb{R}^{p \times k}} \frac{1}{2} \sum_{i=1}^n \|X_i - E_B(P_A(X_i))\|^2_2 \]

No (strict) reason to parametrize the projection and reconstruction by linear maps…
\(\implies\) PCA is a (vanilla) version of an auto-encoder…

Wikipedia

Structured PCA

Sparse PCA

We may want to interpret loading vectors: “what features are important for determining major modes of variation in the data?”
Why not try to make them sparse?
Rank-1 PCA has left and right singular vectors \(u, v\) that satisfy

\[ \text{maximize}_{u,v:\|u\|_2=1,\|v\|_2=1} u^TXv \]

Leading singular value is achieved value \(\hat{u}^TX\hat{v}\).

Penalized Matrix Decomposition

Suppose we want sparsity in \(v\), the PMD proposes

\[ \text{maximize}_{u,v:\|u\|_2 \leq 1,\|v\|_2 \leq 1, \|v\|_1 \leq c} u^TXv \]

Problem (phrased as minimization) is bi-convex in \((u,v)\) which induces a natural ascent algorithm: starting at \((u^{(t)}, v^{(t)})\)

\[ \begin{aligned} u^{(t+1)} &= \text{argmax}_{u: \|u\|_2 \leq 1} u^TXv^{(t)} \\ v^{(t+1)} &= \text{argmax}_{v: \|v\|_2 \leq 1, \|v\|_1 \leq c} (u^{(t+1)})^TXv \\ \end{aligned} \]

Can have structure inducing penalty for \(u\) as well…
Beyond rank 1? Not as clear how to deflate as in PCA…

Missing data

PMD can be fit with missing entries: as in matrix completion, replace \(u^TXv\) with

\[ \text{Tr}(P_{\omega}(X) P_{\omega}(vu^T)) \]

Iterative solution

\[ \begin{aligned} u^{(t+1)} &= \text{argmax}_{u: \|u\|_2 \leq 1} \text{Tr}(P_{\omega}(X) P_{\omega}(v^{(t)}u^T)) \\ v^{(t+1)} &= \text{argmax}_{v: \|v\|_2 \leq 1, \|v\|_1 \leq c} \text{Tr}(P_{\omega}(X) P_{\omega}(vu^{(t+1)}^T)) \\ \end{aligned} \]

Choice of tuning parameter

Split \(X\) into 10 matrices removing elements at random, resulting in matrices similar to matrix completion
Run CV: fit PMD on the union of 9 out of 10, predict on missing 10th…
Similar to a scheme proposed for tuning PCA Bi-CV

CGH (copy number data)

Wikipedia

Fused LASSO

They use the fused LASSO:

\[ {\cal P}(\beta) = \lambda_1 \|\beta\|_1 + \lambda_2 \sum_{j=1}^{p-1} |\beta_{j+1} - \beta_j| \]

Solutions generically will be sparse and piecewise constant.

PMD with fused LASSO

Non-negative matrix factorization

We may want other structure in factors
If \(X_{ij} > 0\) (e.g. count data) we may consider (constraints considered elementwise)

\[ \text{minimize}_{A \in \mathbb{R}^{n \times k}, B \in \mathbb{R}^{p \times k}: A \geq 0, B \geq 0} \frac{1}{2} \|X-AB^T\|^2_F \]

Natural model for NNMF?

Suppose

\[ X_{ij} \sim \text{Poisson}(\lambda_{ij}) \]

Under independence model

\[ \lambda_{ij} = \lambda_0 \pi^R_i \pi^C_j \]

Corresponds to sampling a Poisson number of individuals from a population with \(R\) independent of \(C\).

Suppose there is a mixture of \(k\) independence models

\[ \lambda_{ij} = \lambda_0 \left(\sum_{l=1}^k \pi^L_l \pi^R_{i|l} \pi^C_{j|l} \right) \]

Such \(\lambda\) matrix has an NNMF of rank \(k\), though likelihood would not be Frobenius norm…

Structured covariance

We looked at PCA solely with Frobenius norm, i.e. the rank \(k\) PCA is equivalent to

\[ \text{minimize}_{A \in \mathbb{R}^{n \times k}, B \in \mathbb{R}^{p \times k}} \frac{1}{2} \|X-AB^T\|^2_F \]

What if there were structure across rows, i.e. \(X[,i] \sim N(0, \Sigma_R)\)?
Reasonable to whiten first and consider

\[ \text{minimize}_{A \in \mathbb{R}^{n \times k}, B \in \mathbb{R}^{p \times k}} \frac{1}{2} \|\Sigma_R^{-1/2}(X-AB^T)\|^2_F \]

Could have structure in the columns as well, why not whiten there as well?

\[ \text{minimize}_{A \in \mathbb{R}^{n \times k}, B \in \mathbb{R}^{p \times k}} \frac{1}{2} \|\Sigma_R^{-1/2}(X-AB^T)\Sigma_C^{-1/2}\|^2_F \]

Replaces Frobenius norm with a norm

\[ X \mapsto \text{Tr}(\Sigma_C^{-1}X^T\Sigma_R^{-1}X)^{1/2} \]

Frobenius is the special case \(\Sigma_C, \Sigma_R = I_p, I_n\).