PCA II#

Download#

Principal components regression#

  • Supervised setting with response \(Y\) and design \(X\): model

\[ Y | X \sim N(X\beta, \sigma^2 I) \equiv N(Z\alpha, \sigma^2 I), \qquad (Y_i, X_i) \overset{IID}{\sim} F \]

where \(\alpha = \Gamma^T\beta\) and \(\text{Var}_F(X) = \Gamma \Lambda \Gamma^T\) and \(Z=X\Gamma\).

  • For sample version take \(\Gamma=V\) in SVD of \(R_cX\) so \(Z=\hat{Z}\)

  • We might hypothesize the \(\alpha[(k+1):p] = 0\), i.e. only the first \(k\) principal components of \(X\) contribute to regression…

  • Estimation is easy

\[ \widehat{\alpha}_i = \frac{\widehat{Z}[,i]^T Y}{n\widehat{\Lambda}[i]} \sim N\left(\alpha_i, \frac{1}{n \Lambda[i]}\right) \]

Example of Principal Components Regression#

#| echo: true
prostate = read.table('http://www.stanford.edu/class/stats305c/data/prostate.data', header=TRUE)
M = lm(lpsa ~ . - train, data=prostate)
X = model.matrix(M)[,-1]
summary(M)
Call:
lm(formula = lpsa ~ . - train, data = prostate)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.76644 -0.35510 -0.00328  0.38087  1.55770 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.181561   1.320568   0.137  0.89096    
lcavol       0.564341   0.087833   6.425 6.55e-09 ***
lweight      0.622020   0.200897   3.096  0.00263 ** 
age         -0.021248   0.011084  -1.917  0.05848 .  
lbph         0.096713   0.057913   1.670  0.09848 .  
svi          0.761673   0.241176   3.158  0.00218 ** 
lcp         -0.106051   0.089868  -1.180  0.24115    
gleason      0.049228   0.155341   0.317  0.75207    
pgg45        0.004458   0.004365   1.021  0.31000    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6995 on 88 degrees of freedom
Multiple R-squared:  0.6634,	Adjusted R-squared:  0.6328 
F-statistic: 21.68 on 8 and 88 DF,  p-value: < 2.2e-16

#| echo: true
prostate$Z = prcomp(X, center=TRUE, scale.=TRUE)$x
prcomp_reg3 = lm(lpsa ~ Z[,1:3], data=prostate)
summary(prcomp_reg3)
Call:
lm(formula = lpsa ~ Z[, 1:3], data = prostate)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.92094 -0.42694 -0.03113  0.52694  1.48932 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.47839    0.07661  32.352  < 2e-16 ***
Z[, 1:3]PC1  0.43188    0.04201  10.282  < 2e-16 ***
Z[, 1:3]PC2  0.07136    0.05998   1.190    0.237    
Z[, 1:3]PC3  0.38651    0.07796   4.958 3.19e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7545 on 93 degrees of freedom
Multiple R-squared:  0.5861,	Adjusted R-squared:  0.5728 
F-statistic:  43.9 on 3 and 93 DF,  p-value: < 2.2e-16

#| echo: true
anova(prcomp_reg3, M)
A anova: 2 × 6
Res.DfRSSDfSum of SqFPr(>F)
<dbl><dbl><dbl><dbl><dbl><dbl>
19352.94137NA NA NA NA
28843.05842 59.882954.0396260.002395333

prcomp_R2 = c()
for (i in 1:8) {
    prcomp_reg = lm(lpsa ~ Z[,1:i], data=prostate)
    prcomp_R2 = c(prcomp_R2, summary(prcomp_reg)$r.squared)
}
plot(prcomp_R2, ylim=c(0,1), pch=22, bg='red')
../_images/ded6286cbe49c6a947bbd5c5599e83b673c577f328449fd8c1e0ac08eac2e9cd.png

Variations on PCA#

Matrix completion#

Noisy version#

  • We imagine an ideal

\[ Y_{n \times p} = \mu_{n \times p} + \sigma \cdot \epsilon_{n \times p}, \qquad \epsilon \sim N(0, I \otimes I) \]
  • We observe entries of \(Y\) only on \(\Omega \subset \{(i,j):1 \leq i \leq n, 1 \leq j \leq p\}\).

  • Goal is to estimate \(\mu\).

  • Natural problem to try to solve:

\[ \text{minimize}_{\mu} \frac{1}{2} \|P_{\Omega}(Y - \mu)\|^2_F \]
  • Needs some regularization… low-rank?

Nuclear norm#

Noiseless case#

  • When \(\sigma=0\) important work on the Netflix prize considered the problem

\[ \text{minimize}_{\mu: P_{\Omega}\mu=P_{\Omega}Y} \|\mu\|_* \]
  • Nuclear norm: if \(\mu=UDV^T\):

\[ \|\mu\|_* = \sum_{i} d_i \]

Noisy case: Soft Impute#

\[ \text{minimize}_{\mu} \frac{1}{2} \|P_{\Omega}(Y-\mu)\|^2_F + \lambda \|\mu\|_* \]
  • Authors note that the data and estimate are “sparse + low-rank”

\[ P_{\Omega}(Y-\mu) = P_{\Omega}Y + P_{\Omega}^{\perp}\mu - \mu \]
  • Algorithm involves repeated calls of nuclear norm proximal map (this soft-thresholds the singular values…)

  • When solution is low-rank the proximal map can be computed very quickly (faster than a full SVD) due to “sparse + low-rank” structure

PCA as an autoencoder#

  • Recall that the rank \(k\) PCA solution can be recast (with some particular choices) as

\[ \text{minimize}_{A \in \mathbb{R}^{p \times k}, B \in \mathbb{R}^{p \times k}} \frac{1}{2} \|X - XAB^T\|^2_F \]
  • The map \(\mathbb{R}^p \ni x \mapsto Ax \in \mathbb{R}^k\) is a “low-dimensional projection” say \(P_A\).

  • The map \(\mathbb{R}^k \ni u \mapsto B^Tu \in \mathbb{R}^p\) is an embedding, say \(E_B\).


  • We could rewrite this objective as

\[ \text{minimize}_{A \in \mathbb{R}^{p \times k}, B \in \mathbb{R}^{p \times k}} \frac{1}{2} \sum_{i=1}^n \|X_i - E_B(P_A(X_i))\|^2_2 \]
  • No (strict) reason to parametrize the projection and reconstruction by linear maps…

  • \(\implies\) PCA is a (vanilla) version of an auto-encoder…


Wikipedia

Structured PCA#

Sparse PCA#

  • We may want to interpret loading vectors: “what features are important for determining major modes of variation in the data?”

  • Why not try to make them sparse?

  • Rank-1 PCA has left and right singular vectors \(u, v\) that satisfy

\[ \text{maximize}_{u,v:\|u\|_2=1,\|v\|_2=1} u^TXv \]
  • Leading singular value is achieved value \(\hat{u}^TX\hat{v}\).

Penalized Matrix Decomposition#

  • Suppose we want sparsity in \(v\), the PMD proposes

\[ \text{maximize}_{u,v:\|u\|_2 \leq 1,\|v\|_2 \leq 1, \|v\|_1 \leq c} u^TXv \]
  • Problem (phrased as minimization) is bi-convex in \((u,v)\) which induces a natural ascent algorithm: starting at \((u^{(t)}, v^{(t)})\)

\[\begin{split} \begin{aligned} u^{(t+1)} &= \text{argmax}_{u: \|u\|_2 \leq 1} u^TXv^{(t)} \\ v^{(t+1)} &= \text{argmax}_{v: \|v\|_2 \leq 1, \|v\|_1 \leq c} (u^{(t+1)})^TXv \\ \end{aligned} \end{split}\]
  • Can have structure inducing penalty for \(u\) as well…

  • Beyond rank 1? Not as clear how to deflate as in PCA…

Missing data#

  • PMD can be fit with missing entries: as in matrix completion, replace \(u^TXv\) with

\[ \text{Tr}(P_{\omega}(X) P_{\omega}(vu^T)) \]
  • Iterative solution

\[\begin{split} \begin{aligned} u^{(t+1)} &= \text{argmax}_{u: \|u\|_2 \leq 1} \text{Tr}(P_{\omega}(X) P_{\omega}(v^{(t)}u^T)) \\ v^{(t+1)} &= \text{argmax}_{v: \|v\|_2 \leq 1, \|v\|_1 \leq c} \text{Tr}(P_{\omega}(X) P_{\omega}(vu^{(t+1)}^T)) \\ \end{aligned} \end{split}\]

Choice of tuning parameter#

  • Split \(X\) into 10 matrices removing elements at random, resulting in matrices similar to matrix completion

  • Run CV: fit PMD on the union of 9 out of 10, predict on missing 10th…

  • Similar to a scheme proposed for tuning PCA Bi-CV


CGH (copy number data)#

Wikipedia

Fused LASSO#

  • They use the fused LASSO:

\[ {\cal P}(\beta) = \lambda_1 \|\beta\|_1 + \lambda_2 \sum_{j=1}^{p-1} |\beta_{j+1} - \beta_j| \]
  • Solutions generically will be sparse and piecewise constant.

PMD with fused LASSO#

Non-negative matrix factorization#

  • We may want other structure in factors

  • If \(X_{ij} > 0\) (e.g. count data) we may consider (constraints considered elementwise)

\[ \text{minimize}_{A \in \mathbb{R}^{n \times k}, B \in \mathbb{R}^{p \times k}: A \geq 0, B \geq 0} \frac{1}{2} \|X-AB^T\|^2_F \]

Natural model for NNMF?#

  • Suppose

\[ X_{ij} \sim \text{Poisson}(\lambda_{ij}) \]
  • Under independence model

\[ \lambda_{ij} = \lambda_0 \pi^R_i \pi^C_j \]
  • Corresponds to sampling a Poisson number of individuals from a population with \(R\) independent of \(C\).


  • Suppose there is a mixture of \(k\) independence models

\[ \lambda_{ij} = \lambda_0 \left(\sum_{l=1}^k \pi^L_l \pi^R_{i|l} \pi^C_{j|l} \right) \]
  • Such \(\lambda\) matrix has an NNMF of rank \(k\), though likelihood would not be Frobenius norm…


Structured covariance#

  • We looked at PCA solely with Frobenius norm, i.e. the rank \(k\) PCA is equivalent to

\[ \text{minimize}_{A \in \mathbb{R}^{n \times k}, B \in \mathbb{R}^{p \times k}} \frac{1}{2} \|X-AB^T\|^2_F \]
  • What if there were structure across rows, i.e. \(X[,i] \sim N(0, \Sigma_R)\)?

  • Reasonable to whiten first and consider

\[ \text{minimize}_{A \in \mathbb{R}^{n \times k}, B \in \mathbb{R}^{p \times k}} \frac{1}{2} \|\Sigma_R^{-1/2}(X-AB^T)\|^2_F \]

  • Could have structure in the columns as well, why not whiten there as well?

\[ \text{minimize}_{A \in \mathbb{R}^{n \times k}, B \in \mathbb{R}^{p \times k}} \frac{1}{2} \|\Sigma_R^{-1/2}(X-AB^T)\Sigma_C^{-1/2}\|^2_F \]
  • Replaces Frobenius norm with a norm

\[ X \mapsto \text{Tr}(\Sigma_C^{-1}X^T\Sigma_R^{-1}X)^{1/2} \]
  • Frobenius is the special case \(\Sigma_C, \Sigma_R = I_p, I_n\).

Example#