CCA (Canonical Correlation Analysis)

STATS 305C

Canonical correlation analysis

For a data matrix \(X \in \mathbb{R}^{n \times p}\) with \(X_i \overset{IID}{\sim} F\), PCA and Factor Analysis can be thought of as models / decompositions of

\[ \Sigma = \text{Var}_F(X) \]

Canonical correlation analysis deals with pairs \(\mathbb{R}^p \times \mathbb{R}^q \ni (X_i, Y_i) \overset{IID}{\sim} F\) and models / decomposes the cross-covariance matrix

\[ \text{Var}_F(X,Y) = \begin{pmatrix} \Sigma_{XX} & \Sigma_{XY} \\ \Sigma_{YX} & \Sigma_{YY} \end{pmatrix}. \]

First goal:

\[ \text{maximize}_{(a,b): a^T\Sigma_{XX}a \leq 1, b^T\Sigma_{YY}b \leq 1} a^T\Sigma_{XY}b \]

Example from here
\(X\): psychological variables: (Control, Concept, Motivation)
\(Y\): academic + sex: (Read, Write, Math, Science, Sex)

mm = read.csv("https://stats.idre.ucla.edu/stat/data/mmreg.csv")
head(mm)

  locus_of_control self_concept motivation read write math science female
1            -0.84        -0.24       1.00 54.8  64.5 44.5    52.6      1
2            -0.38        -0.47       0.67 62.7  43.7 44.7    52.6      1
3             0.89         0.59       0.67 60.6  56.7 70.5    58.0      0
4             0.71         0.28       0.67 62.7  56.7 54.7    58.0      0
5            -0.64         0.03       1.00 41.6  46.3 38.4    36.3      1
6             1.11         0.90       0.33 62.7  64.5 61.4    58.0      1

dim(mm)

[1] 600   8

Substituting \(u=\Sigma_{XX}^{1/2}a, v=\Sigma_{YY}^{1/2}v\) we see that \((u, v)\) solve

\[ \text{maximize}_{(u,v): \|u\|_2 \leq 1, \|v\|_2 \leq 1} u^T\Sigma_{XX}^{-1/2}\Sigma_{XY}\Sigma_{YY}^{-1/2}v \]

So \(u=U[,1]\) and \(v=V[,1]\) are leading left and right singular vectors of \(\Sigma_{XX}^{-1/2}\Sigma_{XY}\Sigma_{YY}^{-1/2}= U \rho V^T\) with \(\rho=\text{diag}(\rho_1, \dots,\rho_q)\).
Achieved maximal correlation value is the leading singular value \(\rho_1\).

Practically speaking, can find SVD by eigendecomposition of the symmetric matrix

\[ \Sigma_{XX}^{-1/2}\Sigma_{XY}\Sigma_{YY}^{-1}\Sigma_{YX} \Sigma_{XX}^{-1/2} = U\rho^2 U' \]

Loadings \(a=\Sigma_{XX}^{-1/2}u, b=\Sigma_{YY}^{-1/2}v\) determine first canonical variates \((a^TX, b^TY)\) with

\[ \text{Var}_F(a^TX) = \text{Var}_F(b^TY) = 1, \qquad \text{Cov}_F(a^TX, b^TY) = \rho_1.\]

Let’s call \(a=a_1, b=b_1\) for first canonical variates.
Why not find more?

Second canonical variate problem:

\[ \text{maximize}_{(a_2,b_2): a_2^T\Sigma_{XX}a_2 \leq 1, b_2^T\Sigma_{YY}b_2 \leq 1} a_2^T\Sigma_{XY}b_2 \]

subject to \(\text{Cov}_F(a_1^TX,a_2^TX)=\text{Cov}_F(b_1^TY, b_2^TY)=0\).

Not hard to see that \(a_2=\Sigma_{XX}^{-1/2}U[,2], b_2=\Sigma_{YY}^{-1/2}V[,2]\).
Realized correlation is \(\rho_2\).
Continuing

\[ A = \Sigma_{XX}^{-1/2}U, \qquad B=\Sigma_{YY}^{-1/2}V. \]

Final form of covariance matrix

\[ \text{Cov}_F(A'X, B'Y) = \begin{pmatrix} I & R \\ R^T & I \end{pmatrix} \]

Assuming non-singular covariances, when \(p > q\)

\[ R = \begin{pmatrix} \rho \\ 0 \end{pmatrix} \]

transposing when \(p \leq q\).

Generative model

Several ways to realize the law… here is one
Pick \(\bar{U} \in \mathbb{R}^{p \times p}\) be orthonormal with first \(q\) columns equal to \(U\).

\[ \begin{aligned} \check{Z} & \sim N(0, I_q) \\ \epsilon_X & \sim N(0, I_p) \\ \epsilon_Y & \sim N(0, I_q) \\ Z_Y &= \sqrt{1 - \rho} \cdot \epsilon_Y + \sqrt{\rho} \cdot \check{Z} \\ Z_X[1:p] &= \sqrt{1 - \rho} \cdot \epsilon_X[1:q] + \sqrt{\rho} \cdot \check{Z} \\ Z_X[(q+1):p] &= \epsilon_X[(q+1):p] \\ X &= \Sigma_{XX}^{1/2} \textcolor{red}{\bar{U}} Z_X \\ Y &= \Sigma_{YY}^{1/2} \textcolor{red}{V} Z_Y \end{aligned} \]

Sample version

Simply replace \(\Sigma\) with \(\widehat{\Sigma}\).
Under non-degeneracy assumptions, matrices \(\widehat{\Sigma}_{XX}\) and \(\widehat{\Sigma}_{YY}\) will be invertible if \(n > \text{max}(q,p)\).
As in PPCA, Bach & Jordan (2006) show that sample estimates are MLE for this generative model
\(\implies\) could use likelihood to choose rank…

library(CCA)
X = mm[,1:3]
Y = mm[,-c(1:3)]
CCA_mm = CCA::cc(X, Y)
names(CCA_mm)

[1] "cor"    "names"  "xcoef"  "ycoef"  "scores"

Checking CCA solution

crosscov = cov(mm)[1:3,4:8]
t(CCA_mm$xcoef) %*% crosscov %*% CCA_mm$ycoef

             [,1]          [,2]          [,3]
[1,] 4.640861e-01  5.599087e-17 -1.930228e-16
[2,] 3.610083e-16  1.675092e-01 -3.725764e-16
[3,] 4.012970e-16 -5.498621e-16  1.039911e-01

CCA_mm$xcoef

                       [,1]       [,2]       [,3]
locus_of_control -1.2538339 -0.6214776 -0.6616896
self_concept      0.3513499 -1.1876866  0.8267210
motivation       -1.2624204  2.0272641  2.0002283

Diagonalizes the \(X\) and \(Y\) subblocks

Xcov = cov(mm)[1:3,1:3]
t(CCA_mm$xcoef) %*% Xcov %*% CCA_mm$xcoef

             [,1]          [,2]          [,3]
[1,] 1.000000e+00  2.478633e-16  2.048775e-16
[2,] 3.413977e-16  1.000000e+00 -1.345034e-16
[3,] 2.169002e-16 -1.469862e-16  1.000000e+00

Ycov = cov(mm)[4:8,4:8]
t(CCA_mm$ycoef) %*% Ycov %*% CCA_mm$ycoef

              [,1]          [,2]          [,3]
[1,]  1.000000e+00 -1.975726e-16  2.677851e-17
[2,] -1.686276e-16  1.000000e+00 -4.013920e-15
[3,] -1.444754e-17 -4.112697e-15  1.000000e+00

plot(CCA_mm$scores$xscores[,1], CCA_mm$scores$yscores[,1])

cca = CCA(n_components=3)
cca.fit(X, Y);
X_t, Y_t = cca.transform(X, Y)

fig, ax = plt.subplots()
ax.scatter(X_t[:,0], Y_t[:,0])
ax.set_title(f'Correlation: {np.corrcoef(X_t[:,0], Y_t[:,0])[0,1]}')

Invariance under non-singular transformations

Clearly the CCA decomposition is invariant under change of means of \((X, Y)\).
The matrix

\[ \Sigma_{XX}^{-1/2}\Sigma_{XY} \Sigma_{YY}^{-1/2} \]

is also invariant under non-singular transformations (of \(X\) or of \(Y\))

\(\implies\) CCA is affine equivariant…

CCA_mm_scaled = CCA::cc(scale(X, TRUE, TRUE), scale(Y, TRUE, TRUE))
cbind(CCA_mm$cor, CCA_mm_scaled$cor)

          [,1]      [,2]
[1,] 0.4640861 0.4640861
[2,] 0.1675092 0.1675092
[3,] 0.1039911 0.1039911

lm(CCA_mm$scores$xscores[,1] ~ CCA_mm_scaled$scores$xscores[,1])


Call:
lm(formula = CCA_mm$scores$xscores[, 1] ~ CCA_mm_scaled$scores$xscores[, 
    1])

Coefficients:
                      (Intercept)  CCA_mm_scaled$scores$xscores[, 1]  
                         9.15e-16                           1.00e+00

A = matrix(rnorm(9), 3, 3)
XA = as.matrix(X) %*% A
CCA_affine = CCA::cc(XA, Y)
cbind(CCA_mm$cor, CCA_affine$cor)

          [,1]      [,2]
[1,] 0.4640861 0.4640861
[2,] 0.1675092 0.1675092
[3,] 0.1039911 0.1039911

lm(CCA_mm$scores$xscores[,1] ~ CCA_affine$scores$xscores[,1])


Call:
lm(formula = CCA_mm$scores$xscores[, 1] ~ CCA_affine$scores$xscores[, 
    1])

Coefficients:
                   (Intercept)  CCA_affine$scores$xscores[, 1]  
                     8.153e-16                      -1.000e+00

Relation to reduced rank regression

For each of \(X\) and \(Y\), we saw that there are mappings

\[ X \mapsto U^T\Sigma^{-1/2}_{XX}X = A^TX = Z_X, \qquad Y \mapsto V^T\Sigma_{YY}^{-1/2}Y = B^TY = Z_Y \]

to unitless canonical variates.

These canonical variates drive the correlation between \(X\) and \(Y\)…

Reduced rank regression for rank \(k \leq \text{min}(p, q)\)

Due to affine equivariance, it’s not hard to see that

\[ \hat{Y} = \Sigma_{YY}^{1/2} V[,1:k] \rho[1:k] Z_X[1:k] \]

Inference: how to test \(\rho_j=0, j \geq k\)?

Complicated joint densities…
Not too hard under global null \(H_0:\Sigma_{XY}=0\), though these tests will all look like tests for regression \(Y\) onto \(X\).
Under global null \(H_0: X \ \text{ independent of} \ Y\) could permute either \(X\) or \(Y\)…

Generalizations

For non-Euclidean data we might featurize either \(X\), \(Y\) or both.
This idea shows up over and over: given features \(X\), make new features through a basis expansion (which include the kernel trick).
We see this in two-sample tests (e.g. in the kernelized versions); linear regression; kernel PCA; etc.
Wherever we form random variables like \(\beta^TX\), we can replace with \(W=f(X)\) …
Often, methods with richer (i.e. more) features will also feature a regularization term …
Which new features? This is domain dependent…

Kernelized CCA

Given linear spaces of features \({\cal H}_X, {\cal H}_Y\) we might look at

\[ \text{maximize}_{(f, g) \in {\cal H}_X \times {\cal H}_Y: \text{Var}_F(f(X)) \leq 1, \text{Var}_F(g(Y)) \leq 1} \text{Cov}_F(f(X), g(Y)). \]

If \({\cal H}_X\) and \({\cal H}_Y\) are RKHS, this is kernelized CCA.

A possible issue?

For kernelized CCA \(\text{max}(\text{dim}({\cal H}_X), \text{dim}({\cal H}_Y)) > n\) the matrix

\[ \widehat{\text{Cov}}_F(f_i(X), f_j(X)) = \widehat{\Sigma}_{f(X),f(X)} \]

is degenerate so we can’t really form \(\widehat{\Sigma}_{f(X),f(X)}^{-1/2}\)…

Generic RKHS will always have this problem. How to resolve?

Regularized covariance estimate

Let’s just assume that \({\cal H}_X\) is finite dimensional of dimension \(k > n\) and \({\cal H}_Y\) are just the usual linear functions.
We can then just work on the usual CCA with data matrices \(Y\) and \(W=f(X) \in \mathbb{R}^{n \times k}\).
A simple remedy is to consider instead

\[ \text{maximize}_{(a,b): a^T(\widehat{\Sigma}_{WW}+\epsilon I)a \leq 1, b^T\widehat{\Sigma}_{YY}b \leq 1} a^T\widehat{Cov}_{WY}b \]

Will require SVD of \((\widehat{\Sigma}_{WW} + \epsilon I)^{-1/2} \widehat{\Sigma}_{WY} \widehat{\Sigma}_{YY}^{-1/2}\).

Kernelized version

Instead of \(\widehat{\text{Var}}(f(X))\) could use

\[ f \mapsto \widehat{\text{Var}}(f(X)) + \epsilon \|f\|^2_{{\cal H}_X}. \]

Even simpler (in theory): subsample the “knots” in the kernel (e.g. Snelson & Ghahramani (2005))
By usual kernel trick, this reduces problem to an \(n\) dimensional problem.
In this case, an eigenvalue problem for \(n \times n\) matrix…

Other covariance regularization

(PMD) Witten et al. (pretends \(\Sigma_{WW}= \kappa I\))
NOCCO in Fukumizu et al. (uses the regularized covariance operator in RKHS)
Use row/column covariance Allen et al.
Each of these methods boils down to choosing two quadratic forms \(Q_X, Q_Y\)…

Other structure

Restricting to finite dimensions, the first regularized problems morally look like (assuming \(X\) and \(Y\) have been centered)

\[ \text{maximize}_{(a,b): a^TQ_Xa \leq 1, b^TQ_Yb \leq 1} a^TX^TYb \]

It is often tempting to interpret loadings \((a,b)\) just as in PCA.
For such settings, it is often desirable to impose structure on \((a,b)\).

Sparse CCA and variants

In bound form: \[ \text{maximize}_{(a,b): a^TQ_Xa \leq 1, b^TQ_Yb \leq 1, \|a\|_1 \leq c_X } a^TX^TYb \]
In Lagrange form: \[ \text{maximize}_{(a,b): a^TQ_Xa \leq 1, b^TQ_Yb \leq 1} a^TX^TYb - \lambda_X \|a\|_1 \]

Non-negative factors

\[ \text{maximize}_{(a,b): a^TQ_Xa \leq 1, b^TQ_Yb \leq 1, a \geq 0, b \geq 0} a^TX^TYb \]

Relation to sparse PCA

Corresponds to swapping \(X^TY\) with \(X\) …

A bi-convex problem

The function \((a,b) \mapsto -a^TX^TYb\) is not convex.
Numeric feasibility of solving PCA rests on computability of SVD even though it is not a convex problem.
However, for \(b\) fixed, \(a \mapsto -a^TX^TYb\) is linear, hence convex.
Alternating algorithm: given feasibile initial \((\widehat{a}^{(0)}, \widehat{b}^{(0)})\) we iterate (for \(\ell_1\) bound form above):

\[ \begin{aligned} \widehat{a}^{(t)} &= \text{argmin}_{a: a^TQ_Xa \leq 1, \|a\|_1 \leq c_1} - a^TX^TY\widehat{b}^{(t-1)} \\ \widehat{b}^{(t)} &= \text{argmin}_{b: b^TQ_Yb \leq 1} - (\widehat{a}^{(t)})^TX^TYb \\ \end{aligned} \]

Yields an ascent algorithm on objective \((a,b) \mapsto a^TX^TYb\)… often with pretty simple updates.
Deflation not completely obvious, but there are reasonable proposals

An example with a pretty picture: CGH

Dataset breastdata from PMA has both gene expression data (rna) and CGH measurements (dna) on 89 samples.
Our example (which is in PMA package) will just use CGH from chromosome 1: t(dna) is a \(89 \times 136\) matrix and all of t(rna), an \(89 \times 19672\) matrix.
CGH data is modeled as piecewise constant with a (sparse) fused LASSO penalty constraint \[ {\cal P}_{FL}(v) = \|v\|_1 + \alpha \|Dv\|_1 \leq c_v \] while gene expression is sparse \[ {\cal P}_S(u) = \|u\|_1. \]

Optimization problem: \[ \text{maximize}_{(u,v): \|u\|_2\leq 1, {\cal P}_S(u) \leq c_u, \|v\|_2 \leq 1, {\cal P}_{FL}(v) \leq c_v} u^TX^TYv \]
Inevitably, one has to choose tuning parameters…
Authors propose permuting \(X\) and rerunning algorithm while retaining the value of the problem. Parameters are chosen to have smallest p-value.

set.seed(22)
dna_t = t(dna)
rna_t = t(rna)
perm.out = CCA.permute(x=rna_t,
                       ## Run CCA using all gene exp. data, but CGH data on chrom 1 only.
                       z=dna_t[,chrom==1],
                       typex="standard",
                       typez="ordered",
                       nperms=10,
                       penaltyxs=seq(.02,.7,len=10))


 Permutation  1  out of  10  12345678910
 Permutation  2  out of  10  12345678910
 Permutation  3  out of  10  12345678910
 Permutation  4  out of  10  12345678910
 Permutation  5  out of  10  12345678910
 Permutation  6  out of  10  12345678910
 Permutation  7  out of  10  12345678910
 Permutation  8  out of  10  12345678910
 Permutation  9  out of  10  12345678910
 Permutation  10  out of  10  12345678910

print(perm.out)

Call: CCA.permute(x = rna_t, z = dna_t[, chrom == 1], typex = "standard", 
    typez = "ordered", penaltyxs = seq(0.02, 0.7, len = 10), 
    nperms = 10)

Type of x:  standard 
Type of z:  ordered 
   X Penalty Z Penalty Z-Stat P-Value  Cors Cors Perm FT(Cors) FT(Cors Perm)
1      0.020     0.032  3.694     0.0 0.825     0.641    1.171         0.762
2      0.096     0.032  1.487     0.0 0.801     0.684    1.101         0.845
3      0.171     0.032  1.882     0.0 0.805     0.649    1.113         0.783
4      0.247     0.032  1.786     0.0 0.782     0.623    1.050         0.739
5      0.322     0.032  1.591     0.0 0.753     0.603    0.980         0.706
6      0.398     0.032  1.396     0.0 0.726     0.587    0.921         0.681
7      0.473     0.032  1.221     0.0 0.701     0.574    0.869         0.661
8      0.549     0.032  1.063     0.0 0.677     0.563    0.823         0.644
9      0.624     0.032  0.922     0.2 0.655     0.553    0.784         0.630
10     0.700     0.032  0.796     0.2 0.635     0.546    0.750         0.619
   # U's Non-Zero # Vs Non-Zero
1              12            90
2             267            72
3            1189            71
4            2639            62
5            4329            62
6            6278            61
7            8399            61
8           10860            61
9           13536            61
10          16252            61
Best L1 bound for x:  0.02
Best lambda for z:  0.03178034

What is the \(Z\)-score?

For each permutation, there is a realized value to the optimization problem for each value of the tuning parameter.
Can compute mean and SD over permutation data (for each value of the tuning parameter), yielding a reference distribution.
\(Z\)-score compares observed realized value to the permutation distribution.
Tuning parameter is chosen to maximize the \(Z\)-score.

plot(perm.out)

out = CCA(x=rna_t,
          z=dna_t[,chrom==1], 
          typex="standard", 
          typez="ordered",
          penaltyx=perm.out$bestpenaltyx,
          v=perm.out$v.init, 
          penaltyz=perm.out$bestpenaltyz,
          xnames=substr(genedesc,1,20),
          znames=paste("Pos", sep="", nuc[chrom==1]))

123456789101112131415

print(out, verbose=FALSE) # could do print(out,verbose=TRUE)

Call: CCA(x = rna_t, z = dna_t[, chrom == 1], typex = "standard", typez = "ordered", 
    penaltyx = perm.out$bestpenaltyx, penaltyz = perm.out$bestpenaltyz, 
    v = perm.out$v.init, xnames = substr(genedesc, 1, 20), znames = paste("Pos", 
        sep = "", nuc[chrom == 1]))


Num non-zeros u's:  12 
Num non-zeros v's:  90 
Type of x:  standard 
Type of z:  ordered 
Penalty for x: L1 bound is  0.02 
Penalty for z: Lambda is  0.03178034 
Cor(Xu,Zv):  0.8247056

print(genechr[out$u!=0]) # which chromosome are the selected genes on

 [1] 1 1 1 1 1 1 1 1 1 1 1 1

PlotCGH(out$v, 
        nuc=nuc[chrom==1], 
        chrom=chrom[chrom==1],
        main="Regions of gain/loss on Chrom 1 assoc'd with gene expression")

Multiple CCA

We had gene expression and CGH data on the same samples.
Can easily conceive of situations where we have yet another type of phenotype.
Given samples from \(X=(X_1, \dots, X_r) \sim F, X_i \in \mathbb{R}^{d_i}\) we can write its covariance as

\[ \text{Var}_F(X) = \begin{pmatrix} \Sigma_{11} & \Sigma_{12} & \dots \\ \Sigma_{21} & \Sigma_{22} & \dots \\ \vdots & \ddots & \dots \end{pmatrix} \] with \(\Sigma_{ij} \in \mathbb{R}^{d_i \times d_j}\).

Forming linear combinations \(a_i^TX_i, a_j^TX_j\) yields

\[ \text{Var}_F(a_1^TX_1,\dots, a_r^TX_r) = \begin{pmatrix} a_1^T\Sigma_{11}a_1 & a_2^T\Sigma_{12}a_2 & \dots \\ a_2^T\Sigma_{21}a_1 & a_2^T\Sigma_{22}a_2 & \dots \\ \vdots & \ddots & \dots \end{pmatrix} \in \mathbb{R}^{r \times r} \]

Pick your favorite functional \(\Phi\) of the off-diagonal blocks – try to maximize

\[ \Phi\left(\text{Var}_F(a_1^TX_1,\dots, a_r^TX_r)\right) \]

subject to \(\text{Var}_F(a_i^TX_i)=1, 1 \leq i \leq r\).

Examples of \(\Phi\): sum of off-diagonals; max of off-diagonals…
Covariance can be regularized…
Can impose structure (sparsity, non-negativity, etc.)
Solving the problem? Certainly the sum of off-diagonals is multi-convex…
See Witten and Tibshirani (2009)

Cooperative learning

The CCA problem is unsupervised (though is related to reduced rank regression…)
In supervised setting, we might have different views (qualitatively different types of data)
- Genetic data
- Different types of biological assay
- Imaging data

Multi-view supervised

Multi-view objective: cooperative learning

\[ \text{minimize}_{f_X, f_Z} \frac{1}{2} \mathbb{E} \left[ \left(Y - f_Z(Z) - f_X(Z) \right) + \frac{\rho}{2}\left(f_X(X) - f_Z(Z)\right)^2 \right] \]

Underlying generative model (?)

\(X\), \(Z\) as in the CCA generative model with common variation \(\check{Z}\)
\(Y | \check{Z}, X, Z \sim N(\alpha'\check{Z} + \beta'X + \gamma'Z, \sigma^2)\) with \(\beta\), \(\gamma\) simple
Supervised target shares some of the same latent variables as \(X\) and \(Z\)…