1 Combining a phylogenetic trees into a data analysis

Data Integration

Data Integration

We now need to combine the phylogenetic tree, the denoised read abundances with the complementary information provided about the samples from which the reads were gathered.

This sample information is often provided as a spreadhseet (or .csv), and (mis)named the meta -data.


1.1 Multidomain (multiway) Data

  • DNA: The Genomic material present (16sRNA-gene especially).
  • RNA: What genes are being turned on (gene expression), transcriptomics.
  • Mass Spec: Specific signatures of chemical compounds present.
  • Clinical: Multivariate information about patients’ clinical status, medication, weight.
  • Environmental: Location, nutrition, time.
  • Domain Knowledge: Metabolic networks, phylogenetic trees, gene ontologies.

Then the full suite of data for this study – the sample-by-sequence feature table, the sample (meta)data, the sequence taxonomies, and the phylogenetic tree – are combined into a single object as follows:

library("phyloseq")
mimarksPathClean = "/Users/susan/Books/CUBook/data/MIMARKS_Data_clean.csv"
samdf = read.csv(mimarksPathClean, header=TRUE)
physeq = phyloseq(tax_table(taxtab), sample_data(samdf),
        otu_table(seqtab, taxa_are_rows = FALSE), phy_tree(fitGTR$tree))

This block will only usually be executed once. Once created, we save the object and do all the manipulations starting from that container.

We can also make data transformations, while maintaining the integrity of the links between all the data components.

2 Data Integration: Multi-table methods

2.1 Mantel coefficient between two distances

Earliest version of matrix association.

Given arbitrary dissimilarity matrices, it is defined as: \[\mbox{r}_{\rm m}(\bf X,\bf Y)= \frac{\sum_{i=1}^n \sum_{j=1,j \neq i}^n (d_{ij}^{\bf X}-\bar d^{\bf X})(d_{ij}^{\bf Y}-\bar d^{\bf Y})}{\sqrt{\sum_{i,j,j \neq i}(d_{ij}^{\bf X}-\bar d^{\bf X})^2 \sum_{i,j,j \neq i}(d_{ij}^{\bf Y}-\bar d^{\bf Y})^2}},\] with \(\bar d^{\bf X}\) (resp \(\bar d^{\bf Y}\)) the mean of the upper diagonal terms of the dissimilarity matrix associated to \(\bf X\) (resp. to \(\bf Y\)).

Correlation coefficient between the vectors gathering the upper diagonal terms of the dissimilarity matrices.

Its significance is assessed via a permutation test. The coefficient and its test are implemented in several R packages such as ade4, vegan and ecodist


Data Cubes

Data Cubes


2.2 Inertia, Co-Inertia and the RV coefficient

As in physics, we define inertia as a weighted sum of distances of weighted points.

This enables us to use abundance data in a contingency table and compute its inertia which in this case will be the weighted sum of the squares of distances between observed and expected frequencies, such as is used in computing the chisquare statistic.

Another generalization of variance-inertia is the useful Phylogenetic diversity index. (computing the sum of distances between a subset of taxa through the tree).

We also have such generalizations that cover variability of points on a graph taken from standard spatial statistics.


When studying two standardized variables measured at the same 10 locations; for instance PH and humidity the standard quantification of covariation is their covariance. \[cov(x,y)= sum(x1*y1+x2*y2+x3*y3 + \cdots + x10*y10)\] if x and y co-vary -in the same direction this will be big; we already saw that \[cor(x,y)=\frac{cov(x,y)}{sqrt(var(x))sqrt(var(y))}\]

A simple generalization to multiple matrices to measure as above is done through Co-Inertia analysis (CIA).


3 RV coefficient

The global measure of similarity of two data tables as opposed to two vectors can be done by a generalization of covariance provided by an inner product between tables that gives the RV coefficient, a number between 0 and 1, like a correlation coefficient, but for tables. \[RV(A,B)=\frac{Tr(A'AB'B)}{\sqrt{Tr(A'A)}\sqrt{Tr(B'B)}}\] There are several other measures of matrix correlation available in the package MatrixCorrelation.

If we do ascertain a link between two matrices.


3.1 Canonical Correlation Analysis (CCA)

CCA (Hotelling, 1938) works with two sets of variables and its goal is to find a linear projection of the first set of variables that maximally correlates with a linear projection of the second set of variables.

Finding correlated functions (covariates) of the two views of the same phenomenon by discarding the representation-specific details (noise) is expected to reveal the underlying hidden yet influential factors responsible for the correlation


3.2 Canonical Correlation Algorithm

Two matrices \(X\) and \(Y\) of order \(n > p\) and \(n > q\) respectively.

The columns of \(X\) and \(Y\) correspond to variables and the rows correspond to experimental units.

The jth column of the matrix \(X\) is denoted by \(X_j\), likewise the kth column of \(Y\) is denoted by \(Y_k\). Without loss of generality it will be assumed that the columns of X and Y are standardized (mean 0 and variance 1).

\(p \leq q\) (in other words, the group which contains the least variables is denoted by X).


We denote by \(S_{XX}\) and \(S_{YY}\) the sample covariance matrices for variable sets X and Y respectively, and by \(S_{XY} = S_{YX}^T\) the sample cross-covariance matrix between X and Y .

Classical CCA assumes first \(p \leq n\) and \(q \leq n\), then that matrices \(X\) and \(Y\) are of full column rank p and q respectively. In the following, the principle of CCA is presented as a problem solved through an iterative algorithm. The first stage of CCA consists in finding two vectors \(a =(a_1,...,a_p)^T\) and \(b =(b_1,...,b_q)^T\) that maximize the correlation between the linear combinations \[\begin{eqnarray*} U=Xa&=&a_1 X_1 +a_2 X_2 +\cdots a_p X_p\\ V=Yb&=&b_1 Y_1 +b_2 Y_2 +\cdots a_q Y_q\\ \end{eqnarray*}\] and assuming that vectors a and b are normalized so that \[var(U) = var(V) = 1 .\]


In other words, the problem consists in solving \[\rho_1=cor(U,V)=max_{a,b} cor (Xa,Yb) \mbox{ subject to } var(Xa)=var(Yb)=1\]

The resulting variables U and V are called the first canonical variates and \(\rho_1\) is referred to as the first canonical correlation.

Higher order canonical variates and canonical correlations can be found as a stepwise problem. For \(s = 1,...,p\), we can successively find positive correlations \(\rho_1 \geq \rho_2 \geq\ldots \geq \rho_p\) with corresponding vectors \((a^1, b^1), \ldots, (a^p, b^p)\), by maximizing \[\rho_s=cor(U^s,V^s)=max_{a^s,b^s} cor (Xa^s,Yb^s)\qquad \mbox{ subject to } var(Xa^s)=var(Yb^s)=1\] under the additional restrictions \[cor(U^s,U^t)=cor(V^s,V^t)=0, \mbox{ for } 1\leq t < s \leq p.\]


3.3 Sparse Canonical Correlation Analysis (sCCA)

When \(p>>n\), we want to keep non-zero coefficients to a minimum.

Sparse canonical correlation analysis (sparse CCA or sCCA), well-suited to both exploratory comparisons between samples and the identification of features with interesting variation.

We will use an implementation from the PMA package.

3.4 Example: microbiome and metabolites

Two tables: one contingency table of bacterial abundances and another with metabolites.

There are 12 samples, the metabolite table has measurements on 637 feature and the bacterial abundances had a total of 20,609 taxa, which we will filter down.

library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("phyloseq")
metab_path = "../../data/metabolites.csv"
load(path.expand("../../data/microbe.rda"))
metab   = read.csv(metab_path, row.names = 1) %>% as.matrix

We first filter down to bacteria and metabolites of interest, removing those that are zero across many samples. Then, we transform them to weaken the heavy tails.

library("genefilter")
metab   = metab[rowSums(metab == 0) <= 3, ]
microbe = prune_taxa(taxa_sums(microbe) > 4, microbe)
microbe = filter_taxa(microbe, filterfun(kOverA(3, 2)), TRUE)
metab = log(1 + metab, base = 10)
X = as.matrix(otu_table(microbe))
X[X > 50] = 50

We look as if there is any association between the two matrices, we have to match the same dimensions on both matrices:

library("ade4")
colnames(X)=colnames(metab)
pca1 = dudi.pca(t(metab), scal = TRUE, scann = FALSE)
pca2 = dudi.pca(t(X), scal = TRUE, scann = FALSE)
rv1 = RV.rtest(pca1$tab, pca2$tab, 999)
rv1
## Monte-Carlo test
## Call: RV.rtest(df1 = pca1$tab, df2 = pca2$tab, nrepet = 999)
## 
## Observation: 0.8190389 
## 
## Based on 999 replicates
## Simulated p-value: 0.001 
## Alternative hypothesis: greater 
## 
##     Std.Obs Expectation    Variance 
## 6.166298307 0.344807728 0.005914678

3.5 Mantel Test

This is equivalent to the Mantel test on the distances:

distmetab<-dist(t(metab))
distmicro<- dist(t(X)) 
mtest<-mantel.randtest(distmicro, distmetab, 999)
mtest
## Monte-Carlo test
## Call: mantel.randtest(m1 = distmicro, m2 = distmetab, nrepet = 999)
## 
## Observation: 0.6442379 
## 
## Based on 999 replicates
## Simulated p-value: 0.002 
## Alternative hypothesis: greater 
## 
##      Std.Obs  Expectation     Variance 
##  5.273298169 -0.002450098  0.015039209

3.6 Where is the relation between the two domains?

We can now apply sparse CCA. This method compares sets of features across high-dimensional data tables, where there may be more measured features than samples. In the process, it chooses a subset of available features that capture the most covariance – these are the features that reflect signals present across multiple tables.


Our implementation is below. The parameters penaltyx and penaltyz are sparsity penalties. Larger values of penaltyx will result in fewer selected microbes, similarly penaltyz modulates the number of selected metabolites. We tune them manually to facilitate subsequent interpretation – we generally prefer more sparsity than the default parameters would provide.

library("PMA")
ccaRes = CCA(t(X), t(metab), penaltyx = 0.15, penaltyz = 0.15)
## 123456789101112131415
ccaRes
## Call: CCA(x = t(X), z = t(metab), penaltyx = 0.15, penaltyz = 0.15)
## 
## 
## Num non-zeros u's:  5 
## Num non-zeros v's:  15 
## Type of x:  standard 
## Type of z:  standard 
## Penalty for x: L1 bound is  0.15 
## Penalty for z: L1 bound is  0.15 
## Cor(Xu,Zv):  0.9735258

With these parameters, 5 bacteria and 15 metabolites were selected based on their ability to explain covariation between tables. Further, these 20 features result in a correlation of 0.974 between the two tables.

We interpret this to mean that the microbial and metabolomic data reflect similar underlying signals, and that these signals can be approximated well by the 20 selected features.

Be wary of the correlation value, however, since the scores are far from the usual bivariate normal cloud. Further, note that it is possible that other subsets of features could explain the data just as well – sparse CCA has minimized redundancy across features, but makes no guarantee that these are the ``true’’ features in any sense.


We can still use these 20 features to compress information from the two tables without much loss.

To relate the recovered metabolites and OTUs to characteristics of the samples on which they were measured, we use them as input to an ordinary PCA.

combined = cbind(t(X[ccaRes$u != 0, ]),
                 t(metab[ccaRes$v != 0, ]))
pcaRes = dudi.pca(combined, scannf = FALSE, nf = 3)

# annotation
genotype     = substr(rownames(pcaRes$li), 1, 2)
sampleType  = substr(rownames(pcaRes$l1), 3, 4)
featureType = grepl("\\.", colnames(combined))
featureType = ifelse(featureType, "Metabolite", "OTU")

sampleInfo  = data.frame(pcaRes$li, genotype, sampleType)
featureInfo = data.frame(pcaRes$c1,
                          feature = substr(colnames(combined), 1, 6))

We make a PCA triplot, where we show different types of samples and the multidomain features (Metabolites and OTUs).

This allows comparison across the measured samples – triangles for knockout and circles for wild type – and characterizes the influence the different features – diamonds with text labels.

For example, we see that the main variation in the data is across PD and ST samples, which correspond to the different diets. Further, large values of 15 of the features are associated with ST status, while small values for 5 of them indicate PD status. The advantage of the sparse CCA screening is now clear – we can display most of the variation across samples using a relatively simple plot, and can avoid plotting the hundreds of additional points that would be needed to display all of the features.

library("ggrepel")
## Loading required package: ggplot2
triplot = ggplot() +  geom_point(data = sampleInfo,
  aes(x = Axis1, y = Axis2, col = sampleType, shape = genotype), size = 3) +
  geom_label_repel(data = featureInfo,
  aes(x = 5.5 * CS1, y = 5.5 * CS2, label = feature, fill = featureType),
      size = 2, segment.size = 0.3,
      label.padding = unit(0.1, "lines"), label.size = 0) +
  geom_point(data = featureInfo,
             aes(x = 5.5 * CS1, y = 5.5 * CS2, fill = featureType),
             size = 1, shape = 23, col = "#383838") +
  scale_color_brewer(palette = "Set2") +
  scale_fill_manual(values = c("#a6d854", "#e78ac3")) +
  guides(fill = guide_legend(override.aes = list(shape = 32, size = 0))) +
  coord_fixed()+
  labs(x = sprintf("Axis1 [%s%% Variance]",
                   100 * round(pcaRes$eig[1] / sum(pcaRes$eig), 2)),
       y = sprintf("Axis2 [%s%% Variance]",
                   100 * round(pcaRes$eig[2] / sum(pcaRes$eig), 2)),
       fill = "Feature Type", col = "Sample Type")

triplot 


3.7 Three domain analysis

  • Taxa: Read counts (3 patients taking cipro: two time courses).
  • Mass-Spec Positive and Negative ion Mass Spec features and their intensities.
  • RNA-seq Metagenomic data on genes.

Here is the RV table of the three array types:

> fourtable$RV
          Taxa      Kegg   MassSpec+ MassSpec-
Taxa      1        0.565   0.561      0.670
Kegg      0.565     1      0.686      0.644
MassSpec+ 0.561    0.686   1          0.568
MassSpec- 0.670    0.644   0.568      1

3.8 STATIS and DiSTATIS method: general

Do the eigendecomposition of the RV matrix between tables:

  • used for data integration.
  • used for bootstrap and Bayes- multi-tables.
  • used for comparing networks.

4 Multidomain and Microbiome Analysis: A summary

  • Multiple data combined into special BioC containers.

  • Testing two distances through permutation Mantel or RV tests

  • Hierarchical testing.

  • Multiple Tables: RV coefficients and Canonical Correlation Analyses

  • Sparse methods

  • Multiple communities within single samples.