9 Multivariate methods for heterogeneous data

Real situations often involve, graphs, point clouds, attraction points, noise and different spatial milieux, a little like this picture where we have a rigid skeleton, waves, sun and starlings.

In Chapter 7, we saw how to summarize rectangular matrices whose columns were continuous variables. The maps we made used unsupervised dimensionality reduction techniques such as principal component analysis aimed at isolating the most important signal component in a matrix $$X$$ when all the columns have meaningful variances.

Here we extend these ideas to more complex heterogeneous data where continuous and categorical data are combined or even to data where individual variables are not available. Indeed, sometimes our observations cannot be easily described by features – but it is possible to determine distances or (dis)similarities between them, or to put them into a graph or a tree. Examples include species in a species tree or biological sequences. Outside of biology, there are text documents or sound files, where we may have a reasonable method to determine (dis)similarity between objects, but no absolute ‘coordinate system’ of features.

This chapter contains more advanced techniques for which we have omitted many technical details. We hope that by giving the reader some hands-on experience with examples and extensive references. We hope to enable readers who have come this far to understand some of the more cutting edge’ techniques in nonlinear multivariate analyses.

9.1 Goals for this chapter

In this chapter, we will:

• Extend linear dimension reduction methods to cases when the distances between observations are available, known as multidimensional scaling (MDS) or principal coordinates analysis.

• Find modifications of MDS that are nonlinear and robust to outliers.

• Encode combinations of categorical data and continuous data as well as so-called ‘supplementary’ information. We will see that this enables us to remove batch effects.

• Use chi-square distances and correspondence analysis (CA) to see where categorical data (contingency tables) contain notable dependencies.

• Generalize clustering methods that can uncover latent variables that are not categorical. This will allow us to detect gradients, “pseudotime” and hidden nonlinear effects in our data.

• Generalize the notion of variance and covariance to the study of tables of data from multiple different data domains.

9.2 Multidimensional scaling and ordination

Sometimes, data are not represented as points in a feature space. This can occur when we are provided with (dis)similarity matrices between objects such as drugs, images, trees or other complex objects, which have no obvious coordinates in $${\mathbb R}^n$$.

In Chapter 5 we saw how to produce clusters from distances. Here our goal is to visualize the data in maps in low dimensional spaces (e.g., planes) reminiscent of the ones we make from the first few principal axes in PCA.

We start with an example showing what we can do with simple geographic data. In Figure 9.1 a heatmap and clustering of the approximate road distances between some of the European cities is shown.

Figure 9.1: A heatmap of the distances between some of the cities. The function has re-arranged the order of the cities, grouping the closest ones.

library("pheatmap")
seteuro = as.matrix(distEuroN)[1:12, 1:12]
pheatmap(seteuro, cluster_rows = TRUE,
treeheight_row = 0.0001, treeheight_col = 0.8,
fontsize_col = 8, cellwidth = 13, cellheight = 13)

Given the these distances between cities, multidimensional scaling (MDS) provides a map’ of their relative locations. Of course, in this case the distances were originally measured as road distances (except for ferries), so we actually expect to find a two dimensional map that would represent the data well. With biological data, our maps are likely to be less clearcut. We call the function with:

MDSEuro = cmdscale(distEuroN, eig = TRUE)

We make a function that we can reuse to make the MDS screeplot from the result of a call to the cmdscale function:

Figure 9.2: Screeplot of the first 5 eigenvalues. The drop after the first two eigenvalues is very visible.

library("tibble")
plotbar = function(res, m = 9) {
tibble(eig = res$eig[seq_len(m)], k = seq(along = eig)) %>% ggplot(aes(x = k, y = eig)) + scale_x_discrete("k", limits = seq_len(m)) + theme_minimal() + geom_bar(stat="identity", width=0.5, color="orange", fill="pink") } plotbar(MDSEuro, m = 5) ► Question Make a barplot of all the eigenvalues ouput by the cmdscale function: what do you notice? ► Solution To position the points on the map we have projected them on the new coordinates created from the distances (we will discuss how the algorithm works in the next section). Note that while relative positions in Figure 9.3 are correct, the orientation of the map is unconventional: e.g., Istanbul, which is in the South-East of Europe, is at the top left. Figure 9.3: MDS map of European cities based on their distances. MDSeur = tibble( PCo1 = MDSEuro$points[, 1],
PCo2 = MDSEuro$points[, 2], labs = rownames(MDSEuro$points))
g = ggplot(MDSeur, aes(x = PCo1, y = PCo2, label = labs)) +
geom_point(color = "red") + xlim(-1950, 2000) + ylim(-1150, 1150) +
coord_fixed() + geom_text(size = 4, hjust = 0.3, vjust = -0.5)
g

We reverse the signs of the principal coordinates and redraw the map. We also read in the cities’ true longitudes and latitudes and plot these alongside for comparison (Figure 9.4).

g %+% mutate(MDSeur, PCo1 = -PCo1, PCo2 = -PCo2)
ggplot(Eurodf, aes(x = Long,y = Lat, label = rownames(Eurodf))) +
geom_point(color = "blue") + geom_text(hjust = 0.5, vjust = -0.5)

► Question

Which cities seem to have the worst representation on the PCoA map in the left panel of Figure 9.4?

► Solution

► Question

We drew the longitudes and latitudes in the right panel of Figure 9.4 without much attention to aspect ratio. What is the right aspect ratio for this plot?

► Solution

Note: MDS creates similar output as PCA, however there is only one ‘dimension’ to the data (the sample points). There is no ‘dual’ dimension and biplots are unavailable. This is a drawback when coming to interpreting the maps. Interpretation can be facilitated by examining carefully the extreme points and their differences.

9.2.1 How does the method work?

Let’s take a look at what would happen if we really started with points whose coordinates were known125 Here we commit a slight ‘abuse’ by using the longitude and longitude of our cities as Cartesian cooordinates and ignoring the curvature of the earth’s surface.. We put these coordinates into the two columns of a matrix with 24 rows. Now we compute the distances between points based on these coordinates. To go from the coordinates $$X$$ to distances, we write $d^2_{i,j} = (x_i^1 - x_j^1)^2 + \dots + (x_i^p - x_j^p)^2.$ We will call the matrix of squared distances DdotD in R and $$D\bullet D$$ in the text $$D^2$$ would mean D multiplied by itself, which is different than this.. We want to find points such that the square of their distances is as close as possible to the $$D\bullet D$$ observed.

X = as.matrix(Eurodf)
DdotD = as.matrix(dist(X)^2)

The relative distances do not depend on the point of origin of the data. We center the data by using a matrix $$H$$: the centering matrix defined as $$H=I-\frac{1}{n}{\mathbf{11}}^t$$. Let’s check the centering property of $$H$$ using:

n = nrow(X)
H = diag(rep(1,n))-(1/n) * matrix(1, nrow = n, ncol = n)
Xc = sweep(X,2,apply(X,2,mean))
Xc[1:2, ]
##                       Lat      Long
## Saint_Petersburg 10.78194 15.543056
## Stockholm        10.16528  3.309722
HX = H %*% X
HX[1:2, ]
##           Lat      Long
## [1,] 10.78194 15.543056
## [2,] 10.16528  3.309722
apply(HX, 2, mean)
##          Lat         Long
## 7.901629e-15 2.266633e-15

► Question

Call B0 the matrix obtained by applying the centering matrix both to the right and to the left of DdotD Consider the points centered at the origin given by the $$HX$$ matrix and compute its cross product, we’ll call this B2. What do you have to do to B0 to make it equal to B2?

► Solution

Therefore, given the squared distances between rows ($$D\bullet D$$) and the cross product of the centered matrix $$B=(HX)(HX)^t$$, we have shown:

$$$-\frac{1}{2} H(D\bullet D) H=B \tag{9.1}$$$

This is always true, and we use it to reverse-engineer an $$X$$ which satisfies Equation (9.1) when we are given $$D\bullet D$$ to start with.

From $$D\bullet D$$ to $$X$$ using singular vectors.

We can go backwards from a matrix $$D\bullet D$$ to $$X$$ by taking the eigen-decomposition of $$B$$ as defined in Equation (9.1). This also enables us to choose how many coordinates, or columns, we want for the $$X$$ matrix. This is very similar to how PCA provides the best rank $$r$$ approximation.
Note: As in PCA, we can write this using the singular value decomposition of $$HX$$ (or the eigen decomposition of $$HX(HX)^t$$):

$\begin{equation*} HX^{(r)} = US^{(r)}V^t \mbox{ with } S^{(r)} \mbox{ the diagonal matrix of the first } r \mbox{ singular values}, \end{equation*}$

$S^{(r) } = \begin{pmatrix} s_1 &0 & 0 &0 &...\\ 0&s_2&0 & 0 &...\\ 0& 0& ...& ... & ...\\ 0 & 0 & ... & s_r &...\\ ...& ...& ...& 0 & 0 \\ \end{pmatrix}$ This provides the best approximate representation in an Euclidean space of dimension $$r$$. The method is often called Principal Coordinates Analysis, or PCoA which stresses the connection to PCA. The algorithm gives us the coordinates of points that have approximately the same distances as those provided by the $$D$$ matrix.

Classical MDS Algorithm.

In summary, given an $$n \times n$$ matrix of squared interpoint distances $$D\bullet D$$, we can find points and their coordinates $$\tilde{X}$$ by the following operations:

1. Double center the interpoint distance squared and multiply it by $$-\frac{1}{2}$$:
$$B = -\frac{1}{2}H D\bullet D H$$.

2. Diagonalize $$B$$: $$\quad B = U \Lambda U^t$$.

3. Extract $$\tilde{X}$$: $$\quad \tilde{X} = U \Lambda^{1/2}$$.

Finding the right underlying dimensionality.

As an example, let’s take objects for which we have similarities (surrogrates for distances) but for which there is no natural underlying Euclidean space. In a psychology experiment from the 1950s, Ekman (1954) asked 31 subjects to rank the similarities of 14 different colors. His goal was to understand the underlying dimensionality of color perception. The similarity or confusion matrix was scaled to have values between 0 and 1. The colors that were often confused had similarities close to 1. We transform the data into a dissimilarity by subtracting the values from 1:

rownames(ekm) = colnames(ekm)
disekm = 1 - ekm - diag(1, ncol(ekm))
disekm[1:5, 1:5]
##      w434 w445 w465 w472 w490
## w434 0.00 0.14 0.58 0.58 0.82
## w445 0.14 0.00 0.50 0.56 0.78
## w465 0.58 0.50 0.00 0.19 0.53
## w472 0.58 0.56 0.19 0.00 0.46
## w490 0.82 0.78 0.53 0.46 0.00
disekm = as.dist(disekm)

We compute the MDS coordinates and eigenvalues. We combine the eigenvalues in the screeplot shown in Figure 9.5:

Figure 9.5: The screeplot shows us that the phenomenon is two dimensional, giving a clean answer to Ekman’s question.

mdsekm = cmdscale(disekm, eig = TRUE)
plotbar(mdsekm)

We plot the different colors using the first two principal coordinates as follows:

dfekm = as_tibble(mdsekm$points[,1:2])%>% setNames(paste0("MDS", 1:2)) %>% mutate( name = rownames(ekm), rgb = photobiology::w_length2rgb( as.numeric(sub("w", "", name)))) ## Warning: as_tibble.matrix() requires a matrix with column names or a .name_repair argument. Using compatibility .name_repair. ## This warning is displayed once per session. Figure 9.6: The layout of the scatterpoints in the first two dimensions has a horseshoe shape. The labels and colors show that the arch corresponds to the wavelengths. library("ggrepel") ggplot(dfekm, aes(x = MDS1, y = MDS2)) + geom_point(col = dfekm$rgb, size = 4) +
geom_text_repel(aes(label = name)) + coord_fixed()

Figure 9.6 shows the Ekman data in the new coordinates. There is a striking pattern that calls for explanation. This horseshoe or arch structure in the points is often an indicator of a sequential latent ordering or gradient in the data (Diaconis, Goel, and Holmes 2007). We will revisit this in Section 9.5.

9.2.2 Robust versions of MDS

Robustness: A method is robust if it is not too influenced by a few outliers. For example, the median of a set of $$n$$ numbers does not change by a lot even if we change 20 the numbers by arbitrarily large amounts; to drastically shift the median, we need to change more than half of the numbers. In contrast, we can change the mean by a large amount by just manipulating one of the numbers. We say that the breakdown point of the median is 1/2, while that of the mean is only $$1/n$$. Both mean and median are estimators of the location of a distribution (i.e., what is a “typical” value of the numbers), but the median is more robust. The median is based on the ranks; more generally, methods based on ranks are often more robust than those based on the actual values. Many nonparametric tests are based on reductions of data to their ranks. Multidimensional scaling aims to minimize the difference between the squared distances as given by $$D\bullet D$$ and the squared distances between the points with their new coordinates. Unfortunately, this objective tends to be sensitive to outliers: one single data point with large distances to everyone else can dominate, and thus skew, the whole analysis. Often, we like to use something that is more robust, and one way to achieve this is to disregard the actual values of the distances and only ask that the relative rankings of the original and the new distances are as similar as possible. Such a rank based approach is robust: its sensitivity to outliers is reduced.

We will use the Ekman data to show how useful robust methods are when we are not quite sure about the ‘scale’ of our measurements. Robust ordination, called non metric multidimensional scaling (NMDS for short) only attempts to embed the points in a new space such that the order of the reconstructed distances in the new map is the same as the ordering of the original distance matrix.

Non metric MDS looks for a transformation $$f$$ of the given dissimilarities in the matrix $$d$$ and a set of coordinates in a low dimensional space (the map) such that the distance in this new map is $$\tilde{d}$$ and $$f(d)\thickapprox \tilde{d}$$. The quality of the approximation can be measured by the standardized residual sum of squares (stress) function:

$\begin{equation*} \text{stress}^2=\frac{\sum(f(d)-\tilde{d})^2}{\sum d^2}. \end{equation*}$

NMDS is not sequential in the sense that we have to specify the underlying dimensionality at the outset and the optimization is run to maximize the reconstruction of the distances according to that number. There is no notion of percentage of variation explained by individual axes as provided in PCA. However, we can make a simili-screeplot by running the program for all the successive values of $$k$$ ($$k=1, 2, 3, ...$$) and looking at how well the stress drops. Here is an example of looking at these successive approximations and their goodness of fit. As in the case of diagnostics for clustering, we will take the number of axes after the stress has a steep drop.

Because each calculation of a NMDS result requires a new optimization that is both random and dependent on the $$k$$ value, we use a similar procedure to what we did for clustering in Chapter 4. We execute the metaMDS function, say, 100 times for each of the four possible values of $$k$$ and record the stress values.

library("vegan")
nmds.stress = function(x, sim = 100, kmax = 4) {
sapply(seq_len(kmax), function(k)
replicate(sim, metaMDS(x, k = k, autotransform = FALSE)$stress)) } stress = nmds.stress(disekm, sim = 100) dim(stress) Let’s look at the boxplots of the results. This can be a useful diagnostic plot for choosing $$k$$ (Figure 9.7). Figure 9.7: Several replicates at each dimension were run to evaluate the stability of the stress. We see that the stress drops dramatically with two or more dimensions, thus indicating that a two dimensional solution is appropriate here. dfstr = reshape2::melt(stress, varnames = c("replicate","dimensions")) ggplot(dfstr, aes(y = value, x = dimensions, group = dimensions)) + geom_boxplot() We can also compare the distances and their approximations using what is known as a Shepard plot for $$k=2$$ for instance, computed with: Figure 9.8: The Shepard’s plot compares the original distances or dissimilarities (along the horizonal axis) to the reconstructed distances, in this case for $$k=2$$ (vertical axis). nmdsk2 = metaMDS(disekm, k = 2, autotransform = FALSE) stressplot(nmdsk2, pch = 20) Both the Shepard’s plot in Figure 9.8 and the screeplot in Figure 9.7 point to a two-dimensional solution for Ekman’s color confusion study. Let’s compare the output of the two different MDS programs, the classical metric least squares approximation and the nonmetric rank approximation method. The right panel of Figure 9.9 shows the result from the nonmetric rank approximation, the left panel is the same as Figure 9.6. The projections are almost identical in both cases. For these data, it makes little difference whether we use a Euclidean or nonmetric multidimensional scaling method. dfnmdsk2 = as_tibble(nmdsk2$points[,1:2]) %>%
setNames(paste0("NmMDS", 1:2)) %>%
bind_cols(select(dfekm, rgb, name))
ggplot(dfnmdsk2, aes(x = NmMDS1, y = NmMDS2)) +
geom_point(col = dfekmrgb, size = 4) + geom_text_repel(aes(label = name)) 9.3 Contiguous or supplementary information centerlinetextsfMetadata Many programs and workflows for biological sequence analysis or assays separate the environmental and contextual information, which they call metadata, from the assay data or sequence reads. We discourage such practice as the exact connections between the samples and covariates are important. A lost connection between the assays and covariates makes later analyses impossible. Covariates such as clinical history, time, batch or location are important and should be considered components of the data. In Chapter 3 we introduced the R data.frame class that enables us to combine heterogeneous data types: categorical factors, text and continuous values. Each row of a dataframe corresponds to an object, or a record, and the columns are the different variables, or features. Extra information about sample batches, dates of measurement, different protocols are often named metadata; this can be misnomer if it is implied that metadata are somehow less important. Such information is real data that need to be integrated into the analyses. We typically store it in a data.frame or a similar R class and tightly link it to the primary assay data. 9.3.1 Known batches in data Here we show an example of an analysis that was done by Holmes et al. (2011) on bacterial abundance data from Phylochip (Brodie et al. 2006) microarrays. The experiment was designed to detect differences between a group of healthy rats and a group who had Irritable Bowel Disease (Nelson et al. 2010). This example shows a case where the nuisance batch effects become apparent in the analysis of experimental data. It is an illustration of the fact that best practices in data analyses are sequential and that it is better to analyse data as they are collected to adjust for severe problems in the experimental design as they occur, instead of having to deal with them post mortem126 Fisher’s terminology, see Chapter 13.. When data collection started on this project, days 1 and 2 were delivered and we made the plot that appears in Figure 9.11. This showed a definite day effect. When investigating the source of this effect, we found that both the protocol and the array were different in days 1 and 2. This leads to uncertainty in the source of variation, we call this confounding of effects. Bioconductor container: These data are an example of an awkward way of combining batch information with the actual data. The day information has been combined with the array data and encoded as a number and could be confused with a continuous variable. We will see in the next section a better practice for storing and manipulating heterogeneous data using a Bioconductor container called SummarizedExperiment. We load the data and the packages we use for this section: IBDchip = readRDS("../data/vsn28Exprd.rds") library("ade4") library("factoextra") library("sva") ► Question What class is the IBDchip ? Look at the last row of the matrix, what do you notice? ► Solution The data are normalized abundance measurements of 8634 taxa measured on 28 samples. We use a rank-threshold transformation, giving the top 3000 most abundant tax scores from 3000 to 1, and letting the 5634 least abundant all have a score of 1. We separate out the assay data from the day variable which should be considered a factor: assayIBD = IBDchip[-nrow(IBDchip), ] day = factor(IBDchip[nrow(IBDchip), ]) Figure 9.10: The screeplot shows us that the phenomenon can be usefully represented in two dimensions. Instead of using the continuous, normalized data, we use a robust analysis replacing the values by their ranks. The lower values are considered ties encoded as a threshold chosen to reflect the number of expected taxa thought to be present: rankthreshPCA = function(x, threshold = 3000) { ranksM = apply(x, 2, rank) ranksM[ranksM < threshold] = threshold ranksM = threshold - ranksM dudi.pca(t(ranksM), scannf = FALSE, nf = 2) } pcaDay12 = rankthreshPCA(assayIBD[,day!=3]) day12 = day[day!=3] fviz(pcaDay12, element="ind", axes=c(1,2), geom=c("point","text"), habillage = day12, repel = TRUE, palette = "Dark2", addEllipses = TRUE, ellipse.type = "convex") + ggtitle("") + coord_fixed() Figure 9.11: We have used colors to identify the different days and have kept the sample labels as well. We have also added convex hulls for each day. The group mean is identified as the point with the larger symbol (circle, triangle or square). ► Question Why do we use a threshold for the ranks? ► Solution Figure 9.11 shows that the sample arrange themselves naturally into two different groups according to the day of the samples. After discovering this effect, we delved into the differences that could explain these distinct clusters. There were two different protocols used (protocol 1 on day 1, protocol 2 on day 2) and unfortunately two different provenances for the arrays used on those two days (array 1 on day 1, array 2 on day 2). A third set of data of four samples had to be collected to deconvolve the confounding effect. Array 2 was used with protocol 2 on Day 3, Figure 9.12 shows the new PCA plot with all the samples created by the following: pcaDay123 = rankthreshPCA(assayIBD) fviz(pcaDay123, element="ind", axes=c(1,2), geom=c("point","text"), habillage = day, repel=TRUE, palette = "Dark2", addEllipses = TRUE, ellipse.type = "convex") + ggtitle("") + coord_fixed() ► Question In which situation would it be preferable to make confidence ellipses around the group means using the following code? fviz_pca_ind(pcaDay123, habillage = day, labelsize = 3, palette = "Dark2", addEllipses = TRUE, ellipse.level = 0.69) Figure 9.13: The eigenvalue screeplot the case of 3 groups is extremely similar to that with two groups shown in Figure 9.10. Through this visualization we were able to uncover a flaw in the original experimental design. The first two batches shown in green and brown were both balanced with regards to IBS and healthy rats. They do show very different levels of variability and overall multivariate coordinates. In fact, there are two confounded effects. Both the arrays and protocols were different on those two days. We had to run a third batch of experiments on day 3, represented in purple, this used protocol from day 1 and the arrays from day 2. The third group faithfully overlaps with batch 1, telling us that the change in protocol was responsible for the variability. 9.3.2 Removing batch effects Through the combination of the continuous measurements from assayIBD and the supplementary batch number as a factor, the PCA map has provided an invaluable investigation tool. This is a good example of the use of supplementary points127 This is called a supplementary point because the new observation-point is not used in the matrix decomposition.. The mean-barycenter points are created by using the group-means of points in each of the three groups and serve as extra markers on the plot. We can decide to re-align the three groups by subtracting the group means so that all the batches are centered on the origin. A slightly more effective way is to use the ComBat function available in the sva package. This function uses a similar, but slightly more sophisticated method (Empirical Bayes mixture approach (Leek et al. 2010)). We can see its effect on the data by redoing our robust PCA (see the result in Figure 9.14): model0 = model.matrix(~1, day) combatIBD = ComBat(dat = assayIBD, batch = day, mod = model0) ## Standardizing Data across genes Figure 9.14: The modified data with the batch effects removed now show three batch-groups heavily overlapping and centered almost at the origin. pcaDayBatRM = rankthreshPCA(combatIBD) fviz(pcaDayBatRM, element = "ind", geom = c("point", "text"), habillage = day, repel=TRUE, palette = "Dark2", addEllipses = TRUE, ellipse.type = "convex", axes =c(1,2)) + coord_fixed() + ggtitle("") 9.3.3 Hybrid data and Bioconductor containers A more rational way of combining the batch and treatment information into compartments of a composite object is to use the SummarizedExperiment class. It includes special slots for the assay(s) where rows represent features of interest (e.g., genes, transcripts, exons, etc.) and columns represent samples. Supplementary information about the features can be stored in a DataFrame object, accessible using the function rowData. Each row of the DataFrame provides information on the feature in the corresponding row of the SummarizedExperiment object. A confusing notational similarity occurs here, in the SummarizedExperiment framework a DataFrame is not the same as a data.frame. Here we insert the two covariates day and treatment in the colData object and combine it with assay data in a new SummarizedExperiment object. library("SummarizedExperiment") sampletypes = c("IBS","CTL") status = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) colData = DataFrame(day = day, treatment = factor(sampletypes[status])) chipse = SummarizedExperiment(assays = list(abund = assayIBD), colData = colData) This is the best way to keep all the relevant data together, it will also enable you to quickly filter the data while keeping all the information aligned properly. You can explore composite objects using the tt Environment pane in tt RStudio, you will that some of the slots are empty. ► Question Make a new SummarizedExperiment object by choosing the subset of the samples that were created on day 2. ► Solution Columns of the DataFrame represent different attributes of the features of interest, e.g., gene or transcript IDs, etc. An example of hybrid data container from single cell experiments (see Bioconductor workflow in Perraudeau et al. (2017) for more details). After the pre-processing and normalization steps prescribed in the workflow, we retain the 1,000 most variable genes measured on 747 cells. corese = readRDS("../data/normse.rds") norm = assays(corese)normalizedValues

► Question

How many different batches do the cells belong to ?

► Solution

We can look at a PCA of the normalized values and check graphically that the batch effect has been removed:

Figure 9.15: Screeplot of the PCA of the normalized data.

respca = dudi.pca(t(norm), nf = 3, scannf = FALSE)
plotbar(respca, 15)
PCS = respca$li[, 1:3] We have set up colors for the clusters as in the workflow, (the code is not shown here). Since the screeplot in Figure 9.15 shows us that we must not dissociate axes 2 and 3, we will make a three dimensional plot with the rgl package. We use the following interactive code: library("rgl") batch = colData(corese)$Batch
plot3d(PCS,aspect=sqrt(c(84,24,20)),col=col_clus[batch])
plot3d(PCS,aspect=sqrt(c(84,24,20)),
col = col_clus[as.character(publishedClusters)])

Note: Of course, the book medium is limiting here, as we are showing two static projections that do not do justice to the depth available when looking at the interactive dynamic plots as they appear using the plot3d function. We encourage the reader to experiment extensively with these and other interactive packages and they provide a much more intuitive experience of the data.

9.4 Correspondence analysis for contingency tables

9.4.1 Cross-tabulation and contingency tables

Categorical data abound in biological settings: sequence status (CpG/non-CpG), phenotypes, taxa are often coded as factors as we saw in Chapter 2. Cross-tabulation of two such variables gives us a contingency table; the result of counting the co-occurrence of two phenotypes (sex and colorblindness was such an example). We saw that the first step is to look at the independence of the two categorical variables; the standard statistical measure of independence uses the chisquare distance. This quantity will replace the variance we used for continuous measurements.

The columns and rows of the table have the same status’ and we are not in supervised/regression type setting. We won’t see a sample/variable divide; as a consequence the rows and columns will have the same status and we will ‘center’ both the rows and the columns. This symmetry will also translate in our use of biplots where both dimensions appear on the same plot.

Table 9.1: Sample by mutation matrix.

Patient Mut1 Mut2 Mut3
AHX112 0 0 0
AHX717 1 0 1
AHX543 1 0 0

Transforming the data to tabular form.

If the data are collected as long lists with each subject (or sample) associated to its levels of the categorical variables, we may want to transform them into a contingency table. Here is an example. In table 9.1 HIV mutations are tabulated as indicator (0/1) binary variables. These data are then transformed into a mutation co-occurrence matrix shown in table 9.2.

Table 9.2: Cross-tabulation of the HIV mutations showing two-way co-occurrences.

Patient Mut1 Mut2 Mut3
Mut1 853 29 10
Mut2 29 853 52
Mut3 10 52 853

► Question

What information is lost in this cross-tabulation ?
When will this matter?

Here are some co-occurrence data from the HIV database (Rhee et al. 2003). Some of these mutations have a tendency to co-occur.

► Question

Test the hypothesis of independence of the mutations.

Before explaining the details of how correspondence analysis works, let’s look at the output of one of many correspondence analysis functions. We use dudi.coa from the ade4 package to plot the mutations in a lower dimensional projection; the procedure follows what we did for PCA.

cooc[1:4,1:11]
##     X4S X6D X6K X11R X20R X21I X35I X35L X35M X35T X39A
## 4S    0  28   8    0   99    0   22    5   15    3   45
## 6D   26   0   0   34  131    0  108    4   30   13   84
## 6K    7   0   0    6   45    0    5   13   38   35   12
## 11R   0  35   7    0  127   12   60   17   15    6   42

Figure 9.17: The dependencies between HIV mutations is clearly a three dimensional phenomenon, the three first eigenvalues show a clear signal in the data.

HIVca=dudi.coa(cooc,nf=4,scannf=FALSE)
fviz_eig(HIVca,geom="bar",bar_width=0.6)+ggtitle("")

Figure 9.18: A screenshot of the output from an interactive 3d plotting function (plot3d).

After looking at a screeplot, we see that dimensionality of the underlying variation is definitely three dimensional, we plot these three dimensions. Ideally this would be done with an interactive three-dimensional plotting function such as that provided through the package rgl as shown in Figure 9.18.

► Question

Using the car and rgl packages make 3d scatterplot similar to Figure 9.18.
Compare to the plot obtained using aspect=FALSE with the plot3d function from rgl.
What structure do you notice by rotating the cloud of points?

► Solution

fviz_ca_row(HIVca,axes = c(1, 2),geom="text", col.row="purple",
labelsize=3)+ggtitle("") + xlim(-0.55, 1.7) + ylim(-0.53,1.1) +
theme_bw() +  coord_fixed()
fviz_ca_row(HIVca,axes = c(1, 3), geom="text",col.row="purple",
labelsize=3)+ggtitle("")+ xlim(-0.55, 1.7)+ylim(-0.5,0.6) +
theme_bw() + coord_fixed()

► Question

Show the code for plotting the plane defined by axes 1 and 3 of the correspondence analysis respecting the scaling of the vertical axis as shown in the bottom figure of Figure 9.19.

► Solution

This first example showed how to map all the different levels of one categorical variable (the mutations) in a similar way to how PCA projects continuous variables. We will now explore how this can be extended to two or more categorical variables.

9.4.2 Hair color, eye color and phenotype co-occurrence

We will consider a small table so we can follow the analysis in detail. The data are a contingency table of hair-color and eye-color phenotypic co-occurrence from students as shown in Table 9.3. In Chapter 2 we used a chisquare test of independence that uncovered the existence of possible dependencies:

HairColor = HairEyeColor[,,2]
chisq.test(HairColor)
##
##  Pearson's Chi-squared test
##
## data:  HairColor
## X-squared = 106.66, df = 9, p-value < 2.2e-16

Table 9.3: Cross tabulation of students hair and eye color

Brown Blue Hazel Green
Black 36 9 5 2
Brown 66 34 29 14
Red 16 7 7 7
Blond 4 64 5 8

However, stating non independence between hair and eye color is not enough. We need a more detailed explanation of where the dependencies occur: which hair color occurs more often with green eyes ? Are some of the variable levels independent? In fact we can study the departure from independence using a special weighted version of SVD. This method can be understood as a simple extension of PCA and MDS to contingency tables.

Independence: computationally and visually.

We start by computing the row and column sums; we use these to build the table that would be expected if the two phenotypes were independent. We call this expected table HCexp.

rowsums=as.matrix(apply(HairColor,1,sum))
rowsums
##       [,1]
## Black   52
## Brown  143
## Red     37
## Blond   81
colsums=as.matrix(apply(HairColor,2,sum))
t(colsums)
##      Brown Blue Hazel Green
## [1,]   122  114    46    31
HCexp=rowsums%*%t(colsums)/sum(colsums)

Now we compute the $$\chi^2$$ (chi-squared) statistic, which is the sum of the scaled residuals for each of the cells of the table:

sum((HairColor  - HCexp)^2/HCexp)
## [1] 106.6637

We can study these residuals from the expected table, first numerically then in Figure 9.20.

round(t(HairColor-HCexp))
##        Hair
## Eye     Black Brown Red Blond
##   Brown    16    10   2   -28
##   Blue    -10   -18  -6    34
##   Hazel    -3     8   2    -7
##   Green    -3     0   3     0
library("vcd")

Mathematical Formulation.

Here are the computations we just did in R in a more mathematical form. For a general contingency table $${\mathbf N}$$ with $$I$$ rows and $$J$$ columns and a total sample size of $$n=\sum_{i=1}^I \sum_{j=1}^J n_{ij}= n_{\cdot \cdot}$$. If the two categorical variables were independent, each cell frequency would be approximately equal to

$\begin{equation*} n_{ij} = \frac{n_{i \cdot}}{n} \frac{n_{\cdot j}}{n} \times n \end{equation*}$

can also be written:

$\begin{equation*} {\mathbf N} = {\mathbf c r'} \times n, \qquad \mbox{ where } c= \frac{1}{n} {{\mathbf N}} {\mathbb 1}_m \;\mbox{ and }\; r'=\frac{1}{n} {\mathbf N}' {\mathbb 1}_p \end{equation*}$

The departure from independence is measured by the $$\chi^2$$ statistic

$\begin{equation*} {\cal X}^2=\sum_{i,j} \frac{\left(n_{ij}-\frac{n_{i\cdot}}{n}\frac{n_{\cdot j}}{n}n\right)^2} {\frac{n_{i\cdot}n_{\cdot j}}{n^2}n} \end{equation*}$

Correspondece Analysis functions CCA in vegan, CA in FactoMineR, ordinate in phyloseq, dudi.coa in ade4. Once we have ascertained that the two variables are not independent, we use a weighted multidimensional scaling using $$\chi^2$$ distances to visualize the associations.

The method is called Correspondence Analysis (CA) or Dual Scaling and there are multiple R packages that implement it.

Here we make a simple biplot of the Hair and Eye colors.

HC=as.data.frame.matrix(HairColor)
coaHC=dudi.coa(HC,scannf=FALSE,nf=2)
round(coaHC$eig[1:3]/sum(coaHC$eig)*100)
## [1] 89 10  2
fviz_ca_biplot(coaHC,repel=TRUE,col.col="brown", col.row="purple") +
ggtitle("") + ylim(c(-0.5,0.5))

► Question

What percentage of the Chisquare statistic is explained by the first two axes of the Correspondence Analysis?

► Question

Compare the results with those obtained by using CCA in the vegan package with the appropriate value for the scaling parameter.

► Solution

Interpreting the biplots

CA has a special barycentric property, the biplot scaling is chosen so that the row points are placed at the center of gravity of the column levels with their respective weights. For instance the Blue eyes column point is at the center gravity of the (Black, Brown, Red, Blond) with weights proportional to (9,34,7,64). The Blond row point is very heavily weighted, this is why the figure 9.21 shows Blond and Blue quite close together.

9.5 Finding time…and other important gradients.

All the methods we have studied in the last sections are commonly known as ordination methods. In the same way clustering allowed us to detect and interpret a hidden factor/categorical variable, ordination enables us to detect and interpret a hidden ordering, gradient or latent variable in the data.

The first examples of seriation or chronology detection was that of archaelogical artifacts by Kendall (1969), who used presence/absence of features on pottery to date them. These so-called seriation methods are still relevant today as we follow developmental trajectories in single cell data for instance. Ecologists have a long history of interpreting the arches formed by observations points in correspondence analysis and principal components as ecological gradients (Prentice 1977). Let’s illustrate this first with a very simple data set on which we perform a correspondence analysis.

lakelike[1:3,1:8]
##      plant1 plant2 plant3 plant4 plant5 plant6 plant7 plant8
## loc1      6      4      0      3      0      0      0      0
## loc2      4      5      5      3      4      2      0      0
## loc3      3      4      7      4      5      2      1      1
reslake=dudi.coa(lakelike,scannf=FALSE,nf=2)
round(reslake$eig[1:8]/sum(reslake$eig),2)
## [1] 0.56 0.25 0.09 0.03 0.03 0.02 0.01 0.00
fviz_ca_row(reslake,repel=TRUE)+ggtitle("")+ylim(c(-0.55,1.7))
fviz_ca_biplot(reslake,repel=TRUE)+ggtitle("")+ylim(c(-0.55,1.7))

We plot both the row-location points (in the upper plot of Figure 9.22) and the biplot of both location and plant species in the lower par of Figure 9.22; this plot was made with:

► Question

Looking back at the raw matrix lakes as it appears, do you see a pattern in its entries?
What would happen if the plants had been ordered by actual taxa names for instance?

9.5.1 Dynamics of cell development

We will now analyse a more interesting data set that was published by Moignard et al. (2015). This paper describes the dynamics of blood cell development. The data are single cell gene expression measurements of 3,934 cells with blood and endothelial potential from five populations from between embryonic days E7.0 and E8.25.

Figure 9.23: The four cell populations studied here are representative of three sequential states (PS,NP,HF) and two possible final branches (4SG and 4SFG$$^{-}$$).

Remember from Chapter 4 that several different distances are available for comparing our cells. Here, we start by computing both an $$L_2$$ distance and the $$\ell_1$$ distance between the 3,934 cells.

cellt = rowData(Moignard)$celltypes colsn = c("red", "purple", "orange", "green", "blue") blom = assay(Moignard) dist2n.euclid = dist(blom) dist1n.l1 = dist(blom, "manhattan") The classical multidimensional scaling on these two distances matrices can be carried out using: ce1Mds = cmdscale(dist1n.l1, k = 20, eig = TRUE) ce2Mds = cmdscale(dist2n.euclid, k = 20, eig = TRUE) perc1 = round(100*sum(ce1Mds$eig[1:2])/sum(ce1Mds$eig)) perc2 = round(100*sum(ce2Mds$eig[1:2])/sum(ce2Mds$eig)) We look at the underlying dimension and see in Figure 9.24 that two dimensions can provide a substantial fraction of the variance. Figure 9.24: Screeplots from MDS on $$\ell_1$$ (left) and $$L_2$$ (right) distances. We see that the eigenvalues are extremely similar and both point to a $$2$$ dimensional phenomenon. plotbar(ce1Mds, m = 4) plotbar(ce2Mds, m = 4) The first 2 coordinates account for 78 % of the variability when the $$\ell_1$$ distance is used between cells, and 57% when the $$L^2$$ distance is used. We see in Figure 9.25A the first plane for the MDS on the $$\ell_1$$ distances between cells: c1mds = as_tibble(ce1Mds$points[, 1:2]) %>%
setNames(paste0("L1_PCo", 1:2))
ggplot(c1mds,aes(x = L1_PCo1,y = L1_PCo2, color = cellt)) +
geom_point(aes(color = cellt), alpha = 0.6) +
scale_colour_manual(values = colsn) + guides(color = FALSE)
c2mds = as_tibble(ce1Mds$points[, 1:2]) %>% setNames(paste0("L2_PCo", 1:2)) ggplot(c2mds,aes(x=L2_PCo1,y=L2_PCo2, color = cellt)) + geom_point(aes(color = cellt), alpha = 0.6) + scale_colour_manual(values = colsn) + guides(color = FALSE) Figure 9.25B is created in the same way and shows the two-dimensional projection created by using MDS on the L2 distances. Figure 9.25 shows that the both distances ($$\ell_1$$ and L2) give the same first plane for the MDS with very similar representations of the underlying gradient followed by the cells. We can see from Figure 9.25 that the cells are not distributed uniformly in the lower dimensions we have been looking at, we see a definite organization of the points. All the cells of type $$4SG$$ represented in red form an elongated cluster who are much less mixed with the other cell types. 9.5.2 Local, nonlinear methods Multidimensional scaling and non metric multidimensional scaling aims to represent all distances as precisely as possible and the large distances between far away points skew the representations. It can be beneficial when looking for gradients or low dimensional manifolds to restrict ourselves to approximations of points that are close together. This calls for methods that try to represent local (small) distances well and do not try to approximate distances between faraway points with too much accuracy. There has been substantial progress in such methods in recent years. The use of kernels computed using the calculated interpoint distances allows us to decrease the importance of points that are far apart. A radial basis kernel is of the form $\begin{equation*} 1-\exp\left(-\frac{d(x,y)^2}{\sigma^2}\right),\quad\mbox{where } \sigma^2 \mbox{ is fixed.} \end{equation*}$ It has the effect of heavily discounting large distances. This can be very useful as the precision of interpoint distances is often better at smaller ranges; several examples of such methods are covered in Exercise 9.6 at the end of this chapter. ► Question Why do we take the difference between the 1 and the exponential? What happens when the distance between $$x$$ and $$y$$ is very big? t-SNE. This widely used method adds flexibility to the kernel defined above and allows the $$\sigma^2$$ parameter to vary locally (there is a normalization step so that it averages to one). The t-SNE method starts out from the positions of the points in the high dimensional space and derives a probability distribution on the set of pairs of points, such that the probabilities are proportional to the points’ proximities or similarities. It then uses this distribution to construct a representation of the dataset in low dimensions. The method is not robust and has the property of separating clusters of points artificially; however, this property can also help clarify a complex situation. One can think of it as a method akin to graph (or network) layout algorithms. They stretch the data to clarify relations between the very close (in the network: connected) points, but the distances between more distal (in the network: unconnected) points cannot be interpreted as being on the same scales in different regions of the plot. In particular, these distances will depend on the local point densities. Here is an example of the output of t-SNE on the cell data: library("Rtsne") restsne = Rtsne(blom, dims = 2, perplexity = 30, verbose = FALSE, max_iter = 900) dftsne = as_tibble(restsne$Y[, 1:2]) %>%
setNames(paste0("taxis", 1:2))
ggplot(dftsne,aes(x = taxis1, y = taxis2, color = cellt)) +
geom_point(aes(color = cellt), alpha = 0.6) +
scale_color_manual(values = colsn) + guides(color = FALSE)
restsne3 = Rtsne(blom, dims = 3, perplexity = 30, verbose = FALSE,
max_iter = 900)
dftsne3 = as_tibble(restsne3Y[, 1:3]) %>% setNames(paste0("taxis", 1:3)) ggplot(dftsne3,aes(x = taxis3, y = taxis2, group = cellt)) + geom_point(aes(color = cellt), alpha = 0.6) + scale_colour_manual(values = colsn) + guides(color = FALSE) In this case in order to see the subtle differences between MDS and t-SNE, it is really necessary to use 3d plotting. First we generate the 3d t-SNE representation with: ► Task Use the rgl package to look at the three t-SNE dimensions and add the correct cell type colors to the display. Two of these 3d snapshots are shown in Figure 9.27, we see a much stronger grouping of the purple points than in the MDS plots. Note: A site worth visiting in order to appreciate more about the sensitivity of the t-SNE method to the complexity and $$\sigma$$ parameters can be found at http://distill.pub/2016/misread-tsne. There are several other nonlinear methods for estimating nonlinear trajectories followed by points in the relevant state spaces. Here are a few examples. ****RDRToolbox**** Local linear embedding (LLE) and isomap are methods that solve this problem. ****diffusionMap**** This package models connections between points as a Markovian kernel. ****kernlab**** Kernel methods. ****LPCM-package**** Local principal curves. 9.6 Multitable techniques Current studies often attempt to quantify variation in the microbial, genomic, and metabolic measurements across different experimental conditions. As a result, it is common to perform multiple assays on the same biological samples and ask what features – microbes, genes, or metabolites, for example – are associated with different sample conditions. There are many ways to approach these questions. Which to apply depends on the study’s focus. 9.6.1 Co-variation, inertia, co-inertia and the RV coefficient As in physics, we define inertia as a sum of distances with ‘weighted’ points. This enables us to compute the inertia of counts in a contingency table as the weighted sum of the squares of distances between observed and expected frequencies (as in the chisquare statistic). Another generalization of variance-inertia is the useful Phylogenetic diversity index. (computing the sum of distances between a subset of taxa through the tree). Other useful generalizations include using variability of points on a graph taken from standard spatial statistics. If we want to study two standardized variables measured at the same 10 locations together, we use their covariance. If $$x$$ represents the standardized PH, and and $$y$$ the standardized humidity, we measure their covariation using the mean $$$\text{cov}(x,y) = \text{mean}(x1*y1 + x2*y2 + x3*y3 + \cdots + x10*y10). \tag{9.2}$$$ If $$x$$ and $$y$$ co-vary in the same direction, this will be big. We saw how useful the correlation coefficient we defined in Chapter 8 was to our multivariate analyses. Multitable generalizations will be just as useful. 9.6.2 Mantel coefficient and a test of distance correlation The Mantel coefficient, one of the earliest version of matrix association, developed and used by Henry Daniels, FN David and co-authors (Josse and Holmes 2016) is very popular, especially in ecology. There are some precautions to be taken when using the Mantel coefficient, see a critical review in Guillot and Rousset (2013). Given two dissimilarity matrices $${\mathbf D^X}$$ and $${\mathbf D^Y}$$, make these matrices into vectors the way the R dist function does, and compute their correlation. This is defined mathematically as: \begin{equation*} \begin{aligned} \mbox{r}_{\rm m}(\bf X,\bf Y)= \frac{\sum_{i=1}^n \sum_{j=1,j \neq i}^n (d_{ij}^{\bf X}- \bar{d}^{\bf X})(d_{ij}^{\bf Y}-\bar d^{\bf Y})}{\sqrt{\sum_{i,j,j \neq i}(d_{ij}^{\bf X}- \bar{d}^{\bf X})^2 \sum_{i,j,j \neq i}(d_{ij}^{\bf Y}-\bar{d}^{\bf Y})^2}}, \nonumber\end{aligned} \end{equation*} with $$\bar{d}^{\bf X}$$ (resp $$\bar{d}^{\bf Y}$$) the mean of the lower diagonal terms of the dissimilarity matrix associated to $$\bf X$$ (resp. to $$\bf Y$$). This formulation shows us that it is a measure of linear correlation between distances. It has been widely used for testing two sets of distances, for instance one distance $${\mathbf D^X}$$ could be computed using the soil chemistry at 17 differnt locations. The other distance $${\mathbf D^Y}$$ coul record plant abundance dissimilarities using a Jaccard index between the 17 same locations. The correlation’s significance is often assessed via a simple permutation test, see Josse and Holmes (2016) for a review with its historical background and modern incarnations. The coefficient and associated tests are implemented in several R packages such as ade4 (Chessel, Dufour, and Thioulouse 2004), vegan and ecodist (Goslee, Urban, and others 2007). RV coefficient The global measure of similarity of two data tables as opposed to two vectors can be done by a generalization of covariance provided by an inner product between tables that gives the RV coefficient, a number between 0 and 1, like a correlation coefficient, but for tables. $\begin{equation*} RV(A,B)=\frac{Tr(A'BAB')}{\sqrt{Tr(A'A)}\sqrt{Tr(B'B)}} \end{equation*}$ There are several other measures of matrix correlation available in the package MatrixCorrelation. If we do ascertain a link between two matrices, we then need to find a way to understand that link. One such method is explained in the next section. 9.6.3 Canonical correlation analysis (CCA) CCA is a method similar to PCA as it was developed by Hotelling in the 1930s to search for associations between two sets of continuous variables $$X$$ and $$Y$$. Its goal is to find a linear projection of the first set of variables that maximally correlates with a linear projection of the second set of variables. Finding correlated functions (covariates) of the two views of the same phenomenon by discarding the representation-specific details (noise) is expected to reveal the underlying hidden yet influential factors responsible for the correlation. Canonical Correlation Algorithm. Let us consider two matrices $$X$$ and $$Y$$ of order $$n > p$$ and $$n > q$$ respectively. The columns of $$X$$ and $$Y$$ correspond to variables and the rows correspond to the same $$n$$ experimental units. The jth column of the matrix $$X$$ is denoted by $$X_j$$, likewise the kth column of $$Y$$ is denoted by $$Y_k$$. Without loss of generality it will be assumed that the columns of $$X$$ and $$Y$$ are standardized (mean 0 and variance 1). We denote by $$S_{XX}$$ and $$S_{YY}$$ the sample covariance matrices for variable sets $$X$$ and $$Y$$ respectively, and by $$S_{XY} = S_{YX}^t$$ the sample cross-covariance matrix between $$X$$ and $$Y$$ . Classical CCA assumes first $$p \leq n$$ and $$q \leq n$$, then that matrices $$X$$ and $$Y$$ are of full column rank p and q respectively. In the following, the principle of CCA is presented as a problem solved through an iterative algorithm. The first stage of CCA consists in finding two vectors $$a =(a_1,...,a_p)^t$$ and $$b =(b_1,...,b_q)^t$$ that maximize the correlation between the linear combinations $$U$$ and $$V$$ defined as \begin{equation*} \begin{aligned} U=Xa&=&a_1 X_1 +a_2 X_2 +\cdots a_p X_p\\ V=Yb&=&b_1 Y_1 +b_2 Y_2 +\cdots a_q Y_q\end{aligned} \end{equation*} and assuming that vectors $$a$$ and b$are normalized so that $$var(U) = var(V) = 1$$ In other words, the problem consists in finding $$a$$ and $$b$$ such that $$$\rho_1 = \text{cor}(U, V) = \max_{a,b} \text{cor} (Xa, Yb)\quad \text{subject to}\quad \text{var}(Xa)=\text{var}(Yb) = 1. \tag{9.3}$$$ The resulting variables $$U$$ and $$V$$ are called the first canonical variates and $$\rho_1$$ is referred to as the first canonical correlation. Note: Higher order canonical variates and canonical correlations can be found as a stepwise problem. For $$s = 1,...,p$$, we can successively find positive correlations $$\rho_1 \geq \rho_2 \geq ... \geq \rho_p$$ with corresponding vectors $$(a^1, b^1), ..., (a^p, b^p)$$, by maximizing $$$\rho_s = \text{cor}(U^s,V^s) = \max_{a^s,b^s} \text{cor} (Xa^s,Yb^s)\quad \text{subject to}\quad \text{var}(Xa^s) = \text{var}(Yb^s)=1 \tag{9.4}$$$ under the additional restrictions $$$\text{cor}(U^s,U^t) = \text{cor}(V^s, V^t)=0 \quad\text{for}\quad 1 \leq t < s \leq p. \tag{9.5}$$$ We can think of CCA as a generalization of PCA where the variance we maximize is the covariance’ between the two matrices (see Holmes (2006) for more details). 9.6.4 Sparse canonical correlation analysis (sCCA) When the number of variables in each table is very large finding two very correlated vectors can be too easy and unstable: we have too many degrees of freedom. We will see many examples of regularization and danger of overfitting in Chapter 12. Then it is beneficial to add a penalty maintains the number of non-zero coefficients to a minimum. This approach is called sparse canonical correlation analysis (sparse CCA or sCCA), a method well-suited to both exploratory comparisons between samples and the identification of features with interesting covariation. We will use an implementation from the PMA package. Here we study a dataset collected by Kashyap et al. (2013) with two tables. One is a contingency table of bacterial abundances and another an abundance table of metabolites. There are 12 samples, so $$n = 12$$. The metabolite table has measurements on $$p = 637$$ feature and the bacterial abundances had a total of$ q = 20,609$OTUs, which we will filter down to around 200. We start by loading the data. mb_path = "../data/metabolites.csv" library("genefilter") load("../data/microbe.rda") metab = read.csv(mb_path, row.names = 1) %>% as.matrix We first filter down to bacteria and metabolites of interest, removing (“by hand”) those that are zero across many samples and giving an upper threshold of 50 to the large values. We transform the data to weaken the heavy tails. metab = metab[rowSums(metab == 0) <= 3, ] microbe = prune_taxa(taxa_sums(microbe) > 4, microbe) microbe = filter_taxa(microbe, filterfun(kOverA(3, 2)), TRUE) metab = log(1 + metab, base = 10) X = as.matrix(otu_table(microbe)) X = log(1 + X, base=10) A second step in our preliminary analysis is to look if there is any association between the two matrices using the RV.test from the ade4 package: colnames(metab)=colnames(X) pca1 = dudi.pca(t(metab), scal = TRUE, scann = FALSE) pca2 = dudi.pca(t(X), scal = TRUE, scann = FALSE) rv1 = RV.rtest(pca1$tab, pca2$tab, 999) rv1 ## Monte-Carlo test ## Call: RV.rtest(df1 = pca1$tab, df2 = pca2$tab, nrepet = 999) ## ## Observation: 0.8400429 ## ## Based on 999 replicates ## Simulated p-value: 0.003 ## Alternative hypothesis: greater ## ## Std.Obs Expectation Variance ## 5.822569490 0.311286561 0.008246732 We can now apply sparse CCA. This method compares sets of features across high-dimensional data tables, where there may be more measured features than samples. In the process, it chooses a subset of available features that capture the most covariance – these are the features that reflect signals present across multiple tables. We then apply PCA to this selected subset of features. In this sense, we use sparse CCA as a screening procedure, rather than as an ordination method. The implementation is below. The parameters penaltyx and penaltyz are sparsity penalties. Smaller values of penaltyx will result in fewer selected microbes, similarly penaltyz modulates the number of selected metabolites. We tune them manually to facilitate subsequent interpretation – we generally prefer more sparsity than the default parameters would provide. library("PMA") ccaRes = CCA(t(X), t(metab), penaltyx = 0.15, penaltyz = 0.15) ## 123456789 ccaRes ## Call: CCA(x = t(X), z = t(metab), penaltyx = 0.15, penaltyz = 0.15) ## ## ## Num non-zeros u's: 5 ## Num non-zeros v's: 16 ## Type of x: standard ## Type of z: standard ## Penalty for x: L1 bound is 0.15 ## Penalty for z: L1 bound is 0.15 ## Cor(Xu,Zv): 0.9904707 With these parameters, 5 bacteria and 16 metabolites were selected based on their ability to explain covariation between tables. Further, these features result in a correlation of 0.99 between the two tables. We interpret this to mean that the microbial and metabolomic data reflect similar underlying signals, and that these signals can be approximated well by the selected features. Be wary of the correlation value, however, since the scores are far from the usual bivariate normal cloud. Further, note that it is possible that other subsets of features could explain the data just as well – sparse CCA has minimized redundancy across features, but makes no guarantee that these are the “true” features in any sense. Nonetheless, we can still use these 21 features to compress information from the two tables without much loss. To relate the recovered metabolites and OTUs to characteristics of the samples on which they were measured, we use them as input to an ordinary PCA. We have omitted the code we used to generate Figure 9.28, we refer the reader to the online material accompanying the book or the workflow published in Ben J Callahan et al. (2016). Figure 9.28 displays the PCA triplot, where we show different types of samples and the multidomain features (Metabolites and OTUs). This allows comparison across the measured samples – triangles for knockout and circles for wild type –and characterizes the influence the different features – diamonds with text labels. For example, we see that the main variation in the data is across PD and ST samples, which correspond to the different diets. Further, large values of 15 of the features are associated with ST status, while small values for 5 of them indicate PD status. The advantage of the sparse CCA screening is now clear – we can display most of the variation across samples using a relatively simple plot, and can avoid plotting the hundreds of additional points that would be needed to display all of the features. 9.6.5 Canonical (or constrained) correspondence analysis (CCpnA) Notational overload for CCA: Originally invented by Braak (1985) and called Canonical Correspondence analysis, we will call this method Constrained Correspondence Analysis and abbreviate it CCpnA to avoid confusion with Canonical Correlation Analysis (CCA). However several R packages, such as ade4 and vegan use the name cca for their correspondence analyses function. The term constrained correspondence analysis translates the fact that this method is similar to a constrained regression. The method attempts to force the latent variables to be correlated with the environmental variables provided as `explanatory’. CCpnA creates biplots where the positions of samples are determined by similarity in both species signatures and environmental characteristics. In contrast, principal components analysis or correspondence analysis only look at species signatures. More formally, it ensures that the resulting CCpnA directions lie in the span of the environmental variables. For thorough explanations see Braak (1985; Greenacre 2007). This method can be run using the function ordinate in phyloseq. In order to use the covariates from the sample data, we provide an extra argument, specifying which of the features to consider. Here, we take the data we denoised using dada2 in Chapter 4. We will see more details about creating the phyloseq object in Chapter 10. For the time being, we use the otu_table component containing a contingency table of counts for different taxa. We would like to compute the constrained correspondence analyses that explain the taxa abundances by the age and family relationship (both variables are contained in the sample_data slot of the ps1 object). We would like to make two dimensional plots showing only using the four most abundant taxa (making the biplot easier to read): ps1=readRDS("../data/ps1.rds") ps1p=filter_taxa(ps1, function(x) sum(x) > 0, TRUE) psCCpnA = ordinate(ps1p, "CCA", formula = ps1p ~ ageBin + family_relationship) To access the positions for the biplot, we can use the scores function in the vegan. Further, to facilitate figure annotation, we also join the site scores with the environmental data in the sample_data slot. Of the 23 total taxonomic orders, we only explicitly annotate the four most abundant – this makes the biplot easier to read. evalProp = 100 * psCCpnA$CCA$eig[1:2] / sum(psCCpnA$CA\$eig)
ggplot() +
geom_point(data = sites,aes(x =CCA2, y =CCA1),shape =2,alpha=0.5) +
geom_point(data = species,aes(x =CCA2,y =CCA1,col = Order),size=1)+
geom_text_repel(data = species %>% filter(CCA2 < -2),
aes(x = CCA2, y = CCA1, label = otu_id),
size = 2, segment.size = 0.1) +
facet_grid(. ~ ageBin) +
guides(col = guide_legend(override.aes = list(size = 2))) +
labs(x = sprintf("Axis2 [%s%% variance]", round(evalProp[2])),
y = sprintf("Axis1 [%s%% variance]", round(evalProp[1]))) +
scale_color_brewer(palette = "Set1") + theme(legend.position="bottom")

Look up the extra code for creating the tax and species objects in the online resources accompanying the book. Then make the analogue of Figure 9.29 but using litter as the faceting variable.

Figures 9.29 and 9.30, show the plots of these annotated scores, splitting sites by their age bin and litter membership, respectively. Not that to keep the appropriate aspect ratio in the presence of facetting, we have taken the vertical axis as our first canonical component. We have labeled individual bacteria that are outliers along the second CCpnA direction.

Evidently, the first CCpnA direction distinguishes between mice in the two main age bins. Circles on the left and right of the biplot represent bacteria that are characteristic of younger and older mice, respectively. The second CCpnA direction splits off the few mice in the oldest age group; it also partially distinguishes between the two litters. These samples low in the second CCpnA direction have more of the outlier bacteria than the others.

This CCpnA analysis supports the conclusion that the main difference between the microbiome communities of the different mice lies along the age axis. However, in situations where the influence of environmental variables is not so strong, CCA can have more power in detecting such associations. In general, it can be applied whenever it is desirable to incorporate supplemental data, but in a way that (1) is less aggressive than supervised methods, and (2) can use several environmental variables at once.

9.7 Summary of this chapter

Heterogeneous data A mixture of many continuous and a few categorical variables can be handled by adding the categorical variables as supplementary information to the PCA. This is done by projecting the mean of all points in a group onto the map.

Using distances Relations between data objects can often be summarized as interpoint distances (whether distances between trees, images, graphs, or other complex objects).

Ordination A useful representation of these distances is available through a method similar to PCA called multidimensional scaling (MDS), otherwise known as PCoA (principal coordinate analysis). It can be helpful to think of the outcome of these analyses as uncovering latent variable. In the case of clustering the latent variables are categorical, in ordination they are latent variables like time or environmental gradients like distance to the water. This is why these methods are often called ordination.

Robust versions can be used when interpoint distances are wildly different. NMDS (nonmetric multidimensional scaling) aims to produce coordinates such that the order of the interpoint distances is respected as closely as possible.

Correspondence analysis: a method for computing low dimensional projections that explain dependencies in categorical data. It decomposes chisquare distance in much the same way that PCA decomposes variance. Correspondence analysis is usually the best way to follow up on a significant chisquare test. Once we have ascertained there are significant dependencies between different levels of categories, we can map them and interpret proximities on this map using plots and biplots.

Permutation test for distances Given two sets of distances between the same points, we can measure whether they are related using the Mantel permutation test.

Generalizations of variance and covariance When dealing with more than one matrix of measurements on the same data, we can generalize the notion of covariance and correlations to vectorial measurements of co-inertia.

Canonical correlation is a method for finding a few linear combinations of variables from each table that are as correlated as possible. When using this method on matrices with large numbers of variables, we use a regularized version with an L1 penalty that reduces the number of non-zero coefficients.

Interpretation of PCoA maps and nonlinear embeddings can also be enhanced the way we did for PCA using generalizations of the supplementary point method, see Trosset and Priebe (2008) or Bengio et al. (2004). We saw in chapter 7 how we can project one categorical variable onto a PCA. The correspondence analysis framework actually allows us to mix several categorical variables in with any number of continuous variables. This is done through an extension called multiple correspondence analysis (MCA) whereby we can do the same analysis on a large number of binary categorical variables and obtain useful maps. The trick here will be to turn the continuous variables into categorical variables first. For extensive examples using R see for instance the book by Pagès (2016).

A simple extension to PCA that allows for nonlinear principal curve estimates instead of principal directions defined by eigenvectors was proposed in Hastie and Stuetzle (1989) and is available in the package princurve.

Finding curved subspaces containing a high density data for dimensions higher than $$1$$ is now called manifold embedding and can be done through Laplacian eigenmaps (Belkin and Niyogi 2003), local linear embedding as in Roweis and Saul (2000) or using the isomap method (Tenenbaum, De Silva, and Langford 2000). For textbooks covering nonlinear unsupervised learning methods see Hastie, Tibshirani, and Friedman (2008 Chapter 14) or Izenman (2008).

A review of many multitable correlation coefficients, and analysis of applications can be found in Josse and Holmes (2016).

9.9 Exercises

► Exercise 9.1

We are going to take another look at the Phylochip data, replacing the original expression values by presence/absence. We threshold the data to retain only those that have a value of at least 8.633 in at least 8 samples128 These values were chosen to give about retain about 3,000 taxa, similar to our previous choice of threshold..

ibd.pres = ifelse(assayIBD[, 1:28] > 8.633, 1, 0)

Perform a correspondence analysis on these binary data and compare the plot you obtain to what we saw in Figure 9.12.

► Exercise 9.2

Correspondence Analysis on color association tables:
Here is an example of data collected by looking at the number of Google hits resulting from queries of pairs of words. The numbers in Table 9.4 are to be multiplied by 1000. For instance, the combination of the words “quiet” and “blue” returned 2,150,000 hits. Perform a correspondence analysis of these data. What do you notice when you look at the two-dimensional biplot?

Table 9.4: Contingency table of co-occurring terms from search engine results.

black blue green grey orange purple white
quiet 2770 2150 2140 875 1220 821 2510
angry 2970 1530 1740 752 1040 710 1730
clever 1650 1270 1320 495 693 416 1420
depressed 1480 957 983 147 330 102 1270
happy 19300 8310 8730 1920 4220 2610 9150
lively 1840 1250 1350 659 621 488 1480
perplexed 110 71 80 19 23 15 109
virtuous 179 80 102 20 25 17 165
► Exercise 9.3

The dates Plato wrote his various books are not known. We take the sentence endings and use those pattern frequencies as the data.

platof[1:4, ]
##       Rep Laws Crit Phil Pol Soph Tim
## uuuuu  42   91    5   24  13   26  18
## -uuuu  60  144    3   27  19   33  30
## u-uuu  64   72    3   20  24   31  46
## uu-uu  72   98    2   25  20   24  14

Figure 9.33: Biplot of Plato’s sentence endings.

resPlato = dudi.coa(platof, scannf = FALSE, nf = 2)
fviz_ca_biplot(resPlato, axes=c(2, 1)) + ggtitle("")
fviz_eig(resPlato, geom = "bar", width = 0.6) + ggtitle("")

9.3a From the biplot in Figure 9.33 can you guess at the chronological order of Plato’s works?
Hint: the first (earliest) is known to be Republica. The last (latest) is known to be Laws.

9.3b Which sentence ending did Plato use more frequently early in his life? 9.3c What percentage of the inertia ($$\chi^2$$-distance) is explained by the map in Figure 9.33?

► Exercise 9.4

We are going to look at two datasets, one is a perturbed version of the other and they both present gradients as often seen in ecological data. Read in the two species count matrices lakelike and lakelikeh, which are stored as the object lakes.RData. Compare the output of correspondence analysis and principal component analysis on each of the two data sets; restrict yourself two dimensions. In the plots and the eigenvalues, what do you notice?

► Exercise 9.5

We analyzed the normalized Moignard data in Section 9.5.1. Now redo the analysis with the raw data (in file nbt.3154-S3-raw.csv) and compare the output with that obtained using the normalized values.

► Exercise 9.6

We are going to explore the use of kernel methods. 9.6a Compute kernelized distances using the kernlab for the Moignard data using various values for the sigma tuning parameter in the definition of the kernels. Then perform MDS on these kernelized distances. What difference is there in variability explained by the first four components of kernel multidimensional scaling?

9.6b Make interactive three dimensional representations of the components: is there a projection where you see a branch for the purple points?

► Exercise 9.7

Higher resolution study of cell data.
Take the original expression data blom we generated in Section 9.5.1. Map the intensity of expression of each of the top 10 most variable genes onto the 3d plot made with the diffusion mapping. Which dimension, or which one of the principal coordinates (1,2,3,4) can be seen as the one that clusters the 4SG (red) points the most?

► Exercise 9.8

Here we explore more refined distances and diffusion maps that can show cell development trajectories as in Figure 9.35.

Figure 9.35: Ouput from a three-dimensional diffusion map projection.

The diffusion map method restricts the estimation of distances to local points, thus further pursuing the idea that often only local distances should be represented precisely and as points become further apart they are not being measured with the same ‘reference’. This method also uses the distances as input but then creates local probabilistic transitions as indicators of similarity, these are combined into an affinity matrix for which the eigenvalues and eigenvectors are also computed much like in standard MDS.

Compare the output of the diffuse function from the diffusionMap package on both the l1 and l2 distances computed between the cells available in the dist2n.euclid and dist1n.l1 objects from section 9.5.1.

References

Belkin, Mikhail, and Partha Niyogi. 2003. “Laplacian Eigenmaps for Dimensionality Reduction and Data Representation.” Neural Computation 15 (6). MIT Press: 1373–96.

Bengio, Yoshua, Jean-François Paiement, Pascal Vincent, Olivier Delalleau, Nicolas Le Roux, and Marie Ouimet. 2004. “Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering.” Advances in Neural Information Processing Systems 16. Cambridge, MA, USA: MIT Press: 177–84.

Braak, Cajo ter. 1985. “Correspondence Analysis of Incidence and Abundance Data: Properties in Terms of a Unimodal Respose.” Biometrics 41 (January).

Brodie, Eoin L, Todd Z DeSantis, Dominique C Joyner, Seung M Baek, Joern T Larsen, Gary L Andersen, Terry C Hazen, et al. 2006. “Application of a High-Density Oligonucleotide Microarray Approach to Study Bacterial Population Dynamics During Uranium Reduction and Reoxidation.” Applied and Environmental Microbiology 72 (9). Am Soc Microbiol: 6288–98.

Callahan, Ben J, Kris Sankaran, Julia A Fukuyama, Paul J McMurdie, and Susan P Holmes. 2016. “Bioconductor Workflow for Microbiome Data Analysis: From Raw Reads to Community Analyses.” F1000Research 5. Faculty of 1000 Ltd.

Chessel, Daniel, Anne Dufour, and Jean Thioulouse. 2004. “The ade4 Package - I: One-Table Methods.” R News 4 (1): 5–10. http://CRAN.R-project.org/doc/Rnews/.

Diaconis, Persi, Sharad Goel, and Susan Holmes. 2007. “Horseshoes in Multidimensional Scaling and Kernel Methods.” Annals of Applied Statistics.

Ekman, Gosta. 1954. “Dimensions of Color Vision.” The Journal of Psychology 38 (2). Taylor & Francis: 467–74.

Goslee, Sarah C, Dean L Urban, and others. 2007. “The Ecodist Package for Dissimilarity-Based Analysis of Ecological Data.” Journal of Statistical Software 22 (7): 1–19.

Greenacre, Michael J. 2007. Correspondence Analysis in Practice. Chapman & Hall.

Guillot, Gilles, and François Rousset. 2013. “Dismantling the Mantel Tests.” Methods in Ecology and Evolution 4 (4). Wiley Online Library: 336–44.

Hastie, Trevor, and Werner Stuetzle. 1989. “Principal Curves.” Journal of the American Statistical Association 84 (406). Taylor & Francis Group: 502–16.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2008. The Elements of Statistical Learning. $$2^{\text{nd}}$$. Springer.

Holmes, Susan. 2006. “Multivariate Analysis: The French way.” In Probability and Statistics: Essays in Honor of David a. Freedman, edited by D. Nolan and T. P. Speed. Vol. 56. IMS Lecture Notes–Monograph Series. Beachwood, OH: IMS. http://www.imstat.org/publications/lecnotes.htm.

Holmes, Susan, Alexander V Alekseyenko, Alden Timme, Tyrrell Nelson, Pankaj Jay Pasricha, and Alfred Spormann. 2011. “Visualization and Statistical Comparisons of Microbial Communities Using R Packages on Phylochip Data.” In Pacific Symposium on Biocomputing, 142–53. World Scientific.

Izenman, Alan Julian. 2008. “Nonlinear Dimensionality Reduction and Manifold Learning.” In Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning, 597–632. New York, NY: Springer New York.

Josse, Julie, and Susan Holmes. 2016. “Measuring Multivariate Association and Beyond.” Statistics Surveys 10. IMS: 132–67.

Kashyap, Purna C, Angela Marcobal, Luke K Ursell, Samuel A Smits, Erica D Sonnenburg, Elizabeth K Costello, Steven K Higginbottom, et al. 2013. “Genetically Dictated Change in Host Mucus Carbohydrate Landscape Exerts a Diet-Dependent Effect on the Gut Microbiota.” PNAS 110 (42). National Acad. Sciences: 17059–64.

Kendall, David. 1969. “Incidence Matrices, Interval Graphs and Seriation in Archeology.” Pacific Journal of Mathematics 28 (3): 565–70.

Leek, Jeffrey T, Robert B Scharpf, Héctor Corrada Bravo, David Simcha, Benjamin Langmead, W Evan Johnson, Donald Geman, Keith Baggerly, and Rafael A Irizarry. 2010. “Tackling the Widespread and Critical Impact of Batch Effects in High-Throughput Data.” Nature Reviews Genetics 11 (10): 733–39.

Moignard, Victoria, Steven Woodhouse, Laleh Haghverdi, Andrew J Lilly, Yosuke Tanaka, Adam C Wilkinson, Florian Buettner, et al. 2015. “Decoding the Regulatory Network of Early Blood Development from Single-Cell Gene Expression Measurements.” Nature Biotechnology. Nature Publishing Group.

Nelson, Tyrell A, Susan Holmes, Alexander Alekseyenko, Masha Shenoy, Todd DeSantis, Cindy Wu, Gary Andersen, et al. 2010. “PhyloChip Microarray Analysis Reveals Altered Gastrointestinal Microbial Communities in a Rat Model of Colonic Hypersensitivity.” Neurogastroenterology & Motility. Wiley Online Library.

Pagès, Jérôme. 2016. Multiple Factor Analysis by Example Using R. CRC Press.

Perraudeau, Fanny, Davide Risso, Kelly Street, Elizabeth Purdom, and Sandrine Dudoit. 2017. “Bioconductor Workflow for Single-Cell RNA Sequencing: Normalization, Dimensionality Reduction, Clustering, and Lineage Inference.” F1000Research 6.

Prentice, IC. 1977. “Non-Metric Ordination Methods in Ecology.” The Journal of Ecology. JSTOR, 85–94.

Rhee, Soo-Yon, Matthew J Gonzales, Rami Kantor, Bradley J Betts, Jaideep Ravela, and Robert W Shafer. 2003. “Human Immunodeficiency Virus Reverse Transcriptase and Protease Sequence Database.” Nucleic Acids Research 31 (1). Oxford Univ Press: 298–303.

Roweis, Sam T, and Lawrence K Saul. 2000. “Nonlinear Dimensionality Reduction by Locally Linear Embedding.” Science 290 (5500). American Association for the Advancement of Science: 2323–6.

Tenenbaum, Joshua B, Vin De Silva, and John C Langford. 2000. “A Global Geometric Framework for Nonlinear Dimensionality Reduction.” Science 290 (5500). American Association for the Advancement of Science: 2319–23.

Trosset, Michael W, and Carey E Priebe. 2008. “The Out-of-Sample Problem for Classical Multidimensional Scaling.” Computational Statistics & Data Analysis 52 (10). Elsevier: 4635–42.

Page built: 2019-05-15