Goal

In this lab we will learn the basics of the analysis of microbiome data with a special emphasis on dimension reduction.

Packages

Install packages.

biocpkgs_needed = c("ggplot2","genefilter", "impute",
                "phyloseq")
pkgs_needed = c("PMA","dplyr","ade4","ggrepel","vegan")
letsinstall = setdiff(pkgs_needed, installed.packages()) 
letsinstallb = setdiff(biocpkgs_needed, installed.packages()) 
if (length(letsinstallb) > 0) {
  source("http://bioconductor.org/biocLite.R")
  biocLite(letsinstallb)
}

if (length(letsinstall) > 0) {
  install.packages(letsinstall)
}

Load packages.

library("ggplot2")
library("PMA")
library("dplyr")
library("ade4")
library("genefilter")
library("ggrepel")
library("phyloseq")
library("vegan")

Microbiome analysis

Data import

The data we will analyze in the first part of the lab corresponds to 360 fecal samples which were collected from 12 mice longitudinally over the first year of life, to investigate the development and stabilization of the murine microbiome. Let’s (down)load the dataset:

download.file("https://cdn.rawgit.com/spholmes/F1000_workflow/891463f6/data/ps.rds",
              "ps.rds",
              mode="wb")
ps = readRDS("ps.rds")

The ps object is of class phyloseq from the package phyloseq.

class(ps)
## [1] "phyloseq"
## attr(,"package")
## [1] "phyloseq"

As has been mentioned before, the phyloseq package is a tool to import, store, analyze, and graphically display complex phylogenetic sequencing data. Take some time to explore the object, before we start doing statistical analyses:

Question 1: How many slots does the ps object have?

Question 2: How many distinct Phyla (such as “Bacteroidetes”) have been identified in this dataset? (Hint: Look up the tax_table function and then inspect its output. Also make sure to ignore potential NA values!)

tax_table(ps)[,"Phylum"] %>% unique %>% na.exclude %>% length
## [1] 10

Question 3: In total, does this dataset include more measurements for female or male mice (Hint: Look up the documentation of sample_data)?

table(sample_data(ps)$sex)
## 
##   F   M 
## 173 187

Question 4: How many unique female mice were part of this experiment?

sample_data(ps) %>% group_by(sex) %>% dplyr::summarize(n = length(unique(host_subject_id)))
## # A tibble: 2 x 2
##     sex     n
##   <chr> <int>
## 1     F     6
## 2     M     6

Preprocessing

Before doing the multivariate projections, we will do some basic preprocessing. First remove features with ambiguous Phylum annotation:

ps = subset_taxa(ps, !is.na(Phylum) & !Phylum %in% c("", "uncharacterized"))

Now let’s look at a histogram of the ages of the mice:

ggplot(sample_data(ps), aes(x=age)) + geom_histogram() + xlab("age")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We see that the ages of the mice come in a couple of groups, and so we make a categorical variable corresponding to young, middle-aged, and old mice.

sample_data(ps)$age_binned = cut(sample_data(ps)$age,
                          breaks = c(0, 100, 200, 400))

Furthermore, we log-transform the data. This is an approximate variance stabilizing transformation (it would be more appropriate to use the variance stabilizing functionality available in the DESeq2 package).

pslog = transform_sample_counts(ps, function(x) log(1 + x))

A first principal coordinates analysis (PCoA)

For a first pass, we look at principal coordinates analysis (PCoA) with either the Bray-Curtis dissimilarity on the weighted Unifrac distance.

set.seed(0xdada2)
out_wuf_log = ordinate(pslog, method = "MDS", distance = "wunifrac")
## Warning in UniFrac(physeq, weighted =
## TRUE, ...): Randomly assigning root as --
## GCGAGCGTTATCCGGATTCACTGGGTGTAAAGGGAGCGTAGACGGCCAAGCAAGCCAGGGGTGAAAGCCCGGGGCCCAACCCCGGGACTGCCCTTGGAACTGCTTGGCTGGAGTGCGGGAGGGGCAGGCGGAATTCCTGGTGTAGCGGTGAAATGCGTAGATATCAGGAGGAACACCGGCGGCGAAGGCGGCCTGCTGGACCGCGACTGACGTTGAGGCTCGAAAGCGTGGGGAG
## -- in the phylogenetic tree in the data you provided.
evals =  out_wuf_log$values$Eigenvalues
plot_ordination(pslog, out_wuf_log, color = "age_binned", shape="sex") +
  labs(col = "Binned Age") +
  coord_fixed(sqrt(evals[2] / evals[1]))

Question 5: The above plot shows the proportion of variance explained by the 1st axis (47%) and the 2nd axis (11.2%). What is the variance explained by the 3rd axis? (Hint: Look at evals. Note that evals[2],evals[1] is 11.2, 47.)

round(100*evals[3]/sum(evals),digits = 1)
## [1] 7.9

We see immediately that there are some outliers. We will take them out, since we are mainly interested in the relationships between the non-outlier points (although we skip the details of how to specify the outliers).

outlier_idx =   rownames(sample_data(pslog)) %in% c("M3D149","M2D19","M1D9", "M1D149", "F5D165", "F6D165")

pslog2 = prune_samples(!outlier_idx, pslog)

Question 6: Redo the same ordination as above after outlier removal. What is the proportion of variance explained by the 1st axis?

Different ordination projections

In this section we will explore different ordination projections.

First we will perform a PCoA using Bray-Curtis dissimilarity.

out_bc_log = ordinate(pslog2, method = "MDS", distance = "bray")
evals_bc = out_bc_log$values$Eigenvalues
plot_ordination(pslog2, out_bc_log, color = "age_binned") +
  coord_fixed(sqrt(evals_bc[2] / evals_bc[1])) +
  labs(col = "Binned Age")

The plot above shows the ordination of the samples, and we see that the second axis corresponds to an age effect, with the samples from the younger and older mice separating fairly well.

Next we look at double principal coordinates analysis (DPCoA), which is a phylogenetic ordination method that provides a biplot representation of both samples and taxonomic categories. (Warning: This next line will take a while to compute.)

out_dpcoa_log = ordinate(pslog2, method = "DPCoA")
evals_dpcoa = out_dpcoa_log$eig
plot_ordination(pslog2, out_dpcoa_log, color = "age_binned",
                  shape = "family_relationship") +
  coord_fixed(sqrt(evals_dpcoa[2] / evals_dpcoa[1])) +
  labs(col = "Binned Age", shape = "Litter")

We see again that the second axis corresponds to young vs. old mice.

plot_ordination(pslog2, out_dpcoa_log, type = "species", color = "Phylum") +
  coord_fixed(sqrt(evals_dpcoa[2] / evals_dpcoa[1]))

The second plot suggests an interpretation of the second axis: samples that have larger scores on the second axis have more taxa from Bacteroidetes and one subset of Firmicutes.

Question 7: Consider a sample which is particularly enriched in bacteria from the Bacteroidetes Phylum. Do you think it is more likely that its projection on the first axis will have a positive value or negative value?

PCA on ranks

Microbial abundance data is often heavy-tailed, and sometimes it can be hard to identify a transformation that brings the data to normality. In these cases, it can be safer to ignore the raw abundances altogether, and work instead with ranks. We demonstrate this idea using a rank-transformed version of the data to perform PCA. First, we create a new matrix, representing the abundances by their ranks, where the microbe with the smallest in a sample gets mapped to rank 1, second smallest rank 2, etc.

abund = otu_table(pslog2)
abund_ranks = t(apply(abund, 1, rank))

Naively using these ranks could make differences between pairs of low and high abundance microbes comparable. In the case where many bacteria are absent or present at trace amounts, an artificially large difference in rank could occur for minimally abundant taxa. To avoid this, all those microbes with rank below some threshold are set to be tied at 1. The ranks for the other microbes are shifted down, so there is no large gap between ranks.

abund_ranks = abund_ranks - 329
abund_ranks[abund_ranks < 1] = 1

We can now perform PCA and study the resulting biplot. Run the PCA first:

ranks_pca = dudi.pca(abund_ranks, scannf = F, nf = 3)

To produce annotation for this figure, we used the following blocks.

tax <- tax_table(ps)@.Data %>%
  data.frame(stringsAsFactors = FALSE)
tax$seq <- rownames(tax)

main_orders <- c("Clostridiales", "Bacteroidales", "Lactobacillales",
                   "Coriobacteriales")
tax$Order[!(tax$Order %in% main_orders)] <- "Other"
tax$Order <- factor(tax$Order, levels = c(main_orders, "Other"))
tax$otu_id <- seq_len(ncol(otu_table(ps)))
row_scores = data.frame(li = ranks_pca$li,
                            SampleID = rownames(abund_ranks))
col_scores <- data.frame(co = ranks_pca$co,
                            seq = colnames(abund_ranks))

evals_prop_pca <- 100 * (ranks_pca$eig / sum(ranks_pca$eig))

row_scores <- row_scores %>%
  left_join(sample_data(pslog))
## Joining, by = "SampleID"
## Warning: Column `SampleID` joining factor and character vector, coercing
## into character vector
col_scores <- col_scores %>%
  left_join(tax)
## Joining, by = "seq"
## Warning: Column `seq` joining factor and character vector, coercing into
## character vector

Finally, let’s use ggplot:

ggplot() +
  geom_point(data = row_scores, aes(x = li.Axis1, y = li.Axis2), shape = 2) +
  geom_point(data = col_scores, aes(x = 25 * co.Comp1, y = 25 * co.Comp2, col = Order),
              size = .3, alpha = 0.6) +
  scale_color_brewer(palette = "Set2") +
  facet_grid(~ age_binned) +
  guides(col = guide_legend(override.aes = list(size = 3))) +
  labs(x = sprintf("Axis1 [%s%% variance]", round(evals_prop_pca[1], 2)), 
       y = sprintf("Axis2 [%s%% variance]", round(evals_prop_pca[2], 2))) +
  coord_fixed(sqrt(ranks_pca$eig[2] / ranks_pca$eig[1])) +
  theme(panel.border = element_rect(color = "#787878", fill = alpha("white", 0)))

Canonical Correspondence

Canonical Correspondence Analysis (CCpnA) is an approach to ordination of a species by sample table that incorporates supplemental information about the samples. As before, the purpose of creating biplots is to determine which types of bacterial communities are most prominent in different mouse sample types. It can be easier to interpret these biplots when the ordering between samples reflects sample characteristics – variations in age or litter status in the mouse data, for example – and this central to the design of CCpnA.

The function allows to create biplots where the positions of samples are determined by similarity in both species signatures and environmental characteristics; in contrast, principal components analysis or correspondence analysis only look at species signatures. More formally, it ensures that the resulting CCpnA directions lie in the span of the environmental variables.

Like PCoA and DPCoA, this method can be run using ordinate in phyloseq. In order to use supplemental sample data, it is necessary to provide an extra argument, specifying which of the features to consider – otherwise, phyloseq defaults to using all sample_data measurements when producing the ordination.

ps_ccpna = ordinate(pslog2, "CCA", formula = pslog2 ~ age_binned + family_relationship)

To access the positions for the biplot, we can use the scores function in the vegan package. Further, to facilitate figure annotation, we also join the site scores with the environmental data in sample_data. We only explicitly annotate the four most abundant taxonomic orders – this makes the biplot easier to read.

ps_scores = vegan::scores(ps_ccpna)
sites = data.frame(ps_scores$sites)
sites$SampleID = rownames(sites)
sites = sites %>%
  left_join(sample_data(ps))
## Joining, by = "SampleID"
species = data.frame(ps_scores$species)
species$otu_id = seq_along(colnames(otu_table(ps)))
species = species %>% left_join(tax)
## Joining, by = "otu_id"
evals_prop_ccpna = 100 * ps_ccpna$CCA$eig[1:2] / sum(ps_ccpna$CA$eig)
ggplot(data=sites) +
  geom_point(data = sites, aes(x = CCA1, y = CCA2), shape = 2, alpha = 0.5) +
  geom_point(data = species, aes(x = CCA1, y = CCA2, col = Order), size = 0.5) +
  geom_text_repel(data = species %>% filter(CCA2 < -2),
            aes(x = CCA1, y = CCA2, label = otu_id),
            size = 1.5, segment.size = 0.1) +
  facet_grid(. ~ age_binned) +
  guides(col = guide_legend(override.aes = list(size = 3))) +
  labs(x = sprintf("Axis1 [%s%% variance]", round(evals_prop_ccpna[1], 2)), 
       y = sprintf("Axis2 [%s%% variance]", round(evals_prop_ccpna[2], 2))) +
  scale_color_brewer(palette = "Set2") +
  coord_fixed(sqrt(ps_ccpna$CCA$eig[2] / ps_ccpna$CCA$eig[1])*0.33) +
  theme(panel.border = element_rect(color = "#787878", fill = alpha("white", 0)))

The sites and species are triangles and circles, respectively.

We have labeled individual microbes that are outliers along the second CCpnA direction.

Evidently, the first CCpnA direction distinguishes between mice in the two main age bins. Circles on the left and right of the biplot represent microbes that are characteristic of younger and older mice, respectively.

This CCpnA analysis supports our conclusions from the earlier ordinations – the main difference between the microbiome communities of the different mice lies along the age axis. However, in situations where the influence of environmental variables is not so strong, CCA can have more power in detecting such associations. In general, it can be applied whenever it is desirable to incorporate supplemental data, but in a way that (1) is less aggressive than supervised methods, and (2) can use several environmental variables at once.

Question 8: Repeat the above plot, this time facetting on the litter (column family_relationship). Is Axis 1 or Axis 2 better at distinguishing the litter?

Multitables:

Many microbiome studies attempt to quantify variation in the microbial, genomic, and metabolic measurements across different experimental conditions. As a result, it is common to perform multiple assays on the same biological samples and ask what features – bacteria, genes, or metabolites, for example – are associated with different sample conditions. There are many ways to approach these questions, which to apply depends on the study’s focus.

Here, we will use sparse Canonical Correlation Analysis (sparse CCA), a method well-suited to both exploratory comparisons between samples and the identification of features with interesting variation. We will use an implementation from the PMA package.

Since the mouse data used above included only a single table, we use a new data set. There are two tables here, one for bacteria and another with metabolites. 12 samples were obtained, each with measurements at 637 m/z values and 20,609 OTUs.

Preprocessing

First let’s load the metabolite and microbe data.

metab_path = url("https://cdn.rawgit.com/spholmes/F1000_workflow/891463f6/data/metabolites.csv")
microbe_path = url("https://cdn.rawgit.com/spholmes/F1000_workflow/891463f6/data/microbe.rda")
metab_init = read.csv(metab_path, row.names = 1)
metab_init = as.matrix(metab_init)
microbe = get(load(microbe_path))

We will do some preprocessing steps now, e.g. removing taxa which appear very rarely and log-transforming the metabolite data.

keep_ix = rowSums(metab_init == 0) <= 3
metab = metab_init[keep_ix, ]
microbe = prune_taxa(taxa_sums(microbe) > 4, microbe)
microbe = filter_taxa(microbe, filterfun(kOverA(3, 2)), TRUE)
metab = log(1 + metab, base = 10)
X = otu_table(microbe)@.Data
X[X > 50] <- 50

Question 9: How many metabolites were removed from the dataset because their measurement was too sparse?

We can now apply sparse CCA. This method compares sets of features across high-dimensional data tables, where there may be more measured features than samples. In the process, it chooses a subset of available features that capture the most covariance – these are the features that reflect signals present across multiple tables. We then apply PCA to this selected subset of features. In this sense, we use sparse CCA as a screening procedure, rather than as an ordination method.

Our implementation is below. The parameters penaltyx and penaltyz are sparsity penalties. Larger values of penaltyx will result in more selected microbes, similarly penaltyz modulates the number of selected metabolites. We tune them manually to facilitate subsequent interpretation – we generally prefer more sparsity than the default parameters would provide.

cca_res = CCA(t(X), t(metab),  penaltyx = .15, penaltyz = .15)
## 123456789101112131415
cca_res
## Call: CCA(x = t(X), z = t(metab), penaltyx = 0.15, penaltyz = 0.15)
## 
## 
## Num non-zeros u's:  5 
## Num non-zeros v's:  15 
## Type of x:  standard 
## Type of z:  standard 
## Penalty for x: L1 bound is  0.15 
## Penalty for z: L1 bound is  0.15 
## Cor(Xu,Zv):  0.9735258

With these parameters, 5 microbes and 15 metabolites have been selected, based on their ability to explain covariation between tables. Further, these 20 features result in a correlation of 0.974 between the two tables. We interpret this to mean that the microbial and metabolomic data reflect similar underlying signals, and that these signals can be approximated well by the 20 selected features. Be wary of the correlation value, however, since the scores are far from the usual bivariate normal cloud. Further, note that it is possible that other subsets of features could explain the data just as well – sparse CCA has minimized redundancy across features, but makes no guarantee that these are the “true” features in any sense.

Question 10: Increase penaltyz to 0.5 in the above call to CCA while keeping penaltyx fixed. How many metabolites are selected?

Nonetheless, we can still use these 20 features to compress information from the two tables without much loss. To relate the recovered metabolites and OTUs to characteristics of the samples on which they were measured, we use them as input to an ordinary PCA.

combined = cbind(t(X[cca_res$u != 0, ]),
            t(metab[cca_res$v != 0, ]))
pca_res = dudi.pca(combined, scannf = F, nf = 3)
# annotation
genotype = substr(rownames(pca_res$li), 1, 2)
sample_type = substr(rownames(pca_res$li), 3, 4)
feature_type = grepl("\\.", colnames(combined))
feature_type = ifelse(feature_type, "Metabolite", "OTU")

sample_info = data.frame(pca_res$li, genotype, sample_type)
feature_info = data.frame(pca_res$c1,
                  feature = substr(colnames(combined), 1, 6),
                  feature_type = feature_type)
ggplot() + geom_point(data = sample_info,
            aes(x = Axis1, y = Axis2, col = sample_type, shape = genotype), size = 3) +
  geom_label_repel(data = feature_info,
                     aes(x = 5.5 * CS1, y = 5.5 * CS2, label = feature, fill = feature_type),
                     size = 2, segment.size = 0.3,
                     label.padding = unit(0.1, "lines"), label.size = 0) +
  geom_point(data = feature_info,
          aes(x = 5.5 * CS1, y = 5.5 * CS2, fill = feature_type),
          size = 1, shape = 23, col = "#383838") +
  scale_color_brewer(palette = "Set2") +
  scale_fill_manual(values = c("#a6d854", "#e78ac3")) +
  guides(fill = guide_legend(override.aes = list(shape = 32, size = 0))) +
  coord_fixed(sqrt(pca_res$eig[2] / pca_res$eig[2])) +
  labs(x = sprintf("Axis1 [%s%% Variance]",
                100 * round(pca_res$eig[1] / sum(pca_res$eig), 2)),
       y = sprintf("Axis2 [%s%% Variance]",
                100 * round(pca_res$eig[2] / sum(pca_res$eig), 2)),
       fill = "Feature Type", col = "Sample Type")

This is a PCA triplot, where we show different types of samples and the multidomain features (Metabolites and OTUs). This allows comparison across the measured samples – triangles for Knockout and circles for wild type – and characterizes the influence the different features – diamonds with text labels. For example, we see that the main variation in the data is across PD and ST samples, which correspond to the different diets. For example, large values of 15 of the features are associated with ST status. The advantage of the sparse CCA screening is now clear – we can display most of the variation across samples using a relatively simple plot, and can avoid plotting the hundreds of additional points that would be needed to display all of the features.