Contents

Abundances are ranked within samples

This was first published in the Bioconductor workflow paper.

Microbial abundance data is often heavy-tailed, and sometimes it can be hard to identify a transformation that brings the data to normality. In these cases, it can be safer to ignore the raw abundances altogether, and work instead with ranks. We demonstrate this idea using a rank-transformed version of the data to perform PCA. First, we create a new matrix, representing the abundances by their ranks, where the microbe with the smallest in a sample gets mapped to rank 1, second smallest rank 2, etc.

library(phyloseq)
pst = readRDS("../../data/ps1000R.rds")
abund <- otu_table(pst)
abund_ranks <- t(apply(abund, 1, rank))

Naively using these ranks could make differences between pairs of low and high abundance microbes comparable. In the case where many bacteria are absent or present at trace amounts, an artificially large difference in rank could occur[@holmes2011] for minimally abundant taxa. To avoid this, all those microbes with rank below some threshold are set to be tied at 1. The ranks for the other microbes are shifted down, so there is no large gap between ranks.

This is done by doing something like this:

abund_ranks <- abund_ranks - 329
abund_ranks[abund_ranks < 1] <- 1

329 is an arbirary tuning parameter, it indicates here that we think there are really about 60 taxa present.

PCA analysis on ranks

In the phyloseq lab we created a imilar abund_ranks data set for the Kostic data.

We use these ranks to do a PCA:

library("ade4")
library("dplyr")
library("reshape2")
ranks_pca <- dudi.pca(abund_ranks, scannf = F, nf = 3)

To produce annotation for this figure, we use the following “data wrangling” tricks, do not worry about this too muc, for reproducibility reasons, we are showing you all the details.

tax <- tax_table(pst)@.Data %>%
  data.frame(stringsAsFactors = FALSE)
tax$seq <- rownames(tax)

main_orders <- c("Clostridiales", "Bacteroidales", "Lactobacillales",
                   "Coriobacteriales")
tax$Order[!(tax$Order %in% main_orders)] <- "Other"
tax$Order <- factor(tax$Order, levels = c(main_orders, "Other"))
tax$otu_id <- seq_len(ncol(otu_table(pst)))
row_scores = data.frame(li = ranks_pca$li,
            SampleID = rownames(abund_ranks))
col_scores <- data.frame(co = ranks_pca$co,
              seq = colnames(abund_ranks))

sample_data(pst)$age_binned <- cut(sample_data(pst)$age,
                                  breaks = c(0, 100, 200, 400))
## Proportion of variance explained
evals_prop <- 100 * (ranks_pca$eig / sum(ranks_pca$eig))

row_scores <- row_scores %>%
  left_join(data.frame(sample_data(pst)))
## Warning: Column `SampleID` joining factor and character vector,
## coercing into character vector
col_scores <- col_scores %>%
  left_join(tax)
## Warning: Column `seq` joining factor and character vector, coercing
## into character vector

Finally, let’s use ggplot:

require(ggplot2)
ggplot() +
  geom_point(data = row_scores, aes(x = li.Axis1, y = li.Axis2), shape = 2) +
  geom_point(data = col_scores, aes(x = 25 * co.Comp1, y = 25 * co.Comp2, col = Order),
              size = .3, alpha = 0.6) +
  scale_color_brewer(palette = "Set2") +
  facet_grid(~ age_binned) +
  guides(col = guide_legend(override.aes = list(size = 3))) +
  labs(x = sprintf("Axis1 [%s%% variance]", round(evals_prop[1], 2)), 
  y = sprintf("Axis2 [%s%% variance]", round(evals_prop[2], 2))) +
  coord_fixed(sqrt(ranks_pca$eig[2] / ranks_pca$eig[1])) +
  theme(panel.border = element_rect(color = "#787878", fill = alpha("white", 0)))

A first principal coordinates analysis (PCoA or MDS)

For a first pass, we look at principal coordinates analysis (PCoA) with either the Bray-Curtis dissimilarity on the weighted Unifrac distance.

Question M1: Here is the code for the weighted Unifrac: you should now generate a plot with the Bray-Curtis distance instead.

set.seed(0xdada2)
out_wuf = ordinate(pst, method = "MDS", distance = "wunifrac")
## Warning in UniFrac(physeq, weighted =
## TRUE, ...): Randomly assigning root as --
## GCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGCTAATCAAGTCAGAAGTGAAAACCCAGGGCTCAACTTTGGGACTGCTTTTGAAACTGAGTAGCTAGAGTGCAGGAGAGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACTGTAACTGACGTTGAGGCTCGAAAGCGTGGGGAG
## -- in the phylogenetic tree in the data you provided.
evals =  out_wuf$values$Eigenvalues
plot_ordination(pst, out_wuf, color = "age_binned", shape="sex")  +
  coord_fixed(sqrt(evals[2] / evals[1]))

Question M2: The above plot shows the proportion of variance explained by the 1st axis (48%) and the 2nd axis (21%). What is the variance explained by the 3rd axis? (Hint: Look at evals. Note that evals[2],evals[1] is 21, 48.)

round(100*evals[3]/sum(evals))
## [1] 8

We see immediately that there are some outliers. We will take them out, since we are mainly interested in the relationships between the non-outlier points (although we skip the details of how to specify the outliers).

outlier_idx =   rownames(sample_data(pst)) %in% c("M3D149","M2D19","M1D9", "M1D149", "F5D165", "F6D165")
pst2 = prune_samples(!outlier_idx, pst)

Question M3: Redo the same ordination as above after outlier removal. What is the proportion of variance explained by the 1st axis?

Different ordination projections

In this section we will explore different ordination projections.

First we will perform a PCoA using Bray-Curtis dissimilarity.

out_bc = ordinate(pst2, method = "MDS", distance = "bray")
evals_bc = out_bc$values$Eigenvalues
plot_ordination(pst2, out_bc, color = "age_binned") +
  coord_fixed(sqrt(evals_bc[2] / evals_bc[1])) 

The plot above shows the ordination of the samples, and we see that the second axis corresponds to an age effect, with the samples from the younger and older mice separating fairly well.

Next we look at double principal coordinates analysis (DPCoA), which is a phylogenetic ordination method that provides a biplot representation of both samples and taxonomic categories. (Warning: This next line will take a while to compute.)

out_dpcoa = ordinate(pst2, method = "DPCoA")
evals_dpcoa = out_dpcoa$eig
plot_ordination(pst2, out_dpcoa, color = "age_binned",
                  shape = "family_relationship") +
  coord_fixed(sqrt(evals_dpcoa[2] / evals_dpcoa[1])) +
  labs(col = "Binned Age", shape = "Litter")

We see again that the second axis corresponds to young vs. old mice.

plot_ordination(pst2, out_dpcoa, type = "species", color = "Phylum") +
  coord_fixed(sqrt(evals_dpcoa[2] / evals_dpcoa[1]))

The second plot suggests an interpretation of the second axis: samples that have larger scores on the second axis have more taxa from Bacteroidetes and one subset of Firmicutes.

Question M5: Consider a sample which is particularly enriched in bacteria from the Bacteroidetes Phylum. Do you think it is more likely that its projection on the first axis will have a positive value or negative value?