In this lab we will learn the basics of the analysis of microbiome data
using the phyloseq
package and some
refinements of the graphics using the ggplot2
package.
(You will hear more about ggplot2 tomorrow in the visualization lecture.)
The goal of the phyloseq package is to facilitate the kind of interactive, “not canned” workflow depicted in the graphic below.
The three main steps in phyloseq are:
The phyloseq package provides special functions for accomplishing this in a way that is consistent, reliable, low-error, and reproducible.
A phyloseq object can contain multiple related tables, a tree, and even the sequence corresponding to each observation in the “OTU table”.
A picture is maybe better than words, here:
Make sure you have the installed packages from both CRAN and BioConductor.
biocpkgs_needed = c("ggplot2", "phyloseq")
pkgs_needed = c("PMA","dplyr","ade4","ggrepel","vegan", "tidyverse", "ggtree")
letsinstall = setdiff(pkgs_needed, installed.packages())
letsinstallb = setdiff(biocpkgs_needed, installed.packages())
if (length(letsinstallb) > 0) {
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(letsinstallb, version = "3.9")
}
Load packages.
require("tidyverse")
library("ggplot2")
library("dplyr")
library("phyloseq")
ggplot2 package theme set. See the ggplot2 online documentation for further help.
theme_set(theme_bw())
phyloseq
There are multiple example data sets included in phyloseq. Many are from published investigations and include documentation with a summary and references, as well as some example code representing some aspect of analysis available in phyloseq.
To load example data into the working environment, use the data()
command:
data(GlobalPatterns)
data(esophagus)
data(enterotype)
data(soilrep)
In the package index, go to the names beginning with “data-” to see the documentation of currently available example datasets.
Try for example
?GlobalPatterns
to access the documentation for the so-called “GlobalPatterns” dataset.
You can also look at a composite object with the object viewer by clicking on
the object name in RStudio after you have loaded it with the data()
command.
data(enterotype)
Question 1: Using the Environment tab/pane in RStudio, click on enterotpype to inspect the enterotype object, how many biological samples does it contain?
The phyloseq package is a tool to import, store, analyze, and graphically display complex phylogenetic sequencing data. Take some time to explore the object, before we start doing statistical analyses.
You can also try the examples included with the example data documentation (as well as examples for functions/methods) using the standard example
command in R – in this case the examples for the enterotype
dataset.
Question 2 : What type of output does the enterotype example provide?
example(enterotype, ask=FALSE)
##
## entrty> data(enterotype)
##
## entrty> ig <- make_network(enterotype, "samples", max.dist=0.3)
##
## entrty> plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.3, label=NULL)
## Warning: attributes are not identical across measure variables; they
## will be dropped
Question 3 : Bioconductor objects are often lists whose components are called slots
.
How many slots
does the GlobalPatterns
object have?
data(GlobalPatterns)
GlobalPatterns
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 19216 taxa and 26 samples ]
## sample_data() Sample Data: [ 26 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
Question 4 : What is the dimension of its otu_table
component?
dim(otu_table(GlobalPatterns))
## [1] 19216 26
Question 5: How many different levels does the SampleType
variable in the
sample_data
slot of the GlobalPatterns data have?
Question 6: Make alpha diversity plots, one for each different Sample Type. Here is an example with the Chao index, try other methods (Shannon for instance).
plot_richness(GlobalPatterns, x="SampleType", measures=c("Chao1"))
The data we will analyze in the first part of the lab corresponds to 360 fecal samples which were collected from 12 mice longitudinally over the first year of life, to investigate the development and stabilization of the murine microbiome. Let’s (down)load the dataset:
#download.file("https://cdn.rawgit.com/spholmes/F1000_workflow/891463f6/data/ps.rds",
# "ps.rds",
# mode="wb")
ps = readRDS("ps.rds")
The ps
object is of class phyloseq
from the package phyloseq
.
class(ps)
## [1] "phyloseq"
## attr(,"package")
## [1] "phyloseq"
We often use “rds” files because they have better compression than the ordinary .rda
or
.RData
files.
Question 7: How many slots does the ps
object have?
Question 8: How many distinct Phyla (such as “Bacteroidetes”) have been identified in this dataset? (Hint: Look up the tax_table
function and then inspect its output. Also make sure to ignore potential NA
values!)
tax_table(ps)[,"Phylum"] %>% unique %>% na.exclude %>% length
## [1] 10
Question 9: In total, does this dataset include more measurements for female or male mice (Hint: Look up the documentation of sample_data
)?
table(sample_data(ps)$sex)
##
## F M
## 173 187
Question 10: How many unique female mice were part of this experiment?
sample_data(ps) %>% group_by(sex) %>% dplyr::summarize(n = length(unique(host_subject_id)))
## # A tibble: 2 x 2
## sex n
## <chr> <int>
## 1 F 6
## 2 M 6
The phyloseq package comes with example legacy QIIME output files. The way to find these files on your system is:
otufile <- system.file("extdata", "GP_otu_table_rand_short.txt.gz", package="phyloseq")
otufile
## [1] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/phyloseq/extdata/GP_otu_table_rand_short.txt.gz"
mapfile <- system.file("extdata", "master_map.txt", package="phyloseq")
mapfile
## [1] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/phyloseq/extdata/master_map.txt"
trefile <- system.file("extdata", "GP_tree_rand_short.newick.gz", package="phyloseq")
trefile
## [1] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/phyloseq/extdata/GP_tree_rand_short.newick.gz"
Now use the import_qiime
function to import the data
from these three files, and generate a phyloseq object.
gp = import_qiime(otufile, mapfile, trefile, verbose=FALSE)
class(gp)
## [1] "phyloseq"
## attr(,"package")
## [1] "phyloseq"
gp
?gp
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 500 taxa and 26 samples ]
## sample_data() Sample Data: [ 26 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 500 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 500 tips and 499 internal nodes ]
ntaxa(gp)
## [1] 500
26
## [1] 26
sample_variables(gp)
## [1] "X.SampleID" "Primer"
## [3] "Final_Barcode" "Barcode_truncated_plus_T"
## [5] "Barcode_full_length" "SampleType"
## [7] "Description"
gp %>% sample_data() %>% ncol
## [1] 7
import_biom()
):biomFile <- system.file("extdata", "rich_dense_otu_table.biom", package="phyloseq")
biomFile
## [1] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/phyloseq/extdata/rich_dense_otu_table.biom"
Answer:
import_biom(biomFile)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 5 taxa and 6 samples ]
## sample_data() Sample Data: [ 6 samples by 4 sample variables ]
## tax_table() Taxonomy Table: [ 5 taxa by 7 taxonomic ranks ]
gp
, but for this new data?Before doing the statistical analyses, we will do some basic “data munging”
First remove features with ambiguous Phylum annotation:
ps = subset_taxa(ps, !is.na(Phylum) & !Phylum %in% c("", "uncharacterized"))
Now let’s look at a histogram of the ages of the mice:
ggplot(sample_data(ps), aes(x=age)) + geom_histogram() + xlab("age")
We see that the ages of the mice come in a couple of groups, and so we make a categorical variable corresponding to young, middle-aged, and old mice.
sample_data(ps)$age_binned = cut(sample_data(ps)$age,
breaks = c(0, 100, 200, 400))
Furthermore, we could log-transform the data. This is an approximate variance stabilizing transformation (it would be more appropriate to use the variance stabilizing functionality available in the DESeq2
package).
pslog = transform_sample_counts(ps, function(x) log(1 + x))
Microbial abundance data is often heavy-tailed, and sometimes it can be hard to identify a transformation that brings the data to normality. In these cases, it can be safer to ignore the raw abundances altogether, and work instead with ranks.
This is called a robust method and we demonstrate this idea using a rank-transformed version of the data.
First, we create a new matrix, representing the abundances by their ranks, where the microbe with the smallest in a sample gets mapped to rank 1, second smallest rank 2, etc.
The following are key phyloseq functions for filtering/subsetting
prune_taxa
prune_samples
subset_taxa
subset_samples
subset_taxa
and subset_samples
are convenience wrapper around the subset
function.
You can think of them as analogous to the filter()
function in the tidyverse.
They support expressions of inclusion based on
the tax_table
or sample_data
, respectively.
# Only Firmicutes
subset_taxa(gp, Phylum == "Firmicutes")
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 100 taxa and 26 samples ]
## sample_data() Sample Data: [ 26 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 100 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 100 tips and 99 internal nodes ]
# Only ocean samples
subset_samples(gp, SampleType == "Ocean")
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 500 taxa and 3 samples ]
## sample_data() Sample Data: [ 3 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 500 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 500 tips and 499 internal nodes ]
prune_taxa
and prune_samples
are functions
for filtering based on an previously defined inclusion vector.
You can think of this as similar to square bracket notation
on rows or columns of a table.
# keep only taxa that were observed at least twice
prune_taxa(taxa_sums(gp) >= 2, gp)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 444 taxa and 26 samples ]
## sample_data() Sample Data: [ 26 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 444 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 444 tips and 443 internal nodes ]
# keep only samples with count greater than 2000
prune_samples(sample_sums(gp) >= 2000, gp)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 500 taxa and 25 samples ]
## sample_data() Sample Data: [ 25 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 500 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 500 tips and 499 internal nodes ]
(taxa_sums(gp) > 1) %>% sum
## [1] 444
filter_taxa(gp, function(x){sum(x > 0) > 1}, prune = TRUE)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 414 taxa and 26 samples ]
## sample_data() Sample Data: [ 26 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 414 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 414 tips and 413 internal nodes ]
What about more than 5 samples?
filter_taxa(gp, function(x){sum(x > 0) > 5}, prune = TRUE)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 196 taxa and 26 samples ]
## sample_data() Sample Data: [ 26 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 196 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 196 tips and 195 internal nodes ]
prune_samples(sample_sums(gp) > 4500, gp)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 500 taxa and 23 samples ]
## sample_data() Sample Data: [ 23 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 500 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 500 tips and 499 internal nodes ]
# Which samples are human
sample_data(gp)[["SampleType"]] %in% c("Mock", "Skin", "Tongue", "Feces")
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE
## [12] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE TRUE TRUE
# get_variable(gp, "SampleType") %in% c("Mock", "Skin", "Tongue", "Feces")
humanSample = sample_data(gp)[["SampleType"]] %in% c("Mock", "Skin", "Tongue", "Feces")
prune_samples(samples = humanSample, gp)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 500 taxa and 12 samples ]
## sample_data() Sample Data: [ 12 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 500 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 500 tips and 499 internal nodes ]
tax_glom()
– Taxonomy-defined, taxonomy-aware agglomeration of observations.tip_glom()
– Tree-defined, taxonomy-aware agglomeration of observations.# agglomerate at the taxonomic rank of Family
(tax_glom(gp, taxrank="Order"))
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 87 taxa and 26 samples ]
## sample_data() Sample Data: [ 26 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 87 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 87 tips and 86 internal nodes ]
tip_glom()
can take time for a dataset with a large number of taxa,
so let’s first apply our newly-acquired tidyverse knowledge,
and determine a phylum with a manageable number of unique OTUs.
We will subset to just this Phylum for the purposes of demonstrating
tree-based agglommeration quickly.
gp %>%
psmelt %>%
group_by(Phylum) %>%
summarize(OTUn = (unique(OTU) %>% length)) %>%
arrange(desc(OTUn))
## # A tibble: 31 x 2
## Phylum OTUn
## <fct> <int>
## 1 Proteobacteria 159
## 2 Firmicutes 100
## 3 Bacteroidetes 63
## 4 Actinobacteria 50
## 5 Acidobacteria 27
## 6 Planctomycetes 23
## 7 Chloroflexi 12
## 8 Cyanobacteria 12
## 9 Verrucomicrobia 9
## 10 Nitrospirae 5
## # … with 21 more rows
gpAcid <- subset_taxa(gp, Phylum == "Acidobacteria")
plot_tree(gpAcid, label.tips="taxa_names",
size="abundance", title="Before tip_glom()", ladderize = "left")
gpTipGlom <- tip_glom(gpAcid, h = 0.2)
plot_tree(gpTipGlom, label.tips="taxa_names",
size="abundance", title="After tip_glom()", ladderize = "left")
gp
dataset, what are available ranks in the taxonomy
(Hint: try rank_names()
)?rank_names(gp)
## [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus"
## [7] "Species"
tax_glom()
to agglomerate at taxonomic rank “Family”.gpFam <- tax_glom(gp, "Family")
gpFam
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 131 taxa and 26 samples ]
## sample_data() Sample Data: [ 26 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 131 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 131 tips and 130 internal nodes ]
ntaxa
)?ntaxa(gp)
## [1] 500
ntaxa(gpFam)
## [1] 131
h
parameter in tip_glom
?
h
is the height where the tree should be cut to define agglomeration groups. This refers to the tree resulting from hierarchical clustering ofcophenetic.phylo(phy_tree(physeq))
, not necessarily the original phylogenetic tree,phy_tree(physeq)
h
?
> It will make cluster with larger number of members,
> smaller number of distinct features.tip_glom(gp, h = 0.5)
Question 11 What does the function t()
do?
abund = otu_table(ps)
abund_ranks = t(apply(abund, 1, rank))
Question 12 Pick a few taxa (say the first, second and forth) and compare the ranks for the log transformed data and the raw data:
abund_l = otu_table(pslog)
abund_logranks = t(apply(abund_l, 1, rank))
cor(abund_logranks[,1],abund_ranks[,1])
## [1] 1
cor(abund_logranks[,2],abund_ranks[,2])
## [1] 1
cor(abund_logranks[,4],abund_ranks[,4])
## [1] 1
What do you notice?
Naively using these ranks could make differences between pairs of low and high abundance microbes comparable. In the case where many bacteria are absent or present at trace amounts, an artificially large difference in rank could occur for minimally abundant taxa. To avoid this, all those microbes with rank below some threshold are set to be tied at 1. The ranks for the other microbes are shifted down, so there is no large gap between ranks.
abund_ranks = abund_ranks - 329
abund_ranks[abund_ranks < 1] = 1
If we want to use these data later on, we can save them as a compressed R
file with:
saveRDS(abund_ranks,"abund_ranks.rds")
Question 13 Compute the correlations between these data using the function cor
and then as a bonus, plot the correlations as usefully as you can.
Let’s jump into plotting using phyloseq.
plot_richness()
plot_ordination()
plot_heatmap()
plot_tree()
(use gpAcid
, or something < 200 taxa)plot_bar()
plot_net()
# go for it
When might you use each? >depends on what you need to emphasize, show, explore.
What is plot_richness()
showing you?
Which variables in this plot data are good
for mapping to color
or faceting?
plot_richness(gp, x = "SampleType", color = "SampleType",
measures = "Shannon")
Does it look like there are differences in alpha-diversity between sample sources? >Yes
Try different distances and ordination methods for
plot_ordination
and plot_heatmap
.
ordinate(gp, "PCoA", "bray") %>%
plot_ordination(gp, ., color = "SampleType", title = "Bray-Curtis")
ordinate(gp, "PCoA", "wUniFrac") %>%
plot_ordination(gp, ., color = "SampleType", title = "weighted-UniFrac")
ordinate(gp, "NMDS", "wUniFrac") %>%
plot_ordination(gp, ., color = "SampleType", title = "weighted-UniFrac")
## Run 0 stress 0.1169949
## Run 1 stress 0.1169902
## ... New best solution
## ... Procrustes: rmse 0.001477037 max resid 0.003713577
## ... Similar to previous best
## Run 2 stress 0.1169938
## ... Procrustes: rmse 0.001300766 max resid 0.003219018
## ... Similar to previous best
## Run 3 stress 0.1169951
## ... Procrustes: rmse 0.001488058 max resid 0.003739813
## ... Similar to previous best
## Run 4 stress 0.171832
## Run 5 stress 0.1169928
## ... Procrustes: rmse 0.000941071 max resid 0.002291706
## ... Similar to previous best
## Run 6 stress 0.120547
## Run 7 stress 0.1169914
## ... Procrustes: rmse 0.0007433478 max resid 0.001859757
## ... Similar to previous best
## Run 8 stress 0.1790242
## Run 9 stress 0.1169919
## ... Procrustes: rmse 0.0008900212 max resid 0.002156984
## ... Similar to previous best
## Run 10 stress 0.1205484
## Run 11 stress 0.1379723
## Run 12 stress 0.1169954
## ... Procrustes: rmse 0.001533784 max resid 0.003757569
## ... Similar to previous best
## Run 13 stress 0.1169904
## ... Procrustes: rmse 0.0002494493 max resid 0.0006255789
## ... Similar to previous best
## Run 14 stress 0.1169925
## ... Procrustes: rmse 0.001050537 max resid 0.00255894
## ... Similar to previous best
## Run 15 stress 0.116996
## ... Procrustes: rmse 0.001537947 max resid 0.003829604
## ... Similar to previous best
## Run 16 stress 0.1169945
## ... Procrustes: rmse 0.001423018 max resid 0.003530047
## ... Similar to previous best
## Run 17 stress 0.1169972
## ... Procrustes: rmse 0.001744119 max resid 0.004337548
## ... Similar to previous best
## Run 18 stress 0.1169903
## ... Procrustes: rmse 0.0002227621 max resid 0.00046977
## ... Similar to previous best
## Run 19 stress 0.1205447
## Run 20 stress 0.1728626
## *** Solution reached
plot_heatmap(physeq = gp, method = "MDS", distance = "wUniFrac",
title = "weighted-UniFrac", taxa.label = FALSE)
## Warning: Transformation introduced infinite values in discrete y-axis
plot_net
.plot_net(gp, "bray", color = "SampleType")
plot_net(gp, "WUniFrac", color = "SampleType")
plot_net
# Here are supported options
plot_net(laymeth='list')
## [1] "auto" "random"
## [3] "circle" "sphere"
## [5] "fruchterman.reingold" "kamada.kawai"
## [7] "spring" "reingold.tilford"
## [9] "fruchterman.reingold.grid" "lgl"
## [11] "graphopt" "svd"
# Here are examples
plot_net(gp, "bray", color = "SampleType", laymeth = "circle")
plot_net(gp, "bray", color = "SampleType", laymeth = "reingold.tilford")
plot_net(gp, "bray", color = "SampleType", laymeth = "svd")
## Warning in laymeth(g, ...): SVD layout was removed, we use
## Fruchterman-Reingold instead.
We want to plot trees, sometimes even bootstrap values, but notice that the node labels in the GlobalPatterns
dataset are actually a bit strange. They look like they might be bootstrap values, but they sometimes have two decimals.
head(phy_tree(GlobalPatterns)$node.label, 10)
## [1] "" "0.858.4" "1.000.154" "0.764.3" "0.995.2"
## [6] "1.000.2" "0.943.7" "0.971.6" "0.766" "0.611"
There are actual a second set of spurious labels that were appended to some node labels. Could systematically remove the second decimal, but why not just take the first 4 characters instead?
phy_tree(GlobalPatterns)$node.label = substr(phy_tree(GlobalPatterns)$node.label, 1, 4)
Great, now that we’re more happy with the node labels at least looking like bootstrap values, we can move on to using these along with other information about data mapped onto the tree graphic.
The GlobalPatterns
dataset has many OTUs, more than we would want to try to fit on a tree graphic
ntaxa(GlobalPatterns)
## [1] 19216
So, let’s arbitrarily prune to just the first 50 OTUs in GlobalPatterns
, and store this as physeq
, which also happens to be the name for most main data parameters of function in the phyloseq package.
physeq = prune_taxa(taxa_names(GlobalPatterns)[1:50], GlobalPatterns)
Now let’s look at what happens with the default plot_tree
settings.
plot_tree(physeq)
By default, black dots are annotated next to tips (OTUs) in the tree, one for each sample in which that OTU was observed. Some have more dots than others.
Also by default, the node labels that were stored in the tree were added next to each node without any processing (although we had trimmed their length to 4 characters in the previous step).
What if we want to just see the tree with no sample points next to the tips?
plot_tree(physeq, "treeonly")
And what about without the node labels either?
# plot_tree(physeq, "treeonly", nodeplotblank)
We can adjust the way branches are rotated to make it look nicer using the ladderize
parameter.
# plot_tree(physeq, "treeonly", nodeplotblank, ladderize="left")
# plot_tree(physeq, "treeonly", nodeplotblank, ladderize="right")
And what if we want to add the OTU labels next to each tip?
# plot_tree(physeq, nodelabf=nodeplotblank, label.tips="taxa_names", ladderize="left")
Any method
parameter argument other than "sampledodge"
(the default) will not add dodged sample points next to the tips.
# plot_tree(physeq, "anythingelse")
In the default argument to method
, "sampledodge"
, a point is added next to each OTU tip in the tree for every sample in which that OTU was observed. We can then map certain aesthetic features of these points to variables in our data.
Color is one of the most useful aesthetics in tree graphics when they are complicated. Color can be mapped to either taxonomic ranks or sample covariates. For instance, we can map color to the type of sample collected (environmental location).
# plot_tree(physeq, nodelabf=nodeplotboot(), ladderize="left", color="SampleType")
and we can alternatively map color to taxonomic class.
# plot_tree(physeq, nodelabf=nodeplotboot(), ladderize="left", color="Class")
You can also map a variable to point shape if it has 6 or fewer categories, and this can be done even when color is also mapped. Here we map shape to taxonomic class so that we can still indicate it in the graphic while also mapping SampleType
to point color.
# plot_tree(physeq, nodelabf=nodeplotboot(), ladderize="left", color="SampleType", shape="Class")
One of the most common reasons to label nodes is to add confidence measures, often a bootstrap value, to the nodes of the tree. The following graphics show different ways of doing this (labels are added by default if present in your tree).
# # The default
# plot_tree(physeq, color="SampleType", ladderize="left")
# # Special bootstrap label
# plot_tree(physeq, nodelabf=nodeplotboot(), color="SampleType", ladderize="left")
# # Special bootstrap label with alternative thresholds
# plot_tree(physeq, nodelabf=nodeplotboot(80,0,3), color="SampleType", ladderize="left")
label.tips - The label.tips
parameter controls labeling of tree tips (AKA leaves). Default is NULL, indicating that no tip labels will be printed.
Using taxa_names
is a special argument resulting in the OTU name (try taxa_names
function) being labelled next to the leaves or next to the set of points that label the leaves. Alternatively, if your data object contains a tax_table
, then one of the rank names (from rank_names(physeq)
) can be provided, and the classification of each OTU at that rank will be labeled instead.
text.size - A positive numeric argument indicating the ggplot2 size parameter for the taxa labels. Default is NULL
. If left as NULL
, this function will automatically calculate a (hopefully) optimal text size given the size constraints posed by the tree itself (for a vertical tree). This argument is included mainly in case the automatically-calculated size is wrong and you want to change it. Note that this parameter is only meaningful if label.tips
is not NULL
# plot_tree(physeq, nodelabf=nodeplotboot(80,0,3),
# color="SampleType", label.tips="taxa_names", ladderize="left")
The following is the default barplot when no parameters are given. The dataset is plotted with every sample mapped individually to the horizontal (x
) axis, and abundance values mapped to the veritcal (y
) axis. At each sample’s horizontal position, the abundance values for each OTU are stacked in order from greatest to least, separate by a thin horizontal line. As long as the parameters you choose to separate the data result in more than one OTU abundance value at the respective position in the plot, the values will be stacked in order as a means of displaying both the sum total value while still representing the individual OTU abundances.
We start by taking a subset of taxa, those of the Chlamydiae Phylum:
gp.ch = subset_taxa(GlobalPatterns, Phylum == "Chlamydiae")
Then the default plot_bar()
functions visualizes the data.
plot_bar(gp.ch)
Add fill color to represent the Genus to which each OTU belongs.
plot_bar(gp.ch, fill="Genus")
Now keep the same fill color, and group the samples together by the SampleType
variable; essentially, the environment from which the sample was taken and sequenced.
plot_bar(gp.ch, x="SampleType", fill="Genus")
Note that abundance values for the same OTU from the same SampleType
will be stacked as separate bar segments, and so the segment lines may not accurately portray the observed richness (because the same OTU might be shown more than once for the same horizontal axis grouping). However, all other aspects of the representation are quantitative, with the total stacked bar height at each horizontal position indicating the sum of all reads for that sample(s). There is no attempt by plot_bar
to normalize or standardize your data, which is your job to do (using other tools in the phyloseq pacakge, for instance) before attempting to interpret/compare these values between samples.
In the following example we elected to further organize the data using “facets” – separate, adjacent sub-plots. In this case the facets allow us to according to the genus of each OTU. Within each genus facet, the data is further separated by sequencing technology, and the enterotype label for the sample from which each OTU originated is indicated by fill color.
plot_bar(gp.ch, "Family", fill="Genus", facet_grid=~SampleType)
Note that additional customizations of the plot are always possible using standard ggplot2 layers. For example, the following code chunk shows a plot with jittered points add using a second plot layer.
library("ggplot2")
p = plot_bar(gp.ch, "Family", fill="Genus", facet_grid=~SampleType)
p + geom_point(aes(x=Family, y=Abundance), color="black", position="jitter", size=3)
First, load package (if you haven’t already), then trim Enterotype data to most abundant 10 genera.
TopNOTUs <- names(sort(taxa_sums(enterotype), TRUE)[1:10])
ent10 <- prune_species(TopNOTUs, enterotype)
## Warning: 'prune_species' is deprecated.
## Use 'prune_taxa' instead.
## See help("Deprecated") and help("phyloseq-deprecated").
The parameters to plot_bar
in the following code-chunk were chosen after various trials. We suggest that you also try different parameter settings while you’re exploring different features of the data.
In addition to the variables names of sample_data
, the plot_bar
function recognizes the names of taxonomic ranks, if present.
In this example we have also elected to organize data by “facets” (separate, adjacent sub-plots) according to the genus of each OTU. Within each genus facet, the data is further separated by sequencing technology, and the enterotype label for the sample from which each OTU originated is indicated by fill color. Abundance values from different samples and OTUs but having the same variables mapped to the horizontal (x
) axis are sorted and stacked, with thin horizontal lines designating the boundaries. With this display it is very clear that the choice of sequencing technology had a large effect on which genera were detected, as well as the fraction of OTUs that were assigned to a Genus.
plot_bar(ent10, "SeqTech", fill="Enterotype", facet_grid=~Genus)
You could nix the approach in which OTU abundance values from different samples, different enterotypes, are stacked together and simply shaded differently, and instead opt to separate both the enterotype designation of the samples and the genus designation of the OTUs into one grid. Only a slight modification to the previous function call is necessary in that case (with an added fill to make it even easier to read):
plot_bar(ent10, "Genus", fill="Genus", facet_grid=SeqTech~Enterotype)
:-)
Sometimes you’ll want to define your own ggplot from scratch,
or you’ll want to use the many Data Frame oriented tools
provided in the tidyverse.
To address these, phyloseq provides a “melting” function,
psmelt()
, that converts the complicated
phyloseq-encoded data into a “tidy” data frame.
In fact, most phyloseq plot functions depend on psmelt()
.
Reading the code of these functions can help you get ideas
for how you might define your own custom plots.
(Caveat, trees are not naturally represented by a table,
so psmelt
does not attempt to return anything encoded in the tree).
Try it:
gpAcid %>%
psmelt() %>%
group_by(Sample) %>%
mutate(Proportion = Abundance / sum(Abundance, na.rm = TRUE)) %>%
filter(Proportion > 0) %>%
filter(!is.na(Class)) %>%
ggplot(aes(y = Class, x = log10(Proportion), fill = Class)) +
ggridges::geom_density_ridges2(scale = 1, alpha = 0.7, show.legend = FALSE) +
ggtitle("Compare distribution of relative abundances")
# geom_violin() +
# coord_flip()