Load the phyloseq package:
require("phyloseq")
packageVersion("phyloseq")
## [1] '1.4.5'
require(ggplot2)
The custom functions that read external data files and return an instance of the phyloseq-class are called “importers”. Validity and coherency between data components are checked by the phyloseq-class constructor, phyloseq()
which is invoked internally by the importers, and is also the suggested function for creating a phyloseq object from “manually” imported data. The component indices representing OTUs or samples are checked for intersecting indices, and trimmed/reordered such that all available (non-) component data describe exactly the same OTUs and samples, in the same order.
See ?import
after phyloseq has been loaded (library("phyloseq")
), to get an overview of available import functions and documentation links to their specific doc pages, or see below for examples using some of the more popular importers.
We'll show how to use the import_biom function.
Newer versions of QIIME produce a more-comprehensive and formally-defined JSON file format, called biom file format:
“The biom file format is designed to be a general-use format for representing counts of observations in one or more biological samples. BIOM is a recognized standard for the Earth Microbiome Project and is a Genomics Standards Consortium candidate project.”
The phyloseq package includes small examples of biom files with different levels and organization of data. The following shows how to import each of the four main types of biom files (in practice, you don't need to know which type your file is, only that it is a biom file). In addition, the import_biom
function allows you to simultaneously import an associated phylogenetic tree file and reference sequence file (e.g. fasta).
First, define the file paths. In this case, this will be within the phyloseq package, so we use special features of the system.file
command to get the paths. This should also work on your system if you have phyloseq installed, regardless of your Operating System.
rich_dense_biom = system.file("extdata", "rich_dense_otu_table.biom", package="phyloseq")
rich_sparse_biom = system.file("extdata", "rich_sparse_otu_table.biom", package="phyloseq")
min_dense_biom = system.file("extdata", "min_dense_otu_table.biom", package="phyloseq")
min_sparse_biom = system.file("extdata", "min_sparse_otu_table.biom", package="phyloseq")
treefilename = system.file("extdata", "biom-tree.phy", package="phyloseq")
refseqfilename = system.file("extdata", "biom-refseq.fasta", package="phyloseq")
Now that we've defined the file paths, let's use these as argument to the import_biom
function. Note that the tree and reference sequence files are both suitable for any of the example biom files, which is why we only need one path for each. In practice, you will be specifying a path to a sequence or tree file that matches the rest of your data (include tree tip names and sequence headers)
import_biom(rich_dense_biom, treefilename, refseqfilename, parseFunction=parse_taxonomy_greengenes)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 5 taxa and 6 samples ]
## sample_data() Sample Data: [ 6 samples by 4 sample variables ]
## tax_table() Taxonomy Table: [ 5 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 5 tips and 4 internal nodes ]
## refseq() DNAStringSet: [ 5 reference sequences ]
import_biom(rich_sparse_biom, treefilename, refseqfilename, parseFunction=parse_taxonomy_greengenes)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 5 taxa and 6 samples ]
## sample_data() Sample Data: [ 6 samples by 4 sample variables ]
## tax_table() Taxonomy Table: [ 5 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 5 tips and 4 internal nodes ]
## refseq() DNAStringSet: [ 5 reference sequences ]
import_biom(min_dense_biom, treefilename, refseqfilename, parseFunction=parse_taxonomy_greengenes)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 5 taxa and 6 samples ]
## phy_tree() Phylogenetic Tree: [ 5 tips and 4 internal nodes ]
## refseq() DNAStringSet: [ 5 reference sequences ]
import_biom(min_sparse_biom, treefilename, refseqfilename, parseFunction=parse_taxonomy_greengenes)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 5 taxa and 6 samples ]
## phy_tree() Phylogenetic Tree: [ 5 tips and 4 internal nodes ]
## refseq() DNAStringSet: [ 5 reference sequences ]
Example code for importing large file with parallel backend
library("doParallel")
registerDoParallel(cores=6)
import_biom("my/file/path/file.biom", parseFunction=parse_taxonomy_greengenes, parallel=TRUE)
In practice, you will store the result of your import as some variable name, like myData
, and then use this data object in downstream data manipulations and analysis. For example,
myData = import_biom(rich_dense_biom, treefilename, refseqfilename, parseFunction=parse_taxonomy_greengenes)
plot_tree(myData, color="Genus", shape="BODY_SITE", size="abundance")
plot_richness(myData, x="BODY_SITE", color="Description")
plot_bar(myData, fill="Genus")
refseq(myData)
## A DNAStringSet instance of length 5
## width seq names
## [1] 334 AACGTAGGTCACAAGCGTTGT...TTCCGTGCCGGAGTTAACAC GG_OTU_1
## [2] 465 TACGTAGGGAGCAAGCGTTAT...CCTTACCAGGGCTTGACATA GG_OTU_2
## [3] 249 TACGTAGGGGGCAAGCGTTAT...GGCTCGAAAGCGTGGGGAGC GG_OTU_3
## [4] 453 TACGTATGGTGCAAGCGTTAT...AAGCAACGCGAAGAACCTTA GG_OTU_4
## [5] 178 AACGTAGGGTGCAAGCGTTGT...GGAATGCGTAGATATCGGGA GG_OTU_5
Load the GlobalPatterns
dataset, included with the phyloseq package.
data(GlobalPatterns)
Components of a phyloseq object, like the OTU Table, can be accessed by special accessor functions, or “accessors'', which return specific information about phylogenetic sequencing data, if present. These accessor functions are available for direct interaction by users and dependent functions/packages.
GlobalPatterns
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 19216 taxa and 26 samples ]
## sample_data() Sample Data: [ 26 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
ntaxa(GlobalPatterns)
## [1] 19216
nsamples(GlobalPatterns)
## [1] 26
sample_names(GlobalPatterns)[1:5]
## [1] "CL3" "CC1" "SV1" "M31Fcsw" "M11Fcsw"
rank_names(GlobalPatterns)
## [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
sample_variables(GlobalPatterns)
## [1] "X.SampleID" "Primer"
## [3] "Final_Barcode" "Barcode_truncated_plus_T"
## [5] "Barcode_full_length" "SampleType"
## [7] "Description"
otu_table(GlobalPatterns)[1:5, 1:5]
## OTU Table: [5 taxa and 5 samples]
## taxa are rows
## CL3 CC1 SV1 M31Fcsw M11Fcsw
## 549322 0 0 0 0 0
## 522457 0 0 0 0 0
## 951 0 0 0 0 0
## 244423 0 0 0 0 0
## 586076 0 0 0 0 0
tax_table(GlobalPatterns)[1:5, 1:4]
## Taxonomy Table: [5 taxa by 4 taxonomic ranks]:
## Kingdom Phylum Class Order
## 549322 "Archaea" "Crenarchaeota" "Thermoprotei" NA
## 522457 "Archaea" "Crenarchaeota" "Thermoprotei" NA
## 951 "Archaea" "Crenarchaeota" "Thermoprotei" "Sulfolobales"
## 244423 "Archaea" "Crenarchaeota" "Sd-NA" NA
## 586076 "Archaea" "Crenarchaeota" "Sd-NA" NA
phy_tree(GlobalPatterns)
##
## Phylogenetic tree with 19216 tips and 19215 internal nodes.
##
## Tip labels:
## 549322, 522457, 951, 244423, 586076, 246140, ...
## Node labels:
## , 0.858.4, 1.000.154, 0.764.3, 0.995.2, 1.000.2, ...
##
## Rooted; includes branch lengths.
taxa_names(GlobalPatterns)[1:10]
## [1] "549322" "522457" "951" "244423" "586076" "246140" "143239"
## [8] "244960" "255340" "144887"
myTaxa = taxa_names(GlobalPatterns)[1:10]
plot(phy_tree(prune_taxa(myTaxa, GlobalPatterns)))
The phyloseqBase package also includes functions for filtering, subsetting, and merging abundance data. Filtering in phyloseq is designed in a modular fashion similar to the approach in the genefilter package. This includes the prune_taxa
and prune_samples
methods for directly removing unwanted indices, as well as the filterfun_sample
and genefilter_sample
functions for building arbitrarily complex sample-wise filtering criteria, and the filter_taxa
function for taxa-wise filtering. In the following example, the GlobalPatterns
data is first transformed to relative abundance, creating the new GPr
object, which is then filtered such that all OTUs with a variance greater than 10-5 are removed. This results in a highly-subsetted object containing just 177 of the original ~19000 OTUs (GPfr
below). Note that in both lines we have provided a custom function for transformation and filtering, respectively.
GPr = transform_sample_counts(GlobalPatterns, function(x) x / sum(x) )
GPfr = filter_taxa(GPr, function(x) var(x) > 1e-5, TRUE)
The subsetting methods prune_taxa
and prune_samples
are for cases where the complete subset of desired OTUs or samples is directly available. Alternatively, the subset_taxa
and subset_samples
functions are for subsetting based on auxiliary data contained in the Taxonomy Table or Sample Data components, respectively. These functions are analogous to the subset
function in core R, in which the initial data argument is followed by an arbitrary logical expression that indicates elements or rows to keep. Thus, entire experiment-level data objects can be subset according to conditional expressions regarding the auxiliary data. For example, the following code will first assign to GP.chl
the subset of the GlobalPatterns dataset that are part of the Chlamydiae phylum, and then remove samples with less than 20 total reads.
GP.chl = subset_taxa(GlobalPatterns, Phylum=="Chlamydiae")
GP.chl = prune_samples(sampleSums(GP.chl)>=20, GP.chl)
Merging methods include merge_taxa
and merge_samples
, intended for merging specific OTUs or samples, respectively. There is also the merge_phyloseq
function for a complete merge of two or more phyloseq-objects (or a phyloseq-object and one or more separate components). For example, the following code merges the first 5 OTUs in the Chlamydiae-only dataset.
GP.chl.merged = merge_taxa(GP.chl, taxa_names(GP.chl)[1:5])
Building on the merge_taxa
methods, the phyloseq-package includes the agglomeration functions, tip_glom
and tax_glom
, for merging all OTUs in an experiment that are similar beyond a phylogenetic or taxonomic threshold, respectively. The following code demonstrates how to agglomerate the "Bacteroidetes-only” dataset (called gpsfb
) at the taxonomic rank of Family, and create an annotated tree of the result.
gpsfb = subset_taxa(GPfr, Phylum == "Bacteroidetes")
gpsfbg = tax_glom(gpsfb, "Family")
plot_tree(gpsfbg, color="SampleType", shape="Class", size="abundance")
For transforming abundance values by an arbitrary R function, phyloseqBase includes the transform_sample_counts
function. It takes as arguments a phyloseq-object and an R function, and returns a phyloseq-object in which the abundance values have been transformed, sample-wise, according to the transformations specified by the function. For example, the following command transforms GP.chl
abundance counts to fractional abundance.
transform_sample_counts(GP.chl, function(OTU) OTU/sum(OTU) )
Finally, the following is the remaining set of preprocessing steps that was applied to the GlobalPatterns OTU counts prior to creating the figures in the main phyloseq manuscript.
Remove taxa not seen more than 3 times in at least 20% of the samples. This protects against an OTU with small mean & trivially large C.V.
GP = filter_taxa(GlobalPatterns, function(x) sum(x > 3) > (0.2*length(x)), TRUE)
Define a human versus non-human categorical variable, and add this new variable to sample data:
sample_data(GP)$human = factor( get_variable(GP, "SampleType") %in% c("Feces", "Mock", "Skin", "Tongue") )
Standardize abundances to the median sequencing depth
total = median(sample_sums(GP))
standf = function(x, t=total) round(t * (x / sum(x)))
gps = transform_sample_counts(GP, standf)
Filter the taxa using a cutoff of 3.0 for the Coefficient of Variation
gpsf = filter_taxa(gps, function(x) sd(x)/mean(x) > 3.0, TRUE)
Subset the data to Bacteroidetes, used in some plots
gpsfb = subset_taxa(gpsf, Phylum=="Bacteroidetes")
Now let's summarize this slice of the data with some graphics.
title = "plot_bar; Bacteroidetes-only"
plot_bar(gpsfb, "SampleType", "Abundance", title=title)
plot_bar(gpsfb, "SampleType", "Abundance", "Family", title=title)
plot_bar(gpsfb, "Family", "Abundance", "Family",
title=title, facet_grid="SampleType~.")
## Warning: Removed 336 rows containing missing values (position_stack).
## Warning: Removed 168 rows containing missing values (position_stack).
## Warning: Removed 252 rows containing missing values (position_stack).
## Warning: Removed 252 rows containing missing values (position_stack).
## Warning: Removed 252 rows containing missing values (position_stack).
## Warning: Removed 252 rows containing missing values (position_stack).
## Warning: Removed 252 rows containing missing values (position_stack).
## Warning: Removed 252 rows containing missing values (position_stack).
## Warning: Removed 168 rows containing missing values (position_stack).
The distance
function takes a phyloseq-class object and method option, and returns a dist
-class distance object suitable for certain ordination methods and other distance-based analyses. There are currently 44 explicitly supported method options in the phyloseq package, as well as user-provided arbitrary methods via an interface to vegan::designdist
. For the complete list of currently supported options/arguments to the method parameter, type distance("list")
in the command-line of your R session. Only sample-wise distances are currently supported (the type argument), but eventually OTU-wise (e.g. species) distances will be supported as well.
See the in-package documentation of distance
for further details:
?distance
Because the distance()
function organizes distance calculations into one function, it is relatively straightforward to calculate all supported distance methods and investigate the results. The following code will perform such a loop on the “Enterotypes” dataset, perform multi-dimensional scaling (a.k.a. principle coordinates analysis), and plot the first two axes, shading and shaping the points in each plot according to sequencing technology and assigned “Enterotype” label.
Note that we have omitted the options that require a phylogenetic tree because the "enterotype"
example dataset currently included in the phyloseq-package does not have one.
Note that this may take a little while to run, depending on the size of your data set, but you may not be interested in all supported distances…
Load the enterotype data
data(enterotype)
Remove the OTUs that included all unassigned sequences ("-1"
)
enterotype <- subset_species(enterotype, Genus != "-1")
The available distance methods coded in distance
dist_methods <- unlist(distance("list"))
print(dist_methods)
## UniFrac DPCoA JSD vegdist1 vegdist2
## "unifrac" "dpcoa" "jsd" "manhattan" "euclidean"
## vegdist3 vegdist4 vegdist5 vegdist6 vegdist7
## "canberra" "bray" "kulczynski" "jaccard" "gower"
## vegdist8 vegdist9 vegdist10 vegdist11 vegdist12
## "altGower" "morisita" "horn" "mountford" "raup"
## vegdist13 vegdist14 vegdist15 betadiver1 betadiver2
## "binomial" "chao" "cao" "w" "-1"
## betadiver3 betadiver4 betadiver5 betadiver6 betadiver7
## "c" "wb" "r" "I" "e"
## betadiver8 betadiver9 betadiver10 betadiver11 betadiver12
## "t" "me" "j" "sor" "m"
## betadiver13 betadiver14 betadiver15 betadiver16 betadiver17
## "-2" "co" "cc" "g" "-3"
## betadiver18 betadiver19 betadiver20 betadiver21 betadiver22
## "l" "19" "hk" "rlb" "sim"
## betadiver23 betadiver24 dist1 dist2 dist3
## "gl" "z" "maximum" "binary" "minkowski"
## designdist
## "ANY"
Remove the two distance-methods that require a tree, and the generic custom method that requires user-defined distance arguments.
# These require tree
dist_methods[(1:2)]
## UniFrac DPCoA
## "unifrac" "dpcoa"
# Remove them from the vector
dist_methods <- dist_methods[-(1:2)]
# This is the user-defined method:
dist_methods["designdist"]
## designdist
## "ANY"
# Remove the user-defined distance
dist_methods = dist_methods[-which(dist_methods=="ANY")]
Loop through each distance method, save each plot to a list, called plist
.
plist <- vector("list", length(dist_methods))
names(plist) = dist_methods
for( i in dist_methods ){
# Calculate distance matrix
iDist <- distance(enterotype, method=i)
# Calculate ordination
iMDS <- ordinate(enterotype, "MDS", distance=iDist)
## Make plot
# Don't carry over previous plot (if error, p will be blank)
p <- NULL
# Create plot, store as temp variable, p
p <- plot_ordination(enterotype, iMDS, color="SeqTech", shape="Enterotype")
# Add title to each plot
p <- p + ggtitle(paste("MDS using distance method ", i, sep=""))
# Save the graphic to file.
plist[[i]] = p
}
Shade according to sequencing technology
require(plyr)
df = ldply(plist, function(x) x$data)
names(df)[1] <- "distance"
p = ggplot(df, aes(Axis.1, Axis.2, color=SeqTech, shape=Enterotype))
p = p + geom_point(size=3, alpha=0.5)
p = p + facet_wrap(~distance, scales="free")
p = p + ggtitle("MDS on various distance metrics for Enterotype dataset")
p
The following are some selected examples among the created plots.
Jensen-Shannon Divergence
print(plist[["jsd"]])
Jaccard
print(plist[["jaccard"]])
Bray-Curtis
print(plist[["bray"]])
Gower
print(plist[["gower"]])
print(plist[["w"]])
One of the problems in clustering is to figure out how many clusters there are. One way to do this is with the gap statistic. We'll illustrate with the enterotype data.
require(cluster)
exord = ordinate(enterotype, method="MDS", distance="jsd")
Compute the gap statistic with
pam1 = function(x, k){list(cluster = pam(x,k, cluster.only=TRUE))}
x = phyloseq:::scores.pcoa(exord, display="sites")
# gskmn = clusGap(x[, 1:2], FUN=kmeans, nstart=20, K.max = 6, B = 500)
gskmn = clusGap(x[, 1:2], FUN=pam1, K.max = 6, B = 50)
## Clustering k = 1,2,..., K.max (= 6): .. done
## Bootstrapping, b = 1,2,..., B (= 50) [one "." per sample]:
## .................................................. 50
gskmn
## Clustering Gap statistic ["clusGap"].
## B=50 simulated reference sets, k = 1..6
## --> Number of clusters (method 'firstSEmax', SE.factor=1): 4
## logW E.logW gap SE.sim
## [1,] 2.712 3.031 0.3197 0.02008
## [2,] 2.328 2.755 0.4276 0.02219
## [3,] 2.063 2.520 0.4568 0.02178
## [4,] 1.823 2.319 0.4959 0.02225
## [5,] 1.732 2.210 0.4772 0.01937
## [6,] 1.566 2.115 0.5493 0.01864
Ordination methods can be a useful tool for exploring complex phylogenetic sequencing data, particularly when the hypothesized structure of the data is poorly defined (or there isn't a hypothesis). The phyloseq package provides some useful tools for performing ordinations and plotting their results, via the ordinate() and plot ordination() functions, respectively. Although there are many options and methods supported, a first-step will probably look something like the following:
my.physeq <- import("Biom", BIOMfilename = "myBiomFile.biom")
my.ord <- ordinate(my.physeq)
plot_ordination(my.physeq, my.ord, color = "myFavoriteVarible")
We will show continue our exploration of the “GlobalPatterns” dataset using various features of an ordination method called Correspondence Analysis. Let's limit the number of species:
data(GlobalPatterns)
# Take a subset of the GP dataset, top 200 species
topsp <- names(sort(taxa_sums(GlobalPatterns), TRUE)[1:200])
GP <- prune_species(topsp, GlobalPatterns)
# Subset further to top 5 phyla, among the top 200 OTUs.
top5ph <- sort(tapply(taxa_sums(GP), tax_table(GP)[, "Phylum"], sum), decreasing = TRUE)[1:5]
GP <- subset_species(GP, Phylum %in% names(top5ph))
# Re-add human variable to sample data:
sample_data(GP)$human <- factor( get_variable(GP, "SampleType") %in% c("Feces", "Mock", "Skin", "Tongue") )
First see how the samples behave on the first few CA axes:
gpca <- ordinate(GP, "CCA")
barplot(gpca$CA$eig/sum(gpca$CA$eig), las = 2)
Now the correspondence analysis on axes (1,2) and (3,4)
(p12 <- plot_ordination(GP, gpca, "samples", color = "SampleType") + geom_line() + geom_point(size = 5))
(p34 <- plot_ordination(GP, gpca, "samples", axes = c(3, 4), color = "SampleType") + geom_line() + geom_point(size = 5))
You can now do the homework associated with this lab, which is at https://web.stanford.edu/class/bios221/cgi-bin/index.cgi/.