%>%)Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. Kostic, A. D., Gevers, D., Pedamallu, C. S., Michaud, M., Duke, F., Earl, A. M., et al. (2012). Genome research, 22(2), 292-298.
library(phyloseq)
filepath = system.file("extdata", "study_1457_split_library_seqs_and_mapping.zip", package="phyloseq")
kostic = microbio_me_qiime(filepath)
## Found biom-format file, now parsing it...
## Done parsing biom...
## Importing Sample Metdadata from mapping file...
## Merging the imported objects...
## Successfully merged, phyloseq-class created.
## Returning...
kostic
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 2505 taxa and 190 samples ]
## sample_data() Sample Data: [ 190 samples by 71 sample variables ]
## tax_table() Taxonomy Table: [ 2505 taxa by 7 taxonomic ranks ]
head(sample_data(kostic)$DIAGNOSIS, 10)
## [1] Healthy Tumor Tumor Healthy Healthy Healthy Tumor Healthy Healthy
## [10] Healthy
## Levels: Healthy None Tumor
First remove the 5 samples that had no DIAGNOSIS attribute assigned. These introduce a spurious third design class that is actually a rare artifact in the dataset. Also remove samples with less than 500 reads (counts). Note that this kind of data cleanup is useful, necessary, and should be well-documented because it can also be dangerous to alter or omit data without clear documentation. In this case I actually explored the data first, and am omitting some of the details (and explanatory plots) here for clarity.
kostic <- subset_samples(kostic, DIAGNOSIS != "None")
kostic <- prune_samples(sample_sums(kostic) > 500, kostic)
kostic
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 2505 taxa and 177 samples ]
## sample_data() Sample Data: [ 177 samples by 71 sample variables ]
## tax_table() Taxonomy Table: [ 2505 taxa by 7 taxonomic ranks ]
sample_sums(kostic)
## C0333.N.518126 C0333.T.518046 38U4VAHB.518100 XZ33PN7O.518030
## 5651 1286 6546 4572
## GQ6LSNI6.518106 C0270.N.518041 HZIMMAM6.518095 LRQ9WN6C.518048
## 8714 8018 9327 5833
## A8A34NCW.518023 C0332.N.518027 C10.S.517995 MQE9ONGV.518038
## 6971 1833 4536 2666
## C0230.N.517992 C0256.N.518170 HZIMMNL5.518062 GQ6LSABJ.518010
## 1018 8641 6394 6250
## C0282.N.518138 TV28INUO.518004 C0315.N.518021 C0235.N.517993
## 1672 4250 7439 1513
## 5TA9VN6K.518161 A8A34AP9.518086 UEL2GAPI.518173 O436FAO8.518097
## 8934 7214 8341 7389
## C0378.N.518158 41E1KNBP.517985 C0378.T.518104 R8J9ZNOW.518040
## 7613 4798 9584 7524
## R8J9ZAGC.518059 5TA9VAT9.518088 OTGGZAZO.518108 C0269.N.518130
## 10973 6269 7846 7771
## C0271.T.518078 C0318.N.518105 C0159.T.518087 C0318.T.518140
## 1872 5341 4311 7853
## O436FNP3.518007 C0209.N.517984 C0240.N.518020 C0306.N.518003
## 1041 5664 6017 6253
## MQETMNL4.518037 C0209.T.518128 C0112.T.518024 C0355.T.518157
## 4103 3075 617 3874
## C0322.N.518165 6G2KBNS6.518045 C0240.T.518052 C0268.T.518114
## 2410 6963 3952 1388
## C0285.T.518044 41E1KAMA.517991 QFHRSAG2.518005 I7ROLN9P.518079
## 4169 7016 5150 5630
## QFHRSNIH.518116 C0271.N.518022 KIXFRARY.518160 C0133.N.518051
## 6937 6316 5692 3594
## C0268.N.518089 OTGGZN5Y.518063 C0134.N.518072 C0395.N.518054
## 2778 5324 5903 3874
## UZ65XAS5.518119 C0311.T.518131 5EKFOAO4.518164 C0294.T.518134
## 7817 8921 5864 2594
## C0388.N.518136 C0186.N.518143 38U4VNBW.518090 C0334.T.518147
## 7084 3523 4320 4759
## JIDZEN4J.518073 JIDZEAPD.518149 LRQ9WAHA.518036 UZ65XN27.518028
## 3016 6192 3868 6081
## C0159.N.518042 C0306.T.518122 C0394.N.518031 C0342.N.518068
## 5583 1451 3371 8459
## 82S3MAZ4.518145 C4.T.518118 C0398.N.518152 C0388.T.518015
## 6000 5191 4906 3504
## C0285.N.518084 C0308.N.518154 C0355.N.518077 C0308.T.518162
## 8164 4774 2306 1965
## C0211.T.518096 C0206.N.518065 C0149.N.518039 C4.S.518169
## 4140 6837 6303 5118
## C0374.N.518099 C0362.N.518069 C0211.N.518117 G3UBQNDP.518121
## 8580 1911 6653 5902
## C0341.N.518101 C1.T.518120 C0399.T.518153 C0294.N.517987
## 4005 1361 2115 2454
## C0186.T.518019 G2OD5N55.518146 YOTV6ASH.518142 C0258.T.517997
## 4415 5018 936 11187
## C0230.T.518144 C0315.T.518034 C0349.N.517989 C0212.N.518167
## 2032 1909 6014 5222
## YOTV6NC8.518156 C0112.N.518001 C0335.N.518107 C0344.N.518103
## 5711 2374 8019 7234
## 82S3MNBY.518050 5EKFONB3.518125 C0225.N.518047 C6.S.518082
## 6896 7131 3957 2034
## C0241.N.518043 C1.S.518137 XBS5MNEH.518074 C0195.N.518110
## 6053 2556 3135 7847
## C0322.T.518018 C0342.T.518093 C0198.T.518166 C0395.T.518075
## 868 865 2147 2581
## G3UBQAJU.518060 C0335.T.518115 C0349.T.518111 C0149.T.518132
## 6430 3228 5434 4504
## C0311.N.518124 C0154.N.518002 C0258.N.518014 C0371.N.518009
## 4507 6152 5458 7256
## C0330.N.518148 C0344.T.518029 C0206.T.518053 C0371.T.518139
## 1207 2330 2717 6757
## C0394.T.518064 C0332.T.518017 G2OD5ABL.518057 C6.T.518061
## 1440 2322 6628 2077
## C0256.T.518080 C0095.T.518141 C0252.T.518006 C0366.T.518172
## 1953 1992 2354 7770
## C0241.T.518067 C0198.N.518081 C0330.T.518035 C0325.N.518109
## 2738 3532 1925 3966
## C0225.T.518033 C0154.T.518091 C0366.N.518094 C0270.T.518163
## 5478 1477 5503 3560
## MQETMAZC.518171 C0095.N.518123 C0341.T.518113 C0252.N.518016
## 2074 2642 4233 5628
## C0325.T.518083 C0212.T.518155 C0399.N.518011 C0334.N.518066
## 4722 524 9734 2285
## XZ33PA3O.518135 KIXFRNL2.518129 C0374.T.518127 C0275.N.518076
## 3206 6973 926 5875
## 59S9WAIH.518013 C0314.T.518025 6G2KBASQ.517999 C0133.T.518133
## 3420 3099 3098 2694
## UEL2GN92.518058 59S9WNC6.518055 C0235.T.518049 C0362.T.518026
## 4917 4724 4566 4063
## TV28IANZ.518070 C10.T.518056 C0275.T.518032 MQE9OAS7.518008
## 2606 812 6317 4263
## 32I9UAPQ.518112 UTWNWANU.518168 UTWNWN3P.518012 BFJMKAKB.518159
## 7359 6669 7122 2283
## 32I9UNA9.518098
## 3207
summary(sample_sums(kostic))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 524 2606 4722 4726 6430 11187
%>%)library(dplyr)
sample_sums(kostic) %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 524 2606 4722 4726 6430 11187
sum(otu_table(kostic)>0)
## [1] 34776
sum(otu_table(kostic)>3)
## [1] 14773
PA = data.frame(otu_table(kostic))
PA[PA<3] = 0
PA[PA>2] = 1
apply(PA,2,sum)
## C0333.N.518126 C0333.T.518046 X38U4VAHB.518100 XZ33PN7O.518030
## 100 70 119 97
## GQ6LSNI6.518106 C0270.N.518041 HZIMMAM6.518095 LRQ9WN6C.518048
## 143 139 115 102
## A8A34NCW.518023 C0332.N.518027 C10.S.517995 MQE9ONGV.518038
## 108 100 111 68
## C0230.N.517992 C0256.N.518170 HZIMMNL5.518062 GQ6LSABJ.518010
## 83 124 103 130
## C0282.N.518138 TV28INUO.518004 C0315.N.518021 C0235.N.517993
## 92 77 113 70
## X5TA9VN6K.518161 A8A34AP9.518086 UEL2GAPI.518173 O436FAO8.518097
## 128 105 124 25
## C0378.N.518158 X41E1KNBP.517985 C0378.T.518104 R8J9ZNOW.518040
## 210 75 106 98
## R8J9ZAGC.518059 X5TA9VAT9.518088 OTGGZAZO.518108 C0269.N.518130
## 99 90 107 194
## C0271.T.518078 C0318.N.518105 C0159.T.518087 C0318.T.518140
## 124 148 88 68
## O436FNP3.518007 C0209.N.517984 C0240.N.518020 C0306.N.518003
## 31 89 138 147
## MQETMNL4.518037 C0209.T.518128 C0112.T.518024 C0355.T.518157
## 85 74 41 89
## C0322.N.518165 X6G2KBNS6.518045 C0240.T.518052 C0268.T.518114
## 111 160 134 68
## C0285.T.518044 X41E1KAMA.517991 QFHRSAG2.518005 I7ROLN9P.518079
## 106 64 129 59
## QFHRSNIH.518116 C0271.N.518022 KIXFRARY.518160 C0133.N.518051
## 141 194 201 115
## C0268.N.518089 OTGGZN5Y.518063 C0134.N.518072 C0395.N.518054
## 134 148 196 104
## UZ65XAS5.518119 C0311.T.518131 X5EKFOAO4.518164 C0294.T.518134
## 123 84 111 116
## C0388.N.518136 C0186.N.518143 X38U4VNBW.518090 C0334.T.518147
## 141 112 96 108
## JIDZEN4J.518073 JIDZEAPD.518149 LRQ9WAHA.518036 UZ65XN27.518028
## 130 121 177 189
## C0159.N.518042 C0306.T.518122 C0394.N.518031 C0342.N.518068
## 117 63 114 197
## X82S3MAZ4.518145 C4.T.518118 C0398.N.518152 C0388.T.518015
## 90 97 135 111
## C0285.N.518084 C0308.N.518154 C0355.N.518077 C0308.T.518162
## 120 163 103 100
## C0211.T.518096 C0206.N.518065 C0149.N.518039 C4.S.518169
## 129 81 114 150
## C0374.N.518099 C0362.N.518069 C0211.N.518117 G3UBQNDP.518121
## 158 96 152 96
## C0341.N.518101 C1.T.518120 C0399.T.518153 C0294.N.517987
## 148 51 105 147
## C0186.T.518019 G2OD5N55.518146 YOTV6ASH.518142 C0258.T.517997
## 105 79 39 166
## C0230.T.518144 C0315.T.518034 C0349.N.517989 C0212.N.518167
## 92 52 126 137
## YOTV6NC8.518156 C0112.N.518001 C0335.N.518107 C0344.N.518103
## 90 97 143 123
## X82S3MNBY.518050 X5EKFONB3.518125 C0225.N.518047 C6.S.518082
## 108 117 79 69
## C0241.N.518043 C1.S.518137 XBS5MNEH.518074 C0195.N.518110
## 112 75 16 146
## C0322.T.518018 C0342.T.518093 C0198.T.518166 C0395.T.518075
## 57 58 64 73
## G3UBQAJU.518060 C0335.T.518115 C0349.T.518111 C0149.T.518132
## 60 102 94 122
## C0311.N.518124 C0154.N.518002 C0258.N.518014 C0371.N.518009
## 77 120 134 117
## C0330.N.518148 C0344.T.518029 C0206.T.518053 C0371.T.518139
## 62 74 34 117
## C0394.T.518064 C0332.T.518017 G2OD5ABL.518057 C6.T.518061
## 42 52 102 64
## C0256.T.518080 C0095.T.518141 C0252.T.518006 C0366.T.518172
## 60 80 58 162
## C0241.T.518067 C0198.N.518081 C0330.T.518035 C0325.N.518109
## 79 96 86 106
## C0225.T.518033 C0154.T.518091 C0366.N.518094 C0270.T.518163
## 65 51 144 71
## MQETMAZC.518171 C0095.N.518123 C0341.T.518113 C0252.N.518016
## 51 77 41 193
## C0325.T.518083 C0212.T.518155 C0399.N.518011 C0334.N.518066
## 84 47 180 97
## XZ33PA3O.518135 KIXFRNL2.518129 C0374.T.518127 C0275.N.518076
## 62 78 53 103
## X59S9WAIH.518013 C0314.T.518025 X6G2KBASQ.517999 C0133.T.518133
## 63 28 117 64
## UEL2GN92.518058 X59S9WNC6.518055 C0235.T.518049 C0362.T.518026
## 95 70 37 34
## TV28IANZ.518070 C10.T.518056 C0275.T.518032 MQE9OAS7.518008
## 62 41 55 43
## X32I9UAPQ.518112 UTWNWANU.518168 UTWNWN3P.518012 BFJMKAKB.518159
## 41 47 35 25
## X32I9UNA9.518098
## 40
apply(PA,2,sum) %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.00 68.00 99.00 99.28 123.00 210.00
If we actually want to have a phyloseq object and have a new otu_table with only presence absence, we do the following.
# joey version
kMat <- kostic %>% otu_table %>% as("matrix")
PA <- ifelse(kMat > 2, 1, 0) %>% otu_table(taxa_are_rows = TRUE)
What is the highest rank in the following vector?
tt = sample(100,19)
tt
## [1] 57 68 10 56 2 49 1 87 47 27 37 86 69 3 11 25 59 36 82
rank(tt)
## [1] 13 15 4 12 2 11 1 19 10 7 9 18 16 3 5 6 14 8 17
abund <- otu_table(kostic)
abund[1:5,1:8]
## OTU Table: [5 taxa and 8 samples]
## taxa are rows
## C0333.N.518126 C0333.T.518046 38U4VAHB.518100 XZ33PN7O.518030
## 304309 40 4 1 2
## 469478 0 0 0 0
## 208196 0 0 0 0
## 358030 0 0 0 0
## 16076 271 28 110 7
## GQ6LSNI6.518106 C0270.N.518041 HZIMMAM6.518095 LRQ9WN6C.518048
## 304309 30 1 4 8
## 469478 0 0 0 0
## 208196 0 0 0 0
## 358030 0 0 0 0
## 16076 34 0 0 0
Kabund_ranks <- apply(abund, 2, rank)
Kabund_ranks[1:5,1:8]
## C0333.N.518126 C0333.T.518046 38U4VAHB.518100 XZ33PN7O.518030
## 304309 2477.0 2452.0 2326 2398.5
## 469478 1146.5 1183.5 1145 1164.5
## 208196 1146.5 1183.5 1145 1164.5
## 358030 1146.5 1183.5 1145 1164.5
## 16076 2499.0 2494.0 2482 2444.5
## GQ6LSNI6.518106 C0270.N.518041 HZIMMAM6.518095 LRQ9WN6C.518048
## 304309 2459.5 2272.5 2408.5 2449
## 469478 1093.5 1110.0 1134.5 1155
## 208196 1093.5 1110.0 1134.5 1155
## 358030 1093.5 1110.0 1134.5 1155
## 16076 2467.0 1110.0 1134.5 1155
Kabund_ranks <-Kabund_ranks - 2000
Kabund_ranks[Kabund_ranks < 1] <- 1
Kabund_ranks[1:5,1:8]
## C0333.N.518126 C0333.T.518046 38U4VAHB.518100 XZ33PN7O.518030
## 304309 477 452 326 398.5
## 469478 1 1 1 1.0
## 208196 1 1 1 1.0
## 358030 1 1 1 1.0
## 16076 499 494 482 444.5
## GQ6LSNI6.518106 C0270.N.518041 HZIMMAM6.518095 LRQ9WN6C.518048
## 304309 459.5 272.5 408.5 449
## 469478 1.0 1.0 1.0 1
## 208196 1.0 1.0 1.0 1
## 358030 1.0 1.0 1.0 1
## 16076 467.0 1.0 1.0 1
Plug the rankings to replace the original abundances in the phyloseq object:
kostica <-kostic
otu_table(kostica) <- otu_table(Kabund_ranks,taxa_are_rows=TRUE)
tab2 <- Kabund_ranks[c(2008,2315,1886, 733, 816, 1481),]
diagnosis=as.factor(sample_data(kostic)$DIAGNOSIS)
x <- tab2[1,]
wilcox.test(x ~ diagnosis,
alternative ="two.sided")
##
## Wilcoxon rank sum test with continuity correction
##
## data: x by diagnosis
## W = 2625.5, p-value = 0.0001506
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(tab2[4,] ~ diagnosis,
alternative ="two.sided")
##
## Wilcoxon rank sum test with continuity correction
##
## data: tab2[4, ] by diagnosis
## W = 3956, p-value = 0.3366
## alternative hypothesis: true location shift is not equal to 0
Multiple testing correction can be done using the multtest Bioconductor
package:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("multtest")
See the complete analysis of the kostic data with deseq2 and phyloseq here: deseq2.
The paper on variance stabilization for microbiome data is:
, link to paper
Waste not, Want not
(McMurdie and Holmes 2014)
video
There are many variance stabilization methods that are beneficial for different data, see our book with Wolfgang Huber: Modern Statistics for Modern Biology.
Hypothesis testing can identify individual bacteria whose abundance relates to sample variables of interest.
A standard approach is very similar to the approach we already visited in lecture 7 .
To integrate this information, we proposed a hierarchical testing procedure; where taxonomic groups are only tested if higher levels are found to be be associated.
In the case where many related species have a slight signal, this pooling of information can increase power.
We apply this method to test the association between microbial abundance and age.
Start by using the normalization protocols we discussed in lecture 7
following DESeq2 for RNA-seq data and the Waste Not, Want Not paper for 16S rRNA
generated count data and available in the DESeq2 package:
library("DESeq2")
library("phyloseq")
ps1 = readRDS("../../data/ps1.rds")
ps_dds = phyloseq_to_deseq2(ps1, ~ ageBin + family_relationship)
gm_mean = function(x, na.rm = TRUE){
exp(sum(log(x[x > 0]), na.rm = na.rm) / length(x))
}
geoMeans = apply(counts(ps_dds), 1, gm_mean)
ps_dds = estimateSizeFactors(ps_dds, geoMeans = geoMeans)
ps_dds = estimateDispersions(ps_dds)
abund = getVarianceStabilizedData(ps_dds)
We use the structSSI package to perform the hierarchical testing (Sankaran and Holmes 2014).
library("phyloseq")
ps1 = readRDS("../../data/psmice.rds")
ps1
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 389 taxa and 344 samples ]
## sample_data() Sample Data: [ 344 samples by 16 sample variables ]
## tax_table() Taxonomy Table: [ 389 taxa by 6 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 389 tips and 387 internal nodes ]
abund=readRDS("../../data/abund.rds")
dim(abund)
## [1] 389 344
Procedure needs univariate tests for each higher-level taxonomic group, not just every taxa.
library("structSSI")
el = phy_tree(ps1)$edge
el0 = el
el0 = el0[rev(seq_len(nrow(el))), ]
el_names = c(rownames(abund), seq_len(phy_tree(ps1)$Nnode))
el[, 1] = el_names[el0[, 1]]
el[, 2] = el_names[as.numeric(el0[, 2])]
unadj_p = treePValues(el, abund, sample_data(ps1)$ageBin)
We can now do our FDR calculations using the hierarchical testing procedure. The test results are guaranteed to control several variants of FDR, but at different levels; we defer details to (Benjamini and Yekutieli 2003; Benjamini and Bogomolov 2014; Sankaran and Holmes 2014).
Interactive plotting command that will open a browser window (take away the comment character.)
hfdr_res = hFDR.adjust(unadj_p, el, 0.75)
summary(hfdr_res)
#plot(hfdr_res, height = 5000) # opens in a browser
tax = tax_table(ps1)[, c("Family", "Genus")] %>% data.frame()
tax$seq = rownames(abund)
hfdr_res@p.vals$seq = rownames(hfdr_res@p.vals)
tax %>% left_join(hfdr_res@p.vals[,-3]) %>%
arrange(adjp) %>% head(10)
## Family Genus seq unadjp adjp
## 1 Lachnospiraceae <NA> GCAAG.96 1.673295e-82 3.346591e-82
## 2 Erysipelotrichaceae <NA> GCGAG.46 1.134371e-79 2.268741e-79
## 3 Lachnospiraceae Roseburia GCAAG.71 3.078334e-75 6.156668e-75
## 4 Lachnospiraceae Clostridium_XlVa GCAAG.190 8.173451e-59 1.634690e-58
## 5 Lachnospiraceae <NA> GCAAG.254 3.227107e-50 6.454215e-50
## 6 Lachnospiraceae Clostridium_XlVa GCAAG.150 1.056944e-49 2.113888e-49
## 7 Lachnospiraceae <NA> GCAAG.30 9.057568e-49 1.811514e-48
## 8 Erysipelotrichaceae Turicibacter GCAAG.4 2.896917e-48 5.793834e-48
## 9 Ruminococcaceae Ruminococcus GCGAG.78 1.978950e-46 3.957900e-46
## 10 Lachnospiraceae Clostridium_XlVa GCAAG.170 6.107952e-43 1.221590e-42
It seems that the most strongly associated bacteria all belong to family Lachnospiraceae.
Benjamini, Yoav, and Marina Bogomolov. 2014. “Selective Inference on Multiple Families of Hypotheses.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 (1). Wiley Online Library: 297–318.
Benjamini, Yoav, and Daniel Yekutieli. 2003. “Hierarchical Fdr Testing of Trees of Hypotheses.” Technical report, Department of Statistics; Operations Research, Tel Aviv University.
McMurdie, Paul J, and Susan Holmes. 2014. “Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible.” PLoS Computational Biology 10 (4). Public Library of Science: e1003531.
Sankaran, Kris, and Susan Holmes. 2014. “StructSSI: Simultaneous and Selective Inference for Grouped or Hierarchically Structured Data.” Journal of Statistical Software 59 (1): 1–21.