Question

Using DESeq2 after agglomerating OTUs

0

Entering edit mode

brandilyn.peters • 0

@brandilynpeters-9288

Last seen 8.4 years ago

United States

Hi,

Preface: this is my first post on this site so please excuse me if I don't know the proper posting protocols.

I am using the DESeq function in DESeq2 to look at differential abundance of OTUs between a case and control group (N~600 human subjects). I would also like to use this function to look at differential abundance of agglomerated counts at the phylum through genus levels. I haven't been able to find any posts about this, so my questions are:

1) Is it appropriate to use the DESeq function on raw agglomerated counts?

2) Should size factors be estimated only at the OTU level, and then should these same size factors be applied to the agglomerated datasets? Or should size factors be estimated at each level separately?

3) Another question unrelated to agglomeration: I have a very sparse OTU table (lots of zeros, ~260,000 OTUs), so I have been using this solution from a phyloseq vignette to estimate size factors:

gm_mean = function(x, na.rm=TRUE){ exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x)) }

geoMeans = apply(counts(diagdds), 1, gm_mean)

diagdds = estimateSizeFactors(diagdds, geoMeans = geoMeans)

However when I do this with my data, the large majority of the size factors end up being 1, with a few size factors of 0.99.

When I filter the OTU table to include only OTUs with at least 2 counts in at least 30 people (now ~3,000 OTUs), and then use the above code, the size factors appear to be calculated normally. Any idea why the size factor calculation fails with the larger OTU table?

Thanks in advance for your help!

Here is my session info:

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: CentOS release 6.3 (Final)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] DESeq2_1.10.0 RcppArmadillo_0.6.200.2.0
[3] Rcpp_0.12.1 SummarizedExperiment_1.0.1
[5] Biobase_2.30.0 GenomicRanges_1.22.1
[7] GenomeInfoDb_1.6.1 IRanges_2.4.1
[9] S4Vectors_0.8.2 BiocGenerics_0.16.1
[11] phyloseq_1.14.0

loaded via a namespace (and not attached):
[1] locfit_1.5-9.1 ape_3.3 lattice_0.20-33
[4] Biostrings_2.38.0 digest_0.6.8 foreach_1.4.3
[7] biom_0.3.12 plyr_1.8.3 chron_2.3-47
[10] futile.options_1.0.0 acepack_1.3-3.3 RSQLite_1.0.0
[13] ggplot2_1.0.1 zlibbioc_1.16.0 data.table_1.9.6
[16] annotate_1.48.0 vegan_2.3-1 rpart_4.1-10
[19] Matrix_1.2-2 proto_0.3-10 splines_3.2.0
[22] BiocParallel_1.4.0 geneplotter_1.48.0 stringr_1.0.0
[25] foreign_0.8-66 igraph_1.0.1 munsell_0.4.2
[28] multtest_2.26.0 mgcv_1.8-9 nnet_7.3-11
[31] gridExtra_2.0.0 Hmisc_3.17-0 codetools_0.2-14
[34] XML_3.98-1.3 permute_0.8-4 MASS_7.3-45
[37] grid_3.2.0 nlme_3.1-122 xtable_1.7-4
[40] gtable_0.1.2 DBI_0.3.1 magrittr_1.5
[43] scales_0.3.0 stringi_1.0-1 XVector_0.10.0
[46] reshape2_1.4.1 genefilter_1.52.0 latticeExtra_0.6-26
[49] futile.logger_1.4.1 Formula_1.2-1 lambda.r_1.1.7
[52] RColorBrewer_1.1-2 iterators_1.0.8 tools_3.2.0
[55] RJSONIO_1.3-0 ade4_1.7-2 survival_2.38-3
[58] AnnotationDbi_1.32.0 colorspace_1.2-6 cluster_2.0.3

phyloseq deseq2 estimatesizefactors • 2.1k views

ADD COMMENT • link updated 9.3 years ago by Michael Love 43k • written 9.3 years ago by brandilyn.peters • 0

score 2 · Accepted Answer · 2015-11-30

I'm not an expert in metagenomic analysis, so I'll hold off from answering (1).

(2) You should estimate the size factors for the same count matrix you are analyzing.

(3) I'm not sure what the optimal approach for size factor correction is when the matrices become very sparse (e.g. all rows having a majority of zeros), as I don't analyze such datasets myself. I would consider using upper quantiles, although you would have to choose what quantile to use. This paper discusses some methods:

http://www.nature.com/nmeth/journal/v10/n12/full/nmeth.2658.html