Hey all,
I'm analysing a 16S microbial community dataset, and am using DESeq2 to test for differential abundances. When I do this, I supply raw count information to DESeq() as per the vignette rationale that the model fitting implicitly assumes raw count data. If I want to later try e.g. ordination of my samples, I might consider using rLog transformation or VST to standardise my data first.
The PICRUSt package provides inferred gene content for microbial communities, by referencing 16S taxon abundances against known (or extrapolated) genome content, and providing a table of likely microbial gene abundances for your 16S dataset. The accuracy of this prediction is reflected by a Nearest Sequenced Taxon Index (NSTI), with scores below 0.05 being 'good' and above 0.15 being 'undesirable'.
I would like to use DESeq2 to test for differential abundance of PICRUSt-inferred genes/gene pathways, but:
- DESeq2 was intended for RNA expression data, although it is often extended to 16S analysis - what are likely conditions under which DESeq2's suitability for a data type becomes questionable? How broadly suitable is it to large, sparse count datasets?
- DESeq2 takes raw counts - this conflicts with PICRUSt, where the original 16S dataset is normalised by copy number, before the new predicted gene set is calculated. Are there any thoughts on how this should be dealt with?
A DESeq2 16S copy number correction elsewhere, but I'm not sure that it addresses this issue in the same way.
Thanks!
I don't have definitive answers to these, so I'll just put this as a comment. I've seen in at least two papers that DESeq2 performs reasonably on 16S data, that is to say it could be improved, but it sometimes has better sensitivity over FDR curves than methods designed for these data. As to your second question, I don't know exactly what the conflict is with this other tool. Can you say specifically what you'd like DESeq2 to do here?
Thanks for the response, sorry for not being clearer. I would like to use DESeq2 to test for differential abundances in predicted gene content between samples, much as testing for differential 16S abundance. My concern is that methods like PICRUSt mutate (i.e. 16S-copy normalisation, gene inference) the data even further beyond the point where DESeq2 is an appropriate analysis (not least as DESeq2 requests raw abundances), but I don't know how to examine this.
To rephrase/repeat, how broadly applicable is DESeq2 to examining large, sparse datasets? I appreciate there might not be hard answers to this.
Not knowing anything about PICRUSt, I can't recommend or give any suggestions how it could be used in conjunction with DESeq2. Yes, DESeq2 requires raw counts, and then uses size factors or normalization factors to deal with increases or decreases in the counts. Users have, for example, put copy number information into the normalization factors to find changes in gene expression not explained by copy number.
Re: how broadly applicable, I don't have much more information than the above comment for now.