We have a bacterial mock community consisting of 22 bacterial taxa. The mock community exists in 2 basic versions - one in which each taxon is (supposed to be) equally abundant, and one in which one of the species has inflated abundance (at several different levels).
We would like to use the mock community datasets to test our wet lab and bioinformatics pipeline. Our idea was to construct the OTU abundance table (this is, IMHO, the equivalent of the gene counts in RNA-Seq) and to use edgeR to perform an exact test or to do the equivalent analysis in Deseq2. When the comparison is made between samples which contain equally distributed taxons we expect that no taxon would be flagged as being differentially expressed. In this way we hope to verify the precision of our pipeline. In a next series of tests we plan to compare the equally distributed samples to the ones containing inflated abundance for one of the taxons. In this way we hope to establish the sensitivity threshold for our pipeline.
Question 1: Does such an experimental setup make sense methodologically? Please feel free to offer suggestions.
Question 2 (related): Are Deseq2/edgeR suitable to test low diversity samples?
I was told by someone with a statistical background that neither edgeR nor Deseq2 are suitable to perform the differential abundance test in our experimental design. The explanation was something along the lines that: "Neither Deseq2, nor edgeR are appropriate for extremely low diversity microbial communities like the above mock communities because they require at least a conditional independence of each feature. This is not the case where you have just 20 or so different features (OTUs, genes, etc.). Changes in the true proportion of any of these features 'causes' changes in the proportion of the others, even if there is no biologically meaningful change."
This comment leaves us in a methodological limbo. Shall we go for massive expansion of our mock community in order to simulate "conditional independence"? And, if so, how many shall we include - 100, 500, 1000?
Isn't there still a chance to make meaningful inference based on the mock community which we already have?
Yes, the mock community is a sort of spike-in. We were thinking of adding it to each run but we don't know what probabilistic framework to incorporate it. If I remember correctly, there were control probes on the microarrays which could be used for normalisation, We were thinking of using the mock community in the same way.
Back to the lack of conditional independence - the thing is the same problem exists, in principle, with RNA-Seq, too. But it is moderated by the fact that we have tens of thousands of genes. Here we have just 22 "genes". What I would like to know is at what level we shall consider the "non-independence" issue to be a non-issue - do we need a mock community of 100, 500 or 1000? What level is sufficient?
Well, if the variability of your counts is low enough, then ~20 species may well be sufficient. If your counts are more variable, then you'll need more features to estimate the normalization factor precisely. It all depends on your data.
The problem is that there is a technical variability between the mock replicates.
ps. apologies for posting multiple times the same comment - somehow the submit button did not seem to respond.
Well, there's always going to be variability. The question is whether that variability is too large for effective normalization. If we only had to consider technical variability (i.e., Poisson variance from repeated sequencing runs), there wouldn't be a problem because that's quite small. However, because you're comparing between different communities, you need to be more concerned with the biological variability. This could be quite large, depending on how reproducible your procedures are for constructing the community and its corresponding sequencing library.
Look, long story short; just do it and see whether the 21 non-differential species have consistent M-values between your two communities. If they do, then that's good enough. If not, then you'll have to reconsider the community's structure.