Can someone enlighten me as to the justification for summing counts across technical replicates in DESeq2, especially with respect to the collapseReplicates() function?
I would have thought that statistically the correct thing to do would be to build a column into the design matrix to account for technical replicates and include all samples. This effectively doubles (or x N for N technical replicates) the number of samples you have which obviously "increases" the power and so affects the all-important p-values. On the other hand you are assuming biologically-identical replicates constitute independent samples, but I can't see how you would adjust for large batch effects any other way.
The second argument to collapseReplicates is the factor that you want to collapse on. Here you gave it mock/zika and it collapsed to two samples, one per group. You want to instead collapse based on donor.
This isn't really an "answer" to the first question. On the support forum, the form at the bottom is "Add your answer" which is really supposed to be used by people who are answering the post at the top.
Not collapsing replicates is not appropriate, in a simple way to describe this: failing to collapse technical replicates and providing these to a DE method is "pretending" you have more independent sample than you really do. You can think of a technical replicate as just more reads from the library of cDNA. You could take a library and split it in 2, again and again, and make many technical replicates. None of these would contain any biological variability, because they are from a single, static library of molecules.
So you can think of an idealized experiment, where you have say, 2 vs 2 biological replicates, which is very under-powered to find any significant differences in expression. But if you make many technical replicates from these, by splitting the reads, and pretend these are independent samples, the DE methods will think you have very low within-group biological variability, and tend to report many genes as DE. It will greatly increase your FPR for the "truly null" genes.
When you do differential expression across samples, the kind of variability you need to estimate is the variability across biological replicates. So you don't get a gain in power, because it's not helping you to estimate the variability that would go into a test of differential expression across conditions.
Technical replicate variability is small compared to biological replicate variability and the former is well approximated by a Poisson for the large majority of genes (I've looked into SEQC technical replicates and confirmed this to myself recently). Since the technical replicates of a biological replicate aren't helping you to estimate variability across biological replicates at all, it's best to simply add them together, increasing the sequencing depth of the individual biological replicate. Increasing sequencing depth increases power for differential expression, as does increasing the number of biological replicates.
"On the other hand you are assuming biologically-identical replicates constitute independent samples, but I can't see how you would adjust for large batch effects any other way."
I don't follow this last part, can you add a comment to my post which explains this question more?
Thanks for your answer. The last part relates to an RNA-seq dataset I'm currently working on where batch effect dominates (ie technical replicate variability is large, sadly explains ~80% variance in the data...). So the follow-up questions would be (1) what's the best practice for dealing with dominating technical effects and (2) what's the point in doing technical replicates if we throw away that information by summing over counts?
For my above answer, a technical replicate is when you produce more sequences from the same library. And I wouldn't expect much variation above Poisson.
The point of summing is that you increase the sequencing depth for that sample, which improves power by allowing more precise measurement of gene expression, and increases the set of genes which have minimal read counts.
If you prepare a new library, I wouldn't refer to this as a technical replicate.
Regarding what to do about batches, the recommended approach is to add a term which accounts for this sample dependence into the design, e.g. ~ batch + condition. This typically improves power if there are batches.
Okay I think I was just confused by terms here and thought technical replicate and batch (ie independent library prep) were equivalent. Entirely makes sense that if you do different sequencing runs of the same library then just to sum the counts.
I moved my comment below
The second argument to collapseReplicates is the factor that you want to collapse on. Here you gave it mock/zika and it collapsed to two samples, one per group. You want to instead collapse based on donor.
comment deleted. Sorry I did not find a way to delete my comment, since it was not a response to the question
This isn't really an "answer" to the first question. On the support forum, the form at the bottom is "Add your answer" which is really supposed to be used by people who are answering the post at the top.
Either way, see above for my reply.
And more generally, what would be the consequences of NOT collapsing technical replicates?
Not collapsing replicates is not appropriate, in a simple way to describe this: failing to collapse technical replicates and providing these to a DE method is "pretending" you have more independent sample than you really do. You can think of a technical replicate as just more reads from the library of cDNA. You could take a library and split it in 2, again and again, and make many technical replicates. None of these would contain any biological variability, because they are from a single, static library of molecules.
So you can think of an idealized experiment, where you have say, 2 vs 2 biological replicates, which is very under-powered to find any significant differences in expression. But if you make many technical replicates from these, by splitting the reads, and pretend these are independent samples, the DE methods will think you have very low within-group biological variability, and tend to report many genes as DE. It will greatly increase your FPR for the "truly null" genes.