Further clarification on when not to use duplicateCorrelation with technical replicates (RNA-seq)
2
1
Entering edit mode
paul.alto ▴ 50
@paulalto-11559
Last seen 5.1 years ago

After reading the limma manual and paper and several posts about using duplicateCorrelation with technical replicates mixed with biological replicates, I am still unsure when to use it (and why not use it).

The 2015 limma paper says about duplicateCorrelation: "More generally, the same idea is also used to model the correlation between related RNA samples, for example repeated measures on the same individual or RNA samples collected at the same time."

The duplicateCorrelation help in the limma R package says: Estimate the correlation between duplicate spots (regularly spaced replicate spots on the same array) or between technical replicates from a series of arrays.

However, several posts here

suggest not using duplicateCorrelation in the designs proposed and instead pooling technical replicates.

In this thread ( https://support.bioconductor.org/p/86867 ), Aaron Lun says that duplicateCorrelation "does better when you have samples across a large number of levels of the blocking factor". So should duplicateCorrelation be used when mixing biological and technical replicates, but only when there is a minimum number of samples/replicates/levels? If so, what are the minimums that should be observed?

Thank you in advance.

limma duplicateCorrelation rna-seq technical replicates biological replicates • 6.1k views
ADD COMMENT
0
Entering edit mode

FYI by definition biological samples from different individuals/subjects are also biological replicates. If you have e.g. multiple biological samples per subject then that is a repeated measures design that you would use duplicateCorrelation on. A repeated measures design is any where you have multiple correlated biological samples per higher-level biological unit.

ADD REPLY
6
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 8 minutes ago
The city by the bay

As Gordon suggests, the diversity of possible designs makes it difficult to suggest a hard-and-fast rule. Nonetheless, here are some thoughts:

Technical replicates: If these are generated by literally sequencing the same sample multiple times (e.g., on different lanes), just add them together and treat the resulting sum as a single sample.

Not-quite-technical replicates: These are usually things like "we took multiple samples from the same donor", so they're not fully fledged biological replicates but they aren't totally technical either. In most cases, I would just add them together and move on because I don't care about capturing the variability within levels of the blocking factor. For example, if biopsies are variable within a patient but the average expression across multiple biopsies is consistent across patients, then the latter is all I care about. ~~On the other hand, if I did expect the repeated samples to be similar, I would want to penalize genes that exhibit variation between them, so I'd like to capture that variation with duplicateCorrelation.~~ (Update: see comment below.)

Also, when adding, it is better that each repeated sample contributes evenly to the sum for a particular blocking level; this gives you a more stable sum and thus lower across-level variance. It may also be wise to use voomWithQualityWeights to adjust for differences in the number of repeated samples per donor.

Repeated samples with different uninteresting predictors: This refers to situations where repeated samples do not have the same set of predictors in the design matrix, e.g., because some repeated samples were processed in a different batch. If the repeated samples for each blocking level have the same pattern of values for those predictors (e.g., each blocking level has one repeated sample in each of three batches), summation is still possible. However, in general, this is not the case and then duplicateCorrelation must be used.

Repeated samples with different interesting predictors: This refers to situations where repeated samples do not have the same set of predictors in the design matrix, because those predictors are interesting and their effects are to be tested. The archetypical example would be to collect samples before and after treatment for each patient. Here, we can either use duplicateCorrelation or we can block on the uninteresting factors in the design matrix. I prefer the latter as it avoids a few assumptions of the former, namely that all genes have the same consensus correlation. (There's also an assumption about the distribution of the random effect, but I can't remember what it was - maybe normal i.i.d.) However, duplicateCorrelation is more general and is the only solution when you want to compare across blocking levels, e.g., comparing diseased and healthy donors when each donor also contributes before/after treatment samples.

ADD COMMENT
0
Entering edit mode

Thanks for your reply, Aaron. You summary is very helpful. In the "Not-quite-technical replicates" scenario, my reasoning was the opposite from yours. I thought that if the replicates are expected to be similar, then I would treat them as "technical replicates" and if they are expected to be variable but still more similar than samples from a different individual, then duplicateCorrelation would correct for the "excess" similarity of the "not-quite-technical replicates". Is my reasoning flawed?

ADD REPLY
0
Entering edit mode

Yes, I can see how the comment was misleading, so I've reworded the answer.

My point was more about what you want to see in the DE genes that the analysis will detect. Would you be happy with DE genes that are highly variable between repeated samples, as long as they are consistent across biological replicates? If this is fine, then you don't want to model the variability between samples, and summation makes sense to mask that variability. One example would be single-cell data analysis where you might not care about cell-to-cell variability as long as the response at the population level was consistent.

I also forgot that duplicateCorrelation-based p-values don't actually penalize genes with strong variation between repeated samples; the relevant terms cancel out at some point, so my comment above was wrong and there's no benefit in that respect. Thus, it boils down to the speed and relatively assumption-free nature of summation versus the power improvement from having more samples when using duplicateCorrelation. I prefer the former.

ADD REPLY
0
Entering edit mode

Thank you for the clarification and for taking the time to answer my question!

ADD REPLY
1
Entering edit mode
@gordon-smyth
Last seen 49 minutes ago
WEHI, Melbourne, Australia

So should duplicateCorrelation be used when mixing biological and technical replicates, but only when there is a minimum number of samples/replicates/levels?

Sure, treating factor effects as random often makes more sense when the number of levels is larger, but there is no minimum number. You can apply duplicateCorrelation with only two blocks, and there examples of this in the User's Guide.

Judging from your previous question that Aaron answered, you don't actually have technical replicates at all. If you really did have pure technical replicates (sequencing the same RNA samples twice) then you would normally just sum the counts using edgeR::sumTechReps. There is an infinite variety of designs and an infinite spectrum of "semi" technical replicates that may be strongly or weakly correlated, so it is impossible to give a universal rule that covers all cases. When we advised against duplicateCorrelation in previous posts there was always an alternative, and we gave a reason for choosing the alternative.

ADD COMMENT
0
Entering edit mode

Thank you for your reply. In my case, if I don't have real technical replicates, should I still pool them (RNA-seq) or use duplicateCorrelation?

ADD REPLY

Login before adding your answer.

Traffic: 748 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6