There's two choices here. The first is to use duplicateCorrelation
:
design <- model.matrix(~0 + donor + disease)
block <- paste0(donor, ".", disease)
dc <- duplicateCorrelation(y, design=design, block=block)
The additive formulation design
is fairly self-explanatory, but block
is less obvious. It aims to account for correlations between samples with the same combination of donor/disease, assuming that they are technical replicates. Donor-wide correlations are already handled by the donor
terms in design
, while disease-wide correlations are indistinguishable from your disease effect and should not be removed.
The second and simpler approach is to just average all samples with the same block
together:
y0 <- avearrays(y, ID=block)
This avoids having to deal with correlations; each set of correlated samples is now a single averaged sample, and the averaged samples are independent of each other conditional on the design matrix. You can analyze this with a simple additive design where the number of samples is twice the number of donors (one averaged healthy/diseased sample per donor). For extra sophistication, run arrayWeights
to account for the different variance of average observations derived from different numbers of samples.
The first approach will consider the variance between samples in the same block
, whereas the second approach does not. This does provide some benefit as it encourages consistent measurements across the same donor/disease combination; however, it comes at the cost of other statistical issues, namely anticonservativeness (see A: limma - technical replicates: duplicateCorrelation() or avereps()? for some thoughts on this). Personally, I would go with the second approach - it's a lot faster, too.
Great answer, as usual, but I think the OP should pay particular attention to the "assuming they are technical replicates" bit of this answer. It might be useful for the OP to elaborate on how repeated "healthy" or "diseased" measures from the same patient are ending up in these data.
For instance, imagine "diseased" meant "tumor" in this question and we have two separate biopsies from the same tumor, I wouldn't call these technical replicates and therefore not sure if using duplicateCorrelation in the way you describe here makes sense ... right?
The use of
duplicateCorrelation
should still be valid with separate biopsies - all the function does is model the correlations, and if you're taking repeated biopsies (i.e., measurements) from the same tumour, there most likely will be correlations between the resulting samples. I just mentioned technical replicates because that was easier to explain. The correlations would probably be much smaller with biopsies, though.Thank you very much for your comments. The diseased samples are biopsies from separate diseased regions of the same organ (not cancer). I think I will go with `avearrays` and `arrayWeights` as Aaron suggested. Am I correct in assuming that `arrayWeights` will detect the reduced variance of my averaged samples and thus increase the weight for these samples when running `lmFit`?
On a related note, would it be possible to use `arrayWeights` to compute weights for `avearrays` as well?
For your first question, yes. For your second question... I suppose you could apply
arrayWeights
to the original samples (using a one-way layout withblock
) and then use those weights inavearrays
. This would downweight outliers within each donor/disease combination, but it would only be effective when you have 3 or more samples for that combination.Hi Aaron,
I'm faced with an analysis that has basically the same problem as OP had, but is probably even more unbalanced: In most cases, I have either multiple diseased or multiple healthy samples per donor. Paired d/h samples from the same donor exist but are rare. There are also many cases where there is only a single sample per donor.
I need to use DESeq2 and cannot use limma for reasons I don't want to disclose.
At the moment, I'm simply doing
~ donor + disease(y/n)
in the DESeq2 design. However, I'm concerned that when I later do my disease contrast, that the signal will be driven only by the few paired cases where d/h were available from the same patient and would not integrate info from patients with either unique samples or multiple samples from either only d or h. Is this concern justified? How would you deal with it? If I understand your previous answers correctly, I should simply be averaging samples from the same block and then excludedonor
from the design?Lastly, if I wanted to try
duplicateCorrelation
, how do I technically integrate it with the DESeq2 workflow? As I understand the function returns the inter-block correlations, how can DESeq2 use this info?Your answer would be greatly appreciated.