(I asked this as a side question before, but I though it deserves it's own post)
I have a bulk-RNA experiment with 3 tissue types: Lesional, Non-Lesional and Controls, whereas Lesional and Non-Lesional are taken from the same patients. Most of the Controls were processed in batch#2. To batch-correct, two technical replicates of lesional samples and two technical replicates of non-lesional samples were also included in batch #2. All samples have Patient ID and Sample ID, like so:
Sample ID | Sample Type | Patient ID | Batch |
---|---|---|---|
S1 | Lesional | P1 | 1 |
S2 | Non-Lesional | P1 | 1 |
S3 | Lesional | P2 | 1 |
S4 | Non-Lesional | P2 | 1 |
... | ... | ... | ... |
S55 | Lesional | P20 | 1 |
S56 | Non-Lesional | P20 | 1 |
S55 | Lesional | P20 | 2 |
S56 | Non-Lesional | P20 | 2 |
S57 | Lesional | P21 | 1 |
S58 | Non-Lesional | P21 | 1 |
S57 | Lesional | P21 | 2 |
S58 | Non-Lesional | P21 | 2 |
S59 | Control | P22 | 2 |
S60 | Control | P23 | 2 |
... | ... | ... | ... |
.
I've read some comments regarding batch correction in the forum, and a similar problem was raised here with a more thorough explanation here, but I couldn't find an answer for a case of batch correction followed by repeated measures.
I can use sva
and ComBat_seq
(or RUVSeq
), but the comments recommend utilizing limma's duplicateCorrelation
. This is a valid option, but I don't know which blocking parameter to use - because of the repeated-measures nature of the experiment, I need to use PatientID, but this might not suit the batch correction, which should match SampleID.
My current approach is something like this, which I'm not sure is the "right" way.
exprs.mat.batchcorrected = ComBat_seq(counts=exprs.mat, batch=phenodata$batch, group = phenodata$SampleID, full_mod = T)
DGE.cpm<-DGEList(counts = exprs.mat.batchcorrected, samples = phenodata)
... # filtering genes etc ... #
design<-model.matrix(~0 + Sample.Type + batch, data=DGE.cpm$samples)
voom.result <-voomWithQualityWeights(DGE.cpm, design, plot=T, normalize="quantile")
corfit.rna<-duplicateCorrelation(voom.result, design, block=DGE.cpm$samples$PatientID)
voom.result.randeffect <-voomWithQualityWeights(DGE.cpm, design, plot=TRUE, block=DGE.cpm$samples$PatientID, correlation=corfit.rna$consensus, normalize="quantile")
corfit.rna2<-duplicateCorrelation(voom.result.randeffect, design, block=DGE.cpm$samples$PatientID)
fit2.rna<-lmFit(voom.result.randeffect, design, block=voom.result.randeffect$targets$PatientID, correlation=corfit.rna2$consensus)
Thank you so much for helping!
Thank you for your swift answer :)
Have you checked from the MDS plot that there is a visible batch effect for the repeated samples? Batch correction has lots of negative consequences for the analysis so you should avoid it if there is little or no effect to remove.
The technical replicates do look rather similar on the MDS plot, but the axes percentages are rather low... is that enough to draw conclusions?
Thank you again!
I see no evidence of a batch effect. I would be tempted to sum the technical replicates and analyse the data without batch correction.
Thanks, I'll try that. BTW, what do you mean by "sum them"? Do you mean averaging them?
I mean add the counts, for example by
sumTechReps
. Technical RNA-seq replicates are always combined by summing rather than by "averaging", equivalent to what you would get by combining the FASTQ files. In fact I am not even sure what "averaging" would mean for a count.I'm sorry for raising this thread, but I have a very similar project, and after returning to your answer I have a few more question regarding your guidance:
Thank you again!
In your code, it was clear that
expr.mat
was a matrix rather than a DGEList, so it has to be a count matrix. If you have a DGEList object, then of course you could use that instead.normalize="quantile"
is an option in the code so you, again, you can obviously use it. Or alternatively runnormLibSizes
on the DGEList in edgeR beforevoomLmFit()
.