RNA-seq Batch Correction
1
0
Entering edit mode
cvu • 0
@cvu-14922
Last seen 6.9 years ago

Hello All,

I have an RNA-seq dataset and I've performed PCA analysis to check batch effect using package edgeR. and have used raw counts from cuffnorm output.

here is my code

    library(edgeR)
    a1<-read.table("raw_counts.txt", sep="\t", header=TRUE)
    dge1 <- DGEList(counts=a1[(-1)])
    dge1 <- calcNormFactors(dge1)
    logCPM1 <- cpm(dge1,log=TRUE,prior.count=5)
    batch=c(rep("batch1",8), rep("batch2",8))
    logCPM1 <- removeBatchEffect(logCPM1, batch=batch)
    B.res = prcomp(logCPM1, scale = TRUE)
    pc.s = summary(B.res)$importance[2,1:2]
    pc1.var = round(pc.s[["PC1"]],2)
    pc2.var = round(pc.s[["PC2"]],2)
    pdf(file = "after1_125.pdf")
    plot(B.res$rotation[,1:2], main = "After Batch Correction",  xlab = paste("PC1 (variance:",pc1.var*100, "%)", sep =""), ylab = paste("PC2 (variance: ",pc2.var*100,"%)", sep =""), col=resp, pch = 16, cex = .9)
    dev.off()

From PCA plot I could see there is little batch effect and I have corrected batch effect. Now I want to use batch corrected data for further analysis in cuffdiff. Is this possible? If yes how? or is Tuxedo Suite takes care of batch correction?

Please suggest me tools or workflows which uses batch corrected data for RNA-seq analysis.

Any help would be appreciated.
Thanks you.
 

rna-seq edger limma removebatcheffect() • 1.7k views
ADD COMMENT
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 5 hours ago
The city by the bay

I won't spreak for other analyses you might want to do with RNA-seq data; but for differential expression analyses, you should not use batch-corrected data. This is because the batch correction does not account for the uncertainty of the estimation of the batch effect, which usually results in anticonservativeness (i.e., p-values that are lower than they should be, too many false positives). Rather, you should include the batch as a blocking factor in the statistical model. This can be easily done with GLMs, see Section 3.4.3 of the edgeR user's guide.

Also, edgeR doesn't perform PCA, you're actually using the prcomp function from stats.

ADD COMMENT
0
Entering edit mode

Thank you Aaron
I am also inclined to believe that batch corrected data should not be used for differential expression analysis.

Just had one counter-comment and an additional question
If you could share your thoughts on these, that would be great.
=================================================================
Comment:
As far as the anticonservativeness goes, I think the jury is still divided
Example:
This study states that the correction of batch factors
"it did not increase the number of small p values (e.g., p values < 0.05), but reversely, the small p values are getting less."
https://www.researchgate.net/publication/304497610_Evaluation_of_Methods_in_Removing_Batch_Effects_on_RNA-seq_Data
Would be keen to know your comments.
=================================================================
Question:
if we apply "batch as a blocking factor"
Should this blocking factor be applied for the samples of comparing combination?
or
For all samples? and then use the output data for differential expression?
or
am I getting this wrong?

Example:
I have 24 samples, with 3 batches of 8 samples each.
One specific combination, say A_vs_B, has 3 replicates each (6 samples)

Now should I apply the batch-correction using GLM for the combination? and then perform differential expression analysis?
or
Should I fist corrected the data of all 24 samples, and perform the differential analysis for A_vs_B? and so on so forth for all combination
=================================================================
Thanks a lot for your inputs
really appreciate the time

ADD REPLY
0
Entering edit mode

The quote you've taken from that paper seems to be showing what happens before correction/modelling of the batch factors. Obviously, if you simulate data for a balanced design with a batch effect, and then fail to model that batch effect, you will inflate the variance and increase conservativeness, resulting in a skew towards large p-values. This is a different problem to what I was describing, relating to the use of batch-corrected data for DE analysis.

There is no argument that using pre-corrected data will fail to propagate the uncertainty of the batch effect estimates. Failing to account for uncertainty will generally result in anticonservativeness during hypothesis testing. (I cannot think of a specific case where failing to account for uncertainty would result in conservativeness, but I suppose it might be possible.)

For your second question; I don't understand what you're confused about. Try reading Section 3.4.3 and the case study in Section 4.2 of the edgeR user's guide; I'm not going to recite the documentation here.

ADD REPLY

Login before adding your answer.

Traffic: 540 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6