Question

RNA-seq Batch Correction

0

Entering edit mode

cvu • 0

@cvu-14922

Last seen 7.2 years ago

Hello All,

I have an RNA-seq dataset and I've performed PCA analysis to check batch effect using package edgeR. and have used raw counts from cuffnorm output.

here is my code

    library(edgeR)
    a1<-read.table("raw_counts.txt", sep="\t", header=TRUE)
    dge1 <- DGEList(counts=a1[(-1)])
    dge1 <- calcNormFactors(dge1)
    logCPM1 <- cpm(dge1,log=TRUE,prior.count=5)
    batch=c(rep("batch1",8), rep("batch2",8))
    logCPM1 <- removeBatchEffect(logCPM1, batch=batch)
    B.res = prcomp(logCPM1, scale = TRUE)
    pc.s = summary(B.res)$importance[2,1:2]
    pc1.var = round(pc.s[["PC1"]],2)
    pc2.var = round(pc.s[["PC2"]],2)
    pdf(file = "after1_125.pdf")
    plot(B.res$rotation[,1:2], main = "After Batch Correction", xlab = paste("PC1 (variance:",pc1.var*100, "%)", sep =""), ylab = paste("PC2 (variance: ",pc2.var*100,"%)", sep =""), col=resp, pch = 16, cex = .9)
    dev.off()

From PCA plot I could see there is little batch effect and I have corrected batch effect. Now I want to use batch corrected data for further analysis in cuffdiff. Is this possible? If yes how? or is Tuxedo Suite takes care of batch correction?

Please suggest me tools or workflows which uses batch corrected data for RNA-seq analysis.

Any help would be appreciated.
Thanks you.

rna-seq edger limma removebatcheffect() • 1.9k views

ADD COMMENT • link updated 7.2 years ago by Aaron Lun ★ 28k • written 7.2 years ago by cvu • 0

score 1 · Answer 1 · 2018-01-31

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 39 minutes ago

The city by the bay

I won't spreak for other analyses you might want to do with RNA-seq data; but for differential expression analyses, you should not use batch-corrected data. This is because the batch correction does not account for the uncertainty of the estimation of the batch effect, which usually results in anticonservativeness (i.e., p-values that are lower than they should be, too many false positives). Rather, you should include the batch as a blocking factor in the statistical model. This can be easily done with GLMs, see Section 3.4.3 of the edgeR user's guide.

Also, edgeR doesn't perform PCA, you're actually using the prcomp function from stats.

ADD COMMENT • link 7.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you Aaron
I am also inclined to believe that batch corrected data should not be used for differential expression analysis.

Just had one counter-comment and an additional question
If you could share your thoughts on these, that would be great.
=================================================================
Comment:
As far as the anticonservativeness goes, I think the jury is still divided
Example:
This study states that the correction of batch factors
"it did not increase the number of small p values (e.g., p values < 0.05), but reversely, the small p values are getting less."
https://www.researchgate.net/publication/304497610_Evaluation_of_Methods_in_Removing_Batch_Effects_on_RNA-seq_Data
Would be keen to know your comments.
=================================================================
Question:
if we apply "batch as a blocking factor"
Should this blocking factor be applied for the samples of comparing combination?
or
For all samples? and then use the output data for differential expression?
or
am I getting this wrong?

Example:
I have 24 samples, with 3 batches of 8 samples each.
One specific combination, say A_vs_B, has 3 replicates each (6 samples)

Now should I apply the batch-correction using GLM for the combination? and then perform differential expression analysis?
or
Should I fist corrected the data of all 24 samples, and perform the differential analysis for A_vs_B? and so on so forth for all combination
=================================================================
Thanks a lot for your inputs
really appreciate the time

ADD REPLY • link 7.2 years ago cvu • 0

0

Entering edit mode

The quote you've taken from that paper seems to be showing what happens before correction/modelling of the batch factors. Obviously, if you simulate data for a balanced design with a batch effect, and then fail to model that batch effect, you will inflate the variance and increase conservativeness, resulting in a skew towards large p-values. This is a different problem to what I was describing, relating to the use of batch-corrected data for DE analysis.

There is no argument that using pre-corrected data will fail to propagate the uncertainty of the batch effect estimates. Failing to account for uncertainty will generally result in anticonservativeness during hypothesis testing. (I cannot think of a specific case where failing to account for uncertainty would result in conservativeness, but I suppose it might be possible.)

For your second question; I don't understand what you're confused about. Try reading Section 3.4.3 and the case study in Section 4.2 of the edgeR user's guide; I'm not going to recite the documentation here.

ADD REPLY • link 7.2 years ago Aaron Lun ★ 28k