Hello All,
I have an RNA-seq dataset and I've performed PCA analysis to check batch effect using package edgeR. and have used raw counts from cuffnorm output.
here is my code
library(edgeR)
a1<-read.table("raw_counts.txt", sep="\t", header=TRUE)
dge1 <- DGEList(counts=a1[(-1)])
dge1 <- calcNormFactors(dge1)
logCPM1 <- cpm(dge1,log=TRUE,prior.count=5)
batch=c(rep("batch1",8), rep("batch2",8))
logCPM1 <- removeBatchEffect(logCPM1, batch=batch)
B.res = prcomp(logCPM1, scale = TRUE)
pc.s = summary(B.res)$importance[2,1:2]
pc1.var = round(pc.s[["PC1"]],2)
pc2.var = round(pc.s[["PC2"]],2)
pdf(file = "after1_125.pdf")
plot(B.res$rotation[,1:2], main = "After Batch Correction", xlab = paste("PC1 (variance:",pc1.var*100, "%)", sep =""), ylab = paste("PC2 (variance: ",pc2.var*100,"%)", sep =""), col=resp, pch = 16, cex = .9)
dev.off()
From PCA plot I could see there is little batch effect and I have corrected batch effect. Now I want to use batch corrected data for further analysis in cuffdiff. Is this possible? If yes how? or is Tuxedo Suite takes care of batch correction?
Please suggest me tools or workflows which uses batch corrected data for RNA-seq analysis.
Any help would be appreciated.
Thanks you.
Thank you Aaron
I am also inclined to believe that batch corrected data should not be used for differential expression analysis.
Just had one counter-comment and an additional question
If you could share your thoughts on these, that would be great.
=================================================================
Comment:
As far as the anticonservativeness goes, I think the jury is still divided
Example:
This study states that the correction of batch factors
"it did not increase the number of small p values (e.g., p values < 0.05), but reversely, the small p values are getting less."
https://www.researchgate.net/publication/304497610_Evaluation_of_Methods_in_Removing_Batch_Effects_on_RNA-seq_Data
Would be keen to know your comments.
=================================================================
Question:
if we apply "batch as a blocking factor"
Should this blocking factor be applied for the samples of comparing combination?
or
For all samples? and then use the output data for differential expression?
or
am I getting this wrong?
Example:
I have 24 samples, with 3 batches of 8 samples each.
One specific combination, say A_vs_B, has 3 replicates each (6 samples)
Now should I apply the batch-correction using GLM for the combination? and then perform differential expression analysis?
or
Should I fist corrected the data of all 24 samples, and perform the differential analysis for A_vs_B? and so on so forth for all combination
=================================================================
Thanks a lot for your inputs
really appreciate the time
The quote you've taken from that paper seems to be showing what happens before correction/modelling of the batch factors. Obviously, if you simulate data for a balanced design with a batch effect, and then fail to model that batch effect, you will inflate the variance and increase conservativeness, resulting in a skew towards large p-values. This is a different problem to what I was describing, relating to the use of batch-corrected data for DE analysis.
There is no argument that using pre-corrected data will fail to propagate the uncertainty of the batch effect estimates. Failing to account for uncertainty will generally result in anticonservativeness during hypothesis testing. (I cannot think of a specific case where failing to account for uncertainty would result in conservativeness, but I suppose it might be possible.)
For your second question; I don't understand what you're confused about. Try reading Section 3.4.3 and the case study in Section 4.2 of the edgeR user's guide; I'm not going to recite the documentation here.