Dear All, I have read (EdgeR: Accounting for batch effects in a pairwise analysis) that there is no need to remove the batch effects for a paired design as they are auto corrected.
I am dealing with data from primary cells. There is lot of heterogeneity in the cells. So I initially thought to use RUV-Seq which corrects for unwanted variation and it estimates factors for unwanted variation in the data set and returns it as W_1.
library(RUVSeq)
filtered <- read.delim("filt_counts.txt",header=T,row.names=1)
treat <- as.factor(rep(c("treated","Untreated"),8))
subjects=factor(c(rep(1:8, each=2)))
design <- model.matrix(~subjects+treat)
set <- newSeqExpressionSet(as.matrix(filtered), phenoData = data.frame(treat, row.names=colnames(filtered)))
set <- betweenLaneNormalization(set, which="upper")
#create empirical data set
y <- DGEList(counts=filtered, group=treat)
y <- calcNormFactors(y, method="upperquartile")
y <- estimateGLMCommonDisp(y, design, verbose=TRUE)
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit)
top <- topTags(lrt, n=nrow(y))$table
empirical <- rownames(set)[which(!(rownames(set) %in% rownames(top)[1:5000]))]
#normalise using empirical data set and estimate W_1
set2 <- RUVg(set, empirical, k=1)
#DE analysis using the estimated W_1, final result
design <- model.matrix(~subjects+W_1+treat, data=pData(set2))
y <- DGEList(counts=counts(set2), group=treat)
y <- calcNormFactors(y)
y <- estimateGLMCommonDisp(y, design,verbose=TRUE)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit)
But now I learned that the paired analysis need not to be batch corrected and my design would introduce biases in the analysis as I might be doing it wrong.
I would like to know if its wrong to try to batch correct paired-analysis or if there is any way to remove hidden, unwanted variation to see true signal in data with heterogeneity.
Just to be clear, if by heterogeneity you mean the data is simply highly variable, with truly random independent variations in each gene's expression, there's nothing you can do about this variation. RUV and similar methods are designed for detecting and removing systematic patterns of heterogeneity that appear consistently across many genes and are not explained by your known factors (i.e. treatment and subject).