Dear Bioconductor list,
I have a couple of questions regarding batch effect removal:
Question 1
I have 8 RNA-seq samples to analyze and detect genes that respond to some treatment in different genotypes (wt vs ko). The genotype is described by the vector
a <- c(1,1,1,1,2,2,2,2), the treatment by the vector: b <- c(1,2,1,2,1,2,1,2)
and the batch effect (which is coming from the fact that the same animal has been used in 2 samples) by the vector
batch<- c(1,1,2,2,3,3,4,4)
The effect of the batch is quite strong as it is shown in the mds plot:
http://139.91.162.50/problems/fig1.png
Thus, it is clear that the genotype effect is confounded by the batch effect, and therefore I cannot remove it. I guess this is the meaning of the warning:
Coefficients not estimable: batch3
Warning message:
Partial NA coefficients for 13533 probe(s)
However, it seems that
logCPMc <- removeBatchEffect(logCPM, batch=batch, design=design.nobatch)
, where design.nobatch <- model.matrix(~a*b) can remove the batch effect despite this warning. The MDS plot is as expected:
http://139.91.162.50/problems/fig2.png
Should I trust this batch effect removal process despite the warning message? Why did it produce some result (and how is this result produced), since it cannot estimate the batch effect due to confounding?
Question2
I run batchEffectRemoval on logCPM values: i.e.
y <- DGEList(counts=d1, group = groups)
y <- calcNormFactors( y )
logCPM <- cpm(y, log=TRUE, prior.count=5)
logCPMc <- removeBatchEffect(logCPM, batch=batchVector2, design=design.nobatch)
Is this correct, or should I use the y data structure instead? BTW when I use the y instead of logCPM, I get the following error message:
Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), :
'data' must be of a vector type, was 'NULL'
Thanks a lot in advance,
pavlos
Thanks Aaron,
I hadn't thought about the design matrix you suggest. Thanks a lot. This will work. Can this design matrix however capture effects of the genotype and treatment independent of each other? But perhaps of the confounding between genotype and batch it could be the best that I can do
Another option would be to explore the
duplicateCorrelation
approach withvoom
. This will allow you to fit a model where the genotype and treatment effects are represented by separate coefficients, i.e.,~genotype + treatment
. However, it has its own drawbacks - namely, it assumes that all genes are correlated to the same consensus correlation (which probably isn't true, each gene would have its own correlation depending on its variance); and it will absorb mouse-to-mouse variability into the variance estimate, which will result in conservativeness. In contrast, putting the batch effect in the design matrix will absorb this variability into the coefficients rather than the sample variance.Hi Aaron,
I have similar problem as main post. Instead of genotype and treatment, I have cell type and time point, both of which are part of experimental design. The replicate prep batch are the batch here.
Just for visualization purpose, I used
normalizeVSN()
result as input forremoveBatchEffect
I got similar warning as
I compared cpmVSN_batchCorrected and cpmVSN value, but no value was change, thus basically batchEffect removal failed. Do you know what caused the batch correction failure? I just want to remove the batch co-efficients effect for PCA plot (visualization).
Please post a new question rather than resurrecting old threads.