Hi,
I have been using Combat to correct the batch effect in 450k data (~10 batches). Recently, I read an old reply from Dr. Peter Langfelder where he mentioned that "ComBat should NOT be used before running association testing (lmFit); association testing should be run with batch as a covariate on original data."
I have read the paper from Combat authors where they mentioned that Combat performs better than SVD when sample size is small and comparably similar for large sample size.
Now, my question is since my sample size is large and I am using limma for calculating differential methylation, should I be adjusting batch directly in limma as a covariate (or use removeBatchEffect(), not sure about this, as this will again be like removing batch separately)
batch <- pheno$batch BSC <- pheno$BSC_batch group <- factor(targets$status,levels=c("Control","Case")) design <- model.matrix(~targets$Age+batch+BSC+group) fit <- lmFit(Mval, design)
or should I continue with Combat for removing batch and limma for adjusting other biological covariates?
MvalC <- ComBat(Mval, batch=batch, mod=NULL, par.prior = TRUE, prior.plots = FALSE) modcombat <- model.matrix(~target$Age, data=pheno) MvalC1 <- ComBat(MvalC, batch=BSC, mod=NULL, par.prior = TRUE, prior.plots = FALSE) design <- model.matrix(~targets$Age+group) fit <- lmFit(MvalC1, design)
(I corrected for Chip batch and BSC batch separately, as they were confounded and limma was showing error.)
You've answered your own question with Peter's quote (for anyone who's interested, the original is at C: Limma, blocking batch effect). That you have large sample sizes and are doing differential methylation does not change the correctness of his advice.
The only part I didn't understand (as he also didn't elaborate it) in that comment is: how does it make difference, if I remove batch first, then adjust for covariates while doing differential analysis?
Also, I have seen many people used limma function removeBatchEffect() to remove batch prior to any differential analysis. I guess that will also not be a correct approach
It's got to do with the true number of residual d.f. and the uncertainty of the model parameters. I believe I gave you a more complete answer at C: Adjustment of covariates in 450k data using Limma, but to sum up; batch removal is not perfect. Like any statistical procedure, it is subject to errors (e.g., from imprecise estimation of the batch effect), and if you don't model those errors during the DE analysis, then your analysis will not be correct. Blocking on the batch effect in the linear model will account for those errors and is the right approach; removing the batch effect before modelling is not.
Thanks Aaron, I was also doubting that it has to do with the residual D.f. Thanks once again for clarifying.