Dear all,
I have patient data (microarray data > 100 samples, very noisy) - and as always there are many factors (disease/control, infection, age, sex, treatment, cohort, pmi, batch/scandate).
So my question is basically a generell question concerning combat. I am interested in two biological variables - disease/control and infection. Should/can I correct for the others by running combat sequentially? If yes, what about the order? This does influence the outcome.
Or is multiple batch correction basically overfitting the data?
Thank you very much!
Best,
Julia
Thank you very much for your answer! Very helpful!
I have done something similar to what you described.
Combat cannot take multiple batched at the same time - so I used it sequentially giving it a batch to correct for and a design matrix with my interesting factor plus covariates (uninteresting factors) - I omitted the uninteresting factor I have corrected for in the next step in the design matrix until I only had the interesting factors left.
But thanks! I will give removeBatchEffect a try.
Iterative regression is probably a bad idea if the uninteresting factors are not all orthogonal to the interesting ones. Earlier iterations would not include relevant factors in the model, resulting in biased estimation of the terms that were included. This becomes an issue if you try to use the biased coefficients for regression. The variance in earlier iterations would also be inflated by the effects of missing factors, which might interfere with ComBat's empirical Bayes shrinkage.
Thanks again! It was suggested previously as a solution to Combat not being able to handle multiple batches, but I did not feel very confident about it.