Hello,
I am analyzing a longitudinal RNA-seq dataset with a large number of subjects (~ 2000) with several batch effects (technical covariates) I would like to adjust for in the context of an exploratory analysis of normalized log (pseudo)counts. The batch effects are both categorical, such as sequencing center, operator, library preparation date, and continuous (insert size, mean GC, etc...). I have run removeBatchEffect from limma accounting for all these covariates and including timepoint in the design matrix.
adj_tmm <- removeBatchEffect(log1p(tmm), batch=batch$a, batch2=batch$b, batch3=batch$c, covariates=batch[4:6], design)
I am reproducing a previous analysis not performed by me to test the correlation between the covariates and the principal components (PCs) calculated on the normalized values, in my case before and after batch correction, since the association was reported to be significant with all the covariates taken into account. My doubt arises from the fact that after performing removeBatchEffect there is not a clear trend of loss of significance and/or correlation between the covariates and post-adjustment PCs. However, there seems to be a centering of the boxplots as should it be expected from the subtraction of batch effects, supported also from comparing the heatmaps. Is this behaviour expected and can proceed with the analysis or further investigation is required about the batch correction process?
I am not a biostatistician but for comparison, by fitting a basic lm() function modelling the same count matrix and accounting for all the covariates and then comparing their correlation with the residuals of the model, there is no more a significant correlation (at least with the continous covariates).
Thank you for the help Massimo
This is not a classical "batch" in the common sense, at least I've never seen this to be corrected for explicitely. In my head this is nested with library prep batches. Maybe estimate surrogate variables with something like sva to see if an early SV captures the relevant unwanted variation rather than trying to correct for this individually. The advantage of this is that early SVs might capture both known batches and other hidden unwanted variation that cannot be assigned to a variable a priori. Especially with human data this is often necessary. Longitudinal data are unfortunately riddled by inevitable batches, so SVs might help dampening this a bit. The downside is that the number of SVs is arbitrary and needs to be defined by the user, but I would at least explore this option to see how observed variation compares to known covariates.