Question

Check removeBatchEffect effectiveness

0

Entering edit mode

massimo • 0

@35c0d58e

Last seen 17 hours ago

Denmark

Hello,

I am analyzing a longitudinal RNA-seq dataset with a large number of subjects (~ 2000) with several batch effects (technical covariates) I would like to adjust for in the context of an exploratory analysis of normalized log (pseudo)counts. The batch effects are both categorical, such as sequencing center, operator, library preparation date, and continuous (insert size, mean GC, etc...). I have run removeBatchEffect from limma accounting for all these covariates and including timepoint in the design matrix.

adj_tmm <- removeBatchEffect(log1p(tmm), batch=batch$a, batch2=batch$b, batch3=batch$c, covariates=batch[4:6], design)

I am reproducing a previous analysis not performed by me to test the correlation between the covariates and the principal components (PCs) calculated on the normalized values, in my case before and after batch correction, since the association was reported to be significant with all the covariates taken into account. My doubt arises from the fact that after performing removeBatchEffect there is not a clear trend of loss of significance and/or correlation between the covariates and post-adjustment PCs. However, there seems to be a centering of the boxplots as should it be expected from the subtraction of batch effects, supported also from comparing the heatmaps. Is this behaviour expected and can proceed with the analysis or further investigation is required about the batch correction process?

I am not a biostatistician but for comparison, by fitting a basic lm() function modelling the same count matrix and accounting for all the covariates and then comparing their correlation with the residuals of the model, there is no more a significant correlation (at least with the continous covariates).

Thank you for the help Massimo

RNAseq limma removeBatchEffect • 86 views

ADD COMMENT • link updated 1 hour ago by ATpoint ★ 4.8k • written 18 hours ago by massimo • 0

0

Entering edit mode

insert size, mean GC, etc...

This is not a classical "batch" in the common sense, at least I've never seen this to be corrected for explicitely. In my head this is nested with library prep batches. Maybe estimate surrogate variables with something like sva to see if an early SV captures the relevant unwanted variation rather than trying to correct for this individually. The advantage of this is that early SVs might capture both known batches and other hidden unwanted variation that cannot be assigned to a variable a priori. Especially with human data this is often necessary. Longitudinal data are unfortunately riddled by inevitable batches, so SVs might help dampening this a bit. The downside is that the number of SVs is arbitrary and needs to be defined by the user, but I would at least explore this option to see how observed variation compares to known covariates.

ADD REPLY • link 1 hour ago ATpoint ★ 4.8k

score 0 · Answer 1 · 2025-04-28

Your call to removeBatchEffect() is not correct because that function does not have an argument called batch3, meaning that the batch factor `batch$c$ has been ignored. If you have many batch effects, then the correct way to proceed is

removeBatchEffect(y, covariates=design.batch, design=design.timepoint)

where design.batch is the columns of the full design matrix corresponding to batch effects or covariates and design.timepoint is the design matrix for the conditions of interest, that you would have if there were no batch effects or covariates. With this approach, you can adjust for any number of batch factors. For the categorical batch effects, design.batch should contain indicator variables.

In any case, removeBatchEffect() is designed to remove batch effects while preserving the treatment differences of interest. If the covariates are correlated with the treatment conditions (timepoint in this case), then it is possible for the adjusted log-expression values to still have some correlation with the covariates. This is intended behaviour.