Question

Using both voom and removeBatchEffect, will they double adjust the data or is it the right way to perform this

0

Entering edit mode

sukeshinik5 ▴ 20

@8cebf978

Last seen 6 months ago

India

Hello,

In RNA sequencing data, I want to adjust for the covariates such as age, BMI, sex etc. I have created colData with these covariates and factorised (for binary) and scaled (for continuous) them. My concern is I have normalised the data and then applying adjustment. However, during adjustment using both voom and removeBatchEffect, will they double adjust the data or is it the right way to perform this. Please see below the functions used.

design_matrix <- model.matrix(~Age + Gender + BMI + 
                                Batch + Condition, data=colData)

v <- voom(raw_counts, design_matrix, plot=TRUE)

fit <- lmFit(v, design_matrix)

adjusted_counts <- removeBatchEffect(v$E, covariates=phenodata[, c("Age", "BMI", "Batch")]).

This is because I wish to use this matrix for correlation studies and not for differential gene expression analysis.

Is there any other way to adjust for covariates?

Thank you,

limma adjustment voom removeBatchEffect covariates • 1.4k views

ADD COMMENT • link 10 months ago sukeshinik5 ▴ 20

Gordon Smyth · Accepted Answer · 2024-06-25

3

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 14 hours ago

WEHI, Melbourne, Australia

The primary purpose of the voom() function is to compute precision weights for the downstream differential expression analysis. vooma() also does some normalization but the quantities that it produced are designed to work in conjuction with the weights. It is not intended as a normalization function for other purposes. Since you are not planning to use the voom precision weights in your downstream analysis, there is no purpose in using the voom() function at all.

If you want to export batch-corrected quantities from limma and edgeR, and the purpose is to preserve the effects of Condition, we recommend:

logCPM <- cpm(raw_counts, log=TRUE)
design.Covariates <- model.matrix(~Age+Gender+BMI+Batch)
logCPM.adjusted <- removeBatchEffect(logCPM, group=Condition, covariates=design.Covariates[,-1])

Note that the call to removeBatchEffect() in your question is not correct because you are passing a non-numeric matrix to the covariates argument.

The reason by cpm() with log=TRUE is better than voom() for your purpose is that it applies a larger prior count to the data, damping down the variability of the low log-count values and making the resulting logCPM values more nearly homoscedastic. By contrast, the logCPM values from voom() have a strong mean-variance relationship, which is fine in a limma analysis because it is taken care of by the precision weights.

BTW, I am the author of the voom() and removeBatchEffect() functions.

ADD COMMENT • link 10 months ago Gordon Smyth 52k

0

Entering edit mode

Thank you so much for the response and for the development of such crucial functions: voom() and removeBatchEffect().

I have factorized the covariates so they remain in numeric form and then using them into design matrix. Something like the following,

colData <- data.frame(Condition = phenodata$Status,
                      SampleName = rownames(phenodata), 
                      Age = phenodata$Age, 
                      Gender= as.factor(phenodata$Gender), 
                      BMI = phenodata$BMI, 
                      Batch = phenodata$Batch)

factorize variables to keep it numeric

colData$Gender <- as.factor(colData$Gender)
label_mapping<-c("Female" = 1, "Male" = 2)
colData$Gender <- label_mapping[colData$Gender] # 1=Female 2=Male
colData$Condition <- as.factor(colData$Condition)
colData$Condition <- relevel(colData$Condition, ref = "HC")
colData$Batch <- as.factor(colData$Batch)
Batch_mapping<- c("Batch_1" =1, "Batch_2" = 2, "Batch_3"= 3)
colData$Batch <- Batch_mapping[colData$Batch] # 1, 2, 3

I am not sure if age and BMI should be scaled or factorized.

ADD REPLY • link updated 10 months ago by Gordon Smyth 52k • written 10 months ago by sukeshinik5 ▴ 20

1

Entering edit mode

Ah, no, that's not right. You have misinterpreted my comment about passing a non-numeric matrix to the covariate argument.

You don't need to recode or scale any of the variables. Just use the code exactly as in my answer with the variables as they already are in your phenodata or colData data.frames. The model.matrix function will convert the factors into a numeric matrix in the correct way. You do not need to recode the factors to numeric yourself, nor should you do so.