Question

Removing batch effect from gene expression before Limma linear modelling

0

Entering edit mode

abelsousa • 0

@abelsousa-14090

Last seen 5.1 years ago

UK/Cambridge/EMBL-EBI

Hello,

I have a question about a differential expression analysis that I am doing using Limma on RNA-seq data.

Basically, I want find differentially expressed genes between tumour and normal samples (in men and women separately), using as covariates ethnicity and age. To do so, I defined a model like this one:

design <- model.matrix(~ ethnicity + age + tissue_type)
(tissue_type: factor with tumour or normal levels)

However, it happens that the samples come from two different research projects, like this:

Data Tissue
Project1 Tumour
Project1 Normal (15% of the normal samples)
Project2 Normal (85% of the normal samples)

As I was expecting, there is a batch effect between these samples that was clearly seen on a PCA analysis (the tumour and normal samples from Project1 form a clear distinct group from the normal samples of Project2).

So, before fitting the limma linear model, first I removed this batch effect from the gene counts transformed with voom, using as covariate a factor that indicated whether the samples came from research project1 or project2.

After removing the bacth effect I did another PCA, and showed that this time the samples did not separate by research project.

After this I applied the standard limma pipeline for differential expression analysis:

# fit the linear model for each gene
fit <- lmFit(gene_expression_corrected, design)

#empirical bayes statistics for differential expression
fit <- eBayes(fit)

# extract summary and gene table with statistics
diff_expression_summary <- summary(decideTests(fit))
diff_expression_table <- topTable(fit, coef = "tissue_typetumour", n = Inf, sort.by = "p", p = 1)

My question is: Is this approach correct?

Thanks in advance for the help.

Abel

Limma differential gene expression batch effect correction covariates • 2.0k views

ADD COMMENT • link updated 6.9 years ago by Gordon Smyth 52k • written 6.9 years ago by abelsousa • 0

score 0 · Answer 1 · 2018-06-10

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 11 minutes ago

WEHI, Melbourne, Australia

No, that approach is not correct, as has been pointed out before on this support site. You need to include the batch factor in the linear model rather than attempt to remove the batch before hand.

The help page for removeBatchEffect() says that it is intended to prepare data for plotting, not for input to linear modeling.

If there is strong batch effect between projects 1 and 2, and if project 1 is the only one to include both normals and tumours, then project 2 will provide only limited additional information. The main contribution of project 2 will be to improve the estimates of the coefficients for the ethnicity and age covariates. Your analysis as it stands risks over-stating the evidence you have for comparing normals and tumours.

ADD COMMENT • link 6.9 years ago Gordon Smyth 52k

0

Entering edit mode

Thank you Professor Gordon.

In fact, I compared the results of one analysis where I included the batch factor in the linear model with other analysis where I removed the samples from project 2, and the genes that appear to be differentially expressed between tumour and normal are mostly the same.

I just thought in adding these normal samples from project 2 because I have many more tumour samples than normal samples in project 1. But, anyway, probably I am going to remove these normal samples from the analysis.

ADD REPLY • link 6.9 years ago abelsousa • 0