Removing batch effect from gene expression before Limma linear modelling
1
0
Entering edit mode
abelsousa • 0
@abelsousa-14090
Last seen 4.8 years ago
UK/Cambridge/EMBL-EBI

Hello,

 

I have a question about a differential expression analysis that I am doing using Limma on RNA-seq data.


Basically, I want find differentially expressed genes between tumour and normal samples (in men and women separately), using as covariates ethnicity and age. To do so, I defined a model like this one:


design <- model.matrix(~ ethnicity + age + tissue_type)
(tissue_type: factor with tumour or normal levels)

 

However, it happens that the samples come from two different research projects, like this:

Data    Tissue
Project1    Tumour
Project1    Normal (15% of the normal samples)
Project2    Normal (85% of the normal samples)

 

As I was expecting, there is a batch effect between these samples that was clearly seen on a PCA analysis (the tumour and normal samples from Project1 form a clear distinct group from the normal samples of Project2).

 

So, before fitting the limma linear model, first I removed this batch effect from the gene counts transformed with voom, using as covariate a factor that indicated whether the samples came from research project1 or project2.


After removing the bacth effect I did another PCA, and showed that this time the samples did not separate by research project.


After this I applied the standard limma pipeline for differential expression analysis:

# fit the linear model for each gene
fit <- lmFit(gene_expression_corrected, design)

#empirical bayes statistics for differential expression
fit <- eBayes(fit)

# extract summary and gene table with statistics
diff_expression_summary <- summary(decideTests(fit))
diff_expression_table <- topTable(fit, coef = "tissue_typetumour", n = Inf, sort.by = "p", p = 1)

 

My question is: Is this approach correct?

Thanks in advance for the help.

 

Abel

Limma differential gene expression batch effect correction covariates • 1.9k views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 12 hours ago
WEHI, Melbourne, Australia

No, that approach is not correct, as has been pointed out before on this support site. You need to include the batch factor in the linear model rather than attempt to remove the batch before hand.

The help page for removeBatchEffect() says that it is intended to prepare data for plotting, not for input to linear modeling.

If there is strong batch effect between projects 1 and 2, and if project 1 is the only one to include both normals and tumours, then project 2 will provide only limited additional information. The main contribution of project 2 will be to improve the estimates of the coefficients for the ethnicity and age covariates. Your analysis as it stands risks over-stating the evidence you have for comparing normals and tumours.

ADD COMMENT
0
Entering edit mode

Thank you Professor Gordon.

In fact, I compared the results of one analysis where I included the batch factor in the linear model with other analysis where I removed the samples from project 2, and the genes that appear to be differentially expressed between tumour and normal are mostly the same.

I just thought in adding these normal samples from project 2 because I have many more tumour samples than normal samples in project 1. But, anyway, probably I am going to remove these normal samples from the analysis.

ADD REPLY

Login before adding your answer.

Traffic: 523 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6