Question

limma: same data in different formats produces different DEG-analysis results

0

Entering edit mode

littlefishes20 ▴ 10

@littlefishes20-13054

Last seen 17 months ago

Taiwan

Background:

Sample Information: 8 groups of cells (cell1,cell2,cell3,cell4,cell5,cell6,cell7,cell8), 4 replicates each group, 32 samples in total.

Comparison we are interested in: cell1-vs-cell2, cell3-vs-cell4,cell5-vs-cell6,cell7-vs-cell8

*Pay attention: sample 1,2 and sample 3,4 in each sample are sequenced in different platform, in an order word, there are batch effect in the samples

I analysed the data twice, with the same operation but with different input. First, I inputed all the data of 32 samples. In the second analysis, I inputed the data only with cell7 and cell8 (8 samples in total). I want to get DEGs of cell7-vs-cell8. In the first time, I got 266 DEGs, while I got none in the second time. It confused me a lot, it shouldn't be the result like that.

I hope someone would offer me some ideas. From all of my heart, sincerely.

limma • 1.7k views

ADD COMMENT • link 7.5 years ago • updated 7.4 years ago littlefishes20 ▴ 10

0

Entering edit mode

How did you evaluate the comparison between cell7 and cell8 in the first analysis? Did you use constrasts?
How have you corrected the batch effect in the second analysis and in the first?

ADD REPLY • link 7.5 years ago Lluís Revilla Sancho ▴ 760

0

Entering edit mode

Thanks for quick response!

The answer to the first question is YES, I made the comparison by using the function makeContrasts.

I didn't remove the batch effect in both two analysis, instead, I specified the batch effect in the linear model, like design <- model.matrix(~0+group+batch)

ADD REPLY • link 7.5 years ago littlefishes20 ▴ 10

0

Entering edit mode

The code of my twice analysis

The first analysis:

require(limma)
library(edgeR)
##read in count data and remove genes that are not expressed##
rawx <- read.table("config.readcount.txt")
x <-DGEList(counts=rawx[,1:24],lib.size=colSums(rawx[,1:24]),genes=rownames(rawx),remove.zeros=TRUE)

##organising sample information##
samplenames <- substring(colnames(x),1,nchar(colnames(x)))
colnames(x) <- samplenames
group <- as.factor(c("Cell1","Cell1","Cell1","Cell1","Cell2","Cell2","Cell2","Cell2","Cell3","Cell3","Cell3","Cell3","Cell4","Cell4","Cell4","Cell4","Cell5","Cell5","Cell5","Cell5","Cell6","Cell6","Cell6","Cell6","Cell7","Cell7","Cell7","Cell7","Cell8","Cell8","Cell8","Cell8"))
x$samples$group <- group
batch <- as.factor(c("Batch1","Batch1","Batch2","Batch2","Batch1","Batch1","Batch2","Batch2","Batch1","Batch1","Batch2","Batch2","Batch1","Batch1","Batch2","Batch2","Batch1","Batch1","Batch2","Batch2","Batch1","Batch1","Batch2","Batch2","Batch1","Batch1","Batch2","Batch2","Batch1","Batch1","Batch2","Batch2"))
x$samples$batch <- batch

##data pre-processing#
cpm <- cpm(x)
lcpm <- cpm(x,log=TRUE)
keep.expres <- rowSums(cpm>1)>=4
x <- x[keep.expres,,keep.lib.sizes=FALSE]

##differential expression analysis##
design <- model.matrix(~0+group+batch)
colnames(design) <- gsub("group","",colnames(design))
contr.matrix <- makeContrasts(
Cell1vsCell2 = Cell1 – Cell2,
Cell3vsCell4 = Cell3 – Cell4,
Cell5vsCell6 = Cell5 – Cell6,
Cell7vsCell8 = Cell7 – Cell8,
levels = colnames(design))

v <- voom(x,design,plot=TRUE)
vfit <- lmFit(v, design)
vfit <- contrasts.fit(vfit, contrasts=contr.matrix)
efit <- eBayes(vfit)
summary(decideTests(efit))
q()

The second analysis:

require(limma)
library(edgeR)
##read in count data and remove genes that are not expressed##
rawx <- read.table("config.readcount.c1.txt")
x <-DGEList(counts=rawx[,1:8],lib.size=colSums(rawx[,1:8]),genes=rownames(rawx),remove.zeros=TRUE)

##organising sample information##
samplenames <- substring(colnames(x),1,nchar(colnames(x)))
colnames(x) <- samplenames
group <- as.factor(c("Cell1","Cell1","Cell1","Cell1","Cell2","Cell2","Cell2","Cell2"))
x$samples$group <- group
batch <- as.factor(c("Batch1","Batch1","Batch2","Batch2","Batch1","Batch1","Batch2","Batch2"))
x$samples$batch <- batch

##data pre-processing#
cpm <- cpm(x)
lcpm <- cpm(x,log=TRUE)
keep.expres <- rowSums(cpm>1)>=4
x <- x[keep.expres,,keep.lib.sizes=FALSE]

##differential expression analysis##
design <- model.matrix(~0+group+batch)
colnames(design) <- gsub("group","",colnames(design))
contr.matrix <- makeContrasts(
Cell1vsCell2 = Cell1 – Cell2,
levels = colnames(design))

v <- voom(x,design,plot=TRUE,save.plot=TRUE)
vfit <- lmFit(v, design)
vfit <- contrasts.fit(vfit, contrasts=contr.matrix)
efit <- eBayes(vfit)
summary(decideTests(efit))
q()

ADD REPLY • link 7.5 years ago littlefishes20 ▴ 10

score 3 · Answer 1 · 2017-06-19

3

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 7 weeks ago

Icahn School of Medicine at Mount Sinai…

You didn't provide the code that you used for the analyses, so it's impossible to give you an exact answer, but the most likely reason for the discrepancy is that limma uses all samples in the model to estimate the variance, so with 32 samples, the variance estimate is much more precise than with only 8 samples. Since the variance is estimated more precisely, limma can more confidently call DEGs. Chances are that the logFC values in both analyses are very similar, as is the rank order of the genes by significance. With methods based on (generalized) linear models such as limma, edgeR, or DESeq2, it's recommended to always fit the full model with all samples included for this reason.

Assuming this is your problem, if you search these forums, you'll find many already-answered questions with the same answer.

ADD COMMENT • link 7.5 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Thanks very much for your help, your answer really help!

I can understand what you means. However, to be honest, I found this problem just because I have got the total 32 samples' data. The most important problem puzzles me is what happens if I have only the data of cell7 and cell8? I would find no difference in this comparison? Doesn't it sound so strange?

The codes are listed as follows.

ADD REPLY • link 7.5 years ago littlefishes20 ▴ 10

3

Entering edit mode

With the full data, you have 24 - 9 = 15 degrees of freedom. With the subset of data, you have 8 - 3 = 5 degrees of freedom. This means that when you use all of the data, you're estimating the variances with much more precision. This is especially true when the prior degrees of freedom is low and there's not much information to share between genes.

As Ryan said, the precision of the variance estimate will have an effect on your downstream tests. If you don't have a precise estimate of the variance, you cannot be sure that a log-fold change is genuine or just due to random variability. This means that limma cannot give you a low p-value, as there is no strong evidence against the null hypothesis.

Obviously if you give limma more data, it will give you different (usually better) results; this is not particularly strange.

ADD REPLY • link 7.5 years ago Aaron Lun ★ 28k

score 1 · Answer 2 · 2017-06-22

"The hallmark of the limma approach is the use of linear models to analyse entire experiments as an integrated whole rather than making piece-meal comparisons between pairs of treatments. This has the effect of sharing information between samples. Analysing the data as a whole also allows us to model correlations that may exist between samples due to repeated measures or other causes. This kind of analysis would not be feasible were the data partitioned into subsets and analysed as a series of pairwise comparisons."

-Cited from paper "limma powers differential expression analyses for RNA-sequencing and microarray studies"

It really helps if I used the dataset of all samples.

I thinks it means the sample from the same tissue, what happens if I want to figure out the difference in one same tissue with different phenotype while I have the data of many samples (such as 3 different samples)? Should I use all of them to fit the linear model?