Incredibly high/low foldChange
2
2
Entering edit mode
jovel_juan ▴ 30
@jovel_juan-7129
Last seen 6 months ago
Canada

I conducted differential expression analysis with DESeq2. I have brain cells from three different subjects, half of cells were treated and half were control. We constructed RNAseq libraries in triplicate. So, in total 18 libraries. I analyzed data with the following model:

ddsMat <- DESeqDataSetFromMatrix(countData = countdata, colData = sampleInfo, design =~ group + subject

ddsMat <- DESeq (ddsMat)
ddsMat <- DESeq(ddsMat, test="LRT", reduced = ~ group)

I got > 24,000 transcript deregulated (that's a lot!). More importantly, the fold change for some transcripts is exorbitantly high or low. One example follows as illustration:

Transcript:ENST00000398706.6

baseMean: 70,42

log2FoldChange: > 25

FoldChange: 39,726,699

Such results come from the following raw data (quantification conducted with Salmon):

Name    donor1_mock1    donor1_mock2    donor1_mock3    donor1_treated1 donor1_treated2 donor1_treated3 donor4_mock1    donor4_mock2    donor4_mock3    donor4_treated1 donor4_treated2 donor4_treated3 donor5_mock1    donor5_mock2    donor5_mock3    donor5_treated1 donor5_treated2 donor5_treated3
ENST00000398706.6       0       0       0       0       0       0       0       0       3       3       0       293     16      265     242     157     1       4

Are this results even possible? The Foldchange of millions does not seem to coincide with the numerical values in the raw data.

Thanks!

 

deseq2 • 2.2k views
ADD COMMENT
0
Entering edit mode

Thanks Steve!

Yes, I also think that the high foldChange comes from the zeroes. To test that, I added 1 to the whole table count before DESeq2 analysis, and the foldChage of :ENST00000398706.6 becomes only ~ 315.

The problem is that Salmon reports abundance for many transcripts as a fraction of one i.e. < 1. When I round up, many of those become zero.

Is it statistically valid to add 1 to my count matrix before DESeq2 analysis? The results change both, quantitatively and qualitatively.

ADD REPLY
0
Entering edit mode

We have a procedure for keeping the LFC from going toward Inf, and so they are capped around +/- 20. This isn't a problem at all to have an undefined MLE LFC. You should be using lfcShrink() to produce accurate and bounded LFC estimates, please see the vignette for code examples.

You shouldn't add 1 to the count matrix. Please instead use the Bayesian procedure in lfcShrink().

ADD REPLY
0
Entering edit mode

Thanks Mike. Makes sense to me to use the lfcShrink procedure. I will play with that. Cheers!

ADD REPLY
0
Entering edit mode

I tried reanalyzing my data with the lfcShrink function. While it solves the problem of getting HUGE foldchanges, it clearly is not the way to go. Instead of 24,000 deregulated genes (FC > 2), I know get about 1,500. I loose a lot of information.

It is a good option to plot the data (e.g. in an MA plot), but not a good option for subsequent analysis like pathway analysis.

ADD REPLY
0
Entering edit mode

It's not true that you would lose any information using lfcShrink. It is simply giving you an alternate, more accurate LFC estimate than the maximum likelihood estimate.

I don't know what your purpose is here or what you have done after lfcShrink. Do you want to find the largest LFC, or any gene that shows differential expression at all?

Are you performing an ad hoc filtering step? Can you please post your code?

ADD REPLY
0
Entering edit mode

Hi Mike,

How I do not loose information if the number of differentially expressed transcripts is reduced from >24,000 to ~ 1,800?

Here is the relevant code:

## DIFFERENTIAL EXPRESSION
# This single command performs differential expression analysis
ddsMat <- DESeq (ddsMat)
ddsMat <- DESeq(ddsMat, test="LRT", reduced = ~ group)

resultsNames(ddsMat)

# We can extract all results and store them in "res"
res <- results(ddsMat)
resLFC <- lfcShrink(ddsMat, coef = "group_zika_vs_mock", type = "ashr")

# Extract all gene records in the results that have a p-value smaller
# than 0.05
res.05 <- results(ddsMat, alpha=0.05)

# ...and from those ones, extract only the ones that had an FDR smaller
# than 0.05
table(res.05$padj < 0.05)

# Visualize how many results have a p-value smaller than 0.05
sum(res$pvalue < 0.05, na.rm=TRUE)

# Visualize how many results have a FDR smaller than 0.05
sum(res$padj < 0.05, na.rm=TRUE)

resSig <- subset(resLFC, padj < 0.05)
head(resSig[ order( resSig$log2FoldChange ), 1])

# Export significantly differentially expressed results to a text file
write.table(resSig, file = outFile, sep = "\t", eol = "\n")

 

ADD REPLY
0
Entering edit mode

You should have the same number of padj < .05 in res and resLFC.

Note that results(dds, alpha=.05) does not do any subsetting. The argument 'alpha' tells results() that the independent filtering should have a target FDR of 0.05. To subset, you should use 'alpha' and then use the subset() command.

These should have the same number of rows:

subset(res, padj < .05)

subset(resLFC, padj < .05)
ADD REPLY
0
Entering edit mode

Thanks Mike!

I was sub-setting the shrinkaged fold change correctly (padj < 0.05). My mistake was subsequently, parsing only the transcripts that have a |log2FC| >= 1. I should obviously not do that when applying the function lfcShrink.

Thanks for your help. 

ADD REPLY
0
Entering edit mode

Sorry, another question.

Is it necessary to consolidate Salmon outputs to the gene level, or analysis transcripts abundance is also OK in DESeq2?

I would, obviously, prefer to analyse the data at the level of transcripts.

ADD REPLY
0
Entering edit mode

We've been looking into this in the lab. DESeq2 and other gene-level methods like edgeR and limma-voom perform reasonably well across various sample sizes on transcript-level but there is also room for improvement. You will find the DE transcripts, but you might consider a lower FDR threshold to be more conservative with the list.

ADD REPLY
2
Entering edit mode
@steve-lianoglou-2771
Last seen 21 months ago
United States

A 39e6 fold change seems crazy ... but, of course, if a gene has all zero counts in one group, and some counts in another than the fold change is infinite, right?

Whatever the case may be, it does warrant further inquiry.

This first thing you should do is try to visualize your data. DESeq2 has a "plotCounts" function that can help you get started: you want to look at the expression level of these suspicious genes across your groups, which is exactly what this function does for you.

ADD COMMENT
0
Entering edit mode
jovel_juan ▴ 30
@jovel_juan-7129
Last seen 6 months ago
Canada

Thanks Mike. That helps a lot!

ADD COMMENT

Login before adding your answer.

Traffic: 911 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6