I conducted differential expression analysis with DESeq2. I have brain cells from three different subjects, half of cells were treated and half were control. We constructed RNAseq libraries in triplicate. So, in total 18 libraries. I analyzed data with the following model:
ddsMat <- DESeqDataSetFromMatrix(countData = countdata, colData = sampleInfo, design =~ group + subject
ddsMat <- DESeq (ddsMat)
ddsMat <- DESeq(ddsMat, test="LRT", reduced = ~ group)
I got > 24,000 transcript deregulated (that's a lot!). More importantly, the fold change for some transcripts is exorbitantly high or low. One example follows as illustration:
Transcript:ENST00000398706.6
baseMean: 70,42
log2FoldChange: > 25
FoldChange: 39,726,699
Such results come from the following raw data (quantification conducted with Salmon):
Name donor1_mock1 donor1_mock2 donor1_mock3 donor1_treated1 donor1_treated2 donor1_treated3 donor4_mock1 donor4_mock2 donor4_mock3 donor4_treated1 donor4_treated2 donor4_treated3 donor5_mock1 donor5_mock2 donor5_mock3 donor5_treated1 donor5_treated2 donor5_treated3
ENST00000398706.6 0 0 0 0 0 0 0 0 3 3 0 293 16 265 242 157 1 4
Are this results even possible? The Foldchange of millions does not seem to coincide with the numerical values in the raw data.
Thanks!
Thanks Steve!
Yes, I also think that the high foldChange comes from the zeroes. To test that, I added 1 to the whole table count before DESeq2 analysis, and the foldChage of :ENST00000398706.6 becomes only ~ 315.
The problem is that Salmon reports abundance for many transcripts as a fraction of one i.e. < 1. When I round up, many of those become zero.
Is it statistically valid to add 1 to my count matrix before DESeq2 analysis? The results change both, quantitatively and qualitatively.
We have a procedure for keeping the LFC from going toward Inf, and so they are capped around +/- 20. This isn't a problem at all to have an undefined MLE LFC. You should be using lfcShrink() to produce accurate and bounded LFC estimates, please see the vignette for code examples.
You shouldn't add 1 to the count matrix. Please instead use the Bayesian procedure in lfcShrink().
Thanks Mike. Makes sense to me to use the lfcShrink procedure. I will play with that. Cheers!
I tried reanalyzing my data with the lfcShrink function. While it solves the problem of getting HUGE foldchanges, it clearly is not the way to go. Instead of 24,000 deregulated genes (FC > 2), I know get about 1,500. I loose a lot of information.
It is a good option to plot the data (e.g. in an MA plot), but not a good option for subsequent analysis like pathway analysis.
It's not true that you would lose any information using lfcShrink. It is simply giving you an alternate, more accurate LFC estimate than the maximum likelihood estimate.
I don't know what your purpose is here or what you have done after lfcShrink. Do you want to find the largest LFC, or any gene that shows differential expression at all?
Are you performing an ad hoc filtering step? Can you please post your code?
Hi Mike,
How I do not loose information if the number of differentially expressed transcripts is reduced from >24,000 to ~ 1,800?
Here is the relevant code:
## DIFFERENTIAL EXPRESSION
# This single command performs differential expression analysis
ddsMat <- DESeq (ddsMat)
ddsMat <- DESeq(ddsMat, test="LRT", reduced = ~ group)
resultsNames(ddsMat)
# We can extract all results and store them in "res"
res <- results(ddsMat)
resLFC <- lfcShrink(ddsMat, coef = "group_zika_vs_mock", type = "ashr")
# Extract all gene records in the results that have a p-value smaller
# than 0.05
res.05 <- results(ddsMat, alpha=0.05)
# ...and from those ones, extract only the ones that had an FDR smaller
# than 0.05
table(res.05$padj < 0.05)
# Visualize how many results have a p-value smaller than 0.05
sum(res$pvalue < 0.05, na.rm=TRUE)
# Visualize how many results have a FDR smaller than 0.05
sum(res$padj < 0.05, na.rm=TRUE)
resSig <- subset(resLFC, padj < 0.05)
head(resSig[ order( resSig$log2FoldChange ), 1])
# Export significantly differentially expressed results to a text file
write.table(resSig, file = outFile, sep = "\t", eol = "\n")
You should have the same number of padj < .05 in res and resLFC.
Note that results(dds, alpha=.05) does not do any subsetting. The argument 'alpha' tells results() that the independent filtering should have a target FDR of 0.05. To subset, you should use 'alpha' and then use the subset() command.
These should have the same number of rows:
Thanks Mike!
I was sub-setting the shrinkaged fold change correctly (padj < 0.05). My mistake was subsequently, parsing only the transcripts that have a |log2FC| >= 1. I should obviously not do that when applying the function lfcShrink.
Thanks for your help.
Sorry, another question.
Is it necessary to consolidate Salmon outputs to the gene level, or analysis transcripts abundance is also OK in DESeq2?
I would, obviously, prefer to analyse the data at the level of transcripts.
We've been looking into this in the lab. DESeq2 and other gene-level methods like edgeR and limma-voom perform reasonably well across various sample sizes on transcript-level but there is also room for improvement. You will find the DE transcripts, but you might consider a lower FDR threshold to be more conservative with the list.