Hi,
I am new to RNASeq and learning by reading the tutorials for DESEq2. When I ran my first DE analysis, my Log2FC range from -26 to 26, which I hadn't seen before. When I mentioned this fact to experienced bioinformaticians in my company, they all agreed that this was odd; however, no one has the time to take a look at my data. Reading more about how DESeq2 calculates LFC and looking at the genes which have this large LFCs I realized this is due to one condition having many reads (e.g. ~80 - normalized counts) while the other condition has 0 reads. I came across a post in Bioconductor that suggested using betaPrior=TRUE while running DESeq and that reduced my LFC range (-3 to 3). What is the correct way to report the DE for these genes?
This is the code I used:
data <- read.table("gene-count.txt", header = T, row.names = 1, check.names=F)
meta <- read.table("meta/Meta_ordered.txt", header = T, row.names = 1)
# check that both dataset have the same samples and in the same order
all(rownames(meta) %in% colnames(data))
all(rownames(meta1) == colnames(data1))
dds <- DESeqDataSetFromMatrix(countData = data, colData = meta, design = ~ gender + condition)
dds <- DESeq(dds)
res <- results(dds)
When I type res, I get the following:
> res log2 fold change (MLE): contion 2 vs 1 Wald test p-value: condition 2 vs 1 DataFrame with 35118 rows and 6 columns
Example of extreme LFC:
baseMean | log2FoldChange | lfcSE | stat | pvalue | padj |
---|
gene1 |
27.1703929 |
-26.086127 |
2.3313637 |
-11.1892141 |
4.604968e-29 |
5.260255e-25 |
gene2 |
26.1996811 |
-26.085583 |
2.3469943 |
-11.1144639 |
1.066896e-28 |
8.124766e-25 |
gene3 |
36.8962467 |
-8.289083 |
3.6075064 |
-2.2977320 |
2.157705e-02 |
1.149870e-01 |
normalized count for these genes:
samples in condition 1 are labeled 1_# and samples in condition 2 are labeled 2_#
1_1 |
1_2 | 1_3 | 1_4 | 1_5 | 1_6 | 1_7 | 1_8 | |
gene1 |
80.03106 |
0 |
80.57861 |
89.56059 |
89.49031 |
66.46199 |
78.72004 |
0 |
gene2 |
74.94973 |
0 |
70.77851 |
74.40419 |
97.7193 |
80.17636 |
55.76003 |
0 |
gene3 |
287.0956 |
251.6573 |
0 |
345.8417 |
0 |
0 |
0 |
0 |
2_1 | 2_2 | 2_3 | 2_4 | 2_5 | 2_6 | 2_7 | 2_8 | |
gene1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
gene2 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
gene3 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
When I run DESeq with betaPrior = TRUE:
dds2 <- DESeq(dds1, betaPrior = TRUE)
res2 <- results(dds2)
log2 fold change (MAP): condition 2 vs 1 Wald test p-value: condition 2 vs 1 DataFrame with 35118 rows and 6 columns
the LFC for the same genes is now:
baseMean | log2FoldChange | lfcSE | stat | pvalue | padj |
---|
gene1 |
27.17039 |
-0.2768540 |
0.1798314 |
-1.539520 |
0.1236774 |
0.3430296 |
gene2 |
26.19968 |
-0.2651210 |
0.1787147 |
-1.483487 |
0.1379451 |
0.3679503 |
gene3 |
36.89625 |
-0.1502297 |
0.1187575 |
-1.265012 |
0.2058670 |
0.4613731 |
But now my genes don't make the cutoff for being differentially expressed. Any guidance is appreciated.
Let me know if I need to provide more data or clarification.
Thanks!
Thanks, Mike! I guess I was looking at a tutorial that was for beginners but a bit old. I have found the latest version of the full vignette.
Hi Michael Love
Edit: I see that you are recommending setting svalue=TRUE for dealing with the low p, low lfc situations here, which works. However, apeglm doesn't seem to support contrasts. Do you have a suggestion for adjusting LFCs and pvalues (s-values) for contrasts?
Original question:
The lfcShrink solution indeed fixes the exaggerated LFC values but it has no effect on p-values as they are not re-calculated. So I am ending up with very small LFCs having very small p-values in a similar situation. For example LFC=0.04, p=2e-9. Do you have any suggestions on this?
In my case, the very large original LFCs seem to be stemming from samples that are in the DDS object but not in a given comparison. If I create a DDS only with samples in that comparison I do not see the same large LFCs. The second reason I am suspecting the samples that are not in the comparison is that lfcShrink with apeglm gives reasonable LFCs but ashr does not, and I think apeglm only considers the samples in the comparison while ashr considers all samples, although I might be mistaken in this.
I didn't want to start a new question because I think the issue is the same but I'd be happy to provide more details in a separate question if it is more appropriate.
Thanks in advance, Ozkan
You can use
lfcThreshold
, either inresults
or withashr
inlfcShrink
which can be used with contrasts.Michael Love Thanks for your response.
The problem is the artefactually inflated LFCs. While apeglm shrinks these LFCs, results or ashr does not. So passing a small (or even moderately large) lfcThreshold values to these functions does not solve the issue.
If ashr doesn't shrink these, are you sure they are artifact? It's hard to say for sure. Maybe filtering number of positive counts is another approach to help prioritize genes.