DESeq2 1.20.0 lfcshrink function
3
4
Entering edit mode
emi.sch ▴ 40
@emisch-16234
Last seen 5.2 years ago

Dear all,

I would appreciate your feedback on this one. I have been using DESeq2(version 1.20.0) for differential expression analysis comparing response in cancer patients. 

My question is about the lfcShrink() function. There have been some updates on the function from the previous versions of the package and I have read the documentation and many posts in here, but I have still some doubts on how I do my analysis.

I wanted to be strict strict so I set both thresholds for padj as well as log fold change:

padj.cutoff <- 0.05
lfc.cutoff <- 0.58

dds <- DESeq(dds)

res <- results(dds, contrast=c("response","PD","CR+PR"), 
               lfcThreshold = lfc.cutoff, alpha = padj.cutoff, 
               altHypothesis = "greaterAbs")

res_setThres <- lfcShrink(dds=dds, contrast=c("response","PD","CR+PR"), 
                          res=res, lfcThreshold = lfc.cutoff) 

res_noThres <- lfcShrink(dds=dds, contrast=c("response","PD","CR+PR"), res=res)

My question is: should I set the threshold for LFC inside the lfcShrink() function too? In the res_setThres the p-values have changed (as mentioned in the documentation) and I eventually get no DEGs based on these thresholds. Even when I increased padj to 0.1 I still got zero significant DEGs. My res_noThres table produces a small set of genes for these thresholds. Is it wrong if I use the lfcSchrink without setting the lfcThreshold and then subset my results table to those genes passing the filters? I know it is important  to set these thresholds in the results() function and not filter post-hoc, I was wondering if that applies to lfcshrink too.

Thank you!

deseq2 differential expression logfoldchange lfcshrink lfcthreshold • 8.3k views
ADD COMMENT
17
Entering edit mode
@mikelove
Last seen 1 day ago
United States

Good question. I can see how it's a bit confusing, as we have two inference frameworks now, one more traditional one around p-values with BH adjustment, augmented with IHW, and a new one around posterior estimates of effect sizes and "s-values", which is interfaced via lfcShrink().

It will help me explain things to first walk through some example code. If we make some data, and run DESeq():

dds <- makeExampleDESeqDataSet(betaSD=1)
dds <- DESeq(dds)

We can test against an LFC threshold of 1, from results():

res <- results(dds, lfcThreshold=1)

And we can find a set of genes where we reject the null that |beta| < 1 with an FDR bound of 0.1 (or 0.05 as you have above, specifying alpha).

As you say, this is not the same as the set of shrunken LFC > 1, just because they are different statistical procedures:

lfc1 <- lfcShrink(dds, coef=2, type="apeglm") # apeglm is now recommended

Although there is some overlap:

> table(test=res$padj < .1, shrink=lfc1$log2FoldChange > 1)
       shrink
test    FALSE TRUE
  FALSE   426   69
  TRUE      5   15

And it's also not the same as setting lfcThreshold within lfcShrink, which produces "s-values", although note that this set is more similar:

> lfc2 <- lfcShrink(dds, coef=2, lfcThreshold=1, type="apeglm")
> table(test=res$padj < .1, svalue=lfc2$svalue < .005) # we use a smaller probability for s-value
       svalue
test    FALSE TRUE
  FALSE   486    9
  TRUE      0   20

Or we can use the older version of shrinkage, called "normal", which is more strict than the first set of results.

> lfc3 <- lfcShrink(dds, coef=2, lfcThreshold=1, type="normal")
> table(test=res$padj < .1, old=lfc3$padj < .1)
       old
test    FALSE TRUE
  FALSE   495    0
  TRUE      6   14

Comments:

"My question is: should I set the threshold for LFC inside the lfcShrink() function too? In the res_setThres the p-values have changed (as mentioned in the documentation) and I eventually get no DEGs based on these thresholds."

In your code you are using type="normal", which we have seen can shrink large effect sizes too much, and so when you then perform an LFC threshold test, it is being too conservative. We have a new procedure called "apeglm" which does better than "normal" across many datasets, in particular on the larger effect sizes.

If you want to stick with an FDR bounded set, and have the posterior estimates of LFC, I would recommend to use the following (where you can include your additional arguments in "..."):

res <- results(dds, lfcThreshold=1, ...)
res <- lfcShrink(dds, coef="response_PD_vs_CR+PR", type="apeglm", res=res)

This will keep the p-values and padj from the results() call, and simply update the LFCs so they are posterior estimates.

Alternatively, you can instead use the s-value framework, which produces a set where we bound the expected number of LFCs that will have a false sign or be smaller in magnitude than 1. I recommend to use a smaller probability threshold on the s-values produced here, such as 0.005:

lfc <- lfcShrink(dds, coef="response_PD_vs_CR+PR", lfcThreshold=1, type="apeglm")
ADD COMMENT
0
Entering edit mode

Thank you Michael for the quick reply. Your explanation was very helpful!

ADD REPLY
0
Entering edit mode

Hi Michael,

Thank you for your answer, but I'm still a little confused and have a question as described below:

By calling res <- results(dds, lfcThreshold=1, ...), we test whether the absolute LFC is larger than 1, but after shrinking LCF using res <- lfcShrink(dds, coef="response_PD_vs_CR+PR", type="apeglm", res=res), some genes with an original |LFC| > 1 and padj < 0.05 will now have a |LFC| less than 1. Are those genes still significant? Are we going to use the shrunken LFC and p/padj values for downstream analysis even though the p/padj values were based on the original LFC and do not match the shrunken LFC?

I really appreciate it if you would answer my question. Thank you very much. 

ADD REPLY
1
Entering edit mode

 You’re always going to find a few inconsistencies if you switch between the two frameworks, why not just stick with the Bayesian shrinkage framework then (skip the results function) 

dds <- DESeq(dds)

res <- lfcShrink(dds, coef=coef, lfcThreshold=1)

summary(res)

ADD REPLY
0
Entering edit mode

Thank you for your prompt response, Michael.

My concern is that res <- lfcShrink(dds, coef=coef, lfcThreshold=1)will provide s values, which are less commonly used than p/padj values. I'm afraid that this may cause some confusion during the peer review process. 

I'm considering skipping lfcShrink and just use LFC generated by the results function. Does this approach make sense to you? Thanks very much! 

ADD REPLY
3
Entering edit mode

It’s really up to you. I don’t see an inconsistency between a thresholded test of an unshrunken LFC along with a shrunken LFC, which in a minority of cases may be less than the threshold. They are different procedures and estimators for different purposes, both useful output in my opinion.

ADD REPLY
0
Entering edit mode

I was wrong. lfcShrink will produce padj values. However, lfcShrink doesn't allow for specifying alpha values other than 0.1, whereas the results function does. Can I do res <- lfcShrink(dds, coef=coef, lfcThreshold=1) and then filter significant genes by padj < 0.05? I ask because I saw you mentioned in another post that choosing a different value for alpha than the one you use to filter is sub-optimal.

Would you please tell me which approach would be better? Use results with alpha = 0.05 or use lfcShrink and then filter by padj=0.05? 

Thank you very much!

ADD REPLY
0
Entering edit mode

I see, there's no access to the 'alpha' argument of results() which is called inside lfcShrink. Although I have said elsewhere that it's optimal to pick the same 'alpha' that you threshold on, it's not that big of a deal between 5% and 10%. I would go with using lfcShrink and filtering on padj=.05.

ADD REPLY
0
Entering edit mode

Thank you very much, Michael.

ADD REPLY
0
Entering edit mode

Hi Michael, this post was super informative but I also have some questions. I am concerned about the large differences in DEG lists that I am getting when I use lfcThreshold=1 with lfcShrink (filtered by s-value < .005)

resNorm1 <- lfcShrink(dds, coef="Treat2", lfcThreshold = 1, type="apeglm" )

vs when I use the results file for lfcShrink (filtered by lfcThreshold=1, adj p < .1).

res <- results(dds, contrast = c("Treat", "Treat2", "control"), alpha = 0.1, pAdjustMethod="BH", lfcThreshold = 1, altHypothesis = "greaterAbs")
resNorm2 <- lfcShrink(dds, coef="Treat2", res=res, type="apeglm" )

I get 91 DEG for resNorm1 and 81 DEG for resNorm2, only 28 of which are shared between the two datasets. I understand that the two approaches differ in the way they estimate error, but is it common for them to produce such different results? The issue I have with the recommended second approach (resNorm2) is that it outputs many DEGs with log2FC < 1 (53 out of 81 DEGs have log2FC of 0.1 or below), which I find difficult to interpret. I was wondering if it was common to have this many low FCs after shrinking. I know it's hard to say which approach is the "right" one, but in this case, would the first approach (resNorm1) be more appropriate? My concern is that it identifies 63 DEGs that were not found with approach 1, and while I love having larger DEG lists I worry about having so many that are "unique". I hope my questions make sense. Thank you!

ADD REPLY
0
Entering edit mode

This seems like a case where apeglm shrinkage is stricter than null hypothesis testing of the the MLE effect sizes (even with the lfcThreshold). Can you look at the mean count of the genes that are significant by resNorm2 but have small shrunken LFC? E.g. with(resNorm2, hist(baseMean[padj < 0.1 & abs(log2FoldChange) < 0.1])). Do these tend to have small mean count? If so, one way to square these numbers would be to filter at the top of the script, e.g.:

keep <- rowSums(counts(dds) >= 10) >= X
dds <- dds[keep,]

Where X is some minimal number of samples with a count of 10 or more.

Also, I'd be curious if the results are more overlapping with a less strict LFC threshold, e.g. log2(1.5), so asking for statistical evidence that the fold changes are more than 1.5 fold.

ADD REPLY
0
Entering edit mode

I did include a filtering step at the top: keep <- rowSums(counts(dds) >= 5) >= 3 but maybe my choice of 3 for X (since I had 3 biol reps) was too low? The genes that had super low log2FCs all had 0 mean counts in one of the 2 conditions being compared. I know there is a chance that some genes may have low expression in one condition and not at all in another condition but I don't usually feel very confident about these.

I tried using the lower LFC threshold of 0.585 (fold change = 1.5) but I'm getting fairly similar results. Shrinking using resNorm2 <- lfcShrink(dds, coef="Treat2", res=res, type="apeglm" ) results in 131 DEGs, of which 64 had LFC of 0.1 or below. Shrinking using resNorm1 <- lfcShrink(dds, coef="Treat2", lfcThreshold = 0.585, type="apeglm" ) gave 184 DEGs. All of the genes with LFCs > 0.1 from resNorm2 (n = 67) were also identified with resNorm1. However, resNorm1 gave an additional 117 DEGs. So it seems that setting the LFC threshold in the lfcShrink() function is less conservative than setting the threshold in the results() function and using that output (res) for lfcShrink(). That seems counterintuitive to me somehow.

enter image description here

ADD REPLY
1
Entering edit mode

Not that the s-value and the adjusted p-value control different types of error. Adjusted p-value controls, in aggregate, the rate of genes with LFC=0 being called significant. s-value controls, in aggregate, a gene with LFC < 0 being called as LFC > 0, and vice versa.

I think that the apeglm shrinkage is probably strong in this dataset due to limited number of replicates and high within-group variability, and so you may not want to apply shrinkage, but use lfcThreshold alone.

ADD REPLY
0
Entering edit mode

My concern with not using shrinkage is that otherwise all of the LFCs are very high, ranging from 4 to 25 and -6 to -30 (mainly for the samples that had zero counts). I suppose I can remove those DEGs from the analysis. Would an alternate approach be to apply more stringent filtering to remove all genes with zero counts, e.g.

keep <- rowSums(counts(dds) >= 1) >= X

where X is the total # of samples being analyzed?

ADD REPLY
0
Entering edit mode

Right, those are just genes (the ones going to >10) where one group has all zero counts.

I wouldn't remove the genes from the analysis. Either: go with the LFC and s-values, which will give you more conservative results (apeglm is shrinking a lot because low replicate count and high variability), or:

If you don't like to report those with a large LFC because it is undefined, you could easily report them separately with:

idx <- dds$condition == "X"
sigAndUndefined <- padj < T & 
  rowSums(counts(dds)[,idx] == 0) == sum(idx)

this will allow you to set aside those genes with an FDR < 0.1 and the LFC is not defined because the counts were zero in the reference group.

ADD REPLY
0
Entering edit mode
yusuf.zhc • 0
@yusufzhc-17735
Last seen 5.4 years ago

res1LFC <- lfcShrink(dds, coef="time12vs_0", type="apeglm", padj=.05)

ADD COMMENT
0
Entering edit mode
yusuf.zhc • 0
@yusufzhc-17735
Last seen 5.4 years ago

res1LFC <- lfcShrink(dds, coef="conditiontreatedvs_untreated", type="apeglm", padj=.05)

ADD COMMENT

Login before adding your answer.

Traffic: 549 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6