Dear Helpful Strangers,
I am running multiple different regressions in a rather complex material. I am using DESeq2_1.12.4. I have noticed that using betaPrior=T or betaPrior=F makes a huge difference.
As far as I understand using betaPrior means that the log fold changes are shrunken, and these shrunken fold changes are used in the Wald test. Various posts here in bioconductor suggests that using betaPrior or not should not really influence your p-value that much, see for example Love's answer New function lfcShrink() in DESeq2.
As I mentioned, I am seeing huge effects on the p-values and I wonder why. One of my setting is that I have gene expression data from two time-points from patients (often, some patients only have one sample), a very un-balanced design, and want to model the effect on treatment (visit a to visit b) and recovery ( 0 or 1). There are also covariates for additional treatment that the patients are on. In this example I am getting significant results when using betaPriors, but nothing when I am not using them. I am filtering my counts matrix so that at least 20% of the samples have at least 1 normalized count (for reasons that are unrelated to this exact regression).
This sample set includes 128 samples, and a sneak peek of the design:
> A_design[1:10,]
pat_id visit response T_after P_after
P09a P09 a 0 0 1
P09b P09 b 0 1 1
P10a P10 a 0 0 1
P10b P10 b 0 0 2
P14a P14 a 0 0 0
P14b P14 b 1 0 1
P15a P15 a 0 0 0
P15b P15 b 1 0 1
P17a P17 a 0 0 0
P17b P17 b 1 0 0
Code when using betaPrior=F:
dds <- DESeqDataSetFromMatrix(countData = Counts[,rownames(A_design)], colData = A_design, design = ~ pat_id + visit + response + T_after + P_after) dds <- estimateSizeFactors(dds) idx <- rowSums( counts(dds, normalized=TRUE) >= 1 ) >= 26 dds <- dds[idx,] ddsIb1F=DESeq(dds, betaPrior=FALSE,parallel=TRUE, BPPARAM=MulticoreParam(4)) # general effect of treatment: visit.res1F=results(ddsIb1F,name='
visit_b_vs_a',pAdjustMethod='BH') # gene expression related to recovery: recov.res1F=results(ddsIb1F,name='
response_1_vs_0',pAdjustMethod='BH')
Code when using betaPrior=T:
dds <- DESeqDataSetFromMatrix(countData = Counts[,rownames(A_design)], colData = A_design, design = ~ pat_id + visit + response + T_after + P_after)
dds <- estimateSizeFactors(dds)
idx <- rowSums( counts(dds, normalized=TRUE) >= 1 ) >= 26
dds <- dds[idx,]
ddsIb1=DESeq(dds, betaPrior=TRUE,parallel=TRUE, BPPARAM=MulticoreParam(4))
# general effect of treatment:
visit.res1=results(ddsIb1,name='visitb',pAdjustMethod='BH')
# gene expression related to recovery:
recov.res1=results(ddsIb1,name='response1',pAdjustMethod='BH')
When betaPrior=T my reasults are:
par(mfrow=c(1,2))
plotMA(visit.res1)
plotMA(recov.res1)
And I am getting a fair amount of significant genes (71 and 38 to be exact).
But when I am not shrinking my fold changes I get:
par(mfrow=c(1,2))
plotMA(visit.res1F)
plotMA(recov.res1F)
Here nothing is significant (well actually 1 gene for visit).
The correlation between p-values is rather poor when we consider the more extreme p-values:
And the correlation for fold changes is also low:
More importantly, the distribution of FDR changes dramatically:
Is there anything I can do to further understand how this shrinkage is affecting my data, and whether it is appropriate to use the shrinkage or not? Any suggestions about what in my data that leads to such a huge effect of using shrinkage or not?
I want to perform the most appropriate testing. But here I am at a loss as to which approach that is the most appropriate for my design, and why. In general I really like the idea of shrinking the fold changes. But I have another example where using shrinkage leads to a complete loss of signal (and the significant genes were not all really low in counts).
Thanks in advance,
Boel
Hi Michael,
You are fenomenal at answering questions fast. Thank you. You are correct in that I made misstake in my code above, when betaPrior is False, it should be "visit_b_vs_a" instead of "visitb" just as you say, I've corrected it above (and did run the correct code).
But, using contrast as you specify makes no difference what so ever.
visit.res1_old=visit.res1
visit.res1=results(ddsIb1,contrast=c("visit","b","a"),pAdjustMethod='BH')
recov.res1=results(ddsIb1,contrast=c('response','1','0'),pAdjustMethod='BH')
My results are identical to before. The same number of significant genes. These are the two vectors of p-values for visit plotted against each other.
So this was not the reson why I was getting different results when using betaPrior or not. Also, please note above that I only have significant results when betaPriors=TRUE in this example.
I've assumed that the variables listed by resultsNames are the covariates used in the regression. Is this not the case? I've read a few forum posts where the difference between using the extended model and the standard model was discussed, but I though it was an issue only when doing interactions (and quite frankly did not understand). I would be very happy to be pointed towards some resource where I can understand which regression model that is being ran depending on the arguments used.. Also, here using name='visitb' or contrast=c("visit","b","a") gives almost completely identical results.
Also, this is an old project that I am trying to finish. Hence the old package. But keeping the fold change shrinkage separate from the statistical testing, as in 1.16 and above, does make me a bit worried in the light on the issues I am having now.
Can you show resultsNames(dds) with betaPrior=TRUE? It should not be the same results. Is the LFC the same also?
> resultsNames(ddsIb1)
[1] "Intercept" "pat_idP10"
... "visita"
[78] "visitb" "response0" "response1" "T_after0" "T_after1" "P_after"
The fold changes are not identical, they appear to be twice the size when specifying the contrast:
So where do we end up now? My betaPrior results are not behaving as they should, but there I get significant results (and in this comparison I am expecting significant results). When using betaPrior=F I do not get any significant results.
Versions 1.12 and 1.14 were the last ones in which the default was TRUE, these are from 2016. One of the main reasons for the switch was that I don’t like what happens when there are many covariates and shrinkage on one affects the others. So this is why the default is now FALSE (and we’ve implemented new shrinkage methods that operate on one at a time in a separate function). I’d recommend to use FALSE, even if this means that there’s no significant results. You can try an alternate package if you think DESeq2 isn’t right here.
Hi Michael,
I have somewhat related questions regarding this post.
I'm still using DESEq2_1.14.1. I've been working on differential expression analysis of drought-tolerance in rice. I have 2 genotypes (tolerant and susceptible) and 2 conditions (drought and well-watered) with 4 replications each, essentially a 2x2 factorial experiment with 4 replications.
Now we want to identify genes that are uniquely upregulated and downregulated in both genotypes under drought. Specifically, we want to identify unique genes in the QTL region of the very drought tolerant genotype that are upregulated and downregulated. We then want to use these genes for functional validation through CRISPR-Cpf1. One of the major hypothesis that we want to test is that there are differentially expressed genes (DEGs) between the two genotypes in the QTL region under drought and we want to identify them. Specifically, we want to identify DEGs under drought in the tolerant genotype. Since we don't know the mechanisms of drought tolerance at the reproductive-stage, we set-up contrast to identify DEGs between several groups.
I set-up the codes as follows:
Question 1: According to the vignette, "Using the design is similar to adding an interaction term", is there any difference using the grouping with using the interaction design?
Without grouping and using the interaction term, I'm getting this under resultsnames with betaprior=-FALSE:
[1] "Intercept" "genotype_Swarna_vs_IL" "condition_Drought_vs_Control"
[4] "genotypeSwarna.conditionDrought"
With grouping using betaPrior=TRUE, we get this four different groups.
With grouping using betaPrior=FALSE, we get this groups:
[1] "Intercept" "group_ILDrought_vs_ILControl" "group_SwarnaControl_vs_ILControl"
[4] "group_SwarnaDrought_vs_ILControl"
I ended up using the grouping with betaPrior=TRUE and do pairwise comparisons using results() and contrast with the group variable.
You mentioned on one of the threads that " "lfcShrink() gives the identical moderated LFCs as DESeq() gave in previous versions." and "If you want to obtain (nearly) the same results in version 1.16 as in 1.14 you can do: dds <- DESeq(dds, betaPrior=TRUE)." So when running version 1.14.1 using dds <- DESeq(dds, betaPrior=TRUE) with grouping would be (nearly) the same when using the lfcShrink of version 1.16.1?
Please advise.
Sincerely,
Asher
Hi Asher could you post a new question with this text. I’ll get to it soon
Hi Michael,
Sure. apologies for posting it here.