Hi there,
I have a data set with 3 groups of samples, PD, SD, and CR/PR based on the Response variable. I am interested in finding DEG genes between PD and CR/PR with PD as the reference level.
My design is: design(dds)=~Response
Normally, in order to get the DEG result, I would call results(dds,contrast=c("Response","PRCR","PD")) to retrieve the result. However, I am wondering about the result from calling results(dds,name="ResponsePRCR") which is supposed to give us the individual effect and yet has the all the result columns, baseMean, log2FoldChange, ...
What exactly is this "individual effect"? How do I interpret the result from using name option? What's the difference between using contrast and using name? Thanks a lot.
David
Hi Michael,
Thanks for your quick reply. Sorry for the confusion. There are 3 groups of sample in my data set according to response to some treatment. What I want to do is to find the genes up or down regulated in CR/PR group as compared to PD group. Naively I think there are two ways to do it. The first is to drop SD samples from the data set so there are only 2 levels (groups) left in the data set then run DEG with Response variable (PD as the reference factor level) in the design then call results(dds) to retrieve the results. The other way, is to not to drop any sample and run DEG with Response variable (3 levels, PD as the reference factor level) then call results(dds, contrast=c("Response","CR/PR","PD")). I think the results should be matched between these 2 ways except the baseMean, correct?
My main question is simply what is the difference between results(dds, ,contrast=c("Response","CR/PR","PD")) and results(dds, name="ResponsePRCR")? Since contrast=c("Response","CR/PR","PD") gives the DEG result of "CR/PR" vs "PD", what's name="ResponsePRCR" mean then?
Thanks,
David
To be safe, just use the 'contrast' argument for each run, and results() will make sure that you get the contrast you desire. Then the question is, why are the results different with some samples removed from the dataset. The reason is that DESeq() uses all the samples to estimate the dispersion. There is a FAQ in the vignette about this question.
Hi Michael,
Again, thanks for super fast response. Yes, I understand using the 'contrast' argument is the safe way to get the DEG result. But what if someone accidentally used 'name' argument to pull the result and we would like to know the how to interpret the result? Can you please help?
Yes, I understand that DESeq is using all the samples to estimate the dispersion.
Thanks,
David
The interpretation of the coefficients in resultsNames(dds) depends on how the factors were set up, the design, and how DESeq() was run.
Assuming that you used a design of ~Response, and assuming that you ran DESeq() without arguments, then ResponsePRCR gives the contrast of PRCR vs whatever is the reference level in the current dataset:
levels(dds$Response)[1]
Note that, because the dispersion estimates are different when you remove groups, this assumption is not correct: "I think the results should be matched between these 2 ways except the baseMean, correct?"
Hi Michael,
Thanks so much. Below is essentially the step for DEG.
#################################
dds$Response <- factor(dds$Response,levels=c("PD","SD","PRCR"))
# Set the DEG design
design(dds) <- ~Response
# Run the DEG analysis
dds <- DESeq(dds)
Response.result1.ihw.PRCRvsPD <- results(dds,contrast=c("Response","PRCR","PD"),filterFun = ihw)
Response.result2.ihw.PRCR <- results(dds,name="ResponsePRCR",filterFun = ihw)
#################################
And below is the resultsNames(dds) result:
[1] "Intercept" "ResponsePD" "ResponseSD" "ResponsePRCR"
levels(dds$Response)[1]
[1] "PD"
The problem is Response.result1.ihw.PRCRvsPD and Response.result2.ihw.PRCR are not the same. Does my step violate any assumption? Why the results are not the same between them?
Another question regarding the samples to be included in DEG analysis. I see your point, since the number of samples is different then the size factor will be calculated differently then the normalized read count will be different so the DEG result will not be the same. In this case, what will you recommend for doing DEG? Should the size factor calculated from ALL samples or the size factor calculated from only samples being involved in DEG to be used for DEG?
Thanks,
David
I assumed you were using the latest version of DESeq2, but it seems you are not. You can either update to the latest version (1.16) in which case the default value of betaPrior=FALSE, or just set this argument when you run DESeq(). Then the above advice will hold. (Either way, using 'contrast' always gives you the correct comparison.)
Please consult the FAQ I mentioned before. The relevant change is not in the size factor, but the dispersion estimate. The general recommendation is to use all samples, and then use 'contrast' to compare groups.
Hi Michael,
Thanks. Sorry for the confusion. I thought I am using DESeq2 and below is my session info. It's DESeq2_1.12.4, right? Even if I set the betaPrior to be true by calling "dds <- DESeq(dds, betaPrior=TRUE)", still got different results between 2 results() calls. Any suggestion for further debugging?
Best,
David
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.7 (Final)
locale:
[1] C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] IHW_1.0.2
[2] matrixStats_0.52.2
[3] genefilter_1.54.2
[4] org.Hs.eg.db_3.3.0
[5] annotate_1.50.1
[6] XML_3.98-1.9
[7] ggplot2_2.2.1
[8] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[9] GenomicFeatures_1.24.5
[10] AnnotationDbi_1.36.2
[11] DESeq2_1.12.4
[12] SummarizedExperiment_1.2.3
[13] Biobase_2.32.0
[14] GenomicRanges_1.26.1
[15] GenomeInfoDb_1.8.7
[16] IRanges_2.8.0
[17] S4Vectors_0.12.0
[18] BiocGenerics_0.20.0
...
Sorry I wrote the wrong code, I meant betaPrior=FALSE
Thanks a lot! By changing the betaPrior to be false a number of thing are changed. First the resultsNames() function result is different.
When betaPrior=TRUE (default by DESeq2):
[1] "Intercept" "ResponsePD" "ResponseSD" "ResponsePRCR"
Now, when betaPrior=FALSE:
[1] "Intercept" "Response_SD_vs_PD" "Response_PRCR_vs_PD"
Therefore the results() call "results(dds,name="ResponsePRCR",filterFun = ihw)" is not working anymore. Instead, now should call "results(dds,name="Response_PRCR_vs_PD", ,filterFun = ihw). In this case, indeed, the results between using contrast argument and using name argument (Response_PRCR_vs_PD) are matched. However the number of significant DEG genes has been increased a lot and I guess that's contributed by mostly lower express genes. Below is the distribution of baseMean of FDR < 0.2 genes.
When betaPrior=TRUE:
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.817 85.130 226.500 607.700 566.100 10390.000
When betaPrior=FALSE:
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.392 38.550 91.510 259.300 213.100 8387.000
In the detail session of nbinomWaldTest() in the manual, it mentions "... The weighting ensures that noisy estimates of log fold changes from small count genes do not overly influence the calculation of the prior variance. See estimateBetaPriorVar. The final prior variance for a factor level is the average of the estimated prior variance over all contrasts of all levels of the factor. ..." Does that mean when the betaPrior is set TRUE, the the final prior variance for a factor level is the average of the estimated prior variance of over all contrasts of all other levels of the factor which generates the results of name="ResponsePRCR"?
I normally use contrast argument instead of name argument to retrieve contrast result. Actually it is one of our previous lab member who used the name argument for reporting DEG results and now we are facing the question of how to interpret the result since it's different with contrast argument results.
Could you please help us on how to interpret the name argument results when betaPrior=TRUE?
Many thanks!
David
The section about the prior variance is explaining how much shrinkage is applied to fold changes, I'm not sure it's very informative for your current questions. (The low count genes don't contribute much to deciding how much shrinkage there should be, they are downweighted. Then the next sentence says that all contrasts of levels of a factor are considered for estimating how much shrinkage there should be. But this is all what happens only when betaPrior=TRUE, and to determine how much shrinkage will be applied to fold changes. This is no longer the default routine in the current version of DESeq2, but shrinkage of fold changes is now accomplished by a separate function lfcShrink().)
The interpretation of using name="ResponsePRCR", and when using betaPrior=TRUE is: this pulls out the difference of the PRCR samples to the middle value between all levels of the response variable. It's not a proper comparison of two levels, as you would get with 'contrast'. This is because the way we implemented the beta prior was to have shrinkage have the same effect no matter what level is declared the reference. But this statement just applies to betaPrior=TRUE which is no longer the default for DESeq().
The current version of DESeq2 just uses standard R coefficients, and so you can see from resultsNames(dds) what the coefficients mean by their name. And you can use either 'name' or 'contrast'.
Thank you so much!
David
No problem. The confusion is mostly my fault, and resulted from our evolving the idea of how best to perform shrinkage over time, as we saw our method behave "in the wild". I think shrinkage estimators are a valuable tool for statistical analysis, and I'm happy with the current pipeline of DESeq(), results(), then lfcShrink() for visualization and ranking of genes by effect size.