I've been following DESeq2 vignette, manuals, and different posts/blog from DESeq2 community that I came up with my analysis. Apologies as this will be a lengthy post
I have a 2x2 factorial experiment on drought tolerance in rice under reproductive stage with four biological replicates
I have 2 genotypes:
- Swarna - elite rice mega variety but drought susceptible
- Near-Isogenic Lines (NIL-drought tolerant) which resulted from a cross between a donor parent and an elite mega variety Swarna.
NIL is basically of Swarna background but with an introgressed segment on chromosome 1. There is a major QTL identified on chromosome 1 which confers drought tolerance at reproductive stage.
I have 2 conditions obviously; Drought and Control (it's a greenhouse set-up for both conditions).
I have two tissues under consideration, flag leaf, and panicles but I'm analyzing them independently at this stage.
The goal then is, to identify genes that co-localizes in QTL region which might explain the drought tolerant phenotype of NIL.
To do this, I used Salmon for transcript abundance quantification followed by tximport package and import gene-level values (as suggested by the authors). The scripts for that is as follow;
tx.all <- tximport(myfiles, type = "salmon", tx2gene = tx2gene, reader = read_tsv)
I then constructed the deseq table
colData <- data.frame(geno=rep(c("NIL","Swarna"),each=8),condition=rep(rep(c("Control","Drought"),each=4),times=2)) rownames(colData) <- colnames(tx.all$counts) dds <- DESeqDataSetFromTximport(tx.all, colData, formula(~geno+condition+geno:condition))
In order to extract multiple condition effects, I combined the factors of interest into a single factor
dds$group<-factor(paste0(dds$geno, dds$condition)) design(dds) <- ~group
I applied the most minimal filtering
nrow(dds) dds <- dds[ rowSums(counts(dds)) > 1, ] nrow(dds)
And then I ran differential expression analysis
dds<-DESeq(dds, parallel = TRUE) resultsNames(dds) [1] "Intercept" "groupNILControl" "groupNILDrought" "groupSwarnaControl" [5] "groupSwarnaDrought"
The comes the result generation. Since we wanted to identify genes that could most likely explain the drought tolerant in NIL compared to Swarna (susceptible), I set-up different contrast test. I have four contrast to be exact
contrast=c("group","NILControl","SwarnaControl") contrast=c("group","NILDrought","NILControl") contrast=c("group","SwarnaDrought","SwarnaControl") contrast=c("group","NILDrought","SwarnaDrought")
I extracted results for each contrast at the default parameter. The most interesting contrast for us would be contrast=c("group","NILDrought","SwarnaDrought"). Thus for illustration purposes, Ill be focusing on this contrast.
res<-results(dds, contrast=c("group","NILDrought","SwarnaDrought"), parallel = TRUE) res summary(res)
At the default parameter which is at adjusted p-value of 0.1 we identified genes within the QTL region in all specified contrast.
To be more strict on which genes are considered significant, I lowered the false discovery rate for all contrast as follows;
res.05 <- results(dds, contrast=c("group","NILDrought","SwarnaDrought"), alpha=.05, parallel = TRUE) table(res.05$padj < 0.05) summary(res.05) sum(res.05$padj < 0.05, na.rm=TRUE)
Again, we got a list of DEGs within the QTL region on all of the four contrast.
And then I raised the log2 fold change
resLFC1 <- results(dds, contrast=c("group","NILDrought","SwarnaDrought"), lfcThreshold = 1, alpha = 0.05, parallel = TRUE) table(resLFC1$padj < 0.05) summary(resLFC1) sum( resLFC1$padj < 0.05, na.rm=TRUE )
From this additional parameter, we only got a result from two contrast (below) that would explain treatment difference only regardless of the genotype
contrast=c("group","NILDrought","NILControl") and contrast=c("group","SwarnaDrought","SwarnaControl")
We did not get any result from the other two contrast (below) that would explain genotypic difference which for us is the most important contrast. I did lower the alpha =0.1 from this parameter but we still did not get any genes within the QTL region.
contrast=c("group","NILControl","SwarnaControl") and contrast=c("group","NILDrought","SwarnaDrought")
QUESTIONS:
1. Was there something wrong in the way I set-up the design formula and was my grouping appropriate?
2. Was my approach to achieved the goal using the specific contrast method did not address the question of identifying DEGs between NIL and Swarna under drought stress? Would there be a better approach that I should have done to handle this?
3. In defining which genes are differentially express, is it strictly that we should invoke the log2 fold change (LFC=1) parameter with the specified alpha = 0.1 or 0.05? What is the implication in result generation if we did not invoke the log2 fold change (PFC=1) parameter?
4. Is it possible to say that the results generated from the default setting of adjusted p-value = 0.1 or even at 0.05 differentially express without invoking the LFC parameter? Can we use only this parameter for DEGs generation?
5. To say that it is biologically and statistically significant, what are the ultimate parameters that we should use?
Please and kindly enlighten me on these issues and if possible provide suggestions
My deepest and sincerest gratitude,
Asher
R version 3.3.2 (2016-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] readr_1.1.0 tximport_1.2.0 genefilter_1.56.0 [4] pheatmap_1.0.8 RColorBrewer_1.1-2 ggplot2_2.2.1 [7] gplots_3.0.1 DESeq2_1.14.1 SummarizedExperiment_1.4.0 [10] Biobase_2.34.0 GenomicRanges_1.26.4 GenomeInfoDb_1.10.3 [13] IRanges_2.8.1 S4Vectors_0.12.1 BiocGenerics_0.20.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.10 locfit_1.5-9.1 lattice_0.20-34 snow_0.4-2 [5] gtools_3.5.0 assertthat_0.1 digest_0.6.12 R6_2.2.0 [9] plyr_1.8.4 backports_1.0.5 acepack_1.4.1 RSQLite_1.1-2 [13] zlibbioc_1.20.0 lazyeval_0.2.0 data.table_1.10.4 annotate_1.52.1 [17] gdata_2.17.0 rpart_4.1-10 Matrix_1.2-7.1 checkmate_1.8.2 [21] labeling_0.3 splines_3.3.2 BiocParallel_1.8.1 geneplotter_1.52.0 [25] stringr_1.2.0 foreign_0.8-67 htmlwidgets_0.8 RCurl_1.95-4.8 [29] munsell_0.4.3 base64enc_0.1-3 htmltools_0.3.5 nnet_7.3-12 [33] tibble_1.2 gridExtra_2.2.1 htmlTable_1.9 Hmisc_4.0-2 [37] XML_3.98-1.5 bitops_1.0-6 grid_3.3.2 xtable_1.8-2 [41] gtable_0.2.0 DBI_0.6 magrittr_1.5 scales_0.4.1 [45] KernSmooth_2.23-15 stringi_1.1.3 XVector_0.14.1 latticeExtra_0.6-28 [49] Formula_1.2-1 tools_3.3.2 hms_0.3 survival_2.39-5 [53] AnnotationDbi_1.36.2 colorspace_1.3-2 cluster_2.0.5 caTools_1.17.1 [57] memoise_1.0.0 knitr_1.15.1
Hi Michael,
Thank you so much for the advice.
I did what you suggested
This is how I go about it. Please and kindly comment if I did it correctly.
colData <- data.frame(geno=rep(c("NIL","Swarna"),each=8),condition=rep(rep(c("Control","Drought"),each=4),times=2))
rownames(colData) <- colnames(tx.all$counts)
dds <- DESeqDataSetFromTximport(tx.all, colData, formula(~geno+condition+geno:condition))
colData(dds)$geno<-relevel(colData(dds)$geno, ref = "Swarna")
nrow(dds)
dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds)
dds<-DESeq(dds, parallel = TRUE)
resultsNames(dds)
[1] "Intercept" "geno_NIL_vs_Swarna" "condition_Drought_vs_Control"
[4] "genoNIL.conditionDrought"
res<-results(dds, name="genoNIL.conditionDrought", parallel = TRUE)
log2 fold change (MLE): genoNIL.conditionDrought
out of 33915 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 2395, 7.1%
LFC < 0 (down) : 1577, 4.6%
outliers [1] : 99, 0.29%
low counts [2] : 9857, 29%
I then lowered the false discovery rate to 5%
res.05 <- results(dds, name="genoNIL.conditionDrought", parallel = TRUE, alpha=.05)
And I raised the log2 fold change from 0
resLFC1 <- results(dds, name="genoNIL.conditionDrought", parallel = TRUE, lfcThreshold = 1, alpha = 0.05)
sum( resLFC1$padj < 0.05, na.rm=TRUE )
adjusted p-value < 0.05
LFC > 0 (up) : 106, 0.31%
LFC < 0 (down) : 10, 0.029%
The goal again is to identify differentially expressed genes that co-localizes in QTL region that might explain the tolerance phenotype of the Near-Isogenic Lines (NIL) under reproductive-stage drought stress. So we want to identify drought responsive genes responsible for increased drought tolerance conferred by the QTL region in NIL compared to Swarna at reproductive-stage drought stress.
QUESTIONS:
1. The log2 fold change estimate returns MLE(no shrinkage), would this affect the result generation? This is to find genes with weak differential expression.
2. What's the difference of this interaction term res<-results(dds, name="genoNIL.conditionDrought", parallel = TRUE) , with the result generation using res<-results(dds, contrast=c("group","NILDrought","SwarnaDrought"), parallel = TRUE).
3. In generating results using res<-results(dds, name="genoNIL.conditionDrought", parallel = TRUE) or res.05 <- results(dds, name="genoNIL.conditionDrought", parallel = TRUE, alpha=.05) , I was able to get gene list that co-localizes in the QTL region. Same is true for the using the previous contrast arguments.
4. When I raised the log fold change (LFC=1) with a defined alpha either at 0.1 or 0.05, there is no gene identified within the QTL region. This is the same result using the contrast argument contrast=c("group","NILDrought","SwarnaDrought").
5. I'm still having trouble getting around the log fold-change threshold choice. You mentioned in your paper that "if any biological processes are genuinely affected by the difference in experimental treatment...this hypothesis (zero LFC) is, in fact, implausible and arguably wrong for many if not most genes." And "a change should, therefore, be of sufficient magnitude to be considered biologically significant." I would like to solicit your idea if do you think I should invoke the LFC threshold choice or generating the result at alpha =0.05 would suffice in generating DEGs based on my goal. I'm still honestly perplexed as to invoke the LFC threshold parameter or not.
Sincerely,
Asher
hi Asher,
It's fine to not have LFC shrinkage. You will still be able to detect weak DE genes.
As to your two results table: the first is an interaction, and the second is a two group comparison. You should try to find a local statistician who can help explain the difference to you, if you don't follow after reading the description in the DESeq2 vignette.
I don't know why you are raising the LFC threshold if you say you are interested in detecting weak DE genes.
As I said previously, the choice of FDR threshold and LFC threshold are entirely up to you. I don't know anything about your biological system, and can't say what kind of fold change is meaningful. You should communicate with the rest of your team to decide on what FDR you are comfortable with and what, if any, an effect size (LFC) threshold to use. But again, if you say you want to be able to detect weakly up- or down-regulated genes then you should not be using lfcThreshold.
Hi Michael,
I'm sorry if I may have confused you in detecting weak DE genes. Our goal is to identify differentially express genes. That's why I invoke the LFC threshold. And also the reason being adding the LFC threshold is that from your paper, "a statistical test against the conventional null hypothesis of zero LFC may report genes with statistically significant changes that are so weak in effect that they could be considered irrelevant or distracting." And that a change should be of sufficient magnitude in order to be considered biologically significant. Adding all these together, I decided to add the LFC threshold but I ended up not getting anything within my QTL region.
So question is, without invoking the LFC threshold, does it necessarily mean that the results we generated using the specified FDR value would only be statistically significant but not biologically significant?
The interaction term name="genoNIL.conditionDrought" answers the question, is the condition effect different across genotypes. But is it not the same with the contrast
res<-results(dds, contrast=c("group","NILDrought","SwarnaDrought"), parallel = TRUE) or
res<-results(dds, contrast=c("group","NILControl","SwarnaControl"), parallel = TRUE)?
Sincerely,
Asher
You need to discuss these questions with a local statistical collaborator. These are not really software related questions, but scientific and statistical ones. and I've given all the answers I can.