DEGs generation parameters and specificities
1
0
Entering edit mode
tarun2 • 0
@tarun2-11885
Last seen 3.0 years ago
United States

I've been following DESeq2 vignette, manuals, and different posts/blog from DESeq2 community that I came up with my analysis. Apologies as this will be a lengthy post

I have a 2x2 factorial experiment on drought tolerance in rice under reproductive stage with four biological replicates

I have 2 genotypes:

        - Swarna - elite rice mega variety but drought susceptible

        - Near-Isogenic Lines (NIL-drought tolerant) which resulted from a cross between a donor parent and an elite mega variety Swarna.

NIL is basically of Swarna background but with an introgressed segment on chromosome 1. There is a major QTL identified on chromosome 1 which confers drought tolerance at reproductive stage.

I have 2 conditions obviously; Drought and Control (it's a greenhouse set-up for both conditions).

I have two tissues under consideration, flag leaf, and panicles but I'm analyzing them independently at this stage. 

The goal then is, to identify genes that co-localizes in QTL region which might explain the drought tolerant phenotype of NIL.

To do this,  I used Salmon for transcript abundance quantification followed by tximport package and import gene-level values (as suggested by the authors). The scripts for that is as follow;

tx.all <- tximport(myfiles, type = "salmon", tx2gene = tx2gene, reader = read_tsv)

I then constructed the deseq table

colData <- data.frame(geno=rep(c("NIL","Swarna"),each=8),condition=rep(rep(c("Control","Drought"),each=4),times=2))

rownames(colData) <- colnames(tx.all$counts)

dds <- DESeqDataSetFromTximport(tx.all, colData, formula(~geno+condition+geno:condition))

In order to extract multiple condition effects, I combined the factors of interest into a single factor

dds$group<-factor(paste0(dds$geno, dds$condition)) 
design(dds) <- ~group 

I applied the most minimal filtering

nrow(dds)
dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds)

And then I ran differential expression analysis

dds<-DESeq(dds, parallel = TRUE)
resultsNames(dds)

[1] "Intercept"          "groupNILControl"    "groupNILDrought"    "groupSwarnaControl"
[5] "groupSwarnaDrought"

 

The comes the result generation. Since we wanted to identify genes that could most likely explain the drought tolerant in NIL compared to Swarna (susceptible), I set-up different contrast test. I have four contrast to be exact

contrast=c("group","NILControl","SwarnaControl")

contrast=c("group","NILDrought","NILControl")

contrast=c("group","SwarnaDrought","SwarnaControl")

contrast=c("group","NILDrought","SwarnaDrought")

I extracted results for each contrast at the default parameter. The most interesting contrast for us would be contrast=c("group","NILDrought","SwarnaDrought"). Thus for illustration purposes, Ill be focusing on this contrast. 

res<-results(dds, contrast=c("group","NILDrought","SwarnaDrought"), parallel = TRUE)
res

summary(res)

At the default parameter which is at adjusted p-value of 0.1 we identified genes within the QTL region in all specified contrast.

To be more strict on which genes are considered significant, I lowered the false discovery rate for all contrast as follows;

res.05 <- results(dds, contrast=c("group","NILDrought","SwarnaDrought"), alpha=.05, parallel = TRUE)
table(res.05$padj < 0.05)
summary(res.05)
sum(res.05$padj < 0.05, na.rm=TRUE) 

Again, we got a list of DEGs within the QTL region on all of the four contrast.

 

And then I raised the log2 fold change 

resLFC1 <- results(dds, contrast=c("group","NILDrought","SwarnaDrought"), lfcThreshold = 1, alpha = 0.05, parallel = TRUE)
table(resLFC1$padj < 0.05)
summary(resLFC1)
sum( resLFC1$padj < 0.05, na.rm=TRUE )

From this additional parameter, we only got a result from two contrast (below) that would explain treatment difference only regardless of the genotype  

contrast=c("group","NILDrought","NILControl") and contrast=c("group","SwarnaDrought","SwarnaControl")

We did not get any result from the other two contrast (below) that would explain genotypic difference which for us is the most important contrast. I did lower the alpha =0.1 from this parameter but we still did not get any genes within the QTL region. 

contrast=c("group","NILControl","SwarnaControl") and contrast=c("group","NILDrought","SwarnaDrought")

QUESTIONS:

1. Was there something wrong in the way I set-up the design formula and was my grouping appropriate?

2. Was my approach to achieved the goal using the specific contrast method did not address the question of identifying DEGs between NIL and Swarna under drought stress? Would there be a better approach that I should have done to handle this? 

3. In defining which genes are differentially express, is it strictly that we should invoke the log2 fold change (LFC=1) parameter with the specified alpha = 0.1 or 0.05? What is the implication in result generation if we did not invoke the log2 fold change (PFC=1) parameter?

4. Is it possible to say that the results generated from the default setting of adjusted p-value = 0.1 or even at 0.05 differentially express without invoking the LFC parameter? Can we use only this parameter for DEGs generation?

5. To say that it is biologically and statistically significant, what are the ultimate parameters that we should use?  

Please and kindly enlighten me on these issues and if possible provide suggestions 

My deepest and sincerest gratitude,

Asher

R version 3.3.2 (2016-10-31)

Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] readr_1.1.0                tximport_1.2.0             genefilter_1.56.0         
 [4] pheatmap_1.0.8             RColorBrewer_1.1-2         ggplot2_2.2.1             
 [7] gplots_3.0.1               DESeq2_1.14.1              SummarizedExperiment_1.4.0
[10] Biobase_2.34.0             GenomicRanges_1.26.4       GenomeInfoDb_1.10.3       
[13] IRanges_2.8.1              S4Vectors_0.12.1           BiocGenerics_0.20.0       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10         locfit_1.5-9.1       lattice_0.20-34      snow_0.4-2          
 [5] gtools_3.5.0         assertthat_0.1       digest_0.6.12        R6_2.2.0            
 [9] plyr_1.8.4           backports_1.0.5      acepack_1.4.1        RSQLite_1.1-2       
[13] zlibbioc_1.20.0      lazyeval_0.2.0       data.table_1.10.4    annotate_1.52.1     
[17] gdata_2.17.0         rpart_4.1-10         Matrix_1.2-7.1       checkmate_1.8.2     
[21] labeling_0.3         splines_3.3.2        BiocParallel_1.8.1   geneplotter_1.52.0  
[25] stringr_1.2.0        foreign_0.8-67       htmlwidgets_0.8      RCurl_1.95-4.8      
[29] munsell_0.4.3        base64enc_0.1-3      htmltools_0.3.5      nnet_7.3-12         
[33] tibble_1.2           gridExtra_2.2.1      htmlTable_1.9        Hmisc_4.0-2         
[37] XML_3.98-1.5         bitops_1.0-6         grid_3.3.2           xtable_1.8-2        
[41] gtable_0.2.0         DBI_0.6              magrittr_1.5         scales_0.4.1        
[45] KernSmooth_2.23-15   stringi_1.1.3        XVector_0.14.1       latticeExtra_0.6-28 
[49] Formula_1.2-1        tools_3.3.2          hms_0.3              survival_2.39-5     
[53] AnnotationDbi_1.36.2 colorspace_1.3-2     cluster_2.0.5        caTools_1.17.1      
[57] memoise_1.0.0        knitr_1.15.1  
deseq2 results design and contrast matrix • 1.7k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 1 day ago
United States

hi Asher,

If you are interested in the drought resistant difference of NIL, I would use a design with interaction: ~ geno + condition + geno:condition, and then test on the interaction term: results(dds, name="genoNIL.conditiondrought"). This test finds genes where the NIL genotype differs in the drought vs control effect relative to Swarna, which sounds like what you want.

You should set up the geno factor such that Swarna is the reference level (see vignette on factors and reference levels).

Regarding the FDR threshold choice and LFC threshold choice, this is entirely up to you as the analyst. The LFC threshold changes the null hypothesis, such that you are not just interested in the genes with LFCs which are non-zero, but also large (in absolute value). Setting the threshold at 1 implies you are not interested in genes which are less than doubling (or greater than halving). You need to determine this threshold. It's useful to examine the MA plot of the results object in setting the threshold. The FDR threshold also is up to you: it sets the expected number of false discoveries in a set. If having 10% of the set be false discoveries is acceptable then use 0.1. If 5%, use 0.05, and so on.

ADD COMMENT
0
Entering edit mode

Hi Michael,

Thank you so much for the advice.

I did what you suggested

This is how I go about it. Please and kindly comment if I did it correctly.

colData <- data.frame(geno=rep(c("NIL","Swarna"),each=8),condition=rep(rep(c("Control","Drought"),each=4),times=2))

rownames(colData) <- colnames(tx.all$counts)

dds <- DESeqDataSetFromTximport(tx.all, colData, formula(~geno+condition+geno:condition))

colData(dds)$geno<-relevel(colData(dds)$geno, ref = "Swarna")

nrow(dds)

dds <- dds[ rowSums(counts(dds)) > 1, ]

nrow(dds)

 

dds<-DESeq(dds, parallel = TRUE)
resultsNames(dds)

[1] "Intercept"                    "geno_NIL_vs_Swarna"           "condition_Drought_vs_Control"
[4] "genoNIL.conditionDrought"

res<-results(dds, name="genoNIL.conditionDrought", parallel = TRUE)

 

log2 fold change (MLE): genoNIL.conditionDrought 

out of 33915 with nonzero total read count

adjusted p-value < 0.1

LFC > 0 (up)     : 2395, 7.1% 
LFC < 0 (down)   : 1577, 4.6%
 
outliers [1]     : 99, 0.29% 
low counts [2]   : 9857, 29% 
 

I then lowered the false discovery rate to 5%

res.05 <- results(dds, name="genoNIL.conditionDrought", parallel = TRUE, alpha=.05)
 

And I raised the log2 fold change from 0

resLFC1 <- results(dds, name="genoNIL.conditionDrought", parallel = TRUE, lfcThreshold = 1, alpha = 0.05)
sum( resLFC1$padj < 0.05, na.rm=TRUE )


adjusted p-value < 0.05
LFC > 0 (up)     : 106, 0.31% 
LFC < 0 (down)   : 10, 0.029% 

The goal again is to identify differentially expressed genes that co-localizes in QTL region that might explain the tolerance phenotype of the Near-Isogenic Lines (NIL) under reproductive-stage drought stress. So we want to identify drought responsive genes responsible for increased drought tolerance conferred by the QTL region in NIL compared to Swarna at reproductive-stage drought stress. 

 

QUESTIONS:

1. The log2 fold change estimate returns MLE(no shrinkage), would this affect the result generation? This is to find genes with weak differential expression.

2. What's the difference of this interaction term res<-results(dds, name="genoNIL.conditionDrought", parallel = TRUE) , with the result generation using res<-results(dds, contrast=c("group","NILDrought","SwarnaDrought"), parallel = TRUE). 

3. In generating results using res<-results(dds, name="genoNIL.conditionDrought", parallel = TRUE) or res.05 <- results(dds, name="genoNIL.conditionDrought", parallel = TRUE, alpha=.05) , I was able to get gene list that co-localizes in the QTL region. Same is true for the using the previous contrast arguments. 

4. When I raised the log fold change (LFC=1) with a defined alpha either at 0.1 or 0.05, there is no gene identified within the QTL region. This is the same result using the contrast argument contrast=c("group","NILDrought","SwarnaDrought"). 

5.  I'm still having trouble getting around the log fold-change threshold choice. You mentioned in your paper that "if any biological processes are genuinely affected by the difference in experimental treatment...this hypothesis (zero LFC) is, in fact, implausible and arguably wrong for many if not most genes."  And "a change should, therefore, be of sufficient magnitude to be considered biologically significant." I would like to solicit your idea if do you think I should invoke the LFC threshold choice or generating the result at alpha =0.05 would suffice in generating DEGs based on my goal. I'm still honestly perplexed as to invoke the LFC threshold parameter or not.

 

Sincerely,

Asher

 

 

ADD REPLY
1
Entering edit mode

hi Asher,

It's fine to not have LFC shrinkage. You will still be able to detect weak DE genes.

As to your two results table: the first is an interaction, and the second is a two group comparison. You should try to find a local statistician who can help explain the difference to you, if you don't follow after reading the description in the DESeq2 vignette.

I don't know why you are raising the LFC threshold if you say you are interested in detecting weak DE genes.

As I said previously, the choice of FDR threshold and LFC threshold are entirely up to you. I don't know anything about your biological system, and can't say what kind of fold change is meaningful. You should communicate with the rest of your team to decide on what FDR you are comfortable with and what, if any, an effect size (LFC) threshold to use. But again, if you say you want to be able to detect weakly up- or down-regulated genes then you should not be using lfcThreshold.

ADD REPLY
0
Entering edit mode

Hi Michael,

I'm sorry if I may have confused you in detecting weak DE genes. Our goal is to identify differentially express genes. That's why I invoke the LFC threshold. And also the reason being adding the LFC threshold is that from your paper, "a statistical test against the conventional null hypothesis of zero LFC may report genes with statistically significant changes that are so weak in effect that they could be considered irrelevant or distracting." And that a change should be of sufficient magnitude in order to be considered biologically significant. Adding all these together, I decided to add the LFC threshold but I ended up not getting anything within my QTL region.

So question is, without invoking the LFC threshold, does it necessarily mean that the results we generated using the specified FDR value would only be statistically significant but not biologically significant?

The interaction term name="genoNIL.conditionDrought" answers the question, is the condition effect different across genotypes. But is it not the same with the contrast 

res<-results(dds, contrast=c("group","NILDrought","SwarnaDrought"), parallel = TRUE) or

res<-results(dds, contrast=c("group","NILControl","SwarnaControl"), parallel = TRUE)? 

 

Sincerely,

Asher

ADD REPLY
0
Entering edit mode

You need to discuss these questions with a local statistical collaborator. These are not really software related questions, but scientific and statistical ones. and I've given all the answers I can.

ADD REPLY

Login before adding your answer.

Traffic: 669 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6