DESeq2 --- differences in log2FC between results from DESeq() command and nbinomLRT() command
1
0
Entering edit mode
msm545 • 0
@msm545-24422
Last seen 4.2 years ago

Dear Mike,

Have you observed differences in log2FC (Question 1) and baseMean (Question 2 -- suggestion) when running DESeq() command and nbinomLRT() or nbinomWaldTest() command on a subset of the dataset?

library(DESeq2)
library("BiocParallel")
register(SnowParam(10))

# count tables from RNA-Seq data
cnts <- matrix(rnbinom(n=1000, mu=100, size=1/0.5), ncol=10)
cond <- factor(rep(1:2, each=5))

# object construction
dds <- DESeqDataSetFromMatrix(cnts, DataFrame(cond), ~ cond)
dds<-estimateSizeFactors(dds)
dds<-estimateDispersions(dds)

# standard analysis - Wald
dds <- DESeq(dds)
res <- results(dds)

# an alternate analysis: likelihood ratio test
ddsLRT <- DESeq(dds, test="LRT", reduced= ~ 1)
resLRT <- results(ddsLRT)

# analysis using nbinomWaldTest
n.dds <- nbinomWaldTest(dds)
n.res <- results(n.dds)

# analysis using nbinomLRT
n.ddsLRT <- nbinomLRT(dds, full=~ cond,reduced= ~ 1)
n.resLRT <- results(n.ddsLRT)

Note: The results in baseMean and log2FC are same in all four results - res, resLRT, n.res, n.resLRT

Now when performing analysis on a subset of the data as follows:

sel<-c(1:3,8:10)
# standard analysis - Wald
t.dds <- DESeq(dds[,sel])
t.res <- results(t.dds)

# an alternate analysis: likelihood ratio test
t.ddsLRT <- DESeq(dds[,sel], test="LRT", reduced= ~ 1)
t.resLRT <- results(t.ddsLRT)

# analysis using nbinomWaldTest
t.n.dds <- nbinomWaldTest(dds[,sel])
t.n.res <- results(t.n.dds)

# analysis using nbinomLRT
t.n.ddsLRT <- nbinomLRT(dds[,sel], full=~ cond,reduced= ~ 1)
t.n.resLRT <- results(t.n.ddsLRT)

Note: The results in baseMean and log2FC are not the same in all four results - (t.res and t.resLRT are similar) and different from (t.n.res and t.n.resLRT-- which are similar).


Question 1:: Not sure why log2FC is different (even though smaller differences in the demo dataset --- the differences are high in the actual dataset)?

Note: using DESeq() and results () command log2FC of A vs.B; A vs. C and B vs. C (esitmated using contrasts) - addup [logBC=logAB-logAC] but using nbinomLRT() or nbinomWaldTest -- the log2FC does not addup due to the differences explained above - could you please clarify and address it if needed.


Question 2:: baseMean is different using DESeq() and nbinomLRT or nbinomWaldTest on the subset of data?

Suggestion: The difference is because baseMean seems to be copied from the actual dataset when using nbinomLRT or binomWaldTest -- whereas when using DESeq - baseMean is recalcualted using the subset of data. This needs to be addressed - please correct if wrong.


Related to queston 3 below: Since dispersion estimates were done ahead of the analysis in the demo dataset - DESeq command prints following message: found already estimated dispersions, replacing these


Question 3:: But both in demo dataset and an actual dataset - even though dispersion estimates is calculated ahead of time - DESeq command recalculates it when implementing parallel processing ? (since its a large dataset -- have to perform parallel processing).

dds <- DESeq(dds)

  • using pre-existing size factors
  • estimating dispersions
  • found already estimated dispersions, replacing these
  • gene-wise dispersion estimates
  • mean-dispersion relationship
  • final dispersion estimates
  • fitting model and testing

dds <- DESeq(dds, parallel=T)

  • using pre-existing size factors
  • estimating dispersions
  • gene-wise dispersion estimates: 10 workers
  • mean-dispersion relationship
  • final dispersion estimates, fitting model and testing: 10 workers

Suggestion: its not good to re-estimate dispersions when implementing parallel processing either on the subset of data (which would change results - as explained in the note for question# 1 - when calculating log2FC of three conditions or more) or on entire dataset (increasing additional processing time) -- this probably needs to be addressed and corrected - please correct me if wrong.

Session info for the above analysis are as follows:

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] DESeq2_1.28.1               SummarizedExperiment_1.18.2
 [3] DelayedArray_0.14.1         matrixStats_0.56.0         
 [5] Biobase_2.48.0              GenomicRanges_1.40.0       
 [7] GenomeInfoDb_1.24.2         IRanges_2.22.2             
 [9] S4Vectors_0.26.1            BiocGenerics_0.34.0        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5             pillar_1.4.6           compiler_4.0.2        
 [4] RColorBrewer_1.1-2     XVector_0.28.0         bitops_1.0-6          
 [7] tools_4.0.2            zlibbioc_1.34.0        digest_0.6.25         
[10] bit_4.0.4              tibble_3.0.3           lifecycle_0.2.0       
[13] gtable_0.3.0           annotate_1.66.0        RSQLite_2.2.0         
[16] memoise_1.1.0          lattice_0.20-41        pkgconfig_2.0.3       
[19] rlang_0.4.7            Matrix_1.2-18          DBI_1.1.0             
[22] GenomeInfoDbData_1.2.3 dplyr_1.0.2            genefilter_1.70.0     
[25] generics_0.0.2         vctrs_0.3.4            locfit_1.5-9.4        
[28] tidyselect_1.1.0       bit64_4.0.5            grid_4.0.2            
[31] glue_1.4.2             R6_2.4.1               AnnotationDbi_1.50.3  
[34] XML_3.99-0.5           survival_3.2-3         BiocParallel_1.22.0   
[37] purrr_0.3.4            magrittr_1.5           ggplot2_3.3.2         
[40] geneplotter_1.66.0     blob_1.2.1             ellipsis_0.3.1        
[43] scales_1.1.1           splines_4.0.2          colorspace_1.4-1      
[46] xtable_1.8-4           munsell_0.5.0          RCurl_1.98-1.2        
[49] crayon_1.3.4          

Thank you

deseq2 • 2.5k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 1 day ago
United States

In your first part, this makes sense, DESeq re-estimates dispersion while just running the nbinom... functions you are manually by-passing dispersion estimation on your selected samples and instead using dispersion from a different set of samples (this is not part of the recommended workflow).

baseMean is calculated early on in the steps. You should not run later steps on selected columns of data, you're going "off script" this way.

Re: your last question, dispersions are still being re-estimated, it just isn't printing the message from every core (that would be too verbose).

ADD COMMENT
0
Entering edit mode

Mike, thank you for the response. But it would make sense to use the dispersion based on the entire dataset rather than a subset each time right? -- so technically the nbinom is not reestimating the dispersion -- if thats the case then logBC = logAB-logAC when using nbiom separately with AB, AC and BC and not the case with DESeq. logBC=logAB-logAC will only hold true when run with all factors in one model using DESeq, but not separately.

May be question for different thread: also one would like to get different P-value for three factor pair-wise comparisons: AB, AC and BC -- this could be achieved only by using nbion separately - so using same baseMean calculated from earlier step would not be good and also re-estimating the dispersion as in DESeq

As per the last question, re-estimating dispersion using all cores is just additional processing time and it could be minimized.

Please pardon my naivety or language.

ADD REPLY
0
Entering edit mode

I would just say: the recommended steps are to use contrast in results for making comparisons between multiple levels of a factor in the design, and to run nbinomWaldTest or nbinomLRT on the same dataset (the same dds) that was used to run estimateDispersions (which happens inside of DESeq).

DESeq() is really the wrapper, so if you don't want to do all the steps, you would just use the lower-level functions. They are really easy to parallelize, just run nbinomWaldTest on subsets of the data and combine the mcols, then re-run p.adjust() on the pvalue column to create adjusted p-values.

ADD REPLY
0
Entering edit mode

Mike, following up on the same dataset as described above. Could you please explain why the dispersion estimates change with design - as per my understanding dispersion estimates should be based on the gene counts in the sample and not based on the design used in the dataset - which could vary based on different test conditions (either separate of all variables or together in combinations - depending on the question being asked).

### 1
dds <- DESeqDataSetFromMatrix(cnts, DataFrame(cond), ~ 1)
dds<-estimateSizeFactors(dds)
dds<-estimateDispersions(dds)

### 2
cds <- DESeqDataSetFromMatrix(cnts, DataFrame(cond), ~ cond)
cds<-estimateSizeFactors(cds)
cds<-estimateDispersions(cds)

Question 1: The dispersion of dds and cds are different and this affects all downstream analysis and statistical calculations. So every time a DESeq command is run - depending on the design and/or subset of data - dispersion changes?


Another questions - in regards to the previous comment - Question 2: Could you please explain the difference in log2FC between DESeq() and nbinomLRT() eventhough dispersions estimates are fixed at the top before analysis - it seems DESeq does not use them - it recalculates the dispersions everytime - and it varies based on the design.

Question 3 - as per your explanation - if nbinomLRT is not performing dispersion estimates - then log2FC of yz should be equal to the log2FC (xy-xz) -- but this not the case there are minor differences in this demo dataset but in a larger actual dataset the differences are high and unexplain-able - could you please clarify ?

# see vignette for suggestions on generating
# count tables from RNA-Seq data
cnts <- matrix(rnbinom(n=1200, mu=100, size=1/0.5), ncol=12)
cond <- factor(rep(1:3, each=4))

# object construction
dds <- DESeqDataSetFromMatrix(cnts, DataFrame(cond), ~ 1)
dds<-estimateSizeFactors(dds)
dds<-estimateDispersions(dds)

##3
design(dds)=~cond
ddsLRT <- DESeq(dds, test="LRT", reduced= ~ 1)
resLRT <- results(ddsLRT,name="cond_3_vs_1")
resultsNames(ddsLRT)

results(ddsLRT,name="cond_3_vs_1")
results(ddsLRT,name="cond_2_vs_1")
results(ddsLRT,contrast=c(0,-1,1))

In this case as you mentioned log2FC holds true log2FC(yz) = log2FC(xy-xz) - and i try to use the ddsLRT in nbinomLRT - the relationship holds up true - but when i go back to the original dds - the relationship fails::

sel=cond!=3
m1<-model.matrix(~ cond, colData(dds)[sel,]);all.zero <- apply(m1, 2, function(x) all(x==0));idx <- which(all.zero);if(length(idx)>0){m1 <- m1[,-idx]}
m2<-model.matrix(~ 1, colData(dds)[sel,]);all.zero <- apply(m2, 2, function(x) all(x==0));idx <- which(all.zero);if(length(idx)>0){m2 <- m2[,-idx]}
dds.xy <- nbinomLRT(dds[,sel], full=m1, reduced= m2)
results(dds.xy)

sel=cond!=2
m1<-model.matrix(~ cond, colData(dds)[sel,]);all.zero <- apply(m1, 2, function(x) all(x==0));idx <- which(all.zero);if(length(idx)>0){m1 <- m1[,-idx]}
m2<-model.matrix(~ 1, colData(dds)[sel,]);all.zero <- apply(m2, 2, function(x) all(x==0));idx <- which(all.zero);if(length(idx)>0){m2 <- m2[,-idx]}
dds.xz <- nbinomLRT(dds[,sel], full=m1, reduced= m2)
results(dds.xz)

dds$cond<-relevel(dds$cond,"2")
sel=cond!=1
m1<-model.matrix(~ cond, colData(dds)[sel,]);all.zero <- apply(m1, 2, function(x) all(x==0));idx <- which(all.zero);if(length(idx)>0){m1 <- m1[,-idx]}
m2<-model.matrix(~ 1, colData(dds)[sel,]);all.zero <- apply(m2, 2, function(x) all(x==0));idx <- which(all.zero);if(length(idx)>0){m2 <- m2[,-idx]}
dds.yz <- nbinomLRT(dds[,sel], full=m1, reduced= m2)
results(dds.yz)
ADD REPLY
0
Entering edit mode

"Could you please explain why the dispersion estimates change with design - as per my understanding dispersion estimates should be based on the gene counts in the sample and not based on the design used in the dataset"

This has been asked before on the support site, but I'll just refer you to the 2014 DESeq2 paper which explains how the design is used in estimating dispersion.

Q2 Log2 fold change can depend on dispersion, so then if you change the dispersion you may get different LFC. Yes, DESeq is a wrapper that involves estimation of dispersion, whereas nbinom*** do not estimate dispersion.

Is there still an issue here though if you use our recommended steps (and avoid the issues I've brought up already of subsetting data halfway through the analysis)?

Q3, I get all.equal() giving TRUE in the case you bring up: "if nbinomLRT is not performing dispersion estimates - then log2FC of yz should be equal to the log2FC (xy-xz)"

set.seed(123)
cnts <- matrix(rnbinom(n=1200, mu=100, size=1/0.5), ncol=12)
cond <- factor(rep(1:3, each=4))
dds <- DESeqDataSetFromMatrix(cnts, DataFrame(cond), ~cond)
ddsLRT <- DESeq(dds, test="LRT", reduced= ~ 1)
resLRT <- results(ddsLRT,name="cond_3_vs_1")
r1 <- results(ddsLRT,name="cond_3_vs_1")
r2 <- results(ddsLRT,name="cond_2_vs_1")
r3 <- results(ddsLRT,contrast=c(0,-1,1))
all.equal(r3$log2FoldChange, r1$log2FoldChange - r2$log2FoldChange)
# [1] TRUE
ADD REPLY
0
Entering edit mode

reply for Q1 Ans: Mike, will read the paper - the question was why dispersion would change every time a design is changed (even though different combinations of the variables - but with same set of samples will be used) ?

reply for Q2 Ans: agreed - the question was why the value of dispersion changes with design even using all set of samples - I'm not able to wrap my head around - will read the paper and get back to you soon.

reply for Q3 Ans: yes, as explained and agreed in the question and also your solution - this holds true for DESeq() but the question was - why not this relationship is not holding true for nbionLRT() ?


set.seed(123)
cnts <- matrix(rnbinom(n=1200, mu=100, size=1/0.5), ncol=12)
cond <- factor(rep(1:3, each=4))
dds <- DESeqDataSetFromMatrix(cnts, DataFrame(cond), ~cond)

dds<-estimateSizeFactors(dds)
dds<-estimateDispersions(dds)

sel=cond!=3
m1<-model.matrix(~ cond, colData(dds)[sel,]);all.zero <- apply(m1, 2, function(x) all(x==0));idx <- which(all.zero);if(length(idx)>0){m1 <- m1[,-idx]}
m2<-model.matrix(~ 1, colData(dds)[sel,]);all.zero <- apply(m2, 2, function(x) all(x==0));idx <- which(all.zero);if(length(idx)>0){m2 <- m2[,-idx]}
r1 <- results(nbinomLRT(dds[,sel], full=m1, reduced= m2))

sel=cond!=2
m1<-model.matrix(~ cond, colData(dds)[sel,]);all.zero <- apply(m1, 2, function(x) all(x==0));idx <- which(all.zero);if(length(idx)>0){m1 <- m1[,-idx]}
m2<-model.matrix(~ 1, colData(dds)[sel,]);all.zero <- apply(m2, 2, function(x) all(x==0));idx <- which(all.zero);if(length(idx)>0){m2 <- m2[,-idx]}
r2 <- results(nbinomLRT(dds[,sel], full=m1, reduced= m2))

dds$cond<-relevel(dds$cond,"2")
sel=cond!=1
m1<-model.matrix(~ cond, colData(dds)[sel,]);all.zero <- apply(m1, 2, function(x) all(x==0));idx <- which(all.zero);if(length(idx)>0){m1 <- m1[,-idx]}
m2<-model.matrix(~ 1, colData(dds)[sel,]);all.zero <- apply(m2, 2, function(x) all(x==0));idx <- which(all.zero);if(length(idx)>0){m2 <- m2[,-idx]}
r3 <- results(nbinomLRT(dds[,sel], full=m1, reduced= m2))

all.equal(r3$log2FoldChange, r2$log2FoldChange - r1$log2FoldChange)
# [1] "Mean relative difference: 2.352333e-06"
ADD REPLY
0
Entering edit mode

I have no problem with using nbinomLRT in the correct steps. I may not have any more responses, as I think I've made my points clear above.

set.seed(123)
cnts <- matrix(rnbinom(n=1200, mu=100, size=1/0.5), ncol=12)
cond <- factor(rep(1:3, each=4))
dds <- DESeqDataSetFromMatrix(cnts, DataFrame(cond), ~cond)
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
dds <- nbinomLRT(dds, reduced=~1)
r1 <- results(dds,name="cond_3_vs_1")
r2 <- results(dds,name="cond_2_vs_1")
r3 <- results(dds,contrast=c(0,-1,1))
all.equal(r3$log2FoldChange, r1$log2FoldChange - r2$log2FoldChange)
# [1] TRUE
ADD REPLY
0
Entering edit mode

Mike, I understand - performing contrast is similar to subtraction here and of course the results will be similar and all true. Sorry to bother you one last time with the following code by changing factor level.

set.seed(123)
cnts <- matrix(rnbinom(n=1200, mu=100, size=1/0.5), ncol=12)
cond <- factor(rep(1:3, each=4))
dds <- DESeqDataSetFromMatrix(cnts, DataFrame(cond), ~cond)
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
dds <- nbinomLRT(dds, reduced=~1)
r1 <- results(dds,name="cond_2_vs_1")
r2 <- results(dds,name="cond_3_vs_1")
dds$cond<-relevel(dds$cond,"2") # think its important step in understanding changes happening. 
r3 <- results(nbinomLRT(dds, reduced=~1),name="cond_3_vs_2")
all.equal(r3$log2FoldChange, r2$log2FoldChange - r1$log2FoldChange)
# [1] "Mean relative difference: 2.371032e-06"
ADD REPLY
0
Entering edit mode

This (minor fluctuation after releveling) is just due to numerical convergence, and can be ignored.

> delta <- (r3$log2FoldChange) - (r2$log2FoldChange - r1$log2FoldChange)
> mean(abs(delta)/abs(r3$log2FoldChange))
[1] 6.886913e-06
ADD REPLY
0
Entering edit mode

Mike, I agree using recommended steps would be easy for simple experimental designs - however when using complex design the wrappers need to be broken down to individual steps in order to understand and interpret the results.

For example: taking a complex design where 3 conditions x 2 time points - paired design (at least 5 unique individuals per condition) - in addition one biological variable (eg. Gender) is playing a significant role so that needs to be adjusted.

Following results are needed:

  1. at 1st time point: 3 pair wise comparison of conditions - unpaired (3 results) - adjusted by gender
  2. at 2nd time point: 3 pair wise comparison of conditions - unpaired (3 results) - adjusted by gender
  3. Condition wise changes (Δ) between time points - paired (3 results) - since adjusting for pin (no adjustment for gender is technically needed)
  4. three ΔΔ changes between condition-wise changes across time points - paired (3 results) - need to adjust by pin and gender

in total 12 results

so using a design as follows:

~Gender + Cond + time + cond:pin + cond:time
(cond:pin - will be used as recommended in the vignette/RNA-Seq workflow)

one would not be able to capture the entire spectrum of results needed. So in order to accomplish this, the steps need to be broken down with different designs - and every time estimating dispersions based on the design would change the interpretation of results. (Just for the simplicity of things, it would make it easy if statistical steps could be broken down by using fitNbinomGLMs function as implemented in DESeq.v1) - I know adn understand that DESeq2 was formed for more robust analysis and to make it easy for biologists - its just that when things are hard to be troubleshooted - it makes it more difficult.

ADD REPLY
0
Entering edit mode

I don't think I have any further response here beyond my points above.

ADD REPLY

Login before adding your answer.

Traffic: 693 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6