Deviance decomposition with DESeq2
1
0
Entering edit mode
@giovanni-bacci-15185
Last seen 4.6 years ago
Italy, Florence, University of Florence

Hello,

I'm using DESeq2 to analyse a large dataset with basically three predictors:

1 - Subjects = the ID of the subject involved in the experiment (categorical)

2 - Culture medium = the culture medium (categorical)

3 - Time = the number of days (continuous)

I would like to inspect the proportion of variance explained by the three predictors for each gene to categorize them based on the most influencing factor (this is only an exploratory analysis). Following the DESeq2 manual, the paper, and this post, I wrote this little script but I'm not sure if it is correct:

models <- list(
  all = "~ medium + subject + days",
  no.food = "~ subject + days",
  no.subject = "~ days",
  null = "~ 1"
)

deviances <- lapply(models, function(m){
  design(dds) <- as.formula(m)
  dds <- DESeq(dds, test = "Wald", fitType = "parametric")
  mcols(dds)$deviance
})

residual <- deviances[[length(deviances)]]

deviances <- lapply(seq_along(deviances)[-1], function(i){
  n <- i - 1
  abs(deviances[[i]] - deviances[[n]])
})

deviances <- data.frame(do.call(cbind, deviances))
deviances$residual <- residual - rowSums(deviances)
deviances <- data.frame(t(apply(deviances, 1, function(x) x / sum(x))))
deviances <- setNames(deviances, c("medium", "subject", "days", "residual"))

Probably I'm doing something wrong since I ended up with negative residuals for some genes ...

Thanks in advance!

cheers

Giovanni

 

sessionInfo():

R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=it_IT.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=it_IT.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=it_IT.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] stats4    parallel  grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggtern_2.2.1               vegan_2.4-6                lattice_0.20-35           
 [4] permute_0.9-4              ggsci_2.8                  scales_0.5.0.9000         
 [7] bindrcpp_0.2               dplyr_0.7.4                ape_5.0                   
[10] ggthemes_3.4.0             DESeq2_1.18.1              SummarizedExperiment_1.8.1
[13] DelayedArray_0.4.1         matrixStats_0.53.1         GenomicRanges_1.30.2      
[16] GenomeInfoDb_1.14.0        IRanges_2.12.0             S4Vectors_0.16.0          
[19] phyloseq_1.23.1            metagenomeSeq_1.20.1       RColorBrewer_1.1-2        
[22] glmnet_2.0-13              foreach_1.4.4              Matrix_1.2-11             
[25] limma_3.34.9               Biobase_2.38.0             BiocGenerics_0.24.0       
[28] gridExtra_2.3              ggplot2_2.2.1             

loaded via a namespace (and not attached):
 [1] nlme_3.1-131.1         bitops_1.0-6           bit64_0.9-7            latex2exp_0.4.0       
 [5] tensorA_0.36           tools_3.4.4            backports_1.1.2        utf8_1.1.3            
 [9] R6_2.2.2               rpart_4.1-13           KernSmooth_2.23-15     Hmisc_4.1-1           
[13] DBI_0.7                lazyeval_0.2.1         mgcv_1.8-23            colorspace_1.3-2      
[17] ade4_1.7-10            nnet_7.3-12            bayesm_3.1-0.1         bit_1.1-12            
[21] compiler_3.4.4         compositions_1.40-1    cli_1.0.0              htmlTable_1.11.2      
[25] labeling_0.3           caTools_1.17.1         checkmate_1.8.5        DEoptimR_1.0-8        
[29] robustbase_0.92-8      genefilter_1.60.0      stringr_1.3.0          digest_0.6.15         
[33] foreign_0.8-69         XVector_0.18.0         base64enc_0.1-3        pkgconfig_2.0.1       
[37] htmltools_0.3.6        htmlwidgets_1.0        rlang_0.2.0            rstudioapi_0.7        
[41] RSQLite_2.0            bindr_0.1              energy_1.7-2           jsonlite_1.5          
[45] BiocParallel_1.12.0    gtools_3.5.0           acepack_1.4.1          RCurl_1.95-4.10       
[49] magrittr_1.5           GenomeInfoDbData_1.0.0 Formula_1.2-2          biomformat_1.6.0      
[53] Rcpp_0.12.15           munsell_0.4.3          proto_1.0.0            stringi_1.1.6         
[57] yaml_2.1.16            MASS_7.3-49            zlibbioc_1.24.0        rhdf5_2.22.0          
[61] gplots_3.0.1           plyr_1.8.4             blob_1.1.0             gdata_2.18.0          
[65] crayon_1.3.4           Biostrings_2.46.0      splines_3.4.4          multtest_2.34.0       
[69] annotate_1.56.1        locfit_1.5-9.1         knitr_1.20             pillar_1.1.0          
[73] igraph_1.1.2           boot_1.3-20            geneplotter_1.56.0     reshape2_1.4.3        
[77] codetools_0.2-15       glue_1.2.0             XML_3.98-1.10          latticeExtra_0.6-28   
[81] data.table_1.10.4-3    gtable_0.2.0           assertthat_0.2.0       xtable_1.8-2          
[85] survival_2.41-3        tibble_1.4.2           iterators_1.0.9        AnnotationDbi_1.40.0  
[89] memoise_1.1.0          cluster_2.0.6         

 

deseq2 analysis of deviance • 1.4k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 4 hours ago
United States

The deviance I am returning in mcols(dds) is -2 * log likelihood.

The LRT statistic I return is 2 * (full log likelihood - reduced log likelihood) = reduced deviance - full deviance, which is assumed to follow a chi square distribution. Since the log likelihood increases when coefficients are added to the model, and deviance as defined above has the opposite sign, it's expected that reduced deviance > full deviance. You should not put an absolute value in the difference calculation above.

ADD COMMENT
0
Entering edit mode

Sorry if I come back to this after quite a long time, but I'm still struggling to produce reliable results. It seems that the above calculation produced some negative values. I would say that the factor considered is somehow worsening the model by increasing the deviance of the full model in respect with the reduced one but I don't know how to deal with it.  Should I treat those values as zeros or should I leave them as negative values?

Thanks again!

Giovanni

ADD REPLY
0
Entering edit mode

I would suggest using the dispersion from the full model. If you run separate models, you are re-estimating dispersion each time. You can run estimateDispersion(dds) for the full model and then for the reduced models, you can set dispersion with dispersions(dds_reduce) <- dispersions(dds_full). Then you only need to re-run the nbinomWald() function, not the whole procedure.

ADD REPLY
0
Entering edit mode

Dear Michael,
thanks for your help! I changed my code as follows:

# Full model
dds.full <- DESeq(dds.full, test = "Wald", fitType = "parametric", betaPrior = T)

# Deviance estimation of reduced models
deviances <- sapply(models[-1], function(m){
  dds <- dds.full # Copying the full model
  design(dds) <- as.formula(m) # Changing design
  dds <- nbinomWaldTest(dds, betaPrior = T) # Re-run wald test
  return(mcols(dds)$deviance) # Returning deviances
})
deviances <- cbind(mcols(dds.full)$deviance, deviances) # Adding deviances of the full model

This way I need only to change the design of my experiment leaving everything else as already estimated in the full model . Then, to build the full deviance table:

deviances <- data.frame(medium = deviances[,2] - deviances[,1],
                        food = deviances[,3] - deviances[,2],
                        days = deviances[,4] - deviances[,3]) %>%
  mutate(residual = deviances[,4] - (medium + food + days)) %>%
  as.matrix() %>%
  prop.table(margin = 1) %>%
  data.frame(row.names = rownames(dds.full))

This seems to work, now I get only positive values and they seem reasonable. Does it sound good to you?

Thaks again,

ciao,

Giovanni

 

ADD REPLY
0
Entering edit mode

I wouldn't use the beta prior if you're calculating deviances and comparing them across models. The LRT theory is not for biased estimates of coefficients (the beta prior induces a bias).

ADD REPLY

Login before adding your answer.

Traffic: 742 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6