I have a data set in which the rlog transformation produces in one condition a few genes with very high log-transformed counts. Here is a mimimal example (the real data set has more gene but the same problems):
library("DESeq2") library("tidyverse") # load data load(url("https://github.com/frederikziebell/rlog-example/raw/master/data.RData")) # # re-run DESeq and rlog; takes about 15min on a laptop # rowData(dds) <- NULL # colData(dds) <- colData(dds)[,c("condition","replicate")] # dds <- DESeq(dds) # rld <- rlog(dds) # rlog vs. log2(.+1) assay(rld) %>% data.frame %>% rownames_to_column(var="gene_id") %>% gather("sample","rlog",-gene_id) %>% left_join( counts(dds, normalized=TRUE) %>% data.frame %>% rownames_to_column(var="gene_id") %>% gather("sample","count",-gene_id), by=c("gene_id","sample") ) %>% ggplot(aes(log2(count+1),rlog)) + xlim(-5,30) + ylim(-5,100) + coord_fixed(ratio=35/105)+ geom_hex(bins=80) + scale_fill_gradient(name = "count", trans = "log", breaks = 10^(0:5)) + geom_abline(slope=1, intercept=0, color="red")
Identifying those genes, it appears that some (but not all) of them have a very high MAP dispersion estimate and are thus flagged as an outlier.
# find genes with high rlog value trouble_genes <- assay(rld) %>% data.frame %>% rownames_to_column(var="gene_id") %>% gather("sample","value",-gene_id) %>% filter(value>40) %>% pull(gene_id) %>% unique # info on genes with high rlog value rowData(rld)[which(rownames(rld) %in% trouble_genes),c(1:8,316:323)]
DataFrame with 7 rows and 16 columns baseMean baseVar allZero dispersion dispIter dispOutlier dispMAP Intercept betaConv betaIter deviance maxCooks replace dispGeneEst dispFit rlogIntercept <numeric> <numeric> <logical> <numeric> <integer> <logical> <numeric> <numeric> <logical> <numeric> <numeric> <numeric> <logical> <numeric> <numeric> <matrix> 1 33.48204 81957.56 FALSE 0.98408592 10 FALSE 0.98408592 1.4531139 TRUE 9 456.8104 0.007709898 FALSE 15.51009 0.09718006 5.0658900019208 2 33.27621 80449.15 FALSE 0.05568701 4 FALSE 0.05568701 0.7721303 TRUE 7 452.0163 0.008590950 FALSE 13.36176 0.09737754 5.05696691913032 3 15.17365 17086.37 FALSE 228.00000000 12 TRUE 228.00000000 -0.1547366 TRUE 7 275.3917 0.005929372 FALSE 82.04500 0.13570113 3.92289498948781 4 20.64698 31552.81 FALSE 228.00000000 11 TRUE 39.37817492 -0.1036617 TRUE 8 475.7777 0.006615980 TRUE 41.18565 0.11702636 4.36741465803137 5 20.65373 31552.54 FALSE 228.00000000 11 TRUE 35.66040070 -0.1034237 TRUE 8 487.4940 0.006618849 TRUE 39.87769 0.11700946 4.3678728069692 6 17.43267 22559.42 FALSE 228.00000000 7 TRUE 228.00000000 -0.1543131 TRUE 6 309.6846 0.002378541 FALSE 71.02055 0.12657227 4.12325093031352 7 33.27158 80449.98 FALSE 0.05742513 4 FALSE 0.05742513 0.7724590 TRUE 7 449.8221 0.008426312 FALSE 13.52475 0.09738201 5.05676727448283
Here is a representative example for one of those genes:
plotCounts(dds, trouble_genes[1], intgroup="condition", returnData=TRUE) %>% ggplot(aes(condition, log2(count+1))) + geom_point() + theme(axis.text.x=element_text(angle = 90, hjust = 0))
Is there a way to get more meaningful rlog-transformed values for those gene/sample pairs? I also tried setting blind=FALSE in rlog-call, but with no success.
> sessionInfo() R version 3.4.2 (2017-09-28) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) Matrix products: default locale: [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 LC_NUMERIC=C [5] LC_TIME=German_Germany.1252 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] bindrcpp_0.2 hexbin_1.27.2 forcats_0.2.0 stringr_1.2.0 dplyr_0.7.4 [6] purrr_0.2.4 readr_1.1.1 tidyr_0.7.2 tibble_1.4.2 ggplot2_2.2.1 [11] tidyverse_1.2.1 DESeq2_1.18.1 SummarizedExperiment_1.8.1 DelayedArray_0.4.1 matrixStats_0.53.0 [16] Biobase_2.38.0 GenomicRanges_1.30.1 GenomeInfoDb_1.14.0 IRanges_2.12.0 S4Vectors_0.16.0 [21] BiocGenerics_0.24.0 loaded via a namespace (and not attached): [1] nlme_3.1-131 bitops_1.0-6 lubridate_1.7.1 bit64_0.9-7 RColorBrewer_1.1-2 httr_1.3.1 [7] tools_3.4.2 backports_1.1.2 R6_2.2.2 rpart_4.1-12 Hmisc_4.1-1 DBI_0.7 [13] lazyeval_0.2.1 colorspace_1.3-2 nnet_7.3-12 tidyselect_0.2.3 gridExtra_2.3 mnormt_1.5-5 [19] bit_1.1-12 compiler_3.4.2 cli_1.0.0 rvest_0.3.2 htmlTable_1.11.2 xml2_1.2.0 [25] labeling_0.3 scales_0.5.0 checkmate_1.8.5 psych_1.7.8 genefilter_1.60.0 digest_0.6.14 [31] foreign_0.8-69 XVector_0.18.0 base64enc_0.1-3 pkgconfig_2.0.1 htmltools_0.3.6 htmlwidgets_1.0 [37] rlang_0.1.6 readxl_1.0.0 rstudioapi_0.7 RSQLite_2.0 bindr_0.1 jsonlite_1.5 [43] BiocParallel_1.12.0 acepack_1.4.1 RCurl_1.95-4.10 magrittr_1.5 GenomeInfoDbData_1.0.0 Formula_1.2-2 [49] Matrix_1.2-12 Rcpp_0.12.15 munsell_0.4.3 stringi_1.1.6 yaml_2.1.16 zlibbioc_1.24.0 [55] plyr_1.8.4 grid_3.4.2 blob_1.1.0 crayon_1.3.4 lattice_0.20-35 haven_1.1.1 [61] splines_3.4.2 annotate_1.56.1 hms_0.4.1 locfit_1.5-9.1 knitr_1.18 pillar_1.1.0 [67] geneplotter_1.56.0 reshape2_1.4.3 XML_3.98-1.9 glue_1.2.0 latticeExtra_0.6-28 data.table_1.10.4-3 [73] modelr_0.1.1 cellranger_1.1.0 gtable_0.2.0 assertthat_0.2.0 xtable_1.8-2 broom_0.4.3 [79] survival_2.41-3 AnnotationDbi_1.40.0 memoise_1.1.0 cluster_2.0.6
Hi Mike,
thank you for your explanations. The reason I like to use rlog() over vst() is that it achieves almost equal within-sample distributions of counts across all samples, which may be connected to the size factor issue you were mentioning. Do you see potential future improvements of rlog() using the different shrinkage methods currently implemented in lfcShrink()?
That’s a really good question. In theory the new estimators would be better than the normal prior based estimators, but I’m not directly working on that implementation right now and have too many other projects on my plate. So for DESeq2 i think vst() is best here. Another option would be too try zinbwave factorization which can account for groups with zeros and high counts.
Two more points: