Hi,
I am encountering an unexpected behaviour with regularized logarithmic values for one gene, when there is:
- a large number of conditions (12),
- limited replication number (3),
- the read counts in most samples is 0, and
- there is one sample for which the read count for the gene is high.
When these circumstances are met, the regularized logarithm value for this gene in the outlier condition is much higher than any other value in the dataset.
This effect can be seen in the toy example below, using the highest expressed gene in my dataset, the gene with the unexpected behaviour, and 10 genes selected at random to generate the dds object. Please note that the number of reads in sample A_b_1
for gene Question
is much higher (2471) than in all other samples, and that its behaviour is just as unexpected in my full dataset (60321 genes in 107 samples).
E <- rbind(
HighExpr=c(14413010, 14729567, 16831319, 9285, 42687, 208961, 7768481, 12181302, 10001122, 1275, 19230, 2846911, 15888015, 15805554, 13694644, 4124374, 43071, 16584140, 17740751, 15121608, 740, 675, 1093, 15783974, 14289011, 13051649, 1432938, 2229190, 2194025, 15630065, 14096331, 14922741, 2925, 4603, 26915),
Question=c(0, 0, 0, 2471, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0),
Random_0=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
Random_1=c(0, 0, 0, 0, 14, 1, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1),
Random_2=c(64, 67, 71, 35, 15, 216, 87, 166, 220, 100, 114, 321, 105, 167, 173, 155, 158, 186, 235, 166, 56, 177, 119, 50, 102, 43, 311, 373, 333, 74, 106, 123, 289, 267, 374),
Random_3=c(22, 40, 16, 12, 28, 77, 4, 18, 5, 45, 59, 20, 50, 37, 26, 21, 55, 49, 40, 39, 52, 58, 35, 49, 55, 50, 58, 54, 59, 26, 29, 25, 40, 65, 120),
Random_4=c(1, 3, 7, 77, 25, 40, 4, 1, 7, 15, 46, 17, 7, 0, 4, 8, 25, 8, 15, 10, 7, 22, 5, 5, 3, 2, 6, 13, 6, 12, 9, 3, 8, 3, 17),
Random_5=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
Random_6=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
Random_7=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0),
Random_8=c(3255, 2734, 3156, 11996, 14702, 8891, 3011, 3161, 3364, 8394, 9060, 8458, 3991, 4067, 3630, 9788, 11737, 4580, 5183, 4147, 9150, 10252, 8110, 3530, 3557, 2837, 10429, 9828, 8649, 3863, 3709, 4091, 9683, 10046, 15143),
Random_9=c(0, 0, 0, 0, 3, 2, 0, 0, 0, 0, 0, 0, 1, 5, 0, 0, 0, 2, 0, 0, 0, 2, 1, 0, 5, 0, 4, 2, 0, 0, 1, 0, 1, 0, 5)
)
colnames(E) <- c("A_a_1", "A_a_2", "A_a_3", "A_b_1", "A_b_2", "A_b_3", "A_c_1", "A_c_2", "A_c_3", "A_d_1", "A_d_2", "A_d_3", "B_a_1", "B_a_2", "B_a_3", "B_b_1", "B_b_3", "B_c_1", "B_c_2", "B_c_3", "B_d_1", "B_d_2", "B_d_3", "C_a_1", "C_a_2", "C_a_3", "C_b_1", "C_b_2", "C_b_3", "C_c_1", "C_c_2", "C_c_3", "C_d_1", "C_d_2", "C_d_3")
Cov <- data.frame(Sample=colnames(E), Group=sub("_[123]$", "", colnames(E)), Repl=sub("^[ABC]_[abcd]_", "", colnames(E)))
dds <- DESeqDataSetFromMatrix(countData=E, colData=Cov, design=~Group)
dds <- DESeq(dds)
rld <- rlog(dds)
The regularised logarithm value of the Question
gene in sample A_b_1
is more than twice higher than the highest regularized likelihood of the HighExpr
gene.
> counts(dds)[1:2, 1:5]
A_a_1 A_a_2 A_a_3 A_b_1 A_b_2
HighExpr 14413010 14729567 16831319 9285 42687
Question 0 0 0 2471 4
> assay(rld)[1:2, 1:5]
A_a_1 A_a_2 A_a_3 A_b_1 A_b_2
HighExpr 24.260943 23.913180 24.598851 15.18478 17.231491
Question 6.480747 6.480516 6.480839 53.75706 6.558154
> max(assay(rld)[1,])
[1] 24.59885
However, when fitNbinomGLMs is invoked with argument forceOptim=TRUE
from rlogData
, this unexpected behaviour disappears, see below:
> cbind(assay(rld_forceFalse)["Question",1:5], assay(rld_forceTrue)["Question",1:5])
[,1] [,2]
A_a_1 6.480747 -3.815157
A_a_2 6.480516 -3.935015
A_a_3 6.480839 -3.778702
A_b_1 53.757064 12.202386
A_b_2 6.558154 3.153798
My questions are:
- Is my expectation of regularized logarithm values plain wrong? (i.e. could super high values be sometimes reached for genes with limited counts)
- Am I using DESeq or rlog wrong?
- Is there a way to force optimisation for all genes (or is it too slow?)
Thanks in advance for your help! Best, Eric
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 16.04.7 LTS
Matrix products: default
BLAS/LAPACK: /home/eblanc/miniconda3/envs/R_test/lib/libopenblasp-r0.3.10.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] DESeq2_1.28.0 SummarizedExperiment_1.18.1
[3] DelayedArray_0.14.0 matrixStats_0.56.0
[5] Biobase_2.48.0 GenomicRanges_1.40.0
[7] GenomeInfoDb_1.24.0 IRanges_2.22.1
[9] S4Vectors_0.26.0 BiocGenerics_0.34.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4.6 pillar_1.4.6 compiler_4.0.2
[4] RColorBrewer_1.1-2 XVector_0.28.0 bitops_1.0-6
[7] tools_4.0.2 zlibbioc_1.34.0 digest_0.6.25
[10] bit_4.0.4 tibble_3.0.3 lifecycle_0.2.0
[13] gtable_0.3.0 annotate_1.66.0 RSQLite_2.2.0
[16] memoise_1.1.0 lattice_0.20-41 pkgconfig_2.0.3
[19] rlang_0.4.7 Matrix_1.2-18 DBI_1.1.0
[22] GenomeInfoDbData_1.2.3 genefilter_1.70.0 vctrs_0.3.2
[25] locfit_1.5-9.4 bit64_4.0.2 grid_4.0.2
[28] glue_1.4.1 R6_2.4.1 AnnotationDbi_1.50.0
[31] XML_3.99-0.3 survival_3.2-3 BiocParallel_1.22.0
[34] magrittr_1.5 ggplot2_3.3.2 geneplotter_1.66.0
[37] blob_1.2.1 ellipsis_0.3.1 scales_1.1.1
[40] splines_4.0.2 colorspace_1.4-1 xtable_1.8-4
[43] munsell_0.5.0 RCurl_1.98-1.2 crayon_1.3.4
Hi Michael,
OK, I'll stick to vst then.
Thanks for your help (& for DESeq2!) Eric