Hi everyone,
I have several questions about outlier detection for continuous phenotype using DESeq2. When analyzing continuous phenotype, I got weird results. Below is a reproducible example.
Suppose the phenotype of interest is a continuous variable. However, the continuous variable only takes integer values. e.g. the pheno1
in the following short example.
library(DESeq2)
set.seed(1)
countData = matrix(rnbinom(30*20, mu=1000, size=0.5), ncol=20)
mode(countData) = "integer"
pheno1 = c(rep(3,3),rep(5,5),rep(2,3),rep(10,4),rep(15,5))
##pheno1
## 3 3 3 5 5 5 5 5 2 2 2 10 10 10 10 15 15 15 15 15
colData = DataFrame(pheno = pheno1)
dds = DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~ pheno)
dds = DESeq(dds)
mcols(dds)
the maxCooks
column in mcols(dds)
is not NA, meaning that the outlier detection will be performed in results(dds,name="pheno")
.
However, if we add a very small random number to the pheno1
, e.g pheno2
in the following code
set.seed(1)
pheno2 = pheno1 + rnorm(20,0,0.0001)
pheno2
> pheno2
[1] 2.999937 3.000018 2.999916 5.000160 5.000033 4.999918 5.000049
[8] 5.000074 2.000058 1.999969 2.000151 10.000039 9.999938 9.999779
[15] 10.000112 14.999996 14.999998 15.000094 15.000082 15.000059
colData2 = DataFrame(pheno = pheno2)
dds2 = DESeqDataSetFromMatrix(countData = countData, colData = colData2, design = ~ pheno)
dds2 = DESeq(dds2)
mcols(dds2)
the maxCooks
column in mcols(dds2)
is ALL NA, meaning that no outlier will be detected.
In this example, results(dds,name="pheno")
and results(dds2,name="pheno")
produce almost the same result. But in real data, they could be different, especially for the pvalue
column, due to the outlier detection.
My question is:
(1) For the two examples above, the difference between pheno1
and pheno2
is very tiny. But when calling results
function, one performs outlier detection, the other one doesn't. Is this make sense?
(2) For the continuous phenotype, should the outlier be considered when using DESeq2?
Thanks in advance.
Xu Ren
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] DESeq2_1.18.1 SummarizedExperiment_1.8.1
[3] DelayedArray_0.4.1 matrixStats_0.54.0
[5] Biobase_2.38.0 GenomicRanges_1.30.3
[7] GenomeInfoDb_1.14.0 IRanges_2.12.0
[9] S4Vectors_0.16.0 BiocGenerics_0.24.0
loaded via a namespace (and not attached):
[1] bit64_0.9-7 splines_3.4.0 Formula_1.2-3
[4] assertthat_0.2.0 latticeExtra_0.6-28 blob_1.1.1
[7] GenomeInfoDbData_1.0.0 pillar_1.3.1 RSQLite_2.1.1
[10] backports_1.1.3 lattice_0.20-35 glue_1.3.0
[13] digest_0.6.18 RColorBrewer_1.1-2 XVector_0.18.0
[16] checkmate_1.9.1 colorspace_1.4-0 htmltools_0.3.6
[19] Matrix_1.2-9 plyr_1.8.4 XML_3.98-1.17
[22] pkgconfig_2.0.2 genefilter_1.60.0 zlibbioc_1.24.0
[25] purrr_0.3.0 xtable_1.8-3 scales_1.0.0
[28] BiocParallel_1.12.0 htmlTable_1.13.1 tibble_2.0.1
[31] annotate_1.56.2 ggplot2_3.1.0 nnet_7.3-12
[34] lazyeval_0.2.1 survival_2.41-3 magrittr_1.5
[37] crayon_1.3.4 memoise_1.1.0 foreign_0.8-67
[40] tools_3.4.0 data.table_1.12.0 stringr_1.4.0
[43] locfit_1.5-9.1 munsell_0.5.0 cluster_2.0.6
[46] AnnotationDbi_1.40.0 bindrcpp_0.2.2 compiler_3.4.0
[49] rlang_0.3.1 grid_3.4.0 RCurl_1.95-4.11
[52] rstudioapi_0.9.0 htmlwidgets_1.3 bitops_1.0-6
[55] base64enc_0.1-3 gtable_0.2.0 DBI_1.0.0
[58] R6_2.3.0 gridExtra_2.3 knitr_1.21
[61] dplyr_0.7.8 bit_1.1-14 bindr_0.1.1
[64] Hmisc_4.2-0 stringi_1.3.1 Rcpp_1.0.0
[67] geneplotter_1.56.0 rpart_4.1-11 acepack_1.4.1
[70] tidyselect_0.2.5 xfun_0.4