I'm currently having some issues with how DESeq2
implements count outlier replacement when the number of samples suffices. From what I understand, the default is to use:
99% quantile of the F(p,m-p)
My specific questions are:
- What is the rational behind the default Cook's distance threshold for TMM replacement?
- What is the risk of lowering the threshold?
The reason I ask is that I'm wondering if the default setting is too stringent for my dataset. I have ~200 samples with a minimum of 12 samples per group. There's one particular gene which I expect to be positively associated with one of my groups. In my data, I see it as significantly differentially regulated padj < .05
, but it shows the opposite expected trend (down-regulated instead of up-regulated). I traced this back to a single count outlier in my control group. This can been seen in the ![attached plot](https://pitt-my.sharepoint.com/:b:/g/personal/del53_pitt_edu/EYoKB0_KHFJNk6ePSaAh-pABvz6Asu92nFrJ86fzukA6eQ?e=LJvYv3) , which shows box-plots of the raw and transformed counts. The Cook's distance for this particular outlier is ~4 which is below the threshold for replacement.
The code chunk below simulates the issues I described. Here, gene1
has a single outlier in the A
group which is below the cooks and is therefore not replaced.
require(DESeq2)
require(dplyr)
set.seed(42)
n_genes <- 500
m_samples <- 60
# Creating DDS with 3 groups
dds <- makeExampleDESeqDataSet(n=n_genes, m = m_samples)
colData(dds) <- DataFrame(condition=factor(c(rep('A', m_samples/4 - 3),rep('B', m_samples/4 + 3),rep('C', m_samples/2))),row.names = colnames(dds))
mcols(dds) <- NULL # not sure if neccessary
# Creates NB counts with last group having highest mean
# and first containing a single high count outlier
counts(dds)[1,] <- as.integer(c(rnbinom(m_samples/2, 5, mu=50) + rnbinom(m_samples/2, 5, mu=25), rnbinom(m_samples/2, 1, mu=150) + rnbinom(m_samples/2, 1, mu=100)))
counts(dds)[1, 1] <- as.integer(4*max(counts(dds)[1,]))
counts(dds) <- matrix(sapply(counts(dds), as.integer), dim(counts(dds)))
plotCounts(dds, gene = 'gene1')
results(DESeq(dds))['gene1',]
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] reshape2_1.4.3 ggpubr_0.2
[3] magrittr_1.5 ggplot2_3.1.0
[5] bindrcpp_0.2.2 DESeq2_1.22.1
[7] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
[9] BiocParallel_1.16.2 matrixStats_0.54.0
[11] Biobase_2.42.0 GenomicRanges_1.34.0
[13] GenomeInfoDb_1.18.1 IRanges_2.16.0
[15] S4Vectors_0.20.1 BiocGenerics_0.28.0
[17] dplyr_0.7.8
loaded via a namespace (and not attached):
[1] bitops_1.0-6 bit64_0.9-7 doParallel_1.0.14
[4] RColorBrewer_1.1-2 rprojroot_1.3-2 ggsci_2.9
[7] tools_3.5.1 backports_1.1.2 utf8_1.1.4
[10] R6_2.3.0 rpart_4.1-13 Hmisc_4.1-1
[13] DBI_1.0.0 lazyeval_0.2.1 colorspace_1.3-2
[16] nnet_7.3-12 withr_2.1.2 tidyselect_0.2.5
[19] gridExtra_2.3 bit_1.1-14 compiler_3.5.1
[22] cli_1.0.1 htmlTable_1.12 labeling_0.3
[25] scales_1.0.0 checkmate_1.8.5 genefilter_1.64.0
[28] stringr_1.3.1 digest_0.6.18 foreign_0.8-71
[31] rmarkdown_1.10 XVector_0.22.0 base64enc_0.1-3
[34] pkgconfig_2.0.2 basicOmics_0.1.0 htmltools_0.3.6
[37] limma_3.38.3 htmlwidgets_1.3 rlang_0.3.0.1
[40] rstudioapi_0.8 RSQLite_2.1.1 BiocInstaller_1.32.1
[43] bindr_0.1.1 jsonlite_1.5 acepack_1.4.1
[46] RCurl_1.95-4.11 GenomeInfoDbData_1.2.0 Formula_1.2-3
[49] Matrix_1.2-15 Rcpp_1.0.0 munsell_0.5.0
[52] fansi_0.4.0 stringi_1.2.4 yaml_2.2.0
[55] zlibbioc_1.28.0 plyr_1.8.4 grid_3.5.1
[58] blob_1.1.1 crayon_1.3.4 lattice_0.20-38
[61] cowplot_0.9.3 splines_3.5.1 annotate_1.60.0
[64] locfit_1.5-9.1 knitr_1.20 pillar_1.3.0
[67] geneplotter_1.60.0 codetools_0.2-15 XML_3.98-1.16
[70] glue_1.3.0 evaluate_0.12 latticeExtra_0.6-28
[73] data.table_1.11.8 BiocManager_1.30.4 foreach_1.4.4
[76] gtable_0.2.0 purrr_0.2.5 assertthat_0.2.0
[79] xtable_1.8-3 ggcorrplot_0.1.2 survival_2.43-3
[82] tibble_1.4.2 iterators_1.0.10 AnnotationDbi_1.44.0
[85] memoise_1.1.0 cluster_2.0.7-1
Hi Michael!
Thanks for getting back to me so quickly! I went back over your paper and the vignette and my interpretation was that the TMM replacement method would err on the side of false negatives rather than positives.
This seems a reasonable trade off compared to the alternative methods described in your paper. My initial concern was that lowering the threshold would produce an unpredictable bias. However, after thinking a bit more, I suppose that any threshold I choose will replace the counts of
0
ton
samples with the TMM. If I set the threshold such that alln
samples have their counts replaced with the TMM, then I would obviously not reject the null. If I did this with half the samples, then I would expect the counts to be more similar than the they were initially and thus less likely to reject the null. Likewise for any number of samples' counts with the TMM.For my own dataset, the number of genes containing a Cook's value
>= 1
is around 700, and of those only about 200 had apadj < .05
. So my assumption is that lowering the threshold will not drastically alter the results.To implement TMM replacement using a lowered Cook's cutoff, I ran the standard
DESeq
wrapper function, followed by updating theDESeqDataSet
object using thereplaceOutliersWithTrimmedMean
and then re-runningDESeq
wrapper with the updated counts. This is shown in the following code chunk:That should work.
Well hopefully have an improved method for outliers out soon next year.