Hello everyone,
I am trying to increase the number of outliers identified and replaced in my use of DESeq2.
I create my DESeq object and get the following result
## Create DESeq2Dataset object
metadata$groups <- factor(metadata$groups, levels = c("CON","INJL","INJS","RCNL","RCNS","REPL","REPS"))
dds <- DESeqDataSetFromMatrix(count_file[,c(2:43)], colData = metadata, design = ~ groups)
## Run analysis
dds <- DESeq(dds, minReplicatesForReplace = 6)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 39 genes
-- DESeq argument 'minReplicatesForReplace' = 6
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
As I understand it, this means that 39 genes (i.e. rows from count_file[,c(2:43)]) were identified as having at least one count above the default cook's distance threshold--which in my case is 0.99*F(7,42-7) = 3.199952. Here, 7 is my number of groups, and 42 is my number of samples.
Upon looking at my final DEG's, I realized that I should try setting this cook's distance threshold lower because some of my top hits are still clearly being driven by outliers. Once again, I want more than 39 genes to be replaced.
As I understand it, I should be able to accomplish this by modifying my results function like the following:
contrast_REPL_RCNL <- c("groups", "REPL", "RCNL")
res_table_REPL_RCNL <- results(dds, contrast = contrast_REPL_RCNL, alpha = 0.05) # original
res_table_REPL_RCNL <- results(dds, contrast = contrast_REPL_RCNL, alpha = 0.05, cooksCutoff=1.0) # modified results function
However, this does nothing to change the results table output. In fact, the results table gives this result regardless of cooksCutoff:
## Number of genes removed due to extreme cook's distance values, where TRUE == removal
table(res_table_REPL_RCNL$baseMean > 0 & is.na(res_table_REPL_RCNL$pvalue))
FALSE
31887
I'm very lost as to why I can't modify the cooksCutoff. My end goal is to have DESeq replace more than 39 genes. Could someone explain how I could make this happen?
Here is my sessionInfo()
> sessionInfo()
R version 4.0.1 (2020-06-06)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.3
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] sjPlot_2.8.4 viridis_0.5.1 viridisLite_0.3.0 MASS_7.3-51.6 org.Ss.eg.db_3.11.4
[6] clusterProfiler_3.16.0 pathview_1.28.1 DOSE_3.14.0 gridExtra_2.3 biomaRt_2.44.1
[11] GenomicFeatures_1.40.1 AnnotationDbi_1.50.3 apeglm_1.10.0 ggrepel_0.8.2 tximport_1.16.1
[16] DEGreport_1.24.1 pheatmap_1.0.12 RColorBrewer_1.1-2 forcats_0.5.0 stringr_1.4.0
[21] dplyr_1.0.1 purrr_0.3.4 readr_1.3.1 tidyr_1.1.1 tibble_3.0.3
[26] ggplot2_3.3.2 tidyverse_1.3.0 DESeq2_1.28.1 SummarizedExperiment_1.18.2 DelayedArray_0.14.1
[31] matrixStats_0.56.0 Biobase_2.48.0 GenomicRanges_1.40.0 GenomeInfoDb_1.24.2 IRanges_2.22.2
[36] S4Vectors_0.26.1 BiocGenerics_0.34.0 edgeR_3.30.3 limma_3.44.3
loaded via a namespace (and not attached):
[1] utf8_1.1.4 lme4_1.1-23 tidyselect_1.1.0 RSQLite_2.2.0 grid_4.0.1
[6] BiocParallel_1.22.0 scatterpie_0.1.4 munsell_0.5.0 effectsize_0.3.2 statmod_1.4.34
[11] withr_2.2.0 colorspace_1.4-1 GOSemSim_2.14.1 knitr_1.29 rstudioapi_0.11
[16] emmeans_1.4.8 labeling_0.3 KEGGgraph_1.48.0 lasso2_1.2-20 urltools_1.7.3
[21] bbmle_1.0.23.1 GenomeInfoDbData_1.2.3 mnormt_2.0.1 polyclip_1.10-0 bit64_4.0.2
[26] farver_2.0.3 rprojroot_1.3-2 downloader_0.4 coda_0.19-3 vctrs_0.3.2
[31] generics_0.0.2 xfun_0.16 BiocFileCache_1.12.1 R6_2.4.1 clue_0.3-57
[36] graphlayouts_0.7.0 locfit_1.5-9.4 gridGraphics_0.5-0 bitops_1.0-6 reshape_0.8.8
[41] fgsea_1.14.0 assertthat_0.2.1 scales_1.1.1 ggraph_2.0.3 enrichplot_1.8.1
[46] gtable_0.3.0 tidygraph_1.2.0 rlang_0.4.7 genefilter_1.70.0 GlobalOptions_0.1.2
[51] splines_4.0.1 rtracklayer_1.48.0 europepmc_0.4 broom_0.7.0 BiocManager_1.30.10
[56] yaml_2.2.1 reshape2_1.4.4 modelr_0.1.8 backports_1.1.8 qvalue_2.20.0
[61] tools_4.0.1 psych_2.0.7 ggplotify_0.0.5 logging_0.10-108 ellipsis_0.3.1
[66] ggdendro_0.1-20 ggridges_0.5.2 Rcpp_1.0.5 plyr_1.8.6 progress_1.2.2
[71] zlibbioc_1.34.0 RCurl_1.98-1.2 prettyunits_1.1.1 openssl_1.4.2 GetoptLong_1.0.2
[76] cowplot_1.0.0 haven_2.3.1 cluster_2.1.0 fs_1.5.0 magrittr_1.5
[81] data.table_1.13.0 DO.db_2.9 circlize_0.4.10 triebeard_0.3.0 reprex_0.3.0
[86] tmvnsim_1.0-2 mvtnorm_1.1-1 sjmisc_2.8.5 pkgload_1.1.0 hms_0.5.3
[91] xtable_1.8-4 XML_3.99-0.5 sjstats_0.18.0 emdbook_1.3.12 readxl_1.3.1
[96] ggeffects_0.15.1 shape_1.4.4 testthat_2.3.2 compiler_4.0.1 bdsmatrix_1.3-4
[101] crayon_1.3.4 minqa_1.2.4 geneplotter_1.66.0 lubridate_1.7.9 DBI_1.1.0
[106] sjlabelled_1.1.6 tweenr_1.0.1 dbplyr_1.4.4 ComplexHeatmap_2.4.3 rappdirs_0.3.1
[111] boot_1.3-25 Matrix_1.2-18 cli_2.0.2 insight_0.9.0 igraph_1.2.5
[116] pkgconfig_2.0.3 rvcheck_0.1.8 GenomicAlignments_1.24.0 numDeriv_2016.8-1.1 xml2_1.3.2
[121] annotate_1.66.0 XVector_0.28.0 estimability_1.3 rvest_0.3.6 digest_0.6.25
[126] parameters_0.8.2 ConsensusClusterPlus_1.52.0 graph_1.66.0 Biostrings_2.56.0 cellranger_1.1.0
[131] fastmatch_1.1-0 curl_4.3 Rsamtools_2.4.0 nloptr_1.2.2.2 rjson_0.2.20
[136] lifecycle_0.2.0 nlme_3.1-148 jsonlite_1.7.0 desc_1.2.0 askpass_1.1
[141] fansi_0.4.1 pillar_1.4.6 lattice_0.20-41 Nozzle.R1_1.1-1 KEGGREST_1.28.0
[146] httr_1.4.2 survival_3.2-3 GO.db_3.11.4 glue_1.4.1 bayestestR_0.7.2
[151] png_0.1-7 bit_4.0.4 Rgraphviz_2.32.0 performance_0.4.8 ggforce_0.3.2
[156] stringi_1.4.6 blob_1.2.1 org.Hs.eg.db_3.11.4 memoise_1.1.0
Hi Michael,
Here is a count plot for one of my top DEG's (IGFBP5 or ENSSSCG00000025924): https://ibb.co/txFS1TZ
The comparison here is between the purple REPL group and the green RCNL group. The Cook's distance for the outlier sample is 1.36522020, whereas all other samples in the comparison are well below 0.08. Here is the code for finding the Cook's distances:
The real problem is that setting cooksCutoff to 1.0 doesn't seem to change the results() output:
Shouldn't this gene be excluded from the results table because one of its Cook's distances is over the cutoff? I could remove this gene manually or with a count cutoff, but I figured results() would remove it based on one of the samples being over cooksCutoff... Am I doing something wrong with the results() function?
What do you have in
mcols(dds)$maxCooks
for this gene? I am surprised that the last line didn'tNA
the p-value.mcols(dds)$maxCooks is NA for every gene (which includes the gene of interest). This might be the source of the issue. Any idea as to why this is?
Ah I am remembering the logic now. The outlier replacement procedure is as follows: filtering will only occur after replacement for groups that were too small for replacement. So filtering here would only be triggered for groups with 3-5 samples. So because the outlier is too small to hit the threshold for replacement it isn't considered for filtering. You could turn off replacement in
DESeq()
(minRep=Inf
), which would allow such a gene to be filtered withresults()
. Or you could manually pull out these genes using your code above, scanning for large Cook's values.I've never been satisfied with the outlier procedures for GLM. I am sympathetic to the nonparametric approaches, e.g. SAMseq, which are robust to data that doesn't fit the distribution (e.g. the above count is not really consistent with Negative Binomial).
Ah! That successfully filtered out the gene of interest. Thank you!
Per your last comment, are you suggesting that I utilize SAMseq instead of DESeq because some of my counts don't fit the Negative Binomial? I would say that my data largely does fit the NB distribution with the exception of a few genes.
Ah! That successfully filtered out the gene of interest. Thank you!
Per your last comment, are you suggesting that I utilize SAMseq instead of DESeq because some of my counts don't fit the Negative Binomial? I would say that my data largely does fit the NB distribution with the exception of a few genes.
Re: my last comment, no I think filtering can work. I made that comment more to express my frustration with outlier procedures, which require a lot of hands-on inspection to be trusted (this is ok, this is what you've done and you've found a setting that works through manual inspection), compared to SAMseq, which "just works".