Hello!
I am running DESeq2 on a dichotomous outcome (77 no, 41 yes). My concern, is that I am getting an enormous number of counts labeled as low. I've been using the same count matrix for other runs, just pulling the samples I need for the process (which I get. changes the matrix depending on which samples are used), but I have never gotten anything like this. The data was originally 183 samples, parred down to 117 for this specific outcome. I have 20338 genes.
Code:
data<-read.csv("TotalRNA.csv", header=TRUE, row.names = 1, stringsAsFactors = FALSE)
cD<-read.csv("Book2.csv",header = TRUE,row.names = 1)
all(rownames(cD)==colnames(data))
dds<-DESeqDataSetFromMatrix(countData = data, colData = cD, design =~Sex+Condition)
dds<-DESeq(dds)
res<-results(dds)
plotMA(res,ylim=c(-2,2))
summary(res)
sum(res$padj < 0.1, na.rm=TRUE)
Results:
out of 20338 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 907, 4.5%
LFC < 0 (down) : 101, 0.5%
outliers [1] : 0, 0%
low counts [2] : 15378, 76%
(mean count < 160)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
Thanks for any feedback!
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] DESeq2_1.16.1 SummarizedExperiment_1.6.5 DelayedArray_0.2.7 matrixStats_0.52.2 Biobase_2.36.2
[6] GenomicRanges_1.28.6 GenomeInfoDb_1.12.3 IRanges_2.10.5 S4Vectors_0.14.7 BiocGenerics_0.22.1
loaded via a namespace (and not attached):
[1] locfit_1.5-9.1 Rcpp_0.12.14 lattice_0.20-35 tidyr_0.7.2 assertthat_0.2.0 digest_0.6.12
[7] R6_2.2.2 plyr_1.8.4 backports_1.1.1 acepack_1.4.1 RSQLite_2.0 ggplot2_2.2.1
[13] zlibbioc_1.22.0 rlang_0.1.4 lazyeval_0.2.1 rstudioapi_0.7 data.table_1.10.4-3 annotate_1.54.0
[19] blob_1.1.0 rpart_4.1-11 Matrix_1.2-12 checkmate_1.8.5 splines_3.4.2 BiocParallel_1.10.1
[25] geneplotter_1.54.0 stringr_1.2.0 foreign_0.8-69 htmlwidgets_0.9 bit_1.1-12 RCurl_1.95-4.8
[31] munsell_0.4.3 compiler_3.4.2 pkgconfig_2.0.1 base64enc_0.1-3 htmltools_0.3.6 nnet_7.3-12
[37] tibble_1.3.4 gridExtra_2.3 htmlTable_1.11.0 GenomeInfoDbData_0.99.0 Hmisc_4.0-3 XML_3.98-1.9
[43] dplyr_0.7.4 bitops_1.0-6 grid_3.4.2 DBI_0.7 xtable_1.8-2 gtable_0.2.0
[49] magrittr_1.5 scales_0.5.0 stringi_1.1.6 XVector_0.16.0 genefilter_1.58.1 bindrcpp_0.2
[55] latticeExtra_0.6-28 Formula_1.2-2 RColorBrewer_1.1-2 tools_3.4.2 bit64_0.9-7 glue_1.2.0
[61] purrr_0.2.4 survival_2.41-3 AnnotationDbi_1.38.2 colorspace_1.3-2 cluster_2.0.6 memoise_1.1.0
[67] knitr_1.17 bindr_0.1
Thanks for looking!
Ok, then, can you try this code from the vignette:
https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#independent-filtering-of-results
This will plot the reason that the high threshold was chosen. If the plot looks strange, then you may want to just pick a single reasonable threshold across all your result tables, and use
independentFiltering=FALSE
. Some code for that is in a recent thread:C: contamination and DESeq2 performance