I'm getting strange results from DESeq2 results. I'm comparing 3 samples to another group of 3 samples, which I know is less than ideal
counts<-read.csv("raw_180803.csv", header = TRUE, row.names = 1, check.names = FALSE)
anno <- read.csv("anno_180803.csv", header = TRUE, row.names = 1)
anno$donor <- factor(anno$donor)
dds <- DESeqDataSetFromMatrix(countData = counts, colData = anno, design =~ 1)
dds <- dds[rowSums(counts(dds)) > 1,]
design(dds) <- ~ group + donor
dds <-DESeq(dds)
res<-results(dds, contrast = c("group", "hIgG1", "unstim_none"))
summary(res)
out of 20305 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 1, 0.0049%
LFC < 0 (down) : 0, 0%
outliers [1] : 0, 0%
low counts [2] : 0, 0%
(mean count < 0)
res<-results(dds, contrast = c("group", "hIgG4", "unstim_none"))
summary(res)
out of 20305 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 26, 0.13%
LFC < 0 (down) : 10, 0.049%
outliers [1] : 0, 0%
low counts [2] : 12991, 64%
(mean count < 124)
But the counts for IgG4 and IgG1 just don't look that different across the board. Why is the mean count threshold set so differently between the two contrasts? (The mean of normalized counts for the hIgG1 samples ranges between 650-790).
Here are normalized counts for one gene, but every comparison I do, except for one, shows NA for the adjusted p-value
59.04707 | 79.40045 | 37.52253 | 63.28349 | 82.1004 | 42.47123 | 53.48357 | 68.60829 | 36.32574 | 71.87155 | 73.46171 | 46.15779 | 56.68832 | 60.98544 |
39.50018 |
I thought that NA's meant either outliers or too few reads, but can that be the case with these normalized counts?
Should I stop worrying and just believe my data? What should I look at to figure out why these two contrasts are so different from each other?
EDIT:
I can force the software to filter based on normalized means at the level I choose (I chose 180 here) instead of letting the software pick for me
res<-results(dds, contrast = c("group", "hIgG1", "unstim_none"), filter = dds@rowRanges@elementMetadata$baseMean > 180)
summary(res)
out of 20305 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 2, 0.0098%
LFC < 0 (down) : 1, 0.0049%
outliers [1] : 0, 0%
low counts [2] : 13766, 68%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
I guess the issue is that there are almost no significant differences in this particular comparison, (while there are big donor differences), while for the other comparisons, dropping large number of genes which have decent expression is improving the stats for the best ones.
Is the default independent filtering in the results command maybe not doing that great a job, and I should be setting it myself? It seems plausible that there might be good DE genes with normalized means of about 100 or lower, but the default filtering settings are getting rid of these in most comparisons.
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.9 (Final)
Matrix products: default
BLAS: /.automount/tools/GNU/R/3.5.0/lib64/R/lib/libRblas.so
LAPACK: /.automount/tools/GNU/R/3.5.0/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] org.Hs.eg.db_3.6.0 AnnotationDbi_1.42.1
[3] DESeq2_1.20.0 SummarizedExperiment_1.10.1
[5] DelayedArray_0.6.2 BiocParallel_1.14.2
[7] matrixStats_0.54.0 Biobase_2.40.0
[9] GenomicRanges_1.32.6 GenomeInfoDb_1.16.0
[11] IRanges_2.14.10 S4Vectors_0.18.3
[13] BiocGenerics_0.26.0
loaded via a namespace (and not attached):
[1] bit64_0.9-7 splines_3.5.0 Formula_1.2-3
[4] assertthat_0.2.0 latticeExtra_0.6-28 blob_1.1.1
[7] GenomeInfoDbData_1.1.0 pillar_1.3.0 RSQLite_2.1.1
[10] backports_1.1.2 lattice_0.20-35 glue_1.3.0
[13] digest_0.6.15 RColorBrewer_1.1-2 XVector_0.20.0
[16] checkmate_1.8.5 colorspace_1.3-2 htmltools_0.3.6
[19] Matrix_1.2-14 plyr_1.8.4 XML_3.98-1.13
[22] pkgconfig_2.0.1 genefilter_1.62.0 zlibbioc_1.26.0
[25] purrr_0.2.5 xtable_1.8-2 scales_0.5.0
[28] htmlTable_1.12 tibble_1.4.2 annotate_1.58.0
[31] ggplot2_3.0.0 nnet_7.3-12 lazyeval_0.2.1
[34] survival_2.42-6 magrittr_1.5 crayon_1.3.4
[37] memoise_1.1.0 foreign_0.8-71 tools_3.5.0
[40] data.table_1.11.4 stringr_1.3.1 locfit_1.5-9.1
[43] munsell_0.5.0 cluster_2.0.7-1 bindrcpp_0.2.2
[46] compiler_3.5.0 rlang_0.2.1 grid_3.5.0
[49] RCurl_1.95-4.11 rstudioapi_0.7 htmlwidgets_1.2
[52] bitops_1.0-6 base64enc_0.1-3 gtable_0.2.0
[55] DBI_1.0.0 R6_2.2.2 gridExtra_2.3
[58] knitr_1.20 dplyr_0.7.6 bit_1.1-14
[61] bindr_0.1.1 Hmisc_4.1-1 stringi_1.2.4
[64] Rcpp_0.12.18 geneplotter_1.58.0 rpart_4.1-13
[67] acepack_1.4.1 tidyselect_0.2.4
Thanks, I tried some things, and added a follow-up in my question
There may be genes that you’re missing with low count, and you could turn off independent filtering to try to capture these, but you’ll end up with less differentially expresses genes in total (because that’s how the independent filter works, see the diagram in vignette).
Yeah, I realized that after I wrote it. Thanks again.