Hi, I am analysing my gene expression data with DESeq2 (v1.16.1).
I have 12 samples in each of the two groups I am comparing.
Some genes have been set to 0/NA by independent filtering, however my top 2 most differentially expressed genes have read counts of 0 in every sample except one (where the raw read count is 6 in each case).
I don't understand why these have not been filtered out. Please could you advise?
Complete code and session info below:
> rm(list=ls(all=TRUE)) > library(DESeq2) > directory<-'~/Google Drive/R_projects/doxy_deseq2/T1_A0vA1' > sampleFiles <- grep("R-",list.files(directory),value=TRUE) > sampleCondition <- gsub("R.*","",sampleFiles) > sampleTable <- data.frame(sampleName = sampleFiles,fileName = sampleFiles,condition = sampleCondition) > head(sampleTable) sampleName fileName condition 1 A_0_T1_R-1-0303.txt A_0_T1_R-1-0303.txt A_0_T1_ 2 A_0_T1_R-1-0349.txt A_0_T1_R-1-0349.txt A_0_T1_ 3 A_0_T1_R-1-0376.txt A_0_T1_R-1-0376.txt A_0_T1_ 4 A_0_T1_R-1-0391.txt A_0_T1_R-1-0391.txt A_0_T1_ 5 A_0_T1_R-1-0456.txt A_0_T1_R-1-0456.txt A_0_T1_ 6 A_0_T1_R-1-0464.txt A_0_T1_R-1-0464.txt A_0_T1_ > ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory, design= ~ condition) > dds <- ddsHTSeq[ rowSums(counts(ddsHTSeq)) > 1, ] > dds$condition <- relevel(dds$condition, ref="A_0_T1_") > dds <- DESeq(dds) estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing -- replacing outliers and refitting for 543 genes -- DESeq argument 'minReplicatesForReplace' = 7 -- original counts are preserved in counts(dds) estimating dispersions fitting model and testing > res <- results(dds) > resOrdered <- res[order(res$padj),] > resOrdered log2 fold change (MLE): condition A 1 T1 vs A 0 T1 Wald test p-value: condition A 1 T1 vs A 0 T1 DataFrame with 26689 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> ENSG00000224721 1.024156 17.9432461 2.6726772 6.713585 1.898996e-11 2.523576e-07 ENSG00000257986 1.024156 17.9432461 2.6726772 6.713585 1.898996e-11 2.523576e-07 ENSG00000196159 4.798811 5.8913645 1.1092384 5.311180 1.089179e-07 9.649402e-04 ENSG00000179820 991.787350 1.2533887 0.2704511 4.634437 3.579100e-06 2.378133e-02 ENSG00000231389 379.860911 -0.7549993 0.1720601 -4.387998 1.143988e-05 6.080983e-02 ... ... ... ... ... ... ... ENSG00000278153 0 0 0 0 1 NA ENSG00000278642 0 0 0 0 1 NA ENSG00000278840 0 0 0 0 1 NA ENSG00000279047 0 0 0 0 1 NA ENSG00000280365 0 0 0 0 1 NA > plotCounts(dds, gene=which.min(res$padj), intgroup="condition") > plotCounts(dds, gene="ENSG00000257986", intgroup="condition") > use <- res$baseMean > metadata(res)$filterThreshold > h1 <- hist(res$pvalue[!use], breaks=0:50/50, plot=FALSE) > h2 <- hist(res$pvalue[use], breaks=0:50/50, plot=FALSE) > colori <- c(`do not pass`="khaki", `pass`="powderblue") > barplot(height = rbind(h1$counts, h2$counts), beside = FALSE, + col = colori, space = 0, main = "", ylab="frequency") > text(x = c(0, length(h1$counts)), y = 0, label = paste(c(0,1)), + adj = c(0.5,1.7), xpd=NA) > legend("topleft", fill=rev(colori), legend=rev(names(colori))) > sessionInfo() R version 3.4.0 (2017-04-21) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: OS X El Capitan 10.11.6 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods [9] base other attached packages: [1] DESeq2_1.16.1 BiocInstaller_1.26.0 SummarizedExperiment_1.6.1 [4] DelayedArray_0.2.2 matrixStats_0.52.2 Biobase_2.36.2 [7] GenomicRanges_1.28.1 GenomeInfoDb_1.12.0 IRanges_2.10.0 [10] S4Vectors_0.14.0 BiocGenerics_0.22.0 loaded via a namespace (and not attached): [1] genefilter_1.58.1 locfit_1.5-9.1 splines_3.4.0 [4] lattice_0.20-35 colorspace_1.3-2 htmltools_0.3.6 [7] base64enc_0.1-3 survival_2.41-3 XML_3.98-1.7 [10] foreign_0.8-68 DBI_0.6-1 BiocParallel_1.10.1 [13] RColorBrewer_1.1-2 GenomeInfoDbData_0.99.0 plyr_1.8.4 [16] stringr_1.2.0 zlibbioc_1.22.0 munsell_0.4.3 [19] gtable_0.2.0 htmlwidgets_0.8 memoise_1.1.0 [22] latticeExtra_0.6-28 knitr_1.15.1 geneplotter_1.54.0 [25] AnnotationDbi_1.38.0 htmlTable_1.9 Rcpp_0.12.10 [28] acepack_1.4.1 xtable_1.8-2 scales_0.4.1 [31] backports_1.0.5 checkmate_1.8.2 Hmisc_4.0-3 [34] annotate_1.54.0 XVector_0.16.0 gridExtra_2.2.1 [37] ggplot2_2.2.1 digest_0.6.12 stringi_1.1.5 [40] grid_3.4.0 tools_3.4.0 bitops_1.0-6 [43] magrittr_1.5 RSQLite_1.1-2 lazyeval_0.2.0 [46] RCurl_1.95-4.8 tibble_1.3.0 Formula_1.2-1 [49] cluster_2.0.6 Matrix_1.2-10 data.table_1.10.4 [52] rpart_4.1-11 nnet_7.3-12 compiler_3.4.0 > as.data.frame(colData(dds)) condition sizeFactor replaceable A_0_T1_R-1-0303.txt A_0_T1_ 0.4740729 TRUE A_0_T1_R-1-0349.txt A_0_T1_ 1.0197968 TRUE A_0_T1_R-1-0376.txt A_0_T1_ 1.4085827 TRUE A_0_T1_R-1-0391.txt A_0_T1_ 1.3654126 TRUE A_0_T1_R-1-0456.txt A_0_T1_ 1.2789439 TRUE A_0_T1_R-1-0464.txt A_0_T1_ 1.4638590 TRUE A_0_T1_R-1-0538.txt A_0_T1_ 1.3184049 TRUE A_0_T1_R-1-0567.txt A_0_T1_ 1.3049172 TRUE A_0_T1_R-1-0715.txt A_0_T1_ 0.7848035 TRUE A_0_T1_R-1-0818.txt A_0_T1_ 0.8989433 TRUE A_0_T1_R-1-0877.txt A_0_T1_ 1.7172302 TRUE A_0_T1_R-1-0951.txt A_0_T1_ 1.3992738 TRUE A_1_T1_R-1-0088.txt A_1_T1_ 0.2441034 TRUE A_1_T1_R-1-0121.txt A_1_T1_ 1.4030008 TRUE A_1_T1_R-1-0156.txt A_1_T1_ 1.1589636 TRUE A_1_T1_R-1-0162.txt A_1_T1_ 1.3313694 TRUE A_1_T1_R-1-0179.txt A_1_T1_ 0.5823990 TRUE A_1_T1_R-1-0189.txt A_1_T1_ 0.9231308 TRUE A_1_T1_R-1-0193.txt A_1_T1_ 0.8648280 TRUE A_1_T1_R-1-0203.txt A_1_T1_ 1.1520682 TRUE A_1_T1_R-1-0329.txt A_1_T1_ 0.9307242 TRUE A_1_T1_R-1-0399.txt A_1_T1_ 1.1902978 TRUE A_1_T1_R-1-0448.txt A_1_T1_ 0.9559443 TRUE A_1_T1_R-1-0451.txt A_1_T1_ 1.3852975 TRUE
Links to plots:
Histogram of P values:
https://www.dropbox.com/s/e8rda8qo6a0dlnz/T1_A0vA1.jpeg?dl=0
Plot counts for top 2 DE genes:
https://www.dropbox.com/s/t4ra5ry0c7lt9xu/T1_A0vA1%20721%20plot.jpeg?dl=0 https://www.dropbox.com/s/wfew64ydvtgn5tt/T1_A0vA1%20986%20plot.jpeg?dl=0
Thank you!
Dear Michael,
Thanks very much for your quick response.
Before analysing these data myself in R I used an online platform to perform the same analysis, which uses DESeq2 behind the scenes. The results from the online platform were quite different and the two top hits from my analysis were set to 0/NA.
I got the script from the online platform and the commands were the same. Hence why I was confused that in the same dataset, with the same commands, these genes were filtered out in one analysis but not the other.
Would different versions of DESeq2 result in these differences?
If so is it possible to install an older version of DESeq2 that might be more stringent on letting these genes through?
Thanks again
Possibly. I would just recommend using the latest version of DESeq2 (1.16) and using the filtering step I recommended below.