Question

DESeq2 independent filtering not working

0

Entering edit mode

tamsyn.derrick • 0

@tamsynderrick-13020

Last seen 6.0 years ago

Hi, I am analysing my gene expression data with DESeq2 (v1.16.1).

I have 12 samples in each of the two groups I am comparing.

Some genes have been set to 0/NA by independent filtering, however my top 2 most differentially expressed genes have read counts of 0 in every sample except one (where the raw read count is 6 in each case).

I don't understand why these have not been filtered out. Please could you advise?

Complete code and session info below:

> rm(list=ls(all=TRUE))
> library(DESeq2)
> directory<-'~/Google Drive/R_projects/doxy_deseq2/T1_A0vA1'
> sampleFiles <- grep("R-",list.files(directory),value=TRUE)
> sampleCondition <- gsub("R.*","",sampleFiles)
> sampleTable <- data.frame(sampleName = sampleFiles,fileName = sampleFiles,condition = sampleCondition)
> head(sampleTable)
           sampleName            fileName condition
1 A_0_T1_R-1-0303.txt A_0_T1_R-1-0303.txt   A_0_T1_
2 A_0_T1_R-1-0349.txt A_0_T1_R-1-0349.txt   A_0_T1_
3 A_0_T1_R-1-0376.txt A_0_T1_R-1-0376.txt   A_0_T1_
4 A_0_T1_R-1-0391.txt A_0_T1_R-1-0391.txt   A_0_T1_
5 A_0_T1_R-1-0456.txt A_0_T1_R-1-0456.txt   A_0_T1_
6 A_0_T1_R-1-0464.txt A_0_T1_R-1-0464.txt   A_0_T1_
> ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory, design= ~ condition)
> dds <- ddsHTSeq[ rowSums(counts(ddsHTSeq)) > 1, ]
> dds$condition <- relevel(dds$condition, ref="A_0_T1_")
> dds <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 543 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
> res <- results(dds)
> resOrdered <- res[order(res$padj),]
> resOrdered
log2 fold change (MLE): condition A 1 T1  vs A 0 T1  
Wald test p-value: condition A 1 T1  vs A 0 T1  
DataFrame with 26689 rows and 6 columns
                  baseMean log2FoldChange     lfcSE      stat       pvalue         padj
                 <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
ENSG00000224721   1.024156     17.9432461 2.6726772  6.713585 1.898996e-11 2.523576e-07
ENSG00000257986   1.024156     17.9432461 2.6726772  6.713585 1.898996e-11 2.523576e-07
ENSG00000196159   4.798811      5.8913645 1.1092384  5.311180 1.089179e-07 9.649402e-04
ENSG00000179820 991.787350      1.2533887 0.2704511  4.634437 3.579100e-06 2.378133e-02
ENSG00000231389 379.860911     -0.7549993 0.1720601 -4.387998 1.143988e-05 6.080983e-02
...                    ...            ...       ...       ...          ...          ...
ENSG00000278153          0              0         0         0            1           NA
ENSG00000278642          0              0         0         0            1           NA
ENSG00000278840          0              0         0         0            1           NA
ENSG00000279047          0              0         0         0            1           NA
ENSG00000280365          0              0         0         0            1           NA
> plotCounts(dds, gene=which.min(res$padj), intgroup="condition")
> plotCounts(dds, gene="ENSG00000257986", intgroup="condition")
> use <- res$baseMean > metadata(res)$filterThreshold
> h1 <- hist(res$pvalue[!use], breaks=0:50/50, plot=FALSE)
> h2 <- hist(res$pvalue[use], breaks=0:50/50, plot=FALSE)
> colori <- c(`do not pass`="khaki", `pass`="powderblue")
> barplot(height = rbind(h1$counts, h2$counts), beside = FALSE,
+         col = colori, space = 0, main = "", ylab="frequency")
> text(x = c(0, length(h1$counts)), y = 0, label = paste(c(0,1)),
+      adj = c(0.5,1.7), xpd=NA)
> legend("topleft", fill=rev(colori), legend=rev(names(colori)))
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] DESeq2_1.16.1              BiocInstaller_1.26.0       SummarizedExperiment_1.6.1
 [4] DelayedArray_0.2.2         matrixStats_0.52.2         Biobase_2.36.2            
 [7] GenomicRanges_1.28.1       GenomeInfoDb_1.12.0        IRanges_2.10.0            
[10] S4Vectors_0.14.0           BiocGenerics_0.22.0       

loaded via a namespace (and not attached):
 [1] genefilter_1.58.1       locfit_1.5-9.1          splines_3.4.0          
 [4] lattice_0.20-35         colorspace_1.3-2        htmltools_0.3.6        
 [7] base64enc_0.1-3         survival_2.41-3         XML_3.98-1.7           
[10] foreign_0.8-68          DBI_0.6-1               BiocParallel_1.10.1    
[13] RColorBrewer_1.1-2      GenomeInfoDbData_0.99.0 plyr_1.8.4             
[16] stringr_1.2.0           zlibbioc_1.22.0         munsell_0.4.3          
[19] gtable_0.2.0            htmlwidgets_0.8         memoise_1.1.0          
[22] latticeExtra_0.6-28     knitr_1.15.1            geneplotter_1.54.0     
[25] AnnotationDbi_1.38.0    htmlTable_1.9           Rcpp_0.12.10           
[28] acepack_1.4.1           xtable_1.8-2            scales_0.4.1           
[31] backports_1.0.5         checkmate_1.8.2         Hmisc_4.0-3            
[34] annotate_1.54.0         XVector_0.16.0          gridExtra_2.2.1        
[37] ggplot2_2.2.1           digest_0.6.12           stringi_1.1.5          
[40] grid_3.4.0              tools_3.4.0             bitops_1.0-6           
[43] magrittr_1.5            RSQLite_1.1-2           lazyeval_0.2.0         
[46] RCurl_1.95-4.8          tibble_1.3.0            Formula_1.2-1          
[49] cluster_2.0.6           Matrix_1.2-10           data.table_1.10.4      
[52] rpart_4.1-11            nnet_7.3-12             compiler_3.4.0         
> as.data.frame(colData(dds))
                    condition sizeFactor replaceable
A_0_T1_R-1-0303.txt   A_0_T1_  0.4740729        TRUE
A_0_T1_R-1-0349.txt   A_0_T1_  1.0197968        TRUE
A_0_T1_R-1-0376.txt   A_0_T1_  1.4085827        TRUE
A_0_T1_R-1-0391.txt   A_0_T1_  1.3654126        TRUE
A_0_T1_R-1-0456.txt   A_0_T1_  1.2789439        TRUE
A_0_T1_R-1-0464.txt   A_0_T1_  1.4638590        TRUE
A_0_T1_R-1-0538.txt   A_0_T1_  1.3184049        TRUE
A_0_T1_R-1-0567.txt   A_0_T1_  1.3049172        TRUE
A_0_T1_R-1-0715.txt   A_0_T1_  0.7848035        TRUE
A_0_T1_R-1-0818.txt   A_0_T1_  0.8989433        TRUE
A_0_T1_R-1-0877.txt   A_0_T1_  1.7172302        TRUE
A_0_T1_R-1-0951.txt   A_0_T1_  1.3992738        TRUE
A_1_T1_R-1-0088.txt   A_1_T1_  0.2441034        TRUE
A_1_T1_R-1-0121.txt   A_1_T1_  1.4030008        TRUE
A_1_T1_R-1-0156.txt   A_1_T1_  1.1589636        TRUE
A_1_T1_R-1-0162.txt   A_1_T1_  1.3313694        TRUE
A_1_T1_R-1-0179.txt   A_1_T1_  0.5823990        TRUE
A_1_T1_R-1-0189.txt   A_1_T1_  0.9231308        TRUE
A_1_T1_R-1-0193.txt   A_1_T1_  0.8648280        TRUE
A_1_T1_R-1-0203.txt   A_1_T1_  1.1520682        TRUE
A_1_T1_R-1-0329.txt   A_1_T1_  0.9307242        TRUE
A_1_T1_R-1-0399.txt   A_1_T1_  1.1902978        TRUE
A_1_T1_R-1-0448.txt   A_1_T1_  0.9559443        TRUE
A_1_T1_R-1-0451.txt   A_1_T1_  1.3852975        TRUE

Links to plots:

Histogram of P values:

https://www.dropbox.com/s/e8rda8qo6a0dlnz/T1_A0vA1.jpeg?dl=0

Plot counts for top 2 DE genes:

https://www.dropbox.com/s/t4ra5ry0c7lt9xu/T1_A0vA1%20721%20plot.jpeg?dl=0 https://www.dropbox.com/s/wfew64ydvtgn5tt/T1_A0vA1%20986%20plot.jpeg?dl=0

Thank you!

deseq2 independent filtering cookscutoff • 3.5k views

ADD COMMENT • link 7.5 years ago tamsyn.derrick • 0

0

Entering edit mode

Dear Michael,

Thanks very much for your quick response.

Before analysing these data myself in R I used an online platform to perform the same analysis, which uses DESeq2 behind the scenes. The results from the online platform were quite different and the two top hits from my analysis were set to 0/NA.

I got the script from the online platform and the commands were the same. Hence why I was confused that in the same dataset, with the same commands, these genes were filtered out in one analysis but not the other.

Would different versions of DESeq2 result in these differences?

If so is it possible to install an older version of DESeq2 that might be more stringent on letting these genes through?

Thanks again

ADD REPLY • link 7.5 years ago tamsyn.derrick • 0

0

Entering edit mode

Possibly. I would just recommend using the latest version of DESeq2 (1.16) and using the filtering step I recommended below.

ADD REPLY • link 7.5 years ago Michael Love 42k

score 0 · Answer 1 · 2017-05-12

hi,

The independent filtering is designed only to filter out low count genes to the extent that they are not enriched with small p-values. Here the problem is not independent filtering, but that these two genes get a small p-value rather than being filtered or having an insignificant p-value. Datasets can be different in many ways, and for whatever reason, these two genes survive the filtering and get a counterintuitive small p-value.

I'd recommend you just use a more strict filter in the very beginning, e.g. at least three samples with counts greater than 10:

keep <- rowSums(counts(dds) >= 10) >= 3
dds <- dds[keep,]