DESeq2 and high prefiltering cutoff
1
0
Entering edit mode
Chris • 0
@255004b1
Last seen 2.8 years ago
United States

Hi, I am curious about prefiltering with DESeq2. I understand from this site and reading the DESeq2 vignette that prefiletering is really unnecessary as DESeq2 has a stringent filtering that it does. However, I'm seeing better results with filtering (much higher # DEGs with sig adj p-value after DESeq2).

I have a one factor design and three samples in two different groups. I do have a sample that's a bit noisier than the others but doesn't seem there is any significant grounds to remove it. With prefiltering, I find the highest number of significant DEGs when I filter out anything below 50 normalized counts for each of the six samples.

My question is, is this wrong? Is creating such a high prefiltering cutoff affecting my false positivity rate somehow? Though I love that I get triple the amount of significant DEGs, I don't want to go down the wrong path in my subsequent pathway analsyis due to a prefiltering error.

Thank you for your time.

My code

#dds <- DESeqDataSetFromMatrix(countData=countdata, colData=sampleData, design = ~ condition, tidy= FALSE)
#dds <- estimateSizeFactors(dds)
#keep <- rowSums(counts(dds, normalized=TRUE) >= 50 ) >=6
#dds <- dds[keep,]
#dds <- DESeq(dds)



sessionInfo( )
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] dplyr_1.0.8                 DESeq2_1.34.0              
 [3] SummarizedExperiment_1.24.0 Biobase_2.54.0             
 [5] MatrixGenerics_1.6.0        matrixStats_0.61.0         
 [7] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1        
 [9] IRanges_2.28.0              S4Vectors_0.32.3           
[11] BiocGenerics_0.40.0         BiocManager_1.30.16        

loaded via a namespace (and not attached):
 [1] locfit_1.5-9.4         Rcpp_1.0.8             lattice_0.20-45       
 [4] png_0.1-7              Biostrings_2.62.0      assertthat_0.2.1      
 [7] utf8_1.2.2             R6_2.5.1               RSQLite_2.2.9         
[10] httr_1.4.2             ggplot2_3.3.5          pillar_1.7.0          
[13] zlibbioc_1.40.0        rlang_1.0.1            rstudioapi_0.13       
[16] annotate_1.72.0        blob_1.2.2             Matrix_1.3-4          
[19] splines_4.1.2          BiocParallel_1.28.3    geneplotter_1.72.0    
[22] RCurl_1.98-1.6         bit_4.0.4              munsell_0.5.0         
[25] DelayedArray_0.20.0    compiler_4.1.2         pkgconfig_2.0.3       
[28] tidyselect_1.1.2       KEGGREST_1.34.0        tibble_3.1.6          
[31] GenomeInfoDbData_1.2.7 XML_3.99-0.8           fansi_1.0.2           
[34] crayon_1.5.0           bitops_1.0-7           grid_4.1.2            
[37] xtable_1.8-4           gtable_0.3.0           lifecycle_1.0.1       
[40] DBI_1.1.2              magrittr_2.0.2         scales_1.1.1          
[43] cli_3.1.1              cachem_1.0.6           XVector_0.34.0        
[46] genefilter_1.76.0      ellipsis_0.3.2         vctrs_0.3.8           
[49] generics_0.1.2         RColorBrewer_1.1-2     tools_4.1.2           
[52] bit64_4.0.5            glue_1.6.1             purrr_0.3.4           
[55] parallel_4.1.2         fastmap_1.1.0          survival_3.2-13       
[58] AnnotationDbi_1.56.2   colorspace_2.0-2       memoise_2.0.1         
>
DESeq2 prefiltering • 3.3k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 13 hours ago
United States

Pre-filtering is fine, so you can go with your current settings and report these results. Probably what is happening is that you have a lot of high dispersion low count features that are changing the dispersion trend. It's reasonable to filter in this case. In the user guide we should really make it clear that filtering is not necessary for consideration of multiple testing burden, because of independent filtering.

ADD COMMENT
0
Entering edit mode

Wonderful. Thank you so much for the speedy response! It's much appreciated.

ADD REPLY
0
Entering edit mode

Out of curiosity (since I'm quite a newbie to bioinformatics and DESeq2): In explaining my methods pipeline and DESeq2 results, is there a good way to plot as a visual the possibility of "a lot of high dispersion low count features that are changing the dispersion trend" that could give reason to pre-filtering?

If I do NOT prefilter low counts, perform DEseq, then look at a dispersion plot and a histogram of ratios of low p-value vs. mean count, it seems normal I think?histogram estdispersionplot

plotDispEsts( dds, ylim = c(1e-6, 1e1) )
qs <- c( 0, quantile( res$baseMean[res$baseMean > 0], 0:7/7 ) )
bins <- cut( res$baseMean, qs )
levels(bins) <- paste0("~",round(.5*qs[-1] + .5*qs[-length(qs)]))
ratios <- tapply( res$pvalue, bins, function(p) mean( p < .01, na.rm=TRUE ) )
barplot(ratios, xlab="mean normalized count", ylab="ratio of small $p$ values")
ADD REPLY
1
Entering edit mode

The fact that on the left side of the dispersion plot you have many features with mean of < 10 counts and their dispersion is at the limit in the y-axis (the limit is due to the maximal ratio of SD to mean for positive valued data).

Generally, I tend to remove features with single digit counts for all samples.

ADD REPLY
0
Entering edit mode

Thank you very much for explaining that. It is much appreciated.

ADD REPLY

Login before adding your answer.

Traffic: 594 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6