Question

DESeq2::DESeq(..., parallel = TRUE) fails if a partition has only genes with 0 counts across all samples

0

Entering edit mode

Steffi Grote ▴ 10

@f67b142b

Last seen 15 months ago

Germany, Göttingen

Hi,

using DESeq2::DESeq(..., parallel = TRUE) with a specific dataset, where genes (rows) were sorted by abundance, I have observed the following error:

Error in (function (cond) :
error in evaluating the argument 'args' in selecting a method for function 'do.call': BiocParallel errors element index: 3
first error: error in evaluating the argument 'x' in selecting a method for function 't': no right-hand side in 'b'

I can reproduce the error with a modified example dataset as used in DESeq2's vignette:

# 1) Read example dataset from pasilla

library("pasilla")
pasCts <- system.file("extdata",
  "pasilla_gene_counts.tsv",
  package = "pasilla", mustWork = TRUE
)
pasAnno <- system.file("extdata",
  "pasilla_sample_annotation.csv",
  package = "pasilla", mustWork = TRUE
)
cts <- as.matrix(read.csv(pasCts, sep = "\t", row.names = "gene_id"))

# there are already quite some genes which have 0 counts across all samples
table(rowSums(cts) == 0)
# FALSE  TRUE 
# 12359  2240 

coldata <- read.csv(pasAnno, row.names = 1)
coldata <- coldata[, c("condition", "type")]
coldata$condition <- factor(coldata$condition)
coldata$type <- factor(coldata$type)


# 2) Modify the count matrix to trigger the error

# set the last 2/5 of genes to 0 counts for all samples
i <- cut(seq_len(nrow(cts)), 5, labels = FALSE)
cts[i %in% c(4,5), ] <- 0


# 3) Run DESeq with different numbers of cores

library("DESeq2")
dds <- DESeqDataSetFromMatrix(
  countData = cts,
  colData = coldata,
  design = ~condition
)

# using 3 cores or more triggers the error
BiocParallel::register(BiocParallel::MulticoreParam(3))
res <- DESeq(dds, parallel = TRUE)

# using less than 3 cores works just fine
BiocParallel::register(BiocParallel::MulticoreParam(2))
res <- DESeq(dds, parallel = TRUE)
res <- DESeq(dds, parallel = FALSE)


# 4) It looks like blocks where all genes have 0 counts make the difference:

counts_per_gene <- rowSums(cts)

# if genes are divided into 3 blocks, the last block would have 0-count genes only:
i <- cut(seq_along(counts_per_gene), 3, labels = FALSE)
mean_counts_per_partition <- sapply(unique(i), function(j) {
  mean(counts_per_gene[i == j])  
})
mean_counts_per_partition
# [1] 11378.01  3433.68     0.00

# if divided into 2 blocks, none would have 0-count genes only:
i <- cut(seq_along(counts_per_gene), 2, labels = FALSE)
mean_counts_per_partition <- sapply(unique(i), function(j) {
  mean(counts_per_gene[i == j])  
})
mean_counts_per_partition
# [1] 9123.2308  751.5357

While this problem can easily be tackled by using parallel = FALSE or by removing rows with 0 counts beforehand, it might be nicer if such situation would be caught inside DESeq2::DESeq.

Thanks in advance,
Steffi

Here is my session Info:

> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS

Matrix products: default
BLAS/LAPACK: /software/R-v4.0.4/lib/libopenblasp-r0.3.13.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DESeq2_1.30.1               SummarizedExperiment_1.20.0 Biobase_2.50.0              MatrixGenerics_1.2.1       
 [5] matrixStats_0.58.0          GenomicRanges_1.42.0        GenomeInfoDb_1.26.4         IRanges_2.24.1             
 [9] S4Vectors_0.28.1            BiocGenerics_0.36.0         pasilla_1.18.1             

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6             locfit_1.5-9.4         lattice_0.20-41        assertthat_0.2.1       utf8_1.2.1            
 [6] R6_2.5.0               RSQLite_2.2.4          evoBase_0.1.0          httr_1.4.2             ggplot2_3.3.3         
[11] pillar_1.5.1           zlibbioc_1.36.0        rlang_0.4.10           annotate_1.68.0        blob_1.2.1            
[16] shinyEvotec_1.0        Matrix_1.3-2           splines_4.0.4          BiocParallel_1.24.1    geneplotter_1.68.0    
[21] RCurl_1.98-1.3         bit_4.0.4              munsell_0.5.0          DelayedArray_0.16.2    compiler_4.0.4        
[26] pkgconfig_2.0.3        tidyselect_1.1.0       tibble_3.1.0           GenomeInfoDbData_1.2.4 XML_3.99-0.6          
[31] fansi_0.4.2            crayon_1.4.1           dplyr_1.0.5            bitops_1.0-6           grid_4.0.4            
[36] xtable_1.8-4           gtable_0.3.0           lifecycle_1.0.0        DBI_1.1.1              magrittr_2.0.1        
[41] scales_1.1.1           debugme_1.1.0          cachem_1.0.4           XVector_0.30.0         genefilter_1.72.1     
[46] ellipsis_0.3.1         vctrs_0.3.6            generics_0.1.0         fst_0.9.4              RColorBrewer_1.1-2    
[51] tools_4.0.4            bit64_4.0.5            glue_1.4.2             purrr_0.3.4            fastmap_1.1.0         
[56] survival_3.2-10        yaml_2.2.1             AnnotationDbi_1.52.0   colorspace_2.0-0       memoise_2.0.0

DESeq2 BiocParallel • 2.3k views

ADD COMMENT • link 3.7 years ago Steffi Grote ▴ 10

score 0 · Answer 1 · 2021-08-05

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 9 days ago

United States

Thanks for the note. I'll add this to the documentation now. It will be much easier to handle this just by filtering out all 0's genes from the start. Otherwise it's a waste of memory: sending a large matrix of 0's to a child process though this will result only in NAs.

ADD COMMENT • link 3.7 years ago Michael Love 43k

0

Entering edit mode

Thanks for the quick answer!

ADD REPLY • link 3.7 years ago Steffi Grote ▴ 10

0

Entering edit mode

I took some time to also update the section in the vignette on parallelization:

https://github.com/mikelove/DESeq2/commit/e414a2fba3db15b77db2f590ad189ae24ff9af42