Question

Trouble with DESeq2, edgeR, and limma not accepting zeros in input

0

Entering edit mode

Kaustav • 0

@kaustav-13212

Last seen 5.7 years ago

NY

Hi

I'm trying to find differentially expressed genes from an RNA-Seq dataset with 2 conditions and 3 replicates each (6 samples). I used kallisto to generate counts and then tximport to generate a DESeqDataSet (dds).

> sampleTableEB <- data.frame(condition = factor(rep(c("A", "B"), each = 3))) > rownames(sampleTableEB) <- colnames(txi.kallisto.tsv$counts)

> dds <- DESeqDataSetFromTximport(txi.kallisto.tsv, sampleTableEB, ~condition)
using counts and average transcript lengths from tximport
> dds <- dds[ rowSums(counts(dds)) > 1, ]

In my understanding, the last filtering step should get rid of the rows that have '0' for every sample. Yet when I do:

> dds <- DESeq(dds) estimating size factors

I get an error:
Error in estimateSizeFactorsForMatrix(counts/normMatrix, locfunc = locfunc, : every gene contains at least one zero, cannot compute log geometric means > traceback() 7: stop("every gene contains at least one zero, cannot compute log geometric means") 6: estimateSizeFactorsForMatrix(counts/normMatrix, locfunc = locfunc, geoMeans = geoMeans, controlGenes = controlGenes) 5: estimateNormFactors(counts(object), normMatrix = nm, locfunc = locfunc, geoMeans = geoMeans, controlGenes = controlGenes) 4: .local(object, ...) 3: estimateSizeFactors(object) 2: estimateSizeFactors(object) 1: DESeq(dds)

So I go to my friend google and there's some posts on Bioconductor support regarding this. Michael Love suggests:

> estimateSizeFactors(dds, type = "iterate", geoMeans = geoMeans) Error in estimateSizeFactorsIterate(object) : iterative size factor normalization did not converge

I also tried:

> dds <- dds[ rowSums(counts(dds)) > 5, ] > cts <- counts(dds)

> geoMeans <- apply(exp(sum(log(row[row != 0]))/length(row)))
Error in match.fun(FUN) : argument "FUN" is missing, with no default

rs <- rowSums( counts(dds) == 0 ) > table(rs) rs 3 1

For the original dds (without removing zeros):

> dds <- DESeqDataSetFromTximport(txi.kallisto.tsv, sampleTableEB, ~condition) using counts and average transcript lengths from tximport > rs <- rowSums( counts(dds) == 0 ) > table(rs) rs 3 4 5 6 1 1 1 1

Then I tried:

dds <- dds[ rowSums(counts(dds)) > 5, ] > cts <- counts(dds) > geoMeans <- apply(cts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0])))) > dds <- DESeq(dds) estimating size factors Error in estimateSizeFactorsForMatrix(counts/normMatrix, locfunc = locfunc, : every gene contains at least one zero, cannot compute log geometric means > dds <- estimateSizeFactors(dds, geoMeans=geoMeans) Error: all(!is.na(value)) is not TRUE > sizeFactors(dds) <- calcNormFactors(counts(dds)) Error in quantile.default(x, p = p) : missing values and NaN's not allowed if 'na.rm' is FALSE

Finally tried edgeR and limma:

> cts <- txi.kallisto.tsv$counts > normMat <- txi.kallisto.tsv$length > normMat <- normMat/exp(rowMeans(log(normMat))) > library(edgeR) > o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat)) Error in quantile.default(x, p = p) : missing values and NaN's not allowed if 'na.rm' is FALSE > traceback() 8: stop("missing values and NaN's not allowed if 'na.rm' is FALSE") 7: quantile.default(x, p = p) 6: quantile(x, p = p) 5: FUN(newX[, i], ...) 4: apply(y, 2, function(x) quantile(x, p = p)) 3: .calcFactorQuantile(data = x, lib.size = lib.size, p = 0.75) 2: calcNormFactors.default(cts/normMat) 1: calcNormFactors(cts/normMat)

txilim <- tximport(fileEB, type = "kallisto", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM") reading in files 1 2 3 4 5 6 transcripts missing genes: 465071 summarizing abundance summarizing counts summarizing length > library(limma) > y <- DGEList(txilim$counts) Error in if (any(lib.size == 0L)) warning("Zero library size detected.") : missing value where TRUE/FALSE needed

> traceback() 1: DGEList(txilim$counts)

As you can see it's been a frustrating day. Nothing has worked. I used DESeq before with zeros in samples and it never gave me this error. Apparently edgeR is supposed to be able to handle that, but it hasn't worked for me. I would really appreciate some help. Also I'm not really good with R, so if you want me to show you the contents of a file or object, you may have to provide the R code for that. Thanks, and hope you've had a better day!

> sessionInfo() R version 3.4.0 (2017-04-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS release 6.2 (Final)

Matrix products: default BLAS/LAPACK: /hpc/packages/minerva-common/intel/parallel_studio_xe_2015/composer_xe_2015.0.090/mkl/lib/intel64/libmkl_gf_lp64.so

locale: [1] C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] edgeR_3.14.0 [2] limma_3.28.4 [3] DESeq2_1.12.2 [4] SummarizedExperiment_1.2.3 [5] org.Hs.eg.db_3.3.0 [6] rhdf5_2.16.0 [7] readr_0.2.2 [8] tximport_1.0.2 [9] TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.0 [10] GenomicFeatures_1.26.0 [11] AnnotationDbi_1.36.0 [12] Biobase_2.34.0 [13] GenomicRanges_1.26.1 [14] GenomeInfoDb_1.10.0 [15] IRanges_2.8.0 [16] S4Vectors_0.12.0 [17] BiocGenerics_0.20.0

loaded via a namespace (and not attached): [1] Rcpp_0.12.9.4 locfit_1.5-9.1 lattice_0.20-35 [4] Rsamtools_1.26.1 Biostrings_2.40.2 assertthat_0.1 [7] digest_0.6.12 plyr_1.8.4 backports_1.0.5 [10] acepack_1.4.1 RSQLite_1.0.0 DESeq_1.24.0 [13] ggplot2_2.2.1 zlibbioc_1.20.0 lazyeval_0.2.0 [16] data.table_1.10.4 annotate_1.50.0 rpart_4.1-10 [19] Matrix_1.2-9 checkmate_1.8.2 splines_3.4.0 [22] BiocParallel_1.11.2 geneplotter_1.50.0 stringr_1.2.0 [25] foreign_0.8-67 htmlwidgets_0.8 RCurl_1.95-4.8 [28] biomaRt_2.30.0 munsell_0.4.3 compiler_3.4.0 [31] rtracklayer_1.34.2 base64enc_0.1-3 htmltools_0.3.5 [34] nnet_7.3-12 tibble_1.2 gridExtra_2.2.1 [37] htmlTable_1.9 Hmisc_4.0-2 XML_3.98-1.4 [40] GenomicAlignments_1.8.4 bitops_1.0-6 grid_3.4.0 [43] xtable_1.8-2 gtable_0.2.0 DBI_0.5-1 [46] magrittr_1.5 scales_0.4.1 stringi_1.1.2 [49] XVector_0.12.1 genefilter_1.54.2 latticeExtra_0.6-28 [52] Formula_1.2-1 RColorBrewer_1.1-2 tools_3.4.0 [55] survival_2.40-1 colorspace_1.2-6 cluster_2.0.6 [58] knitr_1.15.1

limma deseq2 edgeR • 4.1k views

ADD COMMENT • link updated 7.9 years ago by Gordon Smyth 52k • written 7.9 years ago by Kaustav • 0

score 1 · Answer 1 · 2017-06-09

1

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 6 hours ago

WEHI, Melbourne, Australia

None of these software programs have any difficulty with zeros, so that is not the problem. There is one exception: you cannot reasonably have a entire sample with nothing in it at all, i.e., a column that is entirely zero. However even that would not lead to the errors you are seeing.

Rows (genes) that are all zero are common and are no problem.

The problem appears to be with the data you are getting from kallisto. Have you made any attempt to look at or check your data? For example:

colSums(txi.kallisto.tsv$counts)

ADD COMMENT • link 7.9 years ago Gordon Smyth 52k

0

Entering edit mode

I guess I have to look through the entire data to find 'NA's if any. I briefly looked at the kallisto results, but not all of it. The thing is, I previously used DESeq2 to analyze this very same dataset, but then I had used STAR to map the reads and htseq-count to generate counts. This time I thought I would use an alignment-independent algorithm to get counts and compare the results from both methods. Thank you for your suggestion

ADD REPLY • link 7.9 years ago Kaustav • 0

1

Entering edit mode

You say "have to look through the entire data" as if this is a difficult thing. Really you can just type

anyNA(cts)

ADD REPLY • link 7.9 years ago Gordon Smyth 52k

score 0 · Answer 2 · 2017-06-09

0

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 21 minutes ago

The city by the bay

It looks like you have NA values in your count matrix, which causes the litany of errors with the various packages. You need to found out why the missing values are being generated, and either fill them in with zeroes (though this may not be appropriate) or remove any genes that have a missing value.