In my OTU data matrix that I would like to input into DESeq2, many of the count values are zeros. I am able to create a DESeqDataSetFromMatrix object from this matrix with the following command:
dds=DESeqDataSetFromMatrix(csv, csv_meta, design=~Habitat)
But when I run the DESeq command I get the following error message:
> out1=DESeq(dds, fitType="mean")
estimating size factors
Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc, :
every gene contains at least one zero, cannot compute log geometric means
Is there some way to remedy this zero issue within DESeq2 or must the data be log transformed (log2(n+1)) before inputting into DESeq2? I am only interested in generating the normalized counts and using these values for downstream analysis, I do not need fold change data.
I found in another Bioconductor post (A: DESeq2 Error in varianceStabilizingTransformation) the following workaround to calculating the log geometric means:
ts = counts(dds) geoMeans = apply(cts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0])))) dds = estimateSizeFactors(dds, geoMeans=geoMeans)
I assume from here that there is a way to continue running the individual functions within the DESeq command to generate the normalized scaled counts? The estimateSizeFactors object can then be input into estimateDispersions, but how do you run the final step of the the DESeq command (fitting model and testing)?
Here is the output for traceback():
6: stop("every gene contains at least one zero, cannot compute log geometric means")
5: estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc,
geoMeans = geoMeans, controlGenes = controlGenes)
4: .local(object, ...)
3: estimateSizeFactors(object)
2: estimateSizeFactors(object)
1: DESeq(dds, fitType = "mean")
And the sessionInfo():
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] DESeq2_1.6.2 RcppArmadillo_0.4.500.0 Rcpp_0.11.3 ShortRead_1.24.0 GenomicAlignments_1.2.1
[6] Rsamtools_1.18.2 GenomicRanges_1.18.3 GenomeInfoDb_1.2.3 Biostrings_2.34.0 XVector_0.6.0
[11] IRanges_2.0.0 S4Vectors_0.4.0 BiocParallel_1.0.0 BiocGenerics_0.12.1
loaded via a namespace (and not attached):
[1] acepack_1.3-3.3 annotate_1.44.0 AnnotationDbi_1.28.1 base64enc_0.1-2 BatchJobs_1.5
[6] BBmisc_1.8 Biobase_2.26.0 bitops_1.0-6 brew_1.0-6 checkmate_1.5.0
[11] cluster_1.15.3 codetools_0.2-9 colorspace_1.2-4 DBI_0.3.1 digest_0.6.4
[16] fail_1.2 foreach_1.4.2 foreign_0.8-61 Formula_1.1-2 genefilter_1.48.1
[21] geneplotter_1.44.0 ggplot2_1.0.0 grid_3.1.2 gtable_0.1.2 Hmisc_3.14-6
[26] hwriter_1.3.2 iterators_1.0.7 lattice_0.20-29 latticeExtra_0.6-26 locfit_1.5-9.1
[31] MASS_7.3-35 munsell_0.4.2 nnet_7.3-8 plyr_1.8.1 proto_0.3-10
[36] RColorBrewer_1.0-5 reshape2_1.4 rpart_4.1-8 RSQLite_1.0.0 scales_0.2.4
[41] sendmailR_1.2-1 splines_3.1.2 stringr_0.6.2 survival_2.37-7 tools_3.1.2
[46] XML_3.98-1.1 xtable_1.7-4 zlibbioc_1.12.0
Thanks much,
Jesse Port
It's not the dispersion estimation but the size factor calculation which requires the geometric mean, which is 0 when a single sample has a 0. You can look over the formula for size factors in the paper (both DESeq and DESeq2 papers have this formula), to see why the default size factor estimation doesn't work when each row has a zero.