Trouble with DESeq2, edgeR, and limma not accepting zeros in input
2
0
Entering edit mode
Kaustav • 0
@kaustav-13212
Last seen 5.3 years ago
NY

Hi

I'm trying to find differentially expressed genes from an RNA-Seq dataset with 2 conditions and 3 replicates each (6 samples). I used kallisto to generate counts and then tximport to generate a DESeqDataSet (dds).

> sampleTableEB <- data.frame(condition = factor(rep(c("A", "B"), each = 3)))
> rownames(sampleTableEB) <- colnames(txi.kallisto.tsv$counts)

> dds <- DESeqDataSetFromTximport(txi.kallisto.tsv, sampleTableEB, ~condition)
using counts and average transcript lengths from tximport
> dds <- dds[ rowSums(counts(dds)) > 1, ]

In my understanding, the last filtering step should get rid of the rows that have '0' for every sample. Yet when I do:

> dds <- DESeq(dds)
estimating size factors

I get an error:
Error in estimateSizeFactorsForMatrix(counts/normMatrix, locfunc = locfunc,  :
  every gene contains at least one zero, cannot compute log geometric means
> traceback()
7: stop("every gene contains at least one zero, cannot compute log geometric means")
6: estimateSizeFactorsForMatrix(counts/normMatrix, locfunc = locfunc,
       geoMeans = geoMeans, controlGenes = controlGenes)
5: estimateNormFactors(counts(object), normMatrix = nm, locfunc = locfunc,
       geoMeans = geoMeans, controlGenes = controlGenes)
4: .local(object, ...)
3: estimateSizeFactors(object)
2: estimateSizeFactors(object)
1: DESeq(dds)

So I go to my friend google and there's some posts on Bioconductor support regarding this. Michael Love suggests:

> estimateSizeFactors(dds, type = "iterate", geoMeans = geoMeans)
Error in estimateSizeFactorsIterate(object) :
  iterative size factor normalization did not converge

I also tried:

> dds <- dds[ rowSums(counts(dds)) > 5, ]
> cts <- counts(dds)
 

> geoMeans <- apply(exp(sum(log(row[row != 0]))/length(row)))
Error in match.fun(FUN) : argument "FUN" is missing, with no default

rs <- rowSums( counts(dds) == 0 )
> table(rs)
rs
3
1

For the original dds (without removing zeros):

> dds <- DESeqDataSetFromTximport(txi.kallisto.tsv, sampleTableEB, ~condition)
using counts and average transcript lengths from tximport
> rs <- rowSums( counts(dds) == 0 )
> table(rs)
rs
3 4 5 6
1 1 1 1

Then I tried:

dds <- dds[ rowSums(counts(dds)) > 5, ]
> cts <- counts(dds)
> geoMeans <- apply(cts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0]))))
> dds <- DESeq(dds)
estimating size factors
Error in estimateSizeFactorsForMatrix(counts/normMatrix, locfunc = locfunc,  :
  every gene contains at least one zero, cannot compute log geometric means
> dds <- estimateSizeFactors(dds, geoMeans=geoMeans)
Error: all(!is.na(value)) is not TRUE
> sizeFactors(dds) <- calcNormFactors(counts(dds))
Error in quantile.default(x, p = p) :
  missing values and NaN's not allowed if 'na.rm' is FALSE

Finally tried edgeR and limma:

> cts <- txi.kallisto.tsv$counts
> normMat <- txi.kallisto.tsv$length
> normMat <- normMat/exp(rowMeans(log(normMat)))
> library(edgeR)
> o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat))
Error in quantile.default(x, p = p) :
  missing values and NaN's not allowed if 'na.rm' is FALSE
> traceback()
8: stop("missing values and NaN's not allowed if 'na.rm' is FALSE")
7: quantile.default(x, p = p)
6: quantile(x, p = p)
5: FUN(newX[, i], ...)
4: apply(y, 2, function(x) quantile(x, p = p))
3: .calcFactorQuantile(data = x, lib.size = lib.size, p = 0.75)
2: calcNormFactors.default(cts/normMat)
1: calcNormFactors(cts/normMat)

 txilim <- tximport(fileEB, type = "kallisto", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM")
reading in files
1 2 3 4 5 6
transcripts missing genes: 465071
summarizing abundance
summarizing counts
summarizing length
> library(limma)
> y <- DGEList(txilim$counts)
Error in if (any(lib.size == 0L)) warning("Zero library size detected.") :
  missing value where TRUE/FALSE needed

> traceback()
1: DGEList(txilim$counts)

As you can see it's been a frustrating day. Nothing has worked. I used DESeq before with zeros in samples and it never gave me this error. Apparently edgeR is supposed to be able to handle that, but it hasn't worked for me. I would really appreciate some help. Also I'm not really good with R, so if you want me to show you the contents of a file or object, you may have to provide the R code for that. Thanks, and hope you've had a better day!

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.2 (Final)

Matrix products: default
BLAS/LAPACK: /hpc/packages/minerva-common/intel/parallel_studio_xe_2015/composer_xe_2015.0.090/mkl/lib/intel64/libmkl_gf_lp64.so

locale:
[1] C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] edgeR_3.14.0                            
 [2] limma_3.28.4                            
 [3] DESeq2_1.12.2                           
 [4] SummarizedExperiment_1.2.3              
 [5] org.Hs.eg.db_3.3.0                      
 [6] rhdf5_2.16.0                            
 [7] readr_0.2.2                             
 [8] tximport_1.0.2                          
 [9] TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.0
[10] GenomicFeatures_1.26.0                  
[11] AnnotationDbi_1.36.0                    
[12] Biobase_2.34.0                          
[13] GenomicRanges_1.26.1                    
[14] GenomeInfoDb_1.10.0                     
[15] IRanges_2.8.0                           
[16] S4Vectors_0.12.0                        
[17] BiocGenerics_0.20.0                     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9.4           locfit_1.5-9.1          lattice_0.20-35        
 [4] Rsamtools_1.26.1        Biostrings_2.40.2       assertthat_0.1         
 [7] digest_0.6.12           plyr_1.8.4              backports_1.0.5        
[10] acepack_1.4.1           RSQLite_1.0.0           DESeq_1.24.0           
[13] ggplot2_2.2.1           zlibbioc_1.20.0         lazyeval_0.2.0         
[16] data.table_1.10.4       annotate_1.50.0         rpart_4.1-10           
[19] Matrix_1.2-9            checkmate_1.8.2         splines_3.4.0          
[22] BiocParallel_1.11.2     geneplotter_1.50.0      stringr_1.2.0          
[25] foreign_0.8-67          htmlwidgets_0.8         RCurl_1.95-4.8         
[28] biomaRt_2.30.0          munsell_0.4.3           compiler_3.4.0         
[31] rtracklayer_1.34.2      base64enc_0.1-3         htmltools_0.3.5        
[34] nnet_7.3-12             tibble_1.2              gridExtra_2.2.1        
[37] htmlTable_1.9           Hmisc_4.0-2             XML_3.98-1.4           
[40] GenomicAlignments_1.8.4 bitops_1.0-6            grid_3.4.0             
[43] xtable_1.8-2            gtable_0.2.0            DBI_0.5-1              
[46] magrittr_1.5            scales_0.4.1            stringi_1.1.2          
[49] XVector_0.12.1          genefilter_1.54.2       latticeExtra_0.6-28    
[52] Formula_1.2-1           RColorBrewer_1.1-2      tools_3.4.0            
[55] survival_2.40-1         colorspace_1.2-6        cluster_2.0.6          
[58] knitr_1.15.1

 

limma deseq2 edgeR • 3.8k views
ADD COMMENT
1
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

None of these software programs have any difficulty with zeros, so that is not the problem. There is one exception: you cannot reasonably have a entire sample with nothing in it at all, i.e., a column that is entirely zero. However even that would not lead to the errors you are seeing.

Rows (genes) that are all zero are common and are no problem.

The problem appears to be with the data you are getting from kallisto. Have you made any attempt to look at or check your data? For example:

colSums(txi.kallisto.tsv$counts)
ADD COMMENT
0
Entering edit mode

I guess I have to look through the entire data to find 'NA's if any. I briefly looked at the kallisto results, but not all of it. The thing is, I previously used DESeq2 to analyze this very same dataset, but then I had used STAR to map the reads and htseq-count to generate counts. This time I thought I would use an alignment-independent algorithm to get counts and compare the results from both methods. Thank you for your suggestion

ADD REPLY
1
Entering edit mode

You say "have to look through the entire data" as if this is a difficult thing. Really you can just type

anyNA(cts)
ADD REPLY
0
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 6 hours ago
The city by the bay

It looks like you have NA values in your count matrix, which causes the litany of errors with the various packages. You need to found out why the missing values are being generated, and either fill them in with zeroes (though this may not be appropriate) or remove any genes that have a missing value.

ADD COMMENT
0
Entering edit mode

Thanks will check!

ADD REPLY

Login before adding your answer.

Traffic: 648 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6