Hi
I'm trying to find differentially expressed genes from an RNA-Seq dataset with 2 conditions and 3 replicates each (6 samples). I used kallisto to generate counts and then tximport to generate a DESeqDataSet (dds).
> sampleTableEB <- data.frame(condition = factor(rep(c("A", "B"), each = 3)))
> rownames(sampleTableEB) <- colnames(txi.kallisto.tsv$counts)
> dds <- DESeqDataSetFromTximport(txi.kallisto.tsv, sampleTableEB, ~condition)
using counts and average transcript lengths from tximport
> dds <- dds[ rowSums(counts(dds)) > 1, ]
In my understanding, the last filtering step should get rid of the rows that have '0' for every sample. Yet when I do:
> dds <- DESeq(dds)
estimating size factors
I get an error:
Error in estimateSizeFactorsForMatrix(counts/normMatrix, locfunc = locfunc, :
every gene contains at least one zero, cannot compute log geometric means
> traceback()
7: stop("every gene contains at least one zero, cannot compute log geometric means")
6: estimateSizeFactorsForMatrix(counts/normMatrix, locfunc = locfunc,
geoMeans = geoMeans, controlGenes = controlGenes)
5: estimateNormFactors(counts(object), normMatrix = nm, locfunc = locfunc,
geoMeans = geoMeans, controlGenes = controlGenes)
4: .local(object, ...)
3: estimateSizeFactors(object)
2: estimateSizeFactors(object)
1: DESeq(dds)
So I go to my friend google and there's some posts on Bioconductor support regarding this. Michael Love suggests:
> estimateSizeFactors(dds, type = "iterate", geoMeans = geoMeans)
Error in estimateSizeFactorsIterate(object) :
iterative size factor normalization did not converge
I also tried:
> dds <- dds[ rowSums(counts(dds)) > 5, ]
> cts <- counts(dds)
> geoMeans <- apply(exp(sum(log(row[row != 0]))/length(row)))
Error in match.fun(FUN) : argument "FUN" is missing, with no default
rs <- rowSums( counts(dds) == 0 )
> table(rs)
rs
3
1
For the original dds (without removing zeros):
> dds <- DESeqDataSetFromTximport(txi.kallisto.tsv, sampleTableEB, ~condition)
using counts and average transcript lengths from tximport
> rs <- rowSums( counts(dds) == 0 )
> table(rs)
rs
3 4 5 6
1 1 1 1
Then I tried:
dds <- dds[ rowSums(counts(dds)) > 5, ]
> cts <- counts(dds)
> geoMeans <- apply(cts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0]))))
> dds <- DESeq(dds)
estimating size factors
Error in estimateSizeFactorsForMatrix(counts/normMatrix, locfunc = locfunc, :
every gene contains at least one zero, cannot compute log geometric means
> dds <- estimateSizeFactors(dds, geoMeans=geoMeans)
Error: all(!is.na(value)) is not TRUE
> sizeFactors(dds) <- calcNormFactors(counts(dds))
Error in quantile.default(x, p = p) :
missing values and NaN's not allowed if 'na.rm' is FALSE
Finally tried edgeR and limma:
> cts <- txi.kallisto.tsv$counts
> normMat <- txi.kallisto.tsv$length
> normMat <- normMat/exp(rowMeans(log(normMat)))
> library(edgeR)
> o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat))
Error in quantile.default(x, p = p) :
missing values and NaN's not allowed if 'na.rm' is FALSE
> traceback()
8: stop("missing values and NaN's not allowed if 'na.rm' is FALSE")
7: quantile.default(x, p = p)
6: quantile(x, p = p)
5: FUN(newX[, i], ...)
4: apply(y, 2, function(x) quantile(x, p = p))
3: .calcFactorQuantile(data = x, lib.size = lib.size, p = 0.75)
2: calcNormFactors.default(cts/normMat)
1: calcNormFactors(cts/normMat)
txilim <- tximport(fileEB, type = "kallisto", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM")
reading in files
1 2 3 4 5 6
transcripts missing genes: 465071
summarizing abundance
summarizing counts
summarizing length
> library(limma)
> y <- DGEList(txilim$counts)
Error in if (any(lib.size == 0L)) warning("Zero library size detected.") :
missing value where TRUE/FALSE needed
> traceback()
1: DGEList(txilim$counts)
As you can see it's been a frustrating day. Nothing has worked. I used DESeq before with zeros in samples and it never gave me this error. Apparently edgeR is supposed to be able to handle that, but it hasn't worked for me. I would really appreciate some help. Also I'm not really good with R, so if you want me to show you the contents of a file or object, you may have to provide the R code for that. Thanks, and hope you've had a better day!
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.2 (Final)
Matrix products: default
BLAS/LAPACK: /hpc/packages/minerva-common/intel/parallel_studio_xe_2015/composer_xe_2015.0.090/mkl/lib/intel64/libmkl_gf_lp64.so
locale:
[1] C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] edgeR_3.14.0
[2] limma_3.28.4
[3] DESeq2_1.12.2
[4] SummarizedExperiment_1.2.3
[5] org.Hs.eg.db_3.3.0
[6] rhdf5_2.16.0
[7] readr_0.2.2
[8] tximport_1.0.2
[9] TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.0
[10] GenomicFeatures_1.26.0
[11] AnnotationDbi_1.36.0
[12] Biobase_2.34.0
[13] GenomicRanges_1.26.1
[14] GenomeInfoDb_1.10.0
[15] IRanges_2.8.0
[16] S4Vectors_0.12.0
[17] BiocGenerics_0.20.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.9.4 locfit_1.5-9.1 lattice_0.20-35
[4] Rsamtools_1.26.1 Biostrings_2.40.2 assertthat_0.1
[7] digest_0.6.12 plyr_1.8.4 backports_1.0.5
[10] acepack_1.4.1 RSQLite_1.0.0 DESeq_1.24.0
[13] ggplot2_2.2.1 zlibbioc_1.20.0 lazyeval_0.2.0
[16] data.table_1.10.4 annotate_1.50.0 rpart_4.1-10
[19] Matrix_1.2-9 checkmate_1.8.2 splines_3.4.0
[22] BiocParallel_1.11.2 geneplotter_1.50.0 stringr_1.2.0
[25] foreign_0.8-67 htmlwidgets_0.8 RCurl_1.95-4.8
[28] biomaRt_2.30.0 munsell_0.4.3 compiler_3.4.0
[31] rtracklayer_1.34.2 base64enc_0.1-3 htmltools_0.3.5
[34] nnet_7.3-12 tibble_1.2 gridExtra_2.2.1
[37] htmlTable_1.9 Hmisc_4.0-2 XML_3.98-1.4
[40] GenomicAlignments_1.8.4 bitops_1.0-6 grid_3.4.0
[43] xtable_1.8-2 gtable_0.2.0 DBI_0.5-1
[46] magrittr_1.5 scales_0.4.1 stringi_1.1.2
[49] XVector_0.12.1 genefilter_1.54.2 latticeExtra_0.6-28
[52] Formula_1.2-1 RColorBrewer_1.1-2 tools_3.4.0
[55] survival_2.40-1 colorspace_1.2-6 cluster_2.0.6
[58] knitr_1.15.1
I guess I have to look through the entire data to find 'NA's if any. I briefly looked at the kallisto results, but not all of it. The thing is, I previously used DESeq2 to analyze this very same dataset, but then I had used STAR to map the reads and htseq-count to generate counts. This time I thought I would use an alignment-independent algorithm to get counts and compare the results from both methods. Thank you for your suggestion
You say "have to look through the entire data" as if this is a difficult thing. Really you can just type