Question

DESeq2 size factors estimation after RSEM quantification

1

Entering edit mode

Hugo Varet ▴ 80

@hugo-varet-6301

Last seen 5 weeks ago

France

Dear Mike Love,

I just upgraded DESeq2 to get the last version (1.12.0) and I have a problem when using the estimateSizeFactors() function on a DESeqDataSet object built from RSEM outputs (with the help of tximport). The problem is that infinite values are generated because of the use of the log() on transcript/gene lengths equal to 0 (in the txi.rsem$length matrix) when computing the geometric mean. Here is a piece of R code from the tximport vignette slightly modified to reproduce the error:

library(tximportData)
library(tximport)
dir <- system.file("extdata", package = "tximportData")
samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
files <- file.path(dir, "rsem", samples$run, paste0(samples$run, ".genes.results"))
names(files) <- paste0("sample", 1:6)
txi.rsem <- tximport(files, type = "rsem")
### modification: ###
txi.rsem$length[which.min(txi.rsem$length)] <- 0
ddsrsem <- DESeqDataSetFromTximport(txi.rsem, colData=samples, design=~1)
ddsrsem <- estimateSizeFactors(ddsrsem) # Erreur : all(!is.na(value)) is not TRUE

Do you think I should filter out rows for which there is at least one transcript/gene length equal to 0?

Below is the result of my sessionInfo():
R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04 LTS

locale:
 [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8     LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8    LC_PAPER=fr_FR.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] readr_0.2.2                tximport_1.0.0             tximportData_1.0.0         DESeq2_1.12.0              SummarizedExperiment_1.2.0 Biobase_2.32.0             GenomicRanges_1.24.0      
 [8] GenomeInfoDb_1.8.0         IRanges_2.6.0              S4Vectors_0.10.0           BiocGenerics_0.18.0       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.4.5        RColorBrewer_1.1-2   plyr_1.8.3           XVector_0.12.0       tools_3.3.0          zlibbioc_1.18.0      rpart_4.1-10         RSQLite_1.0.0        annotate_1.50.0     
[10] gtable_0.2.0         lattice_0.20-33      Matrix_1.2-6         DBI_0.4-1            gridExtra_2.2.1      genefilter_1.54.0    cluster_2.0.4        locfit_1.5-9.1       grid_3.3.0          
[19] nnet_7.3-12          data.table_1.9.6     AnnotationDbi_1.34.0 XML_3.98-1.4         survival_2.39-3      BiocParallel_1.6.0   foreign_0.8-66       latticeExtra_0.6-28  Formula_1.2-1       
[28] geneplotter_1.50.0   ggplot2_2.1.0        Hmisc_3.17-4         scales_0.4.0         splines_3.3.0        colorspace_1.2-6     xtable_1.8-2         acepack_1.3-3.3      munsell_0.4.3       
[37] chron_2.3-47

Please let me know if you need more information.

Many thanks for your help, best regards,

Hugo

deseq2 tximport rsem estimatesizefactors • 2.3k views

ADD COMMENT • link 8.9 years ago Hugo Varet ▴ 80

score 0 · Answer 1 · 2016-05-10

Dear Hugo,

Thanks for the bug report.

I should add some warnings and errors at some point because a gene length of 0 doesn't make sense (in the modeling or otherwise).

I guess the easiest thing would be for tximport to warn about this and then DESeqDataSetFromTximport to give an error and not allow length of 0. Does that sound right to you?

Yes, in the meantime, I would add a check in your scripts to somehow deal with rows that have a gene length of 0, either by somehow filling in a reasonable value or removing the gene from analysis. I don't know which makes more sense.

score 0 · Answer 2 · 2016-05-11

0

Entering edit mode

Hugo Varet ▴ 80

@hugo-varet-6301

Last seen 5 weeks ago

France

Dear Mike,

thanks for you answer and your advices. I added a check in my script to deal with that and continue the analysis.

I am not very familiar with RSEM, but I think the difficulty comes from the "effective_length" which can be set to zero for some samples and positive for others, depending on whether reads from the gene have been sequenced. From a statistical point of view, may it be correct to remove the infinite values coming from the log(0) as it is done to compute the size factors?

Best,

Hugo

ADD COMMENT • link 8.9 years ago Hugo Varet ▴ 80

0

Entering edit mode

Well the gene length used as a normalization factor in the tximport workflow, so then it will cause a problem again for dispersion and LFC estimation. It may make sense to plug in the standard calculation of effective gene length.

ADD REPLY • link 8.9 years ago Michael Love 43k