transcriptsBy via TxDb.Hsapiens.UCSC.hg19.knownGene painfully slow
1
0
Entering edit mode
Murat Tasan ▴ 70
@murat-tasan-5676
Last seen 10.4 years ago
hi all - does anyone have any performance tips for using transcriptsBy(TXDB, by = "gene") with the UCSC transcript database? in particular, is the SQLite backing database file indexed (along columns holding the internal IDs)? i'd provide some timing results for the command execution, but i ran out of patience after about 10 minutes with no results... cheers, -m [[alternative HTML version deleted]]
• 1.1k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 5 months ago
United States
On 01/01/2013 01:32 PM, Murat Tasan wrote: > hi all - does anyone have any performance tips for using > transcriptsBy(TXDB, by = "gene") with the UCSC transcript database? > in particular, is the SQLite backing database file indexed (along columns > holding the internal IDs)? > i'd provide some timing results for the command execution, but i ran out of > patience after about 10 minutes with no results... it is 'slow' but only in the couple of seconds definition of slow. Something else is going on so a reproducible example, including sessionInfo(), would be helfpul. > > cheers, > > -m > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
On 01/01/2013 02:05 PM, Martin Morgan wrote: > On 01/01/2013 01:32 PM, Murat Tasan wrote: >> hi all - does anyone have any performance tips for using >> transcriptsBy(TXDB, by = "gene") with the UCSC transcript database? >> in particular, is the SQLite backing database file indexed (along columns >> holding the internal IDs)? >> i'd provide some timing results for the command execution, but i ran out of >> patience after about 10 minutes with no results... > > it is 'slow' but only in the couple of seconds definition of slow. Something > else is going on so a reproducible example, including sessionInfo(), would be > helfpul. Just to follow my own advice... library(TxDb.Hsapiens.UCSC.hg19.knownGene) system.time(res <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene")) length(res) sessionInfo() gives me > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > system.time(res <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene")) user system elapsed 3.020 0.012 3.042 > length(res) [1] 22932 > sessionInfo() R version 2.15.2 Patched (2012-12-23 r61401) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.8.0 [2] GenomicFeatures_1.10.1 [3] AnnotationDbi_1.20.3 [4] Biobase_2.18.0 [5] GenomicRanges_1.10.5 [6] IRanges_1.16.4 [7] BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] biomaRt_2.14.0 Biostrings_2.26.2 bitops_1.0-5 BSgenome_1.26.1 [5] DBI_0.2-5 parallel_2.15.2 RCurl_1.95-3 Rsamtools_1.10.2 [9] RSQLite_0.11.2 rtracklayer_1.18.1 stats4_2.15.2 tools_2.15.2 [13] XML_3.95-0.1 zlibbioc_1.4.0 > > >> >> cheers, >> >> -m >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD REPLY
0
Entering edit mode
forgot to reply to the list... here's the full output (not including the result of the last timing line, since that's the offender): ######################################## > library(TxDb.Hsapiens.UCSC.hg19.knownGene) Loading required package: GenomicFeatures Loading required package: BiocGenerics Attaching package: ‘BiocGenerics’ The following object(s) are masked from ‘package:stats’: xtabs The following object(s) are masked from ‘package:base’: anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, get, intersect, lapply, Map, mapply, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rbind, Reduce, rep.int, rownames, sapply, setdiff, table, tapply, union, unique Loading required package: IRanges Loading required package: GenomicRanges Loading required package: AnnotationDbi Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'. > TXDB <- TxDb.Hsapiens.UCSC.hg19.knownGene > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.8.0 GenomicFeatures_1.10.1 [3] AnnotationDbi_1.20.3 Biobase_2.18.0 [5] GenomicRanges_1.10.5 IRanges_1.16.4 [7] BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] biomaRt_2.14.0 Biostrings_2.26.2 bitops_1.0-5 BSgenome_1.26.1 DBI_0.2-5 [6] parallel_2.15.0 RCurl_1.95-3 Rsamtools_1.10.2 RSQLite_0.11.2 rtracklayer_1.18.1 [11] stats4_2.15.0 tools_2.15.0 XML_3.95-0.1 zlibbioc_1.4.0 ######################################## our sessions look pretty much identical, with the exception of R 2.15.0 for me and 2.15.2 for you. i'll try to push an upgrade in the next day or so and see if that might make a difference. it also occurred to me that the SQLite query can't be the offender, since this line runs perfectly swiftly: > foo <- select(TXDB, keys = keys(TXDB), cols = c("TXCHROM", "TXSTRAND", "TXSTART", "TXEND"), keytype = "GENEID") so i'm guessing something in the GRangesList construction might be going haywire? cheers, -m On Tue, Jan 1, 2013 at 5:11 PM, Martin Morgan <mtmorgan@fhcrc.org> wrote: > On 01/01/2013 02:05 PM, Martin Morgan wrote: > >> On 01/01/2013 01:32 PM, Murat Tasan wrote: >> >>> hi all - does anyone have any performance tips for using >>> transcriptsBy(TXDB, by = "gene") with the UCSC transcript database? >>> in particular, is the SQLite backing database file indexed (along columns >>> holding the internal IDs)? >>> i'd provide some timing results for the command execution, but i ran out >>> of >>> patience after about 10 minutes with no results... >>> >> >> it is 'slow' but only in the couple of seconds definition of slow. >> Something >> else is going on so a reproducible example, including sessionInfo(), >> would be >> helfpul. >> > > Just to follow my own advice... > > > library(TxDb.Hsapiens.UCSC.**hg19.knownGene) > system.time(res <- transcriptsBy(TxDb.Hsapiens.**UCSC.hg19.knownGene, > by="gene")) > length(res) > sessionInfo() > > gives me > > > library(TxDb.Hsapiens.UCSC.**hg19.knownGene) > > system.time(res <- transcriptsBy(TxDb.Hsapiens.**UCSC.hg19.knownGene, > by="gene")) > user system elapsed > 3.020 0.012 3.042 > > length(res) > [1] 22932 > > sessionInfo() > R version 2.15.2 Patched (2012-12-23 r61401) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] TxDb.Hsapiens.UCSC.hg19.**knownGene_2.8.0 > [2] GenomicFeatures_1.10.1 > [3] AnnotationDbi_1.20.3 > [4] Biobase_2.18.0 > [5] GenomicRanges_1.10.5 > [6] IRanges_1.16.4 > [7] BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.14.0 Biostrings_2.26.2 bitops_1.0-5 > BSgenome_1.26.1 > [5] DBI_0.2-5 parallel_2.15.2 RCurl_1.95-3 > Rsamtools_1.10.2 > [9] RSQLite_0.11.2 rtracklayer_1.18.1 stats4_2.15.2 tools_2.15.2 > [13] XML_3.95-0.1 zlibbioc_1.4.0 > > > >> >> >>> cheers, >>> >>> -m >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________**_________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: >>> http://news.gmane.org/gmane.**science.biology.informatics.**conduc tor<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>> >>> >> >> > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi Murat, On 01/01/2013 02:23 PM, Murat Tasan wrote: > forgot to reply to the list... > > here's the full output (not including the result of the last timing line, > since that's the offender): > > ######################################## > >> library(TxDb.Hsapiens.UCSC.hg19.knownGene) > Loading required package: GenomicFeatures > Loading required package: BiocGenerics > > Attaching package: ???BiocGenerics??? > > The following object(s) are masked from ???package:stats???: > > xtabs > > The following object(s) are masked from ???package:base???: > > anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, get, > intersect, lapply, Map, > mapply, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, > rbind, Reduce, rep.int, > rownames, sapply, setdiff, table, tapply, union, unique > > Loading required package: IRanges > Loading required package: GenomicRanges > Loading required package: AnnotationDbi > Loading required package: Biobase > Welcome to Bioconductor > > Vignettes contain introductory material; view with 'browseVignettes()'. > To cite Bioconductor, > see 'citation("Biobase")', and for packages 'citation("pkgname")'. > >> TXDB <- TxDb.Hsapiens.UCSC.hg19.knownGene > > >> sessionInfo() > > > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > LC_TIME=en_US.UTF-8 > [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 > LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 > LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.8.0 GenomicFeatures_1.10.1 > [3] AnnotationDbi_1.20.3 Biobase_2.18.0 > [5] GenomicRanges_1.10.5 IRanges_1.16.4 > [7] BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.14.0 Biostrings_2.26.2 bitops_1.0-5 > BSgenome_1.26.1 DBI_0.2-5 > [6] parallel_2.15.0 RCurl_1.95-3 Rsamtools_1.10.2 > RSQLite_0.11.2 rtracklayer_1.18.1 > [11] stats4_2.15.0 tools_2.15.0 XML_3.95-0.1 zlibbioc_1.4.0 > > ######################################## > > > our sessions look pretty much identical, with the exception of R 2.15.0 for > me and 2.15.2 for you. > i'll try to push an upgrade in the next day or so and see if that might > make a difference. > > it also occurred to me that the SQLite query can't be the offender, since > this line runs perfectly swiftly: >> foo <- select(TXDB, keys = keys(TXDB), cols = c("TXCHROM", "TXSTRAND", > "TXSTART", "TXEND"), keytype = "GENEID") A fair comparison would be to also select the TXID and TXNAME cols because transcriptsBy() extracts them: foo <- select(TXDB, keys = keys(TXDB), cols = c("TXCHROM", "TXSTRAND", "TXSTART", "TXEND", "TXID", "TXNAME"), keytype = "GENEID") I'm not sure why, but requesting those 2 additional cols slows down select() by a factor 10x for me (from 1 sec to 10 sec). One way to make sure the SQLite query isn't the offender is to use the SQLite client (sqlite3) from the Unix command line to query the TxDb.Hsapiens.UCSC.hg19.knownGene.sqlite file directly. Try the following query which is more or less the query used by transcriptsBy(. , by="gene"): SELECT transcript._tx_id AS tx_id, tx_name, tx_chrom, tx_strand, tx_start, tx_end, gene_id FROM transcript INNER JOIN gene ON (transcript._tx_id=gene._tx_id) WHERE gene_id IS NOT NULL ORDER BY gene_id, tx_chrom, tx_strand, tx_start, tx_end; On my laptop: time sqlite3 TxDb.Hsapiens.UCSC.hg19.knownGene.sqlite 'SELECT transcript._tx_id AS tx_id, tx_name, tx_chrom, tx_strand, tx_start, tx_end, gene_id FROM transcript INNER JOIN gene ON transcript._tx_id=gene._tx_id WHERE gene_id IS NOT NULL ORDER BY gene_id, tx_chrom, tx_strand, tx_start, tx_end' > sql.result real 0m0.507s user 0m0.368s sys 0m0.136s FWIW, I remember SQLite queries being painfully slow when the sqlite file is located on OCFS (Oracle Cluster File System), but that was a long time ago (with an old version of OCFS). Could be worth checking the file system where your packages are installed (your .libPaths() folder). Cheers, H. > > so i'm guessing something in the GRangesList construction might be going > haywire? > > cheers, > > -m > > > On Tue, Jan 1, 2013 at 5:11 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: > >> On 01/01/2013 02:05 PM, Martin Morgan wrote: >> >>> On 01/01/2013 01:32 PM, Murat Tasan wrote: >>> >>>> hi all - does anyone have any performance tips for using >>>> transcriptsBy(TXDB, by = "gene") with the UCSC transcript database? >>>> in particular, is the SQLite backing database file indexed (along columns >>>> holding the internal IDs)? >>>> i'd provide some timing results for the command execution, but i ran out >>>> of >>>> patience after about 10 minutes with no results... >>>> >>> >>> it is 'slow' but only in the couple of seconds definition of slow. >>> Something >>> else is going on so a reproducible example, including sessionInfo(), >>> would be >>> helfpul. >>> >> >> Just to follow my own advice... >> >> >> library(TxDb.Hsapiens.UCSC.**hg19.knownGene) >> system.time(res <- transcriptsBy(TxDb.Hsapiens.**UCSC.hg19.knownGene, >> by="gene")) >> length(res) >> sessionInfo() >> >> gives me >> >>> library(TxDb.Hsapiens.UCSC.**hg19.knownGene) >>> system.time(res <- transcriptsBy(TxDb.Hsapiens.**UCSC.hg19.knownGene, >> by="gene")) >> user system elapsed >> 3.020 0.012 3.042 >>> length(res) >> [1] 22932 >>> sessionInfo() >> R version 2.15.2 Patched (2012-12-23 r61401) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] TxDb.Hsapiens.UCSC.hg19.**knownGene_2.8.0 >> [2] GenomicFeatures_1.10.1 >> [3] AnnotationDbi_1.20.3 >> [4] Biobase_2.18.0 >> [5] GenomicRanges_1.10.5 >> [6] IRanges_1.16.4 >> [7] BiocGenerics_0.4.0 >> >> loaded via a namespace (and not attached): >> [1] biomaRt_2.14.0 Biostrings_2.26.2 bitops_1.0-5 >> BSgenome_1.26.1 >> [5] DBI_0.2-5 parallel_2.15.2 RCurl_1.95-3 >> Rsamtools_1.10.2 >> [9] RSQLite_0.11.2 rtracklayer_1.18.1 stats4_2.15.2 tools_2.15.2 >> [13] XML_3.95-0.1 zlibbioc_1.4.0 >> >> >> >>> >>> >>>> cheers, >>>> >>>> -m >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________**_________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat="" .ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>> Search the archives: >>>> http://news.gmane.org/gmane.**science.biology.informatics.**condu ctor<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>>> >>>> >>> >>> >> >> -- >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 >> > > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY

Login before adding your answer.

Traffic: 1020 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6