Why is *ply-ing over a GRangesList much slower than *ply-ing over an IRangesList?

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 9 days ago

United States

Hi, Looping using any of the *ply (lapply, sapply, seqapply, etc.) seems to be significantly slower when you are iterating over a GRangesList vs. an IRangesList: R> library(GenomicFeatures) R> txdb <- loadFeatures(system.file("extdata", "UCSC_knownGene_sample.sqlite", package="GenomicFeatures")) R> xcripts <- transcriptsBy(txdb, 'gene') R> system.time(l1 <- sapply(xcripts, length)) user system elapsed 2.298 0.003 2.302 irl <- IRangesList(lapply(xcripts, ranges)) system.time(l2 <- sapply(irl, length)) user system elapsed 0.047 0.001 0.049 R> identical(l1, l2) [1] TRUE I was curious if this is known/expected behavior and it's unavoidable, or .. ? Thanks, -steve R> sessionInfo() R version 2.12.0 Under development (unstable) (2010-08-21 r52791) Platform: i386-apple-darwin10.4.0/i386 (32-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] org.Hs.eg.db_2.4.1 RSQLite_0.9-2 DBI_0.2-5 AnnotationDbi_1.11.4 [5] Biobase_2.9.0 GenomicFeatures_1.1.11 GenomicRanges_1.1.20 IRanges_1.7.21 loaded via a namespace (and not attached): [1] BSgenome_1.17.6 Biostrings_2.17.29 RCurl_1.4-3 XML_3.1-1 biomaRt_2.5.1 [6] rtracklayer_1.9.7 tools_2.12.0 -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

Cancer Cancer • 1.6k views

ADD COMMENT • link updated 14.7 years ago by Michael Lawrence ★ 11k • written 14.7 years ago by Steve Lianoglou ★ 13k

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.4 years ago

United States

My guess is that your GRangesList is compressed, whereas the IRangesList is uncompressed. Extracting an element from a compressed list will be slower due to the compression. Michael On Tue, Aug 24, 2010 at 7:31 PM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hi, > > Looping using any of the *ply (lapply, sapply, seqapply, etc.) seems > to be significantly slower when you are iterating over a GRangesList > vs. an IRangesList: > > R> library(GenomicFeatures) > R> txdb <- loadFeatures(system.file("extdata", > "UCSC_knownGene_sample.sqlite", > package="GenomicFeatures")) > R> xcripts <- transcriptsBy(txdb, 'gene') > R> system.time(l1 <- sapply(xcripts, length)) > user system elapsed > 2.298 0.003 2.302 > > irl <- IRangesList(lapply(xcripts, ranges)) > system.time(l2 <- sapply(irl, length)) > user system elapsed > 0.047 0.001 0.049 > > R> identical(l1, l2) > [1] TRUE > > I was curious if this is known/expected behavior and it's unavoidable, or > .. ? > > Thanks, > -steve > > R> sessionInfo() > R version 2.12.0 Under development (unstable) (2010-08-21 r52791) > Platform: i386-apple-darwin10.4.0/i386 (32-bit) > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] org.Hs.eg.db_2.4.1 RSQLite_0.9-2 DBI_0.2-5 > AnnotationDbi_1.11.4 > [5] Biobase_2.9.0 GenomicFeatures_1.1.11 GenomicRanges_1.1.20 > IRanges_1.7.21 > > loaded via a namespace (and not attached): > [1] BSgenome_1.17.6 Biostrings_2.17.29 RCurl_1.4-3 XML_3.1-1 > biomaRt_2.5.1 > [6] rtracklayer_1.9.7 tools_2.12.0 > > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact<http: cbio.mskc="" c.org="" %7elianos="" contact=""> > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 14.7 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Michael, On Wed, Aug 25, 2010 at 10:21 AM, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: > My guess is that your GRangesList is compressed, whereas the IRangesList is > uncompressed. Extracting an element from a compressed list will be slower > due to the compression. Actually, the IRangesList from the example above is also compressed: R> is(irl) [1] "CompressedIRangesList" "IRangesList" "CompressedList" [4] "RangesList" "Sequence" "Annotated" So I'm not sure that is what's causing the speed difference, right? I wrote this portion below before I checked if `irl` was compressed or not, but I'm curious about it, so I'll keep the question, assuming that there will be some significant speed difference between iterating over compressed lists anyway: My next question was if there was anyway to have an uncompressed GRangesList, so I went poking around the IRanges/GenomicRanges code. It seems the answer to that is no, since GRangesList extends/contains CompressedList ... right? Would it be (technically) possible to have something like CompressedGRangesList and a "normal" GRangesList -- analogous to how we currently have an IRangesList and CompressedIRangesList ... or is there some other reason that all GRangesList must be CompressedLists? Thanks, -steve > > Michael > > On Tue, Aug 24, 2010 at 7:31 PM, Steve Lianoglou > <mailinglist.honeypot at="" gmail.com=""> wrote: >> >> Hi, >> >> Looping using any of the *ply (lapply, sapply, seqapply, etc.) seems >> to be significantly slower when you are iterating over a GRangesList >> vs. an IRangesList: >> >> R> library(GenomicFeatures) >> R> txdb <- loadFeatures(system.file("extdata", >> "UCSC_knownGene_sample.sqlite", >> ? ? ?package="GenomicFeatures")) >> R> xcripts <- transcriptsBy(txdb, 'gene') >> R> system.time(l1 <- sapply(xcripts, length)) >> ? user ?system elapsed >> ?2.298 ? 0.003 ? 2.302 >> >> irl <- IRangesList(lapply(xcripts, ranges)) >> system.time(l2 <- sapply(irl, length)) >> ? user ?system elapsed >> ?0.047 ? 0.001 ? 0.049 >> >> R> identical(l1, l2) >> [1] TRUE >> >> I was curious if this is known/expected behavior and it's unavoidable, or >> .. ? >> >> Thanks, >> -steve >> >> R> sessionInfo() >> R version 2.12.0 Under development (unstable) (2010-08-21 r52791) >> Platform: i386-apple-darwin10.4.0/i386 (32-bit) >> >> locale: >> [1] C >> >> attached base packages: >> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base >> >> other attached packages: >> [1] org.Hs.eg.db_2.4.1 ? ? RSQLite_0.9-2 ? ? ? ? ?DBI_0.2-5 >> ?AnnotationDbi_1.11.4 >> [5] Biobase_2.9.0 ? ? ? ? ?GenomicFeatures_1.1.11 GenomicRanges_1.1.20 >> ?IRanges_1.7.21 >> >> loaded via a namespace (and not attached): >> [1] BSgenome_1.17.6 ? ?Biostrings_2.17.29 RCurl_1.4-3 ? ? ? ?XML_3.1-1 >> ? ? ? ? biomaRt_2.5.1 >> [6] rtracklayer_1.9.7 ?tools_2.12.0 >> >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> ?| Memorial Sloan-Kettering Cancer Center >> ?| Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 14.7 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Steve, I haven't profiled the code yet to know what is going on, but I will address your followup question. I have a feeling that the GRangesList concept will be growing over time and I am not sure what the tipping point will be for changes in code to occur. I see two issues related to GRangesList. The first being its internal storage (as you mentioned) and the second being its semantics (are the ranges/intervals contained within each of the elements "grouped" as exons within a transcript or are the ranges/intervals considered to be independent entities as collections of tracks for a genome browser). Patrick Quoting Steve Lianoglou <mailinglist.honeypot at="" gmail.com="">: > Hi Michael, > > On Wed, Aug 25, 2010 at 10:21 AM, Michael Lawrence > <lawrence.michael at="" gene.com=""> wrote: >> My guess is that your GRangesList is compressed, whereas the IRangesList is >> uncompressed. Extracting an element from a compressed list will be slower >> due to the compression. > > Actually, the IRangesList from the example above is also compressed: > > R> is(irl) > [1] "CompressedIRangesList" "IRangesList" "CompressedList" > [4] "RangesList" "Sequence" "Annotated" > > So I'm not sure that is what's causing the speed difference, right? > > I wrote this portion below before I checked if `irl` was compressed or > not, but I'm curious about it, so I'll keep the question, assuming > that there will be some significant speed difference between iterating > over compressed lists anyway: > > My next question was if there was anyway to have an uncompressed > GRangesList, so I went poking around the IRanges/GenomicRanges code. > > It seems the answer to that is no, since GRangesList extends/contains > CompressedList ... right? > > Would it be (technically) possible to have something like > CompressedGRangesList and a "normal" GRangesList -- analogous to how > we currently have an IRangesList and CompressedIRangesList ... or is > there some other reason that all GRangesList must be CompressedLists? > > Thanks, > -steve > > >> >> Michael >> >> On Tue, Aug 24, 2010 at 7:31 PM, Steve Lianoglou >> <mailinglist.honeypot at="" gmail.com=""> wrote: >>> >>> Hi, >>> >>> Looping using any of the *ply (lapply, sapply, seqapply, etc.) seems >>> to be significantly slower when you are iterating over a GRangesList >>> vs. an IRangesList: >>> >>> R> library(GenomicFeatures) >>> R> txdb <- loadFeatures(system.file("extdata", >>> "UCSC_knownGene_sample.sqlite", >>> ? ? ?package="GenomicFeatures")) >>> R> xcripts <- transcriptsBy(txdb, 'gene') >>> R> system.time(l1 <- sapply(xcripts, length)) >>> ? user ?system elapsed >>> ?2.298 ? 0.003 ? 2.302 >>> >>> irl <- IRangesList(lapply(xcripts, ranges)) >>> system.time(l2 <- sapply(irl, length)) >>> ? user ?system elapsed >>> ?0.047 ? 0.001 ? 0.049 >>> >>> R> identical(l1, l2) >>> [1] TRUE >>> >>> I was curious if this is known/expected behavior and it's unavoidable, or >>> .. ? >>> >>> Thanks, >>> -steve >>> >>> R> sessionInfo() >>> R version 2.12.0 Under development (unstable) (2010-08-21 r52791) >>> Platform: i386-apple-darwin10.4.0/i386 (32-bit) >>> >>> locale: >>> [1] C >>> >>> attached base packages: >>> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base >>> >>> other attached packages: >>> [1] org.Hs.eg.db_2.4.1 ? ? RSQLite_0.9-2 ? ? ? ? ?DBI_0.2-5 >>> ?AnnotationDbi_1.11.4 >>> [5] Biobase_2.9.0 ? ? ? ? ?GenomicFeatures_1.1.11 GenomicRanges_1.1.20 >>> ?IRanges_1.7.21 >>> >>> loaded via a namespace (and not attached): >>> [1] BSgenome_1.17.6 ? ?Biostrings_2.17.29 RCurl_1.4-3 ? ? ? ?XML_3.1-1 >>> ? ? ? ? biomaRt_2.5.1 >>> [6] rtracklayer_1.9.7 ?tools_2.12.0 >>> >>> >>> -- >>> Steve Lianoglou >>> Graduate Student: Computational Systems Biology >>> ?| Memorial Sloan-Kettering Cancer Center >>> ?| Weill Medical College of Cornell University >>> Contact Info: http://cbio.mskcc.org/~lianos/contact >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 14.7 years ago Patrick Aboyoun ★ 1.6k

0

Entering edit mode

I second the need for a GRangesList variant that behaves consistently with e.g. RangesList. The compressed storage makes sense for the current use case and design. The more general list would want multiple storage modes. Michael On Wed, Aug 25, 2010 at 10:58 AM, Patrick Aboyoun <paboyoun@fhcrc.org>wrote: > Steve, > I haven't profiled the code yet to know what is going on, but I will > address your followup question. > > I have a feeling that the GRangesList concept will be growing over time and > I am not sure what the tipping point will be for changes in code to occur. I > see two issues related to GRangesList. The first being its internal storage > (as you mentioned) and the second being its semantics (are the > ranges/intervals contained within each of the elements "grouped" as exons > within a transcript or are the ranges/intervals considered to be independent > entities as collections of tracks for a genome browser). > > > Patrick > > > > > Quoting Steve Lianoglou <mailinglist.honeypot@gmail.com>: > > Hi Michael, >> >> On Wed, Aug 25, 2010 at 10:21 AM, Michael Lawrence >> <lawrence.michael@gene.com> wrote: >> >>> My guess is that your GRangesList is compressed, whereas the IRangesList >>> is >>> uncompressed. Extracting an element from a compressed list will be slower >>> due to the compression. >>> >> >> Actually, the IRangesList from the example above is also compressed: >> >> R> is(irl) >> [1] "CompressedIRangesList" "IRangesList" "CompressedList" >> [4] "RangesList" "Sequence" "Annotated" >> >> So I'm not sure that is what's causing the speed difference, right? >> >> I wrote this portion below before I checked if `irl` was compressed or >> not, but I'm curious about it, so I'll keep the question, assuming >> that there will be some significant speed difference between iterating >> over compressed lists anyway: >> >> My next question was if there was anyway to have an uncompressed >> GRangesList, so I went poking around the IRanges/GenomicRanges code. >> >> It seems the answer to that is no, since GRangesList extends/contains >> CompressedList ... right? >> >> Would it be (technically) possible to have something like >> CompressedGRangesList and a "normal" GRangesList -- analogous to how >> we currently have an IRangesList and CompressedIRangesList ... or is >> there some other reason that all GRangesList must be CompressedLists? >> >> Thanks, >> -steve >> >> >> >>> Michael >>> >>> On Tue, Aug 24, 2010 at 7:31 PM, Steve Lianoglou >>> <mailinglist.honeypot@gmail.com> wrote: >>> >>>> >>>> Hi, >>>> >>>> Looping using any of the *ply (lapply, sapply, seqapply, etc.) seems >>>> to be significantly slower when you are iterating over a GRangesList >>>> vs. an IRangesList: >>>> >>>> R> library(GenomicFeatures) >>>> R> txdb <- loadFeatures(system.file("extdata", >>>> "UCSC_knownGene_sample.sqlite", >>>> package="GenomicFeatures")) >>>> R> xcripts <- transcriptsBy(txdb, 'gene') >>>> R> system.time(l1 <- sapply(xcripts, length)) >>>> user system elapsed >>>> 2.298 0.003 2.302 >>>> >>>> irl <- IRangesList(lapply(xcripts, ranges)) >>>> system.time(l2 <- sapply(irl, length)) >>>> user system elapsed >>>> 0.047 0.001 0.049 >>>> >>>> R> identical(l1, l2) >>>> [1] TRUE >>>> >>>> I was curious if this is known/expected behavior and it's unavoidable, >>>> or >>>> .. ? >>>> >>>> Thanks, >>>> -steve >>>> >>>> R> sessionInfo() >>>> R version 2.12.0 Under development (unstable) (2010-08-21 r52791) >>>> Platform: i386-apple-darwin10.4.0/i386 (32-bit) >>>> >>>> locale: >>>> [1] C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods base >>>> >>>> other attached packages: >>>> [1] org.Hs.eg.db_2.4.1 RSQLite_0.9-2 DBI_0.2-5 >>>> AnnotationDbi_1.11.4 >>>> [5] Biobase_2.9.0 GenomicFeatures_1.1.11 GenomicRanges_1.1.20 >>>> IRanges_1.7.21 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] BSgenome_1.17.6 Biostrings_2.17.29 RCurl_1.4-3 XML_3.1-1 >>>> biomaRt_2.5.1 >>>> [6] rtracklayer_1.9.7 tools_2.12.0 >>>> >>>> >>>> -- >>>> Steve Lianoglou >>>> Graduate Student: Computational Systems Biology >>>> | Memorial Sloan-Kettering Cancer Center >>>> | Weill Medical College of Cornell University >>>> Contact Info: http://cbio.mskcc.org/~lianos/contact<http: cbio.m="" skcc.org="" %7elianos="" contact=""> >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor@stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> >>> >> >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact<http: cbio.msk="" cc.org="" %7elianos="" contact=""> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > > [[alternative HTML version deleted]]

ADD REPLY • link 14.7 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi, On Wed, Aug 25, 2010 at 1:58 PM, Patrick Aboyoun <paboyoun at="" fhcrc.org=""> wrote: > Steve, > I haven't profiled the code yet to know what is going on, but I will address > your followup question. > > I have a feeling that the GRangesList concept will be growing over time and > I am not sure what the tipping point will be for changes in code to occur. I > see two issues related to GRangesList. The first being its internal storage > (as you mentioned) Yeah. That's a +1 vote on addressing that at some point from me :-) > and the second being its semantics (are the > ranges/intervals contained within each of the elements "grouped" as exons > within a transcript or are the ranges/intervals considered to be independent > entities as collections of tracks for a genome browser). I'm not sure ... my first reaction is to think that one would consider each element in a GRangesList to be grouped "in some way" (like exons, as you mention). I would think to model separate tracks as separate GRangesList, not seperate elements of a *Ranges object in an *RangesList. I actually can't think of a scenario where I would want the fire-power of *RangesList objects (primarily fast overlap and set-like queries) to address your 2nd scenario (different tracks) where I can easily appreciate (more and more) where considering each element in the *RangesList as being grouped ... and if I don't want the elements to be grouped at all, I'd just unlist() it ... If elements weren't "grouped" I think I'd probably only ever want to iterate over each *RangesList element and do set operations w/in those (maybe overlap TF binding (one IRange) with some acetylation track (another IRange)) -- but again, haven't thought about it in this way before ... I'm not even sure this was 2 cents worth, but there you have it ... -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 14.7 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

On Wed, Aug 25, 2010 at 11:59 AM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hi, > > On Wed, Aug 25, 2010 at 1:58 PM, Patrick Aboyoun <paboyoun@fhcrc.org> > wrote: > > Steve, > > I haven't profiled the code yet to know what is going on, but I will > address > > your followup question. > > > > I have a feeling that the GRangesList concept will be growing over time > and > > I am not sure what the tipping point will be for changes in code to > occur. I > > see two issues related to GRangesList. The first being its internal > storage > > (as you mentioned) > > Yeah. That's a +1 vote on addressing that at some point from me :-) > > > and the second being its semantics (are the > > ranges/intervals contained within each of the elements "grouped" as exons > > within a transcript or are the ranges/intervals considered to be > independent > > entities as collections of tracks for a genome browser). > > I'm not sure ... my first reaction is to think that one would consider > each element in a GRangesList to be grouped "in some way" (like exons, > as you mention). I would think to model separate tracks as separate > GRangesList, not seperate elements of a *Ranges object in an > *RangesList. > > I actually can't think of a scenario where I would want the fire- power > of *RangesList objects (primarily fast overlap and set-like queries) > to address your 2nd scenario (different tracks) where I can easily > appreciate (more and more) where considering each element in the > *RangesList as being grouped ... and if I don't want the elements to > be grouped at all, I'd just unlist() it ... > > If elements weren't "grouped" I think I'd probably only ever want to > iterate over each *RangesList element and do set operations w/in those > (maybe overlap TF binding (one IRange) with some acetylation track > (another IRange)) -- but again, haven't thought about it in this way > before ... > > Just consider the case of dozens of samples. It would be impractical to have a separate variable in the workspace for each one. Sure there's the old R list, but I'd like the convenience of the high-level list classes. And then there's just the desire for consistency with the other List classes that simply perform operations element-wise without any special semantics. > I'm not even sure this was 2 cents worth, but there you have it ... > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact<http: cbio.mskc="" c.org="" %7elianos="" contact=""> > [[alternative HTML version deleted]]

ADD REPLY • link 14.7 years ago Michael Lawrence ★ 11k

Login before adding your answer.