Question

looping through a GRangesList object

4

Entering edit mode

Simon Anders ★ 3.8k

@simon-anders-3855

Last seen 4.7 years ago

Zentrum für Molekularbiologie, Universi…

Hi

Given a GRangesList object grl, running

lapply( grl, length )

is very slow. Why is that? What should I do instead?

Background and example: I want to get, for all human genes, the range that their transcripts cover. So I did

> library( GenomicRanges )
> library( TxDb.Hsapiens.UCSC.hg19.knownGene )
> transcripts <- transcriptsBy( TxDb.Hsapiens.UCSC.hg19.knownGene )
> genes <- reduce( transcripts )

I was happy to notice that reduce seems to do what I want:

> genes
GRangesList object of length 23459:
$1 
GRanges object with 1 range and 0 metadata columns:
      seqnames               ranges strand
         <Rle>            <IRanges>  <Rle>
  [1]    chr19 [58858172, 58874214]      -

$10 
GRanges object with 1 range and 0 metadata columns:
      seqnames               ranges strand
  [1]     chr8 [18248755, 18258723]      +

$100 
GRanges object with 1 range and 0 metadata columns:
      seqnames               ranges strand
  [1]    chr20 [43248163, 43280376]      -

...
<23456 more elements>
-------
seqinfo: 93 sequences (1 circular) from hg19 genome

However, I want a GRanges object, not a GRangesList with each list element containing only a single range. Hence, I used unlist:

> unlist( genes )
GRanges object with 25004 ranges and 0 metadata columns:
        seqnames                 ranges strand
           <Rle>              <IRanges>  <Rle>
      1    chr19 [ 58858172,  58874214]      -
     10     chr8 [ 18248755,  18258723]      +
    100    chr20 [ 43248163,  43280376]      -
   1000    chr18 [ 25530930,  25757445]      -
  10000     chr1 [243651535, 244006886]      -
    ...      ...                    ...    ...
   9991     chr9 [114979995, 115095944]      -
   9992    chr21 [ 35736323,  35743440]      +
   9993    chr22 [ 19023795,  19109967]      -
   9994     chr6 [ 90539619,  90584155]      +
   9997    chr22 [ 50961997,  50964905]      -
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

However, the  GenomicRange returned by unlist contains more rows than genes had list elements. Some of the list elements must have contained more than one range. I would like to do

> lengths <- lapply( genes, length )
> which( lengths > 1 )

However, the lapply taskes ages. Why?

Cheers

Simon

 



> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=fi_FI.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[2] GenomicFeatures_1.22.8                 
[3] AnnotationDbi_1.32.3                   
[4] Biobase_2.30.0                         
[5] GenomicRanges_1.22.3                   
[6] GenomeInfoDb_1.6.1                     
[7] IRanges_2.4.6                          
[8] S4Vectors_0.8.7                        
[9] BiocGenerics_0.16.1                    

loaded via a namespace (and not attached):
 [1] XVector_0.10.0             zlibbioc_1.16.0           
 [3] GenomicAlignments_1.6.3    BiocParallel_1.4.3        
 [5] tools_3.2.2                SummarizedExperiment_1.0.2
 [7] DBI_0.3.1                  lambda.r_1.1.7            
 [9] futile.logger_1.4.1        rtracklayer_1.30.1        
[11] futile.options_1.0.0       bitops_1.0-6              
[13] RCurl_1.95-4.7             biomaRt_2.26.1            
[15] RSQLite_1.0.0              Biostrings_2.38.3         
[17] Rsamtools_1.22.0           XML_3.98-1.3

GenomicRanges GRangesList • 4.9k views

ADD COMMENT • link updated 9.2 years ago by Julian Gehring ★ 1.3k • written 9.2 years ago by Simon Anders ★ 3.8k

0

Entering edit mode

BTW, I know by now why I had more than one range in some of the object returned by reduce: that was simply because the original object contained alignments to alternate MHC versions and the like. For example:

> reduce( transcripts )[ "3115" ]
GRangesList object of length 1:
$3115 
GRanges object with 8 ranges and 0 metadata columns:
            seqnames               ranges strand
               <Rle>            <IRanges>  <Rle>
  [1]           chr6 [33043703, 33057473]      +
  [2]  chr6_apd_hap1 [ 4330504,  4344264]      +
  [3]  chr6_cox_hap2 [ 4487883,  4501576]      +
  [4]  chr6_dbb_hap3 [ 4325042,  4338804]      +
  [5] chr6_mann_hap4 [ 4501243,  4554476]      +
  [6]  chr6_mcf_hap5 [ 4380558,  4394318]      +
  [7]  chr6_qbl_hap6 [ 4276154,  4289926]      +
  [8] chr6_ssto_hap7 [ 4524207,  4537977]      +

But I still would like to know why lapply is so slow, and how to better inspect the elements in a GRangesList object.

ADD REPLY • link 9.2 years ago Simon Anders ★ 3.8k

0

Entering edit mode

You could try elementLengths(grl) for this case.

ADD REPLY • link 9.2 years ago Sean Davis 21k

score 2 · Answer 1 · 2016-01-22

2

Entering edit mode

Julian Gehring ★ 1.3k

@julian-gehring-5818

Last seen 5.9 years ago

Hi Simon,

A way faster way for

lengths <- lapply( genes, length ) ## runtime: ~115s

would be a dedicated function like elementLengths:

lengths <- elementLengths( genes ) ## runtime: practically none

This is of course not a general solution for *applying arbitrary functions efficiently over each element, but should at least remove the bottleneck.

ADD COMMENT • link 9.2 years ago Julian Gehring ★ 1.3k

3

Entering edit mode

Use lengths() instead of elementLengths(). lengths() is a base R function that works on list() as well as *List, and is generally useful. I'm sure others will chime in with details, but lapply() is slow because it creates a new GRanges for each list element. Generally the pattern for efficient operations on GRangesList, when no existing functionality exists, is gr = unlist(grl); answer = f(gr); relist(answer, grl).

ADD REPLY • link 9.2 years ago Martin Morgan 25k

0

Entering edit mode

Just throwing this out there as a challenge: it would be cool to experiment with the data.table strategy of memcpy()ing slots from one GRanges to one that is re-used for each iteration. This requires modifying the lengths of vectors to truncate them. It could be made to work for specific high-level vector types, like Rle. If an unfamiliar column type is encountered, fall back to the existing loop. Should start with CompressedSplitDataFrameList. Even better would be changing R so that the DATAPTR is not a fixed offset from the SEXP header. Then, no memcpy() required.

ADD REPLY • link 9.2 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Wow, after so many years of using R, I still hear about new functions in base. Had never heard of lengths.

The unlist/relist idiom is useful, thanks. This should be mention in the GenomicRanges introduction vignette (in Section 3.5, "Looping over GRangesList objects").

ADD REPLY • link 9.2 years ago Simon Anders ★ 3.8k

0

Entering edit mode

That function was added in only the last year or so.

ADD REPLY • link 9.2 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Simon,

Yes, this should definitely be mentioned there. I'll add it.

lapply( grl, length ) is slow because it does [[ in the loop and [[ is slow on GRangesList objects and on CompressedList derivatives in general. For various reasons. One of them is the complexity of the implementation of the [[ method for CompressedList objects. Another one is method dispatch (there are many levels of dispatch involved when doing [[ on a CompressedList object). There is probably room for some optimization but I don't expect the gain to be dramatic. Maybe by a factor 2 or 3... at best. So that wouldn't really solve the problem. I've learned to live with the idea that lapply() on CompressedList derivatives is almost never an option :-/

Cheers,

H.

ADD REPLY • link 9.2 years ago Hervé Pagès 16k

0

Entering edit mode

I need to correct myself about lengths(), which returns the equivalent of sapply(x, length), versus elementLengths(), which returns (from it's help page) the equivalent of sapply(x, NROW). So for instance l = list(data.frame(x=1, y=1)); lengths(l) returns 2 (since length(data.frame(x=1, y=1)) == 2) whereas elementLengths(l) returns 1 (the data.frame has a single row). This does not make a difference for Simon's case, but does in general.