Question

Speed efficient way to select the n-th element from GRangsList

1

Entering edit mode

foehn ▴ 100

@foehn-16281

Last seen 3.4 years ago

United States

Hello,

Suppose we have a simple task, that is, to select the fifth range from each GRanges object from a GRangesList object. I can think of two methods:

library(EnsDb.Hsapiens.v86)
library(microbenchmark)
X <- exonsBy(EnsDb.Hsapiens.v86, by = "tx")
X <- X[lengths(X) > 10] # to make sure every `GRanges` object has at least 11 elements;
X <- X[1:100] # to make a smaller dataset

# Method 1
> microbenchmark(res <- endoapply(X, FUN = "[", 5))
Unit: milliseconds
                              expr     min       lq     mean   median       uq
 res <- endoapply(X, FUN = "[", 5) 726.677 841.4338 879.6332 872.5642 905.1298
      max neval
 1755.904   100

# Method 2
> microbenchmark(res <- lapply(X, FUN = "[", 5))
Unit: milliseconds
                           expr      min     lq     mean   median      uq
 res <- lapply(X, FUN = "[", 5) 594.1233 638.44 710.8889 682.4832 739.132
      max neval
 1616.825   100
# `res` is of `list` class, so it needs to be converted into `GRangesList` as `endoapply` does.  
> microbenchmark(as(res, "GRangesList"))
Unit: milliseconds
                   expr      min       lq     mean   median       uq     max
 as(res, "GRangesList") 124.2346 137.1553 187.5402 152.5325 252.8355 353.438
 neval
   100

The benchmark test shows that both methods have comparable speed, and neither is fast.

I'm wondering if there is any faster way to do it, besides using multi-core parallel computing. Thanks.

---------------------------------------------------------------------------------------------------------------------------------------------------

Update1. Thanks to @Martin Morgan's suggestion; it makes things 100 times faster.

> microbenchmark(as(unlist(X)[cumsum(c(0, head(lengths(X), -1))) + 5], "GRangesList"))
Unit: milliseconds
                                                                 expr      min
 as(unlist(X)[cumsum(c(0, head(lengths(X), -1))) + 5], "GRangesList") 4.518822
       lq     mean   median       uq      max neval
 5.559363 7.725312 6.223389 7.224878 100.3304   100

Update 2. According to @Martin Morgan's second suggestion, splitAsList goes even faster than as(..., "GRangesList").

> microbenchmark(splitAsList(unlist(X)[cumsum(c(0, head(lengths(X), -1))) + 5], 1:100))
Unit: milliseconds
                                                                       expr
 splitAsList(unlist(X)[cumsum(c(0, head(lengths(X), -1))) + 5],      1:100)
      min       lq     mean   median       uq      max neval
 4.748592 4.852561 6.094437 4.961308 5.099946 105.5112   100

Update 3. Thanks to @Michael Lawrence's answer too. Indeed it takes about twice the amount of time.

> microbenchmark(res <- as(X[rep(IntegerList(5L), length(X))], "GRangesList"))
Unit: milliseconds
                                                         expr     min       lq
 res <- as(X[rep(IntegerList(5L), length(X))], "GRangesList") 10.4255 11.49089
     mean   median       uq     max neval
 13.41262 12.74597 14.06349 25.7567   100

If we break down the process,

> microbenchmark(i <- rep(IntegerList(5L), length(X)))
Unit: milliseconds
                                 expr      min       lq    mean   median
 i <- rep(IntegerList(5L), length(X)) 5.770752 7.612974 9.61129 8.744468
       uq      max neval
 10.86042 38.87412   100
> microbenchmark(res <- X[i])
Unit: milliseconds
        expr      min       lq     mean   median       uq      max neval
 res <- X[i] 5.101723 5.827044 6.797364 6.401072 7.196423 14.13441   100
> microbenchmark(as(res, "GRangesList"))
Unit: microseconds
                   expr   min     lq    mean median    uq   max neval
 as(res, "GRangesList") 2.064 2.1705 2.39714   2.24 2.313 16.42   100

we can see that constructing a CompressedIntegerList object is the main time consuming step, while subsetting takes similar amount of time as @Martin Morgan's method.

In summary, @Martin Morgan's solution is faster, while @Michael Lawrence's seems more versatile. Both are much more efficient than the original methods, thanks.

R GenomicRanges IRanges • 1.3k views

ADD COMMENT • link 6.1 years ago foehn ▴ 100

score 2 · Answer 1 · 2018-12-18

You could also use List-based subsetting.

X[rep(IntegerList(5L), length(X))]

That takes about twice the time as Martin's example but one could argue that is more comprehensible once one understands how to subset by a List.

Also, instead of splitAsList(), you could just coerce the result to a GRangesList. It takes the same amount of time.

as(unlist(X)[offset], "GRangesList")

score 1 · Answer 2 · 2018-12-18

1

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 8 days ago

United States

It's fast to unlist a GRangesList, so figure out the offsets of interest

offset = cumsum(head(c(0, lengths(X)), -1)) + 5

and subset the unlisted elements

unlist(X)[offset]

For GRangesList (but why?) use splitAsList(unlist(X)[offset], seq_along(offset)).

ADD COMMENT • link 6.1 years ago Martin Morgan 25k