Question

Implementing fast mathematical and statistical operations on genomic variables in bioconductor

2

Entering edit mode

Devin King ▴ 20

@devin-king-13481

Last seen 6.8 years ago

The current paradigm to represent genomic variables (e.g. ChIP-seq coverage) in Bioconductor is as a named RleList or as a metadata column on a disjoint GRanges object. This can provide remarkable functionality and performance. However, how does one properly implement mathematical or statistical operations on these objects?

Consider the following (based on the genomicvars help page of GenomicRanges):

library(BSgenome.Scerevisiae.UCSC.sacCer3)
        BSgen <- BSgenome.Scerevisiae.UCSC.sacCer3
        # Function to make fake artificial genomic variables
              makeVars <- function(BSgen,n) {
                RleList(
                  lapply(seqlengths(Scerevisiae),
                         function(seqlen) {
                           tmp <- sample(n, seqlen, replace=TRUE)
                           Rle(cumsum(tmp - rev(tmp)))
                         })
                )
              }
        # Make genomic variables
              set.seed(17)
              x <- makeVars(BSgen,50)
              set.seed(300)
              y <- makeVars(BSgen,50)
              y[y<0] <- 1 # keep y positive

We can use certain supported functionality for performing mathematical operations, e.g. finding parallel maxima using pmax:

newVar.bioc <- BiocGenerics::pmax(x,y) # works right away and is fast and awesome!

newVar.base <- base::pmax(x,y) # slow and unsatisfying but still works
identical(newVar.base,newVar.bioc)

However, what if I wanted to use other operations that aren't supported, e.g. ppois:

newVar.ppois <- stats::ppois(x,y)

This can be circumvented by manual coercion, but it defeats the purpose of using RleList

x.n <- lapply(x,as.numeric) # defeats the puropse of using RleList
y.n <- lapply(y,as.numeric)

ppois.xy <- RleList(lapply(seq_along(x.n), function(i){
  x.i <- x.n[[i]]
  y.i <- y.n[[i]]
  ppois.i <- ppois(x.i,lambda=y.i)
}))

My question, then, is what is the best practice in bioconductor for performing mathematical/statistical operations like ppois on genomicvars? Does this require extending R with c/c++? I'm a little unsure where to begin and any advice would be much appreciated.

Devin

genomicranges biocgenerics rlelist genomicvars iranges • 1.9k views

ADD COMMENT • link updated 7.4 years ago by Hervé Pagès 16k • written 7.4 years ago by Devin King ▴ 20

score 2 · Answer 1 · 2017-09-21

2

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 9 minutes ago

Seattle, WA, United States

Hi Devin,

Just a heads up that the plan is to move away from the named RleList representation or the disjoint GRanges with metadata columns for representing genomic variables. In a not too distant future genomic variables will be better represented as metadata columns on a GPos object.

Note that this is already possible for "small" genomes i.e. for genomes with less than 2^31 nucleotide positions. In order to support longer genomes (e.g. the Human genome) we first need to support long GPos and long Rle objects. Some progress was made a few months ago in the S4Vectors package towards this (long Rle objects are partially supported) but there is still a lot to do. Once we have this, a more natural/intuitive/convenient way to represent the coverage of a full genome will be with an Rle in the metadata column of a GPos object.

H.

ADD COMMENT • link 7.4 years ago Hervé Pagès 16k

0

Entering edit mode

I like GPos and it would be great if it were applicable to human-size genomes. It would be nice to have easy way (simple function call) of moving between GPos and the RLE representation in a GRanges. Sometimes you want to operate on runs, other times positions.

ADD REPLY • link 7.4 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Interesting! I am using functions operating on RleLists extensively (coverage, slice, Views, viewSums, etc), how will they be affected?

ADD REPLY • link 7.4 years ago maltethodberg ▴ 180

0

Entering edit mode

I'm not sure yet. We'll need to come up with a plan to make the transition as smooth as possible. But we are not here yet.

H.

ADD REPLY • link 7.4 years ago Hervé Pagès 16k

score 1 · Answer 2 · 2017-09-19

1

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.2 years ago

United States

I am not sure whether C/C++ would really help unless there is some library for numerical computing with RLEs. The code you propose for ppois() is OK but I would use mapply() for cleanliness and coerce to the numeric vector within the loop in order to save memory (at a time cost).

We are working to push these types of abstractions down into the internals of R, so more numeric functions will gain support, but it will be a long time before all of the third party native code is ported.

ADD COMMENT • link 7.4 years ago Michael Lawrence ★ 11k

1

Entering edit mode

Map() / mapply() also retain names. mendoapply() returns the same object as input, avoiding explicit coercion to RleList.

ppoisRle <- function(q, lambda, ...) {
    q <- as.numeric(q)
    lambda <- as.numeric(lambda)
    value <- ppois(q = q, lambda = lambda, ...)
    Rle(value)
}
ppois.xy <- RleList(Map(ppois, x.n, y.n))
ppois.xy.2a <- RleList(Map(ppoisRle, x, y))
ppois.xy.2b <- mendoapply(ppoisRle, x, y)
identical(ppois.xy, ppois.xy.2a)           # TRUE
identical(ppois.xy.2a, ppois.xy.2b)        # TRUE

I don't think there's a meaningful time-difference between doing the coercion to / from integer in or outside the mapping, but obviously a space savings.

Using the same amount of memory as the original, a faster (about 2 vs. 3 seconds for me, 1.75s in ppois()) calculation is the unlist / relist trick (ok for smaller genomes!)

x.all <- as.numeric(unlist(unname(x)))
y.all <- as.numeric(unlist(unname(y)))
value.all <- ppois(x.all, y.all)
ppois.all <- relist(Rle(value.all), x)

ADD REPLY • link 7.4 years ago Martin Morgan 25k