The current paradigm to represent genomic variables (e.g. ChIP-seq coverage) in Bioconductor is as a named RleList or as a metadata column on a disjoint GRanges object. This can provide remarkable functionality and performance. However, how does one properly implement mathematical or statistical operations on these objects?
Consider the following (based on the genomicvars help page of GenomicRanges):
library(BSgenome.Scerevisiae.UCSC.sacCer3) BSgen <- BSgenome.Scerevisiae.UCSC.sacCer3 # Function to make fake artificial genomic variables makeVars <- function(BSgen,n) { RleList( lapply(seqlengths(Scerevisiae), function(seqlen) { tmp <- sample(n, seqlen, replace=TRUE) Rle(cumsum(tmp - rev(tmp))) }) ) } # Make genomic variables set.seed(17) x <- makeVars(BSgen,50) set.seed(300) y <- makeVars(BSgen,50) y[y<0] <- 1 # keep y positive
We can use certain supported functionality for performing mathematical operations, e.g. finding parallel maxima using pmax:
newVar.bioc <- BiocGenerics::pmax(x,y) # works right away and is fast and awesome! newVar.base <- base::pmax(x,y) # slow and unsatisfying but still works identical(newVar.base,newVar.bioc)
However, what if I wanted to use other operations that aren't supported, e.g. ppois:
newVar.ppois <- stats::ppois(x,y)
This can be circumvented by manual coercion, but it defeats the purpose of using RleList
x.n <- lapply(x,as.numeric) # defeats the puropse of using RleList y.n <- lapply(y,as.numeric) ppois.xy <- RleList(lapply(seq_along(x.n), function(i){ x.i <- x.n[[i]] y.i <- y.n[[i]] ppois.i <- ppois(x.i,lambda=y.i) }))
My question, then, is what is the best practice in bioconductor for performing mathematical/statistical operations like ppois on genomicvars? Does this require extending R with c/c++? I'm a little unsure where to begin and any advice would be much appreciated.
Devin
I like GPos and it would be great if it were applicable to human-size genomes. It would be nice to have easy way (simple function call) of moving between GPos and the RLE representation in a GRanges. Sometimes you want to operate on runs, other times positions.
Interesting! I am using functions operating on RleLists extensively (coverage, slice, Views, viewSums, etc), how will they be affected?
I'm not sure yet. We'll need to come up with a plan to make the transition as smooth as possible. But we are not here yet.
H.