Hi:
I want to implement element-wise-overlaps with iteration for two different bed files. Where I read two bed files as set of GRanges objects such as:
library(rtracklayer)
bed1 <- import("foo.bed")
bed2 <- import("bar.bed")
then callling findOverlaps():
overlap[1] <- findOverlaps(bed1[1], bed.2)
I have tried for loop for iterating each row of bed.1 versus bed.2 and discovering overlapped interval. But, code was not efficient. Anyone can provide simple R script for iterating each peak interval from bed.1 versus bed.2 to detect overlapped regions?? Thanks a lot
@ Hervé Pagès : Thanks a lot for your answer.
@Hervé Pagès: sorry for this repeated question. I am trying to evaluate every genomic interval from bed.1 versus all element in bed.2 and I am gonna evaluate combined p-value of overlapped genomic regions by Fisher' exact test. based on your answer, I have coded like this:
but, certainly it is not good idea, R-studio still executing the code and haven't returned overlapped interval. Is there any alternative solution for that ? Do you know any package that processing multiple bed files in parallel ? many thanks
First of all the
f1
function doesn't make much sense: the value of overlap will be overwritten at each iteration so the function will only return the result of the last call tofindOverlaps()
.Other than that: yes we both know that
for(i in 1:length(x)) findOverlaps(x[i], y)
is very inefficient and I gave you a replacement for it. Did you try it? Did you run into some problems while trying it?About your last question: "Do you know any package that processing multiple bed files in parallel?"
Just to clarify: the inefficiency you're seeing only involves 2 bed files and is easy to work around. I'm not sure parallelizing would help much here but you could try (with either parallel::mclapply() or BiocParallel::bplapply()). I'm pretty confident it won't be as fast as
as(findOverlaps(x, y), "List")
though. In any case, I wouldn't call this kind of parallelization "processing multiple bed files in parallel", that's misleading. The files are not processed in parallel. Anyway and FWIW the closest thing to a "package for processing multiple bed files in parallel" is GenomicFiles. Be aware that it's low-level and pretty generic: you have to learn how to use it and will need to write the code that implements the particular processing you want to apply to each file.One last thing: in the current version of BioC (3.2), the
algorithm
argument is defunct. I highly recommend that you update your installation to use the current BioC. Oldest versions are not supported.Cheers,
H.
@ Hervé Pagès: really appreciated your kind favor so far !!
Hi,
I tried your solution it works. But, when I do like this:
instead, I tried like this: (I am still insist on row wise operation for GRanges objects)
Could you evaluate the feasibility of my code? Thanks a lot
Hi,
I was suggesting that you coerce to List, not to list, i.e. that you do
res <- as(findOverlaps(x,y), "List")
, notres <- as(findOverlaps(x,y), "list")
. The former is more efficient. Anyway the result is not a Hits object so it doesn't make sense to callqueryHits()
orsubjectHits()
on it.I'm not sure what your code is trying to do. If all you want to do is store the hits in a data.frame, then just call
as.data.frame()
on the Hits object returned byfindOverlaps()
:You don't need any loop for that.
H.