Hi everyone,
I want to do a findOverlaps when both subject and query are very large. My query is a GRanges object of genomic intervals on a particular chromosome (~8 million ranges), and my subject is another GRanges object of roughly 2 million reads. I thought of dividing the queries into smaller chunks (~76K), and then performing the overlaps on these smaller chunk as multiple jobs on a Torque job scheduler. I requested 4 processor for one node, 40 gb of real and virtual memory, and 10 gb of memory for both physical memory and virtual memory for each processor.
> reads
GRangesList object of length 2:
$R1
GRanges object with 1948817 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr21 [34574583, 34574683] -
[2] chr21 [29410526, 29410626] +
[3] chr21 [16007419, 16007519] -
[4] chr21 [18682677, 18682777] -
[5] chr21 [16577271, 16577323] -
... ... ... ...
[1948813] chr21 [37892775, 37892875] +
[1948814] chr21 [44730179, 44730279] +
[1948815] chr21 [21558129, 21558229] +
[1948816] chr21 [40091321, 40091421] +
[1948817] chr21 [23817243, 23817339] +
...
<1 more element>
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
> pairs
GRangesList object of length 2:
$R1
GRanges object with 7651444 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr21 [9410946, 9417138] *
[2] chr21 [9410946, 9417138] *
[3] chr21 [9410946, 9417138] *
[4] chr21 [9410946, 9417138] *
[5] chr21 [9410946, 9417138] *
... ... ... ...
[7651440] chr21 [48113915, 48117945] *
[7651441] chr21 [48113915, 48117945] *
[7651442] chr21 [48113915, 48117945] *
[7651443] chr21 [48113915, 48117945] *
[7651444] chr21 [48113915, 48117945] *
...
<1 more element>
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
cores = 4;
This is my code:
library(GenomicRanges)
chunk <- function(x,n) split(x,sort(rank(seq_len(length(x))%%n)))
overlaps <- function(G1,G2,forward,reverse,cores){
library(parallel)
hitsL <- findOverlaps(G1,forward)
hitsR <- findOverlaps(G2,reverse)
L <- split(subjectHits(hitsL),queryHits(hitsL))
R <- split(subjectHits(hitsR),queryHits(hitsR))
res <- unlist(mclapply(1:length(G1),function(x) length(intersect(L[[x]],R[[x]])),mc.cores=cores))
return (res)
}
n <- length(pairs$R1)
forward <- c(reads$R1,reads$R2)
reverse <- c(reads$R2,reads$R1)
G1 <- chunk(pairs$R1,100)
G2 <- chunk(pairs$R2,100)
counts <- NULL
time.proc <- system.time({for(i in 1:3) counts <- c(counts,overlaps(G1[[i]],G2[[i]],forward,reverse,cores)) })[3]/60
time.proc
elapsed
4.4179
It would take about 2 hrs to complete the job. How can I achieve a speedup?
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] doMC_1.3.3 iterators_1.0.7 foreach_1.4.2
[4] GenomicRanges_1.18.1 GenomeInfoDb_1.2.0 IRanges_2.0.0
[7] S4Vectors_0.4.0 BiocGenerics_0.12.0
loaded via a namespace (and not attached):
[1] XVector_0.6.0 codetools_0.2-9 tools_3.1.1
Best,
Mark