Creating all possible k combinations of n objects with a constraint for large n
2
2
Entering edit mode
mrk.carty ▴ 30
@mrkcarty-7442
Last seen 9.0 years ago
United States

Hi all,

If I have an ordered vector of genomic positions on chr1

> head(positions)
[1] 11161 12412 12462 12687 12830 13316
> length(positions)
[1] 576356

The distances between consecutive positions is represented by

> d <- positions[-1] - positions[-length(positions)]
> head(d)
[1] 1251   50  225  143  486  105

I want to enumerate all possible pairs of genomic locations such that the linear distance between the two points is <= D. I want an alternative to combn function from base package. What is fastest and least memory intensity way to do this in R to do this?

sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 data.table_1.9.4                 
 [3] BSgenome_1.34.1                   rtracklayer_1.26.2               
 [5] GenomicAlignments_1.2.2           Rsamtools_1.18.3                 
 [7] Biostrings_2.34.1                 XVector_0.6.0                    
 [9] GenomicRanges_1.18.4              GenomeInfoDb_1.2.4               
[11] IRanges_2.0.1                     S4Vectors_0.4.0                  
[13] Biobase_2.26.0                    BiocGenerics_0.12.1              
[15] Matrix_1.1-5                     

loaded via a namespace (and not attached):
 [1] base64enc_0.1-2    BatchJobs_1.5      BBmisc_1.9         BiocParallel_1.0.3
 [5] bitops_1.0-6       brew_1.0-6         checkmate_1.5.1    chron_2.3-45      
 [9] codetools_0.2-10   DBI_0.3.1          digest_0.6.8       fail_1.2          
[13] foreach_1.4.2      grid_3.1.2         iterators_1.0.7    lattice_0.20-30   
[17] plyr_1.8.1         Rcpp_0.11.4        RCurl_1.95-4.5     reshape2_1.4.1    
[21] RSQLite_1.0.0      sendmailR_1.2-1    stringr_0.6.2      tools_3.1.2       
[25] XML_3.98-1.1       zlibbioc_1.12.0   
>

 

base • 1.0k views
ADD COMMENT
2
Entering edit mode
@sean-davis-490
Last seen 12 weeks ago
United States

Use the GenomicRanges functionality.  Create a toy example:

> library(GenomicRanges)
> locs = GRanges(seqnames=rep('chr1',50000),ranges=IRanges(start=sample(1:2.5e8,50000),width=1))
> locs
GRanges object with 50000 ranges and 0 metadata columns:
          seqnames                 ranges strand
             <Rle>              <IRanges>  <Rle>
      [1]     chr1 [233308084, 233308084]      *
      [2]     chr1 [195830003, 195830003]      *
      [3]     chr1 [ 70145590,  70145590]      *
      [4]     chr1 [130363241, 130363241]      *
      [5]     chr1 [173211123, 173211123]      *
      ...      ...                    ...    ...
  [49996]     chr1 [ 78981164,  78981164]      *
  [49997]     chr1 [175125447, 175125447]      *
  [49998]     chr1 [210573156, 210573156]      *
  [49999]     chr1 [109916498, 109916498]      *
  [50000]     chr1 [111139737, 111139737]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Now, choose your distance, D and find all locations that overlap a region defined by D upstream and downstream.

> D=500
> ovl = findOverlaps(locs,resize(locs,width = D*2,fix = 'center'))
> ovl
Hits object with 60259 hits and 0 metadata columns:
          queryHits subjectHits
          <integer>   <integer>
      [1]         1           1
      [2]         1       36954
      [3]         2           2
      [4]         3           3
      [5]         4       24922
      ...       ...         ...
  [60255]     49996       49996
  [60256]     49997       49997
  [60257]     49998       49998
  [60258]     49999       49999
  [60259]     50000       50000
  -------
  queryLength: 50000
  subjectLength: 50000

This contains all the pairs of locations that are within D of each other.  You may want to remove all the "identity" hits where the query and subject are the same.  With 50k regions, this runs in less than 2 seconds on my laptop (including generating the toy example).

 

ADD COMMENT
0
Entering edit mode
mrk.carty ▴ 30
@mrkcarty-7442
Last seen 9.0 years ago
United States

Hi Sean,

Thanks a lot! This solution has speed up pipeline.

Best,

Mark

ADD COMMENT

Login before adding your answer.

Traffic: 1032 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6