Question

How do I subset a GRanges object based on chromosome (and approximate region)?

1

Entering edit mode

deepue ▴ 10

@deepue-9906

Last seen 3.0 years ago

France

I have the GRanges object data_GR, from which I would like to extract all the regions specific to a chromosome(eg: chr21). How could I extract it without knowing the regions of interest?

set.seed(123)
data_bed = circlize::generateRandomBed(nr = 1000, nc = 0)
data_GR = makeGRangesFromDataFrame(data_bed)

GRanges object with 1005 ranges and 0 metadata columns:
         seqnames            ranges strand
            <Rle>         <IRanges>  <Rle>
     [1]     chr1   7634457-9204434      *
     [2]     chr1  9853594-10435028      *
     [3]     chr1 10862809-12716970      *
     [4]     chr1 13814692-18272526      *
     [5]     chr1 19243285-20683999      *
     ...      ...               ...    ...
  [1001]     chrY 46296843-48478084      *
  [1002]     chrY 48551532-51056391      *
  [1003]     chrY 52266848-53042784      *
  [1004]     chrY 57968441-58556744      *
  [1005]     chrY 58660263-59131689      *
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

Is it possible to extract all the regions present between a range?

904  chr21    182543   2542946
905  chr21   5976730   7429360
906  chr21  14592916  14657056
907  chr21  19808058  21397649
908  chr21  21820886  22077901
909  chr21  22561006  23005888
910  chr21  25473663  26160273
911  chr21  26693456  28326067
912  chr21  30501245  34710361
913  chr21  35698126  36052399
914  chr21  36701826  38995722
915  chr21  40122532  40673153
916  chr21  41211634  41248211
917  chr21  41644225  43391767
918  chr21  44023336  44630830
919  chr21  47539670  48127414

For example, the below regions which exist in the range {20000000, 30000000}

908  chr21  21820886  22077901
909  chr21  22561006  23005888
910  chr21  25473663  26160273
911  chr21  26693456  28326067

GRanges • 6.1k views

ADD COMMENT • link updated 4.6 years ago by merv ▴ 150 • written 4.6 years ago by deepue ▴ 10

4

Entering edit mode

Kevin Blighe ★ 4.0k

@kevin

Last seen 24 days ago

Republic of Ireland

Hi,

It should be a matter of creating a second GRanges object (with your target regions) and then using findOverlaps() or intersect() between both GRanges.

To account for an "approximate" overlap, make use of the maxgap and minoverlap parameters.

Kevin

ADD COMMENT • link 4.6 years ago Kevin Blighe ★ 4.0k

1

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.4 years ago

United States

Another option subsetByOverlaps().

ADD COMMENT • link 4.6 years ago Michael Lawrence ★ 11k

score 4 · Accepted Answer · 2020-08-31

plyranges

The plyranges package can be syntactically helpful with this. Here are some example filtering operations:

# BiocManager::install('plyranges')
library(plyranges)

# all 'chr21' ranges
data_GR %>% 
  filter(seqnames == 'chr21')

# filter by one region (stringent, i.e., fully contained in region)
data_GR %>% 
  filter(seqnames == 'chr21', start >= 2e7L, end <= 3e7L)

# filter by one region (permissive, i.e., any overlap with region)
data_GR %>% 
  filter_by_overlaps(as('chr21:20000000-30000000', 'GRanges'))

# filter by multiple regions
data_GR %>% 
  filter_by_overlaps(as(c('chr1:1-10000000', 'chr21:20000000-30000000'), 'GRanges'))

The latter two are effectively doing what Kevin suggested, i.e., they create a GRanges object for region(s) one wishes to filter on. The filter_by_overlaps method also has the same optional maxgap and minoverlap arguments.