Subset GRanges by intersection
2
0
Entering edit mode
Robert ▴ 10
@robert-21245
Last seen 3.0 years ago
United States

I would like to subset a subject GRanges object by intersection with a query GRanges object. When I use subsetByOverlaps, some entries in the resulting GRanges object can extend "outside" the query GRanges.

library(GenomicRanges)

subject <- GRanges(c("chrI:1-99", "chrI:100-199", "chrI:200-299"), score = 1:3)
subject
# GRanges object with 3 ranges and 1 metadata column:
#       seqnames    ranges strand |     score
#          <Rle> <IRanges>  <Rle> | <integer>
#   [1]     chrI      1-99      * |         1
#   [2]     chrI   100-199      * |         2
#   [3]     chrI   200-299      * |         3
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

query <- GRanges("chrI:50-150")
query
# GRanges object with 1 range and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chrI    50-150      *
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

subsetByOverlaps(x = subject, ranges = query)
# GRanges object with 2 ranges and 1 metadata column:
#       seqnames    ranges strand |     score
#          <Rle> <IRanges>  <Rle> | <integer>
#   [1]     chrI      1-99      * |         1
#   [2]     chrI   100-199      * |         2
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

I would like these to be "truncated" such that they end at the coordinates defined by the query GRanges. I can achieve this by first converting the subject GRanges object to a GPos object.

subject_GPos <- GPos(subject)
subject_GPos$score <- rep(subject$score, width(subject))
subject_GPos
# StitchedGPos object with 299 positions and 1 metadata column:
#         seqnames       pos strand |     score
#            <Rle> <integer>  <Rle> | <integer>
#     [1]     chrI         1      * |         1
#     [2]     chrI         2      * |         1
#     [3]     chrI         3      * |         1
#     [4]     chrI         4      * |         1
#     [5]     chrI         5      * |         1
#     ...      ...       ...    ... .       ...
#   [295]     chrI       295      * |         3
#   [296]     chrI       296      * |         3
#   [297]     chrI       297      * |         3
#   [298]     chrI       298      * |         3
#   [299]     chrI       299      * |         3
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

subsetByOverlaps(x = subject_GPos, ranges = query)
# StitchedGPos object with 101 positions and 1 metadata column:
#         seqnames       pos strand |     score
#            <Rle> <integer>  <Rle> | <integer>
#     [1]     chrI        50      * |         1
#     [2]     chrI        51      * |         1
#     [3]     chrI        52      * |         1
#     [4]     chrI        53      * |         1
#     [5]     chrI        54      * |         1
#     ...      ...       ...    ... .       ...
#    [97]     chrI       146      * |         2
#    [98]     chrI       147      * |         2
#    [99]     chrI       148      * |         2
#   [100]     chrI       149      * |         2
#   [101]     chrI       150      * |         2
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Is there an easier way to do this without conversion to GPos?

sessionInfo( )
# attached base packages:
# [1] grid      stats4    stats     graphics  grDevices utils     datasets  methods   base     
# other attached packages:
# [1] Gviz_1.38.0          GenomicRanges_1.46.1 GenomeInfoDb_1.30.0  IRanges_2.28.0       S4Vectors_0.32.3     BiocGenerics_0.40.0
findOverlaps GPos GenomicRanges subsetByOverlaps • 4.2k views
ADD COMMENT
4
Entering edit mode
@james-w-macdonald-5106
Last seen 10 hours ago
United States

Hervé will probably have something more elegant, but here's my dumb attempt.

> fo <- findOverlaps(subject, query)

> pintersect(subject[queryHits(fo)], query[subjectHits(fo)])
GRanges object with 3 ranges and 2 metadata columns:
      seqnames    ranges strand |     score       hit
         <Rle> <IRanges>  <Rle> | <integer> <logical>
  [1]     chrI     50-99      * |         1      TRUE
  [2]     chrI   100-150      * |         2      TRUE
  [3]     chrI   330-350      * |         4      TRUE
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
ADD COMMENT
0
Entering edit mode

That also included me hitting 'Enter' before adding the triple-backticks, so you'll need to go to the support site to see the edited post

ADD REPLY
0
Entering edit mode

That looks great. If you want to turn this into an answer I will be happy to accept it.

ADD REPLY
0
Entering edit mode
Malcolm Cook ★ 1.6k
@malcolm-cook-6293
Last seen 5 months ago
United States

try this:

GenomicRanges::intersect(subject,subsetByOverlaps(x = subject, ranges = query))
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chrI     1-199      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
ADD COMMENT
1
Entering edit mode

Wouldn't this be closer to what the OP wanted?

> z <- pintersect(subject, query)
> z[z$hit]
GRanges object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand |     score       hit
         <Rle> <IRanges>  <Rle> | <integer> <logical>
  [1]     chrI     50-99      * |         1      TRUE
  [2]     chrI   100-150      * |         2      TRUE
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
ADD REPLY
1
Entering edit mode

We must assume that the OP cares that the ranges in the output can somehow be mapped back to the ranges in subject. Otherwise, they could simply use interesect().

If that's the case then I guess pintersect() works fine but only if query contains a single range. In fact pintersect() expects subject and query to have the same length (i.e. same number of ranges), and, if they don't, it will recycle the shortest to the length of the longest. This is why it seems to be doing what the OP wants in that case.

Note what happens if a range in subject has no intersection with query:

pintersect(subject, query)
# GRanges object with 3 ranges and 2 metadata columns:
#       seqnames    ranges strand |     score       hit
#          <Rle> <IRanges>  <Rle> | <integer> <logical>
#   [1]     chrI     50-99      * |         1      TRUE
#   [2]     chrI   100-150      * |         2      TRUE
#   [3]     chrI   200-199      * |         3     FALSE
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

An empty range is returned (3rd range chrI:200-199 is empty). This is because by default pintersect() always returns a GRanges object _parallel_ to the input (i.e. one range in the output per range in the input). To get rid of these empty ranges, there's either Jim's method above (z[z$hit]) or one can call pintersect() with drop.nohit.ranges=TRUE:

pintersect(subject, query, drop.nohit.ranges=TRUE)
# GRanges object with 2 ranges and 1 metadata column:
#       seqnames    ranges strand |     score
#          <Rle> <IRanges>  <Rle> | <integer>
#   [1]     chrI     50-99      * |         1
#   [2]     chrI   100-150      * |         2
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

This is documented in ?pintersect,GRanges,GRanges-method.

Now if query has more than one range, this is a whole different story. A complication is that in general a range in subject will get fragmented into several ranges when intersected with query so you can no longer easily represent the result with a GRanges object (granted that you care about some notion of mappability between the output and the input ranges). You will need to use a GRangesList for that. But we'll embark on this one only if that's something the OP needs.

H.

ADD REPLY
0
Entering edit mode

Thank you all for the replies. I indeed would like the output to be mapped back to the subject ranges. Seems my minimal example was too minimal. Here is a revised version:

library(GenomicRanges)

subject <- GRanges(c("chrI:1-99", "chrI:100-199", "chrI:200-299", "chrI:300-399"), score = 1:4)
subject
# GRanges object with 4 ranges and 1 metadata column:
#       seqnames    ranges strand |     score
#          <Rle> <IRanges>  <Rle> | <integer>
#   [1]     chrI      1-99      * |         1
#   [2]     chrI   100-199      * |         2
#   [3]     chrI   200-299      * |         3
#   [4]     chrI   300-399      * |         4
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

query <- GRanges(c("chrI:50-150", "chrI:330-350"))
query
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chrI    50-150      *
#   [2]     chrI   330-350      *
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

subsetByOverlaps(x = subject, ranges = query)
# GRanges object with 3 ranges and 1 metadata column:
#       seqnames    ranges strand |     score
#          <Rle> <IRanges>  <Rle> | <integer>
#   [1]     chrI      1-99      * |         1
#   [2]     chrI   100-199      * |         2
#   [3]     chrI   300-399      * |         4
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

What I would like to get is:

# GRanges object with 3 ranges and 1 metadata column:
#       seqnames    ranges strand |     score
#          <Rle> <IRanges>  <Rle> | <integer>
#   [1]     chrI     50-99      * |         1
#   [2]     chrI   100-150      * |         2
#   [3]     chrI   330-350      * |         4
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

As Hervé mentioned, pintersect won't work here

pintersect(x = subject, y = query)
# Error in .pintersect_GRanges_GRanges(x, y, drop.nohit.ranges = drop.nohit.ranges,  : 
# 'y' must have the length of 'x' or length 1

and intersect will drop the score metadata.

intersect(x = subject, y = query)
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chrI    50-150      *
#   [2]     chrI   330-350      *
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths
ADD REPLY
1
Entering edit mode

As I said, if query has more than one range, you can no longer expect the result to be a GRanges object parallel to subject, _in general_. That's because intersecting a range in subject with the ranges in query can produce more than one range. For example, if query is GRanges(c("chrI:50-75", "chrI:80-150", "chrI:330-350")), what would you like to get? One possibility is to use a GRangesList object to represent the result:

library(GenomicRanges)
subject <- GRanges(c(A="chrI:1-99", B="chrI:100-199", C="chrI:200-299", D="chrI:300-399"), score=1:4)
query <- GRanges(c("chrI:50-75", "chrI:80-150", "chrI:330-350"))

GRangesList(lapply(setNames(seq_along(subject), names(subject)), function(i) intersect(subject[i], query)))
# GRangesList object of length 4:
# $A
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chrI     50-75      *
#   [2]     chrI     80-99      *
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#
# $B
# GRanges object with 1 range and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chrI   100-150      *
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths
# 
# $C
# GRanges object with 0 ranges and 0 metadata columns:
#    seqnames    ranges strand
#       <Rle> <IRanges>  <Rle>
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths
# 
# $D
# GRanges object with 1 range and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chrI   330-350      *
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

H.

ADD REPLY
2
Entering edit mode

Just seeing Jim's solution based on findOverlaps() now. It actually does the trick and is a lot faster. If the original ranges in subject are named, the names get propagated which makes it easy to map back the output ranges to the input ranges:

fo <- findOverlaps(subject, query)
pintersect(subject[queryHits(fo)], query[subjectHits(fo)])
# GRanges object with 4 ranges and 2 metadata columns:
#     seqnames    ranges strand |     score       hit
#        <Rle> <IRanges>  <Rle> | <integer> <logical>
#   A     chrI     50-75      * |         1      TRUE
#   A     chrI     80-99      * |         1      TRUE
#   B     chrI   100-150      * |         2      TRUE
#   D     chrI   330-350      * |         4      TRUE
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

H.

ADD REPLY

Login before adding your answer.

Traffic: 685 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6