plyranges: find exact GRanges match keeping metadata columns
1
0
Entering edit mode
@alexisweinreb-22919
Last seen 17 hours ago
United States

I have two GRanges object with metadata. I want to find the rows in the query that match exactly with the subject (not a partial overlap), and I want to keep the mcol() in the result.

Here is an example:

library(plyranges)


gr1 <- data.frame(start = 5,
                  end = 10,
                  seqnames = "I",
                  strand = "+",
                  name1 = "myquery") |>
  as_granges()

gr2 <- data.frame(start = c(1, 5),
                  end = c(8, 10),
                  seqnames = c("I", "I"),
                  strand = c("+", "+"),
                  name2 = paste0("mysubject", 1:2)) |>
  as_granges()

gr1
#> GRanges object with 1 range and 1 metadata column:
#>       seqnames    ranges strand |       name1
#>          <Rle> <IRanges>  <Rle> | <character>
#>   [1]        I      5-10      + |     myquery
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths


gr2
#> GRanges object with 2 ranges and 1 metadata column:
#>       seqnames    ranges strand |       name2
#>          <Rle> <IRanges>  <Rle> | <character>
#>   [1]        I       1-8      + |  mysubject1
#>   [2]        I      5-10      + |  mysubject2
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths



# this returns `mysubject1` which does not exactly match `myquery`

join_overlap_left(gr1, gr2)
#> GRanges object with 2 ranges and 2 metadata columns:
#>       seqnames    ranges strand |       name1       name2
#>          <Rle> <IRanges>  <Rle> | <character> <character>
#>   [1]        I      5-10      + |     myquery  mysubject1
#>   [2]        I      5-10      + |     myquery  mysubject2
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths


# this correctly detects the match, but doesn't keep the metadata

intersect_ranges_directed(gr1, gr2)
#> GRanges object with 1 range and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]        I      5-10      +
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths



# this does what I want, but requires some work to build the result GRanges

gr2[
  GenomicRanges::findOverlaps(gr1, gr2, type = "equal") |>
    to()
]
#> GRanges object with 1 range and 1 metadata column:
#>       seqnames    ranges strand |       name2
#>          <Rle> <IRanges>  <Rle> | <character>
#>   [1]        I      5-10      + |  mysubject2
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths


# and a pure dplyr way which I think is the best match currently

dplyr::left_join(as.data.frame(gr1),
                 as.data.frame(gr2),
                 by = c("seqnames", "start", "end", "width", "strand"))
#>   seqnames start end width strand   name1      name2
#> 1        I     5  10     6      + myquery mysubject2

Created on 2025-01-31 with [reprex v2.1.0](https://reprex.tidyverse.org)

plyranges • 28 views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 48 minutes ago
United States

You could use minoverlap but that requires an integer, not a fraction:

gr1 |> join_overlap_left(gr2, minoverlap=6)

Plyranges doesn't have the equal type, but it does have within:

gr1 |> join_overlap_left_within(gr2)

The within suffix, returns only ranges in x that are completely overlapped within in y.

ADD COMMENT

Login before adding your answer.

Traffic: 625 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6