Question

plyranges: find exact GRanges match keeping metadata columns

0

Entering edit mode

alexis.weinreb • 0

@alexisweinreb-22919

Last seen 17 hours ago

United States

I have two GRanges object with metadata. I want to find the rows in the query that match exactly with the subject (not a partial overlap), and I want to keep the mcol() in the result.

Here is an example:

library(plyranges)


gr1 <- data.frame(start = 5,
                  end = 10,
                  seqnames = "I",
                  strand = "+",
                  name1 = "myquery") |>
  as_granges()

gr2 <- data.frame(start = c(1, 5),
                  end = c(8, 10),
                  seqnames = c("I", "I"),
                  strand = c("+", "+"),
                  name2 = paste0("mysubject", 1:2)) |>
  as_granges()

gr1
#> GRanges object with 1 range and 1 metadata column:
#>       seqnames    ranges strand |       name1
#>          <Rle> <IRanges>  <Rle> | <character>
#>   [1]        I      5-10      + |     myquery
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths


gr2
#> GRanges object with 2 ranges and 1 metadata column:
#>       seqnames    ranges strand |       name2
#>          <Rle> <IRanges>  <Rle> | <character>
#>   [1]        I       1-8      + |  mysubject1
#>   [2]        I      5-10      + |  mysubject2
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths



# this returns `mysubject1` which does not exactly match `myquery`

join_overlap_left(gr1, gr2)
#> GRanges object with 2 ranges and 2 metadata columns:
#>       seqnames    ranges strand |       name1       name2
#>          <Rle> <IRanges>  <Rle> | <character> <character>
#>   [1]        I      5-10      + |     myquery  mysubject1
#>   [2]        I      5-10      + |     myquery  mysubject2
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths


# this correctly detects the match, but doesn't keep the metadata

intersect_ranges_directed(gr1, gr2)
#> GRanges object with 1 range and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]        I      5-10      +
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths



# this does what I want, but requires some work to build the result GRanges

gr2[
  GenomicRanges::findOverlaps(gr1, gr2, type = "equal") |>
    to()
]
#> GRanges object with 1 range and 1 metadata column:
#>       seqnames    ranges strand |       name2
#>          <Rle> <IRanges>  <Rle> | <character>
#>   [1]        I      5-10      + |  mysubject2
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths


# and a pure dplyr way which I think is the best match currently

dplyr::left_join(as.data.frame(gr1),
                 as.data.frame(gr2),
                 by = c("seqnames", "start", "end", "width", "strand"))
#>   seqnames start end width strand   name1      name2
#> 1        I     5  10     6      + myquery mysubject2

^{Created on 2025-01-31 with [reprex v2.1.0](https://reprex.tidyverse.org)}

plyranges • 28 views

ADD COMMENT • link updated 48 minutes ago by Michael Love 43k • written 1 day ago by alexis.weinreb • 0

score 0 · Answer 1 · 2025-02-02

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 48 minutes ago

United States

You could use minoverlap but that requires an integer, not a fraction:

gr1 |> join_overlap_left(gr2, minoverlap=6)

Plyranges doesn't have the equal type, but it does have within:

gr1 |> join_overlap_left_within(gr2)

The within suffix, returns only ranges in x that are completely overlapped within in y.

ADD COMMENT • link 2 hours ago Michael Love 43k