Entering edit mode
I have two GRanges object with metadata. I want to find the rows in the query that match exactly with the subject (not a partial overlap), and I want to keep the mcol() in the result.
Here is an example:
library(plyranges)
gr1 <- data.frame(start = 5,
end = 10,
seqnames = "I",
strand = "+",
name1 = "myquery") |>
as_granges()
gr2 <- data.frame(start = c(1, 5),
end = c(8, 10),
seqnames = c("I", "I"),
strand = c("+", "+"),
name2 = paste0("mysubject", 1:2)) |>
as_granges()
gr1
#> GRanges object with 1 range and 1 metadata column:
#> seqnames ranges strand | name1
#> <Rle> <IRanges> <Rle> | <character>
#> [1] I 5-10 + | myquery
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
gr2
#> GRanges object with 2 ranges and 1 metadata column:
#> seqnames ranges strand | name2
#> <Rle> <IRanges> <Rle> | <character>
#> [1] I 1-8 + | mysubject1
#> [2] I 5-10 + | mysubject2
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
# this returns `mysubject1` which does not exactly match `myquery`
join_overlap_left(gr1, gr2)
#> GRanges object with 2 ranges and 2 metadata columns:
#> seqnames ranges strand | name1 name2
#> <Rle> <IRanges> <Rle> | <character> <character>
#> [1] I 5-10 + | myquery mysubject1
#> [2] I 5-10 + | myquery mysubject2
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
# this correctly detects the match, but doesn't keep the metadata
intersect_ranges_directed(gr1, gr2)
#> GRanges object with 1 range and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] I 5-10 +
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
# this does what I want, but requires some work to build the result GRanges
gr2[
GenomicRanges::findOverlaps(gr1, gr2, type = "equal") |>
to()
]
#> GRanges object with 1 range and 1 metadata column:
#> seqnames ranges strand | name2
#> <Rle> <IRanges> <Rle> | <character>
#> [1] I 5-10 + | mysubject2
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
# and a pure dplyr way which I think is the best match currently
dplyr::left_join(as.data.frame(gr1),
as.data.frame(gr2),
by = c("seqnames", "start", "end", "width", "strand"))
#> seqnames start end width strand name1 name2
#> 1 I 5 10 6 + myquery mysubject2
Created on 2025-01-31 with [reprex v2.1.0](https://reprex.tidyverse.org)