nearest without overlapping
2
0
Entering edit mode
endrebak85 ▴ 40
@endrebak85-10660
Last seen 5.3 years ago
github.com/endrebak/

How I find distanceToNearest without overlapping ranges? I cannot see a flag for it in the docs. Any plans to add this?

genomicranges • 2.2k views
ADD COMMENT
2
Entering edit mode
@herve-pages-1542
Last seen 5 days ago
Seattle, WA, United States

Hi,

Don't remove anything from subject **before** you call distanceToNearest() or you will loose hits that you want to keep:

library(GenomicRanges)
x <- GRanges(c("chr1:2-4", "chr1:5-8"))
subject <- GRanges("chr1:7-10")

distanceToNearest(x, subject)
# Hits object with 2 hits and 1 metadata column:
#       queryHits subjectHits |  distance
#       <integer>   <integer> | <integer>
#   [1]         1           1 |         2
#   [2]         2           1 |         0
#   -------
#   queryLength: 2 / subjectLength: 1

overlaps <- subjectHits(findOverlaps(x, subject))
distanceToNearest(x, subject[-overlaps])
# Hits object with 0 hits and 1 metadata column:
#    queryHits subjectHits |  distance
#    <integer>   <integer> | <integer>
#   -------
#   queryLength: 2 / subjectLength: 0

Instead, remove hits **after** calling distanceToNearest():

hits <- distanceToNearest(x, subject)
code <- pcompare(x[queryHits(hits)], subject[subjectHits(hits)])
hits[abs(code) > 4]
# Hits object with 1 hit and 1 metadata column:
#       queryHits subjectHits |  distance
#       <integer>   <integer> | <integer>
#   [1]         1           1 |         2
#   -------
#   queryLength: 2 / subjectLength: 1

See ?IRanges::pcompare for the meaning of the overlapping codes.

Would be nice to have this as a flag in distanceToNearest().

H

ADD COMMENT
0
Entering edit mode

This will never find the nearest-but-not-overlapping hit for a query if there is any overlap in the subject, right? The same limitation as my solution, but more efficient. And there is poverlaps() for this. 

ADD REPLY
0
Entering edit mode

I see. Yes finding the nearest-but-not-overlapping ranges requires a little bit more work. One way to go is to take advantage of the fact that precede() and nearest() both exclude overlapping ranges:

### A version of distanceToNearest() that finds all the
### nearest-but-not-overlapping ranges in 'subject' for each range
### in 'x':
distanceToNearest2 <- function(x, subject)
{
    hits <- union(precede(x, subject, select="all"),
                  follow(x, subject, select="all"))
    mcols(hits)$distance <- distance(x[queryHits(hits)],
                                     subject[subjectHits(hits)])
    x2subject <- as(hits, "IntegerList")
    x2d <- relist(mcols(hits)$distance, x2subject)
    hits[unlist(min(x2d) == x2d)]
}

Then:

library(GenomicRanges)
x <- GRanges(c("chr1:2-4", "chr1:5-8", "chr1:12-13",
               "chr1:10-15", "chr1:16-17","chr1:20-19"))
subject <- GRanges(c("chr1:8-10", "chr1:15-16", "chr1:15-15"))
distanceToNearest2(x, subject)
# Hits object with 8 hits and 1 metadata column:
#       queryHits subjectHits |  distance
#       <integer>   <integer> | <integer>
#   [1]         1           1 |         3
#   [2]         2           2 |         6
#   [3]         2           3 |         6
#   [4]         3           2 |         1
#   [5]         3           3 |         1
#   [6]         3           1 |         1
#   [7]         5           3 |         0
#   [8]         6           2 |         3
#   -------
#   queryLength: 6 / subjectLength: 3

H.

ADD REPLY
0
Entering edit mode

This part of the code

    x2subject <- as(hits, "IntegerList")
    x2d <- relist(mcols(hits)$distance, x2subject)
    hits[unlist(min(x2d) == x2d)]

was to me highly idiosyncratic (in retrospect I guess there was only one thing that I didn't recognize immediately, which was the coercion of hits to IntegerList, but that was enough to send me off on the follow...). I started with a 'base R' solution (using 'query' instead of 'x' for consistency with the argument names of distanceToNearest()). Here's the distance between each 'nearest' query  / subject pair

    distance <- distance(query[queryHits(hits)], subject[subjectHits(hits)])

The challenge is to apply a function (which values are the minimum?) to the distances, grouped by query hit. If I didn't have any groups, I could use fun1() to return a logical vector indicating the values that are minima 

    fun1 <- function(x)
        min(x) == x

To apply this function to grouped values, a base R solution is to use the function ave()

    keep <- as.logical(ave(distance, queryHits(hits), FUN = fun1)

and then to clean up results

    ## clean-up
    mcols(hits)$distance <- distance
    hits[keep]

distanceToNearest() doesn't actually return ties, so I modified the function used in ave() to

    fun2 <- function(x)
        min(x) == x & !duplicated(x)

I think the performance of ave() wouldn't be too bad for a typical script, and the overall code seems to me easier to understand (ave() is probably relatively specialized knowledge, but the problem of operating on grouped values in base R comes up quite frequently on StackOverflow, so once one recognizes the problem being faced -- apply a vectorized function to a vector split into groups -- a solution isn't too far away).

One bag of tricks I do know about S4Vectors is that a list-of-integers can be efficiently created, manipulated 'group-wise', and turned into a vector. So I replace the use of ave() with

    lst <- splitAsList(distance, queryHits(hits)
    keep <- unsplit(fun1(lst), queryHits(hits))

S4Vectors applies fun1() to each element of lst, i.e., group-wise. Maybe this isn't that much less idiosyncratic than Herve's -- my first line is Herve's lines 1 & 2, and the second line is the subset in Herve's line 3. But knowing the splitAsList() / unlist(fun()) trick applies to many different situations, whereas as(hits, "IntegerList") is very specific. So my revision is

    distanceToNearest3 <- function(query, subject, fun = fun2) {
        hits <- union(
            precede(query, subject, select="all"),
            follow(query, subject, select="all")
        )
        ## minimum distance by query
        distance <- distance(query[queryHits(hits)], subject[subjectHits(hits)])
        lst <- splitAsList(distance, queryHits(hits))
        keep <- unsplit(fun(lst), queryHits(hits))
        ## clean-up
        mcols(hits)$distance <- distance
        hits[keep]
    }
ADD REPLY
0
Entering edit mode

Note that the relist/unlist and split/unsplit combos are elegant, powerful, and reliable idioms, but the split/unlist combo is a risky one. It only works if the splitting factor is sorted which is indeed the case here because the class of Hits object hits is actually SortedByQueryHits.

And because hits is a SortedByQueryHits object, relist/unlist can be used instead of split/unsplit or split/unlist. The cost of relist(distance, CompressedIntegerList) is virtually zero whereas splitAsList(distance, f) generally has to pay the cost of sorting the split factor or at least checking that the split factor is already sorted.

Another advantage of relist(distance, skeleton) is that it is guaranteed to return a List derivative that is parallel to skeleton when skeleton is itself a List derivative. splitAsList(distance, f) OTOH returns a List derivative that is parallel to sort(unique(f)) when f is an atomic vector. This means in particular that queries not represented in queryHits(hits) are not represented in splitAsList(distance, queryHits(hits)) either, which could be problematic in some contexts.

H.

ADD REPLY
0
Entering edit mode

Thanks for reminding me (again) of the split / unlist problems; I updated my comment to use split / unsplit.

ADD REPLY
0
Entering edit mode

Wouldn't the code be faster with

        precede(query, subject, select="first"),
        follow(query, subject, select="last")

?

The only problem is that the code errors:

Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function 'from' for signature '"integer"'
Calls: distanceToNearest3 -> distance -> [ -> queryHits -> from -> <Anonymous>

I am comparing genomicranges with pyranges for a preprint I am writing and I am trying to optimize the competing libraries as much as possible.

ADD REPLY
0
Entering edit mode
@valerie-obenchain-4275
Last seen 3.0 years ago
United States

You could filter out the overlapping ranges before calling distanceToNearest():

> query <- GRanges("chr1", IRanges(5, width=6))
> subject <- GRanges("chr1", IRanges(start=c(15, 10, 20), width=5))
> overlaps <- subjectHits(findOverlaps(query, subject))
> distanceToNearest(query, subject[-overlaps])
Hits object with 1 hit and 1 metadata column:
      queryHits subjectHits |  distance
      <integer>   <integer> | <integer>
  [1]         1           1 |         4
  -------
  queryLength: 1 / subjectLength: 2

Valerie

ADD COMMENT
0
Entering edit mode

The above will remove all subject ranges with any overlap in the query, so there may be cases where a nearest match is not found, since the subject range overlapped a different query range. Instead, use set arithmetic on the hits objects, like

setdiff(distanceToNearest(query, subject), findOverlaps(query, subject))
ADD REPLY
0
Entering edit mode

Doesn't that leave some subjects (or is it queries?) without nearest queries (or is it subjects?)?

ADD REPLY
0
Entering edit mode

Yes, that's right, if a query overlaps a subject range, no hits for it are returned. Depending on exactly what is needed here, there may not be a good solution. The nearest algorithm, as I recall, is greedy and based on positions, so it would need to be rewritten to support this mode. Perhaps the asker could describe their concrete use case. The bedtools closest -io command is documented to behave like Val's solution, where subject ranges with any overlap in the query are ignored.

ADD REPLY
0
Entering edit mode

To remove overlaps, it's also possible to do this:

distanceToNearest(query, subsetByOverlaps(subject, query, invert=TRUE))

ADD REPLY

Login before adding your answer.

Traffic: 490 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6