Question

GRangesList and nearest neighbour search (name-specific)

0

Entering edit mode

Maurits • 0

@maurits-8365

Last seen 9.0 years ago

Australia

Dear all.

Within GenomicRanges, is there a possibility to determine the nearest neighbouring range from two GRangesList (GRangesList query and GRangesList subject) by limiting the search to ranges in query and subject that have the same name?

To be more concrete, I have

a GRangesList, where every (uniquely named) element corresponds to a transcript CDS, and
a GRangesList, with named elements corresponding to multiple transcript 3’UTR (per CDS).

I would like to select those elements from list (2.) that are closest to the CDS from list (1.), based on the name.

A minimal example is given here:

library(GenomicRanges);

cds<-GRangesList(
    "tx1"=GRanges(
        c("chr22","chr22","chr22"),
        IRanges(c(23974307,23973768,23971551),c(23974414,23973943,23971623)),
        c("-","-","-")),
    "tx2"=GRanges(
        c("chr22","chr22"),
        IRanges(c(23974021,23971402),c(23974532,23971532)),
        c("+","+")));

utr<-GRangesList(
    "tx1"=GRanges("chr22",IRanges(23974000,23974020),"+"),
    "tx1"=GRanges("chr22",IRanges(23971540,23971550),"-"),
    "tx2"=GRanges("chr22",IRanges(23974000,23974020),"+"));

idx<-nearest(unlist(range(utr)),unlist(range(cds)));

For example, according to nearest cds[idx[1]] is closest to utr[1], however, they have different names (“tx1” vs. “tx2”).

I could loop through all names in (1.) separately, then filter out the elements from (2.) that match each specific name from (1.), and then use nearest to pull out the relevant element. Unfortunately, given the size of (1.) and (2.) (around 30,000 elements), this takes a very long time.

I was hoping that there might be a smarter way to do this? Any help is greatly appreciated.

grangeslist • 1.9k views

ADD COMMENT • link 9.3 years ago • updated 9.0 years ago Maurits • 0

score 0 · Answer 1 · 2015-12-03

Hi,

Do you care about the fine-grained structure of your CDS and UTRs? If you don't, you can describe each CDS or UTR by a single range and store them in GRanges objects instead of GRangesList objects. That's going to help a lot:

library(GenomicRanges)
cds <- GRangesList(
    tx1=GRanges("chr22", IRanges(c(11, 25, 40), width=8), "-"),
    tx2=GRanges("chr22", IRanges(c(32, 45), width=9), "+")
)
utr <- GRangesList(
    tx1=GRanges("chr22", IRanges(20, 31),"+"),
    tx2=GRanges("chr22", IRanges(20, 31),"+"),
    tx1=GRanges("chr22", IRanges(6, 8),"-"),
    tx1=GRanges("chr22", IRanges(5, 10),"-"),
    tx2=GRanges("chr22", IRanges(15, 25),"+")
)
mcols(utr)$utr_id <- paste0("UTR", seq_along(utr))

Turn cds and utr into GRanges objects:

cds2 <- range(cds)
stopifnot(all(elementLengths(cds2) == 1L))
cds2 <- unlist(cds2)

utr2 <- range(utr)
stopifnot(all(elementLengths(utr2) == 1L))
utr2 <- unlist(utr2)
mcols(utr2) <- mcols(utr)

Then:

cds2
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   tx1    chr22  [11, 47]      -
#   tx2    chr22  [32, 53]      +
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

utr2
# GRanges object with 5 ranges and 1 metadata column:
#       seqnames    ranges strand |      utr_id
#          <Rle> <IRanges>  <Rle> | <character>
#   tx1    chr22  [20, 31]      + |        UTR1
#   tx2    chr22  [20, 31]      + |        UTR2
#   tx1    chr22  [ 6,  8]      - |        UTR3
#   tx1    chr22  [ 5, 10]      - |        UTR4
#   tx2    chr22  [15, 25]      + |        UTR5
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Then group the UTRs in utr2 by transcript:

utr2 <- split(utr2, factor(names(utr2), levels=names(cds2)))

That gives you a GRangesList parallel to cds2 (i.e. utr2 and cds2 have the same length and names):

utr2
# GRangesList object of length 2:
# $tx1 
# GRanges object with 3 ranges and 1 metadata column:
#       seqnames    ranges strand |      utr_id
#          <Rle> <IRanges>  <Rle> | <character>
#   tx1    chr22  [20, 31]      + |        UTR1
#   tx1    chr22  [ 6,  8]      - |        UTR3
#   tx1    chr22  [ 5, 10]      - |        UTR4
#
# $tx2 
# GRanges object with 2 ranges and 1 metadata column:
#       seqnames   ranges strand | utr_id
#   tx2    chr22 [20, 31]      + |   UTR2
#   tx2    chr22 [15, 25]      + |   UTR5
#
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths

Compute the distance between the CDS and their corresponding UTRs:

d <- relist(distance(rep(cds2, elementLengths(utr2)),
                     unlist(utr2),
                     ignore.strand=FALSE),
            utr2)

d
# IntegerList of length 2
# [["tx1"]] <NA> 2 0
# [["tx2"]] 0 6

which.min(d)
# tx1 tx2
#   3   1

Extract the nearest UTRs:

offset <- c(0L, head(elementLengths(utr2), n=1L))
unlist(utr2, use.names=FALSE)[offset + which.min(d)]
# GRanges object with 2 ranges and 1 metadata column:
#       seqnames    ranges strand |      utr_id
#          <Rle> <IRanges>  <Rle> | <character>
#   tx1    chr22  [ 5, 10]      - |        UTR4
#   tx2    chr22  [20, 31]      + |        UTR2
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Finally put all the above code in a function:

## Return a GRanges object parallel to 'cds'.
getCdsNearestUtr <- function(cds, utr, ignore.strand=FALSE)
{
    stopifnot(is(cds, "GRangesList"), is(utr, "GRangesList"))

    cds2 <- range(cds)
    stopifnot(all(elementLengths(cds2) == 1L))
    cds2 <- unlist(cds2)
    utr2 <- range(utr)
    stopifnot(all(elementLengths(utr2) == 1L))
    utr2 <- unlist(utr2)
    mcols(utr2) <- mcols(utr)

    utr2 <- split(utr2, factor(names(utr2), levels=names(cds2)))
    d <- relist(distance(rep(cds2, elementLengths(utr2)),
                         unlist(utr2),
                         ignore.strand=ignore.strand),
                utr2)
    offset <- c(0L, head(elementLengths(utr2), n=1L))
    unlist(utr2, use.names=FALSE)[offset + which.min(d)]
}

Then:

getCdsNearestUtr(cds, utr)
# GRanges object with 2 ranges and 1 metadata column:
#       seqnames    ranges strand |      utr_id
#          <Rle> <IRanges>  <Rle> | <character>
#   tx1    chr22  [ 5, 10]      - |        UTR4
#   tx2    chr22  [20, 31]      + |        UTR2
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

getCdsNearestUtr(cds, utr, ignore.strand=TRUE)
# GRanges object with 2 ranges and 1 metadata column:
#       seqnames    ranges strand |      utr_id
#          <Rle> <IRanges>  <Rle> | <character>
#   tx1    chr22  [20, 31]      + |        UTR1
#   tx2    chr22  [20, 31]      + |        UTR2
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

No loops. Should be fast!

Cheers,

H.

score 0 · Answer 2 · 2015-12-08

0

Entering edit mode

Maurits • 0

@maurits-8365

Last seen 9.0 years ago

Australia

Dear Herve.

Thanks for the elaborate answer. I really appreciate you taking the time.

However, my issue is that I do need to properly account the "fine-grained" structure of the features.
For example, one UTR may contain multiple GRange entries (e.g. due to a spliced exon), and multiple UTRs may be given per CDS with the same name.

Pooling UTR's using range() allows me to determine the UTR closest to a CDS with the same name, but information about the exon structure is lost. The latter is needed to e.g. determine overlaps with other features.

As mentioned, my approach for now is to loop through all unique CDS names, pull out UTR entries that match the name, and then do the distance calculations separately per CDS name. This takes about an hour for ~40,000 unique CDS names.

Would you happen to know of a loop-free & fast(er) version?

Best,

M

ADD COMMENT • link 9.3 years ago Maurits • 0

0

Entering edit mode

Hi,

I don't think the fine-grained structure of the CDSes or UTRs is relevant to determine the UTR closest to a CDS with the same name. The only situation where it would be relevant was if, for a given CDS and UTR in the same transcript, the individual ranges of the CDS were overlapping or interlaced with the individual ranges of the UTR. Not sure how this could happen but, even if it did, it would be easy to modify getCdsNearestUtr() to detect this situation. The problem is: what do you want to do in that case?

Also I'm not sure what you mean when you say that the information about the exon structure is "lost". It's just ignored for determining the UTR closest to a CDS with the same name. But it's still available in your original cds and utr objects. Once you've mapped your CDSes to their closest UTR with getCdsNearestUtr() you still have access to the fine-grained structures by looking back at the original objects. This is easy because the output of getCdsNearestUtr() is parallel to cds, and, if you've ids on top of your utr object (like I have in my example), then you can use them to map the output of getCdsNearestUtr() to the fine-grained UTRs in utr.

Hope this makes sense.

H.

ADD REPLY • link 9.2 years ago Hervé Pagès 16k

0

Entering edit mode

Dear Herve.

I'm not sure I understand what you mean. The "fine-grained structure" of the UTR's and CDS's are very important, if for example you want to determine whether a particular SNP overlaps with a spliced UTR/exon.

And the information is "lost", because you collapse UTR entries per unique ID using range(...). So a SNP might overlap with the range of a UTR but not actually with the individual exons.

It's not clear to me what you mean by "the output of getCdsNearestUtr() is parallel to cds" and "you can use them to map the output of getCdsNearestUTR() to the fine-grained UTRs". Based on your example, collapsing the individual UTR entries for one ID seems irreversible to me.

Maybe we are missing each other's point, so let me rephrase my situation: Imagine two different UTRs with different lengths/distance to the CDS for the same CDS ID. Both UTRs consist of 2 exons. I would like to determine the UTR that is closest to the CDS, but keeping the information about the 2 exons intact, i.e. without collapsing them into one object.

Cheers,

Maurits

ADD REPLY • link 9.2 years ago Maurits • 0

0

Entering edit mode

Yes the "fine-grained structure" of the UTR's and CDS's can be very important, like in your SNP example. But not always. For determining the UTR closest to a CDS with the same name (i.e. in the same transcript), it can be ignored. See for example this situation where 3 UTRs have the same name as the CDS:

  CDS                                    UTRs

  xxxxxx-------xxx---xxxxxxx
                                        xxx----xxxxxxx  UTR1
                                      xxxxx--xxxx       UTR2
                                         xxxxxxxxxx     UTR3

If I ignore the fine-grained i.e. if I replace the CDS and each UTR by a single range spanning the entire feature:

  CDS                                    UTRs

  xxxxxxxxxxxxxxxxxxxxxxxxxx
                                        xxxxxxxxxxxxxx  UTR1
                                      xxxxxxxxxxx       UTR2
                                         xxxxxxxxxx     UTR3

In both cases the UTR that is closest to the CDS is UTR2. It's clear that the fine-grained structure of the individual features is irrelevant here. As I said previously, the only situation where the fine-grained structure would be relevant for solving the closest problem was if, for a given CDS and UTR in the same transcript, the individual ranges of the CDS were overlapping or interlaced with the individual ranges of the UTR. Do you expect this to happen?

In your original post you said that you could implement this with a loop but that you were worried about performance. Why don't you implement this loop anyway and compare what you get with it with what you get with getCdsNearestUtr()? Hopefully you'll get the same results. If not then I guess I don't understand what you mean by "determining the UTR closest to a CDS with the same name" and maybe you could clarify.

Cheers,

H.

ADD REPLY • link 9.2 years ago Hervé Pagès 16k

0

Entering edit mode

Please see my extended example below.

ADD REPLY • link 9.0 years ago Maurits • 0

score 0 · Answer 3 · 2016-02-24

Dear Herve.

It took me a while to get back to this issue. I know it's been a while, but if you're still willing to give this another shot, it'd be greatly appreciated.

In a nutshell, I think we're not understanding each other properly. Have a look at below sample code:

library(GenomicRanges);

tmp.cds <- GRangesList(
    GRanges(seqnames = Rle(c("chrX", "chrX", "chrX", "chrX", "chrX", "chrX")),
            ranges = IRanges(
                start = c(154182678L, 154187770L, 154190054L, 154191688L, 154193408L, 154195930L),
                width = c(112L, 297L, 169L, 166L, 240L, 111L)),
            strand = Rle(c("+", "+", "+", "+", "+", "+"))),
    GRanges(seqnames = Rle(c("chr10")),
            ranges = IRanges(
                start = c(120354701L),
                width = c(483L)),
            strand = Rle(c("-"))));
names(tmp.cds) <- c("NM_000513", "NM_000982");

tmp.utr <- GRangesList(
    GRanges(seqnames = Rle(c("chrX")),
            ranges = IRanges(
                start = c(154182596L),
                width = c(82L)),
            strand = Rle(c("+"))),
    GRanges(seqnames = Rle(c("chrX")),
            ranges = IRanges(
                start = c(154219734L),
                width = c(82L)),
            strand = Rle(c("+"))),
    GRanges(seqnames = Rle(c("chrX")),
            ranges = IRanges(
                start = c(154257538L),
                width = c(82L)),
            strand = Rle(c("+"))),
    GRanges(seqnames = Rle(c("chr13", "chr13")),
            ranges = IRanges(
                start = c(27251555L, 27253765L),
                width = c(31L, 12L)),
            strand = Rle(c("+", "+"))),
    GRanges(seqnames = Rle(c("chr10", "chr10")),
            ranges = IRanges(
                start = c(120355164L, 120355184L),
                width = c(5L, 40L)),
            strand = Rle(c("-")))
    );
names(tmp.utr) <- c(rep("NM_000513", 3),
                    rep("NM_000982", 2));

# This is your proposed function
GetCDSNearestUTR <- function(cds, utr, ignore.strand = FALSE) {
    stopifnot(is(cds, "GRangesList"), is(utr, "GRangesList"));
    cds2 <- range(cds);
    stopifnot(all(elementLengths(cds2) == 1L));
    cds2 <- unlist(cds2);
    utr2 <- range(utr);
    stopifnot(all(elementLengths(utr2) == 1L));
    utr2 <- unlist(utr2);
    utr2 <- split(utr2, factor(names(utr2), levels = names(cds2)));
    d <- relist(distance(rep(cds2, elementLengths(utr2)),
                         unlist(utr2),
                         ignore.strand = ignore.strand),
                utr2);
    offset <- c(0L, head(elementLengths(utr2), n = 1L));
    unlist(utr2, use.names=FALSE)[offset + which.min(d)];
}

If I now do a GetCDSNearestUTR(tmp.cds, tmp.utr); I get the following output

GRanges object with 2 ranges and 0 metadata columns:
            seqnames                 ranges strand
               <Rle>              <IRanges>  <Rle>
  NM_000513     chrX [154182596, 154182677]      +
  NM_000982    chr10 [120355164, 120355223]      -
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

The positive: The function picks the correct UTR.

The negative: For NM_000982, the exon-specific information on the spliced UTR is lost. The function returns the range of the genomic (UTR) feature. However, as you can see from tmp.utr the corresponding UTR contains two exons:

print(tmp.utr[which(names(tmp.utr) == "NM_000982")]);
GRangesList object of length 2:
$NM_000982
GRanges object with 2 ranges and 0 metadata columns:
      seqnames               ranges strand
         <Rle>            <IRanges>  <Rle>
  [1]    chr13 [27251555, 27251585]      +
  [2]    chr13 [27253765, 27253776]      +

$NM_000982
GRanges object with 2 ranges and 0 metadata columns:
      seqnames                 ranges strand
  [1]    chr10 [120355164, 120355168]      -
  [2]    chr10 [120355184, 120355223]      -

-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths

It is the exon-specific information that I need to retain in the filtered (based on minimum distance to CDS) list of UTRs.

Best,

Maurits