Hi,
Do you care about the fine-grained structure of your CDS and UTRs? If you don't, you can describe each CDS or UTR by a single range and store them in GRanges objects instead of GRangesList objects. That's going to help a lot:
library(GenomicRanges)
cds <- GRangesList(
tx1=GRanges("chr22", IRanges(c(11, 25, 40), width=8), "-"),
tx2=GRanges("chr22", IRanges(c(32, 45), width=9), "+")
)
utr <- GRangesList(
tx1=GRanges("chr22", IRanges(20, 31),"+"),
tx2=GRanges("chr22", IRanges(20, 31),"+"),
tx1=GRanges("chr22", IRanges(6, 8),"-"),
tx1=GRanges("chr22", IRanges(5, 10),"-"),
tx2=GRanges("chr22", IRanges(15, 25),"+")
)
mcols(utr)$utr_id <- paste0("UTR", seq_along(utr))
Turn cds
and utr
into GRanges objects:
cds2 <- range(cds)
stopifnot(all(elementLengths(cds2) == 1L))
cds2 <- unlist(cds2)
utr2 <- range(utr)
stopifnot(all(elementLengths(utr2) == 1L))
utr2 <- unlist(utr2)
mcols(utr2) <- mcols(utr)
Then:
cds2
# GRanges object with 2 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# tx1 chr22 [11, 47] -
# tx2 chr22 [32, 53] +
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
utr2
# GRanges object with 5 ranges and 1 metadata column:
# seqnames ranges strand | utr_id
# <Rle> <IRanges> <Rle> | <character>
# tx1 chr22 [20, 31] + | UTR1
# tx2 chr22 [20, 31] + | UTR2
# tx1 chr22 [ 6, 8] - | UTR3
# tx1 chr22 [ 5, 10] - | UTR4
# tx2 chr22 [15, 25] + | UTR5
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
Then group the UTRs in utr2
by transcript:
utr2 <- split(utr2, factor(names(utr2), levels=names(cds2)))
That gives you a GRangesList parallel to cds2
(i.e. utr2
and cds2
have the same length and names):
utr2
# GRangesList object of length 2:
# $tx1
# GRanges object with 3 ranges and 1 metadata column:
# seqnames ranges strand | utr_id
# <Rle> <IRanges> <Rle> | <character>
# tx1 chr22 [20, 31] + | UTR1
# tx1 chr22 [ 6, 8] - | UTR3
# tx1 chr22 [ 5, 10] - | UTR4
#
# $tx2
# GRanges object with 2 ranges and 1 metadata column:
# seqnames ranges strand | utr_id
# tx2 chr22 [20, 31] + | UTR2
# tx2 chr22 [15, 25] + | UTR5
#
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
Compute the distance between the CDS and their corresponding UTRs:
d <- relist(distance(rep(cds2, elementLengths(utr2)),
unlist(utr2),
ignore.strand=FALSE),
utr2)
d
# IntegerList of length 2
# [["tx1"]] <NA> 2 0
# [["tx2"]] 0 6
which.min(d)
# tx1 tx2
# 3 1
Extract the nearest UTRs:
offset <- c(0L, head(elementLengths(utr2), n=1L))
unlist(utr2, use.names=FALSE)[offset + which.min(d)]
# GRanges object with 2 ranges and 1 metadata column:
# seqnames ranges strand | utr_id
# <Rle> <IRanges> <Rle> | <character>
# tx1 chr22 [ 5, 10] - | UTR4
# tx2 chr22 [20, 31] + | UTR2
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
Finally put all the above code in a function:
## Return a GRanges object parallel to 'cds'.
getCdsNearestUtr <- function(cds, utr, ignore.strand=FALSE)
{
stopifnot(is(cds, "GRangesList"), is(utr, "GRangesList"))
cds2 <- range(cds)
stopifnot(all(elementLengths(cds2) == 1L))
cds2 <- unlist(cds2)
utr2 <- range(utr)
stopifnot(all(elementLengths(utr2) == 1L))
utr2 <- unlist(utr2)
mcols(utr2) <- mcols(utr)
utr2 <- split(utr2, factor(names(utr2), levels=names(cds2)))
d <- relist(distance(rep(cds2, elementLengths(utr2)),
unlist(utr2),
ignore.strand=ignore.strand),
utr2)
offset <- c(0L, head(elementLengths(utr2), n=1L))
unlist(utr2, use.names=FALSE)[offset + which.min(d)]
}
Then:
getCdsNearestUtr(cds, utr)
# GRanges object with 2 ranges and 1 metadata column:
# seqnames ranges strand | utr_id
# <Rle> <IRanges> <Rle> | <character>
# tx1 chr22 [ 5, 10] - | UTR4
# tx2 chr22 [20, 31] + | UTR2
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
getCdsNearestUtr(cds, utr, ignore.strand=TRUE)
# GRanges object with 2 ranges and 1 metadata column:
# seqnames ranges strand | utr_id
# <Rle> <IRanges> <Rle> | <character>
# tx1 chr22 [20, 31] + | UTR1
# tx2 chr22 [20, 31] + | UTR2
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
No loops. Should be fast!
Cheers,
H.
Hi,
I don't think the fine-grained structure of the CDSes or UTRs is relevant to determine the UTR closest to a CDS with the same name. The only situation where it would be relevant was if, for a given CDS and UTR in the same transcript, the individual ranges of the CDS were overlapping or interlaced with the individual ranges of the UTR. Not sure how this could happen but, even if it did, it would be easy to modify
getCdsNearestUtr()
to detect this situation. The problem is: what do you want to do in that case?Also I'm not sure what you mean when you say that the information about the exon structure is "lost". It's just ignored for determining the UTR closest to a CDS with the same name. But it's still available in your original
cds
andutr
objects. Once you've mapped your CDSes to their closest UTR withgetCdsNearestUtr()
you still have access to the fine-grained structures by looking back at the original objects. This is easy because the output ofgetCdsNearestUtr()
is parallel tocds
, and, if you've ids on top of yourutr
object (like I have in my example), then you can use them to map the output ofg
etCdsNearestUtr()
to the fine-grained UTRs inutr
.Hope this makes sense.
H.
Dear Herve.
I'm not sure I understand what you mean. The "fine-grained structure" of the UTR's and CDS's are very important, if for example you want to determine whether a particular SNP overlaps with a spliced UTR/exon.
And the information is "lost", because you collapse UTR entries per unique ID using
range(...)
. So a SNP might overlap with the range of a UTR but not actually with the individual exons.It's not clear to me what you mean by "the output of
getCdsNearestUtr()
is parallel to cds" and "you can use them to map the output ofgetCdsNearestUTR()
to the fine-grained UTRs". Based on your example, collapsing the individual UTR entries for one ID seems irreversible to me.Maybe we are missing each other's point, so let me rephrase my situation: Imagine two different UTRs with different lengths/distance to the CDS for the same CDS ID. Both UTRs consist of 2 exons. I would like to determine the UTR that is closest to the CDS, but keeping the information about the 2 exons intact, i.e. without collapsing them into one object.
Cheers,
Maurits
Yes the "fine-grained structure" of the UTR's and CDS's can be very important, like in your SNP example. But not always. For determining the UTR closest to a CDS with the same name (i.e. in the same transcript), it can be ignored. See for example this situation where 3 UTRs have the same name as the CDS:
If I ignore the fine-grained i.e. if I replace the CDS and each UTR by a single range spanning the entire feature:
In both cases the UTR that is closest to the CDS is UTR2. It's clear that the fine-grained structure of the individual features is irrelevant here. As I said previously, the only situation where the fine-grained structure would be relevant for solving the closest problem was if, for a given CDS and UTR in the same transcript, the individual ranges of the CDS were overlapping or interlaced with the individual ranges of the UTR. Do you expect this to happen?
In your original post you said that you could implement this with a loop but that you were worried about performance. Why don't you implement this loop anyway and compare what you get with it with what you get with
getCdsNearestUtr()
? Hopefully you'll get the same results. If not then I guess I don't understand what you mean by "determining the UTR closest to a CDS with the same name" and maybe you could clarify.Cheers,
H.
Please see my extended example below.