Dear Valerie,
sorry for my long delay in reacting to your message below. This
addition
to VariantAnnotation by Michael and you is great. I just would like to
make a comment and report what looks to me as a bug related to the
computation of cDNA positions, just in case you have suggestions and
comments, specially with respect to what at the moment seems like a
bug
to me.
the way in which mapCoords() works is by finding all possible overlaps
of each GRanges entry, corresponding to a variant. This is strictly
not
necessary once you have gone through locateVariants() and/or
predictCoding() since these functions already find all possible
overlaps
of each variant to different transcripts and regions. So, to get the
corresponding annotation of cDNA positions I had to write the
following
function which is slightly more elaborated from what you suggest
below:
## this function assumes that locateVariants() or predictCoding()
## have been called so that there is a TXID metadata column
.cDNAloc <- function(gr, txdb) {
if (!"TXID" %in% colnames(mcols(gr)))
stop("cDNApos: metadata column 'TXID' not found in input GRanges
object.")
exonsbytx <- exonsBy(txdb, by="tx")
map <- mapCoords(gr, exonsbytx, eltHits=TRUE)
eolap <- map$eltHits
qolap <- map$queryHits
txids <- rep(names(exonsbytx), elementLengths(exonsbytx))[eolap]
res <- gr[qolap]
mcols(res) <- append(values(res),
DataFrame(cDNALOC=ranges(map),
TXID2=as.integer(txids)))
## restrict results to cDNA positions of query TXID
res <- res[res$TXID == res$TXID2]
start(res$cDNALOC)
}
i'm going to run it with a toy example of a variant that overlaps a
coding region of 7 transcripts in the positive strand and the three
UTR
region of one transcript in the negative strand.
library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
gr <- GRanges("chr21", IRanges(start=c(43522349, 43522349),
width=c(1, 1)), strand=c("+", "-"))
gr$TXID <- c(72816L, 73335L)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
loc <- locateVariants(gr, txdb, AllVariants())
loc$cDNALOC <- .cDNAloc(loc, txdb)
now let's look to the cDNA position of the transcript in the negative
strand (the last entry of the resulting GRanges object):
GRanges with 1 range and 2 metadata columns:
seqnames ranges strand | TXID cDNALOC
<rle> <iranges> <rle> | <integer> <integer>
[1] chr21 [43522349, 43522349] - | 73335 2036
and the exonic structure of this transcript
exonsbytx <- exonsBy(txdb, by="tx")
exonsbytx[["73335"]]
GRanges with 2 ranges and 3 metadata columns:
seqnames ranges strand | exon_id exon_name
exon_rank
<rle> <iranges> <rle> | <integer> <character>
<integer>
[1] chr21 [43528406, 43528644] - | 260152 <na>
1
[2] chr21 [43522244, 43524145] - | 260151 <na>
2
from this structure one would expect that the reported cDNA position
would be:
43524145 - 43522349 + 1 = 1797
instead of 2036.
after investigating a bit where this value may come from, it seems
that
while the strand is being taken into account for the relative position
within the exon, the widths of upstream exons is wrongly selected (in
this case there are no upstream exons):
width(exonsbytx[["73335"]])
[1] 239 1902
239 + 1796 + 1 = 2036
so, my guess is that this must be something to be fixed in
mapCoords().
best regards!!
robert.
On 07/13/2014 07:26 AM, Valerie Obenchain wrote:
> Hi,
>
> Following up on this thread:
>
>
https://stat.ethz.ch/pipermail/bioconductor/2013-December/056745.html
>
> These changes are available in VariantAnnotation 1.11.15:
>
> 1) LOCSTART, LOCEND
>
> locateVariants() has 2 new output columns, LOCSTART and LOCEND.
> These are LOCATION-centric coordinates and can be different for each
row
> so I thought these names were more descriptive than REFLOCS
(discussed
> in thread). We have 2 values (start/end) instead of a single column
of
> IRanges() because we can't make an IRanges() with missing values.
> Technically 'missing' ranges are represented by zero-width ranges
but we
> still need a position; there is no position because there was no
overlap.
>
> 2) mapCoords(), pmapCoords()
>
> These functions are courtesy of Michael. mapCoords() maps ranges
onto
> another set of coordinates. You can map to cds-centric, exon-centric
or
> any other type of coordinate. See ?mapCoords in both GenomicRanges
and
> GenomicAlignments.
>
>
> In the previous thread we discussed added cDNA locations to
> predictCoding(). I've decided against this because it adds the
> additional overhead of the exonsBy() extraction and a findOverlaps()
> call. Not all users want the cDNA locations and those that do can
now
> easily get them with mapCoords().
>
> ## The usual predictCoding setup:
> library(BSgenome.Hsapiens.UCSC.hg19)
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
> fl <- system.file("extdata", "chr22.vcf.gz",
> package="VariantAnnotation")
> vcf <- readVcf(fl, "hg19")
> vcf <- renameSeqlevels(vcf, "chr22")
> coding <- predictCoding(vcf, txdb, Hsapiens)
>
> ## Exon-centric or cDNA locations:
> exonsbytx <- exonsBy(txdb, "tx")
> cDNA <- mapCoords(coding[!duplicated(ranges(coding))], exonsbytx)
> coding$cDNA <- ranges(cDNA)[togroup(coding$QUERYID)]
>
> Let me know if you run into problems or if the docs need more
detail.
>
> Valerie
>
--
Robert Castelo, PhD
Associate Professor
Dept. of Experimental and Health Sciences
Universitat Pompeu Fabra (UPF)
Barcelona Biomedical Research Park (PRBB)
Dr Aiguader 88
E-08003 Barcelona, Spain
telf: +34.933.160.514
fax: +34.933.160.550