Question

Complete variant toolbox: gmapR/VariantTools/VariantAnnotation - revived

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 2.7 years ago

United States

Hi, Following up on this thread: https://stat.ethz.ch/pipermail/bioconductor/2013-December/056745.html These changes are available in VariantAnnotation 1.11.15: 1) LOCSTART, LOCEND locateVariants() has 2 new output columns, LOCSTART and LOCEND. These are LOCATION-centric coordinates and can be different for each row so I thought these names were more descriptive than REFLOCS (discussed in thread). We have 2 values (start/end) instead of a single column of IRanges() because we can't make an IRanges() with missing values. Technically 'missing' ranges are represented by zero-width ranges but we still need a position; there is no position because there was no overlap. 2) mapCoords(), pmapCoords() These functions are courtesy of Michael. mapCoords() maps ranges onto another set of coordinates. You can map to cds-centric, exon-centric or any other type of coordinate. See ?mapCoords in both GenomicRanges and GenomicAlignments. In the previous thread we discussed added cDNA locations to predictCoding(). I've decided against this because it adds the additional overhead of the exonsBy() extraction and a findOverlaps() call. Not all users want the cDNA locations and those that do can now easily get them with mapCoords(). ## The usual predictCoding setup: library(BSgenome.Hsapiens.UCSC.hg19) library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation") vcf <- readVcf(fl, "hg19") vcf <- renameSeqlevels(vcf, "chr22") coding <- predictCoding(vcf, txdb, Hsapiens) ## Exon-centric or cDNA locations: exonsbytx <- exonsBy(txdb, "tx") cDNA <- mapCoords(coding[!duplicated(ranges(coding))], exonsbytx) coding$cDNA <- ranges(cDNA)[togroup(coding$QUERYID)] Let me know if you run into problems or if the docs need more detail. Valerie

VariantAnnotation GenomicRanges VariantAnnotation VariantAnnotation GenomicRanges • 1.3k views

ADD COMMENT • link updated 10.2 years ago by Robert Castelo ★ 3.4k • written 10.2 years ago by Valerie Obenchain ★ 6.8k

score 0 · Answer 1 · 2014-07-24

0

Entering edit mode

Robert Castelo ★ 3.4k

@rcastelo

Last seen 1 day ago

Barcelona/Universitat Pompeu Fabra

Dear Valerie, sorry for my long delay in reacting to your message below. This addition to VariantAnnotation by Michael and you is great. I just would like to make a comment and report what looks to me as a bug related to the computation of cDNA positions, just in case you have suggestions and comments, specially with respect to what at the moment seems like a bug to me. the way in which mapCoords() works is by finding all possible overlaps of each GRanges entry, corresponding to a variant. This is strictly not necessary once you have gone through locateVariants() and/or predictCoding() since these functions already find all possible overlaps of each variant to different transcripts and regions. So, to get the corresponding annotation of cDNA positions I had to write the following function which is slightly more elaborated from what you suggest below: ## this function assumes that locateVariants() or predictCoding() ## have been called so that there is a TXID metadata column .cDNAloc <- function(gr, txdb) { if (!"TXID" %in% colnames(mcols(gr))) stop("cDNApos: metadata column 'TXID' not found in input GRanges object.") exonsbytx <- exonsBy(txdb, by="tx") map <- mapCoords(gr, exonsbytx, eltHits=TRUE) eolap <- map$eltHits qolap <- map$queryHits txids <- rep(names(exonsbytx), elementLengths(exonsbytx))[eolap] res <- gr[qolap] mcols(res) <- append(values(res), DataFrame(cDNALOC=ranges(map), TXID2=as.integer(txids))) ## restrict results to cDNA positions of query TXID res <- res[res$TXID == res$TXID2] start(res$cDNALOC) } i'm going to run it with a toy example of a variant that overlaps a coding region of 7 transcripts in the positive strand and the three UTR region of one transcript in the negative strand. library(VariantAnnotation) library(TxDb.Hsapiens.UCSC.hg19.knownGene) gr <- GRanges("chr21", IRanges(start=c(43522349, 43522349), width=c(1, 1)), strand=c("+", "-")) gr$TXID <- c(72816L, 73335L) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene loc <- locateVariants(gr, txdb, AllVariants()) loc$cDNALOC <- .cDNAloc(loc, txdb) now let's look to the cDNA position of the transcript in the negative strand (the last entry of the resulting GRanges object): GRanges with 1 range and 2 metadata columns: seqnames ranges strand | TXID cDNALOC <rle> <iranges> <rle> | <integer> <integer> [1] chr21 [43522349, 43522349] - | 73335 2036 and the exonic structure of this transcript exonsbytx <- exonsBy(txdb, by="tx") exonsbytx[["73335"]] GRanges with 2 ranges and 3 metadata columns: seqnames ranges strand | exon_id exon_name exon_rank <rle> <iranges> <rle> | <integer> <character> <integer> [1] chr21 [43528406, 43528644] - | 260152 <na> 1 [2] chr21 [43522244, 43524145] - | 260151 <na> 2 from this structure one would expect that the reported cDNA position would be: 43524145 - 43522349 + 1 = 1797 instead of 2036. after investigating a bit where this value may come from, it seems that while the strand is being taken into account for the relative position within the exon, the widths of upstream exons is wrongly selected (in this case there are no upstream exons): width(exonsbytx[["73335"]]) [1] 239 1902 239 + 1796 + 1 = 2036 so, my guess is that this must be something to be fixed in mapCoords(). best regards!! robert. On 07/13/2014 07:26 AM, Valerie Obenchain wrote: > Hi, > > Following up on this thread: > > https://stat.ethz.ch/pipermail/bioconductor/2013-December/056745.html > > These changes are available in VariantAnnotation 1.11.15: > > 1) LOCSTART, LOCEND > > locateVariants() has 2 new output columns, LOCSTART and LOCEND. > These are LOCATION-centric coordinates and can be different for each row > so I thought these names were more descriptive than REFLOCS (discussed > in thread). We have 2 values (start/end) instead of a single column of > IRanges() because we can't make an IRanges() with missing values. > Technically 'missing' ranges are represented by zero-width ranges but we > still need a position; there is no position because there was no overlap. > > 2) mapCoords(), pmapCoords() > > These functions are courtesy of Michael. mapCoords() maps ranges onto > another set of coordinates. You can map to cds-centric, exon-centric or > any other type of coordinate. See ?mapCoords in both GenomicRanges and > GenomicAlignments. > > > In the previous thread we discussed added cDNA locations to > predictCoding(). I've decided against this because it adds the > additional overhead of the exonsBy() extraction and a findOverlaps() > call. Not all users want the cDNA locations and those that do can now > easily get them with mapCoords(). > > ## The usual predictCoding setup: > library(BSgenome.Hsapiens.UCSC.hg19) > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene > fl <- system.file("extdata", "chr22.vcf.gz", > package="VariantAnnotation") > vcf <- readVcf(fl, "hg19") > vcf <- renameSeqlevels(vcf, "chr22") > coding <- predictCoding(vcf, txdb, Hsapiens) > > ## Exon-centric or cDNA locations: > exonsbytx <- exonsBy(txdb, "tx") > cDNA <- mapCoords(coding[!duplicated(ranges(coding))], exonsbytx) > coding$cDNA <- ranges(cDNA)[togroup(coding$QUERYID)] > > Let me know if you run into problems or if the docs need more detail. > > Valerie > -- Robert Castelo, PhD Associate Professor Dept. of Experimental and Health Sciences Universitat Pompeu Fabra (UPF) Barcelona Biomedical Research Park (PRBB) Dr Aiguader 88 E-08003 Barcelona, Spain telf: +34.933.160.514 fax: +34.933.160.550

ADD COMMENT • link 10.2 years ago Robert Castelo ★ 3.4k

0

Entering edit mode

Hi Robert, Now fixed in GenomicRanges 1.17.28. I introduced this bug when trying to be too clever with ignore.strand. Thanks for catching it and, as always, providing a clear reproducible example. Valerie On 07/24/2014 08:24 AM, Robert Castelo wrote: > Dear Valerie, > > sorry for my long delay in reacting to your message below. This addition > to VariantAnnotation by Michael and you is great. I just would like to > make a comment and report what looks to me as a bug related to the > computation of cDNA positions, just in case you have suggestions and > comments, specially with respect to what at the moment seems like a bug > to me. > > the way in which mapCoords() works is by finding all possible overlaps > of each GRanges entry, corresponding to a variant. This is strictly not > necessary once you have gone through locateVariants() and/or > predictCoding() since these functions already find all possible overlaps > of each variant to different transcripts and regions. So, to get the > corresponding annotation of cDNA positions I had to write the following > function which is slightly more elaborated from what you suggest below: > > ## this function assumes that locateVariants() or predictCoding() > ## have been called so that there is a TXID metadata column > .cDNAloc <- function(gr, txdb) { > if (!"TXID" %in% colnames(mcols(gr))) > stop("cDNApos: metadata column 'TXID' not found in input GRanges > object.") > > exonsbytx <- exonsBy(txdb, by="tx") > map <- mapCoords(gr, exonsbytx, eltHits=TRUE) > eolap <- map$eltHits > qolap <- map$queryHits > txids <- rep(names(exonsbytx), elementLengths(exonsbytx))[eolap] > res <- gr[qolap] > mcols(res) <- append(values(res), > DataFrame(cDNALOC=ranges(map), > TXID2=as.integer(txids))) > ## restrict results to cDNA positions of query TXID > res <- res[res$TXID == res$TXID2] > > start(res$cDNALOC) > } > > i'm going to run it with a toy example of a variant that overlaps a > coding region of 7 transcripts in the positive strand and the three UTR > region of one transcript in the negative strand. > > library(VariantAnnotation) > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > > gr <- GRanges("chr21", IRanges(start=c(43522349, 43522349), > width=c(1, 1)), strand=c("+", "-")) > gr$TXID <- c(72816L, 73335L) > txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene > loc <- locateVariants(gr, txdb, AllVariants()) > loc$cDNALOC <- .cDNAloc(loc, txdb) > > now let's look to the cDNA position of the transcript in the negative > strand (the last entry of the resulting GRanges object): > > GRanges with 1 range and 2 metadata columns: > seqnames ranges strand | TXID cDNALOC > <rle> <iranges> <rle> | <integer> <integer> > [1] chr21 [43522349, 43522349] - | 73335 2036 > > and the exonic structure of this transcript > > exonsbytx <- exonsBy(txdb, by="tx") > exonsbytx[["73335"]] > GRanges with 2 ranges and 3 metadata columns: > seqnames ranges strand | exon_id exon_name > exon_rank > <rle> <iranges> <rle> | <integer> <character> > <integer> > [1] chr21 [43528406, 43528644] - | 260152 <na> 1 > [2] chr21 [43522244, 43524145] - | 260151 <na> 2 > > from this structure one would expect that the reported cDNA position > would be: > > 43524145 - 43522349 + 1 = 1797 > > instead of 2036. > > after investigating a bit where this value may come from, it seems that > while the strand is being taken into account for the relative position > within the exon, the widths of upstream exons is wrongly selected (in > this case there are no upstream exons): > > width(exonsbytx[["73335"]]) > [1] 239 1902 > > 239 + 1796 + 1 = 2036 > > > so, my guess is that this must be something to be fixed in mapCoords(). > > > best regards!! > > robert. > > > On 07/13/2014 07:26 AM, Valerie Obenchain wrote: >> Hi, >> >> Following up on this thread: >> >> https://stat.ethz.ch/pipermail/bioconductor/2013-December/056745.html >> >> These changes are available in VariantAnnotation 1.11.15: >> >> 1) LOCSTART, LOCEND >> >> locateVariants() has 2 new output columns, LOCSTART and LOCEND. >> These are LOCATION-centric coordinates and can be different for each row >> so I thought these names were more descriptive than REFLOCS (discussed >> in thread). We have 2 values (start/end) instead of a single column of >> IRanges() because we can't make an IRanges() with missing values. >> Technically 'missing' ranges are represented by zero-width ranges but we >> still need a position; there is no position because there was no overlap. >> >> 2) mapCoords(), pmapCoords() >> >> These functions are courtesy of Michael. mapCoords() maps ranges onto >> another set of coordinates. You can map to cds-centric, exon- centric or >> any other type of coordinate. See ?mapCoords in both GenomicRanges and >> GenomicAlignments. >> >> >> In the previous thread we discussed added cDNA locations to >> predictCoding(). I've decided against this because it adds the >> additional overhead of the exonsBy() extraction and a findOverlaps() >> call. Not all users want the cDNA locations and those that do can now >> easily get them with mapCoords(). >> >> ## The usual predictCoding setup: >> library(BSgenome.Hsapiens.UCSC.hg19) >> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene >> fl <- system.file("extdata", "chr22.vcf.gz", >> package="VariantAnnotation") >> vcf <- readVcf(fl, "hg19") >> vcf <- renameSeqlevels(vcf, "chr22") >> coding <- predictCoding(vcf, txdb, Hsapiens) >> >> ## Exon-centric or cDNA locations: >> exonsbytx <- exonsBy(txdb, "tx") >> cDNA <- mapCoords(coding[!duplicated(ranges(coding))], exonsbytx) >> coding$cDNA <- ranges(cDNA)[togroup(coding$QUERYID)] >> >> Let me know if you run into problems or if the docs need more detail. >> >> Valerie >> >

ADD REPLY • link 10.2 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

hi Valerie, i'm still not getting the right answer, i've looked at the unit test you added and i'm afraid it does not recreate the situation i described below, which involved one transcript with two exons, while the unit test forms two transcripts of one exon. so as it is now, doing: mapCoords(gr, grl, ignore.strand=TRUE) > map GRanges with 2 ranges and 2 metadata columns: seqnames ranges strand | queryHits subjectHits <rle> <iranges> <rle> | <integer> <integer> [1] chrA [1797, 1797] - | 1 2 [2] chrA [1797, 1797] - | 2 2 you get the right answer, but doing: mapCoords(gr, GRangesList(unlist(grl)), ignore.strand=TRUE) > map GRanges with 2 ranges and 2 metadata columns: seqnames ranges strand | queryHits subjectHits <rle> <iranges> <rle> | <integer> <integer> [1] chrA [2036, 2036] - | 1 1 [2] chrA [2036, 2036] - | 2 1 you get the previous wrong answer. you'll find my sessionInfo() below. cheers, robert. ps: sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0 GenomicFeatures_1.17.13 AnnotationDbi_1.27.9 [4] Biobase_2.25.0 VariantAnnotation_1.11.19 Rsamtools_1.17.31 [7] Biostrings_2.33.13 XVector_0.5.7 GenomicRanges_1.17.28 [10] GenomeInfoDb_1.1.17 IRanges_1.99.23 S4Vectors_0.1.2 [13] BiocGenerics_0.11.3 vimcom_0.9-93 setwidth_1.0-3 [16] colorout_1.0-2 loaded via a namespace (and not attached): [1] BatchJobs_1.3 BBmisc_1.7 BiocParallel_0.7.9 biomaRt_2.21.1 [5] bitops_1.0-6 brew_1.0-6 BSgenome_1.33.8 checkmate_1.2 [9] codetools_0.2-8 DBI_0.2-7 digest_0.6.4 fail_1.2 [13] foreach_1.4.2 GenomicAlignments_1.1.22 iterators_1.0.7 Rcpp_0.11.2 [17] RCurl_1.95-4.1 RSQLite_0.11.4 rtracklayer_1.25.13 sendmailR_1.1-2 [21] stats4_3.1.1 stringr_0.6.2 tools_3.1.1 XML_3.98-1.1 [25] zlibbioc_1.11.1 On 7/28/14 1:25 PM, Valerie Obenchain wrote: > Hi Robert, > > Now fixed in GenomicRanges 1.17.28. > > I introduced this bug when trying to be too clever with > ignore.strand. Thanks for catching it and, as always, providing a > clear reproducible example. > > Valerie > > On 07/24/2014 08:24 AM, Robert Castelo wrote: >> Dear Valerie, >> >> sorry for my long delay in reacting to your message below. This >> addition to VariantAnnotation by Michael and you is great. I just >> would like to make a comment and report what looks to me as a bug >> related to the computation of cDNA positions, just in case you have >> suggestions and comments, specially with respect to what at the >> moment seems like a bug to me. >> >> the way in which mapCoords() works is by finding all possible >> overlaps of each GRanges entry, corresponding to a variant. This is >> strictly not necessary once you have gone through locateVariants() >> and/or predictCoding() since these functions already find all >> possible overlaps of each variant to different transcripts and >> regions. So, to get the corresponding annotation of cDNA positions >> I had to write the following function which is slightly more >> elaborated from what you suggest below: >> >> ## this function assumes that locateVariants() or predictCoding() >> ## have been called so that there is a TXID metadata column >> .cDNAloc <- function(gr, txdb) { if (!"TXID" %in% >> colnames(mcols(gr))) stop("cDNApos: metadata column 'TXID' not >> found in input GRanges object.") >> >> exonsbytx <- exonsBy(txdb, by="tx") map <- mapCoords(gr, exonsbytx, >> eltHits=TRUE) eolap <- map$eltHits qolap <- map$queryHits txids <- >> rep(names(exonsbytx), elementLengths(exonsbytx))[eolap] res <- >> gr[qolap] mcols(res) <- append(values(res), >> DataFrame(cDNALOC=ranges(map), TXID2=as.integer(txids))) ## >> restrict results to cDNA positions of query TXID res <- >> res[res$TXID == res$TXID2] >> >> start(res$cDNALOC) } >> >> i'm going to run it with a toy example of a variant that overlaps >> a coding region of 7 transcripts in the positive strand and the >> three UTR region of one transcript in the negative strand. >> >> library(VariantAnnotation) >> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >> >> gr <- GRanges("chr21", IRanges(start=c(43522349, 43522349), >> width=c(1, 1)), strand=c("+", "-")) gr$TXID <- c(72816L, 73335L) >> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene loc <- locateVariants(gr, >> txdb, AllVariants()) loc$cDNALOC <- .cDNAloc(loc, txdb) >> >> now let's look to the cDNA position of the transcript in the >> negative strand (the last entry of the resulting GRanges object): >> >> GRanges with 1 range and 2 metadata columns: seqnames >> ranges strand | TXID cDNALOC <rle> <iranges> >> <rle> | <integer> <integer> [1] chr21 [43522349, 43522349] >> - | 73335 2036 >> >> and the exonic structure of this transcript >> >> exonsbytx <- exonsBy(txdb, by="tx") exonsbytx[["73335"]] GRanges >> with 2 ranges and 3 metadata columns: seqnames ranges >> strand | exon_id exon_name exon_rank <rle> <iranges> >> <rle> | <integer> <character> <integer> [1] chr21 [43528406, >> 43528644] - | 260152 <na> 1 [2] chr21 >> [43522244, 43524145] - | 260151 <na> 2 >> >> from this structure one would expect that the reported cDNA >> position would be: >> >> 43524145 - 43522349 + 1 = 1797 >> >> instead of 2036. >> >> after investigating a bit where this value may come from, it seems >> that while the strand is being taken into account for the relative >> position within the exon, the widths of upstream exons is wrongly >> selected (in this case there are no upstream exons): >> >> width(exonsbytx[["73335"]]) [1] 239 1902 >> >> 239 + 1796 + 1 = 2036 >> >> >> so, my guess is that this must be something to be fixed in >> mapCoords(). >> >> >> best regards!! >> >> robert. >> >> >> On 07/13/2014 07:26 AM, Valerie Obenchain wrote: >>> Hi, >>> >>> Following up on this thread: >>> >>> https://stat.ethz.ch/pipermail/bioconductor/2013-December/056745.html >>> >>> >>> These changes are available in VariantAnnotation 1.11.15: >>> >>> 1) LOCSTART, LOCEND >>> >>> locateVariants() has 2 new output columns, LOCSTART and LOCEND. >>> These are LOCATION-centric coordinates and can be different for >>> each row so I thought these names were more descriptive than >>> REFLOCS (discussed in thread). We have 2 values (start/end) >>> instead of a single column of IRanges() because we can't make an >>> IRanges() with missing values. Technically 'missing' ranges are >>> represented by zero-width ranges but we still need a position; >>> there is no position because there was no overlap. >>> >>> 2) mapCoords(), pmapCoords() >>> >>> These functions are courtesy of Michael. mapCoords() maps ranges >>> onto another set of coordinates. You can map to cds-centric, >>> exon-centric or any other type of coordinate. See ?mapCoords in >>> both GenomicRanges and GenomicAlignments. >>> >>> >>> In the previous thread we discussed added cDNA locations to >>> predictCoding(). I've decided against this because it adds the >>> additional overhead of the exonsBy() extraction and a >>> findOverlaps() call. Not all users want the cDNA locations and >>> those that do can now easily get them with mapCoords(). >>> >>> ## The usual predictCoding setup: >>> library(BSgenome.Hsapiens.UCSC.hg19) >>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- >>> TxDb.Hsapiens.UCSC.hg19.knownGene fl <- system.file("extdata", >>> "chr22.vcf.gz", package="VariantAnnotation") vcf <- readVcf(fl, >>> "hg19") vcf <- renameSeqlevels(vcf, "chr22") coding <- >>> predictCoding(vcf, txdb, Hsapiens) >>> >>> ## Exon-centric or cDNA locations: exonsbytx <- exonsBy(txdb, >>> "tx") cDNA <- mapCoords(coding[!duplicated(ranges(coding))], >>> exonsbytx) coding$cDNA <- ranges(cDNA)[togroup(coding$QUERYID)] >>> >>> Let me know if you run into problems or if the docs need more >>> detail. >>> >>> Valerie >>> >> > [[alternative HTML version deleted]]

ADD REPLY • link 10.2 years ago Robert Castelo ★ 3.4k

0

Entering edit mode

OK. We'll try it again. See 1.17.29. It looks like I did not introduce this with ignore.strand. The bug was deeper and could be handled in either .orderElementsByTranscription() or the pair of .listCumsum* functions. I chose to add an arg to .orderElementsByTranscription() that controls whether negative strand ranges are ordered large-to-small (3' to 5') or small-to-large (5' to 3'). Valerie On 07/28/2014 01:23 PM, Robert Castelo wrote: > hi Valerie, > > i'm still not getting the right answer, i've looked at the unit test you > added and i'm afraid it does not recreate the situation i described > below, which involved one transcript with two exons, while the unit test > forms two transcripts of one exon. so as it is now, doing: > > mapCoords(gr, grl, ignore.strand=TRUE) > > map > GRanges with 2 ranges and 2 metadata columns: > seqnames ranges strand | queryHits subjectHits > <rle> <iranges> <rle> | <integer> <integer> > [1] chrA [1797, 1797] - | 1 2 > [2] chrA [1797, 1797] - | 2 2 > > you get the right answer, but doing: > > mapCoords(gr, GRangesList(unlist(grl)), ignore.strand=TRUE) > > map > GRanges with 2 ranges and 2 metadata columns: > seqnames ranges strand | queryHits subjectHits > <rle> <iranges> <rle> | <integer> <integer> > [1] chrA [2036, 2036] - | 1 1 > [2] chrA [2036, 2036] - | 2 1 > > you get the previous wrong answer. you'll find my sessionInfo() below. > > cheers, > robert. > ps: sessionInfo() > R version 3.1.1 (2014-07-10) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0 GenomicFeatures_1.17.13 > AnnotationDbi_1.27.9 > [4] Biobase_2.25.0 VariantAnnotation_1.11.19 Rsamtools_1.17.31 > [7] Biostrings_2.33.13 XVector_0.5.7 GenomicRanges_1.17.28 > [10] GenomeInfoDb_1.1.17 IRanges_1.99.23 S4Vectors_0.1.2 > [13] BiocGenerics_0.11.3 vimcom_0.9-93 setwidth_1.0-3 > [16] colorout_1.0-2 > > loaded via a namespace (and not attached): > [1] BatchJobs_1.3 BBmisc_1.7 BiocParallel_0.7.9 > biomaRt_2.21.1 > [5] bitops_1.0-6 brew_1.0-6 BSgenome_1.33.8 > checkmate_1.2 > [9] codetools_0.2-8 DBI_0.2-7 digest_0.6.4 fail_1.2 > [13] foreach_1.4.2 GenomicAlignments_1.1.22 > iterators_1.0.7 Rcpp_0.11.2 > [17] RCurl_1.95-4.1 RSQLite_0.11.4 rtracklayer_1.25.13 > sendmailR_1.1-2 > [21] stats4_3.1.1 stringr_0.6.2 tools_3.1.1 > XML_3.98-1.1 > [25] zlibbioc_1.11.1 > > > On 7/28/14 1:25 PM, Valerie Obenchain wrote: >> Hi Robert, > > > > Now fixed in GenomicRanges 1.17.28. > > > > I introduced this bug when trying to be too clever with > > ignore.strand. Thanks for catching it and, as always, providing a > > clear reproducible example. > > > > Valerie > > > > On 07/24/2014 08:24 AM, Robert Castelo wrote: > >> Dear Valerie, > >> > >> sorry for my long delay in reacting to your message below. This > >> addition to VariantAnnotation by Michael and you is great. I just > >> would like to make a comment and report what looks to me as a bug > >> related to the computation of cDNA positions, just in case you have > >> suggestions and comments, specially with respect to what at the > >> moment seems like a bug to me. > >> > >> the way in which mapCoords() works is by finding all possible > >> overlaps of each GRanges entry, corresponding to a variant. This is > >> strictly not necessary once you have gone through locateVariants() > >> and/or predictCoding() since these functions already find all > >> possible overlaps of each variant to different transcripts and > >> regions. So, to get the corresponding annotation of cDNA positions > >> I had to write the following function which is slightly more > >> elaborated from what you suggest below: > >> > >> ## this function assumes that locateVariants() or predictCoding() > >> ## have been called so that there is a TXID metadata column > >> .cDNAloc <- function(gr, txdb) { if (!"TXID" %in% > >> colnames(mcols(gr))) stop("cDNApos: metadata column 'TXID' not > >> found in input GRanges object.") > >> > >> exonsbytx <- exonsBy(txdb, by="tx") map <- mapCoords(gr, exonsbytx, > >> eltHits=TRUE) eolap <- map$eltHits qolap <- map$queryHits txids <- > >> rep(names(exonsbytx), elementLengths(exonsbytx))[eolap] res <- > >> gr[qolap] mcols(res) <- append(values(res), > >> DataFrame(cDNALOC=ranges(map), TXID2=as.integer(txids))) ## > >> restrict results to cDNA positions of query TXID res <- > >> res[res$TXID == res$TXID2] > >> > >> start(res$cDNALOC) } > >> > >> i'm going to run it with a toy example of a variant that overlaps > >> a coding region of 7 transcripts in the positive strand and the > >> three UTR region of one transcript in the negative strand. > >> > >> library(VariantAnnotation) > >> library(TxDb.Hsapiens.UCSC.hg19.knownGene) > >> > >> gr <- GRanges("chr21", IRanges(start=c(43522349, 43522349), > >> width=c(1, 1)), strand=c("+", "-")) gr$TXID <- c(72816L, 73335L) > >> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene loc <- locateVariants(gr, > >> txdb, AllVariants()) loc$cDNALOC <- .cDNAloc(loc, txdb) > >> > >> now let's look to the cDNA position of the transcript in the > >> negative strand (the last entry of the resulting GRanges object): > >> > >> GRanges with 1 range and 2 metadata columns: seqnames > >> ranges strand | TXID cDNALOC <rle> <iranges> > >> <rle> | <integer> <integer> [1] chr21 [43522349, 43522349] > >> - | 73335 2036 > >> > >> and the exonic structure of this transcript > >> > >> exonsbytx <- exonsBy(txdb, by="tx") exonsbytx[["73335"]] GRanges > >> with 2 ranges and 3 metadata columns: seqnames ranges > >> strand | exon_id exon_name exon_rank <rle> <iranges> > >> <rle> | <integer> <character> <integer> [1] chr21 [43528406, > >> 43528644] - | 260152 <na> 1 [2] chr21 > >> [43522244, 43524145] - | 260151 <na> 2 > >> > >> from this structure one would expect that the reported cDNA > >> position would be: > >> > >> 43524145 - 43522349 + 1 = 1797 > >> > >> instead of 2036. > >> > >> after investigating a bit where this value may come from, it seems > >> that while the strand is being taken into account for the relative > >> position within the exon, the widths of upstream exons is wrongly > >> selected (in this case there are no upstream exons): > >> > >> width(exonsbytx[["73335"]]) [1] 239 1902 > >> > >> 239 + 1796 + 1 = 2036 > >> > >> > >> so, my guess is that this must be something to be fixed in > >> mapCoords(). > >> > >> > >> best regards!! > >> > >> robert. > >> > >> > >> On 07/13/2014 07:26 AM, Valerie Obenchain wrote: > >>> Hi, > >>> > >>> Following up on this thread: > >>> > >>> https://stat.ethz.ch/pipermail/bioconductor/2013-December/056745.html > >>> > >>> > >>> > These changes are available in VariantAnnotation 1.11.15: >>>> > >>> 1) LOCSTART, LOCEND > >>> > >>> locateVariants() has 2 new output columns, LOCSTART and LOCEND. > >>> These are LOCATION-centric coordinates and can be different for > >>> each row so I thought these names were more descriptive than > >>> REFLOCS (discussed in thread). We have 2 values (start/end) > >>> instead of a single column of IRanges() because we can't make an > >>> IRanges() with missing values. Technically 'missing' ranges are > >>> represented by zero-width ranges but we still need a position; > >>> there is no position because there was no overlap. > >>> > >>> 2) mapCoords(), pmapCoords() > >>> > >>> These functions are courtesy of Michael. mapCoords() maps ranges > >>> onto another set of coordinates. You can map to cds-centric, > >>> exon-centric or any other type of coordinate. See ?mapCoords in > >>> both GenomicRanges and GenomicAlignments. > >>> > >>> > >>> In the previous thread we discussed added cDNA locations to > >>> predictCoding(). I've decided against this because it adds the > >>> additional overhead of the exonsBy() extraction and a > >>> findOverlaps() call. Not all users want the cDNA locations and > >>> those that do can now easily get them with mapCoords(). > >>> > >>> ## The usual predictCoding setup: > >>> library(BSgenome.Hsapiens.UCSC.hg19) > >>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- > >>> TxDb.Hsapiens.UCSC.hg19.knownGene fl <- system.file("extdata", > >>> "chr22.vcf.gz", package="VariantAnnotation") vcf <- readVcf(fl, > >>> "hg19") vcf <- renameSeqlevels(vcf, "chr22") coding <- > >>> predictCoding(vcf, txdb, Hsapiens) > >>> > >>> ## Exon-centric or cDNA locations: exonsbytx <- exonsBy(txdb, > >>> "tx") cDNA <- mapCoords(coding[!duplicated(ranges(coding))], > >>> exonsbytx) coding$cDNA <- ranges(cDNA)[togroup(coding$QUERYID)] > >>> > >>> Let me know if you run into problems or if the docs need more > >>> detail. > >>> > >>> Valerie > >>> > >> > > > >

ADD REPLY • link 10.2 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Excellent! Thanks for fixing it, and again for this new feature that enables calculating cDNA positions. best regards, robert. On 7/28/14 11:30 PM, Valerie Obenchain wrote: > OK. We'll try it again. See 1.17.29. > > It looks like I did not introduce this with ignore.strand. The bug was > deeper and could be handled in either .orderElementsByTranscription() > or the pair of .listCumsum* functions. I chose to add an arg to > .orderElementsByTranscription() that controls whether negative strand > ranges are ordered large-to-small (3' to 5') or small-to-large (5' to > 3'). > > Valerie > > > On 07/28/2014 01:23 PM, Robert Castelo wrote: >> hi Valerie, >> >> i'm still not getting the right answer, i've looked at the unit test you >> added and i'm afraid it does not recreate the situation i described >> below, which involved one transcript with two exons, while the unit test >> forms two transcripts of one exon. so as it is now, doing: >> >> mapCoords(gr, grl, ignore.strand=TRUE) >> > map >> GRanges with 2 ranges and 2 metadata columns: >> seqnames ranges strand | queryHits subjectHits >> <rle> <iranges> <rle> | <integer> <integer> >> [1] chrA [1797, 1797] - | 1 2 >> [2] chrA [1797, 1797] - | 2 2 >> >> you get the right answer, but doing: >> >> mapCoords(gr, GRangesList(unlist(grl)), ignore.strand=TRUE) >> > map >> GRanges with 2 ranges and 2 metadata columns: >> seqnames ranges strand | queryHits subjectHits >> <rle> <iranges> <rle> | <integer> <integer> >> [1] chrA [2036, 2036] - | 1 1 >> [2] chrA [2036, 2036] - | 2 1 >> >> you get the previous wrong answer. you'll find my sessionInfo() below. >> >> cheers, >> robert. >> ps: sessionInfo() >> R version 3.1.1 (2014-07-10) >> Platform: x86_64-apple-darwin10.8.0 (64-bit) >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> base >> >> other attached packages: >> [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0 GenomicFeatures_1.17.13 >> AnnotationDbi_1.27.9 >> [4] Biobase_2.25.0 VariantAnnotation_1.11.19 Rsamtools_1.17.31 >> [7] Biostrings_2.33.13 XVector_0.5.7 GenomicRanges_1.17.28 >> [10] GenomeInfoDb_1.1.17 IRanges_1.99.23 S4Vectors_0.1.2 >> [13] BiocGenerics_0.11.3 vimcom_0.9-93 setwidth_1.0-3 >> [16] colorout_1.0-2 >> >> loaded via a namespace (and not attached): >> [1] BatchJobs_1.3 BBmisc_1.7 BiocParallel_0.7.9 >> biomaRt_2.21.1 >> [5] bitops_1.0-6 brew_1.0-6 BSgenome_1.33.8 >> checkmate_1.2 >> [9] codetools_0.2-8 DBI_0.2-7 digest_0.6.4 >> fail_1.2 >> [13] foreach_1.4.2 GenomicAlignments_1.1.22 >> iterators_1.0.7 Rcpp_0.11.2 >> [17] RCurl_1.95-4.1 RSQLite_0.11.4 rtracklayer_1.25.13 >> sendmailR_1.1-2 >> [21] stats4_3.1.1 stringr_0.6.2 tools_3.1.1 >> XML_3.98-1.1 >> [25] zlibbioc_1.11.1 >> >> >> On 7/28/14 1:25 PM, Valerie Obenchain wrote: >>> Hi Robert, >> > >> > Now fixed in GenomicRanges 1.17.28. >> > >> > I introduced this bug when trying to be too clever with >> > ignore.strand. Thanks for catching it and, as always, providing a >> > clear reproducible example. >> > >> > Valerie >> > >> > On 07/24/2014 08:24 AM, Robert Castelo wrote: >> >> Dear Valerie, >> >> >> >> sorry for my long delay in reacting to your message below. This >> >> addition to VariantAnnotation by Michael and you is great. I just >> >> would like to make a comment and report what looks to me as a bug >> >> related to the computation of cDNA positions, just in case you have >> >> suggestions and comments, specially with respect to what at the >> >> moment seems like a bug to me. >> >> >> >> the way in which mapCoords() works is by finding all possible >> >> overlaps of each GRanges entry, corresponding to a variant. This is >> >> strictly not necessary once you have gone through locateVariants() >> >> and/or predictCoding() since these functions already find all >> >> possible overlaps of each variant to different transcripts and >> >> regions. So, to get the corresponding annotation of cDNA positions >> >> I had to write the following function which is slightly more >> >> elaborated from what you suggest below: >> >> >> >> ## this function assumes that locateVariants() or predictCoding() >> >> ## have been called so that there is a TXID metadata column >> >> .cDNAloc <- function(gr, txdb) { if (!"TXID" %in% >> >> colnames(mcols(gr))) stop("cDNApos: metadata column 'TXID' not >> >> found in input GRanges object.") >> >> >> >> exonsbytx <- exonsBy(txdb, by="tx") map <- mapCoords(gr, exonsbytx, >> >> eltHits=TRUE) eolap <- map$eltHits qolap <- map$queryHits txids <- >> >> rep(names(exonsbytx), elementLengths(exonsbytx))[eolap] res <- >> >> gr[qolap] mcols(res) <- append(values(res), >> >> DataFrame(cDNALOC=ranges(map), TXID2=as.integer(txids))) ## >> >> restrict results to cDNA positions of query TXID res <- >> >> res[res$TXID == res$TXID2] >> >> >> >> start(res$cDNALOC) } >> >> >> >> i'm going to run it with a toy example of a variant that overlaps >> >> a coding region of 7 transcripts in the positive strand and the >> >> three UTR region of one transcript in the negative strand. >> >> >> >> library(VariantAnnotation) >> >> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >> >> >> >> gr <- GRanges("chr21", IRanges(start=c(43522349, 43522349), >> >> width=c(1, 1)), strand=c("+", "-")) gr$TXID <- c(72816L, 73335L) >> >> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene loc <- locateVariants(gr, >> >> txdb, AllVariants()) loc$cDNALOC <- .cDNAloc(loc, txdb) >> >> >> >> now let's look to the cDNA position of the transcript in the >> >> negative strand (the last entry of the resulting GRanges object): >> >> >> >> GRanges with 1 range and 2 metadata columns: seqnames >> >> ranges strand | TXID cDNALOC <rle> <iranges> >> >> <rle> | <integer> <integer> [1] chr21 [43522349, 43522349] >> >> - | 73335 2036 >> >> >> >> and the exonic structure of this transcript >> >> >> >> exonsbytx <- exonsBy(txdb, by="tx") exonsbytx[["73335"]] GRanges >> >> with 2 ranges and 3 metadata columns: seqnames ranges >> >> strand | exon_id exon_name exon_rank <rle> <iranges> >> >> <rle> | <integer> <character> <integer> [1] chr21 [43528406, >> >> 43528644] - | 260152 <na> 1 [2] chr21 >> >> [43522244, 43524145] - | 260151 <na> 2 >> >> >> >> from this structure one would expect that the reported cDNA >> >> position would be: >> >> >> >> 43524145 - 43522349 + 1 = 1797 >> >> >> >> instead of 2036. >> >> >> >> after investigating a bit where this value may come from, it seems >> >> that while the strand is being taken into account for the relative >> >> position within the exon, the widths of upstream exons is wrongly >> >> selected (in this case there are no upstream exons): >> >> >> >> width(exonsbytx[["73335"]]) [1] 239 1902 >> >> >> >> 239 + 1796 + 1 = 2036 >> >> >> >> >> >> so, my guess is that this must be something to be fixed in >> >> mapCoords(). >> >> >> >> >> >> best regards!! >> >> >> >> robert. >> >> >> >> >> >> On 07/13/2014 07:26 AM, Valerie Obenchain wrote: >> >>> Hi, >> >>> >> >>> Following up on this thread: >> >>> >> >>> >> https://stat.ethz.ch/pipermail/bioconductor/2013-December/056745.html >> >>> >> >>> >> >>> >> These changes are available in VariantAnnotation 1.11.15: >>>>> >> >>> 1) LOCSTART, LOCEND >> >>> >> >>> locateVariants() has 2 new output columns, LOCSTART and LOCEND. >> >>> These are LOCATION-centric coordinates and can be different for >> >>> each row so I thought these names were more descriptive than >> >>> REFLOCS (discussed in thread). We have 2 values (start/end) >> >>> instead of a single column of IRanges() because we can't make an >> >>> IRanges() with missing values. Technically 'missing' ranges are >> >>> represented by zero-width ranges but we still need a position; >> >>> there is no position because there was no overlap. >> >>> >> >>> 2) mapCoords(), pmapCoords() >> >>> >> >>> These functions are courtesy of Michael. mapCoords() maps ranges >> >>> onto another set of coordinates. You can map to cds-centric, >> >>> exon-centric or any other type of coordinate. See ?mapCoords in >> >>> both GenomicRanges and GenomicAlignments. >> >>> >> >>> >> >>> In the previous thread we discussed added cDNA locations to >> >>> predictCoding(). I've decided against this because it adds the >> >>> additional overhead of the exonsBy() extraction and a >> >>> findOverlaps() call. Not all users want the cDNA locations and >> >>> those that do can now easily get them with mapCoords(). >> >>> >> >>> ## The usual predictCoding setup: >> >>> library(BSgenome.Hsapiens.UCSC.hg19) >> >>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- >> >>> TxDb.Hsapiens.UCSC.hg19.knownGene fl <- system.file("extdata", >> >>> "chr22.vcf.gz", package="VariantAnnotation") vcf <- readVcf(fl, >> >>> "hg19") vcf <- renameSeqlevels(vcf, "chr22") coding <- >> >>> predictCoding(vcf, txdb, Hsapiens) >> >>> >> >>> ## Exon-centric or cDNA locations: exonsbytx <- exonsBy(txdb, >> >>> "tx") cDNA <- mapCoords(coding[!duplicated(ranges(coding))], >> >>> exonsbytx) coding$cDNA <- ranges(cDNA)[togroup(coding$QUERYID)] >> >>> >> >>> Let me know if you run into problems or if the docs need more >> >>> detail. >> >>> >> >>> Valerie >> >>> >> >> >> > >> >> >

ADD REPLY • link 10.2 years ago Robert Castelo ★ 3.4k