VariantAnnotation locateVariants - meaning of precedesID and followsID

0

Entering edit mode

Paul Theodor Pyl ▴ 80

@paul-theodor-pyl-5014

Last seen 10.6 years ago

Hi, I have a question concerning the meaning of precedesID and followsID in the result of the locateVariants function from the VariantAnnotation package. The reference manual states: precedesID: The id of the gene the query precedes. Only present for ?intergenic? variants. followsID: The id of the gene the query follows. Only present for ?intergenic? variants. Which I read as: The precedesID is the ID of the gene that follows the query i.e. the gene which is preceded by the query. The query in this case would be my snv position?! I assume that the order of elements on the chromosome would be: gene with ID followsID, query(SNV), gene with ID precedesID. In the return value of locateVariants I find different scenarios, an example is given in the attached file, I show a table here: pstrand pstart pend start end fstrand fstart fend rs62637812 - 34611 36082 51803 51803 - 136698 140567 1:54586 + 69091 70009 54586 54586 - 136698 140567 1:145820 - 136698 140567 145820 145820 + 69091 70009 1:231454 + 322037 326939 231454 231454 + 69091 70009 1:333714 + 367659 368598 333714 333714 + 323892 328582 1:467095 + 763064 788903 467095 467095 + 367659 368598 1:715857 - 700245 714069 715857 715857 - 761586 762903 rs1044922 - 761586 762903 792263 792263 + 763064 788903 rs7545373 - 803451 812183 812284 812284 + 763064 788903 rs192553893 + 860530 871277 839873 839873 - 852953 854818 rs6673914 - 852953 854818 855075 855075 - 879583 893919 1:919987 - 910579 912022 919987 919987 + 901877 910485 rs2488989 + 948847 949920 934121 934121 - 934342 935553 rs3128114 - 934342 935553 935715 935715 + 901877 910485 rs9442388 + 955503 991500 951295 951295 + 948847 949920 I got the coordinates from the transcriptDb object I used for the call to locateVariants via the transcriptsBy(..., "gene") function. I see cases with both genes upstream of the variant, or with their positions switched etc. Could someone help me make sense of this, I can't see an obvious pattern explaining this. Cheers, Paul -------------- next part -------------- > library(VariantAnnotation) [...] > library(BSgenome.Hsapiens.UCSC.hg19) [...] > library(TxDb.Hsapiens.UCSC.hg19.knownGene) [...] > subst <- c("RSID", "PREDICTION", "SCORE", "AACHANGE", "PROTEINID") > tab <- TabixFile( "../my.vcf.gz" ) > seqlens <- seqlengths( BSgenome.Hsapiens.UCSC.hg19::Hsapiens )[1:24] > names(seqlens) <- paste( c(1:22,"X","Y") ) > which <- RangesList("1"=IRanges(50000, 1000000)) > svp = ScanVcfParam(which=which) > vcf = readVcf(tab, "hg19", svp ) > rename <- paste("chr",seqlevels(vcf),sep="") > names(rename) <- seqlevels(vcf) > vcf <- renameSeqlevels(vcf, rename) > txdb19 <- TxDb.Hsapiens.UCSC.hg19.knownGene > txdb19 TranscriptDb object: | Db type: TranscriptDb | Supporting package: GenomicFeatures | Data source: UCSC | Genome: hg19 | Genus and Species: Homo sapiens | UCSC Table: knownGene | Resource URL: http://genome.ucsc.edu/ | Type of Gene ID: Entrez Gene ID | Full dataset: yes | miRBase build ID: GRCh37 | transcript_nrow: 80922 | exon_nrow: 286852 | cds_nrow: 235842 | Db created by: GenomicFeatures package from Bioconductor | Creation time: 2012-03-12 21:45:23 -0700 (Mon, 12 Mar 2012) | GenomicFeatures version at creation time: 1.7.30 | RSQLite version at creation time: 0.11.1 | DBSCHEMAVERSION: 1.0 > rd = rowData(vcf) > rd GRanges with 1900 ranges and 1 elementMetadata col: seqnames ranges strand | paramRangeID <rle> <iranges> <rle> | <factor> rs62637812 chr1 [51803, 51803] * | <na> rs76402894 chr1 [51898, 51898] * | <na> rs78732933 chr1 [51928, 51928] * | <na> rs62637813 chr1 [52058, 52058] * | <na> 1:52096 chr1 [52096, 52096] * | <na> 1:54586 chr1 [54586, 54586] * | <na> 1:54844 chr1 [54844, 54844] * | <na> 1:58176 chr1 [58176, 58176] * | <na> 1:58211 chr1 [58211, 58211] * | <na> ... ... ... ... ... ... rs56255212 chr1 [985449, 985449] * | <na> 1:985450 chr1 [985450, 985450] * | <na> rs3121562 chr1 [991724, 991724] * | <na> rs113866178 chr1 [991931, 991931] * | <na> rs2710886 chr1 [992084, 992084] * | <na> rs2799074 chr1 [992094, 992094] * | <na> 1:995121 chr1 [995121, 995121] * | <na> 1:995125 chr1 [995125, 995125] * | <na> rs28397086 chr1 [997408, 997408] * | <na> --- seqlengths: chr1 NA > loc = locateVariants(vcf, txdb19, AllVariants()) > loc_df <- as.data.frame(loc) > > txbg <- transcriptsBy(txdb19, "gene") > > loc_df <- loc_df[ is.finite(loc_df$precedesID), c("seqnames","start","end","precedesID","followsID")] > loc_df <- loc_df[!duplicated(loc_df$precedesID),] > > prec <- txbg[ as.character(loc_df$precedesID) ] > foll <- txbg[ as.character(loc_df$followsID) ] > > for( i in 1:length(names(prec)) ){ + np = names(prec)[i] + nf = names(foll)[i] + p = prec[[np]] + f = foll[[nf]] + p1 = p[1] + f1 = f[1] + #print( paste( loc_df$start[i], loc_df$end[i], "|", p1 at seqnames, p1 at strand, p1 at ranges@start, p1 at ranges@start + p1 at ranges@width, "vs.", f1 at seqnames, f1 at strand, f1 at ranges@start, f1 at ranges@start + f1 at ranges@width ) ) + loc_df$pstrand[i] = as.character(p1 at strand) + loc_df$pstart[i] = as.numeric(p1 at ranges@start) + loc_df$pend[i] = as.numeric(p1 at ranges@start) + as.numeric(p1 at ranges@width) + loc_df$fstrand[i] = as.character(f1 at strand) + loc_df$fstart[i] = as.numeric(f1 at ranges@start) + loc_df$fend[i] = as.numeric(f1 at ranges@start) + as.numeric(f1 at ranges@width) + } > > loc_df[, c("pstrand","pstart","pend","start","end","fstrand","fstart","fend") ] pstrand pstart pend start end fstrand fstart fend rs62637812 - 34611 36082 51803 51803 - 136698 140567 1:54586 + 69091 70009 54586 54586 - 136698 140567 1:145820 - 136698 140567 145820 145820 + 69091 70009 1:231454 + 322037 326939 231454 231454 + 69091 70009 1:333714 + 367659 368598 333714 333714 + 323892 328582 1:467095 + 763064 788903 467095 467095 + 367659 368598 1:715857 - 700245 714069 715857 715857 - 761586 762903 rs1044922 - 761586 762903 792263 792263 + 763064 788903 rs7545373 - 803451 812183 812284 812284 + 763064 788903 rs192553893 + 860530 871277 839873 839873 - 852953 854818 rs6673914 - 852953 854818 855075 855075 - 879583 893919 1:919987 - 910579 912022 919987 919987 + 901877 910485 rs2488989 + 948847 949920 934121 934121 - 934342 935553 rs3128114 - 934342 935553 935715 935715 + 901877 910485 rs9442388 + 955503 991500 951295 951295 + 948847 949920 > sessionInfo() R Under development (unstable) (2012-04-16 r59046) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1 [2] GenomicFeatures_1.8.1 [3] AnnotationDbi_1.18.0 [4] Biobase_2.16.0 [5] BSgenome.Hsapiens.UCSC.hg19_1.3.17 [6] BSgenome_1.24.0 [7] VariantAnnotation_1.2.5 [8] Rsamtools_1.8.3 [9] Biostrings_2.24.1 [10] GenomicRanges_1.8.3 [11] IRanges_1.14.2 [12] BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] biomaRt_2.12.0 bitops_1.0-4.1 DBI_0.2-5 grid_2.16.0 [5] lattice_0.20-6 Matrix_1.0-6 RCurl_1.91-1 RSQLite_0.11.1 [9] rtracklayer_1.16.1 snpStats_1.6.0 splines_2.16.0 stats4_2.16.0 [13] survival_2.36-12 tools_2.16.0 XML_3.9-4 zlibbioc_1.2.0

BSgenome TranscriptDb BSgenome GenomicFeatures BSgenome TranscriptDb BSgenome • 1.2k views

ADD COMMENT • link updated 13.0 years ago by Valerie Obenchain ★ 6.8k • written 13.0 years ago by Paul Theodor Pyl ▴ 80

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 3.2 years ago

United States

Hi Paul, The precedeID and followID follow the convention of the precede() and follow() methods. See ?precede, ?follow. The IDs in the precedID column output of locateVariants() are genes that come "after" the variant where "after" indicates to the right or towards the 3' end. When talking about a variant at position "5" on the positive strand, a gene at position "8" would be in the precedeID column and one at position "3" would be in the followID column. This is reversed when working on the negative strand. query <- GRanges("chr1", IRanges(c(5, 5), width=1), strand=c("+", '-')) subject <- GRanges("chr1", IRanges(c(3, 8), width=1), strand="+") > precede(query, subject) [1] 2 NA > follow(query, subject) [1] 1 NA Now change the strand to negative on the subject, strand(subject) <- "-" > precede(query, subject) [1] NA 1 > follow(query, subject) [1] NA 2 Does this clarify things? Valerie On 04/17/2012 08:20 AM, Paul Theodor Pyl wrote: > Hi, > > I have a question concerning the meaning of precedesID and followsID > in the result of the locateVariants function from the > VariantAnnotation package. > > The reference manual states: > > precedesID: The id of the gene the query precedes. Only present for > intergenic variants. > followsID: The id of the gene the query follows. Only present for > intergenic variants. > > Which I read as: The precedesID is the ID of the gene that follows the > query i.e. the gene which is preceded by the query. The query in this > case would be my snv position?! Yes, the query is your snv position. > > I assume that the order of elements on the chromosome would be: > gene with ID followsID, query(SNV), gene with ID precedesID. This is true for the positive strand but reversed for the negative strand. > > In the return value of locateVariants I find different scenarios, an > example is given in the attached file, I show a table here: > > pstrand pstart pend start end fstrand fstart fend > rs62637812 - 34611 36082 51803 51803 - 136698 140567 > 1:54586 + 69091 70009 54586 54586 - 136698 140567 > 1:145820 - 136698 140567 145820 145820 + 69091 70009 > 1:231454 + 322037 326939 231454 231454 + 69091 70009 > 1:333714 + 367659 368598 333714 333714 + 323892 328582 > 1:467095 + 763064 788903 467095 467095 + 367659 368598 > 1:715857 - 700245 714069 715857 715857 - 761586 762903 > rs1044922 - 761586 762903 792263 792263 + 763064 788903 > rs7545373 - 803451 812183 812284 812284 + 763064 788903 > rs192553893 + 860530 871277 839873 839873 - 852953 854818 > rs6673914 - 852953 854818 855075 855075 - 879583 893919 > 1:919987 - 910579 912022 919987 919987 + 901877 910485 > rs2488989 + 948847 949920 934121 934121 - 934342 935553 > rs3128114 - 934342 935553 935715 935715 + 901877 910485 > rs9442388 + 955503 991500 951295 951295 + 948847 949920 > > I got the coordinates from the transcriptDb object I used for the call > to locateVariants via the transcriptsBy(..., "gene") function. > > I see cases with both genes upstream of the variant, or with their > positions switched etc. > Could someone help me make sense of this, I can't see an obvious > pattern explaining this. > > Cheers, > Paul > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD COMMENT • link 13.0 years ago Valerie Obenchain ★ 6.8k

Login before adding your answer.