How to get position for each gene ID/gene symbol instead of position for each transcript

1

Entering edit mode

shirley zhang ★ 1.0k

@shirley-zhang-2038

Last seen 10.2 years ago

Hi Marc, Thanks a lot for your detailed reply. It is very helpful. I tried library "GenomicFeatures" and your code. It works great. I did get position for each Transcript. However I am only interested in obtaining position for each gene ID/gene symbol instead of position for each transcript. For example, if one gene has multiple transcripts, I only want the biggest boundary of each gene which includes all of its transcripts. Does GenomicFeatures or other package can do this? Or I have to compute the gene position by combining all of its corresponding transcripts? Thanks again for your suggestions. Shirley On Thu, Aug 19, 2010 at 2:50 PM, Marc Carlson <mcarlson at="" fhcrc.org=""> wrote: > Hi Shirly, > > That is a very good question! ?Looking at the latest builds of the > annotations (which is what we normally would recommend), the manual page > will now tell you that the source for the latest CHRLOC and CHRLOCEND > information is from here: > > ? ? Mappings were based on data provided by: UCSC Genome > ? ? Bioinformatics (Homo sapiens) > ? ? ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19 > > That means that this information corresponds to the more recent "hg19" > build which in turn will correspond to "build 37". ?So if you want older > builds you could use older versions of the org packages, but I since I > would never in good conscience recommend using stale annotations, I > think that in that case you would be much better off to do the initial > steps I showed you (to get the entrez gene IDs), and THEN using the > GenomicFeatures package, which should allow you to get positions for the > hg18 build mapped onto entrez gene IDs. > > You can see a vignette for GenomicFeatures here: > > http://www.bioconductor.org/packages/release/bioc/html/GenomicFeatur es.html > > You would basically just need to use makeTranscriptDbFromUCSC(), > serialize the result for safekeeping, and then call transcripts() on the > object it gave you. ?Once you had done so, you could simply ask for the > > library(GenomicFeatures) > foo = makeTranscriptDbFromUCSC(genome="hg18",tablename="knownGene") > bar = transcriptsBy(foo, by="gene") > ## At this point you will have ?GRanges object which is really useful > for a lot of great stuff, > ## but since you are only interested in matching the gene_ids to the > positions, > ## you can just pare it down to a much simpler object like so: > > frm = as.data.frame(bar) > > ## The values in "element" will be the names of the elements from the > GRangesList object bar. > ## If you check them you will see that for this UCSC track, they are the > entrez gene IDs that you want to merge with. > > Hope this helps, > > > ?Marc > > > > On 08/18/2010 06:26 PM, shirley zhang wrote: >> Hi Marc, >> >> Thanks a lot for your information. It is very very helpful. May I ask >> one more question? >> >> In my case, ?the snp_position is based on Human Build 36. ?The package >> org.Hs.eg.db was updated in October 2009, so can I assume it is on >> Build 36? >> >> Thanks again, >> Shirely >> >> On Wed, Aug 18, 2010 at 12:56 PM, Marc Carlson <mcarlson at="" fhcrc.org=""> wrote: >> >>> Hi Shirley, >>> >>> You can find this information in an organism package. ?org.Hs.eg.db for >>> humans (for example). >>> >>> What will be harder will be getting all the information from the mixed >>> bag of IDs that you describe. ?But you can still retrieve it for each ID >>> type separately like this example for refseq IDs: >>> >>> ## So for refseq you can start by getting the entrez gene ID >>> library(org.Hs.eg.db) >>> a = c("NM_000014", "NM_000015", "foo") >>> gene_id = unlist2(mget(a, revmap(org.Hs.egREFSEQ), ifnotfound=NA)) >>> gene_id = gene_id[!is.na(gene_id)] >>> >>> ## And then once you have the IDs, you can map them to a position >>> chrloc = toTable(org.Hs.egCHRLOC[gene_id]) >>> chrlocend = toTable(org.Hs.egCHRLOCEND[gene_id]) >>> >>> ## And now because we were careful to make sure that we always have an >>> entrez gene ID, >>> ## you can just merge all the results together to make this easier to >>> look at: >>> merge(merge(data.frame(gene_id=gene_id, refseq=names(gene_id)), chrloc), >>> chrlocend) >>> >>> >>> ## This process can then be repeated for other kinds of IDs etc: >>> b = c("Hs.100217", "Hs.100299") >>> gene_id = unlist2(mget(b, revmap(org.Hs.egUNIGENE), ifnotfound=NA)) >>> gene_id = gene_id[!is.na(gene_id)] >>> >>> chrloc = toTable(org.Hs.egCHRLOC[gene_id]) >>> chrlocend = toTable(org.Hs.egCHRLOCEND[gene_id]) >>> ## etc. >>> >>> Does that help? >>> >>> >>> ?Marc >>> >>> >>> >>> >>> >>> On 08/17/2010 07:26 PM, shirley zhang wrote: >>> >>>> Dear list, >>>> >>>> I have a list of human SNPs and genes/transcripts which are obtained >>>> from different studies. ?I would like to calculate the distance >>>> between these SNPs and genes/transcripts on the same chromosome. >>>> However, the ID for the gene/transcript across studies ?is not >>>> consistent. For example, the ID could be Gene Symbol/Alias (TRIM5l), >>>> RNA nucleotide accession (NM_013387), XM_173221 (its gene symbol is >>>> LOC254266, but its record is an obsolete version), Contig22780_RC, >>>> HSS00001378, Hs.445401 (UniGene entry Hs.445401), etc. >>>> >>>> I would like to retrieve the start and end position for those >>>> gene/transcript. Can anybody help me how to get the position >>>> information based on these different gene/transcript IDs. If I also >>>> need to match all of other IDs to gene symbol/entrezID, what kind of >>>> database or bioConductor package will allow me to do the mapping as >>>> perfect as possible? How about biomaRt package? >>>> >>>> Thanks >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> >> >> > >

Organism PROcess biomaRt GenomicFeatures Organism PROcess biomaRt GenomicFeatures • 2.3k views

ADD COMMENT • link updated 14.3 years ago by Steve Lianoglou ★ 13k • written 14.3 years ago by shirley zhang ★ 1.0k

1

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 21 months ago

United States

Hi Shirley, On Tue, Aug 24, 2010 at 9:07 PM, shirley zhang <shirley0818 at="" gmail.com=""> wrote: > Hi Marc, > > Thanks a lot for your detailed reply. It is very helpful. > > I tried library "GenomicFeatures" and your code. It works great. I did > get position for each Transcript. ?However I am only interested in > obtaining position for each gene ID/gene symbol instead of position > for each transcript. For example, if one gene has multiple > transcripts, I only want the biggest boundary of each gene which > includes all of its transcripts. Does GenomicFeatures or other package > can do this? Or I have to compute the gene position by combining all > of its corresponding transcripts? You can do this pretty "simply" with GenomicFeatures, if you want to stick with that: R> txdb <- loadFeatures('your.transcript.db') R> xcripts <- transcriptsBy(txdb, by='gene') ## This part is really slow -- this will be subject of next email R> gene.bounds <- seqapply(xcripts, reduce) the names() of gene.bounds is the entrez.id of the gene. You can use the org.Hs.eg.db pacakges R> library(org.Hs.eg.db) R> symbols <- mget(names(gene.bounds), org.Hs.egSYMBOL, ifnotfound=NA) symbols will now be a list (names are entrez ids, values are the gene symbols) that you can manipulate in "the standard R way" Hope that helps, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD COMMENT • link 14.3 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Sorry: > You can do this pretty "simply" with GenomicFeatures, if you want to > stick with that: > > R> txdb <- loadFeatures('your.transcript.db') > R> xcripts <- transcriptsBy(txdb, by='gene') > > ## This part is really slow -- this will be subject of next email > R> gene.bounds <- seqapply(xcripts, reduce) Should have used `range` instead of `reduce` here: R> gene.bounds <- seqapply(xcripts, range) The rest is the same ... > the names() of gene.bounds is the entrez.id of the gene. You can use > the org.Hs.eg.db pacakges > > R> library(org.Hs.eg.db) > R> symbols <- mget(names(gene.bounds), org.Hs.egSYMBOL, ifnotfound=NA) > > symbols will now be a list (names are entrez ids, values are the gene > symbols) that you can manipulate in "the standard R way" > > Hope that helps, > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 14.3 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Thanks Steve. It is exactly what I want. Good night! Shirley On Tue, Aug 24, 2010 at 10:41 PM, Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> wrote: > Sorry: > >> You can do this pretty "simply" with GenomicFeatures, if you want to >> stick with that: >> >> R> txdb <- loadFeatures('your.transcript.db') >> R> xcripts <- transcriptsBy(txdb, by='gene') >> >> ## This part is really slow -- this will be subject of next email >> R> gene.bounds <- seqapply(xcripts, reduce) > > Should have used `range` instead of `reduce` here: > > R> gene.bounds <- seqapply(xcripts, range) > > The rest is the same ... > >> the names() of gene.bounds is the entrez.id of the gene. You can use >> the org.Hs.eg.db pacakges >> >> R> library(org.Hs.eg.db) >> R> symbols <- mget(names(gene.bounds), org.Hs.egSYMBOL, ifnotfound=NA) >> >> symbols will now be a list (names are entrez ids, values are the gene >> symbols) that you can manipulate in "the standard R way" >> >> Hope that helps, >> >> -steve >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> ?| Memorial Sloan-Kettering Cancer Center >> ?| Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> > > > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > -- Xiaoling (Shirley) Zhang M.D., Ph.D. (Bioinformatics) Boston University, Boston, MA Tel: (857) 233-9862 Email: zhangxl at bu.edu

ADD REPLY • link 14.3 years ago shirley zhang ★ 1.0k

Login before adding your answer.