How to map KEGG gene IDs to gene names?

0

Entering edit mode

Elliot Kleiman ▴ 150

@elliot-kleiman-2565

Last seen 10.5 years ago

Hi BioC List from {sunny}San Diego, CA! [Question]: * How do you map KEGG gene IDs to textual gene names, gene descriptions via BioC? For example, I am interested in knowing which genes are involved in the calcium signaling pathway in rattus norvegicus, so I did: > library(KEGG) > # map pathway id to pathway name > KEGGPATHID2NAME$"04020" [1] "Calcium signaling pathway" > library(KEGGSOAP) > # get all genes in pathway rno04020 > csp.genes.rno <- get.genes.by.pathway("path:rno04020") > # how many genes are involved? > length(csp.genes.rno) [1] 165 > # print a few of the results out > csp.genes.rno[1:3] [1] "rno:113995" "rno:114098" "rno:114099" The problem is, I don't know what "rno:113995" refers to? [not without visiting the KEGG website] Instead, I would like to obtain a mapping for each of the retrieved KEGG gene IDs into textual gene names, gene descriptions, etc. How do you do that exactly? Thank you, Elliot Kleiman > # print SessionInfo > sessionInfo() R version 2.6.1 (2007-11-26) i686-pc-linux-gnu locale: LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=C;LC_MONETARY=en_ US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHON E=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] KEGG_2.0.1 KEGGSOAP_1.12.0 SSOAP_0.4-6 RCurl_0.8-3 [5] XML_1.93-2 loaded via a namespace (and not attached): [1] rcompgen_0.1-17 tools_2.6.1 -- __________________________ MS graduate student Program in Computational Science San Diego State University http://www.csrc.sdsu.edu/

Rattus norvegicus Rattus norvegicus • 1.8k views

ADD COMMENT • link updated 17.2 years ago by Martin Morgan 25k • written 17.2 years ago by Elliot Kleiman ▴ 150

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 29 days ago

United States

Hi Elliot -- not sure that this is the way to go here, but... > details <- bget(paste(csp.genes.rno, collapse=" ")) > nchar(details) # one long character string [1] 515265 > records <- strsplit(details, "///\\n")[[1]] # ///\n separates records > length(records) [1] 100 > length(unique(csp.genes.rno)) # hmm, a few missing... [1] 165 > cat(records[[1]]) # 1 record, 1 character string; '\n' separates lines ENTRY 113995 CDS R.norvegicus NAME P2rx5 DEFINITION purinergic receptor P2X, ligand-gated ion channel, 5 ORTHOLOGY KO: K05219 purinergic receptor P2X, ligand-gated ion channel 5 PATHWAY PATH: rno04020 Calcium signaling pathway PATH: rno04080 Neuroactive ligand-receptor interaction POSITION 10q24 MOTIF Pfam: P2X_receptor PROSITE: P2X_RECEPTOR DBLINKS RGD: 620256 NCBI-GI: 31377508 NCBI-GeneID: 113995 Ensembl: ENSRNOG00000019208 UniProt: P51578 CODON_USAGE T C A G T 8 20 0 4 9 8 1 0 5 10 0 1 7 5 0 6 C 5 9 2 18 5 4 6 0 2 6 1 17 4 5 2 7 A 8 16 6 6 5 7 5 3 7 16 11 22 4 5 0 7 G 6 7 5 18 8 14 5 2 9 17 7 21 5 9 4 14 AASEQ 455 MGQAAWKGFVLSLFDYKTAKFVVAKSKKVGLLYRVLQLIILLYLLIWVFLIKKSYQDIDT SLQSAVVTKVKGVAYTNTTMLGERLWDVADFVIPSQGENVFFVVTNLIVTPNQRQGICAE REGIPDGECSEDDDCHAGESVVAGHGLKTGRCLRVGNSTRGTCEIFAWCPVETKSMPTDP LLKDAESFTISIKNFIRFPKFNFSKANVLETDNKHFLKTCHFSSTNLYCPIFRLGSIVRW AGADFQDIALKGGVIGIYIEWDCDLDKAASKCNPHYYFNRLDNKHTHSISSGYNFRFARY YRDPNGVEFRDLMKAYGIRFDVIVNGKAGKFSIIPTVINIGSGLALMGAGAFFCDLVLIY LIRKSEFYRDKKFEKVRGQKEDANVEVEANEMEQERPEDEPLERVRQDEQSQELAQSGRK QNSNCQVLLEPARFGLRENAIVNVKQSQILHPVKT NTSEQ 1368 atgggccaggcggcctggaaggggtttgtgctgtctctgttcgactataagactgcaaag ttcgtggtcgccaagagcaagaaggtggggctgctctaccgggtgctgcagctcatcatc ctgttgtacttgctcatatgggtgtttctgataaagaagagttatcaggacattgacact tccctgcagagtgctgtggtcaccaaagtcaagggggtggcctatactaacaccacgatg cttggggaacggctctgggatgtagcagactttgtcattccatctcagggggagaacgtt ttcttcgtggtcaccaacctgatcgtgactcctaaccagcggcagggcatctgcgctgag cgtgaaggcatccctgatggcgagtgttctgaggacgatgactgtcacgctggggagtct gttgtagctgggcacggactgaaaactggccgctgtctccgggtggggaactctacccgg ggaacctgtgagatctttgcttggtgcccagtggagacaaagtccatgccaacggatccc cttctaaaggatgccgaaagcttcaccatttccataaagaacttcattcgcttccccaag ttcaacttctccaaagccaatgtactagaaacagacaacaaacatttcctgaaaacctgt cacttcagctccacaaatctctactgtcccatcttccgactggggtctattgtccgctgg gcaggggcagacttccaggacatagccctgaagggtggtgtgataggaatctatattgaa tgggactgtgaccttgataaagctgcctctaaatgcaacccacactactacttcaaccgc ctggacaacaaacacacacactccatctcctctgggtacaacttcaggttcgccaggtat taccgtgaccctaatggggtagagttccgtgacctgatgaaagcctacggcatccgcttt gatgtgatagttaatggcaaggcaggaaaattcagcatcatccccacagtcatcaacatt ggttctgggctggcgctcatgggtgctggggctttcttctgcgacctggtacttatctac ctcatcaggaagagtgagttttaccgagacaagaagtttgagaaagtgaggggtcagaag gaggatgccaatgttgaggttgaggccaacgagatggagcaggagcggcctgaggacgaa ccactggagagggttcgtcaggatgagcagtcccaagaactggcccagagtggcaggaag cagaatagcaactgccaggtgcttttggagcctgccaggtttggcctccgggagaatgcc attgtgaacgtgaagcagtcacagatcttgcatccagtgaagacgtag >From here it seems like you're stuck 'screen scraping', e.g., > ids <- sub("^ENTRY[[:space:]]+([[:digit:]]+).*", "\\1", records) > ids [1] "113995" "114098" "114099" "114115" "114207" "114493" "114633" "116601" [9] "140447" "140448" "140671" "140693" "170546" "170897" "170926" "171140" [17] "171378" "24173" "24176" "24180" "24215" "24239" "24242" "24244" [25] "24245" "24246" "24260" "24316" "24326" "24329" "24337" "24408" [33] "24409" "24411" "24412" "24414" "24418" "24448" "24598" "24599" [41] "24600" "24611" "24629" "24654" "24655" "24674" "24675" "24680" [49] "24681" "24807" "24808" "24816" "24889" "24896" "24925" "24929" [57] "24938" "25007" "25023" "25031" "25041" "25050" "25107" "25176" [65] "25187" "25229" "25245" "25262" "25267" "252859" "25302" "25324" [73] "25342" "25369" "25391" "25400" "25439" "25461" "25477" "25505" [81] "25570" "25636" "25637" "25645" "25652" "25668" "25679" "25689" [89] "25706" "25738" "257648" "287745" "288057" "290561" "291926" "29241" [97] "29316" "29322" "29337" "293508" Martin Elliot Kleiman <kleiman at="" rohan.sdsu.edu=""> writes: > Hi BioC List from {sunny}San Diego, CA! > > [Question]: > * How do you map KEGG gene IDs to textual gene names, gene descriptions > via BioC? > > For example, I am interested in knowing which genes are > involved in the calcium signaling pathway in rattus norvegicus, > so I did: > > > library(KEGG) > > # map pathway id to pathway name > > KEGGPATHID2NAME$"04020" > [1] "Calcium signaling pathway" > > > library(KEGGSOAP) > > # get all genes in pathway rno04020 > > csp.genes.rno <- get.genes.by.pathway("path:rno04020") > > # how many genes are involved? > > length(csp.genes.rno) > [1] 165 > > # print a few of the results out > > csp.genes.rno[1:3] > [1] "rno:113995" "rno:114098" "rno:114099" > > The problem is, I don't know what "rno:113995" refers to? > [not without visiting the KEGG website] > Instead, I would like to obtain a mapping for each of the retrieved KEGG > gene IDs into textual gene names, gene descriptions, etc. > > How do you do that exactly? > > Thank you, > > Elliot Kleiman > > > # print SessionInfo > > sessionInfo() > R version 2.6.1 (2007-11-26) > i686-pc-linux-gnu > > locale: > LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=C;LC_MONETARY=e n_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPH ONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] KEGG_2.0.1 KEGGSOAP_1.12.0 SSOAP_0.4-6 RCurl_0.8-3 > [5] XML_1.93-2 > > loaded via a namespace (and not attached): > [1] rcompgen_0.1-17 tools_2.6.1 > > -- > __________________________ > MS graduate student > Program in Computational Science > San Diego State University > http://www.csrc.sdsu.edu/ > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793

ADD COMMENT • link 17.2 years ago Martin Morgan 25k

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 29 days ago

United States

As a quick follow-up to myself, and to indicate how unreliable my info is on this, from http://www.genome.jp/kegg/soap/doc/keggapi_manual.html http://www.genome.jp/kegg/soap/doc/keggapi_manual.html#label:40 http://www.genome.jp/dbget/dbget_manual.html 'bget' returns at most 100 entries (hence length(records)==100) and additional options embedded in the character string argument to bget influence the type of data returned. Perhaps there are other things I'm missing, too, and there are better alternatives to the screen scraping I mentioned? Martin Martin Morgan <mtmorgan at="" fhcrc.org=""> writes: > Hi Elliot -- not sure that this is the way to go here, but... > >> details <- bget(paste(csp.genes.rno, collapse=" ")) >> nchar(details) # one long character string > [1] 515265 >> records <- strsplit(details, "///\\n")[[1]] # ///\n separates records >> length(records) > [1] 100 >> length(unique(csp.genes.rno)) # hmm, a few missing... > [1] 165 >> cat(records[[1]]) # 1 record, 1 character string; '\n' separates lines > ENTRY 113995 CDS R.norvegicus > NAME P2rx5 > DEFINITION purinergic receptor P2X, ligand-gated ion channel, 5 > ORTHOLOGY KO: K05219 purinergic receptor P2X, ligand-gated ion channel 5 > PATHWAY PATH: rno04020 Calcium signaling pathway > PATH: rno04080 Neuroactive ligand-receptor interaction > POSITION 10q24 > MOTIF Pfam: P2X_receptor > PROSITE: P2X_RECEPTOR > DBLINKS RGD: 620256 > NCBI-GI: 31377508 > NCBI-GeneID: 113995 > Ensembl: ENSRNOG00000019208 > UniProt: P51578 > CODON_USAGE T C A G > T 8 20 0 4 9 8 1 0 5 10 0 1 7 5 0 6 > C 5 9 2 18 5 4 6 0 2 6 1 17 4 5 2 7 > A 8 16 6 6 5 7 5 3 7 16 11 22 4 5 0 7 > G 6 7 5 18 8 14 5 2 9 17 7 21 5 9 4 14 > AASEQ 455 > MGQAAWKGFVLSLFDYKTAKFVVAKSKKVGLLYRVLQLIILLYLLIWVFLIKKSYQDIDT > SLQSAVVTKVKGVAYTNTTMLGERLWDVADFVIPSQGENVFFVVTNLIVTPNQRQGICAE > REGIPDGECSEDDDCHAGESVVAGHGLKTGRCLRVGNSTRGTCEIFAWCPVETKSMPTDP > LLKDAESFTISIKNFIRFPKFNFSKANVLETDNKHFLKTCHFSSTNLYCPIFRLGSIVRW > AGADFQDIALKGGVIGIYIEWDCDLDKAASKCNPHYYFNRLDNKHTHSISSGYNFRFARY > YRDPNGVEFRDLMKAYGIRFDVIVNGKAGKFSIIPTVINIGSGLALMGAGAFFCDLVLIY > LIRKSEFYRDKKFEKVRGQKEDANVEVEANEMEQERPEDEPLERVRQDEQSQELAQSGRK > QNSNCQVLLEPARFGLRENAIVNVKQSQILHPVKT > NTSEQ 1368 > atgggccaggcggcctggaaggggtttgtgctgtctctgttcgactataagactgcaaag > ttcgtggtcgccaagagcaagaaggtggggctgctctaccgggtgctgcagctcatcatc > ctgttgtacttgctcatatgggtgtttctgataaagaagagttatcaggacattgacact > tccctgcagagtgctgtggtcaccaaagtcaagggggtggcctatactaacaccacgatg > cttggggaacggctctgggatgtagcagactttgtcattccatctcagggggagaacgtt > ttcttcgtggtcaccaacctgatcgtgactcctaaccagcggcagggcatctgcgctgag > cgtgaaggcatccctgatggcgagtgttctgaggacgatgactgtcacgctggggagtct > gttgtagctgggcacggactgaaaactggccgctgtctccgggtggggaactctacccgg > ggaacctgtgagatctttgcttggtgcccagtggagacaaagtccatgccaacggatccc > cttctaaaggatgccgaaagcttcaccatttccataaagaacttcattcgcttccccaag > ttcaacttctccaaagccaatgtactagaaacagacaacaaacatttcctgaaaacctgt > cacttcagctccacaaatctctactgtcccatcttccgactggggtctattgtccgctgg > gcaggggcagacttccaggacatagccctgaagggtggtgtgataggaatctatattgaa > tgggactgtgaccttgataaagctgcctctaaatgcaacccacactactacttcaaccgc > ctggacaacaaacacacacactccatctcctctgggtacaacttcaggttcgccaggtat > taccgtgaccctaatggggtagagttccgtgacctgatgaaagcctacggcatccgcttt > gatgtgatagttaatggcaaggcaggaaaattcagcatcatccccacagtcatcaacatt > ggttctgggctggcgctcatgggtgctggggctttcttctgcgacctggtacttatctac > ctcatcaggaagagtgagttttaccgagacaagaagtttgagaaagtgaggggtcagaag > gaggatgccaatgttgaggttgaggccaacgagatggagcaggagcggcctgaggacgaa > ccactggagagggttcgtcaggatgagcagtcccaagaactggcccagagtggcaggaag > cagaatagcaactgccaggtgcttttggagcctgccaggtttggcctccgggagaatgcc > attgtgaacgtgaagcagtcacagatcttgcatccagtgaagacgtag > >>From here it seems like you're stuck 'screen scraping', e.g., > >> ids <- sub("^ENTRY[[:space:]]+([[:digit:]]+).*", "\\1", records) >> ids > [1] "113995" "114098" "114099" "114115" "114207" "114493" "114633" "116601" > [9] "140447" "140448" "140671" "140693" "170546" "170897" "170926" "171140" > [17] "171378" "24173" "24176" "24180" "24215" "24239" "24242" "24244" > [25] "24245" "24246" "24260" "24316" "24326" "24329" "24337" "24408" > [33] "24409" "24411" "24412" "24414" "24418" "24448" "24598" "24599" > [41] "24600" "24611" "24629" "24654" "24655" "24674" "24675" "24680" > [49] "24681" "24807" "24808" "24816" "24889" "24896" "24925" "24929" > [57] "24938" "25007" "25023" "25031" "25041" "25050" "25107" "25176" > [65] "25187" "25229" "25245" "25262" "25267" "252859" "25302" "25324" > [73] "25342" "25369" "25391" "25400" "25439" "25461" "25477" "25505" > [81] "25570" "25636" "25637" "25645" "25652" "25668" "25679" "25689" > [89] "25706" "25738" "257648" "287745" "288057" "290561" "291926" "29241" > [97] "29316" "29322" "29337" "293508" > > Martin > > Elliot Kleiman <kleiman at="" rohan.sdsu.edu=""> writes: > >> Hi BioC List from {sunny}San Diego, CA! >> >> [Question]: >> * How do you map KEGG gene IDs to textual gene names, gene descriptions >> via BioC? >> >> For example, I am interested in knowing which genes are >> involved in the calcium signaling pathway in rattus norvegicus, >> so I did: >> >> > library(KEGG) >> > # map pathway id to pathway name >> > KEGGPATHID2NAME$"04020" >> [1] "Calcium signaling pathway" >> >> > library(KEGGSOAP) >> > # get all genes in pathway rno04020 >> > csp.genes.rno <- get.genes.by.pathway("path:rno04020") >> > # how many genes are involved? >> > length(csp.genes.rno) >> [1] 165 >> > # print a few of the results out >> > csp.genes.rno[1:3] >> [1] "rno:113995" "rno:114098" "rno:114099" >> >> The problem is, I don't know what "rno:113995" refers to? >> [not without visiting the KEGG website] >> Instead, I would like to obtain a mapping for each of the retrieved KEGG >> gene IDs into textual gene names, gene descriptions, etc. >> >> How do you do that exactly? >> >> Thank you, >> >> Elliot Kleiman >> >> > # print SessionInfo >> > sessionInfo() >> R version 2.6.1 (2007-11-26) >> i686-pc-linux-gnu >> >> locale: >> LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=C;LC_MONETARY= en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEP HONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] KEGG_2.0.1 KEGGSOAP_1.12.0 SSOAP_0.4-6 RCurl_0.8-3 >> [5] XML_1.93-2 >> >> loaded via a namespace (and not attached): >> [1] rcompgen_0.1-17 tools_2.6.1 >> >> -- >> __________________________ >> MS graduate student >> Program in Computational Science >> San Diego State University >> http://www.csrc.sdsu.edu/ >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > Martin Morgan > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M2 B169 > Phone: (206) 667-2793 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793

ADD COMMENT • link 17.2 years ago Martin Morgan 25k

0

Entering edit mode

Hi Martin, Wow, that is great info! I think I can really use the KEGG API to obtain the info I need (e.g. using Perl's SOAP::Lite). Also, I just discovered a very interesting package offered by the Omega project, called `RSPerl`. "RSPerl provides a bidirectional interface for calling R from Perl and Perl from R." * http://www.omegahat.org/RSPerl/ Thank you so much for your help! Elliot Kleiman Martin Morgan wrote: > As a quick follow-up to myself, and to indicate how unreliable my info > is on this, from > > http://www.genome.jp/kegg/soap/doc/keggapi_manual.html > http://www.genome.jp/kegg/soap/doc/keggapi_manual.html#label:40 > http://www.genome.jp/dbget/dbget_manual.html > > 'bget' returns at most 100 entries (hence length(records)==100) and > additional options embedded in the character string argument to bget > influence the type of data returned. Perhaps there are other things > I'm missing, too, and there are better alternatives to the screen > scraping I mentioned? > > Martin > > Martin Morgan <mtmorgan at="" fhcrc.org=""> writes: > > >> Hi Elliot -- not sure that this is the way to go here, but... >> >> >>> details <- bget(paste(csp.genes.rno, collapse=" ")) >>> nchar(details) # one long character string >>> >> [1] 515265 >> >>> records <- strsplit(details, "///\\n")[[1]] # ///\n separates records >>> length(records) >>> >> [1] 100 >> >>> length(unique(csp.genes.rno)) # hmm, a few missing... >>> >> [1] 165 >> >>> cat(records[[1]]) # 1 record, 1 character string; '\n' separates lines >>> >> ENTRY 113995 CDS R.norvegicus >> NAME P2rx5 >> DEFINITION purinergic receptor P2X, ligand-gated ion channel, 5 >> ORTHOLOGY KO: K05219 purinergic receptor P2X, ligand-gated ion channel 5 >> PATHWAY PATH: rno04020 Calcium signaling pathway >> PATH: rno04080 Neuroactive ligand-receptor interaction >> POSITION 10q24 >> MOTIF Pfam: P2X_receptor >> PROSITE: P2X_RECEPTOR >> DBLINKS RGD: 620256 >> NCBI-GI: 31377508 >> NCBI-GeneID: 113995 >> Ensembl: ENSRNOG00000019208 >> UniProt: P51578 >> CODON_USAGE T C A G >> T 8 20 0 4 9 8 1 0 5 10 0 1 7 5 0 6 >> C 5 9 2 18 5 4 6 0 2 6 1 17 4 5 2 7 >> A 8 16 6 6 5 7 5 3 7 16 11 22 4 5 0 7 >> G 6 7 5 18 8 14 5 2 9 17 7 21 5 9 4 14 >> AASEQ 455 >> MGQAAWKGFVLSLFDYKTAKFVVAKSKKVGLLYRVLQLIILLYLLIWVFLIKKSYQDIDT >> SLQSAVVTKVKGVAYTNTTMLGERLWDVADFVIPSQGENVFFVVTNLIVTPNQRQGICAE >> REGIPDGECSEDDDCHAGESVVAGHGLKTGRCLRVGNSTRGTCEIFAWCPVETKSMPTDP >> LLKDAESFTISIKNFIRFPKFNFSKANVLETDNKHFLKTCHFSSTNLYCPIFRLGSIVRW >> AGADFQDIALKGGVIGIYIEWDCDLDKAASKCNPHYYFNRLDNKHTHSISSGYNFRFARY >> YRDPNGVEFRDLMKAYGIRFDVIVNGKAGKFSIIPTVINIGSGLALMGAGAFFCDLVLIY >> LIRKSEFYRDKKFEKVRGQKEDANVEVEANEMEQERPEDEPLERVRQDEQSQELAQSGRK >> QNSNCQVLLEPARFGLRENAIVNVKQSQILHPVKT >> NTSEQ 1368 >> atgggccaggcggcctggaaggggtttgtgctgtctctgttcgactataagactgcaaag >> ttcgtggtcgccaagagcaagaaggtggggctgctctaccgggtgctgcagctcatcatc >> ctgttgtacttgctcatatgggtgtttctgataaagaagagttatcaggacattgacact >> tccctgcagagtgctgtggtcaccaaagtcaagggggtggcctatactaacaccacgatg >> cttggggaacggctctgggatgtagcagactttgtcattccatctcagggggagaacgtt >> ttcttcgtggtcaccaacctgatcgtgactcctaaccagcggcagggcatctgcgctgag >> cgtgaaggcatccctgatggcgagtgttctgaggacgatgactgtcacgctggggagtct >> gttgtagctgggcacggactgaaaactggccgctgtctccgggtggggaactctacccgg >> ggaacctgtgagatctttgcttggtgcccagtggagacaaagtccatgccaacggatccc >> cttctaaaggatgccgaaagcttcaccatttccataaagaacttcattcgcttccccaag >> ttcaacttctccaaagccaatgtactagaaacagacaacaaacatttcctgaaaacctgt >> cacttcagctccacaaatctctactgtcccatcttccgactggggtctattgtccgctgg >> gcaggggcagacttccaggacatagccctgaagggtggtgtgataggaatctatattgaa >> tgggactgtgaccttgataaagctgcctctaaatgcaacccacactactacttcaaccgc >> ctggacaacaaacacacacactccatctcctctgggtacaacttcaggttcgccaggtat >> taccgtgaccctaatggggtagagttccgtgacctgatgaaagcctacggcatccgcttt >> gatgtgatagttaatggcaaggcaggaaaattcagcatcatccccacagtcatcaacatt >> ggttctgggctggcgctcatgggtgctggggctttcttctgcgacctggtacttatctac >> ctcatcaggaagagtgagttttaccgagacaagaagtttgagaaagtgaggggtcagaag >> gaggatgccaatgttgaggttgaggccaacgagatggagcaggagcggcctgaggacgaa >> ccactggagagggttcgtcaggatgagcagtcccaagaactggcccagagtggcaggaag >> cagaatagcaactgccaggtgcttttggagcctgccaggtttggcctccgggagaatgcc >> attgtgaacgtgaagcagtcacagatcttgcatccagtgaagacgtag >> >> >From here it seems like you're stuck 'screen scraping', e.g., >> >> >>> ids <- sub("^ENTRY[[:space:]]+([[:digit:]]+).*", "\\1", records) >>> ids >>> >> [1] "113995" "114098" "114099" "114115" "114207" "114493" "114633" "116601" >> [9] "140447" "140448" "140671" "140693" "170546" "170897" "170926" "171140" >> [17] "171378" "24173" "24176" "24180" "24215" "24239" "24242" "24244" >> [25] "24245" "24246" "24260" "24316" "24326" "24329" "24337" "24408" >> [33] "24409" "24411" "24412" "24414" "24418" "24448" "24598" "24599" >> [41] "24600" "24611" "24629" "24654" "24655" "24674" "24675" "24680" >> [49] "24681" "24807" "24808" "24816" "24889" "24896" "24925" "24929" >> [57] "24938" "25007" "25023" "25031" "25041" "25050" "25107" "25176" >> [65] "25187" "25229" "25245" "25262" "25267" "252859" "25302" "25324" >> [73] "25342" "25369" "25391" "25400" "25439" "25461" "25477" "25505" >> [81] "25570" "25636" "25637" "25645" "25652" "25668" "25679" "25689" >> [89] "25706" "25738" "257648" "287745" "288057" "290561" "291926" "29241" >> [97] "29316" "29322" "29337" "293508" >> >> Martin >> >> Elliot Kleiman <kleiman at="" rohan.sdsu.edu=""> writes: >> >> >>> Hi BioC List from {sunny}San Diego, CA! >>> >>> [Question]: >>> * How do you map KEGG gene IDs to textual gene names, gene descriptions >>> via BioC? >>> >>> For example, I am interested in knowing which genes are >>> involved in the calcium signaling pathway in rattus norvegicus, >>> so I did: >>> >>> > library(KEGG) >>> > # map pathway id to pathway name >>> > KEGGPATHID2NAME$"04020" >>> [1] "Calcium signaling pathway" >>> >>> > library(KEGGSOAP) >>> > # get all genes in pathway rno04020 >>> > csp.genes.rno <- get.genes.by.pathway("path:rno04020") >>> > # how many genes are involved? >>> > length(csp.genes.rno) >>> [1] 165 >>> > # print a few of the results out >>> > csp.genes.rno[1:3] >>> [1] "rno:113995" "rno:114098" "rno:114099" >>> >>> The problem is, I don't know what "rno:113995" refers to? >>> [not without visiting the KEGG website] >>> Instead, I would like to obtain a mapping for each of the retrieved KEGG >>> gene IDs into textual gene names, gene descriptions, etc. >>> >>> How do you do that exactly? >>> >>> Thank you, >>> >>> Elliot Kleiman >>> >>> > # print SessionInfo >>> > sessionInfo() >>> R version 2.6.1 (2007-11-26) >>> i686-pc-linux-gnu >>> >>> locale: >>> LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=C;LC_MONETARY =en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELE PHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] KEGG_2.0.1 KEGGSOAP_1.12.0 SSOAP_0.4-6 RCurl_0.8-3 >>> [5] XML_1.93-2 >>> >>> loaded via a namespace (and not attached): >>> [1] rcompgen_0.1-17 tools_2.6.1 >>> >>> -- >>> __________________________ >>> MS graduate student >>> Program in Computational Science >>> San Diego State University >>> http://www.csrc.sdsu.edu/ >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> -- >> Martin Morgan >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M2 B169 >> Phone: (206) 667-2793 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- __________________________ MS graduate student Program in Computational Science San Diego State University http://www.csrc.sdsu.edu/

ADD REPLY • link 17.2 years ago Elliot Kleiman ▴ 150

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 29 days ago

United States

Hi Elliot -- Actually, if you're comfortable at that level, then you might take a peek 'under the hood' at the R functions in KEGGSOAP -- basically, I think (I have not explored this) you have access to the complete KEGG SOAP API within R, no need for RSPerl. Martin > library(KEGGSOAP) > get.genes.by.pathway function pathway.id) { return(unlist(.SOAP(KEGGserver, "get_genes_by_pathway", .soapArgs = list(pathway_id = pathway.id), action = KEGGaction, xmlns = KEGGxmlns, nameSpaces = SOAPNameSpaces(version = KEGGsoapns)))) } <environment: namespace:keggsoap=""> > KEGGSOAP:::KEGGserver [1] "http://soap.genome.jp/keggapi/request_v6.0.cgi" Elliot Kleiman <kleiman at="" rohan.sdsu.edu=""> writes: > Hi Martin, > > Wow, that is great info! I think I can really use the > KEGG API to obtain the info I need (e.g. using Perl's > SOAP::Lite). Also, I just discovered a very interesting > package offered by the Omega project, called > `RSPerl`. > > "RSPerl provides a bidirectional interface for calling > R from Perl and Perl from R." > * http://www.omegahat.org/RSPerl/ > > Thank you so much for your help! > > Elliot Kleiman > > > Martin Morgan wrote: >> As a quick follow-up to myself, and to indicate how unreliable my info >> is on this, from >> >> http://www.genome.jp/kegg/soap/doc/keggapi_manual.html >> http://www.genome.jp/kegg/soap/doc/keggapi_manual.html#label:40 >> http://www.genome.jp/dbget/dbget_manual.html >> >> 'bget' returns at most 100 entries (hence length(records)==100) and >> additional options embedded in the character string argument to bget >> influence the type of data returned. Perhaps there are other things >> I'm missing, too, and there are better alternatives to the screen >> scraping I mentioned? >> >> Martin >> >> Martin Morgan <mtmorgan at="" fhcrc.org=""> writes: >> >> >>> Hi Elliot -- not sure that this is the way to go here, but... >>> >>> >>>> details <- bget(paste(csp.genes.rno, collapse=" ")) >>>> nchar(details) # one long character string >>>> >>> [1] 515265 >>> >>>> records <- strsplit(details, "///\\n")[[1]] # ///\n separates records >>>> length(records) >>>> >>> [1] 100 >>> >>>> length(unique(csp.genes.rno)) # hmm, a few missing... >>>> >>> [1] 165 >>> >>>> cat(records[[1]]) # 1 record, 1 character string; '\n' separates lines >>>> >>> ENTRY 113995 CDS R.norvegicus >>> NAME P2rx5 >>> DEFINITION purinergic receptor P2X, ligand-gated ion channel, 5 >>> ORTHOLOGY KO: K05219 purinergic receptor P2X, ligand-gated ion channel 5 >>> PATHWAY PATH: rno04020 Calcium signaling pathway >>> PATH: rno04080 Neuroactive ligand-receptor interaction >>> POSITION 10q24 >>> MOTIF Pfam: P2X_receptor >>> PROSITE: P2X_RECEPTOR >>> DBLINKS RGD: 620256 >>> NCBI-GI: 31377508 >>> NCBI-GeneID: 113995 >>> Ensembl: ENSRNOG00000019208 >>> UniProt: P51578 >>> CODON_USAGE T C A G >>> T 8 20 0 4 9 8 1 0 5 10 0 1 7 5 0 6 >>> C 5 9 2 18 5 4 6 0 2 6 1 17 4 5 2 7 >>> A 8 16 6 6 5 7 5 3 7 16 11 22 4 5 0 7 >>> G 6 7 5 18 8 14 5 2 9 17 7 21 5 9 4 14 >>> AASEQ 455 >>> MGQAAWKGFVLSLFDYKTAKFVVAKSKKVGLLYRVLQLIILLYLLIWVFLIKKSYQDIDT >>> SLQSAVVTKVKGVAYTNTTMLGERLWDVADFVIPSQGENVFFVVTNLIVTPNQRQGICAE >>> REGIPDGECSEDDDCHAGESVVAGHGLKTGRCLRVGNSTRGTCEIFAWCPVETKSMPTDP >>> LLKDAESFTISIKNFIRFPKFNFSKANVLETDNKHFLKTCHFSSTNLYCPIFRLGSIVRW >>> AGADFQDIALKGGVIGIYIEWDCDLDKAASKCNPHYYFNRLDNKHTHSISSGYNFRFARY >>> YRDPNGVEFRDLMKAYGIRFDVIVNGKAGKFSIIPTVINIGSGLALMGAGAFFCDLVLIY >>> LIRKSEFYRDKKFEKVRGQKEDANVEVEANEMEQERPEDEPLERVRQDEQSQELAQSGRK >>> QNSNCQVLLEPARFGLRENAIVNVKQSQILHPVKT >>> NTSEQ 1368 >>> atgggccaggcggcctggaaggggtttgtgctgtctctgttcgactataagactgcaaag >>> ttcgtggtcgccaagagcaagaaggtggggctgctctaccgggtgctgcagctcatcatc >>> ctgttgtacttgctcatatgggtgtttctgataaagaagagttatcaggacattgacact >>> tccctgcagagtgctgtggtcaccaaagtcaagggggtggcctatactaacaccacgatg >>> cttggggaacggctctgggatgtagcagactttgtcattccatctcagggggagaacgtt >>> ttcttcgtggtcaccaacctgatcgtgactcctaaccagcggcagggcatctgcgctgag >>> cgtgaaggcatccctgatggcgagtgttctgaggacgatgactgtcacgctggggagtct >>> gttgtagctgggcacggactgaaaactggccgctgtctccgggtggggaactctacccgg >>> ggaacctgtgagatctttgcttggtgcccagtggagacaaagtccatgccaacggatccc >>> cttctaaaggatgccgaaagcttcaccatttccataaagaacttcattcgcttccccaag >>> ttcaacttctccaaagccaatgtactagaaacagacaacaaacatttcctgaaaacctgt >>> cacttcagctccacaaatctctactgtcccatcttccgactggggtctattgtccgctgg >>> gcaggggcagacttccaggacatagccctgaagggtggtgtgataggaatctatattgaa >>> tgggactgtgaccttgataaagctgcctctaaatgcaacccacactactacttcaaccgc >>> ctggacaacaaacacacacactccatctcctctgggtacaacttcaggttcgccaggtat >>> taccgtgaccctaatggggtagagttccgtgacctgatgaaagcctacggcatccgcttt >>> gatgtgatagttaatggcaaggcaggaaaattcagcatcatccccacagtcatcaacatt >>> ggttctgggctggcgctcatgggtgctggggctttcttctgcgacctggtacttatctac >>> ctcatcaggaagagtgagttttaccgagacaagaagtttgagaaagtgaggggtcagaag >>> gaggatgccaatgttgaggttgaggccaacgagatggagcaggagcggcctgaggacgaa >>> ccactggagagggttcgtcaggatgagcagtcccaagaactggcccagagtggcaggaag >>> cagaatagcaactgccaggtgcttttggagcctgccaggtttggcctccgggagaatgcc >>> attgtgaacgtgaagcagtcacagatcttgcatccagtgaagacgtag >>> >>> >From here it seems like you're stuck 'screen scraping', e.g., >>> >>> >>>> ids <- sub("^ENTRY[[:space:]]+([[:digit:]]+).*", "\\1", records) >>>> ids >>>> >>> [1] "113995" "114098" "114099" "114115" "114207" "114493" "114633" "116601" >>> [9] "140447" "140448" "140671" "140693" "170546" "170897" "170926" "171140" >>> [17] "171378" "24173" "24176" "24180" "24215" "24239" "24242" >>> "24244" [25] "24245" "24246" "24260" "24316" "24326" "24329" >>> "24337" "24408" [33] "24409" "24411" "24412" "24414" "24418" >>> "24448" "24598" "24599" [41] "24600" "24611" "24629" "24654" >>> "24655" "24674" "24675" "24680" [49] "24681" "24807" "24808" >>> "24816" "24889" "24896" "24925" "24929" [57] "24938" "25007" >>> "25023" "25031" "25041" "25050" "25107" "25176" [65] "25187" >>> "25229" "25245" "25262" "25267" "252859" "25302" "25324" [73] >>> "25342" "25369" "25391" "25400" "25439" "25461" "25477" >>> "25505" [81] "25570" "25636" "25637" "25645" "25652" "25668" >>> "25679" "25689" [89] "25706" "25738" "257648" "287745" "288057" >>> "290561" "291926" "29241" [97] "29316" "29322" "29337" "293508" >>> >>> Martin >>> >>> Elliot Kleiman <kleiman at="" rohan.sdsu.edu=""> writes: >>> >>> >>>> Hi BioC List from {sunny}San Diego, CA! >>>> >>>> [Question]: >>>> * How do you map KEGG gene IDs to textual gene names, gene descriptions >>>> via BioC? >>>> >>>> For example, I am interested in knowing which genes are >>>> involved in the calcium signaling pathway in rattus norvegicus, >>>> so I did: >>>> >>>> > library(KEGG) >>>> > # map pathway id to pathway name >>>> > KEGGPATHID2NAME$"04020" >>>> [1] "Calcium signaling pathway" >>>> >>>> > library(KEGGSOAP) >>>> > # get all genes in pathway rno04020 >>>> > csp.genes.rno <- get.genes.by.pathway("path:rno04020") >>>> > # how many genes are involved? >>>> > length(csp.genes.rno) >>>> [1] 165 >>>> > # print a few of the results out >>>> > csp.genes.rno[1:3] >>>> [1] "rno:113995" "rno:114098" "rno:114099" >>>> >>>> The problem is, I don't know what "rno:113995" refers to? >>>> [not without visiting the KEGG website] >>>> Instead, I would like to obtain a mapping for each of the retrieved KEGG >>>> gene IDs into textual gene names, gene descriptions, etc. >>>> >>>> How do you do that exactly? >>>> >>>> Thank you, >>>> >>>> Elliot Kleiman >>>> >>>> > # print SessionInfo >>>> > sessionInfo() >>>> R version 2.6.1 (2007-11-26) >>>> i686-pc-linux-gnu >>>> >>>> locale: >>>> LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=C;LC_MONETAR Y=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TEL EPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods base >>>> >>>> other attached packages: >>>> [1] KEGG_2.0.1 KEGGSOAP_1.12.0 SSOAP_0.4-6 RCurl_0.8-3 >>>> [5] XML_1.93-2 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] rcompgen_0.1-17 tools_2.6.1 >>>> >>>> -- >>>> __________________________ >>>> MS graduate student Program in Computational Science San Diego >>>> State University >>>> http://www.csrc.sdsu.edu/ >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> -- >>> Martin Morgan >>> Computational Biology / Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N. >>> PO Box 19024 Seattle, WA 98109 >>> >>> Location: Arnold Building M2 B169 >>> Phone: (206) 667-2793 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > > > -- > __________________________ > MS graduate student Program in Computational Science San Diego State > University > http://www.csrc.sdsu.edu/ > -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793

ADD COMMENT • link 17.2 years ago Martin Morgan 25k

Login before adding your answer.