BiomaRt return value
1
0
Entering edit mode
Tony Chiang ▴ 570
@tony-chiang-1769
Last seen 10.4 years ago
Hi Steffen, Sean, Wolfgang, I have a question about the return value of the getBM() function. It is a data frame object, and in the examples that I have seen, usually if I want to map from EMBL IDs to Entrez Gene IDs, we would still also want to map the EMBL IDs back to the EMBL IDs so we know what has mapped to what. Example code to follow if my explanation is not clear: ################ library(biomaRt) ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl") filters = listFilters(ensembl) attributes = listAttributes(ensembl) ##Here are my IDs from String test = c("9606.ENSP00000045065", "9606.ENSP00000158762", "9606.ENSP00000174653", "9606.ENSP00000202967", "9606.ENSP00000204517", "9606.ENSP00000212015", "9606.ENSP00000220616", "9606.ENSP00000222008", "9606.ENSP00000222390", "9606.ENSP00000223051") emblID = sapply(strsplit(test, "\\."), function(x) x[2]) ##And the code I am using for the mapping is: getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id", "hgnc_automatic_gene_name"), filters="ensembl_peptide_id", values=emblID, mart=ensembl) ################## So I guess I have two questions: would it be a good idea to always return what we input in the output data frame so we would have not to have the redundant attribute ("ensembl_peptide_id" in my example). Also, if you ran the code, you will see that ENSP00000045065 did not map at all , so I assume that it is not a valid ensembl_peptide_id (this is a bit strange since I am using EMBL IDs); I also want to ask if there is some way to make that more transparent...maybe a row of NA values? I realize that these are not terrible things to work around, but would it not make sense to have this? If not, please let me know. Cheers, --Tony > sessionInfo() R version 2.10.0 Patched (2009-10-27 r50222) x86_64-apple-darwin9.8.0 locale: [1] en_US.utf-8/en_US.utf-8/C/C/en_US.utf-8/en_US.utf-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.2.0 loaded via a namespace (and not attached): [1] RCurl_1.2-1 XML_2.6-0 [[alternative HTML version deleted]]
• 1.5k views
ADD COMMENT
0
Entering edit mode
@wolfgang-huber-3550
Last seen 5 months ago
EMBL European Molecular Biology Laborat…
Hi Tony thanks for these good ideas. Both of these you could implement in a small wrapper function around getBM. Once you find that this is a stable, generally useful function, we'd be happy to accept your patch for the biomaRt package! Btw, ENSP00000045065 is a valid protein sequence ID with many hits for it in Google, and indeed in the search box at http://www.ebi.ac.uk. The fact that the hsapiens_gene_ensembl mart does not know a mapping of it to an extant gene name could have all sorts of reasons, historical or scientific, which you could explore at the EBI website. Best wishes Wolfgang Chiang wrote: > Hi Steffen, Sean, Wolfgang, > > I have a question about the return value of the getBM() function. It is a > data frame object, and in the examples that I have seen, usually if I want > to map from EMBL IDs to Entrez Gene IDs, we would still also want to map the > EMBL IDs back to the EMBL IDs so we know what has mapped to what. Example > code to follow if my explanation is not clear: > > ################ > library(biomaRt) > ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl") > filters = listFilters(ensembl) > attributes = listAttributes(ensembl) > ##Here are my IDs from String > test = c("9606.ENSP00000045065", "9606.ENSP00000158762", > "9606.ENSP00000174653", > "9606.ENSP00000202967", "9606.ENSP00000204517", "9606.ENSP00000212015", > "9606.ENSP00000220616", "9606.ENSP00000222008", "9606.ENSP00000222390", > "9606.ENSP00000223051") > emblID = sapply(strsplit(test, "\\."), function(x) x[2]) > ##And the code I am using for the mapping is: > getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id", > "hgnc_automatic_gene_name"), filters="ensembl_peptide_id", values=emblID, > mart=ensembl) > ################## > > So I guess I have two questions: would it be a good idea to always return > what we input in the output data frame so we would have not to have the > redundant attribute ("ensembl_peptide_id" in my example). Also, if you ran > the code, you will see that ENSP00000045065 did not map at all , so I assume > that it is not a valid ensembl_peptide_id (this is a bit strange since I am > using EMBL IDs); I also want to ask if there is some way to make that more > transparent...maybe a row of NA values? I realize that these are not > terrible things to work around, but would it not make sense to have this? If > not, please let me know. > > Cheers, > --Tony > >> sessionInfo() > R version 2.10.0 Patched (2009-10-27 r50222) > x86_64-apple-darwin9.8.0 > > locale: > [1] en_US.utf-8/en_US.utf-8/C/C/en_US.utf-8/en_US.utf-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] biomaRt_2.2.0 > > loaded via a namespace (and not attached): > [1] RCurl_1.2-1 XML_2.6-0 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Best wishes Wolfgang -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber/contact
ADD COMMENT
0
Entering edit mode
Hi Wolfgang, On Sun, Nov 22, 2009 at 2:12 PM, Wolfgang Huber <whuber@embl.de> wrote: > Hi Tony > > thanks for these good ideas. Both of these you could implement in a small > wrapper function around getBM. Once you find that this is a stable, > generally useful function, we'd be happy to accept your patch for the > biomaRt package! > > Afraid of this one =). But I am going to do this for myself so I will see if I can submit something. I guess my original question was if there was anything untoward to doing this (if you had already tried and found it was not feasible), so I am going to assume that this is not the case. > Btw, ENSP00000045065 is a valid protein sequence ID with many hits for it > in Google, and indeed in the search box at http://www.ebi.ac.uk. The fact > that the hsapiens_gene_ensembl mart does not know a mapping of it to an > extant gene name could have all sorts of reasons, historical or scientific, > which you could explore at the EBI website. > > Right. I did google for the peptide ID. I was just surprised that it did not map. In fact, using the hsapiens_gene_ensembl mart, I could not map a number of suc IDs. I was unaware that there could be "historical" reasons for this as well. Cheers, --Tony Best wishes > Wolfgang > > > Chiang wrote: > >> Hi Steffen, Sean, Wolfgang, >> >> I have a question about the return value of the getBM() function. It is a >> data frame object, and in the examples that I have seen, usually if I want >> to map from EMBL IDs to Entrez Gene IDs, we would still also want to map >> the >> EMBL IDs back to the EMBL IDs so we know what has mapped to what. Example >> code to follow if my explanation is not clear: >> >> ################ >> library(biomaRt) >> ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl") >> filters = listFilters(ensembl) >> attributes = listAttributes(ensembl) >> ##Here are my IDs from String >> test = c("9606.ENSP00000045065", "9606.ENSP00000158762", >> "9606.ENSP00000174653", >> "9606.ENSP00000202967", "9606.ENSP00000204517", "9606.ENSP00000212015", >> "9606.ENSP00000220616", "9606.ENSP00000222008", "9606.ENSP00000222390", >> "9606.ENSP00000223051") >> emblID = sapply(strsplit(test, "\\."), function(x) x[2]) >> ##And the code I am using for the mapping is: >> getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id", >> "hgnc_automatic_gene_name"), filters="ensembl_peptide_id", values=emblID, >> mart=ensembl) >> ################## >> >> So I guess I have two questions: would it be a good idea to always return >> what we input in the output data frame so we would have not to have the >> redundant attribute ("ensembl_peptide_id" in my example). Also, if you ran >> the code, you will see that ENSP00000045065 did not map at all , so I >> assume >> that it is not a valid ensembl_peptide_id (this is a bit strange since I >> am >> using EMBL IDs); I also want to ask if there is some way to make that more >> transparent...maybe a row of NA values? I realize that these are not >> terrible things to work around, but would it not make sense to have this? >> If >> not, please let me know. >> >> Cheers, >> --Tony >> >> sessionInfo() >>> >> R version 2.10.0 Patched (2009-10-27 r50222) >> x86_64-apple-darwin9.8.0 >> >> locale: >> [1] en_US.utf-8/en_US.utf-8/C/C/en_US.utf-8/en_US.utf-8 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] biomaRt_2.2.0 >> >> loaded via a namespace (and not attached): >> [1] RCurl_1.2-1 XML_2.6-0 >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > > Best wishes > Wolfgang > > > -- > Wolfgang Huber > EMBL > http://www.embl.de/research/units/genome_biology/huber/contact > > > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi Tony, I want to add that in the past we used to return what was used as input to the query (filter) also as an attribute. However this is not generalizable as for some attributes/filters the name is different e.g. "start_position" in attribute list and "start" in filter list. And sometimes a filter is not present as an attribute. To make our code more stable we took this out, and if a user wants such functionality then I agree with Wolfgang and it should be a wrapper around getBM that does this. Cheers, Steffen > Hi Tony > > thanks for these good ideas. Both of these you could implement in a > small wrapper function around getBM. Once you find that this is a > stable, generally useful function, we'd be happy to accept your patch > for the biomaRt package! > > Btw, ENSP00000045065 is a valid protein sequence ID with many hits for > it in Google, and indeed in the search box at http://www.ebi.ac.uk. The > fact that the hsapiens_gene_ensembl mart does not know a mapping of it > to an extant gene name could have all sorts of reasons, historical or > scientific, which you could explore at the EBI website. > > Best wishes > Wolfgang > > > Chiang wrote: >> Hi Steffen, Sean, Wolfgang, >> >> I have a question about the return value of the getBM() function. It is >> a >> data frame object, and in the examples that I have seen, usually if I >> want >> to map from EMBL IDs to Entrez Gene IDs, we would still also want to map >> the >> EMBL IDs back to the EMBL IDs so we know what has mapped to what. >> Example >> code to follow if my explanation is not clear: >> >> ################ >> library(biomaRt) >> ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl") >> filters = listFilters(ensembl) >> attributes = listAttributes(ensembl) >> ##Here are my IDs from String >> test = c("9606.ENSP00000045065", "9606.ENSP00000158762", >> "9606.ENSP00000174653", >> "9606.ENSP00000202967", "9606.ENSP00000204517", "9606.ENSP00000212015", >> "9606.ENSP00000220616", "9606.ENSP00000222008", "9606.ENSP00000222390", >> "9606.ENSP00000223051") >> emblID = sapply(strsplit(test, "\\."), function(x) x[2]) >> ##And the code I am using for the mapping is: >> getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id", >> "hgnc_automatic_gene_name"), filters="ensembl_peptide_id", >> values=emblID, >> mart=ensembl) >> ################## >> >> So I guess I have two questions: would it be a good idea to always >> return >> what we input in the output data frame so we would have not to have the >> redundant attribute ("ensembl_peptide_id" in my example). Also, if you >> ran >> the code, you will see that ENSP00000045065 did not map at all , so I >> assume >> that it is not a valid ensembl_peptide_id (this is a bit strange since I >> am >> using EMBL IDs); I also want to ask if there is some way to make that >> more >> transparent...maybe a row of NA values? I realize that these are not >> terrible things to work around, but would it not make sense to have >> this? If >> not, please let me know. >> >> Cheers, >> --Tony >> >>> sessionInfo() >> R version 2.10.0 Patched (2009-10-27 r50222) >> x86_64-apple-darwin9.8.0 >> >> locale: >> [1] en_US.utf-8/en_US.utf-8/C/C/en_US.utf-8/en_US.utf-8 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] biomaRt_2.2.0 >> >> loaded via a namespace (and not attached): >> [1] RCurl_1.2-1 XML_2.6-0 >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > > Best wishes > Wolfgang > > > -- > Wolfgang Huber > EMBL > http://www.embl.de/research/units/genome_biology/huber/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
Thanks Steffen, That's the answer I was looking for before I put some serious work into something that is probably an ad hoc thing at best. Tony On Mon, Nov 23, 2009 at 10:27 AM, <steffen@stat.berkeley.edu> wrote: > Hi Tony, > > I want to add that in the past we used to return what was used as input to > the query (filter) also as an attribute. However this is not > generalizable as for some attributes/filters the name is different e.g. > "start_position" in attribute list and "start" in filter list. And > sometimes a filter is not present as an attribute. To make our code more > stable we took this out, and if a user wants such functionality then I > agree with Wolfgang and it should be a wrapper around getBM that does > this. > > Cheers, > Steffen > > > > Hi Tony > > > > thanks for these good ideas. Both of these you could implement in a > > small wrapper function around getBM. Once you find that this is a > > stable, generally useful function, we'd be happy to accept your patch > > for the biomaRt package! > > > > Btw, ENSP00000045065 is a valid protein sequence ID with many hits for > > it in Google, and indeed in the search box at http://www.ebi.ac.uk. The > > fact that the hsapiens_gene_ensembl mart does not know a mapping of it > > to an extant gene name could have all sorts of reasons, historical or > > scientific, which you could explore at the EBI website. > > > > Best wishes > > Wolfgang > > > > > > Chiang wrote: > >> Hi Steffen, Sean, Wolfgang, > >> > >> I have a question about the return value of the getBM() function. It is > >> a > >> data frame object, and in the examples that I have seen, usually if I > >> want > >> to map from EMBL IDs to Entrez Gene IDs, we would still also want to map > >> the > >> EMBL IDs back to the EMBL IDs so we know what has mapped to what. > >> Example > >> code to follow if my explanation is not clear: > >> > >> ################ > >> library(biomaRt) > >> ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl") > >> filters = listFilters(ensembl) > >> attributes = listAttributes(ensembl) > >> ##Here are my IDs from String > >> test = c("9606.ENSP00000045065", "9606.ENSP00000158762", > >> "9606.ENSP00000174653", > >> "9606.ENSP00000202967", "9606.ENSP00000204517", "9606.ENSP00000212015", > >> "9606.ENSP00000220616", "9606.ENSP00000222008", "9606.ENSP00000222390", > >> "9606.ENSP00000223051") > >> emblID = sapply(strsplit(test, "\\."), function(x) x[2]) > >> ##And the code I am using for the mapping is: > >> getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id", > >> "hgnc_automatic_gene_name"), filters="ensembl_peptide_id", > >> values=emblID, > >> mart=ensembl) > >> ################## > >> > >> So I guess I have two questions: would it be a good idea to always > >> return > >> what we input in the output data frame so we would have not to have the > >> redundant attribute ("ensembl_peptide_id" in my example). Also, if you > >> ran > >> the code, you will see that ENSP00000045065 did not map at all , so I > >> assume > >> that it is not a valid ensembl_peptide_id (this is a bit strange since I > >> am > >> using EMBL IDs); I also want to ask if there is some way to make that > >> more > >> transparent...maybe a row of NA values? I realize that these are not > >> terrible things to work around, but would it not make sense to have > >> this? If > >> not, please let me know. > >> > >> Cheers, > >> --Tony > >> > >>> sessionInfo() > >> R version 2.10.0 Patched (2009-10-27 r50222) > >> x86_64-apple-darwin9.8.0 > >> > >> locale: > >> [1] en_US.utf-8/en_US.utf-8/C/C/en_US.utf-8/en_US.utf-8 > >> > >> attached base packages: > >> [1] stats graphics grDevices utils datasets methods base > >> > >> other attached packages: > >> [1] biomaRt_2.2.0 > >> > >> loaded via a namespace (and not attached): > >> [1] RCurl_1.2-1 XML_2.6-0 > >> > >> [[alternative HTML version deleted]] > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@stat.math.ethz.ch > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > > > > Best wishes > > Wolfgang > > > > > > -- > > Wolfgang Huber > > EMBL > > http://www.embl.de/research/units/genome_biology/huber/contact > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 728 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6