Search queries with biomaRt does not align with online queries via ensembl
1
0
Entering edit mode
@hotz-hans-rudolf-3951
Last seen 4.1 years ago
Switzerland
On 2/28/10 7:16 PM, "Tony Chiang" <tchiang at="" fhcrc.org=""> wrote: > Hi Steffen et al, > > Quick question about a search query via biomaRt. Here is the code that I am > using: > > ***** > library(biomaRt) > ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl") > filters = listFilters(ensembl) > attributes = listAttributes(ensembl) > getBM(attributes=c("ensembl_peptide_id", "entrezgene", > "ensembl_gene_id", "hgnc_automatic_gene_name"), > filters="hgnc_automatic_gene_name", values="ATF4", > mart=ensembl) > ***** try ' filters="hgnc_symbol" ', eg: > getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id", "hgnc_automatic_gene_name"), filters="hgnc_symbol", values="ATF4", mart=ensembl) ensembl_peptide_id entrezgene ensembl_gene_id hgnc_automatic_gene_name 1 ENSP00000384587 468 ENSG00000128272 NA 2 ENSP00000336790 468 ENSG00000128272 NA 3 ENSP00000379912 468 ENSG00000128272 NA > Hans > For me, this returns an empty data frame. But when I query ATF4 online at > ensembl, I find what I need. I also looked up ATF4 at genenames.org (HUGO) > and it seems that ATF4 is a valid hgnc gene name, so the filter so be fine. > I guess the only other reason that I can see is which dataset I use in the > useMart function. I am guessing that the online API will search through all > datasets while I am only specifying a single one? If this is true, do you > know of a sensible work around? I have about 150 genes that I would like > mapped to the EBML ID names but using the code above with a vector of gene > names, I can only map around 25...but if I manually query for some of the > non-mapped gene names, I get what I am after. If I am wrong about my guess > in the dataset, can you let me know what you think might be going on? > > Tony > >> sessionInfo() > R version 2.11.0 Under development (unstable) (2010-01-16 r50993) > i386-apple-darwin10.2.0 > > locale: > [1] en_US.utf-8/en_US.utf-8/C/C/en_US.utf-8/en_US.utf-8 > > attached base packages: > [1] grid stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] hgu133plus2.db_2.3.5 org.Hs.eg.db_2.3.6 Rgraphviz_1.25.1 > [4] biomaRt_2.3.0 GOstats_2.13.0 RSQLite_0.8-1 > [7] DBI_0.2-5 Category_2.13.0 AnnotationDbi_1.9.4 > [10] Biobase_2.7.3 RBGL_1.23.0 graph_1.25.5 > > loaded via a namespace (and not attached): > [1] annotate_1.25.1 genefilter_1.29.5 GO.db_2.3.5 GSEABase_1.9.0 > [5] RCurl_1.3-1 splines_2.11.0 survival_2.35-8 tools_2.11.0 > [9] XML_2.6-0 xtable_1.5-6 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
GO hgu133plus2 biomaRt GO hgu133plus2 biomaRt • 1.2k views
ADD COMMENT
0
Entering edit mode
Tony Chiang ▴ 570
@tony-chiang-1769
Last seen 10.2 years ago
Thanks Hans, That worked much better. Quick follow up question then (I guess for anyone who might know the answer), when would we use the hgnc gene names rather the the symbols? It would appear that ATF4 is a valid hgnc gene name so I thought that the obvious choice would have been to filter based on hgnc_automatic_gene_name but this is obviously not the case. I guess what I am trying to ask is how do I know what to use as the filter when it would seem like there is an obvious candidate to chose but is not the correct one? Cheers, --Tony On Mon, Mar 1, 2010 at 12:31 AM, Hotz, Hans-Rudolf <hrh@fmi.ch> wrote: > > > > On 2/28/10 7:16 PM, "Tony Chiang" <tchiang@fhcrc.org> wrote: > > > Hi Steffen et al, > > > > Quick question about a search query via biomaRt. Here is the code that I > am > > using: > > > > ***** > > library(biomaRt) > > ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl") > > filters = listFilters(ensembl) > > attributes = listAttributes(ensembl) > > getBM(attributes=c("ensembl_peptide_id", "entrezgene", > > "ensembl_gene_id", "hgnc_automatic_gene_name"), > > filters="hgnc_automatic_gene_name", values="ATF4", > > mart=ensembl) > > ***** > > try ' filters="hgnc_symbol" ', eg: > > > > getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id", > "hgnc_automatic_gene_name"), filters="hgnc_symbol", values="ATF4", > mart=ensembl) > ensembl_peptide_id entrezgene ensembl_gene_id hgnc_automatic_gene_name > 1 ENSP00000384587 468 ENSG00000128272 NA > 2 ENSP00000336790 468 ENSG00000128272 NA > 3 ENSP00000379912 468 ENSG00000128272 NA > > > > > > Hans > > > For me, this returns an empty data frame. But when I query ATF4 online at > > ensembl, I find what I need. I also looked up ATF4 at genenames.org(HUGO) > > and it seems that ATF4 is a valid hgnc gene name, so the filter so be > fine. > > I guess the only other reason that I can see is which dataset I use in > the > > useMart function. I am guessing that the online API will search through > all > > datasets while I am only specifying a single one? If this is true, do you > > know of a sensible work around? I have about 150 genes that I would like > > mapped to the EBML ID names but using the code above with a vector of > gene > > names, I can only map around 25...but if I manually query for some of the > > non-mapped gene names, I get what I am after. If I am wrong about my > guess > > in the dataset, can you let me know what you think might be going on? > > > > Tony > > > >> sessionInfo() > > R version 2.11.0 Under development (unstable) (2010-01-16 r50993) > > i386-apple-darwin10.2.0 > > > > locale: > > [1] en_US.utf-8/en_US.utf-8/C/C/en_US.utf-8/en_US.utf-8 > > > > attached base packages: > > [1] grid stats graphics grDevices utils datasets methods > > [8] base > > > > other attached packages: > > [1] hgu133plus2.db_2.3.5 org.Hs.eg.db_2.3.6 Rgraphviz_1.25.1 > > [4] biomaRt_2.3.0 GOstats_2.13.0 RSQLite_0.8-1 > > [7] DBI_0.2-5 Category_2.13.0 AnnotationDbi_1.9.4 > > [10] Biobase_2.7.3 RBGL_1.23.0 graph_1.25.5 > > > > loaded via a namespace (and not attached): > > [1] annotate_1.25.1 genefilter_1.29.5 GO.db_2.3.5 GSEABase_1.9.0 > > [5] RCurl_1.3-1 splines_2.11.0 survival_2.35-8 tools_2.11.0 > > [9] XML_2.6-0 xtable_1.5-6 > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Hi Tony, ATF4 isn't a valid gene name, it's a HUGO gene symbol. The gene name can be retrieved using the 'description' attribute. So you have to know that ATF4 is a gene symbol, and that Ensembl calls these things hgnc_symbols. But your question still remains. How to decide which of the often inscrutable filters/attributes should one use to get a set of results? This is compounded by the fact that Ensembl will sometimes change what they call things. For instance, hgnc_symbol was once simply symbol. And for a while there, one had to know that for humans you used symbol, but for mice you used mgi_symbol... There isn't a quick answer to this question. Steffen added a second column to the output of both listFilters() and listAttributes() that may help (although often times it is the same as the first, minus the underscores). What it often comes down to is trial and error, choosing different attributes that might plausibly return what you want. One strategy I use is to try the shortest possible attribute name that might describe what I want. It seems the more descriptors are added to a given attribute, the less data on the back end. So for instance, something like hgnc_automatic_gene_name would be quite low on a list of attributes that I would explore. OTOH, "curated" might be more useful, so hgnc_curated_gene_name to me is more likely to bear fruit. > getBM(c("hgnc_symbol","description","hgnc_curated_gene_name"), "hgnc_symbol", "ATF4", mart) hgnc_symbol 1 ATF4 description 1 Cyclic AMP-dependent transcription factor ATF-4 (cAMP-dependent transcription factor ATF-4)(Activating transcription factor 4)(DNA-binding protein TAXREB67)(Cyclic AMP-responsive element-binding protein 2)(cAMP-responsive element-binding protein 2)(CREB-2) [Source:UniProtKB/Swiss-Prot;Acc:P18848] hgnc_curated_gene_name 1 ATF4 Best, Jim Tony Chiang wrote: > Thanks Hans, > > That worked much better. Quick follow up question then (I guess for anyone > who might know the answer), when would we use the hgnc gene names rather the > the symbols? It would appear that ATF4 is a valid hgnc gene name so I > thought that the obvious choice would have been to filter based on > hgnc_automatic_gene_name but this is obviously not the case. I guess what I > am trying to ask is how do I know what to use as the filter when it would > seem like there is an obvious candidate to chose but is not the correct one? > > Cheers, > --Tony > > > On Mon, Mar 1, 2010 at 12:31 AM, Hotz, Hans-Rudolf <hrh at="" fmi.ch=""> wrote: > >> >> >> On 2/28/10 7:16 PM, "Tony Chiang" <tchiang at="" fhcrc.org=""> wrote: >> >>> Hi Steffen et al, >>> >>> Quick question about a search query via biomaRt. Here is the code that I >> am >>> using: >>> >>> ***** >>> library(biomaRt) >>> ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl") >>> filters = listFilters(ensembl) >>> attributes = listAttributes(ensembl) >>> getBM(attributes=c("ensembl_peptide_id", "entrezgene", >>> "ensembl_gene_id", "hgnc_automatic_gene_name"), >>> filters="hgnc_automatic_gene_name", values="ATF4", >>> mart=ensembl) >>> ***** >> try ' filters="hgnc_symbol" ', eg: >> >> >>> getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id", >> "hgnc_automatic_gene_name"), filters="hgnc_symbol", values="ATF4", >> mart=ensembl) >> ensembl_peptide_id entrezgene ensembl_gene_id hgnc_automatic_gene_name >> 1 ENSP00000384587 468 ENSG00000128272 NA >> 2 ENSP00000336790 468 ENSG00000128272 NA >> 3 ENSP00000379912 468 ENSG00000128272 NA >> >> >> Hans >> >>> For me, this returns an empty data frame. But when I query ATF4 online at >>> ensembl, I find what I need. I also looked up ATF4 at genenames.org(HUGO) >>> and it seems that ATF4 is a valid hgnc gene name, so the filter so be >> fine. >>> I guess the only other reason that I can see is which dataset I use in >> the >>> useMart function. I am guessing that the online API will search through >> all >>> datasets while I am only specifying a single one? If this is true, do you >>> know of a sensible work around? I have about 150 genes that I would like >>> mapped to the EBML ID names but using the code above with a vector of >> gene >>> names, I can only map around 25...but if I manually query for some of the >>> non-mapped gene names, I get what I am after. If I am wrong about my >> guess >>> in the dataset, can you let me know what you think might be going on? >>> >>> Tony >>> >>>> sessionInfo() >>> R version 2.11.0 Under development (unstable) (2010-01-16 r50993) >>> i386-apple-darwin10.2.0 >>> >>> locale: >>> [1] en_US.utf-8/en_US.utf-8/C/C/en_US.utf-8/en_US.utf-8 >>> >>> attached base packages: >>> [1] grid stats graphics grDevices utils datasets methods >>> [8] base >>> >>> other attached packages: >>> [1] hgu133plus2.db_2.3.5 org.Hs.eg.db_2.3.6 Rgraphviz_1.25.1 >>> [4] biomaRt_2.3.0 GOstats_2.13.0 RSQLite_0.8-1 >>> [7] DBI_0.2-5 Category_2.13.0 AnnotationDbi_1.9.4 >>> [10] Biobase_2.7.3 RBGL_1.23.0 graph_1.25.5 >>> >>> loaded via a namespace (and not attached): >>> [1] annotate_1.25.1 genefilter_1.29.5 GO.db_2.3.5 GSEABase_1.9.0 >>> [5] RCurl_1.3-1 splines_2.11.0 survival_2.35-8 tools_2.11.0 >>> [9] XML_2.6-0 xtable_1.5-6 >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
ADD REPLY
0
Entering edit mode
Hi James, See inline... On Mon, Mar 1, 2010 at 7:55 AM, James W. MacDonald <jmacdon@med.umich.edu>wrote: > Hi Tony, > > ATF4 isn't a valid gene name, it's a HUGO gene symbol. The gene name can be > retrieved using the 'description' attribute. So you have to know that ATF4 > is a gene symbol, and that Ensembl calls these things hgnc_symbols. > > Yes, I have since figured this out with Han's e-mail. > But your question still remains. How to decide which of the often > inscrutable filters/attributes should one use to get a set of results? This > is compounded by the fact that Ensembl will sometimes change what they call > things. For instance, hgnc_symbol was once simply symbol. And for a while > there, one had to know that for humans you used symbol, but for mice you > used mgi_symbol... > > Not only does the name change, but sometimes it is simply counter- intuitive to what most (and by most, I mean ME) people would do. > There isn't a quick answer to this question. Steffen added a second column > to the output of both listFilters() and listAttributes() that may help > (although often times it is the same as the first, minus the underscores). > What it often comes down to is trial and error, choosing different > attributes that might plausibly return what you want. > > This was the answer that I dreaded...trial and error! > One strategy I use is to try the shortest possible attribute name that > might describe what I want. It seems the more descriptors are added to a > given attribute, the less data on the back end. So for instance, something > like hgnc_automatic_gene_name would be quite low on a list of attributes > that I would explore. OTOH, "curated" might be more useful, so > hgnc_curated_gene_name to me is more likely to bear fruit. > > Sounds like a good strategy. I guess I should use symbols or the shortest possible descriptors and figure out what to do with the unmapped ones downstream. Thanks Jim! Tony > > getBM(c("hgnc_symbol","description","hgnc_curated_gene_name"), > "hgnc_symbol", "ATF4", mart) > hgnc_symbol > 1 ATF4 > > > > > description > 1 Cyclic AMP-dependent transcription factor ATF-4 (cAMP-dependent > transcription factor ATF-4)(Activating transcription factor 4)(DNA- binding > protein TAXREB67)(Cyclic AMP-responsive element-binding protein > 2)(cAMP-responsive element-binding protein 2)(CREB-2) > [Source:UniProtKB/Swiss-Prot;Acc:P18848] > hgnc_curated_gene_name > 1 ATF4 > > Best, > > Jim > > > > > > > Tony Chiang wrote: > >> Thanks Hans, >> >> That worked much better. Quick follow up question then (I guess for anyone >> who might know the answer), when would we use the hgnc gene names rather >> the >> the symbols? It would appear that ATF4 is a valid hgnc gene name so I >> thought that the obvious choice would have been to filter based on >> hgnc_automatic_gene_name but this is obviously not the case. I guess what >> I >> am trying to ask is how do I know what to use as the filter when it would >> seem like there is an obvious candidate to chose but is not the correct >> one? >> >> Cheers, >> --Tony >> >> >> On Mon, Mar 1, 2010 at 12:31 AM, Hotz, Hans-Rudolf <hrh@fmi.ch> wrote: >> >> >>> >>> On 2/28/10 7:16 PM, "Tony Chiang" <tchiang@fhcrc.org> wrote: >>> >>> Hi Steffen et al, >>>> >>>> Quick question about a search query via biomaRt. Here is the code that I >>>> >>> am >>> >>>> using: >>>> >>>> ***** >>>> library(biomaRt) >>>> ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl") >>>> filters = listFilters(ensembl) >>>> attributes = listAttributes(ensembl) >>>> getBM(attributes=c("ensembl_peptide_id", "entrezgene", >>>> "ensembl_gene_id", "hgnc_automatic_gene_name"), >>>> filters="hgnc_automatic_gene_name", values="ATF4", >>>> mart=ensembl) >>>> ***** >>>> >>> try ' filters="hgnc_symbol" ', eg: >>> >>> >>> getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id", >>>> >>> "hgnc_automatic_gene_name"), filters="hgnc_symbol", values="ATF4", >>> mart=ensembl) >>> ensembl_peptide_id entrezgene ensembl_gene_id hgnc_automatic_gene_name >>> 1 ENSP00000384587 468 ENSG00000128272 NA >>> 2 ENSP00000336790 468 ENSG00000128272 NA >>> 3 ENSP00000379912 468 ENSG00000128272 NA >>> >>> >>> Hans >>> >>> For me, this returns an empty data frame. But when I query ATF4 online >>>> at >>>> ensembl, I find what I need. I also looked up ATF4 at genenames.org >>>> (HUGO) >>>> and it seems that ATF4 is a valid hgnc gene name, so the filter so be >>>> >>> fine. >>> >>>> I guess the only other reason that I can see is which dataset I use in >>>> >>> the >>> >>>> useMart function. I am guessing that the online API will search through >>>> >>> all >>> >>>> datasets while I am only specifying a single one? If this is true, do >>>> you >>>> know of a sensible work around? I have about 150 genes that I would like >>>> mapped to the EBML ID names but using the code above with a vector of >>>> >>> gene >>> >>>> names, I can only map around 25...but if I manually query for some of >>>> the >>>> non-mapped gene names, I get what I am after. If I am wrong about my >>>> >>> guess >>> >>>> in the dataset, can you let me know what you think might be going on? >>>> >>>> Tony >>>> >>>> sessionInfo() >>>>> >>>> R version 2.11.0 Under development (unstable) (2010-01-16 r50993) >>>> i386-apple-darwin10.2.0 >>>> >>>> locale: >>>> [1] en_US.utf-8/en_US.utf-8/C/C/en_US.utf-8/en_US.utf-8 >>>> >>>> attached base packages: >>>> [1] grid stats graphics grDevices utils datasets methods >>>> [8] base >>>> >>>> other attached packages: >>>> [1] hgu133plus2.db_2.3.5 org.Hs.eg.db_2.3.6 Rgraphviz_1.25.1 >>>> [4] biomaRt_2.3.0 GOstats_2.13.0 RSQLite_0.8-1 >>>> [7] DBI_0.2-5 Category_2.13.0 AnnotationDbi_1.9.4 >>>> [10] Biobase_2.7.3 RBGL_1.23.0 graph_1.25.5 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] annotate_1.25.1 genefilter_1.29.5 GO.db_2.3.5 >>>> GSEABase_1.9.0 >>>> [5] RCurl_1.3-1 splines_2.11.0 survival_2.35-8 tools_2.11.0 >>>> [9] XML_2.6-0 xtable_1.5-6 >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor@stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > James W. MacDonald, M.S. > Biostatistician > Douglas Lab > University of Michigan > Department of Human Genetics > 5912 Buhl > 1241 E. Catherine St. > Ann Arbor MI 48109-5618 > 734-615-7826 > ********************************************************** > Electronic Mail is not secure, may not be read every day, and should not be > used for urgent or sensitive issues > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
On 3/1/10 4:07 PM, "Tony Chiang" <tchiang at="" fhcrc.org=""> wrote: > Thanks Hans, > > That worked much better. Quick follow up question then (I guess for anyone > who might know the answer), when would we use the hgnc gene names rather the > the symbols? It would appear that ATF4 is a valid hgnc gene name as far as I understand 'hgnc_symbol' should always work (if the symbol does exist). The HGNC does assign (or rather approve) 'symbols', and 'names' refer to written out names, see: http://www.genenames.org/data/hgnc_data.php?hgnc_id=786 Ensembl uses the HGNC symbol as 'Name', see: http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000128272 => notice the label 'curated' Hence for this particular symbol, you can also use the biomart filter "hgnc_curated_gene_nam", eg: > getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id", "hgnc_automatic_gene_name"), filters="hgnc_curated_gene_name", values="ATF4", mart=ensembl) ensembl_peptide_id entrezgene ensembl_gene_id hgnc_automatic_gene_name 1 ENSP00000384587 468 ENSG00000128272 NA 2 ENSP00000336790 468 ENSG00000128272 NA 3 ENSP00000379912 468 ENSG00000128272 NA > How ever, if you look at 'IGHA2', see: http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000211890 -> notice the label 'automatic' Hence, the biomart filter "hgnc_curated_gene_name" will not work, but the biomart filter "hgnc_curated_automatic_name" will work, eg: > getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id", "hgnc_automatic_gene_name"), filters="hgnc_curated_gene_name", values="IGHA2", mart=ensembl) [1] ensembl_peptide_id entrezgene ensembl_gene_id [4] hgnc_automatic_gene_name <0 rows> (or 0-length row.names) > getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id", "hgnc_automatic_gene_name"), filters="hgnc_automatic_gene_name", values="IGHA2", mart=ensembl) ensembl_peptide_id entrezgene ensembl_gene_id hgnc_automatic_gene_name 1 ENSP00000418606 NA ENSG00000211890 IGHA2 2 ENSP00000374980 NA ENSG00000211890 IGHA2 3 ENSP00000374981 NA ENSG00000211890 IGHA2 > and 'hgnc_symbol' always work, eg: > getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id", "hgnc_automatic_gene_name"), filters="hgnc_symbol", values="IGHA2", mart=ensembl) ensembl_peptide_id entrezgene ensembl_gene_id hgnc_automatic_gene_name 1 ENSP00000418606 NA ENSG00000211890 IGHA2 2 ENSP00000374980 NA ENSG00000211890 IGHA2 3 ENSP00000374981 NA ENSG00000211890 IGHA2 > Now, the follow up question is: how does ensembl distinguish between 'curated' and 'automatic'? well, I am no more fully familiar with ensembl, but I assume, that the entry for IGHA2 has no (not yet) support from their manual curators...there is also no link back to vega on the HGNC web page for 'IGHA2', and there is one for 'ATF4' I hope this clarifies the situation Hans > so I > thought that the obvious choice would have been to filter based on > hgnc_automatic_gene_name but this is obviously not the case. I guess what I > am trying to ask is how do I know what to use as the filter when it would > seem like there is an obvious candidate to chose but is not the correct one? > > Cheers, > --Tony > > > On Mon, Mar 1, 2010 at 12:31 AM, Hotz, Hans-Rudolf <hrh at="" fmi.ch=""> wrote: > >> >> >> >> On 2/28/10 7:16 PM, "Tony Chiang" <tchiang at="" fhcrc.org=""> wrote: >> >>> Hi Steffen et al, >>> >>> Quick question about a search query via biomaRt. Here is the code that I >> am >>> using: >>> >>> ***** >>> library(biomaRt) >>> ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl") >>> filters = listFilters(ensembl) >>> attributes = listAttributes(ensembl) >>> getBM(attributes=c("ensembl_peptide_id", "entrezgene", >>> "ensembl_gene_id", "hgnc_automatic_gene_name"), >>> filters="hgnc_automatic_gene_name", values="ATF4", >>> mart=ensembl) >>> ***** >> >> try ' filters="hgnc_symbol" ', eg: >> >> >>> getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id", >> "hgnc_automatic_gene_name"), filters="hgnc_symbol", values="ATF4", >> mart=ensembl) >> ensembl_peptide_id entrezgene ensembl_gene_id hgnc_automatic_gene_name >> 1 ENSP00000384587 468 ENSG00000128272 NA >> 2 ENSP00000336790 468 ENSG00000128272 NA >> 3 ENSP00000379912 468 ENSG00000128272 NA >>> >> >> >> >> Hans >> >>> For me, this returns an empty data frame. But when I query ATF4 online at >>> ensembl, I find what I need. I also looked up ATF4 at genenames.org(HUGO) >>> and it seems that ATF4 is a valid hgnc gene name, so the filter so be >> fine. >>> I guess the only other reason that I can see is which dataset I use in >> the >>> useMart function. I am guessing that the online API will search through >> all >>> datasets while I am only specifying a single one? If this is true, do you >>> know of a sensible work around? I have about 150 genes that I would like >>> mapped to the EBML ID names but using the code above with a vector of >> gene >>> names, I can only map around 25...but if I manually query for some of the >>> non-mapped gene names, I get what I am after. If I am wrong about my >> guess >>> in the dataset, can you let me know what you think might be going on? >>> >>> Tony >>> >>>> sessionInfo() >>> R version 2.11.0 Under development (unstable) (2010-01-16 r50993) >>> i386-apple-darwin10.2.0 >>> >>> locale: >>> [1] en_US.utf-8/en_US.utf-8/C/C/en_US.utf-8/en_US.utf-8 >>> >>> attached base packages: >>> [1] grid stats graphics grDevices utils datasets methods >>> [8] base >>> >>> other attached packages: >>> [1] hgu133plus2.db_2.3.5 org.Hs.eg.db_2.3.6 Rgraphviz_1.25.1 >>> [4] biomaRt_2.3.0 GOstats_2.13.0 RSQLite_0.8-1 >>> [7] DBI_0.2-5 Category_2.13.0 AnnotationDbi_1.9.4 >>> [10] Biobase_2.7.3 RBGL_1.23.0 graph_1.25.5 >>> >>> loaded via a namespace (and not attached): >>> [1] annotate_1.25.1 genefilter_1.29.5 GO.db_2.3.5 GSEABase_1.9.0 >>> [5] RCurl_1.3-1 splines_2.11.0 survival_2.35-8 tools_2.11.0 >>> [9] XML_2.6-0 xtable_1.5-6 >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >>
ADD REPLY
0
Entering edit mode
Hi Hans, Yes that was all very helpful. So my mistake was to assume that a gene name is the actual name of the gene rather than a description of the gene. It is there in the documentation so I should not complain...though very counter-intuitive I would argue. Thanks for all your help on this! Cheers, --Tony On Mon, Mar 1, 2010 at 7:47 AM, Hotz, Hans-Rudolf <hrh@fmi.ch> wrote: > > > > On 3/1/10 4:07 PM, "Tony Chiang" <tchiang@fhcrc.org> wrote: > > > Thanks Hans, > > > > That worked much better. Quick follow up question then (I guess for > anyone > > who might know the answer), when would we use the hgnc gene names rather > the > > the symbols? It would appear that ATF4 is a valid hgnc gene name > > > > as far as I understand 'hgnc_symbol' should always work (if the symbol does > exist). The HGNC does assign (or rather approve) 'symbols', and 'names' > refer to written out names, see: > > http://www.genenames.org/data/hgnc_data.php?hgnc_id=786 > > > Ensembl uses the HGNC symbol as 'Name', see: > > http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000128272 > > => notice the label 'curated' > > Hence for this particular symbol, you can also use the biomart filter > "hgnc_curated_gene_nam", eg: > > > getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id", > "hgnc_automatic_gene_name"), filters="hgnc_curated_gene_name", > values="ATF4", > mart=ensembl) > ensembl_peptide_id entrezgene ensembl_gene_id hgnc_automatic_gene_name > 1 ENSP00000384587 468 ENSG00000128272 NA > 2 ENSP00000336790 468 ENSG00000128272 NA > 3 ENSP00000379912 468 ENSG00000128272 NA > > > > How ever, if you look at 'IGHA2', see: > > http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000211890 > > -> notice the label 'automatic' > > Hence, the biomart filter "hgnc_curated_gene_name" will not work, but the > biomart filter "hgnc_curated_automatic_name" will work, eg: > > > > getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id", > "hgnc_automatic_gene_name"), filters="hgnc_curated_gene_name", > values="IGHA2", > mart=ensembl) > [1] ensembl_peptide_id entrezgene ensembl_gene_id > [4] hgnc_automatic_gene_name > <0 rows> (or 0-length row.names) > > getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id", > "hgnc_automatic_gene_name"), filters="hgnc_automatic_gene_name", > values="IGHA2", > mart=ensembl) > ensembl_peptide_id entrezgene ensembl_gene_id hgnc_automatic_gene_name > 1 ENSP00000418606 NA ENSG00000211890 IGHA2 > 2 ENSP00000374980 NA ENSG00000211890 IGHA2 > 3 ENSP00000374981 NA ENSG00000211890 IGHA2 > > > > and 'hgnc_symbol' always work, eg: > > > > > getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id", > "hgnc_automatic_gene_name"), filters="hgnc_symbol", values="IGHA2", > mart=ensembl) > ensembl_peptide_id entrezgene ensembl_gene_id hgnc_automatic_gene_name > 1 ENSP00000418606 NA ENSG00000211890 IGHA2 > 2 ENSP00000374980 NA ENSG00000211890 IGHA2 > 3 ENSP00000374981 NA ENSG00000211890 IGHA2 > > > > > > > Now, the follow up question is: how does ensembl distinguish between > 'curated' and 'automatic'? well, I am no more fully familiar with ensembl, > but I assume, that the entry for IGHA2 has no (not yet) support from their > manual curators...there is also no link back to vega on the HGNC web page > for 'IGHA2', and there is one for 'ATF4' > > > I hope this clarifies the situation > > Hans > > > > > > so I > > thought that the obvious choice would have been to filter based on > > hgnc_automatic_gene_name but this is obviously not the case. I guess what > I > > am trying to ask is how do I know what to use as the filter when it would > > seem like there is an obvious candidate to chose but is not the correct > one? > > > > Cheers, > > --Tony > > > > > > On Mon, Mar 1, 2010 at 12:31 AM, Hotz, Hans-Rudolf <hrh@fmi.ch> wrote: > > > >> > >> > >> > >> On 2/28/10 7:16 PM, "Tony Chiang" <tchiang@fhcrc.org> wrote: > >> > >>> Hi Steffen et al, > >>> > >>> Quick question about a search query via biomaRt. Here is the code that > I > >> am > >>> using: > >>> > >>> ***** > >>> library(biomaRt) > >>> ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl") > >>> filters = listFilters(ensembl) > >>> attributes = listAttributes(ensembl) > >>> getBM(attributes=c("ensembl_peptide_id", "entrezgene", > >>> "ensembl_gene_id", "hgnc_automatic_gene_name"), > >>> filters="hgnc_automatic_gene_name", values="ATF4", > >>> mart=ensembl) > >>> ***** > >> > >> try ' filters="hgnc_symbol" ', eg: > >> > >> > >>> getBM(attributes=c("ensembl_peptide_id", > "entrezgene","ensembl_gene_id", > >> "hgnc_automatic_gene_name"), filters="hgnc_symbol", values="ATF4", > >> mart=ensembl) > >> ensembl_peptide_id entrezgene ensembl_gene_id hgnc_automatic_gene_name > >> 1 ENSP00000384587 468 ENSG00000128272 NA > >> 2 ENSP00000336790 468 ENSG00000128272 NA > >> 3 ENSP00000379912 468 ENSG00000128272 NA > >>> > >> > >> > >> > >> Hans > >> > >>> For me, this returns an empty data frame. But when I query ATF4 online > at > >>> ensembl, I find what I need. I also looked up ATF4 at genenames.org > (HUGO) > >>> and it seems that ATF4 is a valid hgnc gene name, so the filter so be > >> fine. > >>> I guess the only other reason that I can see is which dataset I use in > >> the > >>> useMart function. I am guessing that the online API will search through > >> all > >>> datasets while I am only specifying a single one? If this is true, do > you > >>> know of a sensible work around? I have about 150 genes that I would > like > >>> mapped to the EBML ID names but using the code above with a vector of > >> gene > >>> names, I can only map around 25...but if I manually query for some of > the > >>> non-mapped gene names, I get what I am after. If I am wrong about my > >> guess > >>> in the dataset, can you let me know what you think might be going on? > >>> > >>> Tony > >>> > >>>> sessionInfo() > >>> R version 2.11.0 Under development (unstable) (2010-01-16 r50993) > >>> i386-apple-darwin10.2.0 > >>> > >>> locale: > >>> [1] en_US.utf-8/en_US.utf-8/C/C/en_US.utf-8/en_US.utf-8 > >>> > >>> attached base packages: > >>> [1] grid stats graphics grDevices utils datasets methods > >>> [8] base > >>> > >>> other attached packages: > >>> [1] hgu133plus2.db_2.3.5 org.Hs.eg.db_2.3.6 Rgraphviz_1.25.1 > >>> [4] biomaRt_2.3.0 GOstats_2.13.0 RSQLite_0.8-1 > >>> [7] DBI_0.2-5 Category_2.13.0 AnnotationDbi_1.9.4 > >>> [10] Biobase_2.7.3 RBGL_1.23.0 graph_1.25.5 > >>> > >>> loaded via a namespace (and not attached): > >>> [1] annotate_1.25.1 genefilter_1.29.5 GO.db_2.3.5 > GSEABase_1.9.0 > >>> [5] RCurl_1.3-1 splines_2.11.0 survival_2.35-8 tools_2.11.0 > >>> [9] XML_2.6-0 xtable_1.5-6 > >>> > >>> [[alternative HTML version deleted]] > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor@stat.math.ethz.ch > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> > > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 867 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6