biomart query: ensembl gene id and entrez gene id confusion
1
0
Entering edit mode
Natasha ▴ 440
@natasha-4640
Last seen 10.2 years ago
Dear List, I want to extract ensembl gene ids from biomart to add to my microarray analysis output. However, there are some discrepancies that have me confused regarding the entrez gene id and ensemble gene id. Array used: Illumina HumanHT12 v4. As an example: GAGE12F, GAGE12G, GAGE12I genes Microarray: Illumina HT12 v4 output: Entrez_Gene_ID Symbol Chromosome Probe_Id Probe_Type Cytoband 26748 GAGE12I X ILMN_1691563 A Xp11.23b 100008586 GAGE12F X ILMN_3242920 S Xp11.23b 645073 GAGE12G X ILMN_1664660 S Xp11.23b Definition Homo sapiens G antigen 12I (GAGE12I), mRNA. Homo sapiens G antigen 12F (GAGE12F), mRNA. Homo sapiens G antigen 12G (GAGE12G), mRNA. Biomart output: entrezgene ensembl_gene_id hgnc_symbol 1 100008586 ENSG00000241465 GAGE12I 2 100008586 ENSG00000236362 GAGE12F 3 100008586 ENSG00000215269 GAGE12G 1022 26748 ENSG00000241465 GAGE12I 1023 26748 ENSG00000236362 GAGE12F 1024 26748 ENSG00000215269 GAGE12G 2392 645073 ENSG00000241465 GAGE12I 2393 645073 ENSG00000236362 GAGE12F 2394 645073 ENSG00000215269 GAGE12G So please help me understand, why are there multiple results rather than true unique results. If I merge the two, based on the above, I would get an incorrectly merged file. (I cannot use the Illumina HT12 probe ids as a filter, as I was informed that in biomart these are mapped to the HT12 v3 chip). R code and sessionifno: library(biomaRt) library(DESeq) library(gdata) m_h_a2 # 3995 15 (limma output for a given comparison) length(unique(m_h_a2$Entrez_Gene_ID)) # 3506 length(unique(m_h_a2$Symbol)) # 3542 length(unique(m_h_a2$Probe_Id)) # 3995 ## Non-NA's mh.ona = na.omit(m_h_a2) # 3912 17 ## Unique ids mh.u.eg = m_h_a2[match(unique(m_h_a2$Entrez_Gene_ID),m_h_a2$Entrez_Gene_ID),] # 3506 15 mh.u.eg = na.omitmh.u.eg) # 3505 15 ensembl = useMart("ensembl", dataset="hsapiens_gene_ensembl") mh_eg.ens <- getBM(attributes = c("entrezgene","ensembl_gene_id","hgnc_symbol"), filters = "entrezgene", values = mh.u.eg$Entrez_Gene_ID, mart = ensembl) # 3305 3 ### I would like to merge mh.u.eg with mh_eg.ens ##sessionInfo R version 2.13.0 (2011-04-13) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] scatterplot3d_0.3-33 WriteXLS_2.1.0 gdata_2.8.2 [4] DESeq_1.4.1 locfit_1.5-6 lattice_0.19-23 [7] akima_0.5-4 Biobase_2.12.1 biomaRt_2.8.0 loaded via a namespace (and not attached): [1] annotate_1.30.0 AnnotationDbi_1.14.1 DBI_0.2-5 [4] genefilter_1.34.0 geneplotter_1.30.0 grid_2.13.0 [7] gtools_2.6.2 RColorBrewer_1.0-2 RCurl_1.6-4 [10] RSQLite_0.9-4 splines_2.13.0 survival_2.36-5 [13] tools_2.13.0 XML_3.4-0 xtable_1.5-6 Many Thanks, Natasha [[alternative HTML version deleted]]
biomaRt biomaRt • 1.6k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States
Hi Natasha, On 8/23/2011 8:37 AM, Natasha Sahgal wrote: > Dear List, > > I want to extract ensembl gene ids from biomart to add to my > microarray analysis output. However, there are some discrepancies > that have me confused regarding the entrez gene id and ensemble gene > id. > > Array used: Illumina HumanHT12 v4. > > > As an example: GAGE12F, GAGE12G, GAGE12I genes > > Microarray: Illumina HT12 v4 output: > > > > Entrez_Gene_ID Symbol Chromosome Probe_Id Probe_Type Cytoband > > 26748 GAGE12I X ILMN_1691563 A Xp11.23b > > 100008586 GAGE12F X ILMN_3242920 S Xp11.23b > > 645073 GAGE12G X ILMN_1664660 S Xp11.23b > > Definition > > Homo sapiens G antigen 12I (GAGE12I), mRNA. > > Homo sapiens G antigen 12F (GAGE12F), mRNA. > > Homo sapiens G antigen 12G (GAGE12G), mRNA. > > > > > > Biomart output: > > > > entrezgene ensembl_gene_id hgnc_symbol > > 1 100008586 ENSG00000241465 GAGE12I > > 2 100008586 ENSG00000236362 GAGE12F > > 3 100008586 ENSG00000215269 GAGE12G > > 1022 26748 ENSG00000241465 GAGE12I > > 1023 26748 ENSG00000236362 GAGE12F > > 1024 26748 ENSG00000215269 GAGE12G > > 2392 645073 ENSG00000241465 GAGE12I > > 2393 645073 ENSG00000236362 GAGE12F > > 2394 645073 ENSG00000215269 GAGE12G > > So please help me understand, why are there multiple results rather > than true unique results. If I merge the two, based on the above, I > would get an incorrectly merged file. (I cannot use the Illumina HT12 > probe ids as a filter, as I was informed that in biomart these are > mapped to the HT12 v3 chip). This has to do with the difference between manufacturer mappings and annotations of genes. When the manufacturer creates a chip, they intend each reporter to interrogate a single transcript, and it is not unheard of for them to ignore other cross-hybridizing transcripts. On the other hand, the annotation of genes and especially cross- listing between annotation data bases cannot be so single minded. In the case of the GAGE genes, there are multiple transcripts with the same name, and multiple genomic positions for each transcript. If you look at the UCSC Genome Browser at this position: http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=208507527&hgt_doJsComman d=&position=chrX%3A49%2C323%2C231-49%2C333%2C657&hgtgroup_map_close=0& hgtgroup_phenDis_close=1&hgtgroup_genes_close=0&hgtgroup_rna_close=0&h gtgroup_expression_close=1&hgtgroup_regulation_close=1&hgtgroup_compGe no_close=0&hgtgroup_neandertal_close=0&hgtgroup_varRep_close=0 you can see that there are transcripts labeled GAGE12C, D, E, F, G, and I that are identical but for the UTR regions. I am actually surprised that the Biomart server didn't return these other GAGE12 transcripts as well. Best, Jim > > R code and sessionifno: > > library(biomaRt) library(DESeq) library(gdata) > > m_h_a2 # 3995 15 (limma output for a given comparison) > > length(unique(m_h_a2$Entrez_Gene_ID)) # 3506 > length(unique(m_h_a2$Symbol)) # 3542 > length(unique(m_h_a2$Probe_Id)) # 3995 > > ## Non-NA's mh.ona = na.omit(m_h_a2) # 3912 17 > > ## Unique ids mh.u.eg = > m_h_a2[match(unique(m_h_a2$Entrez_Gene_ID),m_h_a2$Entrez_Gene_ID),] # > 3506 15 mh.u.eg = na.omitmh.u.eg) # 3505 15 > > ensembl = useMart("ensembl", dataset="hsapiens_gene_ensembl") > > mh_eg.ens<- getBM(attributes = > c("entrezgene","ensembl_gene_id","hgnc_symbol"), filters = > "entrezgene", values = mh.u.eg$Entrez_Gene_ID, mart = ensembl) # 3305 > 3 > > ### I would like to merge mh.u.eg with mh_eg.ens > > ##sessionInfo R version 2.13.0 (2011-04-13) Platform: > x86_64-pc-linux-gnu (64-bit) > > locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] > LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=C > LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] > LC_ADDRESS=C LC_TELEPHONE=C [11] > LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: [1] stats graphics grDevices utils > datasets methods base > > other attached packages: [1] scatterplot3d_0.3-33 WriteXLS_2.1.0 > gdata_2.8.2 [4] DESeq_1.4.1 locfit_1.5-6 > lattice_0.19-23 [7] akima_0.5-4 Biobase_2.12.1 > biomaRt_2.8.0 > > loaded via a namespace (and not attached): [1] annotate_1.30.0 > AnnotationDbi_1.14.1 DBI_0.2-5 [4] genefilter_1.34.0 > geneplotter_1.30.0 grid_2.13.0 [7] gtools_2.6.2 > RColorBrewer_1.0-2 RCurl_1.6-4 [10] RSQLite_0.9-4 > splines_2.13.0 survival_2.36-5 [13] tools_2.13.0 > XML_3.4-0 xtable_1.5-6 > > > Many Thanks, Natasha > > > [[alternative HTML version deleted]] > > _______________________________________________ Bioconductor mailing > list Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the > archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
ADD COMMENT

Login before adding your answer.

Traffic: 838 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6