BioMart missing IDs
1
0
Entering edit mode
rina ▴ 30
@rina-16738
Last seen 14 months ago
France

Looking at the NAs that came up after mapping Ensembl IDs to Entrez IDs using BioMart, I randomly checked one (ENSG00000018607) and it is linked to an Entrez ID that was yet not found. Any ideas what might be the reason?

This is the code I used

   mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
    genes.entrez <- getBM(
      filters="ensembl_gene_id",
      attributes=c("ensembl_gene_id", "entrezgene"),
      values=genes.nodot,
      mart=mart)

Note that I had originally a data frame with raw counts of expression data mapped to Ensembl IDs of the form

 [1] "ENSG00000000005.5"  "ENSG00000000419.11" "ENSG00000000457.12" "ENSG00000000460.15" "ENSG00000000938.11" "ENSG00000000971.14" "ENSG00000001036.12" "ENSG00000001084.9" 
[9] "ENSG00000001167.13"

So I removed the dot suffix to do the mapping.

The results I get after the mapping look like this.

ensembl_gene_id entrezgene
1 ENSG00000000005      64102
2 ENSG00000001561      22875
3 ENSG00000004478       2288
4 ENSG00000004799       5166
5 ENSG00000005022        292
6 ENSG00000005073       3207

Every kind of help would be much appreciated, as I am pretty new to using R.

biomart ensembl entrez gene identifiers • 2.8k views
ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

Please note that the example you give isn't a good example. The Ensembl ID ENSG00000018607, is, according to Ensembl, an unprocessed pseudogene. Entrez Gene, on the other hand, says it's a coding gene. These are not the same thing! So I think it's good that the Biomart isn't saying they are. This is actually spelled out on the Ensembl page where it says that there is an overlapping gene in Entrez Gene, but that the two groups differ as to what the underlying thing is.

Mapping between the two annotation services is a fraught enterprise, and I try to avoid doing so if at all possible, because there are any number of little technical details like this and it's not clear who is right. I mean, you have two groups with lots of people who spend lots of time trying to figure this stuff out, and they disagree over a fundamental issue of whether or not this thing is a pseudogene that doesn't get expressed, or a real gene that codes for proteins. And that's just one gene (or not). There is no way for one person to resolve these conflicts, particularly in bulk, using programmatic methods. So you should either simply accept what mappings you get, or just stick with a single annotation service, and be clear about which one you used.

ADD COMMENT
0
Entering edit mode

That was a great and really helpful answer! I will take your pointers into consideration and try to tweak the workflow in a way that I avoid converting IDs. Much appreciated! 

ADD REPLY

Login before adding your answer.

Traffic: 472 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6