Hi,
I'm using R package biomaRt
to map Ensembl gene IDs to HGNC symbols. I find some Ensembl IDs can be mapped to multiple symbols. For example,
mart = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"), filters = "ensembl_gene_id", values = c("ENSG00000187510", "ENSG00000230417", "ENSG00000276085"), mart = mart)
ensembl_gene_id hgnc_symbol
1 ENSG00000187510 C12orf74
2 ENSG00000187510 PLEKHG7
3 ENSG00000230417 LINC00595
4 ENSG00000230417 LINC00856
5 ENSG00000276085 CCL3L1
6 ENSG00000276085 CCL3L3
> packageVersion("biomaRt")
[1] ‘2.38.0’
This is unsurprising given that we don't expect 1:1 map. However, what is confusing is that, if I query those IDs with Ensembl website, I will get unambiguously one symbol. That is,
ENSG00000187510 -> C12orf74
ENSG00000230417 -> LINC00856
ENSG00000276085 -> CCL3L1
In theory, what is behind biomaRt
is just SQL query against Ensembl database online, and we should expect same results given the same version of the database. So I want to know why we get this discrepancy.
Thanks,
Right. But I'm curious about why http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000230417;r=10:78179185-78551355 returns LINC00856 as the Name in Summary section. Does it imply that Ensembl regards LINC00856 as a more canonical symbol than the other?
That's a question for EBI/EMBL, no? I'm not sure why you would think anybody at the Bioconductor support site would have any particular insight as to their thinking about what symbol is more or less canonical than any other.
Good idea.
According to Ensembl's reply, they arbitrarily pick a HGNC synonym for the summary if multiple.