org.Hs.eg.db gives more than one ENTREZID for a gene symbol
1
0
Entering edit mode
harish • 0
@d734c3d2
Last seen 20 months ago
Germany

I have a list of gene symbols and when I run the code in r as given below. Just for the two gene symbols TEC and MEMO1, I get two different entrezID. This makes my list output list longer than my input list, further how to resolve this, and can a gene symbol have two Entrez ID.

library(org.Hs.eg.db)

HGNC_symbol <- c("TEC", "MEMO1")

conversion <- AnnotationDbi::select(org.Hs.eg.db, 
       keys = HGNC_symbol,
       columns = c("ENTREZID", "SYMBOL"),
       keytype = "SYMBOL")
org.Hs.eg.db AnnotationDbi • 2.0k views
ADD COMMENT
0
Entering edit mode

ok I understand the trouble here, when I look up for the ENTREZID that I get in the output, I can see that both of the ENTREZID retrieve two different genes, with the same gene symbol for them in NCBI. However, one is approved by the HGNC, and the other is not approved. How can I tell AnnotationDbi to consider my gene symbols as the once approved by HGNC when I retrieve the data for ENTREZID? it is much clear if you look for MEMO1 in NCBI.

Bottom line is, is there a way to specify gene symbols as HGNC gene symbols in AnnotationDbi??

ADD REPLY
1
Entering edit mode
@james-w-macdonald-5106
Last seen 21 hours ago
United States

There is no way to specify the source of gene symbols for an OrgDb. For TEC, one comes from HGNC, and the other comes from OMIM. When we generate the OrgDb packages, we don't distinguish between sources, as they are all (as far as NCBI is concerned) 'real' gene symbols. Unfortunately, gene symbols are not unique, and come from different sources (and get retired regularly), so one would ideally not use them for anything but presenting data to a biologist, for whom the gene symbol is usually the primary ID.

The easy way to get around this is to use mapIds instead.

> z <- mapIds(org.Hs.eg.db, c("TEC", "MEMO1"), "ENTREZID","SYMBOL")
'select()' returned 1:many mapping between keys and columns
> data.frame(ENTREZID = z, SYMBOL = names(z))
      ENTREZID SYMBOL
TEC       7006    TEC
MEMO1     7795  MEMO1

But do note this is a naive implementation that simply chooses the first choice for each symbol

> mapIds(org.Hs.eg.db, c("TEC", "MEMO1"), "ENTREZID","SYMBOL", multiVals = "list")
'select()' returned 1:many mapping between keys and columns
$TEC
[1] "7006"      "100124696"

$MEMO1
[1] "7795"  "51072"
ADD COMMENT
1
Entering edit mode

To follow up on this, we use the gene_info.gz file from NCBI, parsing out the gene ID and symbol, which are the second and third columns. There is also an eleventh column called 'Symbol from nomenclature authority'

$ zcat gene_info.gz | awk -F '\t' '{ if($1 == 9606 && ($2 == 7795 || $2 == 51072)) print $2"\t"$3"\t"$11}'
7795    MEMO1   -
51072   MEMO1   MEMO1

## and

$ zcat gene_info.gz | awk -F '\t' '{ if($1 == 9606 && ($2 == 7006 || $2 == 100124696)) print $2"\t"$3"\t"$11}'
7006    TEC TEC
100124696   TEC -

## and

$ zcat gene_info.gz | awk -F '\t' '{ if($1 == 9606 && $11 == "-" && $3 != "-") print}' | wc -l
98090
$ zcat gene_info.gz | awk -F '\t' '{ if($1 == 9606 && $11 != "-") print}' | wc -l
43499
$ zcat gene_info.gz | awk -F '\t' '{ if($1 == 9606 && $11 == $3) print}' | wc -l
43462

So there are tons of genes with symbols that aren't from the nomenclature authority that people might want to use. Plus 37 where the symbol and the nomenclature symbol don't agree.

ADD REPLY

Login before adding your answer.

Traffic: 587 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6