At the GitHub of clusterProfiler
a question was posted on the use of symbols of mitochondrial genes.
It basically boils down to the question whether a hyphen in a symbol may cause a problem with the org.Hs.eg.db
, preventing it from being present in the database.
To illustrate the issue:
A set of 4 'official' mitochondrial symbols:
mito.hgnc <- c("MT-ATP6", "MT-ATP8", "MT-CO1", "MT-CYB")
When the OrgDb
is queried to retrieve the corresponding ENTREZID
nothing is found....:
> library(org.Hs.eg.db)
> AnnotationDbi::select(org.Hs.eg.db, keys = mito.hgnc, keytype = "SYMBOL",
+ columns = c("ENTREZID", "SYMBOL", "GENENAME") )
Error in .testForValidKeys(x, keys, keytype, fks) :
None of the keys entered are valid keys for 'SYMBOL'. Please use the keys method to see a listing of valid arguments.
>
The same when ALIAS
is used...
> AnnotationDbi::select(org.Hs.eg.db, keys = mito.hgnc, keytype = "ALIAS",
+ columns = c("ENTREZID", "SYMBOL", "GENENAME") )
Error in .testForValidKeys(x, keys, keytype, fks) :
None of the keys entered are valid keys for 'ALIAS'. Please use the keys method to see a listing of valid arguments.
>
Using a code snippet to retrieve all mitochondrial genes (previously posted on this forum by James)
## check all genes on chromosome M
> z <- unlist(as.list(org.Hs.egCHR))
> mito.egids <- names(z)[z %in% "MT"]
> mito.egids
[1] "4508" "4509" "4511" "4512" "4513" "4514" "4519" "4535" "4536" "4537"
[11] "4538" "4539" "4540" "4541" "4549" "4550" "4553" "4555" "4556" "4558"
[21] "4563" "4564" "4565" "4566" "4567" "4568" "4569" "4570" "4571" "4572"
[31] "4573" "4574" "4575" "4576" "4577" "4578" "4579"
>
> head( AnnotationDbi::select(org.Hs.eg.db, keys = mito.egids, keytype = "ENTREZID",
+ columns = c("ENTREZID", "SYMBOL", "GENENAME") ) )
'select()' returned 1:1 mapping between keys and columns
ENTREZID SYMBOL GENENAME
1 4508 ATP6 ATP synthase F0 subunit 6
2 4509 ATP8 ATP synthase F0 subunit 8
3 4511 TRNC tRNA-Cys
4 4512 COX1 cytochrome c oxidase subunit I
5 4513 COX2 cytochrome c oxidase subunit II
6 4514 COX3 cytochrome c oxidase subunit III
>
Mmm, SYMBOL
are ATP6
, ATP8
, etc. No prefix with MT...
Yet, at both NCBI and HGNC official symbols for all 4 input symbols are with MT prefix...?
(... and official symbol is MT-CO1
rather than COX1
...)
MT-ATP6
https://www.ncbi.nlm.nih.gov/gene/4508
https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:7414
MT-ATP8
https://www.ncbi.nlm.nih.gov/gene/4509
https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:7415
MT-CO1
https://www.ncbi.nlm.nih.gov/gene/4512
https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:7419
MT-CYB
https://www.ncbi.nlm.nih.gov/gene/4519
https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:7427
Thus, is this expected behavior, or maybe not? Any insights would be appreciated.
G
BTW, I understand that the OrgDb
are generated a while before a new Bioconductor release, and at NCBI it is stated that the last info update was on October 28, but ~2 weeks ago he prefix MT was already used (and maybe even long before then, but I don't know that...).
> sessionInfo()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: Europe/Amsterdam
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] org.Hs.eg.db_3.20.0 AnnotationDbi_1.68.0 IRanges_2.40.0
[4] S4Vectors_0.44.0 Biobase_2.66.0 BiocGenerics_0.52.0
loaded via a namespace (and not attached):
[1] crayon_1.5.3 vctrs_0.6.5 httr_1.4.7
[4] cli_3.6.3 rlang_1.1.4 DBI_1.2.3
[7] png_0.1-8 UCSC.utils_1.2.0 jsonlite_1.8.9
[10] bit_4.5.0 Biostrings_2.74.0 KEGGREST_1.46.0
[13] fastmap_1.2.0 GenomeInfoDb_1.42.0 memoise_2.0.1
[16] compiler_4.4.2 RSQLite_2.3.7 blob_1.2.4
[19] pkgconfig_2.0.3 XVector_0.46.0 R6_2.5.1
[22] GenomeInfoDbData_1.2.13 tools_4.4.2 bit64_4.5.2
[25] zlibbioc_1.52.0 cachem_1.1.0
>
Thanks; I did not know which file exactly is being used. Good catch regarding the discrepancy between the 2 web pages. I will get in touch with NCBI regarding this issue.
If you care to know more about the process, there is a GitHub with all the code. It's pretty complicated though - the readme alone is like War and Peace length.
I have contacted NCBI, and meanwhile got an an answer. See below.
It turns out that the official, HGNC-approved symbols are present in the before-mentioned file
gene_info.gz
, namely in column 11 (Symbol_from_nomenclature_authority
). I manually looked up the above-mentioned mitochondrial genes (lines 12218018 - 12218026), and indeed the HGNC symbols are present in that column.Screenshot:
@James: yes, that is indeed a lot of code. Do you happen to know where in the code it is defined which columns are extracted? I noticed that after downloading the
gene_info
file first is filtered for the organisms for which Bioconductor provides anOrgb
(getsrc.sh
, https://github.com/Bioconductor/BioconductorAnnotationPipeline/blob/master/annosrc/gene/script/getsrc.sh).In the
srcdb.sql
script the relevant annotation information seems to be extracted (https://github.com/Bioconductor/BioconductorAnnotationPipeline/blob/2cb9f84f10c836008eb52aea2ae939749f818201/annosrc/gene/script/srcdb.sql#L32-L49), but I wasn't able to find out what exactly is used fordefault_gene_symbol
.Anyway, do you think it would be possible to include the
Symbol_from_nomenclature_authority
(column 11) andFull_name_from_nomenclature_authority information
(column 12) in theOrgDb
as well?Reply NLM Support: