Question

How to resolve NAs when annotating a diff gene list with org.Hs.eg.db terms?

1

Entering edit mode

anthony.nash ▴ 20

@anthonynash-14843

Last seen 4.6 years ago

University of Oxford

I'm having issues resolving NAs whilst trying to annotate a list of diff genes with entrez IDs using ensembl IDs. I would be surprised if this hadn't been asked before, but finding answers and suggestions is half of the battle if you're not quite sure what to look for. Main questions in bold.

I've followed the RNA seq DESeq2 Bioconductor tutorial/outlined steps. The reference transcriptome is Homo_sapiens.GRCh38.v100 - I combined both coding and non-coding. I have list of diff genes for several compound-treatment experiments. I need entrez IDs for a chemistry process downstream from here, so I ran what you would expect:

library("AnnotationDbi")
library("org.Hs.org.db")
resAmi$entrez <- mapIds(org.Hs.eg.db,
                     keys=ens.str,
                     column="ENTREZID",
                     keytype="ENSEMBL",
                     multiVals="first")

A good proportion of each diff genes are given an entrez ID of NA. Firstly, why are there NAs? Something to do with ensembl dropping gene mappings after a particular version of their DB? A random comment I found on Biostars!

Secondly, I decided to try and annotate with EnsDb.Hsapiens.v86; a complete stab in the dark in an effort to understand more. This resolved some of the NAs but many of the entrez values I once had with org.Hs.eg.db are now different. In fact, I can see how two ensembl ID entries share the same entrez ID depending on the annotation DB. Which annotation DB is appropriate?

Here's just a snippet of what I'm seeing (sorry about the formatting, the two entrez IDs in question are in bold):

Symbol | Entrez | Entrez_ens | txbiotype | LFC

NA | NA | 2920 | protein_coding | -26.4976570057622

TAS2R3 | 50831 | 1417 | protein_coding | -16.5810022683443

NA | NA | 50831 | protein_coding | 17.5184870830614

NA | NA | 102724652 | protein_coding | 14.3289350041311

ARHGAP11B | 89839 | NA | processed_pseudogene | -16.7365692557264

The two entrez columns came from the sources: Entrez = org.Hs.eg.db. Entrez_ens = EnsDb.Hsapiens.v86.

As much information you can spare is greatly appreciated.

annotation rna • 3.3k views

ADD COMMENT • link updated 4.5 years ago by James W. MacDonald 68k • written 4.5 years ago by anthony.nash ▴ 20

score 3 · Accepted Answer · 2020-10-14

The reason there are NA values is because there isn't a NCBI Gene ID that corresponds to a given Ensembl Gene ID. There can be any number of reasons for that, but the root cause is that you have two different annotation services that are trying to define what is and isn't a gene, and where the genes reside. They do that in two different ways; if I were to generalize I would say that NCBI takes a bottoms-up approach where they first gathered information about genes and transcripts and then tried to localize them on the genome, and EBI/EMBL takes a top-down approach where they take the genome, and what they know of genes/transcripts to infer where the genes reside.

Because of the different approaches, there can be irreconcilable differences, in which case either service will say that there isn't a corresponding gene/transcript defined by the other. It's all boring and technical, which is why I recommend that people try to refrain from mapping IDs. You will lose genes, and the time and effort required to regain them isn't worth the effort.

Here's an example for NATP. Ensembl and NCBI. Same HGNC symbol, same chromosome, really similar location, but both services say it doesn't exist in the other. Why? Who knows? Do you really want to figure that out?

If you need NCBI Gene IDs, then use NCBI gene definitions when aligning/counting reads. If you need Ensembl, do the opposite.