I'm have been trying to annotate some RNAseq data using the org.Mm.eg.db. The count matrix I was sent by collaborators has ENSEMBL gene IDs. However, I have been having a problem with missing gene ids in the egENSEMBL table when I try to annotate. For example, one gene I am interested in, H19, has the gene_id 14955, but this id does not seem to be present in egENSEMBL. On the other hand 14955 is present in egSYMBOL. Is there something basic I am missing? Is there a different table I should be using?
Thank you,
Greg
Sorry. I was realize I was very confusing. I'll try to clarify using the H19 as an example. When I search for H19 on the Ensembl website I find that H19’s Ensembl gene ID is ENSMUSG00000000031 and the Entrez gene ID as 14955. When I search through the egENSEMBL table, the ENSEMBL gene ID ENSMUSG00000000031 is not present. However, when I search through the egSYMBOL table, I find that H19 is present and has the Entrez gene ID 14955. See below for the exact commands I used to search through the tables
> library(org.Mm.eg.db)
> egENSEMBL <- toTable(org.Mm.egENSEMBL)
Then I wrote this table to a text file and searched for ENSMUSG00000000031.
gene_id
ensembl_id
14679
ENSMUSG00000000001
54192
ENSMUSG00000000003
12544
ENSMUSG00000000028
107815
ENSMUSG00000000037
11818
ENSMUSG00000000049
67608
ENSMUSG00000000056
As you can see there is no entry for H19 in this table. However, when I search through the egSYMBOL table I find there is an entry for H19
> egSYMBOL <- toTable(org.Mm.egSYMBOL)
Then I wrote this table to a text file and searched for 14955.
gene_id
symbol
14944
Gzmg
14945
Gzmk
14950
H13
14955
H19
14957
Hist1h1d
14958
H1f0
14960
H2-Aa
So my question is, why is the gene entry for H19 missing from the egENSEMBL table? Have I done something wrong?
No, it doesn't say that 14955 is the matching gene. It says something else:
Overlapping RefSeq Gene ID 14955 matches but different biotype of misc_RNA
So you are saying 'these things are the same', and both Ensembl and NCBI are saying, 'well, not really'. So this gets back to what the org.Xx.eg.db packages are; simply a reformulation of data from NCBI, without interpretation on our part, and in particular based on mappings, starting with NCBI's Gene database. If EBI and NCBI say that the gene is in the same place, but is not the same thing, exactly, then we won't map 14955 to ENSMUSG00000000031, because NCBI doesn't.
And no, you haven't done anything wrong. Like I said before, when you have two different groups doing essentially the same thing, there are bound to be things that are not completely consistent between the two. And if you look at things from Ensembl's standpoint, they agree to disagree as well:
Fair enough. Thank you very much for your help