It's not clear if you know this or not, but those are Ensembl gene IDs for Chinese hamster. The annotation package you are trying to use is for Humans, which you are aware of, and the fact that there is an 'eg' in the name is intended to indicate that the central key for the underlying database is Entrez Gene, which is what NCBI used to call their Gene IDs.
So you are trying to use an NCBI-based annotation of human genes to annotate Chinese hamster data for which you have Ensembl IDs, which obviously isn't going to work. If you have Ensembl IDs, you should do all your annotations within the EBI/EBML world, rather than trying to do NCBI -> Ensembl mappings. That means you use biomaRt
. The trick here is that you don't know a priori what arguments you need to use for various functions, so you need to figure out what you need.
## load biomaRt and get a 'base' mart. I use the useast mirror because I'm in the Eastern part of the US
> library(biomaRt)
> mart <- useEnsembl("ensembl", mirror = "useast")
## get a listing of the available data sets
> z <- listDatasets(mart)
## what are the mart names for Chinese hamster?
> z[grep("Chinese hamster", z[,2]),]
dataset description
32 cgchok1gshd_gene_ensembl Chinese hamster CHOK1GS genes (CHOK1GS_HDv1)
33 cgcrigri_gene_ensembl Chinese hamster CriGri genes (CriGri_1.0)
35 cgpicr_gene_ensembl Chinese hamster PICR genes (CriGri-PICR)
version
32 CHOK1GS_HDv1
33 CriGri_1.0
35 CriGri-PICR
## I know from doing a search at ensembl.org that these are CHOK1GS gene IDs, so we use that
> mart <- useEnsembl("ensembl", "cgchok1gshd_gene_ensembl", mirror = "useast")
> ids <- c("ENSCGRG00001004858", "ENSCGRG00001007305", "ENSCGRG00001010983", "ENSCGRG00001012374", "ENSCGRG00001013169", "ENSCGRG00001013944")
## now we need to know what to look for.
> z <- listAttributes(mart)
> head(z)
name description page
1 ensembl_gene_id Gene stable ID feature_page
2 ensembl_gene_id_version Gene stable ID version feature_page
3 ensembl_transcript_id Transcript stable ID feature_page
4 ensembl_transcript_id_version Transcript stable ID version feature_page
5 ensembl_peptide_id Protein stable ID feature_page
6 ensembl_peptide_id_version Protein stable ID version feature_page
## I know that ensembl_gene_id is what those IDs are, because I've done this before.
## We want to get the gene symbols though
> z[grep("symbol", z$description, ignore.case = TRUE),]
name description page
45 mgi_symbol MGI symbol feature_page
65 uniprot_gn_symbol UniProtKB Gene Name symbol feature_page
## when you query biomaRt you always want to ask for the thing you are querying on, as the results
## are returned in random order, and you need to be able to re-sort. Plus if there isn't a return value you need
## to have a blank returned.
## the arguments for getBM are, in essence, 'what I want', 'what I have', 'the IDs themselves', 'the mart object'
## so for the first argument (the attributes argument, if you care to know), we ask for both what you want,
## and what you already have.
> getBM(c("mgi_symbol","ensembl_gene_id"), "ensembl_gene_id", ids, mart)
mgi_symbol ensembl_gene_id
1 ENSCGRG00001004858
2 Gpm6a ENSCGRG00001007305
3 Wdr17 ENSCGRG00001010983
4 Spata4 ENSCGRG00001012374
5 Asb5 ENSCGRG00001013169
6 Spcs3 ENSCGRG00001013944
I'll leave it up to you to figure out how to re-order those data to match your DGEList
rownames
, and note that you should read the biomaRt vignette and help pages if you have questions.
I should also note that
mgi_symbol
worked, but that was probably happenstance. I know from long experience that what you really want are theexternal_gene_name
, but that's not what you would naively search for.Thank you so much James. I have just a quick question. As, somebody else done the read alignment to genome, transcript assembly and featureCount, do you think they used Chinese hamster annotation file instead of Human annotation?? OR it's possible to have Chinese hamster and Human annotation the same gene naming.
Thanks
I have no idea. Can you not ask whomever did the alignment?
Thank you James. Is it also possible to extract the chromosome names?
Best, MJ
Yes. At this point you should read the vignette.