Question

How to access gene IDs associated with GO terms in GOSeq

0

Entering edit mode

mico • 0

@mico-15362

Last seen 13 months ago

United States

I did the GO terms enrichment analysis using GOSeq, after that I want to access the esIDs in the significantly enriched terms.

I tried to extract esIDs using org.Hs.eg.db and GO.db, according to the vignette and this post. For example using this category GO:0009063.

library(org.Hs.eg.db)
library(GO.db)
go_id <- "GO:0009063"
allegs <- unique(get(go_id, org.Hs.egGO2ALLEGS))
esids <- unlist(mget(allegs, org.Hs.egENSEMBL))
length(esids)
# [1] 118

However the numDEInCat of this term in GO.wall report is 124, so is there any method to directly access the genes in GO terms in GOSeq?

GOseq ensemblid GO • 3.6k views

ADD COMMENT • link updated 6.4 years ago by James W. MacDonald 68k • written 6.4 years ago by mico • 0

0

Entering edit mode

I found that if using org.Hs.egENSEMBL will return the GRCh38 ESIDs, while my former analyses used hg19 annotations.

> org.Hs.eg.db
OrgDb object:
| DBSCHEMAVERSION: 2.1
| Db type: OrgDb
| Supporting package: AnnotationDbi
| DBSCHEMA: HUMAN_DB
| ORGANISM: Homo sapiens
| SPECIES: Human
| EGSOURCEDATE: 2018-Apr4
| EGSOURCENAME: Entrez Gene
| EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| CENTRALID: EG
| TAXID: 9606
| GOSOURCENAME: Gene Ontology
| GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
| GOSOURCEDATE: 2018-Mar28
| GOEGSOURCEDATE: 2018-Apr4
| GOEGSOURCENAME: Entrez Gene
| GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| KEGGSOURCENAME: KEGG GENOME
| KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
| KEGGSOURCEDATE: 2011-Mar15
| GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
| GPSOURCEURL: 
| GPSOURCEDATE: 2018-Mar26
| ENSOURCEDATE: 2017-Dec04
| ENSOURCENAME: Ensembl
| ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
| UPSOURCENAME: Uniprot
| UPSOURCEURL: http://www.UniProt.org/
| UPSOURCEDATE: Mon Apr  9 20:58:54 2018

Therefore I first got all the entrez ids for each GO term then using biomaRt to retrieve the hg19 ESIDs:

library(org.Hs.eg.db)
library(GO.db)
library(biomaRt)

hg19 <- useMart(biomart = "ENSEMBL_MART_ENSEMBL", 
                  host = "grch37.ensembl.org", 
                  path = "/biomart/martservice",
                  dataset = "hsapiens_gene_ensembl")

go_id <- "GO:0009063"
allegs <- unique(get(go_id, org.Hs.egGO2ALLEGS))
esids <- getBM(attributes = "ensembl_gene_id",
               filters = "entrezgene",
               values = allegs,
               mart = hg19)
esids <- esids$ensembl_gene_id

If I only use the GO filter in biomaRt to retrieve ESIDs of GO terms, only a few gene ids in the results, I don't know why:

getBM(attributes = "ensembl_gene_id",
          filters = "go",
          values ="GO:0009063",
          mart = hg19)


#  ensembl_gene_id
#1 ENSG00000132199
#2 ENSG00000103507

ADD REPLY • link 6.4 years ago mico • 0

score 2 · Accepted Answer · 2018-11-14

If you want to add extra information to a post, use the ADD COMMENT button, not Add Answer. Because you aren't adding an answer, you are adding a comment.

There are a few misconceptions in your post. First, gene annotations (like mappings between GO and say Ensembl IDs) aren't tied to a genome build, because they have nothing to do with locations of genes on the genome (which is what the genome build provides). What Ensembl IDs there are, and what GO terms there are, and how they interact is quite fluid, being updated regularly (unlike the genome builds, which in general have big releases years apart, with smaller updates between). So if you use an archived version of the Biomart data, you get the mappings that were extant at that point in time (like 2014, which is a long time ago!), and you are missing out on any updates that may have occurred in the intervening time.

When you provide Ensembl IDs to goseq, what happens under the hood is that your IDs are mapped to Entrez Gene IDs and then mapped to the GO IDs. That will affect your results, as you invariably 'lose' genes by doing that, because the two annotation services (EBI and NCBI) don't necessarily agree on what is and isn't a gene. Do note that this is orthogonal to the question at hand, which is what GO terms are appended to a given gene - if you lose data because you can't map between Ensembl IDs to Gene IDs, that has nothing to do with the mapping for those IDs to the GO ID.

If you want to use Ensembl IDs, you are better off providing goseq with a gene2cat data.frame that specifies the Ensembl ID -> GO ID mappings you want to use (based off of data from Biomart, which keeps the mappings within EBI's databases).