biomaRT GO-ID retreival showing different genes than AmiGO
1
1
Entering edit mode
snamjoshi87 ▴ 40
@snamjoshi87-11184
Last seen 7.9 years ago

I am trying to retrieve all genes that match a particular GO-ID using biomaRt:

ensembl <- useMart("ensembl", dataset = "mmusculus_gene_ensembl")

goGenes <- getBM(attributes = c("mgi_symbol", "go_id"),
                 filters = "go_id",
                 values = "GO:0098793",
                 mart = ensembl)

nrow(goGenes)

This returns a value of 53. However, if you look at the AmiGO page for this GO term and filter for M. musculus, you see that there are actually 779 genes (384 when you remove duplicated MGI symbols).

For this GO term, the page shows 591 genes after duplicates are removed. But running the function above with this GO term returns 0 genes.

What am I doing wrong here? Why don't the numbers match up?

biomart go • 3.2k views
ADD COMMENT
4
Entering edit mode
Mike Smith ★ 6.6k
@mike-smith
Last seen 11 minutes ago
EMBL Heidelberg

This isn't a problem with the biomaRt package per se, as you get back the same values you find via accessing Ensembl biomart directly.  

My instinct is that this query will return any gene that is directly annotated with that GO category.  It won't find anything assigned to a child category.  Is the same true for AmiGO?  A brief look makes me think the list of genes on that site includes those from child nodes in the ontology.

Does the overlap improve if you query Ensembl with the parent term? e.g.

goGenes <- getBM(attributes = c("mgi_symbol", "go_id"),
                 filters = "go_parent_term",
                 values = "GO:0098793",
                 mart = ensembl)

 

ADD COMMENT
0
Entering edit mode

Sorry for the late response! This worked for me. Only thing to note is that if you pass multiple GO IDs to the values parameter it will not work the way I was intending in the question. Thanks!

ADD REPLY
0
Entering edit mode

What output are you hoping for when you supply multiple GO terms?

ADD REPLY
0
Entering edit mode

I just realized I never really specified in my question what output I wanted. If you run the code you have supplied above, you get the genes for all child terms. If you use multiple genes, you will get a combination of all child terms for all the parent GO terms you supplied which makes sense. But then, you have no way of knowing which child term is associated with what parent process. It's all lumped together. Ideally, I could search for a bunch of parent terms and there would be another column indicating what parent term a given child term is associated with. I got around this by just creating a function that accepts GO terms and uses rbind() to combine each output together with a separate column identifying what the original GO parent term was. There could be a better way though.

ADD REPLY

Login before adding your answer.

Traffic: 747 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6