Any BioMart query you can do in the browser you can also do via the biomaRt package, although I'll agree that it's not always obvious what the correct options are. Here's a little example trying to get mouse orthologs for three fly genes we're interest in.
First, lets load biomaRt and define that we want to use the fruit fly dataset:
library(biomaRt)
ensembl <- useEnsembl(biomart = "ensembl",
dataset = "dmelanogaster_gene_ensembl")
Now lets define our three genes we're interest in. I'm using what Ensembl refers to as the 'Gene Stable ID' here, which I guess is also the FlyBase ID. You'll have to adapt the filters
argument in final query if your IDs are of a different type.
genes_of_interest <- c("FBgn0036531","FBgn0037375","FBgn0035252")
You can now query BioMart using something like:
getBM(mart = ensembl,
filters = "ensembl_gene_id",
values = genes_of_interest,
attributes = c("ensembl_gene_id",
"external_gene_name",
"mmusculus_homolog_ensembl_gene",
"mmusculus_homolog_associated_gene_name")
)
ensembl_gene_id external_gene_name mmusculus_homolog_ensembl_gene mmusculus_homolog_associated_gene_name
1 FBgn0035252 CG7970 ENSMUSG00000029499 Pxmp2
2 FBgn0036531 CG6244 ENSMUSG00000013822 Elof1
3 FBgn0037375 kat-60L1
What information you retrieve is determined by the attributes
argument. Here we're getting back same ID we used to search and the fly gene name, plus the analogous values for orthologs found in mouse. If you want different information, then you can use the functions listAttributes()
and searchAttributes()
to find what else is available (or you can look on the web interface). Missing values presumably indicate that Ensembl does not have an ortholog mapping between these two species for that gene.
You should also note that I think this may return duplicate matches, because paralogs within one genome may be matched to paralogs in the second genome. e.g. if FlyGeneA & FlyGeneB are paralogs, and MouseGeneA & MouseGeneB are also paralogs then you'll end up with 4 reported parings.
Many thanks for the very useful explanation Mike, I will give it a go with my table. I will come back if I encounter problems. I am a bit of a rookie
Hi again, I used the getBM function using the following code:
library(tidyverse)
library(dplyr)
gene_list <- read_csv("/Users/miguel/Dropbox/RStudio/Martins_Sanz/data/Thermal_Stress_GSA.csv")
gene_ids <- gene_list %>%
select (Gene_ID)
# ****Note that there is a bug and if biomaRt is loaded, select() will not work in dplyr****
# Now,load biomaRt and define that we want to use the fruit fly dataset
library(biomaRt)
fly = useMart("ensembl", dataset = "dmelanogaster_gene_ensembl")
fly_attributes <- listAttributes (fly)
getBM(mart = fly,
filters = "ensembl_gene_id",
values = gene_ids,
attributes = c("mmusculus_homolog_ensembl_gene", "hsapiens_homolog_ensembl_gene"))
I got the following error message at the end, any thoughts?
Error in `[[<-.data.frame`(`*tmp*`, vIdx, value = c("FBgn0013276", "FBgn0001223", :
replacement has 492 rows, data has 19669
There isn't a bug. This is known as masking, where you can have more than one package defining a function with the same name. You can get around that using the double colon function. Instead of using
select
directly, you would usedplyr::select(Gene_ID).
Or you could just roll old school and use base R:In which case you don't need either tidyverse or dplyr.
Also do note that Mike's code is useful whereas yours is not! The query to the Biomart will return data in random order, so you need to ask for the filter to be returned as well in order to complete your mappings (you will need to probably get rid of some of the duplicated paralogs, as well as re-order the results to conform to your input data).
Anyway, you need to show a self-contained example that does what you show, as this works for me, using the genes that Mike originally used:
Your original code will return this:
And you don't know what the paralogs are matched to.
You could run your code again, and after getting the error, run
traceback()
and show what that output is. But all things equal it's better if you present a self-contained bit of code that people can use to debug.Thanks James, I will troubleshoot my code and when I do so post an update here. I am unfamiliar on how to post properly formatted code in this platform, is there a tutorial available?
When you start composing a response, note that at the top left there is a dropdown that by default says 'Text'. If you want to add code, put it in a separate paragraph block, highlight, and then change to 'Code' using that dropdown. You can also highlight inline code blocks and use 'Inline Code'.
Hi again, I have troubleshooted and got it to work with Mike's example. Here is my code:
It all goes well with the test set, but my data returns an error:
If I follow your advice an run traceback(), the output is the following:
traceback_output
I had to create a link to the output as it has more than 5000 characters. As the reason for using tidyverse, I was thinking of combining the data I get from the Biomart query with the original CSV file using joins.
Thanks again for helping me out on my first project!
You'll probably have to put gene_ids on DropBox so people can troubleshoot.
Ok, I have Dropboxed gene_ids, with the header as a csv file here. Thanks again!
This is because
getBM
is expecting a character vector, butgene_ids
is atibble
.dplyr::select
always returns atibble
even if you're only selecting a single column.The simplest thing to do is forget using
dplyr::select
and use the$
operator to extract the correct column.In the call to
getBM
you could just dovalues = gene_list$Gene_ID
Many thanks Mike, the issue is now resolved.