R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
BiocManager 1.30.4
biomaRt_2.40.3
(latest)
I've been working with the biomaRt
package to get mappings between different organizations and I noticed something that, to me, seems like a bug, but I'm not sure.
I'm trying to get mappings between ensembl_gene_id
and entrezgene_id
. I started building a bigger dataframe because i needed more info, so I defined names_mart
mart <- useMart(biomart = "ENSEMBL_MART_ENSEMBL", dataset="scerevisiae_gene_ensembl", host = "ensembl.org")
names_mart <- getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id", "entrezgene_id","external_gene_name", "kegg_enzyme" ,"goslim_goa_accession", "goslim_goa_description","description"), mart = mart)
#This contains only 'ensembl_gene_id' and 'entrezgene_id' columns
ensembl2entrez <- names_mart[, 2:3]
#Getting the same info directly from 'getBM'
ensembl2entrez_bio <- getBM(attributes = c("ensembl_gene_id", "entrezgene_id"), mart = mart)
#Only unique and non-NA elements for 'entrezgene_id'. These two dataframes should have the same 'entrezgene_id', but it isn't the case
ensembl2entrez <- ensembl2entrez[unique(which(!is.na(ensembl2entrez$entrezgene_id))),]
ensembl2entrez_bio <- ensembl2entrez_bio[unique(which(!is.na(ensembl2entrez_bio$entrezgene_id))),]
apply(ensembl2entrez, 2, function(df) length(unique(df)))
#[1]ensembl_gene_id entrezgene_id
# 5507 5505
apply(ensembl2entrez_bio, 2, function(df) length(unique(df)))
#[1]ensembl_gene_id entrezgene_id
# 5804 5801
ensem2entre_intersect <- intersect(ensembl2entrez_bio$entrezgene_id,ensembl2entrez$entrezgene_id)
length(ensem2entre_intersect)
#[1]5505
ensem2entre_set_diff <- setdiff(ensembl2entrez_bio$entrezgene_id,ensembl2entrez$entrezgene_id)
length(ensem2entre_set_diff)
#[1] 296
I don't understand why this difference between the number of unique elements when i compare these two biomart
queries. What could be some of the reasons that explain this difference?
Thanks for your answer, i'll keep in mind this, to query as few as possible attributes at once.