Biomart's getBM returns no genes for an existing GO-term in grch38, and less then expected in grch37
2
0
Entering edit mode
A • 0
@a-23294
Last seen 3.4 years ago
Austria

I am trying to get all genes associated with a GO-term using Biomart. I cannot seem to get the genes listed by other services.

Example

Take a small term with 5 genes: 0018103 in amigo or quickGO

Prepare marts

ensembl37 = useEnsembl("ensembl", dataset = "hsapiens_gene_ensembl"
                       , host = "https://grch37.ensembl.org"
) 

ensembl38 = useEnsembl("ensembl", dataset = "hsapiens_gene_ensembl"
                       , host = "https://ensembl.org"
)

Find genes

I find 1 gene using https://grch37.ensembl.org

getBM(attributes=c('hgnc_symbol'),
                   filters = 'go_parent_term', values = 'GO:0018103', mart = ensembl37, verbose = F); gene.data

>   hgnc_symbol
> 1        DPM3

I get an error when using https://ensembl.org

getBM(attributes=c('hgnc_symbol'),
      filters = 'go_parent_term', values = 'GO:0018103', mart = ensembl38, verbose = F); gene.data


> NULL
> Error in .processResults(postRes, mart = mart, sep = sep, fullXmlQuery = fullXmlQuery,  : 
>   The query to the BioMart webservice returned an invalid result.
> The number of columns in the result table does not equal the number of attributes in the query.
> Please report this on the support site at http://support.bioconductor.org

I guess

  1. ensembl37 simply does not have the other genes (or maybe it takes only certain evidence/ annotation)
  2. in ensembl38 something changed with the syntax. I went through all posts in google / bioconductor / biostar, but can't seem to find the solution.

Ideally I should get all 5 genes with the latest annotation, but with ensembl37 is also fine.

I appreciate your help! Thanks, A

PS.

> R version 4.0.2 (2020-06-22)
> package.version("biomaRt")
[1] "2.45.9"
GO biomaRt • 1.7k views
ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 8 hours ago
United States

Might just be an intermittent outage:

> library(biomaRt)
> mart <- useEnsembl("ensembl","hsapiens_gene_ensembl")
> getBM(c("hgnc_symbol","ensembl_gene_id"), "go_parent_term", "GO:0018103", mart)
  hgnc_symbol ensembl_gene_id
1     DPY19L2 ENSG00000177990
2     DPY19L3 ENSG00000178904
3     DPY19L4 ENSG00000156162
4     DPY19L1 ENSG00000173852
5        DPM3 ENSG00000179085

## And

> oldmart <- useEnsembl("ensembl","hsapiens_gene_ensembl","https://grch37.ensembl.org")
> getBM(c("hgnc_symbol","ensembl_gene_id"), "go_parent_term", "GO:0018103", oldmart)
  hgnc_symbol ensembl_gene_id
1     DPY19L2 ENSG00000177990
2     DPY19L3 ENSG00000178904
3     DPY19L4 ENSG00000156162
4     DPY19L1 ENSG00000173852
5        DPM3 ENSG00000179085
ADD COMMENT
0
Entering edit mode

But then again, you should upgrade.

> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.48.1

loaded via a namespace (and not attached):
 [1] KEGGREST_1.32.0        progress_1.2.2         tidyselect_1.1.1      
 [4] purrr_0.3.4            vctrs_0.3.8            generics_0.1.0        
 [7] stats4_4.1.0           BiocFileCache_2.0.0    utf8_1.2.1            
[10] blob_1.2.1             XML_3.99-0.6           rlang_0.4.11          
[13] pillar_1.6.1           withr_2.4.2            glue_1.4.2            
[16] DBI_1.1.1              rappdirs_0.3.3         BiocGenerics_0.38.0   
[19] bit64_4.0.5            dbplyr_2.1.1           GenomeInfoDbData_1.2.6
[22] lifecycle_1.0.0        stringr_1.4.0          zlibbioc_1.38.0       
[25] Biostrings_2.60.1      memoise_2.0.0          Biobase_2.52.0        
[28] IRanges_2.26.0         fastmap_1.1.0          GenomeInfoDb_1.28.1   
[31] parallel_4.1.0         curl_4.3.1             AnnotationDbi_1.54.1  
[34] fansi_0.5.0            Rcpp_1.0.6             filelock_1.0.2        
[37] cachem_1.0.5           S4Vectors_0.30.0       XVector_0.32.0        
[40] bit_4.0.4              hms_1.1.0              png_0.1-7             
[43] digest_0.6.27          stringi_1.6.2          dplyr_1.0.7           
[46] tools_4.1.0            bitops_1.0-7           magrittr_2.0.1        
[49] RCurl_1.98-1.3         RSQLite_2.2.7          tibble_3.1.2          
[52] crayon_1.4.1           pkgconfig_2.0.3        ellipsis_0.3.2        
[55] xml2_1.3.2             prettyunits_1.1.1      assertthat_0.2.1      
[58] httr_1.4.2             rstudioapi_0.13        R6_2.5.0              
[61] compiler_4.1.0
ADD REPLY
0
Entering edit mode

Thanks James!

For grch38 I can reproduce your call

> mart <- useEnsembl("ensembl","hsapiens_gene_ensembl")
> getBM(c("hgnc_symbol","ensembl_gene_id"), "go_parent_term", "GO:0018103", mart)
  hgnc_symbol ensembl_gene_id
1     DPY19L2 ENSG00000177990
2     DPY19L3 ENSG00000178904
3     DPY19L4 ENSG00000156162
4     DPY19L1 ENSG00000173852
5        DPM3 ENSG00000179085

The problem was that I specified the host argument

> mart2 <- useEnsembl("ensembl","hsapiens_gene_ensembl", host = "https://ensembl.org")
> getBM(c("hgnc_symbol","ensembl_gene_id"), "go_parent_term", "GO:0018103", mart2)
NULL
Error in .processResults(postRes, mart = mart, sep = sep, fullXmlQuery = fullXmlQuery,  : 
  The query to the BioMart webservice returned an invalid result.
The number of columns in the result table does not equal the number of attributes in the query.
Please report this on the support site at http://support.bioconductor.org

For grch37 I still get only 1

> oldmart <- useEnsembl("ensembl","hsapiens_gene_ensembl","https://grch37.ensembl.org")
> getBM(c("hgnc_symbol","ensembl_gene_id"), "go_parent_term", "GO:0018103", oldmart)
  hgnc_symbol ensembl_gene_id
1        DPM3 ENSG00000179085

It therefore may be a version problem? Not sure what package would be out of date. I am cautious with updating ... once I spent some frustrating days figuring out that a 3rd level dependence changed a default argument, and so I kept getting different results without a single warning...

ADD REPLY
0
Entering edit mode

Yeah, I think it's a version thing.

> library(biomaRt)
> oldmart <- useEnsembl("ensembl","hsapiens_gene_ensembl","https://grch37.ensembl.org")
> getBM(c("hgnc_symbol","ensembl_gene_id"), "go_parent_term", "GO:0018103", oldmart)
  hgnc_symbol ensembl_gene_id
1        DPM3 ENSG00000179085
> version
               _                           
platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          4                           
minor          0.2                         
year           2020                        
month          06                          
day            22                          
svn rev        78730                       
language       R                           
version.string R version 4.0.2 (2020-06-22)
nickname       Taking Off Again            
>
ADD REPLY
0
Entering edit mode

This shouldn't be a biomaRt version thing, so that's not good! It's supposed to just be an interface to the Ensembl server, and if you're querying https://grch37.ensembl.org both times, it should at least be receiving the same data back. I'll take a look at what might have happened here, because I don't remember intentionally introducing any changes that would manifest like this.

ADD REPLY
0
Entering edit mode
Mike Smith ★ 6.6k
@mike-smith
Last seen 1 day ago
EMBL Heidelberg

I think you've managed to find a bug in biomaRt. In the current version (I'm not sure how far back this goes yet) if you supply a host argument to useEnsembl() it is ignored, and a new host is constructed based on the the mirror, GRCh and version arguments. Leaving these as defaults will simply use www.ensembl.org.

The "correct" result for GRCh37 is the version that contains only the single hit for DPM3, which you can verify by visiting https://grch37.ensembl.org/biomart/ and running the query interactively.

I'd recommend using the argument GRCh = 37 to make sure you are querying the correct server, rather than providing the host. You can also check by looking at the host slot in the Mart object e.g.

mart_with_version <- useEnsembl(biomart = 'genes', dataset = 'hsapiens_gene_ensembl', GRCh = '37')
mart_with_version@host
## [1] "https://grch37.ensembl.org:443/biomart/martservice"

Thanks for finding and reporting the problem, I'll patch the current devel and release versions of biomaRt to make sure the host argument is respected. However it won't propagate to the 2.45 version you're using, that's now read-only I'm afraid.

ADD COMMENT

Login before adding your answer.

Traffic: 482 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6