Question

How to get both hgnc_symbol and all cdna_coding_start values associated?

0

Entering edit mode

bastien_chassagnol • 0

@5de73a99

Last seen 4 weeks ago

France

My first objective was to generate all possible positions (start and end) for hgnc_symbol, using for that purpose "start_position" and "end_position" attributes. Till that step, no problem:

# define which mart to use
human_ensembl <- biomaRt::useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl")

# generate all possible attributes with pattern "start_position"
biomaRt::searchAttributes(human_ensembl, pattern = "start_position")
               name     description         page
10   start_position Gene start (bp) feature_page
210  start_position Gene start (bp)    structure
250  start_position Gene start (bp)     homologs
2929 start_position Gene start (bp)          snp
2973 start_position Gene start (bp)  snp_somatic
3018 start_position Gene start (bp)    sequences

# generate all possible attributes with pattern "hgnc_symbol"
biomaRt::searchAttributes(human_ensembl, pattern = "hgnc_symbol")
         name description         page
62 hgnc_symbol HGNC symbol feature_page

# check with two genes if we can retrieve both information
gene_positions <- biomaRt::getBM(attributes= c("start_position", "end_position", "hgnc_symbol"), filters=c("hgnc_symbol"), values = c("ABCF1", "ABO"), mart=human_ensembl)

head(gene_positions)
  start_position end_position hgnc_symbol
1       30649829     30675633       ABCF1
2       30563598     30589402       ABCF1

First question is how is chosen attribute "start_position" in that case, as it matches exactly six inputs in reference attribute database?

Second objective was then to retrieve all cdna positions (as I suppose it lists all possible mature transcripts):

# generate all possible attributes with pattern "cdna_coding_start"
biomaRt::searchAttributes(human_ensembl, pattern = "cdna_coding_start")
                  name             description      page
233  cdna_coding_start       cDNA coding start structure
3025 cdna_coding_start CDS start (within cDNA) sequences
3053 cdna_coding_start       cDNA coding start sequences

# check with two genes if we can retrieve both information
transcript_positions <- biomaRt::getBM(attributes= c("cdna_coding_start",  "hgnc_symbol"), filters=c("hgnc_symbol"), values = c("ABCF1", "ABO"), mart=human_ensembl)
Error in .processResults(postRes, mart = mart, sep = sep, fullXmlQuery = fullXmlQuery,  : 
  Query ERROR: caught BioMart::Exception::Usage: Attributes from multiple attribute pages are not allowed

This time, it doesn't work. Attributes come from different pages, however, you can perform searches with "hgnc_id" and "cdna", wherease they also come from several pages. So what is the true reason I can't perform such requests? And how can I get all cdna_start_position for a given hgnc_symbol?

# generate all possible attributes with pattern "^cdna$"
biomaRt::searchAttributes(human_ensembl, pattern = "^cdna$")
     name    description      page
3007 cdna cDNA sequences sequences

# check with two genes if we can retrieve both information
transcript_sequences <- biomaRt::getBM(attributes= c("cdna",  "hgnc_symbol"), filters=c("hgnc_symbol"), values = c("ABCF1", "ABO"), mart=human_ensembl)

dim(transcript_sequences)
[1] 24  2

As you can note, cdna and hgnc_symbol come from different pages, without causing any issues in the request.

Sorry for long post, here's my session info:

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /softhpc/R/4.0.2/lib64/R/lib/libRblas.so
LAPACK: /softhpc/R/4.0.2/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] bmkanalysis_1.0.1.9001 testthat_3.0.1

biomaRt cdna keys gene_start • 1.2k views

ADD COMMENT • link 4.2 years ago bastien_chassagnol • 0

score 1 · Accepted Answer · 2021-01-21

I'll try to address the question about the biomaRt behaviour first, then think about how to get the data you want.

You've run across one of the edge cases of biomaRt, which is that lets you try and run queries that aren't possible by the standard web-interface.

Just for clarity, I'll point out that the 'pages' reported by searchAttributes match to the radio box selection you get on the Attributes section of the web interface e.g. enter image description here

Selecting a page gives you access to the available attributes via the expanding box below. Some attributes appear on all pages (e.g. Gene stable ID, Gene start) others appear only on a single page. Picking a new page will remove any attribute selection you've already made, so it's impossible to pick attribute across pages via the website. However biomaRt has no such restriction, as you've found. For the most part you get the error you've seen, which is actually thrown by the BioMart server, rather than the R package.

  Query ERROR: caught BioMart::Exception::Usage: Attributes from multiple attribute pages are not allowed

There are a small number of cases where that error isn't triggered. hgnc_symbol and any sequence type is one such case. You can't run that query via the web interface as HGNC symbol isn't available as an option.

enter image description here

biomaRt should probably have a check that prevents you from submitting the query too, but I know people use this type of thing, and as far as I know they work correctly, so I've never stopped the behaviour.

I'm afraid I don't know why this doesn't trigger the server-side error; I don't know enough about the BioMart internals. As far as I know this only happens with the sequence page, and I've always assumed there are some weird hacks going on server-side since that also triggers the non-standard FASTA return type.

It doesn't sound like you actually want to sequence, just the start position for each transcript for your requested genes, so maybe the structure page is what you want.

From there you can request Ensembl Gene and Transcript IDs, and get the position of the cDNA coding region for each exon relative to the start of the transcript. Filtering by the first exon of each transcript should allow you to generate a complete list of coding start positions across the transcripts.

One problem is that you still can't return the HGNC symbol in your results. Selecting Gene Name works for your example symbols, but I wouldn't rely on it. When working with Ensembl BioMart it's good to remember that it's very much focused on Ensembl annotation and IDs, so every attribute page lets you include the Ensembl Gene ID in results. I would run two separate queries, one returning the coding region information, and another that generates a mapping table between HGNC symbols and Ensembl Gene IDs. I would then merge these two tables using the Ensembl Gene ID as my key. Be aware that there isn't a perfect one-to-one mapping between different sources of gene annotation, but that approach is about the best you can do if you aren't using Ensembl IDs directly.