My first objective was to generate all possible positions (start and end) for hgnc_symbol, using for that purpose "start_position" and "end_position" attributes. Till that step, no problem:
# define which mart to use
human_ensembl <- biomaRt::useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl")
# generate all possible attributes with pattern "start_position"
biomaRt::searchAttributes(human_ensembl, pattern = "start_position")
name description page
10 start_position Gene start (bp) feature_page
210 start_position Gene start (bp) structure
250 start_position Gene start (bp) homologs
2929 start_position Gene start (bp) snp
2973 start_position Gene start (bp) snp_somatic
3018 start_position Gene start (bp) sequences
# generate all possible attributes with pattern "hgnc_symbol"
biomaRt::searchAttributes(human_ensembl, pattern = "hgnc_symbol")
name description page
62 hgnc_symbol HGNC symbol feature_page
# check with two genes if we can retrieve both information
gene_positions <- biomaRt::getBM(attributes= c("start_position", "end_position", "hgnc_symbol"), filters=c("hgnc_symbol"), values = c("ABCF1", "ABO"), mart=human_ensembl)
head(gene_positions)
start_position end_position hgnc_symbol
1 30649829 30675633 ABCF1
2 30563598 30589402 ABCF1
First question is how is chosen attribute "start_position" in that case, as it matches exactly six inputs in reference attribute database?
Second objective was then to retrieve all cdna positions (as I suppose it lists all possible mature transcripts):
# generate all possible attributes with pattern "cdna_coding_start"
biomaRt::searchAttributes(human_ensembl, pattern = "cdna_coding_start")
name description page
233 cdna_coding_start cDNA coding start structure
3025 cdna_coding_start CDS start (within cDNA) sequences
3053 cdna_coding_start cDNA coding start sequences
# check with two genes if we can retrieve both information
transcript_positions <- biomaRt::getBM(attributes= c("cdna_coding_start", "hgnc_symbol"), filters=c("hgnc_symbol"), values = c("ABCF1", "ABO"), mart=human_ensembl)
Error in .processResults(postRes, mart = mart, sep = sep, fullXmlQuery = fullXmlQuery, :
Query ERROR: caught BioMart::Exception::Usage: Attributes from multiple attribute pages are not allowed
This time, it doesn't work. Attributes come from different pages, however, you can perform searches with "hgnc_id" and "cdna", wherease they also come from several pages. So what is the true reason I can't perform such requests? And how can I get all cdna_start_position for a given hgnc_symbol?
# generate all possible attributes with pattern "^cdna$"
biomaRt::searchAttributes(human_ensembl, pattern = "^cdna$")
name description page
3007 cdna cDNA sequences sequences
# check with two genes if we can retrieve both information
transcript_sequences <- biomaRt::getBM(attributes= c("cdna", "hgnc_symbol"), filters=c("hgnc_symbol"), values = c("ABCF1", "ABO"), mart=human_ensembl)
dim(transcript_sequences)
[1] 24 2
As you can note, cdna and hgnc_symbol come from different pages, without causing any issues in the request.
Sorry for long post, here's my session info:
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /softhpc/R/4.0.2/lib64/R/lib/libRblas.so
LAPACK: /softhpc/R/4.0.2/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] bmkanalysis_1.0.1.9001 testthat_3.0.1
Thanks for your suggest, hoping one day that both nomenclature (HGNC and Ensembl) collaborate more, such that you can get at once all relevant information for a given gene.