Mike Smith will probably be along in a bit with a direct answer, but I wonder if you are sort of duplicating work that has already been done by Johannes Rainier, when he builds the EnsDb packages:
> library(AnnotationHub)
> hub <- AnnotationHub()
> query(hub, c("mus musculus","ensdb"))
AnnotationHub with 8 records
# snapshotDate(): 2018-10-24
# $dataprovider: Ensembl
# $species: Mus musculus
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH53222"]]'
title
AH53222 | Ensembl 87 EnsDb for Mus Musculus
AH53726 | Ensembl 88 EnsDb for Mus Musculus
AH56691 | Ensembl 89 EnsDb for Mus Musculus
AH57770 | Ensembl 90 EnsDb for Mus Musculus
AH60788 | Ensembl 91 EnsDb for Mus Musculus
AH60992 | Ensembl 92 EnsDb for Mus Musculus
AH64461 | Ensembl 93 EnsDb for Mus Musculus
AH64944 | Ensembl 94 EnsDb for Mus musculus
> musdb <- hub[["AH64944"]]
downloading 1 resources
retrieving 1 resource
|======================================================================| 100%
> gns <- transcriptsBy(musdb)
> gns
GRangesList object of length 55341:
$ENSMUSG00000000001
GRanges object with 1 range and 8 metadata columns:
seqnames ranges strand | tx_id tx_biotype
<Rle> <IRanges> <Rle> | <character> <character>
[1] 3 108107280-108146146 - | ENSMUST00000000001 protein_coding
tx_cds_seq_start tx_cds_seq_end gene_id tx_support_level
<integer> <integer> <character> <integer>
[1] 108109422 108146005 ENSMUSG00000000001 1
tx_id_version tx_name
<character> <character>
[1] ENSMUST00000000001.4 ENSMUST00000000001
$ENSMUSG00000000003
GRanges object with 2 ranges and 8 metadata columns:
seqnames ranges strand | tx_id tx_biotype
[1] X 77837901-77853623 - | ENSMUST00000000003 protein_coding
[2] X 77837902-77853530 - | ENSMUST00000114041 protein_coding
tx_cds_seq_start tx_cds_seq_end gene_id tx_support_level
[1] 77841883 77853483 ENSMUSG00000000003 1
[2] 77841883 77853483 ENSMUSG00000000003 2
tx_id_version tx_name
[1] ENSMUST00000000003.13 ENSMUST00000000003
[2] ENSMUST00000114041.2 ENSMUST00000114041
$ENSMUSG00000000028
GRanges object with 4 ranges and 8 metadata columns:
seqnames ranges strand | tx_id tx_biotype
[1] 16 18807356-18811987 - | ENSMUST00000115585 protein_coding
[2] 16 18780447-18811972 - | ENSMUST00000000028 protein_coding
[3] 16 18780453-18811626 - | ENSMUST00000096990 protein_coding
[4] 16 18810108-18811591 - | ENSMUST00000231819 retained_intron
tx_cds_seq_start tx_cds_seq_end gene_id tx_support_level
[1] 18807356 18811565 ENSMUSG00000000028 2
[2] 18781898 18811565 ENSMUSG00000000028 1
[3] 18781898 18811565 ENSMUSG00000000028 1
[4] <NA> <NA> ENSMUSG00000000028 <NA>
tx_id_version tx_name
[1] ENSMUST00000115585.1 ENSMUST00000115585
[2] ENSMUST00000000028.13 ENSMUST00000000028
[3] ENSMUST00000096990.9 ENSMUST00000096990
[4] ENSMUST00000231819.1 ENSMUST00000231819
...
<55338 more elements>
-------
seqinfo: 117 sequences from GRCm38 genome
Which has pretty much everything but the mappings to NCBI IDs, which I would argue is a non-trivial exercise, given the differences between NCBI and EBI/EMBL.
And if you want to do some tidyverse sorcery on the results, you can always unlist
that GRangesList, or convert to a DataFrame
or a data.frame
or (shudders) a tibble
.
OR if you just wanted a DB to make queries on, you can always make direct SQL queries on the underlying SQLite DB:
> DBI::dbListTables(dbconn(musdb))
[1] "chromosome" "entrezgene" "exon" "gene"
[5] "metadata" "protein" "protein_domain" "tx"
[9] "tx2exon" "uniprot"