Question

How to filter EnsDb into another EnsDb object instead of GRange?

0

Entering edit mode

Ahdee ▴ 60

@ahdee-8938

Last seen 4 months ago

United States

Hi I'm trying to create an EnsDb object from the EnsDb.Hsapiens.v86 package. I would like to filter only protein_coding as such.

library(EnsDb.Hsapiens.v86)
db=transcripts(EnsDb.Hsapiens.v86, filter = TxBiotypeFilter( "protein_coding" ))

however this result in a filter GRange object instead of a smaller subset of the original EnsDb. Does anyone know how to create a new EnsDb with the above filter?

thanks! Ahdee

ens ensembldb EnsDb.Hsapiens.v86 EnsDb.Rnorvegicus.v75 • 2.7k views

ADD COMMENT • link 3.9 years ago Ahdee ▴ 60

score 0 · Answer 1 · 2021-05-27

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 11 hours ago

United States

The EnsDb packages are simply a SQLite database and associated code to make SQL queries to generate various objects like the GRanges you get from the transcripts function. Subsetting the SQLite database would probably not be worth the trouble. Instead, perhaps you can say what your use-case is, and people can make suggestions as to how you can accomplish it.

ADD COMMENT • link 3.9 years ago James W. MacDonald 68k

0

Entering edit mode

I see thank you. What I want to do is to convert genomic coordinates into transcript coordinates, however I only want to retain transcript type that are protein and not something like mediate decay. For example something like this.

genomeToTranscript( GRanges( c( "5:150412626"  ) ) , EnsDb.Hsapiens.v86)

Right now I have to do this two steps first above and then subset with another annotation with transcript type.

ADD REPLY • link 3.9 years ago Ahdee ▴ 60

score 0 · Answer 2 · 2021-05-28

0

Entering edit mode

Johannes Rainer ★ 2.1k

@johannes-rainer-6987

Last seen 5 months ago

Italy

You can filter the whole database with:

> library(EnsDb.Hsapiens.v86)
> db <- filter(EnsDb.Hsapiens.v86, filter = ~ tx_biotype == "protein_coding")
> length(transcripts(EnsDb.Hsapiens.v86))
[1] 216741
> length(transcripts(db))
[1] 87694

this is similar to subsetting the whole database to only annotations related to protein coding transcripts.

best, jo

ADD COMMENT • link 3.9 years ago Johannes Rainer ★ 2.1k

0

Entering edit mode

Thanks Jo! Good to know.

ADD REPLY • link 3.9 years ago James W. MacDonald 68k

0

Entering edit mode

Hi Jo, thanks that is super helpful. Since we are on topic is there a way to update the the annotations? I ask because it the current db does not seem to reflect the latest version, p13. For example the gene CD74

> test =ensembldb::transcripts (EnsDb.Hsapiens.v86, filter = list ( TxBiotypeFilter( "protein_coding"  )  ) )
> test[test$gene_id == "ENSG00000019582", ]
GRanges object with 5 ranges and 6 metadata columns:
                  seqnames              ranges strand |           tx_id     tx_biotype tx_cds_seq_start tx_cds_seq_end         gene_id         tx_name
                     <Rle>           <IRanges>  <Rle> |     <character>    <character>        <integer>      <integer>     <character>     <character>
  ENST00000377795        5 150401637-150412769      - | ENST00000377795 protein_coding        150402584      150412749 ENSG00000019582 ENST00000377795
  ENST00000353334        5 150401670-150412929      - | ENST00000353334 protein_coding        150402240      150412749 ENSG00000019582 ENST00000353334
  ENST00000518797        5 150401699-150412732      - | ENST00000518797 protein_coding        150402477      150412732 ENSG00000019582 ENST00000518797
  ENST00000524315        5 150401864-150412751      - | ENST00000524315 protein_coding        150401987      150412749 ENSG00000019582 ENST00000524315
  ENST00000009530        5 150402152-150412751      - | ENST00000009530 protein_coding        150402240      150412749 ENSG00000019582 ENST00000009530
  -------
  seqinfo: 288 sequences from GRCh38 genome

is different than what I I get when downloading the latest using getBM and also is not accurately reflecting the latest ensembl website as well. Notice that whereas the current db only has 3 transcript htat are protein_coding and the EnsDb.Hsapiens.v86 has 5 of those transcript instead.

Here is a screen shot of the web sight for this gene.

enter image description here

ADD REPLY • link 3.9 years ago Ahdee ▴ 60

1

Entering edit mode

The name of the package, EnsDb.Hsapiens.v86 indicates the Ensembl build is version 86! If you want an updated version, you need to get it from the AnnotationHub package.

> library(AnnotationHub)

> hub <- AnnotationHub()
snapshotDate(): 2021-05-18

> minihub <- query(hub, c("homo","ensdb"))
> minihub
AnnotationHub with 18 records
# snapshotDate(): 2021-05-18
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH53211"]]' 

            title                             
  AH53211 | Ensembl 87 EnsDb for Homo Sapiens 
  AH53715 | Ensembl 88 EnsDb for Homo Sapiens 
  AH56681 | Ensembl 89 EnsDb for Homo Sapiens 
  AH57757 | Ensembl 90 EnsDb for Homo Sapiens 
  AH60773 | Ensembl 91 EnsDb for Homo Sapiens 
  ...       ...                               
  AH78783 | Ensembl 99 EnsDb for Homo sapiens 
  AH79689 | Ensembl 100 EnsDb for Homo sapiens
  AH83216 | Ensembl 101 EnsDb for Homo sapiens
  AH89180 | Ensembl 102 EnsDb for Homo sapiens
  AH89426 | Ensembl 103 EnsDb for Homo sapiens
> ensdb <- minihub[["AH89426"]]
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

loading from cache
require("ensembldb")

> ensdb2 <- filter(ensdb, filter = list ( TxBiotypeFilter( "protein_coding"  )  ) )
> trans <- transcripts(ensdb2)
> trans[trans$gene_id == "ENSG00000019582", ]
GRanges object with 3 ranges and 9 metadata columns:
                  seqnames              ranges strand |           tx_id
                     <Rle>           <IRanges>  <Rle> |     <character>
  ENST00000009530        5 150400041-150412751      - | ENST00000009530
  ENST00000377795        5 150401637-150412769      - | ENST00000377795
  ENST00000353334        5 150401670-150412750      - | ENST00000353334
                      tx_biotype tx_cds_seq_start tx_cds_seq_end
                     <character>        <integer>      <integer>
  ENST00000009530 protein_coding        150402240      150412749
  ENST00000377795 protein_coding        150402584      150412749
  ENST00000353334 protein_coding        150402240      150412749
                          gene_id tx_support_level      tx_id_version
                      <character>        <integer>        <character>
  ENST00000009530 ENSG00000019582             <NA> ENST00000009530.12
  ENST00000377795 ENSG00000019582             <NA>  ENST00000377795.7
  ENST00000353334 ENSG00000019582             <NA> ENST00000353334.11
                  gc_content         tx_name
                   <numeric>     <character>
  ENST00000009530    54.5925 ENST00000009530
  ENST00000377795    58.6116 ENST00000377795
  ENST00000353334    58.5827 ENST00000353334
  -------
  seqinfo: 358 sequences from GRCh38 genome
>

ADD REPLY • link 3.9 years ago James W. MacDonald 68k

0

Entering edit mode

this looks promising, strange for me however is that the latest snapshot I can get is 2019-10-29? thus my latest stops at AH78783 | Ensembl 99 EnsDb for Homo sapiens, instead of 103 like yours? edit: I tried reinstalling AnnotationHub but its still giving me the same data.

  > hub <- AnnotationHub()
snapshotDate(): 2019-10-29

> minihub <- query(hub, c("homo","ensdb"))
> minihub
AnnotationHub with 14 records
# snapshotDate(): 2019-10-29 
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH53211"]]' 

            title                            
  AH53211 | Ensembl 87 EnsDb for Homo Sapiens
  AH53715 | Ensembl 88 EnsDb for Homo Sapiens
  AH56681 | Ensembl 89 EnsDb for Homo Sapiens
  AH57757 | Ensembl 90 EnsDb for Homo Sapiens
  AH60773 | Ensembl 91 EnsDb for Homo Sapiens
  ...       ...                              
  AH69187 | Ensembl 96 EnsDb for Homo sapiens
  AH73881 | Ensembl 97 EnsDb for Homo sapiens
  AH73986 | Ensembl 79 EnsDb for Homo sapiens
  AH75011 | Ensembl 98 EnsDb for Homo sapiens
  AH78783 | Ensembl 99 EnsDb for Homo sapiens

ADD REPLY • link 3.9 years ago Ahdee ▴ 60

0

Entering edit mode

I would guess that you are using a very old version of R/Bioconductor. I am on the current release, which just came out last week.

ADD REPLY • link 3.9 years ago James W. MacDonald 68k

0

Entering edit mode

I see thanks. I even tried clearing the cache removeCache ( minihub ) but no go. I think I might have to just uprgrade. thanks this is helpful!

ADD REPLY • link 3.9 years ago Ahdee ▴ 60