ensDbFromGtf error from gtf file downloaded from GENCODE.
1
0
Entering edit mode
@mthabisi-moyo-9721
Last seen 26 days ago
United States

I am getting an error when I try and create an ensDb object using a GENCODE .gtf annotation file for GRCh38 downloaded directly from the GENCODE website.

> gtffile <- "/user/Downloads/gencode.v26.annotation.gtf"

> DB <- ensDbFromGtf(gtffile)
Error in `colnames<-`(`*tmp*`, value = c("name", "value")) : 
  length of 'dimnames' [2] not equal to array extent

I tried downloading the .gtf from Ensembl in case the issue was the file but I get a different set of errors:

> gtffile <- "/Users/Bongani/Downloads/Homo_sapiens.GRCh38.88.gtf.gz"
> DB <- ensDbFromGtf(gtffile)
Error in checkValidEnsDb(EnsDb(dbname), verbose = verbose) :
  Provided exon index in transcript does not match with ordering of the exons by chromosomal coordinates for1of the101167transcripts encoded on the + strand!
In addition: Warning messages:
1: In rsqlite_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
2: In rsqlite_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
3: In rsqlite_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
4: In rsqlite_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
5: In rsqlite_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
6: In rsqlite_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries

Is there a package I am missing? The errors seem to suggest an error in formatting but I have not modified the .gtf files in any way before passing them to ensDb.

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.6 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ensembldb_1.2.2         GenomicFeatures_1.22.13 AnnotationDbi_1.32.3   
 [4] Biobase_2.30.0          GenomicRanges_1.22.4    GenomeInfoDb_1.6.3     
 [7] IRanges_2.4.8           S4Vectors_0.8.11        AnnotationHub_2.2.5    
[10] BiocGenerics_0.16.1    
ensembldb ensdbfromgtf • 3.6k views
ADD COMMENT
1
Entering edit mode
Johannes Rainer ★ 2.1k
@johannes-rainer-6987
Last seen 10 weeks ago
Italy

Regarding the GENCODE gtf file - for that you might want to use the makeTxDbFromGFF function from the GenomicFeatures package since the ensembldb::ensDbFromGtf works only for gtf files from Ensembl as it requires certain fields/columns to be present in the file.

Regarding the error you get with the release 88 GTF - I have to check that. The database is created, but there seems to be a discrepancy between the index of the exon within the transcript and its chromosomal coordinates for one transcript. It seems to be specific for GTF files for Homo sapiens Ensembl releases 88 and 89. I didn't get any error with the Ensembl 90 GTF.

Note that for newer releases of R/Bioconductor you can also fetch EnsDb databases from AnnotationHub. I am adding EnsDbs for every new Ensembl release. Note that the result below is for Bioconductor 3.6 devel. Bioconductor 3.5 has also EnsDbs in AnnotationHub, but not the most recent releases.

> library(AnnotationHub)
> ah <- AnnotationHub()
snapshotDate(): 2017-09-07
> query(ah, "EnsDb")
AnnotationHub with 282 records
# snapshotDate(): 2017-09-07
# $dataprovider: Ensembl
# $species: Ailuropoda Melanoleuca, Anas Platyrhynchos, Anolis Carolinensis,...
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH53185"]]'

            title                                      
  AH53185 | Ensembl 87 EnsDb for Anolis Carolinensis   
  AH53186 | Ensembl 87 EnsDb for Ailuropoda Melanoleuca
  AH53187 | Ensembl 87 EnsDb for Astyanax Mexicanus    
  AH53188 | Ensembl 87 EnsDb for Anas Platyrhynchos    
  AH53189 | Ensembl 87 EnsDb for Bos Taurus            
  ...       ...                                        
  AH57800 | Ensembl 90 EnsDb for Takifugu Rubripes     
  AH57801 | Ensembl 90 EnsDb for Tursiops Truncatus    
  AH57802 | Ensembl 90 EnsDb for Vicugna Pacos         
  AH57803 | Ensembl 90 EnsDb for Xiphophorus Maculatus
  AH57804 | Ensembl 90 EnsDb for Xenopus Tropicalis    

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu/x86_64 (64-bit)
Running under: Linux Mint 18.1

Matrix products: default
BLAS: /home/jo/R/2017-07/R-3.4.1-BioC3.6-devel/lib/R/lib/x86_64/libRblas.so
LAPACK: /home/jo/R/2017-07/R-3.4.1-BioC3.6-devel/lib/R/lib/x86_64/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=it_IT.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=it_IT.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] AnnotationHub_2.9.12 BiocGenerics_0.23.1

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12                  AnnotationDbi_1.39.3         
 [3] IRanges_2.11.15               bit_1.1-12                   
 [5] xtable_1.8-2                  R6_2.2.2                     
 [7] rlang_0.1.2                   blob_1.1.0                   
 [9] httr_1.3.1                    tools_3.4.1                  
[11] Biobase_2.37.2                DBI_0.7                      
[13] htmltools_0.3.6               yaml_2.1.14                  
[15] bit64_0.9-7                   digest_0.6.12                
[17] tibble_1.3.4                  interactiveDisplayBase_1.15.0
[19] shiny_1.0.5                   S4Vectors_0.15.7             
[21] curl_2.8.1                    memoise_1.1.0                
[23] RSQLite_2.0                   mime_0.5                     
[25] compiler_3.4.1                BiocInstaller_1.27.4         
[27] stats4_3.4.1                  httpuv_1.3.5                 
[29] pkgconfig_2.0.1              

cheers, jo

ADD COMMENT
1
Entering edit mode

Checked into the problematic transcript: transcript ENST00000639671 was assigned to gene ENSG00000141198 in Ensembl release 88, but got re-assigned in Ensembl 90 to ENSG00000166260. Both genes are encoded at around the same region on chromosome 17, but on opposite strands. The order of the exons of ENST00000639671 was (and is) that of a transcript encoded on the reverse strand. With the transcript assigned to the + encoded ENST00000141198 this order did no longer match with the expected order of the chromosomal start coords. Seems to me that this might have been a bug in the Ensembl annotation pipeline. It is fixed since Ensembl 90.

 

ADD REPLY
0
Entering edit mode

Thanks for the feedback!

1) I was able to construct the package using Ensembl 90 without a problem:

>gtffile <- "/user/Downloads/Homo_sapiens.GRCh38.90.gtf.gz"

> DB <- ensDbFromGtf(gtffile)
Importing GTF file ... OK
Processing metadata ... OK
Processing genes ... 
 Attribute availability:
  o gene_id ... OK
  o gene_name ... OK
  o entrezid ... Nope
  o gene_biotype ... OK
OK
Processing transcripts ... 
 Attribute availability:
  o transcript_id ... OK
  o gene_id ... OK
  o transcript_biotype ... OK
OK
Processing exons ... OK
Processing chromosomes ... Fetch seqlengths from ensembl ... OK
Generating index ... OK
  -------------
Verifying validity of the information in the database:
Checking transcripts ... OK
Checking exons ... OK
There were 13 warnings (use warnings() to see them)

> EDB <- EnsDb(DB)

> makeEnsembldbPackage(ensdb = DB, version = "0.1.0", maintainer = "Mthabisi Moyo <>",
+                      author = "M Moyo")
Creating package in ./EnsDb.Hsapiens.v90 
[1] TRUE

This created a folder (containing "inst", "man", "R" subfolders as well as the Namespace and description) which I installed in R using devtools::install. I haven't tested it out on a dataset yet but it looks good:

> library(EnsDb.Hsapiens.v90)
> EnsDb.Hsapiens.v90
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.0.1
|Creation time: Tue Sep 12 13:55:27 2017
|ensembl_version: 90
|ensembl_host: unknown
|Organism: Homo_sapiens
|genome_build: GRCh38
|DBSCHEMAVERSION: 1.0
|source_file: Homo_sapiens.GRCh38.90.gtf.gz
| No. of genes: 58302.
| No. of transcripts: 200310.

2) With regard to fetching EnsDb databases from AnnotationHub, this was the first thing I tried but I kept getting back an empty query:

> library(AnnotationHub)
> ah <- AnnotationHub()
snapshotDate(): 2017-04-25
> query(ah, "EnsDb")
AnnotationHub with 0 records
# snapshotDate(): 2017-04-25 


> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] EnsDb.Hsapiens.v90_0.90.0 devtools_1.13.3           ensembldb_2.0.4          
 [4] AnnotationFilter_1.0.0    GenomicFeatures_1.28.4    GenomicRanges_1.28.5     
 [7] GenomeInfoDb_1.12.2       BiocInstaller_1.26.1      AnnotationDbi_1.38.2     
[10] IRanges_2.10.3            S4Vectors_0.14.4          Biobase_2.36.2           
[13] AnnotationHub_2.8.2       BiocGenerics_0.22.0      

I just updated to R version 3.4.1 and I am using Bioconductor version 3.5 (BiocInstaller 1.26.1) but the result is still the same.

ADD REPLY
0
Entering edit mode

That you don't get any results for the query in AnnotationHub is strange. Could you try to delete the .AnnotationHub folder in your home directory and re-run the commands above? Eventually the AnnotationHub database is still the one from your previous installation.

ADD REPLY
0
Entering edit mode

I deleted the .AnnotationHub folder in /Library/Frameworks/R.framework/Versions/3.4/Resources/library/AnnotationHub and the result is unchanged. I still get 0 queries after reinstalling AnnotationHub database using Biocinstaller.

Mthabisi

 

ADD REPLY
0
Entering edit mode

Sorry, I was unclear. I didn't mean to remove the AnnotatioHub folder from the R library. AnnotationHub creates a hidden folder (called ".AnnotationHub" (note the dot before AnnotationHub), usually in the user's home directory. The cache as well as the database is stored there.

You should re-install AnnotationHub with BiocInstaller::biocLite("AnnotationHub"), remove the hidden ".AnnotationHub" folder and try again.

ADD REPLY
1
Entering edit mode

I missed that, thanks for the catch! Deleting the .AnnotationHub folder cleared the cache and did the trick.

I ran ensembldb using both the EnsDb I made from the Ensembl GRCh38 .gtf annotation file (v90) and the one I downloaded through AnnotationHub and both worked well. The only point I would make is with regard to the annotation files from GENCODE. They append the gene/transcript version number to ensembl gene ids. This results in 0 matches when using mapIds(). I fixed this by removing the version numbers from the gene ids in command line after doing the counts. I will probably just download all the fasta and annotation files from ensembl in future.

Mthabisi

ADD REPLY

Login before adding your answer.

Traffic: 566 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6