makeTxDbFromGFF errors too many NAs and make.splicings
1
0
Entering edit mode
Karl Lundén ▴ 20
@karl-lunden-5313
Last seen 6.2 years ago

 

Hi,

Im trying to make TxDb objects for some GFF3-files of Picea abies from Congenie. Can you see any obvious reason why there are errors ? Are the GFF3 -files not compatible with the makeTxDbFromGFF or are there some updates needed ?

Kind regards

Karl

 

> MYBtestTxDb<-makeTxDbFromGFF("ftp://plantgenie.org/Data/ConGenIE/Picea_abies/v1.0/GFF3/Gene_Prediction_Transcript_assemblies/Trinity_kmer10.gff3.gz")
Import genomic features from the file as a GRanges object ... trying URL 'ftp://plantgenie.org/Data/ConGenIE/Picea_abies/v1.0/GFF3/Gene_Prediction_Transcript_assemblies/Trinity_kmer10.gff3.gz'
Content type 'unknown' length 13520938 bytes (12.9 MB)
==================================================
Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges") : 
  solving row 58427: range cannot be determined from the supplied arguments (too many NAs)
> traceback()
11: .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges")
10: solveUserSEW0(start = start, end = end, width = width)
9: IRanges(ans_start, ans_end, names = ans_names)
8: makeGRangesFromDataFrame(df, seqnames.field = "seqid")
7: readGFFAsGRanges(con, version = version, colnames = colnames, 
       filter = list(type = feature.type), genome = genome, sequenceRegionsAsSeqinfo = sequenceRegionsAsSeqinfo, 
       speciesAsMetadata = TRUE)
6: .local(con, format, text, ...)
5: import(FileForFormat(con, format), ...)
4: import(FileForFormat(con, format), ...)
3: import(file, format = format, colnames = colnames, feature.type = GFF_FEATURE_TYPES)
2: import(file, format = format, colnames = colnames, feature.type = GFF_FEATURE_TYPES)
1: makeTxDbFromGFF("ftp://plantgenie.org/Data/ConGenIE/Picea_abies/v1.0/GFF3/Gene_Prediction_Transcript_assemblies/Trinity_kmer10.gff3.gz")

> txdB_gene2<- makeTxDbFromGFF("ftp://plantgenie.org/Data/ConGenIE/Picea_abies/v1.0/GFF3/Gene_Prediction_Transcript_assemblies/Pabies01b-gene.gff3.gz")
Import genomic features from the file as a GRanges object ... trying URL 'ftp://plantgenie.org/Data/ConGenIE/Picea_abies/v1.0/GFF3/Gene_Prediction_Transcript_assemblies/Pabies01b-gene.gff3.gz'
Content type 'unknown' length 5769965 bytes (5.5 MB)
==================================================
OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... Error in .make_splicings(exons, cds, stop_codons) : 
  some CDS cannot be mapped to an exon
> traceback()
4: stop(wmsg("some CDS cannot be mapped to an exon"))
3: .make_splicings(exons, cds, stop_codons)
2: makeTxDbFromGRanges(gr, metadata = metadata)
1: makeTxDbFromGFF("ftp://plantgenie.org/Data/ConGenIE/Picea_abies/v1.0/GFF3/Gene_Prediction_Transcript_assemblies/Pabies01b-gene.gff3.gz")

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.4

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BiocInstaller_1.30.0   GenomicFeatures_1.32.0 AnnotationDbi_1.42.1   Biobase_2.40.0         rtracklayer_1.40.3     GenomicRanges_1.32.3  
 [7] GenomeInfoDb_1.16.0    IRanges_2.14.10        S4Vectors_0.18.3       BiocGenerics_0.26.0   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17                compiler_3.5.0              XVector_0.20.0              prettyunits_1.0.2           bitops_1.0-6               
 [6] tools_3.5.0                 zlibbioc_1.26.0             progress_1.2.0              biomaRt_2.36.1              digest_0.6.15              
[11] bit_1.1-14                  RSQLite_2.1.1               memoise_1.1.0               lattice_0.20-35             pkgconfig_2.0.1            
[16] rlang_0.2.1                 Matrix_1.2-14               DelayedArray_0.6.1          DBI_1.0.0                   GenomeInfoDbData_1.1.0     
[21] httr_1.3.1                  stringr_1.3.1               Biostrings_2.48.0           hms_0.4.2                   bit64_0.9-7                
[26] grid_3.5.0                  R6_2.2.2                    XML_3.98-1.11               BiocParallel_1.14.1         magrittr_1.5               
[31] blob_1.1.1                  Rsamtools_1.32.0            matrixStats_0.53.1          GenomicAlignments_1.16.0    assertthat_0.2.0           
[36] SummarizedExperiment_1.10.1 stringi_1.2.3               RCurl_1.95-4.10             crayon_1.3.4               

 

 

maketxdbfromgff limma • 2.5k views
ADD COMMENT
0
Entering edit mode

did you should take a look at row 58427 as suggested in error msg?  what did you see?

ADD REPLY
0
Entering edit mode
@herve-pages-1542
Last seen 12 hours ago
Seattle, WA, United States

Hi Karl,

The issue with the 1st file (Trinity_kmer10.gff3.gz) is that it contains start/end values greater than 2^31-1. The problem was that, because these values cannot be stored in an R integer vector, makeTxDbFromGFF() was silently coercing them to NAs. I committed a change to rtracklayer (version 1.41.7, this is in BioC 3.8 only) so that makeTxDbFromGFF() now fails early with an informative error message when this happens:

library(rtracklayer)
library(GenomicFeatures)
txdb <- makeTxDbFromGFF("Trinity_kmer10.gff3.gz")
# Import genomic features from the file as a GRanges object ... Error in
# readGFF(filepath, version = version, columns = columns, tags = tags,  : 
#   reading GFF file: line 58427 contains values greater than 2^31-1 
#   (= .Machine$integer.max) in column 4 (start) and/or 5 (end).
#   Bioconductor does not support such GFF files at the moment. Sorry!

The issue with the 2nd file (Pabies01b-gene.gff3) was that CDS features have their Parent set to an exon instead of a transcript. Note that this is very unconventional and deviates from the well established convention documented at: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

I committed a change to GenomicFeatures (version 1.33.4, also in BioC 3.8 only) so that makeTxDbFromGFF() now supports such files:

txdb <- makeTxDbFromGFF("Pabies01b-gene.gff3.gz")
# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... OK

Note that in this file, the CDS and exons are actually the same (i.e. same genomic ranges):

> all(cds(txdb) == exons(txdb))
[1] TRUE

Both rtracklayer 1.41.7 and GenomicFeatures 1.33.4 should become available to BioC 3.8 users via BiocManager::install() in the next 24 hours or so.

Cheers,

H.

> sessionInfo()
R version 3.5.1 Patched (2018-08-01 r75051)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /home/hpages/R/R-3.5.r75051/lib/libRblas.so
LAPACK: /home/hpages/R/R-3.5.r75051/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] GenomicFeatures_1.33.4 AnnotationDbi_1.43.1   Biobase_2.41.2        
[4] rtracklayer_1.41.7     GenomicRanges_1.33.14  GenomeInfoDb_1.17.2   
[7] IRanges_2.15.18        S4Vectors_0.19.22      BiocGenerics_0.27.1   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.19                compiler_3.5.1             
 [3] XVector_0.21.4              prettyunits_1.0.2          
 [5] bitops_1.0-6                tools_3.5.1                
 [7] zlibbioc_1.27.0             progress_1.2.0             
 [9] biomaRt_2.37.8              digest_0.6.18              
[11] bit_1.1-14                  RSQLite_2.1.1              
[13] memoise_1.1.0               lattice_0.20-35            
[15] pkgconfig_2.0.2             rlang_0.2.2                
[17] Matrix_1.2-14               DelayedArray_0.7.47        
[19] DBI_1.0.0                   GenomeInfoDbData_1.2.0     
[21] httr_1.3.1                  stringr_1.3.1              
[23] Biostrings_2.49.2           hms_0.4.2                  
[25] bit64_0.9-7                 grid_3.5.1                 
[27] R6_2.3.0                    XML_3.98-1.16              
[29] BiocParallel_1.15.15        magrittr_1.5               
[31] blob_1.1.1                  Rsamtools_1.99.0           
[33] matrixStats_0.54.0          GenomicAlignments_1.17.3   
[35] assertthat_0.2.0            SummarizedExperiment_1.11.6
[37] stringi_1.2.4               RCurl_1.95-4.11            
[39] crayon_1.3.4

 

ADD COMMENT

Login before adding your answer.

Traffic: 651 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6