I am attempting to generate a taxonomic database with Genomic Features package. I have attached the code and the error output below. The gff file was obtained from NCBI RefSeq Genomes database. Do I have to parse the gff file and if yes how to be able to use it with the package.
gffmodel <- file.path(dataDir, "GCF_000686985.2_Bra_napus_v2.0_genomic.gff")
(txdb <- makeTxDbFromGFF(gffmodel, format="gff"))
Output:
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... Error in h(simpleError(msg, call)) :
error in evaluating the argument 'table' in selecting a method for function '%in%': subscript contains NAs
sessionInfo( )
R version 4.0.2 (2020-06-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)
Matrix products: default BLAS/LAPACK: /cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/imkl/2020.1.217/compilers_and_libraries_2020.1.217/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base
other attached packages:
[1] BiocParallel_1.24.1
[2] GenomicAlignments_1.26.0
[3] Rsamtools_2.6.0
[4] SummarizedExperiment_1.20.0
[5] MatrixGenerics_1.2.0
[6] matrixStats_0.57.0
[7] GenomicFeatures_1.42.1
[8] AnnotationDbi_1.52.0
[9] Biobase_2.50.0
[10] BSgenome.Athaliana.TAIR.TAIR9_1.3.1000
[11] BiocManager_1.30.10
[12] BSgenome_1.58.0
[13] rtracklayer_1.50.0
[14] GenomicRanges_1.42.0
[15] GenomeInfoDb_1.26.2
[16] stringr_1.4.0
[17] Biostrings_2.58.0
[18] XVector_0.30.0
[19] IRanges_2.24.1
[20] S4Vectors_0.28.1
[21] BiocGenerics_0.36.0
loaded via a namespace (and not attached):
[1] progress_1.2.2 tidyselect_1.1.0 purrr_0.3.4
[4] lattice_0.20-41 vctrs_0.3.5 generics_0.1.0
[7] BiocFileCache_1.14.0 blob_1.2.1 XML_3.99-0.5
[10] rlang_0.4.9 pillar_1.4.7 glue_1.4.2
[13] DBI_1.1.0 rappdirs_0.3.1 bit64_4.0.5
[16] dbplyr_2.0.0 GenomeInfoDbData_1.2.4 lifecycle_0.2.0
[19] zlibbioc_1.36.0 memoise_1.1.0 biomaRt_2.46.0
[22] curl_4.3 Rcpp_1.0.5 openssl_1.4.3
[25] DelayedArray_0.16.0 bit_4.0.4 hms_0.5.3
[28] askpass_1.1 digest_0.6.27 stringi_1.5.3
[31] dplyr_1.0.2 grid_4.0.2 tools_4.0.2
[34] bitops_1.0-6 magrittr_2.0.1 RCurl_1.98-1.2
[37] RSQLite_2.2.1 tibble_3.0.4 crayon_1.3.4
[40] pkgconfig_2.0.3 ellipsis_0.3.1 Matrix_1.2-18
[43] xml2_1.3.2 prettyunits_1.1.1 assertthat_0.2.1
[46] httr_1.4.2 R6_2.5.0 compiler_4.0.2
Well, so it looks like this GFF file contains an exon with no Parent attribute (
ID=id-NC_008285.1:25367..25761-4
on line 1796721). First time ever.FWIW this is what the GFF specs say about this:
But in this case they've attached the exon to... nothing! This breaks
makeTxDbFromGRanges()
which is used internally bymakeTxDbFromGFF()
. A fix is on its way.H.
The file also contains some unusual trans-spliced genes (e.g.
ID=gene-BRNAC_p045
at lines 1797501-1797505) which also breakmakeTxDbFromGRanges()
so the fix will take a little bit longer.Thank you.