Question

Genomic Features error in TxDbfromGFF

0

Entering edit mode

sieminsk • 0

@3fc834aa

Last seen 3.6 years ago

Canada

I am attempting to generate a taxonomic database with Genomic Features package. I have attached the code and the error output below. The gff file was obtained from NCBI RefSeq Genomes database. Do I have to parse the gff file and if yes how to be able to use it with the package.

gffmodel <- file.path(dataDir, "GCF_000686985.2_Bra_napus_v2.0_genomic.gff")
(txdb <- makeTxDbFromGFF(gffmodel, format="gff"))

Output: 
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'table' in selecting a method for function '%in%': subscript contains NAs

sessionInfo( )

R version 4.0.2 (2020-06-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS/LAPACK: /cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/imkl/2020.1.217/compilers_and_libraries_2020.1.217/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] BiocParallel_1.24.1
[2] GenomicAlignments_1.26.0
[3] Rsamtools_2.6.0
[4] SummarizedExperiment_1.20.0
[5] MatrixGenerics_1.2.0
[6] matrixStats_0.57.0
[7] GenomicFeatures_1.42.1
[8] AnnotationDbi_1.52.0
[9] Biobase_2.50.0
[10] BSgenome.Athaliana.TAIR.TAIR9_1.3.1000 [11] BiocManager_1.30.10
[12] BSgenome_1.58.0
[13] rtracklayer_1.50.0
[14] GenomicRanges_1.42.0
[15] GenomeInfoDb_1.26.2
[16] stringr_1.4.0
[17] Biostrings_2.58.0
[18] XVector_0.30.0
[19] IRanges_2.24.1
[20] S4Vectors_0.28.1
[21] BiocGenerics_0.36.0

loaded via a namespace (and not attached): [1] progress_1.2.2 tidyselect_1.1.0 purrr_0.3.4
[4] lattice_0.20-41 vctrs_0.3.5 generics_0.1.0
[7] BiocFileCache_1.14.0 blob_1.2.1 XML_3.99-0.5
[10] rlang_0.4.9 pillar_1.4.7 glue_1.4.2
[13] DBI_1.1.0 rappdirs_0.3.1 bit64_4.0.5
[16] dbplyr_2.0.0 GenomeInfoDbData_1.2.4 lifecycle_0.2.0
[19] zlibbioc_1.36.0 memoise_1.1.0 biomaRt_2.46.0
[22] curl_4.3 Rcpp_1.0.5 openssl_1.4.3
[25] DelayedArray_0.16.0 bit_4.0.4 hms_0.5.3
[28] askpass_1.1 digest_0.6.27 stringi_1.5.3
[31] dplyr_1.0.2 grid_4.0.2 tools_4.0.2
[34] bitops_1.0-6 magrittr_2.0.1 RCurl_1.98-1.2
[37] RSQLite_2.2.1 tibble_3.0.4 crayon_1.3.4
[40] pkgconfig_2.0.3 ellipsis_0.3.1 Matrix_1.2-18
[43] xml2_1.3.2 prettyunits_1.1.1 assertthat_0.2.1
[46] httr_1.4.2 R6_2.5.0 compiler_4.0.2

GenomicFeatures TxDb • 3.5k views

ADD COMMENT • link 4.0 years ago sieminsk • 0

0

Entering edit mode

Well, so it looks like this GFF file contains an exon with no Parent attribute (ID=id-NC_008285.1:25367..25761-4 on line 1796721). First time ever.

FWIW this is what the GFF specs say about this:

"Orphan" exons CDSs, and other features. Ab initio gene prediction programs call hypothetical exons and CDS's that are attached to the genomic sequence and not necessarily to a known transcript. To handle these features, you may either (1) create a placeholder mRNA and use it as the parent for the exon and CDS subfeatures; or (2) attach the exons and CDSs directly to the gene. This is allowed by SO because of the transitive nature of the part_of relationship.

But in this case they've attached the exon to... nothing! This breaks makeTxDbFromGRanges() which is used internally by makeTxDbFromGFF(). A fix is on its way.

H.

ADD REPLY • link 4.0 years ago Hervé Pagès 16k

0

Entering edit mode

The file also contains some unusual trans-spliced genes (e.g. ID=gene-BRNAC_p045 at lines 1797501-1797505) which also break makeTxDbFromGRanges() so the fix will take a little bit longer.

ADD REPLY • link 4.0 years ago Hervé Pagès 16k

0

Entering edit mode

Thank you.

ADD REPLY • link 4.0 years ago sieminsk • 0

score 0 · Answer 1 · 2021-04-01

This is fixed in GenomicFeatures 1.42.3 (release) and 1.43.8 (devel). Both versions should become available in the next 48h or so thru BiocManager::install().

With GenomicFeatures 1.42.3:

library(GenomicFeatures)
txdb <- makeTxDbFromGFF("GCF_000686985.2_Bra_napus_v2.0_genomic.gff")
# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... OK
# Warning messages:
# 1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
#   some transcripts have no "transcript_id" attribute ==> their name
#   ("tx_name" column in the TxDb object) was set to NA
# 2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
#   the transcript names ("tx_name" column in the TxDb object) imported
#   from the "transcript_id" attribute are not unique
# 3: In .merge_transcript_parts(transcripts) :
#   The following transcripts were dropped because they have multiple parts
#   that could not be merged due to incompatible strands: gene-BRNAC_p045,
#   gene-BrnapMp015, gene-BrnapMp029, gene-BrnapMp070
# 4: In makeTxDbFromGRanges(gr, metadata = metadata) :
#   The following transcripts were dropped because their exon ranks could
#   not be inferred (either because the exons are not on the same
#   chromosome/strand or because they are not separated by introns):
#   gene-BRNAC_p045, gene-BrnapMp015, gene-BrnapMp029, gene-BrnapMp062,
#   gene-BrnapMp070

Please pay attention to the warnings. They give important clues about the problems found in the file.

A small number of transcripts couldn't be imported (see warnings 3 and 4) because they have exons on both strands (trans-splicing). TxDb objects don't support transcripts with trans-splicing.

Then:

txlens <- transcriptLengths(txdb, with.cds_len=TRUE)
head(txlens)
#   tx_id tx_name    gene_id nexon tx_len cds_len
# 1     1    <NA> BrnaMpl_p1     1   2973    2973
# 2     2    <NA> BrnaMpl_p5     1    411     411
# 3     3    <NA> BrnaMpl_p2     1   1077    1077
# 4     4    <NA> BrnaMpl_p3     1    954     954
# 5     5    <NA> BrnaMpl_p4     1   1089    1089
# 6     6    <NA> BrnaMpl_p6     1   3237    3237

tail(txlens)
#         tx_id        tx_name      gene_id nexon tx_len cds_len
# 157024 157024 XM_022715858.1 LOC111214013     4    864     744
# 157025 157025 XM_022715859.1 LOC111214014     1    432     432
# 157026 157026           <NA> LOC111214015     1   1186       0
# 157027 157027 XM_022715860.1 LOC111214016     1    998     906
# 157028 157028 XM_022715861.1 LOC111214017     5    605     420
# 157029 157029 XM_022715862.1 LOC111214018     5    590     480

Cheers,

H.