Problem with MakeTxDbFromGFF
1
0
Entering edit mode
@fce3b503
Last seen 20 months ago
United States

I'm trying to use the function MakeTxDbFromGFF and am getting errors. There are three .gff3 files I'm uisng at https://download.xenbase.org/pub/Genomics/JGI/Xenla9.2/

XENLA_9.2_Xenbase.gff3.gz XENLA_9.2_GCA.gff3.gz XENLA_9.2_GCF.ff3.gz

I don't know what the differences are so I'm trying all three to see which gives me the best result. My R commands are:

TxDb.xlaevis_xenbase <- makeTxDbFromGFF("XENLA_9.2_Xenbase.gff3")
TxDb.xlaevis_GCF <- makeTxDbFromGFF("XENLA_9.2_GCF.gff3")
TxDb.xlaevis_GCA <- makeTxDbFromGFF("XENLA_9.2_GCA.gff3")

The first ccommand, using XENLA_9.2_Xenbase.gff3, gives the following error:

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... Error in as.vector(x, mode) : 
  coercing an AtomicList object to an atomic vector is supported only for
  objects with top-level elements of length <= 1

Does anyone know why this is or how to fix it?

The command calling the GCF file works but gives me warnings. The command calling the GCA file works perfectly.

I tried running XENLA_9.2_Xenbase.gff3 through http://genometools.org/cgi-bin/gff3validator.cgi and it tells me that the .gff3 is too large.

Top 10 lines of the XENLA_9.2_Xenbase.gff3:

#gff-version 3
#data-version 2017-08-28
#species Xenopus laevis
#genome build 9.2
#genome assembler NCBI
#genome accession GCF_001663975.1
#genome FASTA file ftp://ftp.xenbase.org/pub/Genomics/JGI/Xenla9.2/XL9_2.fa.gz
#RefSeq-Accn converted to Sequence-Name via ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/663/975/GCF_001663975.1_Xenopus_laevis_v2/GCF_001663975.1_Xenopus_laevis_v2_assembly_report.txt
MT  Xenbase gene    2136    2204    .   +   .   ID=gene42065;Alias=XL-9_2-gene42065;Name=mt-trna-phe.L;Dbxref=Xenbase:XB-GENE-22251956;Note=NAR: 1461;Original=RefSeq_rna62470;anticodon=(pos:2166..2168);gbkey=tRNA;product=tRNA-Phe;curie=Xenbase:XB-GENE-22251956;gene_id=Xenbase:XB-GENE-22251956;Ontology_term=SO:0001272
MT  Xenbase tRNA    2136    2204    .   +   .   ID=rna100000;Alias=XL-9_2-rna100000;Name=rna100000;Parent=gene42065;curie=modelID:XL-9_2-rna100000;transcript_id=modelID:XL-9_2-rna100000;Ontology_term=SO:0000253
MT  Xenbase exon    2136    2204    .   +   .   ID=id856313;Alias=XL-9_2-id856313;Parent=rna100000;gbkey=tRNA
MT  Xenbase gene    2205    3023    .   +   .   ID=gene34778;Alias=XL-9_2-gene34778;Name=mt-rnr1.L;Dbxref=Xenbase:XB-GENE-22251886;Original=RefSeq_rna62492;gbkey=rRNA;product=12S ribosomal RNA;curie=Xenbase:XB-GENE-22251886;gene_id=Xenbase:XB-GENE-22251886;Ontology_term=SO:0001637
MT  Xenbase rRNA    2205    3023    .   +   .   ID=rna100001;Alias=XL-9_2-rna100001;Name=rna100001;Parent=gene34778;curie=modelID:XL-9_2-rna100001;transcript_id=modelID:XL-9_2-rna100001;Ontology_term=SO:0000252
MT  Xenbase exon    2205    3023    .   +   .   ID=id733233;Alias=XL-9_2-id733233;Parent=rna100001;gbkey=rRNA
MT  Xenbase gene    3024    3092    .   +   .   ID=gene48202;Alias=XL-9_2-gene48202;Name=mt-trna-val.L;Dbxref=Xenbase:XB-GENE-22251991;Original=RefSeq_rna62471;anticodon=(pos:3054..3056);gbkey=tRNA;product=tRNA-Val;curie=Xenbase:XB-GENE-22251991;gene_id=Xenbase:XB-GENE-22251991;Ontology_term=SO:0001272
MT  Xenbase tRNA    3024    3092    .   +   .   ID=rna100002;Alias=XL-9_2-rna100002;Name=rna100002;Parent=gene48202;curie=modelID:XL-9_2-rna100002;transcript_id=modelID:XL-9_2-rna100002;Ontology_term=SO:0000253
MT  Xenbase exon    3024    3092    .   +   .   ID=id956642;Alias=XL-9_2-id956642;Parent=rna100002;gbkey=tRNA
MT  Xenbase gene    3093    4723    .   +   .   ID=gene44770;Alias=XL-9_2-gene44770;Name=mt-rnr2.L;Dbxref=Xenbase:XB-GENE-22251891;Original=RefSeq_rna62493;gbkey=rRNA;product=16S ribosomal RNA;curie=Xenbase:XB-GENE-22251891;gene_id=Xenbase:XB-GENE-22251891;Ontology_term=SO:0001637
MT  Xenbase rRNA    3093    4723    .   +   .   ID=rna100003;Alias=XL-9_2-rna100003;Name=rna100003;Parent=gene44770;curie=modelID:XL-9_2-rna100003;transcript_id=modelID:XL-9_2-rna100003;Ontology_term=SO:0000252
MT  Xenbase exon    3093    4723    .   +   .   ID=id903228;Alias=XL-9_2-id903228;Parent=rna100003;gbkey=rRNA
MT  Xenbase gene    4724    4798    .   +   .   ID=gene43253;Alias=XL-9_2-gene43253;Name=mt-trna-leu1.L;Dbxref=Xenbase:XB-GENE-22251946;Original=RefSeq_rna62472;anticodon=(pos:4759..4761);gbkey=tRNA;product=tRNA-Leu;curie=Xenbase:XB-GENE-22251946;gene_id=Xenbase:XB-GENE-22251946;Ontology_term=SO:0001272
MT  Xenbase tRNA    4724    4798    .   +   .   ID=rna100004;Alias=XL-9_2-rna100004;Name=rna100004;Parent=gene43253;curie=modelID:XL-9_2-rna100004;transcript_id=modelID:XL-9_2-rna100004;Ontology_term=SO:0000253
MT  Xenbase exon    4724    4798    .   +   .   ID=id878652;Alias=XL-9_2-id878652;Parent=rna100004;gbkey=tRNA
MT  Xenbase gene    4799    5770    .   +   .   ID=gene41609;Alias=XL-9_2-gene41609;Name=nd1.L;Dbxref=GeneID:2642086,Xenbase:XB-GENE-6251959;gbkey=Gene;gene=nd1.L;gene_biotype=protein_coding;curie=Xenbase:XB-GENE-6251959;gene_id=Xenbase:XB-GENE-6251959;Ontology_term=SO:0001217
MT  Xenbase mRNA    4799    5770    .   +   .   ID=rna100005;Alias=XL-9_2-rna100005;Name=rna100005;Parent=gene41609;curie=modelID:XL-9_2-rna100005;transcript_id=modelID:XL-9_2-rna100005;Ontology_term=SO:0000234
MT  Xenbase CDS 4799    5770    .   +   0   ID=cds781946;Alias=XL-9_2-cds781946;Parent=rna100005;gbkey=CDS;protein_id=modelID:XL-9_2-cds781946

sessionInfo( )

``` R version 4.2.2 (2022-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 22621)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] GenomicAlignments_1.32.1 Rsamtools_2.12.0
[3] Biostrings_2.64.1 XVector_0.36.0
[5] openxlsx_4.2.5.1 xlaevis.db_3.2.3
[7] org.Xl.eg.db_3.15.0 ChIPseeker_1.32.1
[9] rtracklayer_1.56.1 TxDb.Hsapiens.UCSC.hg38.knownGene_3.15.0 [11] GenomicFeatures_1.48.4 AnnotationDbi_1.58.0
[13] ChIPQC_1.32.2 BiocParallel_1.30.4
[15] DiffBind_3.6.5 SummarizedExperiment_1.26.1
[17] Biobase_2.56.0 MatrixGenerics_1.8.1
[19] matrixStats_0.63.0 GenomicRanges_1.48.0
[21] GenomeInfoDb_1.32.4 IRanges_2.30.1
[23] S4Vectors_0.34.0 BiocGenerics_0.42.0
[25] ggplot2_3.4.0 BiocManager_1.30.19

loaded via a namespace (and not attached): [1] utf8_1.2.2 tidyselect_1.2.0
[3] RSQLite_2.2.18 htmlwidgets_1.6.1
[5] grid_4.2.2 scatterpie_0.1.8
[7] munsell_0.5.0 codetools_0.2-18
[9] interp_1.1-3 systemPipeR_2.2.2
[11] withr_2.5.0 colorspace_2.0-3
[13] GOSemSim_2.22.0 filelock_1.0.2
[15] rstudioapi_0.14 rJava_1.0-6
[17] DOSE_3.22.1 bbmle_1.0.25
[19] GenomeInfoDbData_1.2.8 mixsqp_0.3-48
[21] hwriter_1.3.2.1 polyclip_1.10-4
[23] bit64_4.0.5 farver_2.1.1
[25] coda_0.19-4 vctrs_0.5.0
[27] treeio_1.20.2 TxDb.Rnorvegicus.UCSC.rn4.ensGene_3.2.2
[29] generics_0.1.3 BiocFileCache_2.4.0
[31] R6_2.5.1 apeglm_1.18.0
[33] graphlayouts_0.8.4 invgamma_1.1
[35] RVenn_1.1.0 locfit_1.5-9.7
[37] bitops_1.0-7 cachem_1.0.6
[39] fgsea_1.22.0 gridGraphics_0.5-1
[41] DelayedArray_0.22.0 assertthat_0.2.1
[43] BiocIO_1.6.0 scales_1.2.1
[45] ggraph_2.1.0 enrichplot_1.16.2
[47] gtable_0.3.1 tidygraph_1.2.2
[49] xlsx_0.6.5 rlang_1.0.6
[51] splines_4.2.2 lazyeval_0.2.2
[53] yaml_2.3.6 reshape2_1.4.4
[55] TxDb.Dmelanogaster.UCSC.dm3.ensGene_3.2.2 qvalue_2.28.0
[57] tools_4.2.2 ggplotify_0.1.0
[59] ellipsis_0.3.2 gplots_3.1.3
[61] RColorBrewer_1.1-3 Rcpp_1.0.9
[63] plyr_1.8.8 progress_1.2.2
[65] zlibbioc_1.42.0 purrr_1.0.1
[67] RCurl_1.98-1.9 prettyunits_1.1.1
[69] deldir_1.0-6 viridis_0.6.2
[71] ashr_2.2-54 chipseq_1.46.0
[73] ggrepel_0.9.2 magrittr_2.0.3
[75] data.table_1.14.6 TxDb.Hsapiens.UCSC.hg18.knownGene_3.2.2
[77] DO.db_2.9 truncnorm_1.0-8
[79] mvtnorm_1.1-3 SQUAREM_2021.1
[81] amap_0.8-19 TxDb.Mmusculus.UCSC.mm9.knownGene_3.2.2
[83] hms_1.1.2 xlsxjars_0.6.1
[85] patchwork_1.1.2 XML_3.99-0.13
[87] emdbook_1.3.12 jpeg_0.1-10
[89] gridExtra_2.3 compiler_4.2.2
[91] biomaRt_2.52.0 bdsmatrix_1.3-6
[93] tibble_3.1.8 shadowtext_0.1.2
[95] KernSmooth_2.23-20 crayon_1.5.2
[97] htmltools_0.5.4 ggfun_0.0.9
[99] ggVennDiagram_1.2.2 tidyr_1.2.1
[101] aplot_0.1.9 DBI_1.1.3
[103] tweenr_2.0.2 dbplyr_2.3.0
[105] MASS_7.3-58.1 rappdirs_0.3.3
[107] boot_1.3-28 ShortRead_1.54.0
[109] Matrix_1.5-3 cli_3.4.1
[111] parallel_4.2.2 igraph_1.3.5
[113] pkgconfig_2.0.3 TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[115] numDeriv_2016.8-1.1 TxDb.Celegans.UCSC.ce6.ensGene_3.2.2
[117] xml2_1.3.3 ggtree_3.4.4
[119] yulab.utils_0.0.6 stringr_1.5.0
[121] digest_0.6.31 fastmatch_1.1-3
[123] tidytree_0.4.2 restfulr_0.0.15
[125] GreyListChIP_1.28.1 curl_5.0.0
[127] gtools_3.9.4 rjson_0.2.21
[129] jsonlite_1.8.4 lifecycle_1.0.3
[131] nlme_3.1-160 viridisLite_0.4.1
[133] limma_3.52.4 BSgenome_1.64.0
[135] fansi_1.0.3 pillar_1.8.1
[137] lattice_0.20-45 Nozzle.R1_1.1-1.1
[139] plotrix_3.8-2 KEGGREST_1.36.3
[141] fastmap_1.1.0 httr_1.4.4
[143] GO.db_3.15.0 glue_1.6.2
[145] zip_2.2.2 png_0.1-7
[147] bit_4.0.4 ggforce_0.4.1
[149] stringi_1.7.8 blob_1.2.3
[151] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0 latticeExtra_0.6-30
[153] caTools_1.18.2 memoise_2.0.1
[155] dplyr_1.0.10 irlba_2.3.5.1

makeTxDbFromGff • 1.3k views
ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 4 hours ago
United States

That looks like a weird GFF file. The error you are getting is because the mcols Name slot is a CharacterList rather than a Character vector.

> library(rtracklayer)
> z <- import("https://download.xenbase.org/pub/Genomics/JGI/Xenla9.2/XENLA_9.2_Xenbase.gff3.gz")
trying URL 'https://download.xenbase.org/pub/Genomics/JGI/Xenla9.2/XENLA_9.2_Xenbase.gff3.gz'
Content type 'application/x-gzip' length 30158361 bytes (28.8 MB)
downloaded 28.8 MB

> library(GenomicFeatures)
## this should be character
> z$Name
CharacterList of length 2063094
[[1]] mt-trna-phe.L
[[2]] rna100000
[[3]] character(0)
[[4]] mt-rnr1.L
[[5]] rna100001
[[6]] character(0)
[[7]] mt-trna-val.L
[[8]] rna100002
[[9]] character(0)
[[10]] mt-rnr2.L
...
<2063084 more elements>

## fix it
> huh <- z$Name
> huh <- sapply(huh, "[", 1)
> head(huh)
[1] "mt-trna-phe.L" "rna100000"     NA              "mt-rnr1.L"    
[5] "rna100001"     NA             
> z$Name <- huh
> zz <- makeTxDbFromGRanges(z)
Warning messages:
1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
  the transcript names ("tx_name" column in the TxDb object) imported
  from the "transcript_id" attribute are not unique
2: In makeTxDbFromGRanges(z) :
  The following transcripts were dropped because their exon ranks could
  not be inferred (either because the exons are not on the same
  chromosome/strand or because they are not separated by introns):
  rna10303, rna10687, rna11435, rna1157, rna12115, rna13351, rna13723,
  rna13744, rna14109, rna1411, rna14130, rna14263, rna15160, rna15611,
  rna16129, rna16446, rna16675, rna16967, rna17194, rna17330, rna17409,
  rna18725, rna18950, rna19885, rna20615, rna20833, rna21339, rna21340,
  rna21850, rna22060, rna22155, rna22288, rna22501, rna22622, rna22837,
  rna23221, rna23226, rna23559, rna23665, rna23698, rna23890, rna2393,
  rna24227, rna24494, rna24517, rna24863, rna25442, rna25635, rna26216,
  rna26523, rna27122, rna27630, rna27832, rna2785, rna27873, rna27956,
  rna28032, rna28275, rna2924, rna30023, rna30410, rna31312, rna31398,
  rna3150, rna31758, rna32131, rna32559, rna32621, rna3289, rna33031,
  rna34079, rna34778, rna34788, rna34898, rna35277, rna35695, rna35753,
  rna35953, rn [... truncated]
3: In .reject_transcripts(bad_tx, because) :
  The following transcripts were dropped because they have CDSs that
  cannot be mapped to an exon: rna10487, rna11002, rna11593, rna11855,
  rna20002, rna32222, rna36395, rna44170, rna44743, rna55375, rna57920,
  rna6788
4: In .find_exon_cds(exons, cds) :
  The following transcripts have exons that contain more than one CDS
  (only the first CDS was kept for each exon): rna12620, rna17993,
  rna21044, rna24325, rna25779, rna27854, rna28205, rna30505, rna32622,
  rna32705, rna33396, rna33706, rna35285, rna37985, rna41612, rna43478,
  rna43582, rna43736, rna45095, rna46783, rna47243, rna47992, rna48933,
  rna49492, rna4969, rna51025, rna51046, rna52009, rna52430, rna54960,
  rna55327, rna56345, rna56648, rna59410, rna59538, rna59796, rna59989,
  rna80700, rna81379, rna897, rna9182

That's a lot of warnings, so you might want to either use one of the other two GFF files or double check this one.

ADD COMMENT
0
Entering edit mode

thanks so much...really appreciate it.

ADD REPLY

Login before adding your answer.

Traffic: 706 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6