I am trying to construct a TxDb from a gff3 but it fails and I got the following error message:
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... Error in as.vector(x, mode = "character") :
coercing an AtomicList object to an atomic vector is supported only for
objects with top-level elements of length <= 1
What can be the reason? My gff3 is not from a model species.
sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 15.04
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=fr_FR.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=fr_FR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] Rsamtools_1.22.0 Biostrings_2.38.0 XVector_0.10.0 GenomicFeatures_1.22.4 AnnotationDbi_1.32.0
[6] Biobase_2.30.0 GenomicRanges_1.22.1 GenomeInfoDb_1.6.1 IRanges_2.4.1 S4Vectors_0.8.1
[11] BiocGenerics_0.16.1
loaded via a namespace (and not attached):
[1] zlibbioc_1.16.0 GenomicAlignments_1.6.1 BiocParallel_1.4.0 tools_3.2.2
[5] SummarizedExperiment_1.0.1 DBI_0.3.1 lambda.r_1.1.7 futile.logger_1.4.1
[9] rtracklayer_1.30.1 futile.options_1.0.0 bitops_1.0-6 RCurl_1.95-4.7
[13] biomaRt_2.26.0 RSQLite_1.0.0 XML_3.98-1.3
This is the command I used
gtffile <- file.path(dir_csec,"Cryptotermes_secundus.gff")
(txdb <- makeTxDbFromGFF(gtffile, format="gff3", circ_seqs=character()))
This is how the first 10 lines look
##gff-version 3
##_######################
## GENOME
#some more comments...
##
C37016610 GffMM gene 1 234 . - . ID=Csec_G00001;Name=Csec_G00001;alias=csec_17774_G
C37016610 GeneWise transcript 1 234 50.65 - . ID=Csec_G00001_T1;Parent=Csec_G00001;shift=0;alias=csec_17774;evid_id=CG9836-PA_DROME-D4;identical_support_id=CUFF.152.1;nb_exon=1;part_support_id=CUFF.3355.1;source_id=CCG014480.1
C37016610 GeneWise CDS 1 234 . - 0 ID=Csec_G00001_T1_C1;Parent=Csec_G00001_T1
##
C37651885 GffMM gene 29 433 . + . ID=Csec_G00002;Name=Csec_G00002;alias=csec_17775_G
C37651885 GeneWise transcript 29 433 72 + . ID=Csec_G00002_T1;Parent=Csec_G00002;shift=0;alias=csec_17775;evid_id=GB18432-PA_APIME-D6;identical_support_id=CUFF.152.1;nb_exon=1;part_support_id=CUFF.3355.1;source_id=CCG014480.1
C37651885 GeneWise CDS 29 433 . + 0 ID=Csec_G00002_T1_C1;Parent=Csec_G00002_T1
##
C37688897 GffMM gene 75 380 . + . ID=Csec_G00003;Name=Csec_G00003;alias=csec_17776_G
This is how the last 10 lines look
##
scaffold29346 PASA gene 1 3199 . - . ID=Csec_G19035;Name=PASA_cluster_69985,Csec_G19035;alias=PASA_cluster_69985
scaffold29346 PASA transcript 1 3199 . - . ID=Csec_G19035_T1;Parent=Csec_G19035;Name=asmbl_81646;nb_exon=4;size=3199;alias=asmbl_81646
scaffold29346 PASA exon 1 137 . - . ID=Csec_G19035_T1_E4;Parent=Csec_G19035_T1;size=137;exon_index=4
scaffold29346 OrfPred three_prime_UTR 1 1 . - . ID=Csec_G19035_T1_U1;Parent=Csec_G19035_T1
scaffold29346 OrfPred CDS 2 119 . - 0 ID=Csec_G19035_T1_C1;Parent=Csec_G19035_T1;total_size=118
scaffold29346 OrfPred five_prime_UTR 120 137 . - . ID=Csec_G19035_T1_U2;Parent=Csec_G19035_T1
scaffold29346 PASA exon 2342 2468 . - . ID=Csec_G19035_T1_E3;Parent=Csec_G19035_T1;size=127;exon_index=3
scaffold29346 OrfPred five_prime_UTR 2342 2468 . - . ID=Csec_G19035_T1_U3;Parent=Csec_G19035_T1
scaffold29346 PASA exon 2587 2829 . - . ID=Csec_G19035_T1_E2;Parent=Csec_G19035_T1;size=243;exon_index=2
scaffold29346 OrfPred five_prime_UTR 2587 2829 . - . ID=Csec_G19035_T1_U4;Parent=Csec_G19035_T1
scaffold29346 PASA exon 2925 3199 . - . ID=Csec_G19035_T1_E1;Parent=Csec_G19035_T1;size=275;exon_index=1
scaffold29346 OrfPred five_prime_UTR 2925 3199 . - . ID=Csec_G19035_T1_U5;Parent=Csec_G19035_T1
##
The GFF and function call look okay at first glance. You could try using an online GFF validator (e.g. http://genometools.org/cgi-bin/gff3validator.cgi) to make sure the format is okay.
My first thought is that it might be the empty vector passed to the `circ_seqs` arg. The default value for that argument is:
So as long as none of your chromosomes have the same names and are linear (I would guess not..), then you can just leave the argument as is.
Does it work if you leave out the circ_seqs arg?
Do you think you could put the entire file somewhere online? That would make this much easier.