Good evening to everyone,
I'm trying to run the workflow DEXSeq to perform a DE analysis on Olea europaea transcripts.
I think I need some help here... More specifically, I'm stuck at the paragraph 2.4 Preparing the annotation. Apparently, I need to run the script dexseqprepareannotation.py to prepare my .GTF file for the workflow. I'm using the release Oe6 from https://denovo.cnag.cat/olive , which is a .GFF3 instead. This is the exact file I've used.
Maybe I'm missing something, I thought it would make more sense to use the .GFF3 directly, but I first converted it to .GTF and then to a .GFF to stick to the workflow, as follows:
gffread OE6A.newFA.gff3 -T -o OE6A.newFA.gtf
python dexseq_prepare_annotation.py OE6A.newFA.gtf OE6A.newFA.gff
After a few seconds, I get this error:
Traceback (most recent call last):
File "dexseq_prepare_annotation.py", line 127, in <module>
assert l[i].iv.end <= l[i+1].iv.start, str(l[i+1]) + " starts too early"
AssertionError: <GenomicFeature: exonic_part 'OE6A112357' at Oe6_s02107: 324809 -> 321700 (strand '-')> starts too early
I tried to remove this specific gene from my .GFF, but there are plenty of them, always making the script outputting the same error, just with a different line. And of course, since I found at least a dozen, this will end up potentially cutting out important information.
Here, an user reported a similar problem. Apparently, it's something related to the way my annotation file was built.
Look at the OE6A112357 annotation retrieved from my .GTF.
Oe6_s02107 CNAG transcript 318359 324919 . + . transcript_id "OE6A112357T1"; gene_id "OE6A112357";
Oe6_s02107 CNAG exon 318359 318658 . + . transcript_id "OE6A112357T1"; gene_id "OE6A112357";
Oe6_s02107 CNAG exon 319380 319449 . + . transcript_id "OE6A112357T1"; gene_id "OE6A112357";
Oe6_s02107 CNAG exon 319924 321129 . + . transcript_id "OE6A112357T1"; gene_id "OE6A112357";
Oe6_s02107 CNAG exon 321690 324919 . + . transcript_id "OE6A112357T1"; gene_id "OE6A112357";
Oe6_s02107 CNAG CDS 318600 318658 . + 0 transcript_id "OE6A112357T1"; gene_id "OE6A112357";
Oe6_s02107 CNAG CDS 319380 319449 . + 1 transcript_id "OE6A112357T1"; gene_id "OE6A112357";
Oe6_s02107 CNAG CDS 319924 321129 . + 0 transcript_id "OE6A112357T1"; gene_id "OE6A112357";
Oe6_s02107 CNAG CDS 321690 322262 . + 0 transcript_id "OE6A112357T1"; gene_id "OE6A112357";
Oe6_s02107 CNAG transcript 319165 324919 . + . transcript_id "OE6A112357T2"; gene_id "OE6A112357";
Oe6_s02107 CNAG exon 319165 321129 . + . transcript_id "OE6A112357T2"; gene_id "OE6A112357";
Oe6_s02107 CNAG exon 321690 324919 . + . transcript_id "OE6A112357T2"; gene_id "OE6A112357";
Oe6_s02107 CNAG CDS 320125 321129 . + 0 transcript_id "OE6A112357T2"; gene_id "OE6A112357";
Oe6_s02107 CNAG CDS 321690 322262 . + 0 transcript_id "OE6A112357T2"; gene_id "OE6A112357";
**Oe6_s02107 CNAG transcript 321702 326749 . - . transcript_id "OE6A112357T4"; gene_id "OE6A112357";
Oe6_s02107 CNAG exon 321702 324810 . - . transcript_id "OE6A112357T4"; gene_id "OE6A112357";
Oe6_s02107 CNAG exon 325991 326749 . - . transcript_id "OE6A112357T4"; gene_id "OE6A112357";
Oe6_s02107 CNAG CDS 322942 324810 . - 0 transcript_id "OE6A112357T4"; gene_id "OE6A112357";
Oe6_s02107 CNAG CDS 325991 326230 . - 0 transcript_id "OE6A112357T4"; gene_id "OE6A112357";**
The OE6A112357T4 part is apparently the problem, since that specific transcript is on the - strain, while the others are on the +.
Can anyone help me here?
Can I use my .GFF3 directly for the analysis somehow (I tried and it's not accepted by the scripts), maybe bypassing the problem? If this isn't the case, what may the problem about this script be and how can I solve it?
I'm also attaching my converted .GTF file, as obtained by gffread.
Thanks in advance for your help!
Hello, thanks for your answer! I have no idea how accurate the annotation may be, since it wasn't created by my research team. As far as I know, it's only a draft and a better version for this specific species is still to be created. For sure, it's quite interesting and unusual, as you said as well. Thanks for your suggestion, that's exactly what I did, even though it meant cutting out some information. In the end, I removed the genes which lead to this problem (only 18 of them, they were quite few, fortunately) from my annotation file manually. Id did this by just retrieving their names every time I re-ran the script and got an error, removing the corresponding annotation information from my .gtf file, re-inputing the corrected file and so on, till everything worked fine.