DEXSeq: problem with dexseq_prepare_annotation.py
1
0
Entering edit mode
@stephen-turner-4916
Last seen 6.5 years ago
United States
Alejandro, Simon, Wolfgang, et al.: I'm trying to use the dexseq_prepare_annotation.py script to parse the UCSC hg18 genes.gtf GTF file included with the Illumina igenomes packages (http://tophat.cbcb.umd.edu/igenomes.html). I'm getting an error: Traceback (most recent call last): File "/home/sdt5z/bin/dexseq_prepare_annotation.py", line 93, in <module> raise ValueError, "Same name found on two chromosomes: %s, %s" % ( str(l[i]), str(l[i+1]) ) ValueError: Same name found on two chromosomes: <genomicfeature: exonic_part="" 'cfb'="" at="" chr6_qbl_hap2:="" 3167392="" -=""> 3167602 (strand '+')>, <genomicfeature: exonic_part="" 'cfb'="" at="" chr6_cox_hap1:="" 3359983="" -=""> 3360325 (strand '+')> I'm guessing this is because the same gene name is found in two separate places. I'm not entirely sure what these two chromosomal segments refer to, but I removed them from the GTF file and the python script threw another error: Traceback (most recent call last): File "/home/sdt5z/bin/dexseq_prepare_annotation.py", line 91, in <module> assert l[i].iv.end <= l[i+1].iv.start, str(l[i+1]) + " starts too early" AssertionError: <genomicfeature: exonic_part="" 'hist2h3c+hist2h3a'="" at="" chr1:="" 148079388="" -=""> 148078883 (strand '-')> starts too early I'm really unsure what to make of this or how to fix it. The script works without issues with the Ensembl GTF. Any help would be greatly appreciated. Stephen ----------------------------------------- Stephen D. Turner, Ph.D. Bioinformatics Core Director University of Virginia School of Medicine bioinformatics.virginia.edu
• 3.1k views
ADD COMMENT
0
Entering edit mode
Alejandro Reyes ★ 1.9k
@alejandro-reyes-5124
Last seen 5 months ago
Novartis Institutes for BioMedical Reseā€¦
Dear Stephen, The problem is that the exon is finishing (148078883) before it starts (148079388). Usually the gtf files contain the "left most" position as start position independently of the strand, also the UCSC annotations but with a few exceptions as the one you mentioned. The reason of the script complain is that these gtf parsers in HTSeq were written and tested for ENSEMBL annotation files, where these kind of "errors" are basically absent. Assuming that these are simple mistakes in the gtf files, it is easy to solve them just by flipping the positions of the start and ends when start > end, or maybe by deleting this genes from the gtf files. Anyway, these cases are really rare, like ~15 in the human UCSC genome. Alejandro > Alejandro, Simon, Wolfgang, et al.: > > I'm trying to use the dexseq_prepare_annotation.py script to parse the > UCSC hg18 genes.gtf GTF file included with the Illumina igenomes > packages (http://tophat.cbcb.umd.edu/igenomes.html). I'm getting an > error: > > Traceback (most recent call last): > File "/home/sdt5z/bin/dexseq_prepare_annotation.py", line 93, in<module> > raise ValueError, "Same name found on two chromosomes: %s, %s" % ( > str(l[i]), str(l[i+1]) ) > ValueError: Same name found on two chromosomes:<genomicfeature:> exonic_part 'CFB' at chr6_qbl_hap2: 3167392 -> 3167602 (strand '+')>, > <genomicfeature: exonic_part="" 'cfb'="" at="" chr6_cox_hap1:="" 3359983="" -=""> > 3360325 (strand '+')> > > I'm guessing this is because the same gene name is found in two > separate places. I'm not entirely sure what these two chromosomal > segments refer to, but I removed them from the GTF file and the python > script threw another error: > > Traceback (most recent call last): > File "/home/sdt5z/bin/dexseq_prepare_annotation.py", line 91, in<module> > assert l[i].iv.end<= l[i+1].iv.start, str(l[i+1]) + " starts too early" > AssertionError:<genomicfeature: exonic_part="" 'hist2h3c+hist2h3a'="" at=""> chr1: 148079388 -> 148078883 (strand '-')> starts too early > > I'm really unsure what to make of this or how to fix it. The script > works without issues with the Ensembl GTF. Any help would be greatly > appreciated. > > Stephen > > ----------------------------------------- > Stephen D. Turner, Ph.D. > Bioinformatics Core Director > University of Virginia School of Medicine > bioinformatics.virginia.edu > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
Dear Stephen as you might have guessed, the reason that the script raises these errors is to make you (i.e. users) look at such dodgy GTF records and to verify that these are indeed easily fixable formatting mistakes - instead of just proceeding with some sort of automatic resolution and potentially running into bigger, more obscure errors downstream. I suppose if we have enough confidence on what the typical sources of dodginess are, and how their resolution can be automated, that could be added to the script. Contributions are welcome. Best wishes Wolfgang Apr/18/12 9:30 AM, Alejandro Reyes scripsit:: > Dear Stephen, > > The problem is that the exon is finishing (148078883) before it starts > (148079388). > Usually the gtf files contain the "left most" position as start position > independently of the strand, also the UCSC annotations but with a few > exceptions as the one you mentioned. The reason of the script complain > is that these gtf parsers in HTSeq were written and tested for ENSEMBL > annotation files, where these kind of "errors" are basically absent. > Assuming that these are simple mistakes in the gtf files, it is easy to > solve them just by flipping the positions of the start and ends when > start > end, or maybe by deleting this genes from the gtf files. Anyway, > these cases are really rare, like ~15 in the human UCSC genome. > > Alejandro > > > >> Alejandro, Simon, Wolfgang, et al.: >> >> I'm trying to use the dexseq_prepare_annotation.py script to parse the >> UCSC hg18 genes.gtf GTF file included with the Illumina igenomes >> packages (http://tophat.cbcb.umd.edu/igenomes.html). I'm getting an >> error: >> >> Traceback (most recent call last): >> File "/home/sdt5z/bin/dexseq_prepare_annotation.py", line 93, in<module> >> raise ValueError, "Same name found on two chromosomes: %s, %s" % ( >> str(l[i]), str(l[i+1]) ) >> ValueError: Same name found on two chromosomes:<genomicfeature:>> exonic_part 'CFB' at chr6_qbl_hap2: 3167392 -> 3167602 (strand '+')>, >> <genomicfeature: exonic_part="" 'cfb'="" at="" chr6_cox_hap1:="" 3359983="" -=""> >> 3360325 (strand '+')> >> >> I'm guessing this is because the same gene name is found in two >> separate places. I'm not entirely sure what these two chromosomal >> segments refer to, but I removed them from the GTF file and the python >> script threw another error: >> >> Traceback (most recent call last): >> File "/home/sdt5z/bin/dexseq_prepare_annotation.py", line 91, in<module> >> assert l[i].iv.end<= l[i+1].iv.start, str(l[i+1]) + " starts too early" >> AssertionError:<genomicfeature: exonic_part="" 'hist2h3c+hist2h3a'="" at="">> chr1: 148079388 -> 148078883 (strand '-')> starts too early >> >> I'm really unsure what to make of this or how to fix it. The script >> works without issues with the Ensembl GTF. Any help would be greatly >> appreciated. >> >> Stephen >> >> ----------------------------------------- >> Stephen D. Turner, Ph.D. >> Bioinformatics Core Director >> University of Virginia School of Medicine >> bioinformatics.virginia.edu >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber
ADD REPLY

Login before adding your answer.

Traffic: 648 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6