dexseq zero length exons
1
0
Entering edit mode
@jose-m-garcia-manteiga-6046
Last seen 7.7 years ago
Italy
Dear Alejandro, I have just analysed my results using DEXSeq 1.9. I had prepared the gff annotation using the version of dexseq_prepare.annotation.py in DEXSeq 1.9 that contained the -r parameter but counted with an old version, as you advised to me, of the dexseq_count.py since the newest included a check for a NH tag, which my bam files (SOAPSplice) did not contain. Everything worked fine but when checking some of the exons I realised that there are some, and some amongst the DEU results too, that showed length equal zero. I have checked and there are a number of those amongst the GFF. I have checked in the original gtf file I got from Ensembl (v72 hg19) for one of those genes and I do not understand why there should be one of length 0. Here you are a histogram of all exons showing very short exons and 0 length ones and an example of the gtf and gff file prepared by dexseq_prepare_annotation.py. Is it normal? How can there be exons with the same start and end in the gff file? How can they have reads assigned and even pass the tests? # exons length exon.length<-fData(ecs)$end - fData(ecs)$start hist(log2(exon.length),breaks=100) The gff file 'grepped' for one example: $ grep 'ENSG00000001460' Homo_sapiens.ensembl72.DEXSeq.gff chr1 dexseq_prepare_annotation.py aggregate_gene 24683489 24743424 . - . gene_id "ENSG00000001460" *chr1 dexseq_prepare_annotation.py exonic_part 24683489 24683489 . - . transcripts "ENST00000374409"; exonic_part_number "001"; gene_id "ENSG00000001460"* chr1 dexseq_prepare_annotation.py exonic_part 24683490 24683494 . - . transcripts "ENST00000440416+ENST00000374409"; exonic_part_number "002"; gene_id "ENSG00000001460" chr1 dexseq_prepare_annotation.py exonic_part 24683495 24683526 . - . transcripts "ENST00000468303+ENST00000440416+ENST00000003583+ENST00000374409"; exonic_part_number "003"; gene_id "ENSG00000001460" chr1 dexseq_prepare_annotation.py exonic_part 24683527 24685109 . - . transcripts "ENST00000468303+ENST00000440416+ENST00000374409+ENST00000003583+ENST0 0000337248"; exonic_part_number "004"; gene_id "ENSG00000001460" chr1 dexseq_prepare_annotation.py exonic_part 24687341 24687531 . - . transcripts "ENST00000468303+ENST00000440416+ENST00000374409+ENST00000003583+ENST0 0000337248"; exonic_part_number "005"; gene_id "ENSG00000001460" chr1 dexseq_prepare_annotation.py exonic_part 24695194 24695532 . - . transcripts "ENST00000497384"; exonic_part_number "006"; gene_id "ENSG00000001460" Now the gtf used as input for dexseq_prepare_annotation.py for that gene searching for the start and end coordinate: grep 'ENSG00000001460' hg19.ensGene_withGeneName.gtf | grep '24683489' chr1 hg19_ensGene exon *24683489 24685109* 0.000000 - . gene_id "ENSG00000001460"; transcript_id "*ENST00000374409*"; gene_name "STPG1"; It seems that the gtf is OK and there is an exon that starts at that position but nothing at the end. I checked also the ecs object coming from an older analysis of the same dataset using the two py scripts from the same release that gave me a good results but I wanted to repeat to include the -r parameter, and it also contains a lot of these zero length exons. Is it normal? I have further checked within the Genome Browser and it seems that there is another Transcript with start at +1 with respect to that. If this is so it means that zero-length in the end-start difference from the gff file that I found in the fData of the ecs object is indeed 1bp length and just a matter of 0/1 conventions. If this is so, I would be more relieved, but please confirm. In anycase, may I ask how reliable counts of ~75bp read can be to an exon made from 1 pb difference in the start of two transcripts in the annotation? how can they even pass the filter of p-value for DEU analysis? Shouldn't it be a sort of cut off for this 'pseudo'exons? Thanks again for your help Best Jose -- Jose M. Garcia Manteiga PhD Research Assistant - Data Analysis in Functional Genomics Center for Translational Genomics and BioInformatics Dibit2-Basilica, 3A3 San Raffaele Scientific Institute Via Olgettina 58, 20132 Milano (MI), Italy Tel: +39-02-2643-4751 e-mail: garciamanteiga.josemanuel at hsr.it -------------- next part -------------- A non-text attachment was scrubbed... Name: Exon_length_1.9.pdf Type: application/pdf Size: 5196 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140310="" b3021f2f="" attachment.pdf="">
Annotation DEXSeq Annotation DEXSeq • 1.3k views
ADD COMMENT
0
Entering edit mode
Alejandro Reyes ★ 1.9k
@alejandro-reyes-5124
Last seen 4 months ago
Novartis Institutes for BioMedical Reseā€¦
Dear Jose, Thanks a lot for your detailed report! The cases that you mention are "normal" cases that are originated from our exon bining, these are instances where the annotation file contains cases like the one you just describe (having a transcript initiation site or alternative splice site with single base pair differences). These are exon bins covering a single base pair (the length of the exon bins is fData(ecs)$end - fData(ecs)$start + 1). The reason you can get counts and can achieve significance for DEU for small exon bins is because the reads mapping to multiple exon bins are counted in all the exon bins, so these are cases where the read maps to this small exon bins and many other regions of the genome. I think having differences in these exons can still biologically informative despite its length, but it always depends on the analysis you are doing. Say, if you are interested in large cassette skipping events you could get rid of this small bins and reduce the number of tests. Best regards, Alejandro > Dear Alejandro, > I have just analysed my results using DEXSeq 1.9. > I had prepared the gff annotation using the version of > dexseq_prepare.annotation.py <http: dexseq_prepare.annotation.py=""> in > DEXSeq 1.9 that contained the -r parameter but counted with an old > version, as you advised to me, of the dexseq_count.py since the newest > included a check for a NH tag, which my bam files (SOAPSplice) did not > contain. Everything worked fine but when checking some of the exons > I realised that there are some, and some amongst the DEU results too, > that showed length equal zero. I have checked and there are a number > of those amongst the GFF. I have checked in the original gtf file I > got from Ensembl (v72 hg19) for one of those genes and I do not > understand why there should be one of length 0. > Here you are a histogram of all exons showing very short exons and 0 > length ones and an example of the gtf and gff file prepared by > dexseq_prepare_annotation.py. > Is it normal? How can there be exons with the same start and end in > the gff file? How can they have reads assigned and even pass the tests? > > # exons length > exon.length<-fData(ecs)$end - fData(ecs)$start > hist(log2(exon.length),breaks=100) > > The gff file 'grepped' for one example: > > $ grep 'ENSG00000001460' Homo_sapiens.ensembl72.DEXSeq.gff > > chr1dexseq_prepare_annotation.pyaggregate_gene2468348924743424.-.gen e_id > "ENSG00000001460" > > *chr1dexseq_prepare_annotation.pyexonic_part_24683489__24683489_.-.t ranscripts > "ENST00000374409"; exonic_part_number "001"; gene_id "ENSG00000001460"* > > chr1dexseq_prepare_annotation.pyexonic_part2468349024683494.-.transc ripts > "ENST00000440416+ENST00000374409"; exonic_part_number "002"; gene_id > "ENSG00000001460" > > chr1dexseq_prepare_annotation.pyexonic_part2468349524683526.-.transc ripts > "ENST00000468303+ENST00000440416+ENST00000003583+ENST00000374409"; > exonic_part_number "003"; gene_id "ENSG00000001460" > > chr1dexseq_prepare_annotation.pyexonic_part2468352724685109.-.transc ripts > "ENST00000468303+ENST00000440416+ENST00000374409+ENST00000003583+ENS T00000337248"; > exonic_part_number "004"; gene_id "ENSG00000001460" > > chr1dexseq_prepare_annotation.pyexonic_part2468734124687531.-.transc ripts > "ENST00000468303+ENST00000440416+ENST00000374409+ENST00000003583+ENS T00000337248"; > exonic_part_number "005"; gene_id "ENSG00000001460" > > chr1dexseq_prepare_annotation.pyexonic_part2469519424695532.-.transc ripts > "ENST00000497384"; exonic_part_number "006"; gene_id "ENSG00000001460" > > Now the gtf used as input for dexseq_prepare_annotation.py for that > gene searching for the start and end coordinate: > > grep 'ENSG00000001460' hg19.ensGene_withGeneName.gtf | grep '24683489' > > chr1hg19_ensGeneexon*2468348924685109*0.000000-.gene_id > "ENSG00000001460"; transcript_id "*ENST00000374409*"; gene_name "STPG1"; > > It seems that the gtf is OK and there is an exon that starts at that > position but nothing at the end. > > I checked also the ecs object coming from an older analysis of the > same dataset using the two py scripts from the same release that gave > me a good results but I wanted to repeat to include the -r parameter, > and it also contains a lot of these zero length exons. > > Is it normal? > > I have further checked within the Genome Browser and it seems that > there is another Transcript with start at +1 with respect to that. > > If this is so it means that zero-length in the end-start difference > from the gff file that I found in the fData of the ecs object is > indeed 1bp length and just a matter of 0/1 conventions. If this is so, > I would be more relieved, but please confirm. > > In anycase, may I ask how reliable counts of ~75bp read can be to an > exon made from 1 pb difference in the start of two transcripts in the > annotation? how can they even pass the filter of p-value for DEU > analysis? Shouldn't it be a sort of cut off for this 'pseudo'exons? > > > Thanks again for your help > > Best > > Jose > > > > > > > > > > -- > Jose M. Garcia Manteiga PhD > Research Assistant - Data Analysis in Functional Genomics > Center for Translational Genomics and BioInformatics > Dibit2-Basilica, 3A3 > San Raffaele Scientific Institute > Via Olgettina 58, 20132 Milano (MI), Italy > > Tel: +39-02-2643-4751 > e-mail: garciamanteiga.josemanuel at hsr.it > <mailto:garciamanteiga.josemanuel at="" hsr.it="">
ADD COMMENT
0
Entering edit mode
Dear Alejandro, Thanks for the reply. I will keep them, then. Just to know whether I understood the implications: Is it possible that a DEU coming from a 1bp exon bin is resulting from a +/-1 different transcripition start (if first exon) or a +/-1 alternative splicing that could produce a change in the ORF of the translated protein (same for non 3 multiples +/-)?? If so, do you have any evidence that these are the typical cases? It could turn extremely useful information if the DEU for these exons is trustable to identify these rare events with 'consequences' that could have a deep impact in biology. Thanks again Best Jose 2014-03-10 15:57 GMT+01:00 Alejandro Reyes <alejandro.reyes@embl.de>: > Dear Jose, > > Thanks a lot for your detailed report! > > The cases that you mention are "normal" cases that are originated from > our exon bining, these are instances where the annotation file contains > cases like the one you just describe (having a transcript initiation > site or alternative splice site with single base pair differences). > These are exon bins covering a single base pair (the length of the exon > bins is fData(ecs)$end - fData(ecs)$start + 1). > > The reason you can get counts and can achieve significance for DEU for > small exon bins is because the reads mapping to multiple exon bins are > counted in all the exon bins, so these are cases where the read maps to > this small exon bins and many other regions of the genome. I think > having differences in these exons can still biologically informative > despite its length, but it always depends on the analysis you are > doing. Say, if you are interested in large cassette skipping events you > could get rid of this small bins and reduce the number of tests. > > Best regards, > Alejandro > > > > > Dear Alejandro, > > I have just analysed my results using DEXSeq 1.9. > > I had prepared the gff annotation using the version of > > dexseq_prepare.annotation.py <http: dexseq_prepare.annotation.py=""> in > > DEXSeq 1.9 that contained the -r parameter but counted with an old > > version, as you advised to me, of the dexseq_count.py since the newest > > included a check for a NH tag, which my bam files (SOAPSplice) did not > > contain. Everything worked fine but when checking some of the exons > > I realised that there are some, and some amongst the DEU results too, > > that showed length equal zero. I have checked and there are a number > > of those amongst the GFF. I have checked in the original gtf file I > > got from Ensembl (v72 hg19) for one of those genes and I do not > > understand why there should be one of length 0. > > Here you are a histogram of all exons showing very short exons and 0 > > length ones and an example of the gtf and gff file prepared by > > dexseq_prepare_annotation.py. > > Is it normal? How can there be exons with the same start and end in > > the gff file? How can they have reads assigned and even pass the tests? > > > > # exons length > > exon.length<-fData(ecs)$end - fData(ecs)$start > > hist(log2(exon.length),breaks=100) > > > > The gff file 'grepped' for one example: > > > > $ grep 'ENSG00000001460' Homo_sapiens.ensembl72.DEXSeq.gff > > > > chr1dexseq_prepare_annotation.pyaggregate_gene2468348924743424.-.g ene_id > > "ENSG00000001460" > > > > > *chr1dexseq_prepare_annotation.pyexonic_part_24683489__24683489_.-.t ranscripts > > "ENST00000374409"; exonic_part_number "001"; gene_id "ENSG00000001460"* > > > > chr1dexseq_prepare_annotation.pyexonic_part2468349024683494.-.tran scripts > > "ENST00000440416+ENST00000374409"; exonic_part_number "002"; gene_id > > "ENSG00000001460" > > > > chr1dexseq_prepare_annotation.pyexonic_part2468349524683526.-.tran scripts > > "ENST00000468303+ENST00000440416+ENST00000003583+ENST00000374409"; > > exonic_part_number "003"; gene_id "ENSG00000001460" > > > > chr1dexseq_prepare_annotation.pyexonic_part2468352724685109.-.tran scripts > > > "ENST00000468303+ENST00000440416+ENST00000374409+ENST00000003583+ENS T00000337248"; > > exonic_part_number "004"; gene_id "ENSG00000001460" > > > > chr1dexseq_prepare_annotation.pyexonic_part2468734124687531.-.tran scripts > > > "ENST00000468303+ENST00000440416+ENST00000374409+ENST00000003583+ENS T00000337248"; > > exonic_part_number "005"; gene_id "ENSG00000001460" > > > > chr1dexseq_prepare_annotation.pyexonic_part2469519424695532.-.tran scripts > > "ENST00000497384"; exonic_part_number "006"; gene_id "ENSG00000001460" > > > > Now the gtf used as input for dexseq_prepare_annotation.py for that > > gene searching for the start and end coordinate: > > > > grep 'ENSG00000001460' hg19.ensGene_withGeneName.gtf | grep '24683489' > > > > chr1hg19_ensGeneexon*2468348924685109*0.000000-.gene_id > > "ENSG00000001460"; transcript_id "*ENST00000374409*"; gene_name "STPG1"; > > > > It seems that the gtf is OK and there is an exon that starts at that > > position but nothing at the end. > > > > I checked also the ecs object coming from an older analysis of the > > same dataset using the two py scripts from the same release that gave > > me a good results but I wanted to repeat to include the -r parameter, > > and it also contains a lot of these zero length exons. > > > > Is it normal? > > > > I have further checked within the Genome Browser and it seems that > > there is another Transcript with start at +1 with respect to that. > > > > If this is so it means that zero-length in the end-start difference > > from the gff file that I found in the fData of the ecs object is > > indeed 1bp length and just a matter of 0/1 conventions. If this is so, > > I would be more relieved, but please confirm. > > > > In anycase, may I ask how reliable counts of ~75bp read can be to an > > exon made from 1 pb difference in the start of two transcripts in the > > annotation? how can they even pass the filter of p-value for DEU > > analysis? Shouldn't it be a sort of cut off for this 'pseudo'exons? > > > > > > Thanks again for your help > > > > Best > > > > Jose > > > > > > > > > > > > > > > > > > > > -- > > Jose M. Garcia Manteiga PhD > > Research Assistant - Data Analysis in Functional Genomics > > Center for Translational Genomics and BioInformatics > > Dibit2-Basilica, 3A3 > > San Raffaele Scientific Institute > > Via Olgettina 58, 20132 Milano (MI), Italy > > > > Tel: +39-02-2643-4751 > > e-mail: garciamanteiga.josemanuel@hsr.it > > <mailto:garciamanteiga.josemanuel@hsr.it> > > -- Jose M. Garcia Manteiga PhD Research Assistant - Data Analysis in Functional Genomics Center for Translational Genomics and BioInformatics Dibit2-Basilica, 3A3 San Raffaele Scientific Institute Via Olgettina 58, 20132 Milano (MI), Italy Tel: +39-02-2643-4751 e-mail: garciamanteiga.josemanuel@hsr.it [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 517 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6