Dear Alejandro,
I have just analysed my results using DEXSeq 1.9.
I had prepared the gff annotation using the version of
dexseq_prepare.annotation.py in DEXSeq 1.9 that contained the -r
parameter
but counted with an old version, as you advised to me, of
the dexseq_count.py since the newest included a check for a NH tag,
which
my bam files (SOAPSplice) did not contain. Everything worked fine but
when
checking some of the exons I realised that there are some, and some
amongst
the DEU results too, that showed length equal zero. I have checked and
there are a number of those amongst the GFF. I have checked in the
original
gtf file I got from Ensembl (v72 hg19) for one of those genes and I do
not
understand why there should be one of length 0.
Here you are a histogram of all exons showing very short exons and 0
length
ones and an example of the gtf and gff file prepared by
dexseq_prepare_annotation.py.
Is it normal? How can there be exons with the same start and end in
the gff
file? How can they have reads assigned and even pass the tests?
# exons length
exon.length<-fData(ecs)$end - fData(ecs)$start
hist(log2(exon.length),breaks=100)
The gff file 'grepped' for one example:
$ grep 'ENSG00000001460' Homo_sapiens.ensembl72.DEXSeq.gff
chr1 dexseq_prepare_annotation.py aggregate_gene 24683489 24743424 . -
. gene_id
"ENSG00000001460"
*chr1 dexseq_prepare_annotation.py exonic_part 24683489 24683489 . - .
transcripts "ENST00000374409"; exonic_part_number "001"; gene_id
"ENSG00000001460"*
chr1 dexseq_prepare_annotation.py exonic_part 24683490 24683494 . - .
transcripts
"ENST00000440416+ENST00000374409"; exonic_part_number "002"; gene_id
"ENSG00000001460"
chr1 dexseq_prepare_annotation.py exonic_part 24683495 24683526 . - .
transcripts
"ENST00000468303+ENST00000440416+ENST00000003583+ENST00000374409";
exonic_part_number "003"; gene_id "ENSG00000001460"
chr1 dexseq_prepare_annotation.py exonic_part 24683527 24685109 . - .
transcripts
"ENST00000468303+ENST00000440416+ENST00000374409+ENST00000003583+ENST0
0000337248";
exonic_part_number "004"; gene_id "ENSG00000001460"
chr1 dexseq_prepare_annotation.py exonic_part 24687341 24687531 . - .
transcripts
"ENST00000468303+ENST00000440416+ENST00000374409+ENST00000003583+ENST0
0000337248";
exonic_part_number "005"; gene_id "ENSG00000001460"
chr1 dexseq_prepare_annotation.py exonic_part 24695194 24695532 . - .
transcripts
"ENST00000497384"; exonic_part_number "006"; gene_id "ENSG00000001460"
Now the gtf used as input for dexseq_prepare_annotation.py for that
gene
searching for the start and end coordinate:
grep 'ENSG00000001460' hg19.ensGene_withGeneName.gtf | grep '24683489'
chr1 hg19_ensGene exon *24683489 24685109* 0.000000 - . gene_id
"ENSG00000001460"; transcript_id "*ENST00000374409*"; gene_name
"STPG1";
It seems that the gtf is OK and there is an exon that starts at that
position but nothing at the end.
I checked also the ecs object coming from an older analysis of the
same
dataset using the two py scripts from the same release that gave me a
good
results but I wanted to repeat to include the -r parameter, and it
also
contains a lot of these zero length exons.
Is it normal?
I have further checked within the Genome Browser and it seems that
there is
another Transcript with start at +1 with respect to that.
If this is so it means that zero-length in the end-start difference
from
the gff file that I found in the fData of the ecs object is indeed 1bp
length and just a matter of 0/1 conventions. If this is so, I would be
more
relieved, but please confirm.
In anycase, may I ask how reliable counts of ~75bp read can be to an
exon
made from 1 pb difference in the start of two transcripts in the
annotation? how can they even pass the filter of p-value for DEU
analysis?
Shouldn't it be a sort of cut off for this 'pseudo'exons?
Thanks again for your help
Best
Jose
--
Jose M. Garcia Manteiga PhD
Research Assistant - Data Analysis in Functional Genomics
Center for Translational Genomics and BioInformatics
Dibit2-Basilica, 3A3
San Raffaele Scientific Institute
Via Olgettina 58, 20132 Milano (MI), Italy
Tel: +39-02-2643-4751
e-mail: garciamanteiga.josemanuel at hsr.it
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Exon_length_1.9.pdf
Type: application/pdf
Size: 5196 bytes
Desc: not available
URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140310="" b3021f2f="" attachment.pdf="">
Dear Jose,
Thanks a lot for your detailed report!
The cases that you mention are "normal" cases that are originated from
our exon bining, these are instances where the annotation file
contains
cases like the one you just describe (having a transcript initiation
site or alternative splice site with single base pair differences).
These are exon bins covering a single base pair (the length of the
exon
bins is fData(ecs)$end - fData(ecs)$start + 1).
The reason you can get counts and can achieve significance for DEU for
small exon bins is because the reads mapping to multiple exon bins are
counted in all the exon bins, so these are cases where the read maps
to
this small exon bins and many other regions of the genome. I think
having differences in these exons can still biologically informative
despite its length, but it always depends on the analysis you are
doing. Say, if you are interested in large cassette skipping events
you
could get rid of this small bins and reduce the number of tests.
Best regards,
Alejandro
> Dear Alejandro,
> I have just analysed my results using DEXSeq 1.9.
> I had prepared the gff annotation using the version of
> dexseq_prepare.annotation.py <http: dexseq_prepare.annotation.py="">
in
> DEXSeq 1.9 that contained the -r parameter but counted with an old
> version, as you advised to me, of the dexseq_count.py since the
newest
> included a check for a NH tag, which my bam files (SOAPSplice) did
not
> contain. Everything worked fine but when checking some of the exons
> I realised that there are some, and some amongst the DEU results
too,
> that showed length equal zero. I have checked and there are a number
> of those amongst the GFF. I have checked in the original gtf file I
> got from Ensembl (v72 hg19) for one of those genes and I do not
> understand why there should be one of length 0.
> Here you are a histogram of all exons showing very short exons and 0
> length ones and an example of the gtf and gff file prepared by
> dexseq_prepare_annotation.py.
> Is it normal? How can there be exons with the same start and end in
> the gff file? How can they have reads assigned and even pass the
tests?
>
> # exons length
> exon.length<-fData(ecs)$end - fData(ecs)$start
> hist(log2(exon.length),breaks=100)
>
> The gff file 'grepped' for one example:
>
> $ grep 'ENSG00000001460' Homo_sapiens.ensembl72.DEXSeq.gff
>
> chr1dexseq_prepare_annotation.pyaggregate_gene2468348924743424.-.gen
e_id
> "ENSG00000001460"
>
> *chr1dexseq_prepare_annotation.pyexonic_part_24683489__24683489_.-.t
ranscripts
> "ENST00000374409"; exonic_part_number "001"; gene_id
"ENSG00000001460"*
>
> chr1dexseq_prepare_annotation.pyexonic_part2468349024683494.-.transc
ripts
> "ENST00000440416+ENST00000374409"; exonic_part_number "002"; gene_id
> "ENSG00000001460"
>
> chr1dexseq_prepare_annotation.pyexonic_part2468349524683526.-.transc
ripts
> "ENST00000468303+ENST00000440416+ENST00000003583+ENST00000374409";
> exonic_part_number "003"; gene_id "ENSG00000001460"
>
> chr1dexseq_prepare_annotation.pyexonic_part2468352724685109.-.transc
ripts
> "ENST00000468303+ENST00000440416+ENST00000374409+ENST00000003583+ENS
T00000337248";
> exonic_part_number "004"; gene_id "ENSG00000001460"
>
> chr1dexseq_prepare_annotation.pyexonic_part2468734124687531.-.transc
ripts
> "ENST00000468303+ENST00000440416+ENST00000374409+ENST00000003583+ENS
T00000337248";
> exonic_part_number "005"; gene_id "ENSG00000001460"
>
> chr1dexseq_prepare_annotation.pyexonic_part2469519424695532.-.transc
ripts
> "ENST00000497384"; exonic_part_number "006"; gene_id
"ENSG00000001460"
>
> Now the gtf used as input for dexseq_prepare_annotation.py for that
> gene searching for the start and end coordinate:
>
> grep 'ENSG00000001460' hg19.ensGene_withGeneName.gtf | grep
'24683489'
>
> chr1hg19_ensGeneexon*2468348924685109*0.000000-.gene_id
> "ENSG00000001460"; transcript_id "*ENST00000374409*"; gene_name
"STPG1";
>
> It seems that the gtf is OK and there is an exon that starts at that
> position but nothing at the end.
>
> I checked also the ecs object coming from an older analysis of the
> same dataset using the two py scripts from the same release that
gave
> me a good results but I wanted to repeat to include the -r
parameter,
> and it also contains a lot of these zero length exons.
>
> Is it normal?
>
> I have further checked within the Genome Browser and it seems that
> there is another Transcript with start at +1 with respect to that.
>
> If this is so it means that zero-length in the end-start difference
> from the gff file that I found in the fData of the ecs object is
> indeed 1bp length and just a matter of 0/1 conventions. If this is
so,
> I would be more relieved, but please confirm.
>
> In anycase, may I ask how reliable counts of ~75bp read can be to an
> exon made from 1 pb difference in the start of two transcripts in
the
> annotation? how can they even pass the filter of p-value for DEU
> analysis? Shouldn't it be a sort of cut off for this 'pseudo'exons?
>
>
> Thanks again for your help
>
> Best
>
> Jose
>
>
>
>
>
>
>
>
>
> --
> Jose M. Garcia Manteiga PhD
> Research Assistant - Data Analysis in Functional Genomics
> Center for Translational Genomics and BioInformatics
> Dibit2-Basilica, 3A3
> San Raffaele Scientific Institute
> Via Olgettina 58, 20132 Milano (MI), Italy
>
> Tel: +39-02-2643-4751
> e-mail: garciamanteiga.josemanuel at hsr.it
> <mailto:garciamanteiga.josemanuel at="" hsr.it="">
Dear Alejandro,
Thanks for the reply. I will keep them, then. Just to know whether I
understood the implications:
Is it possible that a DEU coming from a 1bp exon bin is resulting from
a
+/-1 different transcripition start (if first exon) or a +/-1
alternative
splicing that could produce a change in the ORF of the translated
protein
(same for non 3 multiples +/-)?? If so, do you have any evidence that
these
are the typical cases? It could turn extremely useful information if
the
DEU for these exons is trustable to identify these rare events with
'consequences' that could have a deep impact in biology.
Thanks again
Best
Jose
2014-03-10 15:57 GMT+01:00 Alejandro Reyes <alejandro.reyes@embl.de>:
> Dear Jose,
>
> Thanks a lot for your detailed report!
>
> The cases that you mention are "normal" cases that are originated
from
> our exon bining, these are instances where the annotation file
contains
> cases like the one you just describe (having a transcript initiation
> site or alternative splice site with single base pair differences).
> These are exon bins covering a single base pair (the length of the
exon
> bins is fData(ecs)$end - fData(ecs)$start + 1).
>
> The reason you can get counts and can achieve significance for DEU
for
> small exon bins is because the reads mapping to multiple exon bins
are
> counted in all the exon bins, so these are cases where the read maps
to
> this small exon bins and many other regions of the genome. I think
> having differences in these exons can still biologically informative
> despite its length, but it always depends on the analysis you are
> doing. Say, if you are interested in large cassette skipping events
you
> could get rid of this small bins and reduce the number of tests.
>
> Best regards,
> Alejandro
>
>
>
> > Dear Alejandro,
> > I have just analysed my results using DEXSeq 1.9.
> > I had prepared the gff annotation using the version of
> > dexseq_prepare.annotation.py <http: dexseq_prepare.annotation.py="">
in
> > DEXSeq 1.9 that contained the -r parameter but counted with an old
> > version, as you advised to me, of the dexseq_count.py since the
newest
> > included a check for a NH tag, which my bam files (SOAPSplice) did
not
> > contain. Everything worked fine but when checking some of the
exons
> > I realised that there are some, and some amongst the DEU results
too,
> > that showed length equal zero. I have checked and there are a
number
> > of those amongst the GFF. I have checked in the original gtf file
I
> > got from Ensembl (v72 hg19) for one of those genes and I do not
> > understand why there should be one of length 0.
> > Here you are a histogram of all exons showing very short exons and
0
> > length ones and an example of the gtf and gff file prepared by
> > dexseq_prepare_annotation.py.
> > Is it normal? How can there be exons with the same start and end
in
> > the gff file? How can they have reads assigned and even pass the
tests?
> >
> > # exons length
> > exon.length<-fData(ecs)$end - fData(ecs)$start
> > hist(log2(exon.length),breaks=100)
> >
> > The gff file 'grepped' for one example:
> >
> > $ grep 'ENSG00000001460' Homo_sapiens.ensembl72.DEXSeq.gff
> >
> > chr1dexseq_prepare_annotation.pyaggregate_gene2468348924743424.-.g
ene_id
> > "ENSG00000001460"
> >
> >
> *chr1dexseq_prepare_annotation.pyexonic_part_24683489__24683489_.-.t
ranscripts
> > "ENST00000374409"; exonic_part_number "001"; gene_id
"ENSG00000001460"*
> >
> > chr1dexseq_prepare_annotation.pyexonic_part2468349024683494.-.tran
scripts
> > "ENST00000440416+ENST00000374409"; exonic_part_number "002";
gene_id
> > "ENSG00000001460"
> >
> > chr1dexseq_prepare_annotation.pyexonic_part2468349524683526.-.tran
scripts
> > "ENST00000468303+ENST00000440416+ENST00000003583+ENST00000374409";
> > exonic_part_number "003"; gene_id "ENSG00000001460"
> >
> > chr1dexseq_prepare_annotation.pyexonic_part2468352724685109.-.tran
scripts
> >
> "ENST00000468303+ENST00000440416+ENST00000374409+ENST00000003583+ENS
T00000337248";
> > exonic_part_number "004"; gene_id "ENSG00000001460"
> >
> > chr1dexseq_prepare_annotation.pyexonic_part2468734124687531.-.tran
scripts
> >
> "ENST00000468303+ENST00000440416+ENST00000374409+ENST00000003583+ENS
T00000337248";
> > exonic_part_number "005"; gene_id "ENSG00000001460"
> >
> > chr1dexseq_prepare_annotation.pyexonic_part2469519424695532.-.tran
scripts
> > "ENST00000497384"; exonic_part_number "006"; gene_id
"ENSG00000001460"
> >
> > Now the gtf used as input for dexseq_prepare_annotation.py for
that
> > gene searching for the start and end coordinate:
> >
> > grep 'ENSG00000001460' hg19.ensGene_withGeneName.gtf | grep
'24683489'
> >
> > chr1hg19_ensGeneexon*2468348924685109*0.000000-.gene_id
> > "ENSG00000001460"; transcript_id "*ENST00000374409*"; gene_name
"STPG1";
> >
> > It seems that the gtf is OK and there is an exon that starts at
that
> > position but nothing at the end.
> >
> > I checked also the ecs object coming from an older analysis of the
> > same dataset using the two py scripts from the same release that
gave
> > me a good results but I wanted to repeat to include the -r
parameter,
> > and it also contains a lot of these zero length exons.
> >
> > Is it normal?
> >
> > I have further checked within the Genome Browser and it seems that
> > there is another Transcript with start at +1 with respect to that.
> >
> > If this is so it means that zero-length in the end-start
difference
> > from the gff file that I found in the fData of the ecs object is
> > indeed 1bp length and just a matter of 0/1 conventions. If this is
so,
> > I would be more relieved, but please confirm.
> >
> > In anycase, may I ask how reliable counts of ~75bp read can be to
an
> > exon made from 1 pb difference in the start of two transcripts in
the
> > annotation? how can they even pass the filter of p-value for DEU
> > analysis? Shouldn't it be a sort of cut off for this
'pseudo'exons?
> >
> >
> > Thanks again for your help
> >
> > Best
> >
> > Jose
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > --
> > Jose M. Garcia Manteiga PhD
> > Research Assistant - Data Analysis in Functional Genomics
> > Center for Translational Genomics and BioInformatics
> > Dibit2-Basilica, 3A3
> > San Raffaele Scientific Institute
> > Via Olgettina 58, 20132 Milano (MI), Italy
> >
> > Tel: +39-02-2643-4751
> > e-mail: garciamanteiga.josemanuel@hsr.it
> > <mailto:garciamanteiga.josemanuel@hsr.it>
>
>
--
Jose M. Garcia Manteiga PhD
Research Assistant - Data Analysis in Functional Genomics
Center for Translational Genomics and BioInformatics
Dibit2-Basilica, 3A3
San Raffaele Scientific Institute
Via Olgettina 58, 20132 Milano (MI), Italy
Tel: +39-02-2643-4751
e-mail: garciamanteiga.josemanuel@hsr.it
[[alternative HTML version deleted]]