Hi,
I am trying to understand the counting done by HTseq using Ensemble
GTF file. Here is an example:
GeneA_isoform1: Exon1 Exon2 Exon3
GeneA_isoform2: Exon1 Exon3
GeneA_isoform3:Exon1
Exon4
When counting at gene level, I assume the reads that fall on all exons
(Exon1, Exon2, Exon3 and Exon4) are all summed up for GeneA.
When counting at isoform level,
GeneA_isoform1 - is it sum of exons from Exon1, Exon2 and Exon3 (or)
just reads that map to Exon2?
GeneA_isoform2 - is it sum of Exon1 and Exon3 (or) no counts because
its exons are common with isoform1 and isoform3?
GeneA_isoform3 - sum of Exon1 and Exon4 (or) only Exon4?
Thank you very much for the clarification.
Best Regards,
Nirmala
----------------------------------------------------------------------
--------------------------------------------------------
Contractor
Buiding 35, Room 1A-205
35 Convent Drive,
National Institute of Mental Health/NIH
Bethesda
MD - 20892
Phone# 301-451-4258
[[alternative HTML version deleted]]
Hi
On 04/12/12 23:46, Akula, Nirmala (NIH/NIMH) [C] wrote:
> When counting at gene level, I assume the reads that fall on all
exons (Exon1, Exon2, Exon3 and Exon4) are all summed up for GeneA.
>
> When counting at isoform level,
>
> GeneA_isoform1 - is it sum of exons from Exon1, Exon2 and Exon3 (or)
just reads that map to Exon2?
> GeneA_isoform2 - is it sum of Exon1 and Exon3 (or) no counts because
its exons are common with isoform1 and isoform3?
> GeneA_isoform3 - sum of Exon1 and Exon4 (or) only Exon4?
Always the latter. This is why htseq-count is not suitable to count at
isoform level.
To explain the rationale behind this:
HTSeq-count is meant to be used for differential expression analysis;
hence the rule that ambiguous mappings are discarded. Consider two
genes
that share part of their sequence, one of them being differentially
expressed, the other not. If we count reads mapping to the shared part
(and hence to both genes), we will wrongly conclude that they are
_both_
differentially expressed. If we discard the reads mapping to the
shared
part, we underestimate both genes' expression but we do so by the same
fraction in all samples so that any inference about expression changes
is still correct.
For counting at gene level, we can afford to discard the rather few
reads that map to shared sequence. (With long reads, there is few such
stretches longer than the read length even between paralogs.) For
isoforms, this becomes untenable, and hence, any attempt of inferring
differential expression at the isoform level is bound to fail if it is
based on simple counting.
Instead, one should either use some method based on Bayesian inference
(e.g. BitSeq) or perform the inference on the exon level (our DEXSeq
approach). See our paper for a discussion why the prefer the latter
and
see Glaus et al.'s paper to learn more about the former.
Simon
Thank you very much for your response Simon.
Best Regards,
Nirmala
----------------------------------------------------------------------
--------------------------------------------------------
Contractor
Buiding 35, Room 1A-205
35 Convent Drive,
National Institute of Mental Health/NIH
Bethesda
MD - 20892
Phone# 301-451-4258
-----Original Message-----
From: Simon Anders [mailto:anders@embl.de]
Sent: Wednesday, December 05, 2012 5:36 AM
To: bioconductor at r-project.org
Subject: Re: [BioC] HT-seq counting - gene vs isoform
Hi
On 04/12/12 23:46, Akula, Nirmala (NIH/NIMH) [C] wrote:
> When counting at gene level, I assume the reads that fall on all
exons (Exon1, Exon2, Exon3 and Exon4) are all summed up for GeneA.
>
> When counting at isoform level,
>
> GeneA_isoform1 - is it sum of exons from Exon1, Exon2 and Exon3 (or)
just reads that map to Exon2?
> GeneA_isoform2 - is it sum of Exon1 and Exon3 (or) no counts because
its exons are common with isoform1 and isoform3?
> GeneA_isoform3 - sum of Exon1 and Exon4 (or) only Exon4?
Always the latter. This is why htseq-count is not suitable to count at
isoform level.
To explain the rationale behind this:
HTSeq-count is meant to be used for differential expression analysis;
hence the rule that ambiguous mappings are discarded. Consider two
genes that share part of their sequence, one of them being
differentially expressed, the other not. If we count reads mapping to
the shared part (and hence to both genes), we will wrongly conclude
that they are _both_ differentially expressed. If we discard the reads
mapping to the shared part, we underestimate both genes' expression
but we do so by the same fraction in all samples so that any inference
about expression changes is still correct.
For counting at gene level, we can afford to discard the rather few
reads that map to shared sequence. (With long reads, there is few such
stretches longer than the read length even between paralogs.) For
isoforms, this becomes untenable, and hence, any attempt of inferring
differential expression at the isoform level is bound to fail if it is
based on simple counting.
Instead, one should either use some method based on Bayesian inference
(e.g. BitSeq) or perform the inference on the exon level (our DEXSeq
approach). See our paper for a discussion why the prefer the latter
and see Glaus et al.'s paper to learn more about the former.
Simon
_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor