HT-seq counting - gene vs isoform

0

Entering edit mode

Akula, Nirmala NIH/NIMH [C] ▴ 190

@akula-nirmala-nihnimh-c-5007

Last seen 5.3 years ago

Hi, I am trying to understand the counting done by HTseq using Ensemble GTF file. Here is an example: GeneA_isoform1: Exon1 Exon2 Exon3 GeneA_isoform2: Exon1 Exon3 GeneA_isoform3:Exon1 Exon4 When counting at gene level, I assume the reads that fall on all exons (Exon1, Exon2, Exon3 and Exon4) are all summed up for GeneA. When counting at isoform level, GeneA_isoform1 - is it sum of exons from Exon1, Exon2 and Exon3 (or) just reads that map to Exon2? GeneA_isoform2 - is it sum of Exon1 and Exon3 (or) no counts because its exons are common with isoform1 and isoform3? GeneA_isoform3 - sum of Exon1 and Exon4 (or) only Exon4? Thank you very much for the clarification. Best Regards, Nirmala ---------------------------------------------------------------------- -------------------------------------------------------- Contractor Buiding 35, Room 1A-205 35 Convent Drive, National Institute of Mental Health/NIH Bethesda MD - 20892 Phone# 301-451-4258 [[alternative HTML version deleted]]

• 1.9k views

ADD COMMENT • link updated 12.2 years ago by Simon Anders ★ 3.8k • written 12.2 years ago by Akula, Nirmala NIH/NIMH [C] ▴ 190

0

Entering edit mode

Simon Anders ★ 3.8k

@simon-anders-3855

Last seen 4.5 years ago

Zentrum für Molekularbiologie, Universi…

Hi On 04/12/12 23:46, Akula, Nirmala (NIH/NIMH) [C] wrote: > When counting at gene level, I assume the reads that fall on all exons (Exon1, Exon2, Exon3 and Exon4) are all summed up for GeneA. > > When counting at isoform level, > > GeneA_isoform1 - is it sum of exons from Exon1, Exon2 and Exon3 (or) just reads that map to Exon2? > GeneA_isoform2 - is it sum of Exon1 and Exon3 (or) no counts because its exons are common with isoform1 and isoform3? > GeneA_isoform3 - sum of Exon1 and Exon4 (or) only Exon4? Always the latter. This is why htseq-count is not suitable to count at isoform level. To explain the rationale behind this: HTSeq-count is meant to be used for differential expression analysis; hence the rule that ambiguous mappings are discarded. Consider two genes that share part of their sequence, one of them being differentially expressed, the other not. If we count reads mapping to the shared part (and hence to both genes), we will wrongly conclude that they are _both_ differentially expressed. If we discard the reads mapping to the shared part, we underestimate both genes' expression but we do so by the same fraction in all samples so that any inference about expression changes is still correct. For counting at gene level, we can afford to discard the rather few reads that map to shared sequence. (With long reads, there is few such stretches longer than the read length even between paralogs.) For isoforms, this becomes untenable, and hence, any attempt of inferring differential expression at the isoform level is bound to fail if it is based on simple counting. Instead, one should either use some method based on Bayesian inference (e.g. BitSeq) or perform the inference on the exon level (our DEXSeq approach). See our paper for a discussion why the prefer the latter and see Glaus et al.'s paper to learn more about the former. Simon

ADD COMMENT • link 12.2 years ago Simon Anders ★ 3.8k

0

Entering edit mode

Thank you very much for your response Simon. Best Regards, Nirmala ---------------------------------------------------------------------- -------------------------------------------------------- Contractor Buiding 35, Room 1A-205 35 Convent Drive, National Institute of Mental Health/NIH Bethesda MD - 20892 Phone# 301-451-4258 -----Original Message----- From: Simon Anders [mailto:anders@embl.de] Sent: Wednesday, December 05, 2012 5:36 AM To: bioconductor at r-project.org Subject: Re: [BioC] HT-seq counting - gene vs isoform Hi On 04/12/12 23:46, Akula, Nirmala (NIH/NIMH) [C] wrote: > When counting at gene level, I assume the reads that fall on all exons (Exon1, Exon2, Exon3 and Exon4) are all summed up for GeneA. > > When counting at isoform level, > > GeneA_isoform1 - is it sum of exons from Exon1, Exon2 and Exon3 (or) just reads that map to Exon2? > GeneA_isoform2 - is it sum of Exon1 and Exon3 (or) no counts because its exons are common with isoform1 and isoform3? > GeneA_isoform3 - sum of Exon1 and Exon4 (or) only Exon4? Always the latter. This is why htseq-count is not suitable to count at isoform level. To explain the rationale behind this: HTSeq-count is meant to be used for differential expression analysis; hence the rule that ambiguous mappings are discarded. Consider two genes that share part of their sequence, one of them being differentially expressed, the other not. If we count reads mapping to the shared part (and hence to both genes), we will wrongly conclude that they are _both_ differentially expressed. If we discard the reads mapping to the shared part, we underestimate both genes' expression but we do so by the same fraction in all samples so that any inference about expression changes is still correct. For counting at gene level, we can afford to discard the rather few reads that map to shared sequence. (With long reads, there is few such stretches longer than the read length even between paralogs.) For isoforms, this becomes untenable, and hence, any attempt of inferring differential expression at the isoform level is bound to fail if it is based on simple counting. Instead, one should either use some method based on Bayesian inference (e.g. BitSeq) or perform the inference on the exon level (our DEXSeq approach). See our paper for a discussion why the prefer the latter and see Glaus et al.'s paper to learn more about the former. Simon _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.2 years ago Akula, Nirmala NIH/NIMH [C] ▴ 190

Login before adding your answer.