Question

To count or not to count multi-overlapping reads?

0

Entering edit mode

Arindam ▴ 80

@ag1805x-15215

Last seen 8 weeks ago

University of Eastern Finland

The featureCounts manuscript mentions that for RNA-Seq multi-overlapping reads must not be counted and the reasoning seems logical.

We recommend that reads or fragments overlapping more than one gene are not counted for RNA-seq experiments because any single fragment must originate from only one of the target genes but the identity of the true target gene cannot be confidently determined. On the other hand, we recommend that multi-overlap reads or fragments are counted for most ChIP-seq experiments because epigenetic modifications inferred from these reads may regulate the biological functions of all their overlapping genes.

How does it handle reads that map to a gene that is located in a region that also has another gene but on the alternate strand? If strand information is provided, I understand this should not be an issue. But what about unstranded sequencing?

I was particularly considering the situations as shown in this figure: https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-9-174/figures/1

RNA-Seq featureCounts Rsubread • 634 views

ADD COMMENT • link updated 10 weeks ago by Gordon Smyth 52k • written 11 weeks ago by Arindam ▴ 80

score 0 · Answer 1 · 2025-02-14

How does it handle reads that map to a gene that is located in a region that also has another gene but on the alternate strand?

Overlapping genes typically are on different strands, but that is irrelevant if the sequencing is unstranded because it is impossible for the alignment to know which strand the read came from. With unstranded sequencing, reads overlapping two genes on different strands will by default not be counted. With stranded sequencing, they will.

Overlapping genes most often involve a pseudogene overlapping a protein-coding gene on the other strand. When I use featureCounts for RNA-seq data, I prefer to restrict the gene annotation to curated RefSeq genes. This has the effect of removing computationally predicted genes, most pseudogenes, and most cases of overlaps. If you're interested, you can see my Rsubread SAF files at https://bioinf.wehi.edu.au/Rsubread/annot/.

In our experiments with mouse stranded and unstranded RNA-seq data, we find that an extra 4% of reads can be assigned to genes with stranded instead of unstranded sequencing when using featureCounts with Gencode annotation. When using strict RefSeq annotation, the difference between stranded and unstranded sequencing is 2.4% extra reads assigned.