Question

Odd behavior of Rsubread::featureCounts when setting `nonOverlap = 0`

0

Entering edit mode

isaac.vock • 0

@45e1494b

Last seen 6 months ago

United States

Hello,

I am running into some perplexing behavior when using featureCounts. I would like to assign reads only to features with which they completely overlap (similar to HTSeq's intersection-strict set operation, and a functionality requested in a previous post). From reading the featureCounts documentation, I suspected that this would be accomplished by setting the nonOverlap parameter, described as "integer giving the maximum number of bases in a read (or a read pair) that are allowed not to overlap the assigned feature." to 0. For the rest of the post, I will be discussing the results when using the following call to featureCounts (where the goal was to assign reads to a gene with which it completely overlapped; the problems discussed here were encountered with 3 different calls to featureCounts with different feature assignments, namely assigning reads to genes, purely exonic regions, and purely exonic regions but assigning to the set of isoforms with which the read was consistent):

featureCounts -T 4 -s 1 -a Hs_genome_chr.gtf -R CORE -p --countReadPairs -g gene_id -t transcript --nonOverlap 0 --primary -o testout.featureCounts <input bam file>

While this worked seemingly as expected on several datasets, I ran into problem when assigning reads from a paired-end library with very small insert sizes and thus lots of instances of reads in a pair overlapping one another. In particular, the vast majority (> 90%) of read pairs from said library were not assigned, with featureCounts listing the reason for failed assignment being "Unassigned_Overlapping_Length". Increasing nonOverlap to 1 causes a small percentage of previously unassigned reads to be successfully assigned, and increasing it to 5 causes the majority of reads (~70%) to be successfully assigned. Below I have some IGV screenshots of examples, in all cases instances where the read completely and unambiguously overlaps an annotated gene that it should be assigned to:

Instances where read is assigned in all cases

Read assigned to gene as expected with nonOverlap = 0

Another read assigned to gene as expected with nonOverlap = 0

Instances where read is not assigned with nonOverlap = 0 but is assigned with nonOverlap = 1

Read assigned to gene as expected with nonOverlap >= 1

Instances where read is not assigned with nonOverlap = 0 and nonOverlap = 1, but is assigned with nonOverlap = 5

Read assigned to gene as expected with nonOverlap = 5

I have set up a Github repository with downsampled bam files containing reads that either 1) were assigned properly from the beginning, 2) were unassigned when nonOverlap = 0, 3) that were unassigned when nonOverlap=0 but assigned when nonOverlap=5 and 4) were unassigned when nonOverlap = 0 or nonOverlap = 1 but that were assigned when nonOverlap = 5.

The reads come from a stranded (FR) paired-end fastq file that was trimmed with fastp and aligned to the human genome with STAR. I am happy to provide any additional details.

Any explanation for why reads that clearly completely overlap a feature are not assigned when nonOverlap = 0 or 1 would be greatly appreciated!

Best, Isaac

featureCounts Rsubread • 1.2k views

ADD COMMENT • link 13 months ago • updated 10 months ago isaac.vock • 0

0

Entering edit mode

This might be caused by your use of transcripts (-t transcript) instead of exons for comparing read coordinates against feature coordinates. Can you try to use exons instead (this normally should be -t exon for most GTF files) to see if you still have the same issue?

You may also try to remove -s 1 and add -O to rule out the possibility that this is caused by stranded counting or multi-overlapping (ie a read overlaps more than one gene).

ADD REPLY • link 12 months ago Wei Shi ★ 3.6k

0

Entering edit mode

Thank you for your response. I can confirm that setting -t exon and/or removing -s 1 and adding -O has no impact on the number of successfully assigned reads.

Best, Isaac

ADD REPLY • link 12 months ago isaac.vock • 0

score 2 · Accepted Answer · 2024-06-11

2

Entering edit mode

Yang Liao ▴ 450

@yang-liao-6075

Last seen 18 days ago

Australia

We looked at the read alignments in the BAM files. We noticed that all alignments of read-pairs have soft-clipped bases in one end or in both ends.

When you specify nonOverlap=0, featureCounts requires all bases in the read-pair to overlap with the feature. But a soft-clipped base does not overlap with the feature, hence preventing the read-pair from being assigned.

Similarly, if the read-pair has 3 bases soft-clipped, and the other bases all overlap with the feature, this read-pair is only assigned to the feature if nonOverlap has a value >= 3.

IGV does not show soft-clipped bases. But if you move the mouse pointer over the read in IGV, you can see the CIGAR string of the read, and the CIGAR string includes "S" segments, which denote soft-clipped bases.

ADD COMMENT • link 10 months ago Yang Liao ▴ 450

0

Entering edit mode

Thank you for looking into this and for your explanation!

As a follow-up, would it be possible in future releases of featureCounts to provide an option to not count soft-clipped bases as non-overlapping? I would like to be able to use featureCounts to count reads overlapping exclusively exonic regions, and to use the exonic counts and total gene counts to infer the intronic counts. Setting nonOverlap=0 when assigning reads to exons is the only way I could think of to ensure that reads from pre-mRNA are not counted towards the exonic read count, but if soft-clipped bases count as non-overlapping, that complicates this strategy.

ADD REPLY • link 10 months ago isaac.vock • 0