This is a follow up to Reads mistakenly being assigned to Unassigned_NoFeatures category when using featureCounts where Wei Shi says:
Your command requires each read to 100% overlap with a feature ('--fracOverlap 1') but the reads you provided do not 100% overlap with any feature. These reads have soft-clipped bases ('S' section in CIGAR) or inserted bases ('I' section in CIGAR), making them have the fraction of overlapping bases being less than 1. This is the reason why they were assigned to the "Unassigned_NoFeatures" category.
I need to count reads fully contained in a feature even if they have soft-clipped and inserted bases. Is there a simple solution to make featureCounts include these reads? I'm thinking to hack the bam file to replace cigar strings containing S or I (but not containing N) with a complete match, like:
samtools view in.bam \
| awk -v FS='\t' -v OFS='\t' '{if($6 !~ "N" && $1 !~ "^@" && ($6 ~ "I" || $6 ~ "S")) {$6 = length($10)"M"} print $0}' > out.sam
featureCounts --fracOverlap 1 -O -f -a genes.gtf -o counts -t gene -g gene_id -p out.sam
which is kind of horrible but probably good enough for me... Any better solution? Is there any plan to make featureCounts ignore S and I when computing fracOverlap?
This is with featureCounts 2.0.1
(Apologies if this is not the right place for this question)
featureCounts does not consider soft-clipped bases when assigning reads to features because it is unknown where these bases map to. Similarly, you cannot know if a read is fully contained in a feature or not if it has soft-clipped bases included in its mapping result. But it seems you assumed that the soft-clipped bases are mapped along with the rest of the read? If that is the case, these bases would not be soft-clipped at the first place.
Anyway, you can try to modify the bam file to mark the soft-clipped bases as matched so as to make featureCounts consider these bases during its assignment. However, you will also need to modify the mapping start position of those reads that have their first matched base changed due to the change you made to their CIGAR strings. Otherwise, those reads may not be counted properly.