Question

using featureCounts (Subread) to count only spliced reads per gene - behavior of --splitOnly with PE data?

0

Entering edit mode

Gregory • 0

@20f4c74f

Last seen 4 months ago

United States

This technically pertains to the command line featureCounts program in subread, but the subread webpage suggests posting here, and I'm assuming the bioconductor version has the same functionality.

I'm trying to count only fragments that span splice junctions (per gene as the meta feature) in featureCounts using the --splitOnly option. The documentation suggests that this will capture reads with an 'N' in the cigar string, or individual reads that have split alignments.

However, I am using paired-end data, and I'd like to make sure that I capture fragments (also using the -p option) where the individual mates are not split, but the whole fragment aligns to more than one exon. It's possible I missed it, but I couldn't find anything in the documentation to suggest what the program does in these cases. I checked some of the counts against splice junction counts from STAR, and in a couple of cases it did appear that STAR was generating substantially higher count numbers. I realize that there is a separate option to count reads/fragments spanning individual splice junctions, but I'm not looking for that granularity here.

Thanks so much for any guidance!

Rsubread • 1.8k views

ADD COMMENT • link 15 months ago Gregory • 0

0

Entering edit mode

The splitOnly option is designed to strictly return split reads, i.e., reads that identify exon junctions rather than simply read-pairs overlapping more than one exon.

I'm not completely sure what output you are wanting from featureCounts. Do you want the rows of the matrix to be genes or exons or exon-exon junctions? To take a specific example, suppose a read-pair has one mate aligned to exon1 of a gene and the second mate aligned to exon3. What feature do you want that read pair to be assigned to?

ADD REPLY • link 15 months ago Gordon Smyth 52k

0

Entering edit mode

Thanks for the prompt response, Gordon.

I'm looking for count at the gene level - the gene id is the meta feature, so one row per gene in the output. At this point I don't need to retain which reads map to which features within a gene id. I would like that count to include all fragments with evidence of splicing. Your example of a mate pair with one mate aligned to exon1 and the other aligned to exon 3 does describe the other case that I had in mind for counting, but I see the problem; the insert could span an unspliced intron. These are Illumina PE 150bp libraries with maybe 300 - 500bp fragments, so the introns would need to be pretty short for this to occur. But still a problem I think.

I'd guess that the method used with the --splitOnly option provides an unbiased count of splicing events with respect to intron size, which is probably what I need here. If I did want to leverage the PE data to identify splicing over, e.g., introns that exceed a certain length, is there a way to count fragments where the mates map to two separate exons?

ADD REPLY • link 15 months ago Gregory • 0