Question

FeatureCounts Output Counts at Exon Level Using Default Settings, want gene level

1

Entering edit mode

CDSPARKS ▴ 30

@cc7ab61c

Last seen 5 months ago

United States

Hello all,

From my understanding of the featureCounts manual it should, by default, count reads that align to the features (exons) of a meta-feature (gene). However my output file is at the exon level (sorry for the line formatting in the screenshot):

featurecounts Output

I can't tell if this is an issue with featurecounts, or my understanding of how my command should be formatted, or perhaps an issue with the feature IDs in my GTF file.

Here is the code I ran:

featureCounts -T 8 -t exon -g ID -a AmbTr_v01.0.gff.gff3 -o iwgc_counts.txt *.sam

and here is a snippet of my GTF file:

sample of GTF file

Any advice, explanation of what I am missing, or guidance would be greatly appreciated!

GeneExpression • 3.4k views

ADD COMMENT • link 2.0 years ago CDSPARKS ▴ 30

0

Entering edit mode

I don't know if this information will help either troubleshoot the issue or make it more obvious if I am misunderstanding the options, but here is a small update.

I used sed to change "ID" in gene rows to "gene_ID" to distinguish from IDs of other features, so now it looks like: enter image description here

I then re-ran featureCounts with this command:

featureCounts -p -T 32 -t exon -g gene_ID -a AmbTr_geneID_v01.0.gff.gff3 -o feature_counts_newgff.txt *.sam

And from that I get this error: enter image description here

So, with these options should featureCounts be looking for the exon ID at this step or the gene ID?

Am I just massively misunderstanding the command?

I do not think I want to use -t gene because then it will count reads that aligned to introns as well, correct?

How can I get the total counts per gene, but only based on the reads aligned exons?

Please help!

ADD REPLY • link 2.0 years ago CDSPARKS ▴ 30

score 1 · Answer 1 · 2023-04-22

I think I figured it out...so it is looking in the "exon" row of the gtf for the option indicated in -g, but it should reference back to the meta-feature ID, in my case it was "ID" for both and I definitely misunderstood how it would bin into meta-features. I assumed it would count reads based on the exon row but then use the coordinates to go and get the gene_id from the gene row to group into meta-features.

To fix it I once again used sed to change part of the exon row to the actual gene_ID and then used that in the -g option. enter image description here

Then my output is (again sorry for the line formatting): enter image description here

I am still open to any ideas or corrections to this answer if what I am saying is wrong and I just got lucky.

If I am correct in this conclusion, perhaps it could helpful if the manual clearly stated that the meta-feature ID -g is coming from the GTF row that corresponds to the feature specified by -t, rather than getting it from the gene row. Now that I understand what is happening the manual makes more sense, but without this little exercise it was easy to assume/misunderstand. Maybe many people understand immediately but there has to be more people like me that tend to misinterpret manuals that don't explicitly state these types of details.