Question

featureCounts creating fragments and high NoFeatures value

0

Entering edit mode

gregory.l.stone ▴ 10

@gregorylstone-12225

Last seen 6.5 years ago

I apologize in advance for my ignorance--this is my first time using both STAR and featureCounts.

my STAR command is STAR --runThreadN 4 --genomeDir $REF --readFilesIn ${FILE1} ${FILE2} --sjdbGTFfile $GTF --outSAMtype BAM Unsorted --outFileNamePrefix ${OUT_STAR}

my featureCounts command is featureCounts -T 2 -p -t exon -g gene_id -a ${GTF} -o counts.txt ${FILE1}

I am using STAR 2.5.2b and featureCounts 1.5.1

I am using human samples and the hg19 gtf annotation file. I ran featureCounts after getting unreliable results from htseq-count. My samples are paired, and featureCounts was appealing after struggling with htseq-count because I read that featureCounts handles paired-end reads very well.

The output from featureCounts is as follows:

Assigned        17798778
Unassigned_Ambiguity    316936
Unassigned_MultiMapping 3099021
Unassigned_NoFeatures   13022732
Unassigned_Unmapped     0
Unassigned_MappingQuality       0
Unassigned_FragmentLength       0
Unassigned_Chimera      0
Unassigned_Secondary    0
Unassigned_Nonjunction 0
Unassigned_Duplicate    0

These values are from 34237467 total fragments, and corresponds to a success rate of 52.0%.

I am confused because my STAR output is as follows:

Number of input reads |       33385184
                      Average input read length |       175
                                    UNIQUE READS:
                   Uniquely mapped reads number |       31138446
                        Uniquely mapped reads % |       93.27%
                          Average mapped length |       174.80
                       Number of splices: Total |       13176899
            Number of splices: Annotated (sjdb) |       12931737
                       Number of splices: GT/AG |       12993469
                       Number of splices: GC/AG |       93376
                       Number of splices: AT/AC |       7197
               Number of splices: Non-canonical |       82857
                      Mismatch rate per base, % |       0.25%
                         Deletion rate per base |       0.01%
                        Deletion average length |       1.62
                        Insertion rate per base |       0.01%
                       Insertion average length |       1.33
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       1399714
             % of reads mapped to multiple loci |       4.19%
        Number of reads mapped to too many loci |       6341
             % of reads mapped to too many loci |       0.02%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.00%
                 % of reads unmapped: too short |       2.50%
                     % of reads unmapped: other |       0.02%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%

Prior to running featureCounts I sorted my bam file by position.

I am confused primarily for 2 reasons. 1) why does featureCounts identify more input fragments (34237467) as were mapped by STAR (31138446). 2) Why is featureCounts returning such a high Unassigned_NoFeatures value?

Any help and advice would be invaluable and much appreciated.

Thanks,

Greg

tl;dr featureCounts assigning high number of no features and reading more input fragments than what was produced by STAR alignemnt.

featurecounts bioconductor differential gene expression star • 6.6k views

ADD COMMENT • link updated 8.2 years ago by Wei Shi ★ 3.6k • written 8.2 years ago by gregory.l.stone ▴ 10

score 1 · Answer 1 · 2017-01-31

1

Entering edit mode

WouterDeCoster ▴ 130

@wouterdecoster-10838

Last seen 10 months ago

Antwerp, Belgium

Based on my understanding:

STAR is reporting percentages on the number of reads received as input of STAR (so in the fastq files). In contrast, featureCounts is reporting percentages on the number of alignments, which is, due to multimapping, higher than the number of reads.

For STAR one fastq is multimapping in 20 locations, for featureCounts this read results in 20 multimapping alignments (as such inflating the number of Unassigned_NoFeatures).

ADD COMMENT • link 8.2 years ago WouterDeCoster ▴ 130

0

Entering edit mode

Thank you for your thoughtful response. That definitely makes sense, I should've thought more closely about what the values meant before just comparing the numbers.

ADD REPLY • link 8.2 years ago gregory.l.stone ▴ 10

score 0 · Answer 2 · 2017-01-31

0

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 3 days ago

Australia/Melbourne

As @WouterDeCoster pointed, featureCounts counts alignments not reads. The total number of alignments will be greater than total number of reads if there are reads in your bam file that were reported multiple times.

The reporting of multi-mapping reads may increase the number in every category in featureCounts counting summary including 'Unassigned_NoFeatures' and 'Assigned'. I've typically seen ~50-70 percent of reads assigned to genes for human/mouse RNA-seq data. So your assignment rate of 52% looks OK to me.

ADD COMMENT • link 8.2 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Great, thank you very much for your help

ADD REPLY • link 8.2 years ago gregory.l.stone ▴ 10

0

Entering edit mode

Also you do not need to sort your bam files by position before you feed them to featureCounts. featureCounts accepts both name-sorted and location-sorted bam files. It processes these files in the same speed and returns you identical counting results.

ADD REPLY • link 8.2 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Dear Dr. Shi,

Would you please let me know how I can interpret the 50% of Unassigned_NoFeatures reads? It is very interesting if we can know what they are. Thank you,

ADD REPLY • link 5.4 years ago luongthang1908 • 0

0

Entering edit mode

Those are the reads that do not overlap any exon in any gene. They could originate from introns or from intergenic regions.

ADD REPLY • link 5.4 years ago Wei Shi ★ 3.6k