Question

featureCount - very less Successfully assigned reads

0

Entering edit mode

amoltej ▴ 10

@amoltej-7192

Last seen 7.6 years ago

Australia

hello all,

I am using featurecount for differential expression analysis. After running feature count I found out there are very less number of reads assigned successfully (33%). where as my SAM file (aligned by STAR) showing 82% mapped reads. I tried both counting by exon and gene feature. Featurecount code that I have used -

featureCounts -M -p -O -t exon -a /home/amol/work/Cuffcompare/cuffcmp.combined.gtf -o Bcyt.count3 /home/amol/work/STAR/Bcyt/Aligned.nsorted.bam

Summery of feature count -

Input files : 1 BAM file ||
|| P /home/amol/work/STAR/Bcyt/Aligned.nsorted.bam ||
|| ||
|| Output file : Bcyt.count3 ||
|| Annotations : /home/amol/work/Cuffcompare/cuffcmp.combined ... ||
|| ||
|| Threads : 1 ||
|| Level : meta-feature level ||
|| Paired-end : yes ||
|| Strand specific : no ||
|| Multimapping reads : counted ||
|| Multi-overlapping reads : counted ||
|| ||
|| Chimeric reads : counted ||
|| Both ends mapped : not required ||
|| ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
|| ||
|| Load annotation file /home/amol/work/Cuffcompare/cuffcmp.combined.gtf ... ||
|| Features : 84237 ||
|| Meta-features : 16590 ||
|| Chromosomes : 4940 ||
|| ||
|| Process BAM file /home/amol/work/STAR/Bcyt/Aligned.nsorted.bam... ||
|| Paired-end reads are included. ||
|| Assign fragments (read pairs) to features... ||
|| Found reads that are not properly paired. ||
|| (missing mate or the mate is not the next read) ||
|| 2580055 reads have missing mates. ||
|| Input was converted to a format accepted by featureCounts. ||
|| Total fragments : 5314957 ||
|| Successfully assigned fragments : 1771088 (33.3%) ||
|| Running time : 2.54 minutes

can somebody please tell me whats going wrong?

thank you

amol

featurecounts rsubread • 17k views

ADD COMMENT • link updated 10.3 years ago by James W. MacDonald 68k • written 10.3 years ago by amoltej ▴ 10

0

Entering edit mode

Perhaps I'm being pedantic, but this isn't really a Bioconductor question. It seems like you're using the standalone C version of featureCounts, rather than that in Rsubread. A more appropriate forum might be the Subread mailing list at https://groups.google.com/forum/#!forum/subread.

ADD REPLY • link 10.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks Aaron for directing me!

But does this really matters because background is same, process is same. and you are understanding why it is happening. whether it is C version or R version how would it be different?

ADD REPLY • link 10.3 years ago amoltej ▴ 10

0

Entering edit mode

From what I understand, there could be some slight differences with versioning and/or default parameter settings between the Rsubread and standalone C implementations. So, it's worthwhile posting to the appropriate forum.

ADD REPLY • link 10.3 years ago Aaron Lun ★ 28k

score 7 · Answer 1 · 2014-12-23

I think you might have a fundamental misunderstanding of what STAR and featureCounts are doing.

The former is aligning reads to the genome, and the latter is counting which of those aligned reads appear to overlap known genes. It's expected that the percentage of reads that align to the genome will be larger than the percentage that align to genes, as genes make up only about 3% of the genome.

In addition, unless you did polyA selection as part of the sample processing, you likely had lots of transcripts that still contained intronic sequences (in our experience, non-polyA selected mRNA tends to be 'polluted' with about as many intronic reads as exonic reads for Ion Proton data, and about 50% more reads for HiSeq data), which featureCounts isn't going to count. As an example here are results from two bamfiles I have in hand (HiSeq data):

Rsubread featureCounts()

|| Process BAM file ../data/batch1/Sample_7749.bam...                         ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||    Total fragments : 26203397                                              ||
||    Successfully assigned fragments : 15261620 (58.2%)                      ||
||    Running time : 0.88 minutes                                             ||
||                                                                            ||
|| Process BAM file ../data/batch1/Sample_7752.bam...                         ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||    Total fragments : 18057482                                              ||
||    Successfully assigned fragments : 11065817 (61.3%)                      ||
||    Running time : 0.56 minutes                                             ||

Now checking overlaps for those two using summarizeOverlaps():

> library(TxDb.Mmusculus.UCSC.mm10.knownGene)
Loading required package: GenomicFeatures
Loading required package: GenomicRanges
> gns <- genes(TxDb.Mmusculus.UCSC.mm10.knownGene)
> library(Rsamtools)
> library(GenomicAlignments)
> bfl <- BamFileList(fls[1:2])
> olaps <- summarizeOverlaps(gns, bfl)
> colSums(assays(olaps)$counts)
Sample_7749.bam Sample_7752.bam
       22143603        15200876       

> 22143603/26203397*100
[1] 84.50661
> 15200876/18057482*100
[1] 84.18049

So I go from about 60% to 84% if I include introns (modulo differences between the genes mapped by TxDb.Mmusculus.UCSC.mm10.knownGene and the mm10 file that comes with Rsubread). But still there are other aligned reads that are not within the known genes I align against, which can be real genes that we don't know about, or reads with mismatches that match perfectly to some other genomic location, or DNA that magically slipped through the mRNA purification step, etc.