I am trying to do read counting on the set of all RepeatMasker annotated regions in UCSC hg19, mainly in order to determine whether we have any contamination by ribosomal or other repetitive sequences in our RNA-seq data. However, featureCounts is running extremely slowly on this dataset. Specifically, with only two normal-sized RNA-seq samples, featureCounts was still running after 16 hours, at which point I aborted it. In contrast, when I run featureCounts on the set of all human exons grouped by gene (a similar-sized annotation), it works fine. And it works fine when I run it on a subset of only 1/500th of the features from my annotation. But if I increase that to 1/50th, featureCounts pauses for several minutes after being apparently finished. I'm now running it on 1/10th of my annotation, and it's still running after one hour (EDIT: Final time, 88 minutes!). These results are summarized in this gist:
Despite having roughly the same or fewer features and fewer meta-features than the set of all human exons grouped by genes, These test sets are taking significantly longer, and the extra time happens apparently after all the read counting is already done. Is there something about my annotation that is triggering a weird edge case in featureCounts? I'm not sure what to do about this.
I can provide my annotations and bam files if necessary.
Ryan, did you also try rna-seqc ( https://www.broadinstitute.org/cancer/cga/rna-seqc )? it does all the estimations (intronic, exonic, intergenic). you can also give it a list with rRNA regions and it will count reads in these regions as well.