Question

Rsubread::featureCounts extremely slow on some annotations

2

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 6 months ago

Icahn School of Medicine at Mount Sinai…

I am trying to do read counting on the set of all RepeatMasker annotated regions in UCSC hg19, mainly in order to determine whether we have any contamination by ribosomal or other repetitive sequences in our RNA-seq data. However, featureCounts is running extremely slowly on this dataset. Specifically, with only two normal-sized RNA-seq samples, featureCounts was still running after 16 hours, at which point I aborted it. In contrast, when I run featureCounts on the set of all human exons grouped by gene (a similar-sized annotation), it works fine. And it works fine when I run it on a subset of only 1/500th of the features from my annotation. But if I increase that to 1/50th, featureCounts pauses for several minutes after being apparently finished. I'm now running it on 1/10th of my annotation, and it's still running after one hour (EDIT: Final time, 88 minutes!). These results are summarized in this gist:

Despite having roughly the same or fewer features and fewer meta-features than the set of all human exons grouped by genes, These test sets are taking significantly longer, and the extra time happens apparently after all the read counting is already done. Is there something about my annotation that is triggering a weird edge case in featureCounts? I'm not sure what to do about this.

I can provide my annotations and bam files if necessary.

rsubread featurecounts • 7.7k views

ADD COMMENT • link updated 14 months ago by Gordon Smyth 52k • written 10.1 years ago by Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Ryan, did you also try rna-seqc ( https://www.broadinstitute.org/cancer/cga/rna-seqc )? it does all the estimations (intronic, exonic, intergenic). you can also give it a list with rRNA regions and it will count reads in these regions as well.

ADD REPLY • link 9.8 years ago mat.lesche ▴ 110

0

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 3 days ago

Australia/Melbourne

Hi Ryan,

Could you try use the C version of featureCounts program (http://subread.sourceforge.net/) to see if you will still get the same problem? This will help diagnose if the problem was caused by R.

Wei

ADD COMMENT • link 10.1 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Ok, I will do this when I get the chance.

ADD REPLY • link 10.1 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Here is what I get running featureCounts on the command line (via the system function from an R script):

> message("Running command: ", cmd) Running command: '/gpfs/home/rcthomps/src/subread/subread-1.4.6-p2-source/bin/featureCounts' -F SAF -a 'repeat-groups.saf' -o repeat-counts.txt 'star_results/A1/Aligned.out.bam' > system.time({ + retcode <- system(cmd) + })

========== _____ _ _ ____ _____ ______ _____ ===== / ____| | | | _ \| __ \| ____| /\ | __ \ ===== | (___ | | | | |_) | |__) | |__ / \ | | | | ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | | ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| | ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/ v1.4.6-p2

//========================== featureCounts setting ===========================\\ || || || Input files : 1 BAM file || || S star_results/A1/Aligned.out.bam || || || || Output file : repeat-counts.txt || || Annotations : repeat-groups.saf (SAF) || || || || Threads : 1 || || Level : meta-feature level || || Paired-end : no || || Strand specific : no || || Multimapping reads : not counted || || Multi-overlapping reads : not counted || || || \\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\ || || || Load annotation file repeat-groups.saf ... || user system elapsed 25.273 0.772 26.221 > message("Command exited with return code: ", retcode) Command exited with return code: 11 >

ADD REPLY • link 10.0 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 3 days ago

Australia/Melbourne

Is this the complete screen output from featureCounts? It sounds you truncated the screen output.

Could you pls run your command directly on a unix shell, instead of running it via the system function in R? This will tell me if the problem was possibly caused by R. You will need to run multiple files to see if you can reproduce the problem you encountered in R.

ADD COMMENT • link 10.0 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Yes, that's the complete output. featureCounts crashed after that point and exited with return code 11, as shown. I'll try to reproduce it directly at the command line.

ADD REPLY • link 10.0 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

# Running on one bam file at command-line: segfault

$ time '/gpfs/home/rcthomps/src/subread/subread-1.4.6-p2-source/bin/featureCounts' -F SAF -a 'repeat-groups.saf' -o repeat-counts.txt 'star_results/A1/Aligned.out.bam'

========== _____ _ _ ____ _____ ______ _____ ===== / ____| | | | _ \| __ \| ____| /\ | __ \ ===== | (___ | | | | |_) | |__) | |__ / \ | | | | ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | | ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| | ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/ v1.4.6-p2

//========================== featureCounts setting ===========================\\ || || || Input files : 1 BAM file || || S star_results/A1/Aligned.out.bam || || || || Output file : repeat-counts.txt || || Annotations : repeat-groups.saf (SAF) || || || || Threads : 1 || || Level : meta-feature level || || Paired-end : no || || Strand specific : no || || Multimapping reads : not counted || || Multi-overlapping reads : not counted || || || \\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\ || || || Load annotation file repeat-groups.saf ... || Segmentation fault

real 0m35.218s user 0m34.271s sys 0m0.789s

# Running it again on all BAM files instead of one: same result

$ time '/gpfs/home/rcthomps/src/subread/subread-1.4.6-p2-source/bin/featureCounts' -F SAF -a 'repeat-groups.saf' -o repeat-counts.txt star_results/*/Aligned.out.bam

========== _____ _ _ ____ _____ ______ _____ ===== / ____| | | | _ \| __ \| ____| /\ | __ \ ===== | (___ | | | | |_) | |__) | |__ / \ | | | | ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | | ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| | ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/ v1.4.6-p2

//========================== featureCounts setting ===========================\\ || || || Input files : 15 BAM files || || S star_results/A1/Aligned.out.bam || || S star_results/A2/Aligned.out.bam || || S star_results/A3/Aligned.out.bam || || S star_results/B1/Aligned.out.bam || || S star_results/B2/Aligned.out.bam || || S star_results/B3/Aligned.out.bam || || S star_results/C1/Aligned.out.bam || || S star_results/C2/Aligned.out.bam || || S star_results/C3/Aligned.out.bam || || S star_results/D1/Aligned.out.bam || || S star_results/D2/Aligned.out.bam || || S star_results/D3/Aligned.out.bam || || S star_results/E1/Aligned.out.bam || || S star_results/E2/Aligned.out.bam || || S star_results/E3/Aligned.out.bam || || || || Output file : repeat-counts.txt || || Annotations : repeat-groups.saf (SAF) || || || || Threads : 1 || || Level : meta-feature level || || Paired-end : no || || Strand specific : no || || Multimapping reads : not counted || || Multi-overlapping reads : not counted || || || \\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\ || || || Load annotation file repeat-groups.saf ... || Segmentation fault

real 0m38.198s user 0m34.895s sys 0m0.850s

ADD REPLY • link 10.0 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

This is currently happening to me now, I am using featurecounts over a cluster and it has made dozens of temp files and has been processing this one .bam file for ~10 hours, while the other .bam files are processed in a few minutes. I will try a different human annotation...

ADD REPLY • link 4.4 years ago Connor • 0

0

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 3 days ago

Australia/Melbourne

Can you show the first few lines of your SAF annotation file?

ADD COMMENT • link 10.0 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Here are the first 10 lines (including header):

Chr     Start   End     Strand  GeneID
chr1    184548979       184550051       -       DNA
chr1    2096978 2097155 -       DNA
chr1    44039967        44040368        -       DNA
chr1    77594531        77594679        +       DNA
chr1    148897677       148897879       -       DNA
chr1    232783806       232783900       -       DNA
chr1    2752511 2752735 +       DNA
chr1    5373663 5374024 -       DNA
chr1    9043634 9043987 +       DNA

When I run it on just these 10 lines as the annotation, I get the following different error:

$ time '/gpfs/home/rcthomps/src/subread/subread-1.4.6-p2-source/bin/featureCounts' -F SAF -a 'repeat-groups-10.saf' -o repeat-counts.txt 'star_results/A1/Aligned.out.bam'

========== _____ _ _ ____ _____ ______ _____ ===== / ____| | | | _ \| __ \| ____| /\ | __ \ ===== | (___ | | | | |_) | |__) | |__ / \ | | | | ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | | ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| | ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/ v1.4.6-p2

//========================== featureCounts setting ===========================\\ || || || Input files : 1 BAM file || || S star_results/A1/Aligned.out.bam || || || || Output file : repeat-counts.txt || || Annotations : repeat-groups-10.saf (SAF) || || || || Threads : 1 || || Level : meta-feature level || || Paired-end : no || || Strand specific : no || || Multimapping reads : not counted || || Multi-overlapping reads : not counted || || || \\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\ || || || Load annotation file repeat-groups-10.saf ... || || Features : 10 || featureCounts: readSummary.c:676: register_reverse_table: Assertion `this_block_min_start <= this_block_max_end' failed. Aborted

real 0m0.004s user 0m0.000s sys 0m0.003s

ADD REPLY • link 10.0 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 3 days ago

Australia/Melbourne

The order of columns should be GeneID, Chr, Start, End and Strand for the C version (SourceForge version) of featureCounts program. The columns are allowed to be in any order for featureCounts R function.

ADD COMMENT • link 10.0 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Ok, after correcting the column order, it looks like I am seeing identical behavior to my original report: featureCounts hangs indefinitely after counting the last sample and before reporting "Read assignment finished".

$ date; time nice '/gpfs/home/rcthomps/src/subread/subread-1.4.6-p2-source/bin/featureCounts' -F SAF -a 'repeat-groups.saf' -o repeat-counts.txt 'star_results/A1/Aligned.out.bam' Wed Apr 29 14:49:09 PDT 2015

========== _____ _ _ ____ _____ ______ _____ ===== / ____| | | | _ \| __ \| ____| /\ | __ \ ===== | (___ | | | | |_) | |__) | |__ / \ | | | | ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | | ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| | ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/ v1.4.6-p2

//========================== featureCounts setting ===========================\\ || || || Input files : 1 BAM file || || S star_results/A1/Aligned.out.bam || || || || Output file : repeat-counts.txt || || Annotations : repeat-groups.saf (SAF) || || || || Threads : 1 || || Level : meta-feature level || || Paired-end : no || || Strand specific : no || || Multimapping reads : not counted || || Multi-overlapping reads : not counted || || || \\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\ || || || Load annotation file repeat-groups.saf ... || || Features : 5298130 || || Meta-features : 21 || || Chromosomes : 93 || || || || Process BAM file star_results/A1/Aligned.out.bam... || || Single-end reads are included. || || Assign reads to features... || || Total reads : 35134460 || || Successfully assigned reads : 17361583 (49.4%) || || Running time : 1.37 minutes || || ||
[ featureCounts hangs indefinitely using 100% of 1 CPU ]

My guess is that it has something to do with a large number of features in a small number of meta-features.

ADD REPLY • link 10.0 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 3 days ago

Australia/Melbourne

Hi Ryan, thanks for running the command with corrected annotation. Although it is possible that there is a bug in featureCounts that resulted in the problem you encountered, it is also possible that there was a I/O problem with your file system. Is it possible that you can run your command on a different computer/file volume to see if you can reproduce the problem? Or you may send us your annotation file and bam file to let us take a look.

Wei

ADD COMMENT • link 10.0 years ago Wei Shi ★ 3.6k

0

Entering edit mode

There's no I/O problem, because using a normal annotation such as all exons grouped by gene from TxDb.Hsapiens.UCSC.hg19.knownGene on the same server works just fine, despite having a similar number of total features to my annotation. When I get to work I'll send you the annotation and a sample bam file that causes the issue.

ADD REPLY • link 10.0 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

wangjiawen • 0

@wangjiawen-23168

Last seen 4.4 years ago

I met this problem too, when I used the following genome and it's annotation for my paired-end RNAseq data: ftp://ftp.ensembl.org/pub/release-101/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/gencode.v35.annotation.gtf.gz

A large number of temp files was generated and it seemed that the featurecounts process was frozen. After a long wait I decided to give up running it in R. I switched to run featurecounts in shell command line then the same problem appeared again. I used other RNAseq datasets then it worked. Finally, I changed the temporary directory, then it ran successfully. So I think perhaps there was a limit of the default temporary directory on my server.

ADD COMMENT • link 4.4 years ago wangjiawen • 0

0

Entering edit mode

Can you provide the command you used? Are you using the latest version of Rsubread? featureCounts uses your current working directory as the temporary directory by default. If the temporary directory you used has an I/O issue, then all your datasets will be affected.

ADD REPLY • link 4.4 years ago Wei Shi ★ 3.6k

0

Entering edit mode

I use Rsubread 2.0.1. This is the code: gene <- featureCounts(BAMFiles, useMetaFeatures=TRUE,annot.ext="genes.gtf",isGTFAnnotationFile=TRUE,countMultiMappingReads=FALSE,allowMultiOverlap=TRUE,isPairedEnd=TRUE,requireBothEndsMapped=TRUE,countChimericFragments=FALSE,nthreads=30,tmpDir = "/home/wangjw/data/work/tmp"). I set tmpDir to a customized directory. This is the second time I met this problem. I thinks it's a server-specfic and datasets-specific issue.

ADD REPLY • link 4.4 years ago wangjiawen • 0

0

Entering edit mode

FeatureCounts creates temporary files if there are very many reads in the BAM file having no mapped mates around it in the BAM file. It saves these reads into the temporary files in case the mapped mate is seen later in the BAM file.

Although it takes some time to write the reads into the temporary files, the speed should still be fairly fast, like using a few minutes to count 10 million reads. How many reads are in the BAM file? How long did it take before you terminated the task?

Also, if the working directory and/or the temporary file directory is on a network file system (NFS, Samba, etc), it may take much longer because the network file system needs to synchronize the status of files bewteen servers after each writting.

ADD REPLY • link 4.4 years ago Yang Liao ▴ 450

0

Entering edit mode

This is currently happening to me now, I am using featurecounts over a cluster and it has made dozens of temp files and has been processing this one .bam file for ~10 hours, while the other .bam files are processed in a few minutes. I will try a different human annotation...

ADD REPLY • link 4.4 years ago Connor • 0

0

Entering edit mode

This is currently happening to me now, I am using featurecounts over a cluster and it has made dozens of temp files and has been processing this one .bam file for ~10 hours, while the other .bam files are processed in a few minutes. I will try a different human annotation...

ADD REPLY • link 4.4 years ago Connor • 0

0

Entering edit mode

This is currently happening to me now, I am using featurecounts over a cluster and it has made dozens of temp files and has been processing this one .bam file for ~10 hours, while the other .bam files are processed in a few minutes. I will try a different human annotation...

ADD REPLY • link 4.4 years ago Connor • 0

0

Entering edit mode

This is currently happening to me now, I am using featurecounts over a cluster and it has made dozens of temp files and has been processing this one .bam file for ~10 hours, while the other .bam files are processed in a few minutes. I will try a different human annotation...

ADD REPLY • link 4.4 years ago Connor • 0

0

Entering edit mode

This is currently happening to me now, I am using featurecounts over a cluster and it has made dozens of temp files and has been processing this one .bam file for ~10 hours, while the other .bam files are processed in a few minutes. I will try a different human annotation...

ADD REPLY • link 4.4 years ago Connor • 0

0

Entering edit mode

10 hours is indeed too slow. Is it possible if you can share the BAM file that caused the problem with us?

ADD REPLY • link 4.4 years ago Yang Liao ▴ 450

0

Entering edit mode

difraiadomenico • 0

@739ddafe

Last seen 14 months ago

United States

Same issue here, I am running lots of different RNA-Seq counts on many different organisms. Somehow featureCounts has some problem with some specific .bam files. It is not the annotation since here I am using same annotation, on different .bam files so I guess it is really here the problem. I am still working for a solution

ADD COMMENT • link 14 months ago difraiadomenico • 0

1

Entering edit mode

Adding a answer (that isn't really an answer) to a 9-year-old question isn't the way to get help!

Your problem is clearly quite different to the problem discussed here 9 years ago, because the old problem was associated with a specific annotation whereas you say that your problem isn't caused by annotation. The featureCounts authors will try to help you, but you need to post a question of your own and you need to give more specific information about what the issue is. What exactly is the problem that you are observing and what is unusual about the bam files that it is occuring on? featureCounts is a heavily used program and it is almost always very fast, so there must be something very unusual going on with your files. Have you done any trouble-shooting so far to, for example, rule out problems with your own computing system such as running out of disk space?

Have a read of the posting guide at https://bioconductor.org/help/support/posting-guide/.

ADD REPLY • link 14 months ago Gordon Smyth 52k

score 4 · Accepted Answer · 2015-06-22

Hi Ryan, sorry for my slow response. We analyzed the data you sent to us and found that the long running time of featureCounts was caused by the excessive number of exons in some of the genes included in your annotation. featureCounts concatenates all the exon coordinates into a string before it returns the counting results. Some genes in your annotation contains more than one million exons, causing extremely long running time for the string concatenation.

We have improved the method for concatenating the exon coordinates and now it only takes 2 minutes to complete the read counting for your data.

Latest version can be found in Rsubread 1.19.2. It should be available in a day or two.