Rsubread requests longread mode for shotgun RNAseq data
1
0
Entering edit mode
@mertes-11992
Last seen 21 months ago
Germany

Dear Rsubread developers,

I wanted to use the Rsubread package to count features on shotgun RNA-seq data. But when counting, it fails due too long alignments and requests the longread mode which changes runtime from 10 minutes to 30 hours and from multi core to single core! How can we overcome the problem? I also could not find any problematic read within the bam file.

The expected behavior is that it counts in multi core mode in around 10 minutes since the used bam file is a shotgun RNA-seq sample with 76bp read length.

And it works on other local sequencing data with 150bp read length with not problems.

Thanks in advance for any help. Best regards,

Christian Mertes

> featureCounts(file="HG00356.2.M_111215_6.bam", annot.ext=ranges, isLongRead=FALSE)

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.34.7

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           P HG00356.2.M_111215_6.bam                       ||
||                                                                            ||
||              Annotation : R data.frame                                     ||
||      Dir for temp files : .                                                ||
||                 Threads : 1                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : no                                               ||
||      Multimapping reads : counted                                          ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file .Rsubread_UserProvidedAnnotation_pid36210 ...         ||
||    Features : 3                                                            ||
||    Meta-features : 3                                                       ||
||    Chromosomes/contigs : 2                                                 ||
||                                                                            ||
|| Process BAM file HG00356.2.M_111215_6.bam...                               ||
||    Paired-end reads are included.                                          ||
||    Assign alignments to features...                                        ||
||                                                                            ||
||    ERROR: Alignment record is too long.                                    ||
||           Please use the long read mode.                                   ||

FATAL Error: The program has to terminate and no counting file is generated.

Error in featureCounts(file = "HG00356.2.M_111215_6.bam", annot.ext = ranges,  : 
  No counts were generated.

Here is an example showing the problem:

# load libraries
library(Rsamtools)
library(Rsubread)

# get example data
download.file(method = "wget", extra="--continue -nv",
    "https://www.ebi.ac.uk/arrayexpress/files/E-GEUV-1/HG00356.2.M_111215_6.bam",
    "HG00356.2.M_111215_6.bam")

# index file
indexBam("HG00356.2.M_111215_6.bam")

# ranges to count
ranges <- data.frame(
    GeneID=c("1", "2", "3"),
    Chr=c("chr3", "chr3", "chr19"),
    Start= c(119349459, 119380675, 7126379),
    End = c(119349460, 119380676, 7126380),
    Strand=c("*", "*", "*")
)

# count features
featureCounts(file="HG00356.2.M_111215_6.bam", annot.ext=ranges, 
        isLongRead=FALSE)

And my sessionInfo is:

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.7 (Nitrogen)

Matrix products: default
BLAS:   /data/nasif12/modules_if12/SL7/i12g/R/3.6.0-Bioc3.9/lib64/R/lib/libRblas.so
LAPACK: /data/nasif12/modules_if12/SL7/i12g/R/3.6.0-Bioc3.9/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] Rsubread_1.34.7      Rsamtools_2.0.1      Biostrings_2.52.0    XVector_0.24.0       GenomicRanges_1.36.1
[6] GenomeInfoDb_1.20.0  IRanges_2.18.2       S4Vectors_0.22.1     BiocGenerics_0.30.0 

loaded via a namespace (and not attached):
[1] packrat_0.5.0          bitops_1.0-6           zlibbioc_1.30.0        BiocParallel_1.18.1   
[5] tools_3.6.0            RCurl_1.95-4.12        compiler_3.6.0         GenomeInfoDbData_1.2.1
Rsubread counting RNA-seq • 2.6k views
ADD COMMENT
0
Entering edit mode

Hi Christian, Can you share your BAM file, please? I hope to look into the BAM file to see what was the cause of the error message. You can send the link to liao (AT) wehi.edu.au Thanks. Cheers, Yang

ADD REPLY
0
Entering edit mode

Please check my example. It contains the link to the file. And thanks in advance!

ADD REPLY
2
Entering edit mode
Yang Liao ▴ 450
@yang-liao-6075
Last seen 6 hours ago
Australia

Hi Christian,

I received the BAM file that you provided. Although the read sequences are short (76bp or so), there are some very long optional fields in many alignments. For example, read "HWI-ST700660:106:C0APYACXX:6:2206:5447:137301" has an optional field, "XA:Z", longer than 100,000 bytes. It is legitimate to have such a long additional field in the BAM file, although featureCounts cannot process it on the multi-threaded mode.

Because this optional field doesn't have any effect on read assignment, a quick solution is using samtools to remove this field:

$ samtools view -h HG00356.2.M_111215_6.bam | sed 's/\tXA\:Z\:[^\t]*//' | samtools view -Sb - > HG00356.2.M_111215_6-reduced.bam

Then you can assign HG00356.2.M_111215_6-reduced.bam using multiple threads; featureCounts should finish very quickly.

Meanwhile we are going to fix this problem and we hope to have featureCounts running fast on such BAM files in the next Rsubread release.

BTW, even on the long-read mode, featureCounts still finished in just 1.77 minutes. It did't use 30 hours to finish:

> f<-featureCounts(file="02OCT2019-BioCon-125078-HG00356.2.M_111215_6.bam", annot.ext=ranges, isLongRead=TRUE)

        ==========     _____ _    _ ____  _____  ______          _____
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.34.7

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           o 02OCT2019-BioCon-125078-HG00356.2.M_111215 ... ||
||                                                                            ||
||              Annotation : R data.frame                                     ||
||      Dir for temp files : .                                                ||
||                 Threads : 1                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : no                                               ||
||      Multimapping reads : counted                                          ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||          Long read mode : yes                                              ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file .Rsubread_UserProvidedAnnotation_pid65035 ...         ||
||    Features : 3                                                            ||
||    Meta-features : 3                                                       ||
||    Chromosomes/contigs : 2                                                 ||
||                                                                            ||
|| Process BAM file 02OCT2019-BioCon-125078-HG00356.2.M_111215_6.bam...       ||
||    WARNING: Paired-end reads were found.                                   ||
||    Total alignments : 74530006                                             ||
||    Successfully assigned alignments : 0 (0.0%)                             ||
||    Running time : 1.77 minutes                                             ||
||                                                                            ||
||                                                                            ||
\\============================================================================//

My computer is an Xeon computer running Centos 6.4. Did you run the program in a real computer or a virtual machine hosted in a real computer? How much memory was available when you ran the program?

Cheers, Yang

ADD COMMENT
0
Entering edit mode

Thanks for the answer!

1) Yes the samtools quick fix helped.

2) Regarding the runtime: I can still reproduce the 30h runtime on the current R version, which is compiled with a local 4.9.4 GCC library on a Scientific Linux 7.7 version. But on the stock GCC libraries and current compiled R version, the counting is done in a few minutes. So I guess its a problem with the local GCC libraries. I was running it on a normal server with 40 cores and 250 Gb memory and a local copy of the BAM file.

So problem solved for now.

ADD REPLY
0
Entering edit mode

Glad to know that the quick fix works for you.

ADD REPLY
0
Entering edit mode

Hi Yang, probably related to this issue. I'm trying to use the featureCounts function in a GTEx v8 sample (e.g. GTEX-1B933-0011-R3b-SM-7P8PL) and I get the following error:

Cannot parse the input BAM file. If the BAM file contains long reads, please run featureCounts on the long-read mode.
No counts were generated.

When I change to isLongRead = TRUE, then I get an error saying that the long read option only works on single-end data, but GTEx is paired-end. I'm using: R 4.1.0 and Rsubread 2.6.1.

Have you had any experience with this? Thanks in advance.

ADD REPLY
0
Entering edit mode

Hi Vicente, the message you had about the long-read mode is unlikely to be relevent to the report from 2 years ago. Because GTEx portal doesn't provide the BAM files to the public, I cannot find what caused the issue.

Can you take a look at the reads in the BAM file, say, how long is the longest read? Are there very many or very long extra tags in the BAM file? Also, featureCounts will report this error message when the BAM file is truncated. Can you use samtools to see if the BAM file is complete (ie see if the samtools view command reports warning messages on the BAM file)?

ADD REPLY

Login before adding your answer.

Traffic: 598 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6