Question

Rsubread requests longread mode for shotgun RNAseq data

0

Entering edit mode

Christian Mertes • 0

@mertes-11992

Last seen 2.3 years ago

Germany

Dear Rsubread developers,

I wanted to use the Rsubread package to count features on shotgun RNA-seq data. But when counting, it fails due too long alignments and requests the longread mode which changes runtime from 10 minutes to 30 hours and from multi core to single core! How can we overcome the problem? I also could not find any problematic read within the bam file.

The expected behavior is that it counts in multi core mode in around 10 minutes since the used bam file is a shotgun RNA-seq sample with 76bp read length.

And it works on other local sequencing data with 150bp read length with not problems.

Thanks in advance for any help. Best regards,

Christian Mertes

> featureCounts(file="HG00356.2.M_111215_6.bam", annot.ext=ranges, isLongRead=FALSE)

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.34.7

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           P HG00356.2.M_111215_6.bam                       ||
||                                                                            ||
||              Annotation : R data.frame                                     ||
||      Dir for temp files : .                                                ||
||                 Threads : 1                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : no                                               ||
||      Multimapping reads : counted                                          ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file .Rsubread_UserProvidedAnnotation_pid36210 ...         ||
||    Features : 3                                                            ||
||    Meta-features : 3                                                       ||
||    Chromosomes/contigs : 2                                                 ||
||                                                                            ||
|| Process BAM file HG00356.2.M_111215_6.bam...                               ||
||    Paired-end reads are included.                                          ||
||    Assign alignments to features...                                        ||
||                                                                            ||
||    ERROR: Alignment record is too long.                                    ||
||           Please use the long read mode.                                   ||

FATAL Error: The program has to terminate and no counting file is generated.

Error in featureCounts(file = "HG00356.2.M_111215_6.bam", annot.ext = ranges,  : 
  No counts were generated.

Here is an example showing the problem:

# load libraries
library(Rsamtools)
library(Rsubread)

# get example data
download.file(method = "wget", extra="--continue -nv",
    "https://www.ebi.ac.uk/arrayexpress/files/E-GEUV-1/HG00356.2.M_111215_6.bam",
    "HG00356.2.M_111215_6.bam")

# index file
indexBam("HG00356.2.M_111215_6.bam")

# ranges to count
ranges <- data.frame(
    GeneID=c("1", "2", "3"),
    Chr=c("chr3", "chr3", "chr19"),
    Start= c(119349459, 119380675, 7126379),
    End = c(119349460, 119380676, 7126380),
    Strand=c("*", "*", "*")
)

# count features
featureCounts(file="HG00356.2.M_111215_6.bam", annot.ext=ranges, 
        isLongRead=FALSE)

And my sessionInfo is:

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.7 (Nitrogen)

Matrix products: default
BLAS:   /data/nasif12/modules_if12/SL7/i12g/R/3.6.0-Bioc3.9/lib64/R/lib/libRblas.so
LAPACK: /data/nasif12/modules_if12/SL7/i12g/R/3.6.0-Bioc3.9/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] Rsubread_1.34.7      Rsamtools_2.0.1      Biostrings_2.52.0    XVector_0.24.0       GenomicRanges_1.36.1
[6] GenomeInfoDb_1.20.0  IRanges_2.18.2       S4Vectors_0.22.1     BiocGenerics_0.30.0 

loaded via a namespace (and not attached):
[1] packrat_0.5.0          bitops_1.0-6           zlibbioc_1.30.0        BiocParallel_1.18.1   
[5] tools_3.6.0            RCurl_1.95-4.12        compiler_3.6.0         GenomeInfoDbData_1.2.1

Rsubread counting RNA-seq • 2.9k views

ADD COMMENT • link updated 3.9 years ago by Yang Liao ▴ 450 • written 5.6 years ago by Christian Mertes • 0

0

Entering edit mode

Hi Christian, Can you share your BAM file, please? I hope to look into the BAM file to see what was the cause of the error message. You can send the link to liao (AT) wehi.edu.au Thanks. Cheers, Yang

ADD REPLY • link 5.6 years ago Yang Liao ▴ 450

0

Entering edit mode

Please check my example. It contains the link to the file. And thanks in advance!

ADD REPLY • link 5.6 years ago Christian Mertes • 0

score 2 · Accepted Answer · 2019-10-01

Hi Christian,

I received the BAM file that you provided. Although the read sequences are short (76bp or so), there are some very long optional fields in many alignments. For example, read "HWI-ST700660:106:C0APYACXX:6:2206:5447:137301" has an optional field, "XA:Z", longer than 100,000 bytes. It is legitimate to have such a long additional field in the BAM file, although featureCounts cannot process it on the multi-threaded mode.

Because this optional field doesn't have any effect on read assignment, a quick solution is using samtools to remove this field:

$ samtools view -h HG00356.2.M_111215_6.bam | sed 's/\tXA\:Z\:[^\t]*//' | samtools view -Sb - > HG00356.2.M_111215_6-reduced.bam

Then you can assign HG00356.2.M_111215_6-reduced.bam using multiple threads; featureCounts should finish very quickly.

Meanwhile we are going to fix this problem and we hope to have featureCounts running fast on such BAM files in the next Rsubread release.

BTW, even on the long-read mode, featureCounts still finished in just 1.77 minutes. It did't use 30 hours to finish:

> f<-featureCounts(file="02OCT2019-BioCon-125078-HG00356.2.M_111215_6.bam", annot.ext=ranges, isLongRead=TRUE)

        ==========     _____ _    _ ____  _____  ______          _____
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.34.7

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           o 02OCT2019-BioCon-125078-HG00356.2.M_111215 ... ||
||                                                                            ||
||              Annotation : R data.frame                                     ||
||      Dir for temp files : .                                                ||
||                 Threads : 1                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : no                                               ||
||      Multimapping reads : counted                                          ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||          Long read mode : yes                                              ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file .Rsubread_UserProvidedAnnotation_pid65035 ...         ||
||    Features : 3                                                            ||
||    Meta-features : 3                                                       ||
||    Chromosomes/contigs : 2                                                 ||
||                                                                            ||
|| Process BAM file 02OCT2019-BioCon-125078-HG00356.2.M_111215_6.bam...       ||
||    WARNING: Paired-end reads were found.                                   ||
||    Total alignments : 74530006                                             ||
||    Successfully assigned alignments : 0 (0.0%)                             ||
||    Running time : 1.77 minutes                                             ||
||                                                                            ||
||                                                                            ||
\\============================================================================//

My computer is an Xeon computer running Centos 6.4. Did you run the program in a real computer or a virtual machine hosted in a real computer? How much memory was available when you ran the program?

Cheers, Yang