Entering edit mode
I have run a very large bam (186 G) file through the countBam function from the Rsamtools package, and got an unexpected error:
> library(Rsamtools) > countBam(large.bam.file) Error in value[[3L]](cond) : 'countBam' failed: record: -2023186199 error: 0 file: /home/NFS/research_projects/combat_cancer/Lung_cancer_data_set/Mouse/FVB_NJ.bam index:
At first I thought that there was an error in the bam file, but we have checked the integrity of the bam file compared to the source using md5, and additionally running samtools from the command line on the same file just works fine:
$ samtools view -c ./large_bam_file.bam 2271781097
Also, I have used countBam successfully for other smaller bam files. Does anyone have any suggestions as to where this error stems from?
Here's my sessionInfo():
> sessionInfo() R version 3.3.0 (2016-05-03) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.1 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] snow_0.4-1 futile.logger_1.4.1 [3] CopyhelpeR_1.4.0 DNAcopy_1.46.0 [5] chipseq_1.22.0 ShortRead_1.30.0 [7] GenomicAlignments_1.8.0 SummarizedExperiment_1.2.2 [9] Biobase_2.32.0 Rsamtools_1.24.0 [11] Biostrings_2.40.0 XVector_0.12.0 [13] BiocParallel_1.6.2 GenomicRanges_1.24.0 [15] GenomeInfoDb_1.8.1 IRanges_2.6.0 [17] S4Vectors_0.10.0 BiocGenerics_0.18.0 [19] data.table_1.9.6 gtools_3.5.0 [21] matrixStats_0.50.2 loaded via a namespace (and not attached): [1] zlibbioc_1.18.0 lattice_0.20-33 hwriter_1.3.2 [4] tools_3.3.0 grid_3.3.0 latticeExtra_0.6-28 [7] lambda.r_1.1.7 RColorBrewer_1.1-2 futile.options_1.0.0 [10] bitops_1.0-6 chron_2.3-47
That looks like an integer overflow problem to me. You have more records in the bam file than can be stored in an integer.
Perhaps as a workaround you could count several subsets of the reads (perhaps using ScanBamParam to pick chromosomes) and then combine the individual results as numerics?
Thanks for the suggestion. The maximum number an integer can have on my system is as follows:
Can any of the authors of the Rsamtools package fill us in whether counting reads can only go up to the maximum integer number?
Yes, that's the current limitation; it'll be updated shortly.
Thanks for clarifying that. Will it be implemented in the next Bioconductor release?