Question

CAGEr BAM files size issue

1

Entering edit mode

orzechowska.mag ▴ 10

@orzechowskamag-13472

Last seen 7.7 years ago

Dear Colleagues,

Recently I am analyzing my CAGE data using CAGEr package and I have experienced some difficulties. My project consists of 10 libraries among which 9 .bam files are up to 940 MB and one .bam is 1.5 GB. At the level of getCTSS(cage_bam) for this one huge file I obtain error:

> getCTSS(cage_bam)

Reading in file: ./MCF7_xx.bam...

	-> Filtering out low quality reads...
Error in .Call2("XStringSet_unlist", x, PACKAGE = "Biostrings") : 
  negative length vectors are not allowed

What I have found is that the issue is caused by size of this huge .bam file. My other .bams work perfectly fine. Moreover, I used SAMtools to extract subset of this huge .bam and it worked perfectly up to size of 1.1 GB. The computing power of my PC is really big, so I guess that the issue is located at different point. In addition, I am using the latest versions of R, RStudio, CAGEr and other adjacent packages @ Ubuntu Unity.

I would very appreciate your help how to solve this issue.

Thank you in advance and best regards,

Magda

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=pl_PL.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=pl_PL.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=pl_PL.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=pl_PL.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg38_1.4.1 CAGEr_1.18.0                      BSgenome_1.44.0                   rtracklayer_1.36.3               
 [5] Biostrings_2.44.1                 XVector_0.16.0                    GenomicRanges_1.28.3              GenomeInfoDb_1.12.2              
 [9] IRanges_2.10.2                    S4Vectors_0.14.3                  BiocGenerics_0.22.0              

loaded via a namespace (and not attached):
 [1] splines_3.4.1              zlibbioc_1.22.0            GenomicAlignments_1.12.1   beanplot_1.2               BiocParallel_1.10.1       
 [6] som_0.3-5.1                lattice_0.20-35            tools_3.4.1                SummarizedExperiment_1.6.3 grid_3.4.1                
[11] data.table_1.10.4          Biobase_2.36.2             matrixStats_0.52.2         Matrix_1.2-10              GenomeInfoDbData_0.99.0   
[16] bitops_1.0-6               RCurl_1.95-4.8             VGAM_1.0-3                 DelayedArray_0.2.7         compiler_3.4.1            
[21] Rsamtools_1.28.0           XML_3.98-1.9

CAGEr BAM software error R biostrings • 2.5k views

ADD COMMENT • link updated 7.7 years ago by Charles Plessy ▴ 180 • written 7.8 years ago by orzechowska.mag ▴ 10

0

Entering edit mode

Dear Magda, thanks for the report; could you suggest a pair of publicly available BAM files where one triggers the bug and the other does not ?

ADD REPLY • link 7.8 years ago Charles Plessy ▴ 180

0

Entering edit mode

Dear Charles, unfortunately I could not find as big file as mine, which generates the same difficulties. I shared via e-mail with you my file, which triggers the bug and one publicly available, which does not (http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/releaseLatest/wgEncodeRikenCageGm12878CellPapAlnRep1.bam).

ADD REPLY • link 7.8 years ago orzechowska.mag ▴ 10

0

Entering edit mode

Hi,

The error message is coming from Biostrings and is admittedly not helpful. I just changed this in Biostrings 2.44.2 and now it says:

> unlist(big_dna_string_set)
Error in .Call2("XStringSet_unlist", x, PACKAGE = "Biostrings") : 
  XStringSet object is too big to be unlisted (would result in an XString
  object of length 2^31 or more)

It looks like the getCTSS() function in the CAGEr package is trying to unlist some XStringSet object containing quality strings that are too big to be concatenated in one single string. Maybe that could be avoided, I don't know, I'm not familiar with the CAGEr package.

H.

PS: Biostrings 2.44.2 should become available via biocLite() in the next 24 hours or so.

ADD REPLY • link 7.7 years ago Hervé Pagès 16k

0

Entering edit mode

Hi,

thank you for reply.

Maybe I should ask Biostrings support how to solve this issue then.

I have some alternative solution that before sorting my .bam I will randomly pick the reads to prepare a bit smaller, digestible file for CAGEr, but ideally I wish to avoid that. This is solution for the worst case scenario.

Best regards!

ADD REPLY • link 7.7 years ago orzechowska.mag ▴ 10

0

Entering edit mode

Hi,

You're using the CAGEr package so I'm not sure exactly how the Biostrings package could help you solve this issue. Unless I add support for long XString objects, which is on the radar but unlikely to happen in the near future. In the mean time, one way to solve the issue would be to revisit the implementation of the getCTSS() function from CAGEr. There should be a way for the function to avoid unlisting the XStringSet object of qualities. Hopefully the CAGEr maintainers will be able to help you with this.

Cheers,

H.

ADD REPLY • link 7.7 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Hervé and Magda,

thanks for the report and the guidance.

getCTSS() loads the BAM files with Rsamtools::ScanBam, and then extracts the qualities from the returned object, in which they are stored as a PhredQuality object. For each read, an average quality is calculated, and that requires the PhredQuality object to be converted, for instance to a matrix to in a IntegerList. Basically:

bam <- scanBam(inputFiles(object)[i], param = param)
qal.avg <- as.integer(mean(as(bam[[1]]$qual, "IntegerList")))

In Magda's BAM files, there are 35,560,858 sequences of 73 nucleotides each, and conversion of the PhredQuality object to matrix or IntegerList fail with the same error.

Unless there are better suggestions, I will process the qualities by batches which are short enough to allow for the conversion (20,000,000 or smaller, if sequence length also matters?).

Have a nice week-end,

-- Charles

ADD REPLY • link 7.7 years ago Charles Plessy ▴ 180

score 1 · Accepted Answer · 2017-08-08

1

Entering edit mode

Charles Plessy ▴ 180

@charles-plessy-7857

Last seen 18 months ago

Japan

I have pushed update 1.18.1 to the [Bioconductor release 3.5 branch](http://bioconductor.org/packages/3.5/bioc/html/CAGEr.html), where I [implemented the solution discussed above](https://github.com/Bioconductor-mirror/CAGEr/commit/0a1c85d2e8f99b0104446fb301f013ce2b4dfaad) (parse the qualities by chunks of 1,000,000 to avoid hitting limits on object length). The correction is not yet in the devel branch, where I opportunistically wait for the transition to Git to be completed.

Please let us know if it solves your problem !

ADD COMMENT • link 7.7 years ago Charles Plessy ▴ 180

0

Entering edit mode

The issue is gone now. Thank you very much for your support and solution. I really appreciate it.

All the best!

ADD REPLY • link 7.7 years ago orzechowska.mag ▴ 10