tuning memory consumption of summarizeOverlaps?
1
0
Entering edit mode
@vincent-j-carey-jr-4
Last seen 7 weeks ago
United States

The main parameters to summarizeOverlaps are features and reads.  I would like to know what one can do to tune the memory consumption of summarizeOverlaps.  One could limit the number of features in play, or could define a ScanBamParam to limit the scope of reads being processed, or one could set a yieldSize in the bam file reference.  Does anyone have data on the options here?  Are details of the reads such as length, or size of bam files, additional determinants of memory consumption?

rna-seq summarizeoverlaps • 1.8k views
ADD COMMENT
0
Entering edit mode
@valerie-obenchain-4275
Last seen 2.8 years ago
United States

Hi Vince,

summarizeOverlaps is basically two steps: scanBam and findOverlaps.

scanBam:

- This step is run in parallel, by file, with bplaply. If you have many large files I would reduce the number of workers in BPPARAM so you aren't maxed out (default).

- If the files are big (ie, generally > 1000000) it pays to use yieldSize.

- ScanBamParam is useful if you are after a subset of records but assuming you want to count all it doesn't provide an advantage. The code already reads in the minimal information needed to perform overlaps (ie, doesn't bring in other fields, flags etc.).


findOverlaps:

The overlap step will be faster with a smaller number of features so if you really don't need the full annotation then yes, subset it. The new NCList algorithm counts at the C level and the hits are not kept. This has reduced the memory considerably when there are many hits. While a smaller annotation may increase performance slightly I don't think it will affect memory much.

Because we aren't reading/manipulating sequences I can't imagine read length plays a role here. Read positions are stored as 'start' and 'end' (or width) but essentially just 2 integers. I don't know if there is much of a difference finding overlaps on small vs large ranges. I believe Herve saw a performance difference with many small nested ranges vs non-nested but not just large vs small.


In my experience, with multiple large files, moderating yieldSize and the number of workers have been the most effective in controlling memory.

I'm sure Martin and Herve have more to add.

Val

ADD COMMENT

Login before adding your answer.

Traffic: 750 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6