Hi Vince,
summarizeOverlaps is basically two steps: scanBam and findOverlaps.
scanBam:
- This step is run in parallel, by file, with bplaply. If you have many large files I would reduce the number of workers in BPPARAM so you aren't maxed out (default).
- If the files are big (ie, generally > 1000000) it pays to use yieldSize.
- ScanBamParam is useful if you are after a subset of records but assuming you want to count all it doesn't provide an advantage. The code already reads in the minimal information needed to perform overlaps (ie, doesn't bring in other fields, flags etc.).
findOverlaps:
The overlap step will be faster with a smaller number of features so if you really don't need the full annotation then yes, subset it. The new NCList algorithm counts at the C level and the hits are not kept. This has reduced the memory considerably when there are many hits. While a smaller annotation may increase performance slightly I don't think it will affect memory much.
Because we aren't reading/manipulating sequences I can't imagine read length plays a role here. Read positions are stored as 'start' and 'end' (or width) but essentially just 2 integers. I don't know if there is much of a difference finding overlaps on small vs large ranges. I believe Herve saw a performance difference with many small nested ranges vs non-nested but not just large vs small.
In my experience, with multiple large files, moderating yieldSize and the number of workers have been the most effective in controlling memory.
I'm sure Martin and Herve have more to add.
Val