First, a clarification; csaw
doesn't do peak calling in any formal sense. Rather, it slides windows across the genome and counts the number of (possibly extended) reads overlapping each window. To do so, it needs access to the position, strand and length of each mapped read. Such information is readily available from the BAM file, which is why this is the default (and only) input format. I toyed around with other input formats, but I didn't feel they were worth supporting directly, given that users could simply convert their files to BAM (or get the original BAM files fresh out of the aligner) prior to entry into csaw
.
Anyway, if windowCounts
were to accept RleList
objects as input, it'd have to figure out how to extract alignment information from the coverage vector. This is impossible for strandedness and length, given that the same coverage profile can be obtained with any combination of forward/reverse and short/long reads. As for the read positions; in theory, if all reads were known to be of the same length and strand (e.g., by supplying different coverage vectors for each combination of length and strand), one could derive the positions from the coverage profile. However, I suspect that the complexity of this procedure would outweigh any memory benefits of using RleList
objects.
Also, as you've noted, the coverage vectors contain the count data for each position. If you cbind
the coverage vectors for all libraries together, you get a count matrix that you can use for statistical analyses. This is equivalent to counting reads in windowCounts
with a window size of 1, a spacing of 1 (i.e., each base is its own window) and no read extension. Obviously, a single-base-resolution analysis will be very memory- and time-intensive, with little gain in information given that the tests for adjacent windows are so highly correlated. We find that a spacing of ~50 bp provides a reasonable compromise between resolution and convenience.