Question

csaw : effect of library size on filtering of uninteresting windows

0

Entering edit mode

Vivek.b ▴ 100

@vivekb-7661

Last seen 4.5 years ago

Germany

Hi there,

I am trying the csaw package to filter the background from my data using the "local enrichment" method. I first tested in a dataset with low input material (resulting in low library sizes) and find that the method works nicely, when I keep all regions with 2-fold enrichment over the local background. But when I tested the method on a dataset with higher input material (resulting in 10x more library sizes), I find that I have to increase the filtering threshold to 6-fold enrichment to keep the bound regions without noise.

I wanted to automate this process and that's why I am wondering what would be an appropriate way to select the filtering cutoff from the filter.stats that works for all library sizes?

I managed to use cpm instead of normal windowCounts and regionCounts to get the filter.stats. But the distribution of filter.stats is still not similar between the two kind of samples, so I won't be able to use a single cutoff for both. Any ideas?

Thanks

Vivek

csaw • 1.4k views

ADD COMMENT • link updated 7.7 years ago by Aaron Lun ★ 28k • written 7.7 years ago by Vivek.b ▴ 100

score 1 · Accepted Answer · 2017-03-22

The only component of filterWindows that does not cancel out with library size is the pseudo-count used in aveLogCPM. This squeezes the filter statistics (i.e., filter in the output) towards zero. The smaller the counts, the stronger the shrinkage - and with good reason, otherwise the function would happily report "large" enrichments for regions with a handful of reads. This effectively means that a small threshold for a low library size is as stringent (in terms of number of windows retained) as a large threshold for a larger library size.

Now, if all of your libraries are large, the behaviour of the pseudo-count will not make a difference. This is because the amount of shrinkage will approach zero as your counts increase, such that you should get similar results with the same threshold for different (but still large) library sizes. However, for very small libraries, it will affect the results - after all, that's why we use it - so you'll just have to pay attention to those cases.

Full automation of these analyses would be nice. But then I wouldn't have a job.