Question

csaw: how does unequal sample size affect filtering filterWindows step

0

Entering edit mode

zhangly811 • 0

@zhangly811-13922

Last seen 7.2 years ago

Hi, I have been using "csaw" to compare ChIP-Seq samples from two conditions. However, I have 2 samples in one condition, and 3 samples in the other condition, no control (input). I'm concerned with the independent filtering step done by filterWindows, which does independent filtering by averaging the signal within each window across all ChIP-Seq samples ( in this case 5 samples) and compared that to the input signal. As I can think of, bindings that only present in the condition with fewer samples are less likely to be kept, or it has to be at a higher intensity in order to be kept, and which may lead to more false negatives because they are not even considered in the DB step downstream.

For example, the normalized read count in my samples are 15, 15, 0,0,0, the average is 6. Let's say the global background is 2, and desired FC is 3. Then for the above window, FC = 6/2 = 3, which just meets the criteria. However, if there is another window that only has reads from the other condition, and read counts are like 0,0,10,10,10, average=30/5=6, FC:6/2=3, also pass the criteria. However, the actual signal is weaker compared to the first scenario. Does anyone have the same concern, or have idea how to work around this issue? Thank you for your suggestions!

csaw differential binding analysis sample size independent filtering statistical power • 2.0k views

ADD COMMENT • link updated 7.6 years ago by Aaron Lun ★ 28k • written 7.6 years ago by zhangly811 • 0

0

Entering edit mode

You're asking a "Question" rather than offering a "Tutorial" so I've changed the tag. Note that the "Tutorial" tag is for people advertising a tutorial rather than asking for one.

ADD REPLY • link 7.6 years ago Gordon Smyth 52k

score 3 · Accepted Answer · 2017-09-10

When filterWindows performs independent filtering, the independence is referring to the fact that the filter does not affect the distribution of p-values when the null hypothesis is true. In other words, the type I error rate is still controlled among the true null windows/regions after filtering. This is usually the priority in high-throughput statistical analyses, i.e., to avoid detecting significant results where there are none. As a field, I suppose we haven't given type II error rates as much consideration; this may be because genome-wide studies will usually give you a non-zero number of significant results, so we can happily take something home.

At any rate, the examples you describe can definitely occur, and will affect the type II error rate with respect to retaining windows for hypothesis testing. However, these seem to me like the natural consequences of an unbalanced experimental design. Let's take it to the extreme; if you have 100 replicates of one condition, and one replicate of another condition, it will obviously be easier to detect binding events in the first condition compared to the second. Filtering on abundance effectively focuses the analysis on windows that are likely to be binding events/enriched regions/peaks, so it will be subject to the experimental design. If you don't like it... design a balanced experiment.

That being said, you should still do filtering in a DB analysis. If not, you will be effectively testing the entire genome for DB, and this will result in more drastic loss of power due to the multiple testing correction. Moreover, if you are to do filtering, you should do it in an independent manner, which ensures that your discoveries are reliable. Other filtering procedures (e.g., peaks present in at least X samples) are appealing in their simplicity but have unpredictable statistical effects.