I have data from a variant of ChIP-seq and would like to call "peaks" from the data by finding differential-binding between my ChIP samples and the paired input samples. I'm currently using the csaw workflow outlined in the excellent F1000 paper. However, I've encountered a few problems when applying it to my data:
My ChIP samples are a lot more heterogenous than my input samples. To overcome this I tested that the distributions were significantly different using quantro, and then applied smooth quantile normalization to the data. Because edgeR wants raw counts, I passed the normalised data to limma-trend. The P-value distribution for the tested windows shows a "hill" in the middle, rather than the uniform distribution I'd expect:
I think this is related to the problem that my ChIP samples are very heterogenous, and the input samples are very homogenous. According to this DeSeq2 - hill–shaped histogram of pvalues , the hill could be caused by this exact issue:
"My guess is that this often stems from the fact that DESeq2, just
like LIMMA, models one variance parameter per gene (the dispersion). This implicitly assumes that the within group variability is not different between the groups. So if this is not correct, since
e.g. the control group is less variable than the treatment group(s), it might lead to
wrong p-values."
I tried to overcome this by using the fdrtool package to estimate the variance of the null–model from the test statistic, however that produces a histogram with the exact/similar P-values. I also tried applying surrogate variable analysis and/or array quality weighting to the smooth quantile normalised data, but I received a similar hump in the histogram.
When I just use the raw counts with either limma-voom or edgeR quasi-likelihood, <100 windows are called differential, but from looking at the data in a genome browser, the calculated P-value seems too high for windows which show consistent signal amongst my ChIP samples.
The only method which appears to greatly increase the number of differential windows, which by eye seem to correlate with the amount and variation of signal seen in the genome browser, is by creating an rlog CPM matrix of the windows (scaling factors calculated by normOffsets function), then applying smooth quantile normalization to the rlog data, calculating surrogate variables, and using limma-trend on this normalised matrix with surrogate variables in the design.