Question

Hill in P-value distribution for heterogenous ChIP-seq samples

2

Entering edit mode

jma1991 ▴ 70

@jma1991-11856

Last seen 2.1 years ago

Cumbernauld

I have data from a variant of ChIP-seq and would like to call "peaks" from the data by finding differential-binding between my ChIP samples and the paired input samples. I'm currently using the csaw workflow outlined in the excellent F1000 paper. However, I've encountered a few problems when applying it to my data:

My ChIP samples are a lot more heterogenous than my input samples. To overcome this I tested that the distributions were significantly different using quantro, and then applied smooth quantile normalization to the data. Because edgeR wants raw counts, I passed the normalised data to limma-trend. The P-value distribution for the tested windows shows a "hill" in the middle, rather than the uniform distribution I'd expect:

I think this is related to the problem that my ChIP samples are very heterogenous, and the input samples are very homogenous. According to this DeSeq2 - hill–shaped histogram of pvalues , the hill could be caused by this exact issue:

"My guess is that this often stems from the fact that DESeq2, just
like LIMMA, models one variance parameter per gene (the dispersion). This implicitly assumes that the within group variability is not different between the groups. So if this is not correct, since
e.g. the control group is less variable than the treatment group(s), it might lead to
wrong p-values."

I tried to overcome this by using the fdrtool package to estimate the variance of the null–model from the test statistic, however that produces a histogram with the exact/similar P-values. I also tried applying surrogate variable analysis and/or array quality weighting to the smooth quantile normalised data, but I received a similar hump in the histogram.

When I just use the raw counts with either limma-voom or edgeR quasi-likelihood, <100 windows are called differential, but from looking at the data in a genome browser, the calculated P-value seems too high for windows which show consistent signal amongst my ChIP samples.

The only method which appears to greatly increase the number of differential windows, which by eye seem to correlate with the amount and variation of signal seen in the genome browser, is by creating an rlog CPM matrix of the windows (scaling factors calculated by normOffsets function), then applying smooth quantile normalization to the rlog data, calculating surrogate variables, and using limma-trend on this normalised matrix with surrogate variables in the design.

csaw limma sva qsmooth edger • 1.8k views

ADD COMMENT • link updated 8.2 years ago by Aaron Lun ★ 28k • written 8.2 years ago by jma1991 ▴ 70

score 1 · Answer 1 · 2017-02-02

I suspect that your ChIP samples are differing systematically in their peak heights, i.e., all peaks go up or down in size in a particular sample compared to one of its biological replicates. This can either reflect variability in actual binding, e.g., due to greater expression of the target protein in one animal; or in the efficiency of immunoprecipitation, which is probably more likely. These systematic changes require some effort for edgeR's model to handle, usually resulting in increased dispersion estimates that reduce detection power. However, there is also some compensation from an increased prior degrees of freedom, which usually increases power; the interplay between these two mechanisms can lead to some funny-shaped p-value distributions.

As for what you can do about it; probably nothing, if your ChIP samples are heterogeneous in the way that I've described above. Put all the fancy statistical methodology to the side for now, and ask yourself: if you have one ChIP library with strong peaks and another library with weak peaks, which one should you trust? For all we know, the levels or binding strength of your target protein in your biological system could just be inherently variable - in which case, your data have successfully captured this variability, and the analysis should model it rather than trying to normalize it out. Of course, you could argue that the variability is technical, mainly driven by differences in immunoprecipitation efficiency, and should be removed prior to further analysis. However, the nature of the ChIP-seq protocol means that binding strength is experimentally confounded with immunoprecipitation efficiency and cannot be separated in the data (unless you also added spike-in chromatin).

I can't speak for what other things qsmooth and sva are doing, but the heterogeneity you've observed would increase the NB dispersion for each window in edgeR. This will result in large DB p-values for each window, which is entirely sensible. Say you have a peak with 200 reads in one ChIP sample, and 500 reads in another replicate ChIP sample - for all edgeR knows, given this level of variability, an additional replicate ChIP sample might contain no reads. Conversely, an additional input sample might contain hundreds of reads (which is not entirely farfetched if you look around repeat regions.) As a result, edgeR cannot confidently call this peak as being differentially bound relative to the input, even though "your eye" tells you that there is a strong binding site here.