Question

csaw with negative controls

0

Entering edit mode

chriad ▴ 10

@chriad-10721

Last seen 7.6 years ago

Say I'd like to assess 'differential binding' of a histon mark (K9me2) in two conditions (mutant vs wildtype), So I try to answer the question how much binding of the histon mark changes when changing from wlidtype to mutant.

I have 2 replicates for each condition, each with a matching input (8 libraries).

The simplest thing to do, as I understand, would be to compute enrichment over input for all the paired chip and input samples and then compare those fold changes over the two conditions.

In the vignette in section 3.5.3 it says that "These controls are mostly irrelevant when testing for DB between ChIP samples". But this is only true if there is only one input reference for all samples as is the case with the example in the vignette.

In the paper csaw: a Bioconductor package for differential binding analysis of ChIP-seq data using sliding windows it says:

The ... the GLM framework means that csaw can incorporate condition specific controls into a regular DB analysis in serveral ways...

One approach is to include the the controls in the linear model so that the log fold change between conditions for the ChIP samples is compared to that of the controls.

Would that mean passing a contrast of the form (ChIP.mutant - input.mutant) - (ChIP.wt - input.wt) to the linear model (with a 2 factor design with chIP/input and mutant/wt) and then testing for DB?

Another approach is to normalize the ChIP samples to condition specific controls and pass the adjustments to csaw as offsets for GLM fitting.

Does that mean normalizing all the samples (ChIPs and inputs) together (e.g. with a call to normOffsets) and then using these norm.factors in the downstream analysis? Does that mean we can now compare the ChIP libraries (in mutant and wildtype) unconditionally of their respective backgrounds because we have accounted for the background in the normalization?

I am very confused about how to incorporate input controls beyond just comparing simple enrichments.

csaw • 2.2k views

ADD COMMENT • link 8.8 years ago chriad ▴ 10

score 2 · Answer 1 · 2016-07-12

People generate input libraries mostly because they like getting absolute binding calls, to complement their differential binding calls. While it is certainly possible to incorporate input information into the DB analysis, I generally don't do so routinely. This is because it is difficult to do it in a sensible manner that preserves power:

Subtracting input coverage from ChIP coverage disrupts the mean-variance relationship and interferes with statistical modelling. See A: DESeq2 for ChIP-seq differential peaks for a discussion of this.
Fitting a GLM to compare the log-fold changes between ChIP and input across conditions, as suggested in the paper, is a better approach. However, this becomes problematic when changes in chromatin state (that affect input coverage) coincide with changes in binding. If you get a two-fold increase in accessibility, this would negate your ability to detect a genuine two-fold increase in binding at the same location (which is more likely than not, as increased accessibility is often associated with increased binding). The normalisation approach runs into similar issues if you try to compute site-specific offsets.

If you are truly worried about spurious DB due to changes in chromatin state, you can take Rory's suggestion and use the GreyListChIP package to identify problematic regions. If I remember correctly, this involves identifying significant peaks in inputs from one or more conditions, and then screening out the offending regions in your DB analysis of the actual ChIP samples. This is probably the safest approach, given that it is difficult to properly remove the effect of changes in chromatin state from the DB statistics.

P.S. Another option is to use the input controls for filtering. Your inputs will give a reasonable estimate of the background noise at each site, which can refine the choice of filter threshold. (Note that this isn't done in a condition-specific manner, as otherwise the filtering process wouldn't be independent of the hypothesis testing.) See the user's guide for some examples.