Our advice is: don't do input subtraction. This will break the mean-variance modelling in edgeR
. For example, let's say you have two windows; one with large IP and input counts (I'll call this window X), and another with small IP and input counts (Y). Under any realistic model like the NB model used in edgeR
, the variance of the counts for window X should be higher than that of Y, simply because the counts are larger and have more scope to vary.
Now, assume that subtraction is performed such that the average of the "IP - input" counts is the same between both windows. This means that we end up with a window that has a large variance (i.e., X), at the same mean as a window with a smaller variance (Y). edgeR
cannot model this properly during empirical Bayes shrinkage, as the mean-dispersion trend will only have a single fitted value for the dispersion at any given mean. Should that fitted value be optimized for modelling the variance of X, such that the variance of Y will not be modelled properly; or Y, such that if will fail for X; or try to go for some compromise, in which case neither window will be modelled properly? In fact, the trend fitting itself will be confounded if all of the windows are crammed into a smaller range of mean values after subtraction.
And that's not even considering problems with negative counts. This is mostly problematic if your input and IP signals are of comparable size, such that there is a decent chance that the former is larger than the latter. You could apply a floor of zero to the post-subtraction counts, but then you might end up with substantial zero inflation that affects the accuracy of the NB model. I also note that you've done a scaling approach to adjust for library size prior to subtraction. Doing this properly is not trivial, e.g., classic edgeR
uses a sophisticated quantile-matching strategy in q2qnbinom
to preserve the mean-variance relationship upon library size adjustment.
Check out some discussions at A: DESeq2 for ChIP-seq differential peaks for an equivalent problem in ChIP-seq data. Finally, as an aside, it seems like you're doing a window-based analysis of this data. I suspect that you may already know about csaw
, based on our off-list conversations; it might be worth testing it out on this data set.
You'll want to fix the tag, otherwise the
edgeR
maintainers won't see this. Fortunately for you, I'm here all the time.