I have a ChIP-seq dataset with two conditions. For each condition I have two replicates of both input and the acutal ChIP experiment: 8 samples in total.
In R the design matrix would look like this:
myDesing <- data.frame(sample=1:8, data=c(rep('Input',4), rep('Chip',4)), condition=c('Wt','Kd'))
myDesing$condition <- factor(myDesing$condition, levels=c('Wt','Kd'))
myDesing$data <- factor(myDesing$data, levels=c('Input','Chip'))
> model.matrix(data=myDesing, ~data + condition)
(Intercept) dataChip conditionKd
1 1 0 0
2 1 0 1
3 1 0 0
4 1 0 1
5 1 1 0
6 1 1 1
7 1 1 0
8 1 1 1
I'm interested in how to handle the input data in regards to doing a differential binding analysis with edgeR, DESeq(2) or Voom-Limma.
Is it possible to incorporating the input data into the linear model?
My idea is to make a model that includes both the input and chip libraries and then test the interaction between chip and condition. More specifically for the R code example above this would correspond to testing the interaction dataChip::conditionKd. Does this make sense or does it violate some of the assumptions?
Does this not correspond to measuring the fold change of Input vs WT ( FCiwt ) and the fold change of Input vs KD (FCikd) and then comparing whether there is a difference between FCiwt and FCikd?
I have read this DESeq2 for ChIP-seq differential peaks which discuss whether one can subtract the input reads from before the DE analysis but it is a bit unsatisfactory as the discussion did not result in a good way to incorporate the input data.
In advance thanks for your help.
Kristoffer
Thanks! As i have condition specific input I will try it :-)
Do you think it is a problem for the normalization that the input libraries are 5-20x smaller than the acutal chip-seq libraries meaning 5-20x fewer counts in the peaks identified ?
In and of itself, differences in sequencing depth between samples are not a problem, as they are automatically modelled after normalization. However, in practice, there are (at least) two issues. Computationally, if the counts are too low, it may not be possible to compute sensible normalization factors. You may need to use fairly large regions of the genome - see the binning approach in the csaw package. The problem is exacerbated if you try to test for differences of ChIP/input log-fold changes; if both inputs are zero, you won't reject the null hypothesis, regardless of what happens to the ChIP samples.
From an experimental perspective, large differences in coverage may be symptomatic of other problems, e.g., failed library preparation. This is especially true for input controls, which should have plenty of DNA available; in contrast, low depth for IgG controls can be excused as a good pulldown protocol should result in a lot less DNA. Large differences in DNA concentration during library preparation may also result in other anomalies, e.g., excessive PCR duplicates in low-diversity libraries.