edgeR itself can incorporate alternative normalization schemes fairly easily; the real question is whether the assumptions behind the spike-in process are applicable. IIRC, there are two major assumptions:
- The spike-in antibody (usually against some Drosophila histone mark) is subject to the same technical biases as the actual antibody against your desired target.
- Your spike-in addition is sufficiently accurate so that the ratio of the concentrations of spike-in chromatin to your actual chromatin of interest is constant across samples.
To make it all work, you can align reads to a combined genome containing both your human and spike-in reference sequence. Then it's a simple matter of:
- Identifying enriched regions in the combined genome. The safest way to do so is to pool reads from all samples together for a single round of peak calling.
- Creating the usual
DGEList
where each row corresponds to an enriched region in the combined genome.
- Subsetting the
DGEList
to your regions from the spike-in genome (do not set keep.lib.sizes=FALSE
!) and run calcNormFactors()
.
- Transfer the normalization factors from the subset back to the full
DGEList
.
Steps 3 and 4 would look something like this, assuming your DGEList
is named y
and you have a GRanges
named locations
:
is.spike.in <- as.logical(seqnames(locations) %in% c("I", "II", "III")) # I dunno, whatever the spike-in chromosome names are.
ysub <- y[is.spike.in,]
ysub <- calcNormFactors(ysub)
y$samples$norm.factors <- ysub$samples$norm.factors
Here, the TMM step assumes that any difference in the spike-in coverage is technical and should be removed. The transfer of the normalization factors back to y
further assumes that the biases affecting the spike-in chromatin are also applicable to the actual test chromatin.
And that's it. After that, it's just the usual edgeR workflow. Personally I always felt that these assumptions were pretty sketchy, and I would prefer to use the binning approach (see Section 4.1 here for some background). But to each their own.
I'll also add that just adding in yeast DNA is not really all that informative. The main appeal of spike-ins is to capture differences in immunoprecipitation efficiency across samples. If you're just throwing in yeast DNA without an antibody against it, you don't get that information; at that point, you might as well save yourself the trouble and use TMM on the bins, especially given that your TF probably isn't binding enough of the genome to compromise the accuracy of the binning approach.
Do you have spike-ins? If you do, can you explain what was spiked-in and what measurements you have on them?
TF was mapped in human cells and spiked in with yeast DNA. Both were sequenced so now I have human and yeast reads aligned and quantified.
You are correct that you should not use TMM normalization directly on ChIP-seq counts when the binding enrichment changes systematically between conditions. Spike-ins are one way to get around this, but you can also normalize to "background" reads using large bins across the genome. The
csaw
package guide shows how to do this, and this ability is now built into theDiffBind
package (the vignette walks through this process in some detail).For spike-ins, there are a number of protocols and commercial kits for performing ChIP-seq spike-ins, usually using Drosophila chromatin. After sequencing, the reads can be aligned against the Drosophila reference genome (either separately of combined with the target genome). Once you have the alignments, the latest version of
DiffBind
has built-in support for normalizing to spike-in data and performing differential binding analysis usingedgeR
. The vignette has a section showing how to do this.Note that, once the normalization parameters have been set, you can export the
edgeR
DGEList
object from withinDiffBind
for fine-grained control over theedgeR
analysis.Thank you. I will check on this. I also want to point out that it was not ChIPseq, but CUT&RUN method that was used.