To preface this,
I have used DiffBind and csaw for most of my experimental work and love these programs, this is just a general question I had and I have seen versions of these questions asked before on this forum - but would like to ask additional questions namely regarding the normalization procedures.
(see: DEseq2 for differential binding analysis ChIP seq and DESeq2 for ChIP-seq differential peaks)
To summarize, I wanted to try a different approach, rather than using Csaw or DiffBind just for my own knowledge and abilities. To avoid posting lines of code, I'll simply summarize but can edit this in the future: I create a merged bam per condition of all replicates using samtools and called peaks with macs3 off this. I generated a union set of peaks between these two conditions using bedtools sort & merge, followed by converting this into a .saf file, and called featureCounts off the individual replicate Bams using this .saf file, which generates a count matrix of the reads within these peaks per replicate.
I preprocess this count matrix to filter out blacklisted regions as well as low counts in majority of the replicates as follows
Olaps <- overlapsAny(CoordinatesOnly %>% GRanges(), blacklist)
FilteredCounts <- RawCounts[ !Olaps, ]
FilteredCounts <- FilteredCounts [apply(FilteredCounts , 1, function(row) length(row[row>5]) > 2),]
I then import this count matrix and proceed as normal using DESeq2 (that is using the default normalization)
DDS <- estimateSizeFactors(DDS)
DDS <- DESeq(DDS, test = "Wald")
Running a Wald test in DESeq2 on this yields results almost identical to those I get with csaw and diffbind, the general trend in sites is the same and the sites are almost all the same as well.
My question is simple, is this a valid approach to this problem? Or should more time be spent on normalization. In the Csaw paper, DiffBind paper, and the Csaw book there is a lot of emphasis placed on normalization strategies. For this ChIP we are not dealing with a WT vs KO condition so global shifts may not be factor but rather a disease and how this particular protein changes its binding with it. How would one deal with global shifts in an approach like this? Should I trust these results or should I be more cautious?
Thanks in advance!
Thank you so much!! Yeah I had thought about this too, regarding the fact that DiffBind is a wrapper for counting peaks in this way in R directly. If I can ask you as well, how would you handle normalization of more complex libraries (complex in the sense of technical variables)? Would it be sufficient to just run the default normalizations as I did? There were some forum posts in the past (I believe in one of the links I attached actually) that said to use a more lenient q-cutoff for peak calling which I did, so my FRiP using the approach outlined above is between 25-30%. Would this become my "new" library size - similar I guess to RNA-seq where the library size is the column sum of reads of a counts file?
Sorry for reviving this 1 month later, but I was reading about size factor estimation, and just wanted to add my findings if anyone stumbles on this in the future. For me, it turns out the default size factor estimation was virtually identical to manually computing this from the total mapped reads (library size post deduplication and retention of only properly mapped pairs, since this was a PE seq).
Here's my code
When plotting volcanos between the two methods for size factor estimation I get the following below, which look pretty similar so I think that the default size estimation captured the normalization from my reads in peaks quite well (is my conclusion anyway).
Curious to hear what anyone else thinks!
That plot looks extremely worrying. You have very unbalanced DE profiles and clearly two populations of peaks on the right. I would make sure that a) proper prefiltering is done, normalization worked well (use MA-plots) and that there is not some other oddities (use PCA to inspect data clutering).
I noticed the dual populations as well, when using csaw this is controlled for. Regarding the DE profile, given this disease condition we expect global sustained accumulation of this histone so I would anticipate a major right skew (csaw normalization on composition bias). Looking at the MA plots, I see what you mean. I dropped my significance for peak detection pretty low to quartile the peaks in terms of expression, so even though in this bulk comparison I filter out low counts, there are probably low enriched peaks retained - maybe that is skewing this.
There are additional artifacts in the data, these were sequenced in 3 separate sequencing batches (counter balanced, and controlled for in the model), but one sample has a particularly low depth, while two samples have particularly high depths, I thought about just removing that one sample. For the record, did not generate this data, I am just analyzing it. Csaw seems to control for all these artifacts quite well.
Perhaps edgeR's upperquartile normalization handles this better, the MA looks a lot more centered at 0
If it was global changes then (in my head) why is there two populations of peaks (that to me indicates some sort of confouder) and also, if things globally go up, shouldn't then this not be removed by the normalization because *seq is inherently relative, hence global changes are cancelled out by the size factors? That MA-plot you show first indicates some sort of non-linear bias and then this second population of peaks that I don't really have an idea what it could be. Interesting though, never seen something like that.
Thanks for your time on this, I have also never seen anything quite like this before neither. The only confounding variable I received in the metadata file was the different sequencing batches, but there may be others (this seq was done 3 years ago now, before my time in this lab). On the PCA, samples cluster first by this sequencing batch (which is expected, I've seen this many times), once correcting with ComBat for visualization its resolved, and modelling for it in the glm, the within sample variability (specifically along PC2) is still quite high. There is a decent probability different brands of antibodies (and certainly different LOT # of these antibodies) were used.
I remember reading in the csaw manual (I think) that systemic differences in binding would cause a global shift on the MAs. I have applied offsets using their normOffsets() function and plotted a loess regression through the MA which seems centered, based on discussions this shift up could also have been a trended bias (Ignore the quality of the graph, I rushed it in ggplot).
Thank you so much for your help again, this was strangest dataset I've worked with, usually this is a bit more straight forward. I'm going to tag Aaron and maybe he has some thoughts as well Aaron Lun
I think for this project I will stick with Csaw, this was just a side project, since its all up on our cluster the execution time is very fast, might have to put my Csaw script up on our cluster as well ...