Hi, I have a question about the definition of consensus peak set. I can think of two options:
A) One single peak set from adding and merging multiple peak sets. akin to: cat 1.bed 2.bed | bedtools merge -stdin > consensus.bed
B) A "pooled" peak set consisting of all the peaks from the two input peak sets. akin to: cat 1.bed 2.bed > consensus.bed
Which one is it?
The reason I ask is because I am trying to decide whether it would be a good idea to perform IDR on my samples and use the single IDR-cutoff peak set as the "consensus" for differential binding analysis. I would do this if A) is the right answer. However, if it is B) then IDR is somewhat redundant, as diffbind would just tell me all of the peaks that are significant.
Thanks in advance.
-Wes
I see, thanks Gord. I also welcome any advice from you brilliant folks. That's the first I've heard of that, so I'll give it a go on my data.
This is out of the scope of Diffbind, but do you personally prefer obtaining significant differences through Diffbind without the IDR pipeline? Or do you actually follow Li's advice?
I wanted to add somethings to what Gord said.
The ENCODE standards, including IDR, were developed as part of an effort to identify the locations where binding sites and epigenetic marks are. The focus of this type of "mapping" exercise is on identifying the location of binding sites with high confidence.
The goals of a differential analysis are different. We are trying to identify genomic intervals where we have confidence that binding levels have changed. A definitive "map" of high-confidence binding sites is not required to accomplish this. The techniques used in
DiffBind
should be robust to the inclusion of low-confidence binding sites and noise, so long as there are sufficient replicates to properly power the analysis. Only sites that consistently differ in read density across all the replicates in the sample groups should be identified as being differentially bound with high confidence (low FDR). So choosing a "lenient" consensus set, and not worrying too much about getting a perfect set, is fine.Secondly, regarding merging of peaks that overlap in multiple samples. We do this so that the consensus peaks are unique in the bases they cover, so we can uniquely assign reads when counting. There are some downsides to this. One is that the peak intervals tend to get wider the more samples there are, and wider peaks can include more background which can compromise the analysis. For "punctate" peaks such as transcription factor binding, we recommend re-centering the peaks using the summits parameter in
dba.count()
. This will identify a consensus "summit" (point of highest coverage) and replace the peak interval with a new one of consistent width centered on the summit. For example, if you specifysummits=200
, the peak intervals will all be 400bp (200bp upstream and downstream of the summit).Another disadvantage of merging (and recentering) is that the consensus peaks can be difficult to relate back to the originally called peaks. The idea is that
DiffBind
helps identify regions on the genome where we have high confidence that the binding changed; there can then be more detailed analysis of what is going on in these regions (which may involve a complex pattern of enrichment).-Rory
This makes a lot of sense, thank you for the insight!
-Wes
Hi,
We (our Bioinformatics Core group) do our best to insist on at least 3 replicates, so IDR is not directly applicable. Personally I don't use IDR, instead just use DiffBind and the underlying package DESeq2, which does the actual statistical analysis. I don't think anybody in our group uses IDR routinely.
Cheers,
- Gord
Ah, I see. Thanks for the response!
-Wes