Question

Technical details of DiffBind correlation analysis

0

Entering edit mode

meisan406 • 0

@meisan406-7061

Last seen 7.4 years ago

United States

I would like to clarify some technical details which I’ve been trying to search for in the documentations for DiffBind, but seem to have missed it. I am interested to know the details of how the correlation of occupancy (using peak caller score) is generated. Is this a Pearson Correlation Coefficient? Is it purely based on peak caller score, or does it take into account the overlap coordinates of peaks? And if it does take into overlapping of peaks, at what percentage do you consider peaks overlap?

diffbind duplicatecorrelation • 3.5k views

ADD COMMENT • link updated 15 months ago by Rory Stark ★ 5.2k • written 10.4 years ago by meisan406 • 0

score 1 · Answer 1 · 2014-11-18

1

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 3 months ago

Cambridge, UK

Hello Mei San-

Here is the sequence that occurs when you read in peaksets and plot a correlation heatmap.

First, each peak set is read in separately. The scores are normalized to a 0-1 scale as follows:

> scores <- scores/max(scores)

(if LowerBetter was specified, this is followed by scores <- 1-scores.)

Second, all the peaksets are merged. Any overlapping peak regions are extended to encompass the contiguous peaks, so peaks are consider to overlap if at least one base position overlaps.

Third, each interval in the merged peakset is assigned a score for each sample. If a sample had one peak called that overlaps the merged interval, it receives the (normalized) score for that peak. If a sample has more than one peak called that overlaps a merged interval, it is assigned the maximum value of the overlapping peak scores for that sample. If a sample has no peaks that overlap a merged interval, it is assigned a score of -1.

At this point, there is a binding matrix with rows corresponding to the merged intervals, and columns corresponding to each sample. Each column then is a vector representing the peak scores for that sample (with -1 scores for all the regions that were not identified as peaks for that sample).

To make the correlation heatmap (by calling dba.plotHeatmap() or just plot(), which maps to the same thing), the log2() values of the scores are obtained, then a correlation value is computed for each pair of score vectors. This is a pearson correlation by default, although it is changeable using the distMethod parameter to dba.plotHeatmap().

The whole matrix of correlations is returned silently by dba.plotHeatmap(), so if you assign this to a value, you can see the correlation values.

Hope this help!

Cheers-

Rory

ADD COMMENT • link 10.4 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Hello Rory.

Thanks. Definitely did help. Does the affinity correlation have the same sequence?

I am asking mainly because I have a data set that has very low correlation with the occupancy analysis (~0.2) between replicates, but much higher correlation analysis (~0.8) with the affinity analysis and I'm trying to conclude meaningfully what does it mean. Am I right to say that the higher correlation with affinity counts mean that a differential analysis of the read counts is the better way to compare the samples across different contrasts?

ADD REPLY • link 10.4 years ago meisan406 • 0

0

Entering edit mode

The difference could reflect issues with the peak calling. In the occupancy analysis, if a site is not called as a peak for the sample, it has no score (-1), while in the affinity analysis, it will still get a relative score based on the number of reads. when I see a big different in correlation rates (and/or clustering), I prefer to use the read counts, as that reflects the actual signal detected by the sequencing.

-R

ADD REPLY • link 10.4 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Can you please explain what is the score in the occupancy analysis? is it the peak's height? peak's intensity?

ADD REPLY • link 7.8 years ago C T ▴ 140

0

Entering edit mode

In an occupancy analysis, the score is as described in my original answer (normalised peak score as specified in the peak file). The calculation of the original scores depends on the peak caller. For example, the default score for MACS peaks is -log10(p-value) (peak confidence). You can change it to another measure (tags, or peak height) by the ScoreCol column in the sample sheet.

ADD REPLY • link 7.8 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Hello Rory Stark , how does the different sample replicates enter into the binding matrix one after the other? what determines which goes first into the matrix and which the last?

ADD REPLY • link 15 months ago Kalaivani • 0

0

Entering edit mode

They are entered in the matrix (as columns) in the order they appear in the sample sheet. Peaks are also merged in that order.

ADD REPLY • link 15 months ago Rory Stark ★ 5.2k