In 4.4 Peak-calling in the interaction space in the diffHiC users guide, it is suggested to use a "modest" minimum enrichment threshold of 0.5. It is also suggested that most parent bins should have low numbers (<5) of nested bin pairs, and that larger proportions require more aggressive filtering. A more aggressive minimum enrichment of 0 is suggested in the 2nd example.
While the user guide has suggested that a lower threshold is more stringent, instinctually, I would expect a higher minimum enrichment to be more stringent than a lower one, so for my own data set I looked at several thresholds above and below 0.5. At a resolution of 20kb, a lower minimum enrichment actually does result in an increase in nested bin pairs, contrary to what is shown in the user guide example. For the most part, increasing the min enrichment seems to lower the number of nested bins pairs, though some do increase depending on the threshold. However at lower resolutions, the number of bin pairs seems to go up or down depending on how far I go in either direction, but even decreasing the threshold to an extreme negative number cannot produce bin numbers of <5 for most parent bins. It's a bit unclear whether I should be increasing or decreasing the min enrichment threshold or what might be a good value to use.
How exactly is this minimum enrichment threshold used in peak calling? Does the sign (+ vs -) affect the results? What is the best way to determine what threshold to use?
Below are some examples of different thresholds, using the code suggested in the users guide. (I've copied it only once below -the only thing I changed for the output was the min.enrich value or the data set since I used two different resolutions).
Thank you, to anyone who might have some insight.
> peak.keep_1e6_stringent <- filterPeaks(original_1e6, enrichments_1e6, min.enrich=0.5, min.count=5, min.diag=2L)
> sum(peak.keep_1e6_stringent)
[1] 27464
> neighborhood_1e6_stringent <- (2*flank.width + 1) * metadata(original_1e6)$width
> boxed_1e6_stringent <- boxPairs(data[peak.keep_1e6_stringent], reference=neighborhood_1e6_stringent)
> out_1e6_stringent <- tabulate(tabulate(boxed_1e6_stringent$indices[[1]]))
> setNames(round(c(out_1e6_stringent[1:5], sum(out_1e6_stringent[5:length(out_1e6_stringent)]))/sum(out_1e6_stringent)*100, 1), c(1:5, ">5"))
1e6 min.enrich = 0.5:
1 |
2 |
3 |
4 |
5 |
>5 |
15.0 |
14.5 |
15.3 |
15.1 |
13.3 |
40.2 |
1e6 min.enrich= 0.8 (this seems to be the most reasonable threshold, with most parent values <5):
1 |
2 |
3 |
4 |
5 |
>5 |
74.8 |
19.3 |
4.6 |
1.0 |
0.3 |
0.4 |
1e6 min.enrich = 0 (numbers change marginally when the enrichment score is increasingly negative, all the way down to -9):
1 |
2 |
3 |
4 |
5 |
>5 |
8.5 |
6.1 |
5.5 |
5.0 |
4.7 |
74.9 |
20kb min.enrich = 0.5 (only 3 parents are <5, but perhaps that's enough?):
1 |
2 |
3 |
4 |
5 |
>5 |
69.8 |
19.4 |
6.9 |
2.5 |
0.7 |
1.4 |
20kb min.enrich = 0:
1 |
2 |
3 |
4 |
5 |
>5 |
59.7 |
21.2 |
9.0 |
4.5 |
2.4 |
5.5 |
20kb min.enrich = 1 (increasing this:
1 |
2 |
3 |
4 |
5 |
>5 |
92.0 |
6.4 |
1.3 |
0.2 |
0.1 |
0.1 |
20kb min.enrich = -1:
1 |
2 |
3 |
4 |
5 |
>5 |
38.3 |
21.8 1 |
4.0 |
9.2 |
5.7 |
16.9 |
For good measure, the summary of enrichment scores of the data at both resolutions:
> summary(enrichments_1e6)
Min. |
1st Qu. |
Median |
Mean |
3rd Qu. |
Max. |
-9.2950 |
-0.4271 |
-0.1419 |
-0.1761 |
0.1251 |
7.1710 |
> summary(enrichments_20kb)
Min. |
1st Qu. |
Median |
Mean |
3rd Qu. |
Max. |
-6.1190 |
0.2857 |
0.3198 |
0.2440 |
0.3212 |
7.1710 |
Thank you for the response. The wording there is a bit confusing; I assumed the "less than 5" in "Most of these parents should contain low numbers (i.e., less than 5) of nested bin pairs," referred to the absolute numbers appearing for each category (1,2,3,4,5 and >5). If I understand you correctly here, you mean that the proportion between the ">5" category and any of the other categories should not exceed 5? In that case, based on my output above, can I simply use the suggested threshold of 0.5? Or would a larger number in category 1 also indicate the need for changing the threshold? Should I be aiming for a 1:1:1:1:1 ratio between these categories? Or just anything that doesn't exceed a ratio of 5:1 for any given comparison? Thank you
No. In that statement, I meant that most bin pairs should be in the categories 1 to 5 (well, for "less than or equal to 5"). The category number is the number of bin pairs contained within a parent; thus, only a small fraction of parents should be in the ">5" category. A large number in the ">5" category suggests you need to increase the
min.enrich
threshold.Technically I should have said "fewer" rather than "less". I guess I should fix that too.
I see. Could a "large number" simply be considered anything greater than the sum of categories 1-5, or is there some other cut-off I should use? Is it better to make the threshold so that >5 category is close to that number? (will I lose anything using that arbitrary 0.5 threshold on my 20kb data, considering how much smaller the >5 category is?)
If the >5 proportion is less than 5%, I'd be fairly satisfied. Don't sweat the details at this point; the idea is just to cut down the number of bin pairs for downstream differential testing. We're treating peak detection as a filtering step here, so we don't have to be particularly rigorous. Of course, if the aim of your analysis is to formally detect peaks, then more care is required. We're working on it, as I alluded to in my original answer, but it's not a high priority at the moment. diffHic is designed to detect differences, after all, rather than absolute features.
Thank you, I appreciate the advice.