Question

diffHic: minimum counts per bin

0

Entering edit mode

andreucat91 • 0

@andreucat91-8929

Last seen 8.4 years ago

Spain

Hi,

I've loaded my matrix this way:

bin.coords <- import("./1000000.bed")
bin.coords <- as(bin.coords, "GRanges")
raw <- read.table("./1000000.txt")
gi <- GInteractions(raw$V1, raw$V2, bin.coords, mode="reverse")
iset1 <- InteractionSet(as.matrix(raw$V3), gi)
iset1$totals <- colSums(assay(iset))

The I use the average log-CPM corresponding to a count of 5 to filter out pairs of bins prior to differential analysis with edgeR.

ave.ab <- aveLogCPM(asDGEList(iset))
threshold <- aveLogCPM(5, lib.size=mean(iset$totals))
hist(ave.ab, xlab="Average abundance", xlim = c(-5,5), breaks=100)
abline(v = threshold)

count.keep <- ave.ab >= threshold
summary(count.keep) #11% of bins kept

#removing low counts as a function of the distance. In Hi-C data,
#bins within a very short distance have high interaction due to
#they are linearly close, not due to a specific interaction.

trended <- filterTrended(iset)

#plot it
smoothScatter(trended$log.distance, trended$abundances, xlab="Log-Distance", ylab="Normalized abundance")
o <- order(trended$log.distance)
lines(trended$log.distance[o], trended$threshold[o], col="red", lwd=2)

trend.keep <- trended$abundances > trended$threshold
summary(trend.keep) #50% of bins kept

#combining filters counts+distances
final.keep <- trend.keep & count.keep
iset_filtered <- iset[final.keep,]

However, with this, I filter out the 90% of bins. Is it a lot? How can I choose a better threshold?

diffhic edger • 1.6k views

ADD COMMENT • link 8.4 years ago andreucat91 • 0

score 1 · Answer 1 · 2016-12-12

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 16 hours ago

The city by the bay

I assume you mean "filter out bin pairs", not bins.
I don't recall using a "log cpm 5" filter. Do you mean an average log-CPM corresponding to a count of 5?
If you did any pre-filtering in squareCounts, then losing a further 90% of bin pairs is quite a lot. It's hard to tell whether it's caused by your data or your code, though, because you don't show what you did.
There are many filtering strategies described in the user's guide, which can give you a suitable threshold.

ADD COMMENT • link 8.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you for your reply. I've modified my post.

ADD REPLY • link 8.4 years ago andreucat91 • 0

0

Entering edit mode

On further reflection, 90% is probably fine, as long as you're keeping 100000 bin pairs (or more), which is what I usually get. I tend not to use the trended filter because many of the low-distance interactions might be interesting (e.g., domain boundaries), so I try to hold onto them for downstream differential testing.

ADD REPLY • link 8.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you. I wanted to ask another question. I know that the lower the resolution, the best. In very low resolutions, we can have a very sparse matrix and we would filter too much bins. Let's say I have matrices for 20.000, 40.000, 100.000bp. Do you know how to choose the best resolution for differential analysis?

ADD REPLY • link 8.4 years ago andreucat91 • 0

1

Entering edit mode

Firstly, you've got your terminology mixed up here; smaller bin sizes provide higher resolution, because their use improves your capacity to distinguish adjacent events in the interaction space. Secondly, it's not clear that there's a single "best" resolution for the DI analysis. For example, a small bin size might be optimal for detecting changes in looping intensity, while a large bin size would provide more power for detecting changes in large structures like TADs. To overcome this, it is possible to perform the analysis at multiple resolutions and then merge the results. This strategy exploits the potential for different discoveries at each resolution - see Section 7.3 of the diffHic user's guide for details.

Generally, though, I just pick a bin size, do the analysis and see if the results are "good enough". This evaluation is a balance between whether I get enough DIs (usually you get more DIs with a larger bin size, because of larger counts) and whether the results are interpretable (which is better with smaller bin sizes for assigning interactions to genes). For in situ Hi-C data, I've found that 50 kbp bins are a good place to start; large enough to get decently sized counts per bin pair, yet small enough so that you get manageable numbers of genes (< 5) assigned to each anchor region. If you, say, try 20 kbp bin pairs and find that you get much fewer DIs, then those counts are probably too low for analysis.

ADD REPLY • link 8.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you. Does it make sense to find bin1 vs bin1 (in the diagonal), differentialy expressed?

ADD REPLY • link 8.4 years ago andreucat91 • 0

0

Entering edit mode

"Differentially interacting". Anyway, self-interactions need to be carefully interpreted because they are more susceptible to technical differences between libraries that haven't been completely removed (e.g., self-ligation, dangling ends). If so, this requires separate normalization from the other bin pairs - see the user's guide for details. In addition, when you look at self-interactions, you need to figure out whether the increase in intensity is due to an increase in interactions with proximal elements (e.g., locus compaction), which might not be that interesting; or an increase in interactions with distal elements that happen to lie in the same bin. Thus, I tend to filter out bin pairs on the diagonal prior to further analysis, because I'm not sure what to make of them.

ADD REPLY • link 8.4 years ago Aaron Lun ★ 28k