Entering edit mode
Hi,
I've loaded my matrix this way:
bin.coords <- import("./1000000.bed") bin.coords <- as(bin.coords, "GRanges") raw <- read.table("./1000000.txt") gi <- GInteractions(raw$V1, raw$V2, bin.coords, mode="reverse") iset1 <- InteractionSet(as.matrix(raw$V3), gi) iset1$totals <- colSums(assay(iset))
The I use the average log-CPM corresponding to a count of 5 to filter out pairs of bins prior to differential analysis with edgeR.
ave.ab <- aveLogCPM(asDGEList(iset)) threshold <- aveLogCPM(5, lib.size=mean(iset$totals)) hist(ave.ab, xlab="Average abundance", xlim = c(-5,5), breaks=100) abline(v = threshold) count.keep <- ave.ab >= threshold summary(count.keep) #11% of bins kept #removing low counts as a function of the distance. In Hi-C data, #bins within a very short distance have high interaction due to #they are linearly close, not due to a specific interaction. trended <- filterTrended(iset) #plot it smoothScatter(trended$log.distance, trended$abundances, xlab="Log-Distance", ylab="Normalized abundance") o <- order(trended$log.distance) lines(trended$log.distance[o], trended$threshold[o], col="red", lwd=2) trend.keep <- trended$abundances > trended$threshold summary(trend.keep) #50% of bins kept #combining filters counts+distances final.keep <- trend.keep & count.keep iset_filtered <- iset[final.keep,]
However, with this, I filter out the 90% of bins. Is it a lot? How can I choose a better threshold?
Thank you for your reply. I've modified my post.
On further reflection, 90% is probably fine, as long as you're keeping 100000 bin pairs (or more), which is what I usually get. I tend not to use the trended filter because many of the low-distance interactions might be interesting (e.g., domain boundaries), so I try to hold onto them for downstream differential testing.
Thank you. I wanted to ask another question. I know that the lower the resolution, the best. In very low resolutions, we can have a very sparse matrix and we would filter too much bins. Let's say I have matrices for 20.000, 40.000, 100.000bp. Do you know how to choose the best resolution for differential analysis?
Firstly, you've got your terminology mixed up here; smaller bin sizes provide higher resolution, because their use improves your capacity to distinguish adjacent events in the interaction space. Secondly, it's not clear that there's a single "best" resolution for the DI analysis. For example, a small bin size might be optimal for detecting changes in looping intensity, while a large bin size would provide more power for detecting changes in large structures like TADs. To overcome this, it is possible to perform the analysis at multiple resolutions and then merge the results. This strategy exploits the potential for different discoveries at each resolution - see Section 7.3 of the diffHic user's guide for details.
Generally, though, I just pick a bin size, do the analysis and see if the results are "good enough". This evaluation is a balance between whether I get enough DIs (usually you get more DIs with a larger bin size, because of larger counts) and whether the results are interpretable (which is better with smaller bin sizes for assigning interactions to genes). For in situ Hi-C data, I've found that 50 kbp bins are a good place to start; large enough to get decently sized counts per bin pair, yet small enough so that you get manageable numbers of genes (< 5) assigned to each anchor region. If you, say, try 20 kbp bin pairs and find that you get much fewer DIs, then those counts are probably too low for analysis.
Thank you. Does it make sense to find bin1 vs bin1 (in the diagonal), differentialy expressed?
"Differentially interacting". Anyway, self-interactions need to be carefully interpreted because they are more susceptible to technical differences between libraries that haven't been completely removed (e.g., self-ligation, dangling ends). If so, this requires separate normalization from the other bin pairs - see the user's guide for details. In addition, when you look at self-interactions, you need to figure out whether the increase in intensity is due to an increase in interactions with proximal elements (e.g., locus compaction), which might not be that interesting; or an increase in interactions with distal elements that happen to lie in the same bin. Thus, I tend to filter out bin pairs on the diagonal prior to further analysis, because I'm not sure what to make of them.