Question

Best way to get HiCUP-mapped Hi-C data into diffHiC

0

Entering edit mode

e.jacobson ▴ 10

@ejacobson-13378

Last seen 7.8 years ago

I have mapped and filtered my Hi-C reads using HiCUP (https://www.bioinformatics.babraham.ac.uk/projects/hicup/) and it is too computationally intensive to align the reads again with diffHiC (each library is ~400 million 150bp pairs). I was wondering if someone could recommend the best way to get these reads into diffHiC. HiCUP provides conversion scripts to HOMER, fitHiC, GOTHiC, hicpipe, and hicPro. I have also generated raw interaction matrices with HOMER so could input these as suggested in a previous post (https://support.bioconductor.org/p/90184/).

In section 2.5 of the user guide the 'savepairs' function is described as a good entry point from other pipelines, but I'm unclear of the format the input data needs to be in. Should it be one data frame arranged as:

anchor1.id anchor1.pos anchor2.len anchor2.id anchor2.pos anchor2.len raw_counts

or in two dataframes - one with the anchor ids and counts, and one with the alignment information?

So to reiterate the question, what is the most efficient way to get aligned Hi-C reads (either as mapped fragments or in bins) into diffHiC from HiCUP (or HOMER), and how should the data be formatted?

Thanks for your help.

hicup hi-c diffhic homer • 2.9k views

ADD COMMENT • link 7.8 years ago e.jacobson ▴ 10

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 13 hours ago

The city by the bay

If you just want to do the DI analysis, then putting in raw contact matrices as suggested in the above link is fine. If you want to take advantage of other diffHic functionality, you'll need to use savePairs for each sample. This requires a single data frame where each row corresponds to a read pair (so there is no raw_counts field):

anchorX.id indicates the index of the genomic interval to which read X is aligned. Intervals are generally expected to be restriction fragments for conventional Hi-C. This index should match up to the entries of param$fragments, i.e., param$fragments[anchorX.id] should be the interval to which X is aligned. (For DNase-C, the index should point to the chromosome of the seqlengths of param$fragments to which X is aligned.)
anchorX.pos holds the genomic coordinate of the alignment. This should be the 1-based position on the chromosome corresponding to the first aligned base of X.
anchorX.len holds the length of the alignment relative to the reference. This should be multiplied by -1 if the alignment is on the negative strand.

If the data frame is too large to hold in memory, you can run savePairs on each chromosome and merge it all together with mergePairs. I must admit that I don't do this much, so it's not as well-documented as it could be.

ADD COMMENT • link 7.8 years ago Aaron Lun ★ 28k

score 2 · Accepted Answer · 2017-07-20

I ended up getting advice from someone who uses HiCUP and diffHiC together frequently, and they suggested this:

1) .bam files have to be name sorted:

samtools sort -n myData.bam > myData.sorted.bam (check if reads have /1 and /2 at the end; this will cause problems)

2) HiCUP digest file has to be imported into R:

digest <- read.csv(“Digest_GRCm28_BglII.txt, header=T, sep=“\t“)

3) Fragment file has to be generated:

mm.frag<-with(digest,Granges(Chromosome, IRanges(Start,End)))

4) Generate pair.param file:

mm.param<-pairParam(mm.frag)

5) preparePairs(“myData.sorted.bam”, mm.param, file=”mockII.h5”) will create the h5 file needed for counting data into bins

6) input<-c(“mockI.h5”,”mockII.h5”,”treatedI.h5”,”treatedII.h5”) #generating input file

data<-squareCounts/marginCounts… (input, mm.param, width=bin.size) # counting reads into bins

This worked perfectly for me, and I am now able to get all my HiCUP mapped data into diffHiC.