Hi,
I am following the diffHic tutorial using my datasets, consisting in two biological replicates for one condition and four biological replicates for another condition. I have three questions which I am writing below:
1st question:
I am in the step 3.4 of the tutorial ("Counting into single bins"), by which I can asses if there are systematic differences in coverage between libraries for a given bin. When i compared each pair of datasets, I could see that there were some bins, that were always overrepresented in the datasets from one condition versus the datasets in the other condition, indicating, probably, copy number variation. So, I was interested to know, which region were represented by those bins. And I do not know how to do it.
this was my pipeline:
> genome <- "./genome.fasta" > test.frag <- cutGenome(genome, "AAGCTT",4) > test.param <- pairParam(test.frag) > input <- c("wt1.h5", "wt2.h5", "ko1.h5", "ko2.h5", "ko3.h5", "ko4.h5") > bin.size <- 1e5 > data <- squareCounts(input, test.param, width=bin.size, filter=1) > data DIList object for 6 libraries with 105542 pairs across 469 regions > margin.data <- marginCounts(input, test.param, width=bin.size) > margin.data DIList object for 6 libraries with 467 pairs across 469 regions adjc <-cpm(asDGEList(margin.data), log=TRUE, prior.count=5)
adjc contains the log2 value of mapped reads per million for each bin, but only for 467 of the 469, so it seems that two were not considered, I would guess that because they do not have enough counts. The thing is that I do not know which bins were not considered and, because of the shift, I do not know if regions(data)[N] corresponds to adjc[N,] or tho adjc[N-1,] or to adjc[N-2,]...How could I know that? I would also like now which are the bins that were not considered.
2nd question:
"counts(data)" contains the information of mapped read for each pair of bins in each dataset and looks like this:
> head(counts(data)) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 59 3 69 50 57 70 [2,] 73 5 54 66 62 171 [3,] 307 46 281 258 223 439 [4,] 80 6 45 53 40 112 [5,] 106 7 104 116 123 306 [6,] 273 35 212 212 195 334
so, as i understood, each column represents each dataset, and each row is each pair of bins, so 59 is the number of counts between "anchors(data)[1]" and "targets(data)[1]" in the dataset "input[1]" , and 54 is the number of counts between the bins "anchors(data)[2]" and "targets(data)[2]" in the dataset "input[3]". but, Is there an easy way to reconstruct for each dataset the interaction matrix in which the name of the rows and the columns is each bin region and each value inside the matrix is the number of interaction between each pair of bins?
Third question:
Given that my genome was divided in 469 bins, i am expecting (469^2 + 469)/2 interaction pairs, which are 110215 pairs. But my "data" object has only 105542 pairs, so there are almost 5000 interactions missing, which I supposed were filtered by "filter=1", but when I run squareCounts again with "filter=0" I obtain the same number of interactions (105542). Where are the missing interaction pairs? or am i thinking it or doing something wrong?
I hope you can help me!
Cheers,
Raúl