I am a graduate student in Statistics at UC Riverside, and I have mapped out a plan to execute a new analysis method on HiC datasets. However, there is currently no software to implement this new approach, so I will need to write much of it on my own. However, I want to avoid reinventing the wheel wherever possible, so I’d like to ask for your input on how I can start with pre-existing S4 classes.
In order, I would like to do the following:
Read in raw HiC data into R (likely an InteractionSet or hic.table) (I CAN ALREADY DO THIS)
Identify the initial binning size, K, so that I might find a number between the largest observed genomic coordinate and the chromosome size that is proportional to a power of 2:
(I HAVE ALREADY DONE THIS)
LargestObservedGenomicCoord <= K * 2^J <= ChromSize
Bin the Hi-C data into bins of size K Currently, binning seems limited to Kilobase resolutions. Can I be more precise than that?
If not, I can find a way to proceed with the limitation that the initial bin size be at least 1kb, but if I can have bins of arbitrary size K (i.e. 13.407 kB), that would be simpler for my workflow.
Map the 2D upper-triangular matrix coordinates (x,y) to a 1-D "pseudo space-filling curve" coordinate (d) which is also defined on upper-triangular spaces (I HAVE ALREADY DONE THIS with Rcpp)
One issue i may encounter is a limitation on the size of the
d
index values.Seeing as the largest human chromosome is approximately 250 million basepairs in length, the index values (which will be used in sorting and defining “neighboring observations”) can be as large as ~2^55, necessitating 64-bit numbers.
I have already written a function to create these 64bit numbers (using Rcpp) but including them in an InteractionSet object and using InteractionSet methods on these 64-bit number may be a challenge.
Sort the Hi-C object (eg. InteractionSet, hic.table, or other) according to this new 1-dimensional coordinate (d)
I see there are fast sorting/ordering methods for InteractionSets.
However, I will be dealing with 64 bit numbers. I’m not sure how to reorder an Rle/S4Vector object based on a secondary vector of metadata indices.
Perform a 1D wavelet-transform/multiresolution-analysis on the sorted data (NEW OPERATION)
This would involve downsampling (performing successive sums and differences of neighboring pairs in the vectors, and repeating on the resulting sum. For example:
# Input data
A0 <- c(x0, x1, x2, x3, x4, x5, x6, x7)
# Perform first round of differencing , summing, and downsampling
# call this c( d1.1, d1.2, d1.3, d1.4 )
D1 <- c(x0 - x1, x2 - x3, x4 - x5, x6 - x7)
# call this c( a1.1, a1.2, a1.3, a1.4 )
A1 <- c(x0 + x1, x2 + x3, x4 + x5, x6 + x7)
# D2 and A2 are calculated the same as above, but with A1 as the input
# = c( a1.1 - a1.2, a1.3 - a1.4 ) = c( d2.1, d1.2 )
D2 <- c(x0 + x1 - (x2 + x3), x4 + x5 - (x6 + x7))
# = c( a1.1 + a1.2, a1.3 + a1.4 ) = c( a2.1, a1.2 )
A2 <- c(x0 + x1 + (x2 + x3), x4 + x5 + (x6 + x7))
# Typically the result is stored in a list of lists (see this for an example of a 2D wavelet object)
# https://pywavelets.readthedocs.io/en/latest/ref/dwt-coefficient-handling.html
#
# However, I might use a list of InteractionSets??
#
# The original signal, A0, can be recovered without loss from this object.
#
# This could be a new class object, such as a “HicWavelet” object??
my.wavelet.object <- list( D1, D2, D3, …., DJ, AJ)
Thankfully, the size of each successive, downsampled S4Vector is fairly easy to determine to looking at how many none-zero successive runs exist in the current vector. So, memory allocation can be done to avoid major slowdowns that would otherwise be brought about by appending values to a vector of unknown length.
If you’ve made it this far, I would like to thank you! I do not have the requisite knowledge about how to make changes to S4 classes, and any input/guidance you might offer would be extremely helpful. I’ve ready Hadley Wickham’s Advanced R material on S4 classes, but I find it too superficial to get me to a place where I am extending S4 class objects, or modifying a current restrictions on slots.
Please, if you are familiar with any resources I might make use of to train myself up to the point where I can get myself up and running with S4 classes on my own, I would be endlessly grateful. SPECIFICALLY:
How to extend/modify InteractionSet objects to accept 64 bit / double numbers in their slots
How to modify the sorting methods of InteractionSets to work with 64 bit numbers
How to write new functions for use on InteractionSets (such as differencing/summing neighbors and performing the reverse transformation)
How to create a new class of object which will resemble the usual list-of-lists structure of a typical wavelet transform, but instead uses an InteractionSet type object for each entry (i.e. a list-of-InteractionSets).
How to write custom print methods for objects of this type.