Hi,
I have a set of bins in a GrangesList, and a set of regions I would like to mask. My goal is to get a list of vectors, where each vector indicates the fractions of the bins not overlapped with the masked regions, so I can use these vectors for further scaling other things.
I've implemented a small function to achieve my goal. Although it did work on small testing data, it won't work in real and much larger data. Below is the testing data and the function I wrote to illustrate my goal.
The set of bins in a GrangesList, bin size is set as 200bp here.
grp_bin1 <-
GenomicRanges::GRanges(
seqnames = S4Vectors::Rle(c("1"), c(4)),
ranges = IRanges::IRanges(start = seq(from = 1, by = 200, length.out = 4),
end = seq(from = 200, by = 200, length.out = 4)),
strand = S4Vectors::Rle(BiocGenerics::strand(c("+")), c(4)))
grp_bin2 <-
GenomicRanges::GRanges(
seqnames = S4Vectors::Rle(c("2"), c(4)),
ranges = IRanges::IRanges(start = seq(from = 1, by = 200, length.out = 4),
end = seq(from = 200, by = 200, length.out = 4)),
strand = S4Vectors::Rle(BiocGenerics::strand(c("-")), c(4)))
grp_bins <-
GenomicRanges::GRangesList("bin1" = grp_bin1, "bin2" = grp_bin2)
> grp_bins
GRangesList object of length 2:
$bin1
GRanges object with 4 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 1-200 +
[2] 1 201-400 +
[3] 1 401-600 +
[4] 1 601-800 +
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
$bin2
GRanges object with 4 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 2 1-200 -
[2] 2 201-400 -
[3] 2 401-600 -
[4] 2 601-800 -
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
The set of regions I would like to mask
add_mask1 <-
GenomicRanges::GRanges(
seqnames = S4Vectors::Rle(c("1", "2"), c(2, 1)),
ranges = IRanges::IRanges(start = c(51, 251, 151),
end = c(200, 400, 250)),
strand = S4Vectors::Rle(BiocGenerics::strand(c("*")), c(3)))
> add_mask1
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 51-200 *
[2] 1 251-400 *
[3] 2 151-250 *
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
Then function I wrote and the expected output
scale_additional_masks <- function(bins, add_mask, bin_size) {
# Find overlap
ovr <- IRanges::findOverlapPairs(add_mask, bins)
# Compute the percentages of the bins overlap with the masks
ovr_intersect <-
IRanges::pintersect(ovr, drop.nohit.ranges = FALSE)
ovr_val <- GenomicRanges::width(ovr_intersect) / bin_size
# Convert the overlap fractions into vectors
ovr_val_ls <- base::split(ovr_val, names(ovr_val))
add_scale <- lapply(ovr_val_ls, function(x) {
1 - colSums(as.matrix(x))
})
return(add_scale)
}
> scale_additional_masks(bins = grp_bins, add_mask = add_mask1, bin_size = 200)
$bin1
[1] 0.25 0.25 1.00 1.00
$bin2
[1] 0.75 0.75 1.00 1.00
Everthing works fine until I move on to my real data, which has ~18K groups of bins, and ~7M non-overlapped regions need to be masked.
Browse[2]> bins
GRangesList object of length 18101:
$`10000_-`
GRanges object with 250 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 11 644233-644482 -
[2] 11 644483-644732 -
[3] 11 644733-644982 -
[4] 11 644983-645232 -
[5] 11 645233-645482 -
... ... ... ...
[246] 11 705483-705732 -
[247] 11 705733-705982 -
[248] 11 705983-706232 -
[249] 11 706233-706482 -
[250] 11 706483-706732 -
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths
...
<18100 more elements>
Browse[2]> add_mask
GRanges object with 7238684 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 1-382433 *
[2] 1 382512-382513 *
[3] 1 382529-382530 *
[4] 1 382631-611597 *
[5] 1 611655-611656 *
... ... ... ...
[7238680] X 156010147-156010167 *
[7238681] X 156010554 *
[7238682] X 156011799-156011812 *
[7238683] X 156012646-156012649 *
[7238684] X 156013711-156040895 *
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths
When I run and debug the function, it pops an error at this line.
ovr <- IRanges::findOverlapPairs(add_mask, bins)
Error in METHOD(x, i) :
Subsetting operation on CompressedGRangesList object 'x' produces a result that is too big to be
represented as a CompressedList object. Please try to coerce 'x' to a SimpleList object first (with 'as(x,
"SimpleList")').
I cannot convert my mask regions to a SimpleList as suggested by the error
> as(add_mask1,
+ "SimpleList")
Error in getListElement(x, i, ...) :
GRanges objects don't support [[, as.list(), lapply(), or unlist() at the moment
I also tried to truncated the mask regions, and it seems working ok when the number of rows is around 3e6, but no more than 4e6. When it runs at this scale, it takes more than 100 GB RAM so I have to run it on a server rather than my local computer.
I'm wondering if there is any solutions could help me when data come to this scale? We have high memory node on server (1.5TB) so it might be ok to use high memory for a short time, but I guess it's also my implementation too naive so it make things so costly.
Any suggestions are highly appreciated!
Best,
Yixin
Thanks a lot Michael! Your solution is fantastic and much much faster and memory efficient than mine! I play with the code you provide and I'm very close to get a solution for my real data. The only issue I have for now is:
I notice the vector output by your method is based on chromosomes, I was trying to split it into my original groups but haven't figured out a proper way to do it. What I mean is as followed, in addition to the two groups of bins I have at the beginning, now I have two more on chr1 to make it more like real data.
My naive function will produce list of vector like,
If I understand correctly, to ensure your method works, I need to make sure the seqlengths are the same for my bins and masks, and cover my Granges entirely (in reality I should download the genome version matched seqinfo and apply it to both the bins and masks)? Here I just set them manually.
And then,
Notice that the bin1, bin3 and bin4 are now all in
1
. Do you have any suggestions about how I should do the split so the vectots can be converted back to the form as my naive function?Thanks again for your prompt reply!
It would generally be easiest to flatten the list of window groups and instead use a factor to represent the grouping, as done in the initial answer via
stack()
. If you ensure that the data are sorted in chromosome order viasort()
, you can simplyunlist()
andrelist()
the result via the grouping factor, like:Your solution is amazing!
unlist()
thensplit()
really do the trick. I didn't knowsplit()
can relist the results perfectly via the grouping factor, and get stuck in trying to userelist()
.Highly appreciate your help and indeed learn a lot about how to deal with GRanges objects!