Hi :
I have list of GRanges
that needed to apply very specific duplicate removal . I have reason for using specific conditional duplicate removal for my data. However, duplicate removal condition for each individual GRanges
is different. I want to do complete duplicate removal for first list element; for second list element, I need to search the row that appear more than twice (freq >2), and only keep one row; for third list element, search over the row that appear more than three times (freq>3), and keep two or three rows. I am trying to get more programmatic, dynamic solution for this data manipulation task. How can I make this happen easily ? Any way to accomplish this task more efficiently respect to my specific output ? Any idea please ?
Edit
(thanks for @Martin' edit on my reproducible data).
mini example :
grl <- GRangesList( bar= GRanges(seqnames = Rle("chr1",14), IRanges( c(9,19,34,54,70,82,136,9,34,70,136,9,82,136), c(14,21,39,61,73,87,153,14,39,73,153,14,87,153)), score=c(48,6,9,8,4,15,38,48,9,4,38,48,15,38)), cat = GRanges(seqnames = Rle("chr10",16), IRanges( c(7,21,21,72,142,7,16,21,45,72,100,114,142,16,72,114), c(10,34,34,78,147,10,17,34,51,78,103,124,147,17,78,124)), score=c(53,14,14,20,4,53,20,14,11,20,7,32,4,20,20,32)), foo= GRanges(seqnames = Rle("chr11",16), IRanges( c(12,12,12,58,58,58,118,12,12,44,58,102,118,12,58,118), c(36,36,36,92,92,92,139,36,36,49,92,109,139,36,92,139)), score=c(48,48,48,12,12,12,5,48,48,12,12,11,5,48,12,5)) )
Note that in cat
, I am going to look up the rows that appear three times, and keep that rows only once; if row appear twice, I don't do duplicate removal on that. in foo
, I am going to check the rows that appear more than three times, and keep two or three same rows instead. This is what I am trying to make very specific duplicate removal for each GRange
. How can I get my output ?
This is my desired output :
grl_expected <- GRangesList( bar= GRanges(seqnames = Rle("chr1",7), IRanges( c(9,19,34,54,70,82,136), c(14,21,39,61,73,87,153)), score=c(48,6,9,8,4,15,38)), cat= GRanges(seqnames = Rle("chr10",12), IRanges( c(7,21,72,142,7,16,45,100,114,142,16,114), c(10,34,78,147,10,17,51,103,124,147,17,124)), score=c(53,14,20,4,53,20,11,7,32,4,20,32)), foo= GRanges(seqnames = Rle("chr11",11), IRanges( c(12,12,12,44,58,58,58,118,102,118,118), c(36,36,36,49,92,92,92,139,109,139,139)), score=c(48,48,48,17,12,12,12,5,11,5,5)) )
can any one point me out how to make this happen ? Any idea ?
Best regards :
Jurat
I edited your question to avoid the unnecessary step of constructing a list of GRanges, and removed information about 'strand', which doesn't seem relevant to your question. Please edit your question further so that the seqnames in the result are consistent with the seqnames in the original. Please also remove any elements that do not provide additional insight into your problem. For instance, are all 14 ranges in the first element of your input necessary to illustrate the operation you are trying to perform, or would perhaps just 4 elements be enough?
Dear Martin :
Thanks for your editing on my question. I am sure the structure of my data (very much simulated from my original problem, and we have reason for conditionally keeping or removing repeated rows in special way). Could you give me possible idea of making this happen ? Thanks a lot.
Best regards :
Jurat
Also, in this example, it seems the 'starts' are enough to determine duplication (ie whenever the 'start' is the same, so is the end, the seqname, and the score?) Is that necessarily true? If so, I imagine, an efficient way would be to create, for each ranges, to count the number of previous instances of a start with
counted <- ave(start(bar), start(bar), FUN=seq_along)
and then you could subset your ranges withbar[counted<=nbar]
. You could run this in a loop (or lapply-like invocation), where you'd regenerate the object being counted (and the relevant n). If the 'starts' being equal isn't enough to determine overall uniqueness, then you could 'paste' together enough information to determine whether rows are duplicated, and do something similar.@gavinpaulkelly I think better to evaluate by each row to determine the duplication pattern, and it can be done.