I am trying to extract columns based on two conditions from the indices of two overlaps. This is an example:
df1 = data.frame(chr=c("chr1", "chr1"), start=c(20,21), stop=c(28,29), value1=c(1,2))
df2 = data.frame(chr=c("chr1", "chr1", "chr1"), start=c(20,22, 28), stop=c(22,24,34), value2=c(3,4, 60))
df3 = data.frame(chr=c("chr1", "chr1"), start=c(3,1), stop=c(8,4))
df4 = data.frame(chr=c("chr1", "chr1", "chr2"), start=c(10,1, 1), stop=c(12,2, 2))
df1_all = cbind.data.frame(df1, df3)
df2_all = cbind.data.frame(df2, df4)
Which looks like this:
> df1_all
chr start stop value1 chr start stop
1 chr1 20 28 1 chr1 3 8
2 chr1 21 29 2 chr1 1 4
> df2_all
chr start stop value2 chr start stop
1 chr1 20 22 3 chr1 10 12
2 chr1 22 24 4 chr1 1 2
3 chr1 28 34 60 chr2 1 2
I would like to get the values from data frame df1_all,
together with the matching column from df2_all called "value2", but only for values for which both df1 overlaps df3, and df2 overlaps df4, so in this case it would be:
chr start stop value1 chr start stop value1 value2
chr1 21 29 2 chr1 1 4 2 4
I am almost there but I am still getting something wrong in my real data and I cannot find the bug, I have been trying to find a solution for long now so I am coming here for help and a set of new eyes on this problem. Can you please help?
This is what I have:
df1.gr makeGRangesFromDataFrame(df1)
df2.gr makeGRangesFromDataFrame(df2)
df3.gr makeGRangesFromDataFrame(df3)
df4.gr makeGRangesFromDataFrame(df4)
# First overlap
hits1 <- findOverlapsdf1.gr, df2.gr, maxgap = 0)
values1 <- rep(FALSE, nrow(df2_all))
values1[unique(subjectHits(hits1))] <- TRUE
OBJ= data.frame(df1_all[unique(queryHits(hits1)),],
matched.df2 = df2_all[unique(queryHits(hits1)),"value2"])
# Second overlap
hits2 <- findOverlapsdf3.gr, df4.gr, maxgap = 0)
values2 <- rep(FALSE, nrow(df2_all))
values2[unique(subjectHits(hits2))] <- TRUE
ov = values1 & values2
OBJ = OBJ[ov,]
Not sure I understand your example. But I think you could get further using
intersect(hits1, hits2)
, which would find the rows wheredf1
overlapsdf2
anddf3
overlapsdf4
.Hi, Thank you Michael for your reply.
My problem is trying to add information from df1_all and df2_all only from the intersecting IDs (with the condition that both ranges overlap):
OBJ = data.frame(df1_all[unique(subjectHits(intersect(hits1, hits2))),])
But then how to get the columns in df2_all that match? I have tried in so many ways...
Thanks again
This is basically an inner join, but then reducing the data so that no rows in
df1
become repeated. How do you want to reduce the data when one row indf1
overlaps more than one row indf2
?