I have two dataframes one with two (or more) individuals (denoted by sampleID
) denoted by TEST
and the other consisting of just 3 columns called REF
.
I want to add REF
onto TEST
based upon the overlaps between the "chr","start","end"
columns of REF
and the corresponding (in order) columns of TEST
: "Chromosome","Start","End"
.
I want the definition of overlap to be that in order for a REF
interval to overlap a TEST
interval, it must cover >50% of TEST
interval.
I've been playing around with GenomicRanges. However, the findOverlap
function only allows me to specify the minOverlap
as the number of overlapping positions rather than a fraction of the interval covered. So, using GenomicRanges, how can I do this?
Some example input:
REF = structure(list(chr = c("1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"),
start = c(766101L, 1627918L, 4421060L, 7324468L, 8053732L,
8182462L, 8182584L, 8182584L, 8206130L, 8804237L, 10369546L,
10370541L, 10543836L, 10656324L, 12354307L, 12841928L, 12845863L,
12909237L, 12909965L, 13444908L), end = c(809773L, 1672603L,
4424115L, 7325408L, 8067990L, 8189854L, 8189285L, 8189285L,
8209321L, 8812660L, 10377983L, 10377983L, 10545046L, 10657912L,
12357076L, 12971833L, 12883096L, 12927107L, 12918079L, 13468022L
), Deft = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1)), .Names = c("chr", "start", "end", "Deft"
), row.names = c(217L, 568L, 1340L, 1691L, 1804L, 1811L, 1812L,
1813L, 1819L, 1880L, 2017L, 2020L, 2041L, 2049L, 2224L, 2282L,
2284L, 2332L, 2335L, 2424L), class = "data.frame")
TEST = structure(list(sampleID = c("SID1331", "SID1331", "SID1331",
"SID1331", "SID1331", "SID1331", "SID1331", "SID1331", "SID1331",
"SID1331", "SID1337", "SID1337", "SID1337", "SID1337", "SID1337",
"SID1337", "SID1337", "SID1337", "SID1337", "SID1337", "SID1337"
), Chromosome = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 10L,
10L, 11L, 11L, 11L, 11L, 11L, 12L, 12L, 12L, 12L), Start = c(61735L,
7541291L, 32664866L, 45935056L, 55449273L, 117769301L, 117892183L,
167580307L, 172458743L, 172852403L, 129495492L, 129526092L, 198572L,
112837740L, 112847206L, 132105712L, 132126171L, 150442L, 23261653L,
23270278L, 28135989L), End = c(7539746L, 32664843L, 45934562L,
55445562L, 117767918L, 117886211L, 167580003L, 172457651L, 172851634L,
214938359L, 129525791L, 135506704L, 112837593L, 112846803L, 132104437L,
132122974L, 134944770L, 23261612L, 23268885L, 28131521L, 52920414L
)), .Names = c("sampleID", "Chromosome", "Start", "End"), row.names = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1000L, 1001L, 1002L, 1003L,
1004L, 1005L, 1006L, 1007L, 1008L, 1009L, 1010L), class = "data.frame")
Would it be possible to post pseudocode? I'm having trouble knowing how to aligh/join the ranges in a way that allows me to use pintersect afterwards
I edited the answer to give more detail.
Perfect!, thank you!