Question

How to use GenomicRanges to find overlaps by percentage covered of interval

4

Entering edit mode

khauser130d ▴ 50

@khauser130d-8880

Last seen 9.3 years ago

United Kingdom

I have two dataframes one with two (or more) individuals (denoted by sampleID) denoted by TEST and the other consisting of just 3 columns called REF.

I want to add REF onto TEST based upon the overlaps between the "chr","start","end" columns of REFand the corresponding (in order) columns of TEST: "Chromosome","Start","End".

I want the definition of overlap to be that in order for a REF interval to overlap a TESTinterval, it must cover >50% of TEST interval.

I've been playing around with GenomicRanges. However, the findOverlap function only allows me to specify the minOverlap as the number of overlapping positions rather than a fraction of the interval covered. So, using GenomicRanges, how can I do this?

Some example input:

REF = structure(list(chr = c("1", "1", "1", "1", "1", "1", "1", "1", 
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"), 
    start = c(766101L, 1627918L, 4421060L, 7324468L, 8053732L, 
    8182462L, 8182584L, 8182584L, 8206130L, 8804237L, 10369546L, 
    10370541L, 10543836L, 10656324L, 12354307L, 12841928L, 12845863L, 
    12909237L, 12909965L, 13444908L), end = c(809773L, 1672603L, 
    4424115L, 7325408L, 8067990L, 8189854L, 8189285L, 8189285L, 
    8209321L, 8812660L, 10377983L, 10377983L, 10545046L, 10657912L, 
    12357076L, 12971833L, 12883096L, 12927107L, 12918079L, 13468022L
    ), Deft = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1)), .Names = c("chr", "start", "end", "Deft"
), row.names = c(217L, 568L, 1340L, 1691L, 1804L, 1811L, 1812L, 
1813L, 1819L, 1880L, 2017L, 2020L, 2041L, 2049L, 2224L, 2282L, 
2284L, 2332L, 2335L, 2424L), class = "data.frame")

TEST = structure(list(sampleID = c("SID1331", "SID1331", "SID1331", 
"SID1331", "SID1331", "SID1331", "SID1331", "SID1331", "SID1331", 
"SID1331", "SID1337", "SID1337", "SID1337", "SID1337", "SID1337", 
"SID1337", "SID1337", "SID1337", "SID1337", "SID1337", "SID1337"
), Chromosome = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 10L, 
10L, 11L, 11L, 11L, 11L, 11L, 12L, 12L, 12L, 12L), Start = c(61735L, 
7541291L, 32664866L, 45935056L, 55449273L, 117769301L, 117892183L, 
167580307L, 172458743L, 172852403L, 129495492L, 129526092L, 198572L, 
112837740L, 112847206L, 132105712L, 132126171L, 150442L, 23261653L, 
23270278L, 28135989L), End = c(7539746L, 32664843L, 45934562L, 
55445562L, 117767918L, 117886211L, 167580003L, 172457651L, 172851634L, 
214938359L, 129525791L, 135506704L, 112837593L, 112846803L, 132104437L, 
132122974L, 134944770L, 23261612L, 23268885L, 28131521L, 52920414L
)), .Names = c("sampleID", "Chromosome", "Start", "End"), row.names = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1000L, 1001L, 1002L, 1003L, 
1004L, 1005L, 1006L, 1007L, 1008L, 1009L, 1010L), class = "data.frame")

genomicranges • 12k views

ADD COMMENT • link updated 9.5 years ago by Michael Lawrence ★ 11k • written 9.5 years ago by khauser130d ▴ 50

score 11 · Accepted Answer · 2015-09-26

11

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.3 years ago

United States

Make your GRanges:

refGR <- makeGRangesFromDataFrame(REF)
testGR <- makeGRangesFromDataFrame(TEST)

Find overlaps:

hits <- findOverlaps(refGR, testGR)

Compute percent overlap and filter the hits:

overlaps <- pintersect(refGR[queryHits(hits)], testGR[subjectHits(hits)])
percentOverlap <- width(overlaps) / width(testGR[subjectHits(hits)])
hits <- hits[percentOverlap > 0.5]

Note that none of the sample data passes the filter. It is not clear what you mean by adding REF onto TEST. One example is merging the "Deft" column from REF into TEST:

TEST$Deft <- NA
TEST$Deft[subjectHits(hits)] <- REF$Deft[queryHits(hits)]

ADD COMMENT • link 9.5 years ago Michael Lawrence ★ 11k

1

Entering edit mode

Would it be possible to post pseudocode? I'm having trouble knowing how to aligh/join the ranges in a way that allows me to use pintersect afterwards

ADD REPLY • link 9.5 years ago khauser130d ▴ 50

1

Entering edit mode

I edited the answer to give more detail.

ADD REPLY • link 9.5 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Perfect!, thank you!

ADD REPLY • link 6.2 years ago eegonzalezk • 0