Findoverlaps insame sample
2
0
Entering edit mode
julrodr80 ▴ 30
@julrodr80-11991
Last seen 7.1 years ago

I have two GR objects with 1000 samples. Each sample, several cnvs (genomic ranges). I would like to find overlaps between the genomic ranges of the two objets, but also with the condition that the overlap region is in the same individual. 

seqnames   ranges strand | Sample
  <Rle> <IRanges> <character>
154 chr15 [ 22755185,  23228712] 00_017
154 chr15 [ 22876889,  23086929] 00_017
215 chr12 [  8003758,   8126839] 01_006
527 chr10 [135258147, 135379710] 01_025
527 chr10 [135258147, 135267445] 01_025

 thank you very much 

genomicranges granges findoverlaps • 1.2k views
ADD COMMENT
2
Entering edit mode
@herve-pages-1542
Last seen 9 hours ago
Seattle, WA, United States

Hi,

The easiest solution is to just call findOverlaps() without worrying about the extra condition, and then to subset the returned Hits object to keep only the hits that satisfy the extra condition. For example, let's say the sample ID is stored in the metadata column sampleID on the 2 GRanges objects query and subject, then you could do something like:

hits <- findOverlaps(query, subject, ...)
hit_is_within_same_sample <-
    mcols(query)$sampleID[queryHits(hits)] ==
    mcols(subject)$sampleID[subjectHits(hits)]
hits <- hits[hit_is_within_same_sample]

A legitimate concern with this approach is that the call to findOverlaps() might be too costly because it returns many hits that will be removed. Depending on the input objects, these "to-be-removed" hits could make the Hits object thousands or millions times bigger than necessary and could consume too much memory. Here are 2 ways to avoid this:

1) Split the 2 GRanges objects by sample ID, and loop over all the sample IDs with mapply()

all_sampleIDs <- union(mcols(query)$sampleID, mcols(subject)$sampleID)
## We split 'query' and 'subject' with factors that have the same levels
## so we get 2 GRangesList objects that are *parallel* to each other (and
## to 'all_sampleIDs'):
query_by_sampleID <- split(query, factor(mcols(query)$sampleID,
                                         levels=all_sampleIDs)
subject_by_sampleID <- split(subject, factor(mcols(subject)$sampleID,
                                             levels=all_sample_IDs)
all_hits <- mapply(findOverlaps, query_by_sampleID, subject_by_sampleID)

all_hits is a list of Hits objects that is parallel to query_by_sampleIDsubject_by_sampleID, and all_sampleIDs. Depending on what you want to do downstream, this might or might not be a convenient data structure.

2) Mangle the seqnames of the 2 GRanges objects with the sample IDs

This is probably much more efficient and should return exactly the same result as the easiest solution above. Do the mangling with something like:

query2 <- GRanges(paste(mcols(query)$sampleID, seqnames(query), sep=":"),
                  ranges(query), strand(query))
subject2 <- GRanges(paste(mcols(subject)$sampleID, seqnames(subject), sep=":"),
                    ranges(subject), strand(subject))

Then call findOverlaps() on query2 and subject2.

Hope this helps,

H.

ADD COMMENT
0
Entering edit mode
julrodr80 ▴ 30
@julrodr80-11991
Last seen 7.1 years ago

Hi, 

It works. I've tried the last options.

ADD COMMENT

Login before adding your answer.

Traffic: 567 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6