Getting first or last hit from a Hits object
3
0
Entering edit mode
Peter Hickey ▴ 740
@petehaitch
Last seen 4 weeks ago
WEHI, Melbourne, Australia
Hi all, Suppose I've created a Hits object by all_hits <- findOverlaps(query, subject, select = 'all', type = 'any') and now I want to get the first (or last) hit for each query. Of course, I could use first_hits <- findOverlaps(query, subject, select = 'first', type = 'any') last_hits <- findOverlaps(query, subject, select = 'last', type = 'any') but I this isn't necessary as all the required information is in the all_hits Hits object. I was unable to find a function to do what I wanted, however, based at the source for findOverlaps-methods.R (actually Hector's branch at https://github.com/hcorrada/GenomicRanges/blob/master/R/findOverlaps- methods.R) I wrote the following simple functions getFirstHit <- function(all_hits){ ans <- rep.int(NA_integer_, queryLength(all_hits)) oo <- IRanges:::orderIntegerPairs(queryHits(all_hits), subjectHits(all_hits), decreasing = TRUE) ans[queryHits(all_hits)[oo]] <- subjectHits(all_hits)[oo] return(ans) } getLastHit <- function(all_hits){ ans <- rep.int(NA_integer_, queryLength(all_hits)) ans[queryHits(all_hits)] <- subjectHits(all_hits) return(ans) } These do what I want and are faster than re-running findOverlaps for my use case (millions of query intervals with tens of thousands of subject intervals) Now to my questions: (1) Did I just overlook built-in "getFirstHit" and "getLastHit" functions? (2) Would others find "getFirstHit" and "getLastHit" useful enough to justify their inclusion in GenomicRanges (or IRanges, wherever it belongs)? (3) I recall some discussion on this mailing list and Rdevel stating that using ":::" to access non-exported functions isn't a great idea for use in package code. If I need to include my own "getFirstHit" would someone be so kind as to point me to the best advice for avoiding calls to ":::" in my code to access IRanges:::orderIntegerPairs(queryHits)? Many thanks, Pete -------------------------------- Peter Hickey, PhD Student/Research Assistant, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Ph: +613 9345 2324 hickey@wehi.edu.au http://www.wehi.edu.au ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:8}}
IRanges GenomicRanges IRanges GenomicRanges • 2.1k views
ADD COMMENT
0
Entering edit mode
@harris-a-jaffee-3972
Last seen 10.1 years ago
United States
I needed essentially your "last_hits" for bumphunter:::fuzzy.match2 In your notation, it comes out: last_hits_rows <- cumsum(rle(queryHits(all_hits))$lengths) last_hits <- all_hits[last_hits_rows,] Having expended that effort, you could do first_hits_rows <- c(1, 1 + last_hits_rows[-length(query)]) Maybe there's a better way. I have not compared these to anything of Hector's. On Oct 20, 2013, at 4:32 PM, Peter Hickey <hickey at="" wehi.edu.au=""> wrote: > Hi all, > > Suppose I've created a Hits object by > all_hits <- findOverlaps(query, subject, select = 'all', type = 'any') > and now I want to get the first (or last) hit for each query. Of course, I could use > first_hits <- findOverlaps(query, subject, select = 'first', type = 'any') > last_hits <- findOverlaps(query, subject, select = 'last', type = 'any') > but I this isn't necessary as all the required information is in the all_hits Hits object. > > I was unable to find a function to do what I wanted, however, based at the source for findOverlaps-methods.R (actually Hector's branch at https://github.com/hcorrada/GenomicRanges/blob/master/R/findOverlaps- methods.R) I wrote the following simple functions > getFirstHit <- function(all_hits){ > ans <- rep.int(NA_integer_, queryLength(all_hits)) > oo <- IRanges:::orderIntegerPairs(queryHits(all_hits), subjectHits(all_hits), decreasing = TRUE) > ans[queryHits(all_hits)[oo]] <- subjectHits(all_hits)[oo] > return(ans) > } > getLastHit <- function(all_hits){ > ans <- rep.int(NA_integer_, queryLength(all_hits)) > ans[queryHits(all_hits)] <- subjectHits(all_hits) > return(ans) > } > > These do what I want and are faster than re-running findOverlaps for my use case (millions of query intervals with tens of thousands of subject intervals) > > Now to my questions: > (1) Did I just overlook built-in "getFirstHit" and "getLastHit" functions? > (2) Would others find "getFirstHit" and "getLastHit" useful enough to justify their inclusion in GenomicRanges (or IRanges, wherever it belongs)? > (3) I recall some discussion on this mailing list and Rdevel stating that using ":::" to access non-exported functions isn't a great idea for use in package code. If I need to include my own "getFirstHit" would someone be so kind as to point me to the best advice for avoiding calls to ":::" in my code to access IRanges:::orderIntegerPairs(queryHits)? > > Many thanks, > Pete > -------------------------------- > Peter Hickey, > PhD Student/Research Assistant, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Ph: +613 9345 2324 > > hickey at wehi.edu.au > http://www.wehi.edu.au > > > ______________________________________________________________________ > The information in this email is confidential and intend...{{dropped:8}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 3.0 years ago
United States
Thanks for providing this use case. It's actually documented that the Hits are sorted by query. So using that assumption, an Rle-based approach like that provided by Harris is probably the most efficient. But I would use Rle from IRanges instead of the base "rle" class, because it offers more conveniences; for example, the code is pretty intuitive for this use case: queryHits_rle <- Rle(queryHits(all_hits)) first_hits <- all_hits[start(queryHits_rle)] last_hits <- all_hits[end(queryHits_rle)] Michael On Sun, Oct 20, 2013 at 1:32 PM, Peter Hickey <hickey@wehi.edu.au> wrote: > Hi all, > > Suppose I've created a Hits object by > all_hits <- findOverlaps(query, subject, select = 'all', type = 'any') > and now I want to get the first (or last) hit for each query. Of course, I > could use > first_hits <- findOverlaps(query, subject, select = 'first', type = 'any') > last_hits <- findOverlaps(query, subject, select = 'last', type = 'any') > but I this isn't necessary as all the required information is in the > all_hits Hits object. > > I was unable to find a function to do what I wanted, however, based at the > source for findOverlaps-methods.R (actually Hector's branch at > https://github.com/hcorrada/GenomicRanges/blob/master/R /findOverlaps-methods.R) > I wrote the following simple functions > getFirstHit <- function(all_hits){ > ans <- rep.int(NA_integer_, queryLength(all_hits)) > oo <- IRanges:::orderIntegerPairs(queryHits(all_hits), > subjectHits(all_hits), decreasing = TRUE) > ans[queryHits(all_hits)[oo]] <- subjectHits(all_hits)[oo] > return(ans) > } > getLastHit <- function(all_hits){ > ans <- rep.int(NA_integer_, queryLength(all_hits)) > ans[queryHits(all_hits)] <- subjectHits(all_hits) > return(ans) > } > > These do what I want and are faster than re-running findOverlaps for my > use case (millions of query intervals with tens of thousands of subject > intervals) > > Now to my questions: > (1) Did I just overlook built-in "getFirstHit" and "getLastHit" functions? > (2) Would others find "getFirstHit" and "getLastHit" useful enough to > justify their inclusion in GenomicRanges (or IRanges, wherever it belongs)? > (3) I recall some discussion on this mailing list and Rdevel stating that > using ":::" to access non-exported functions isn't a great idea for use in > package code. If I need to include my own "getFirstHit" would someone be so > kind as to point me to the best advice for avoiding calls to ":::" in my > code to access IRanges:::orderIntegerPairs(queryHits)? > > Many thanks, > Pete > -------------------------------- > Peter Hickey, > PhD Student/Research Assistant, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Ph: +613 9345 2324 > > hickey@wehi.edu.au > http://www.wehi.edu.au > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:13}}
ADD COMMENT
0
Entering edit mode
Thanks to Harris, Chuck and Michael for their solutions. Much appreciated. Pete On 21/10/2013, at 3:25 PM, Michael Lawrence wrote: > Thanks for providing this use case. > > It's actually documented that the Hits are sorted by query. So using that assumption, an Rle-based approach like that provided by Harris is probably the most efficient. But I would use Rle from IRanges instead of the base "rle" class, because it offers more conveniences; for example, the code is pretty intuitive for this use case: > > queryHits_rle <- Rle(queryHits(all_hits)) > first_hits <- all_hits[start(queryHits_rle)] > last_hits <- all_hits[end(queryHits_rle)] > > Michael > > > > On Sun, Oct 20, 2013 at 1:32 PM, Peter Hickey <hickey@wehi.edu.au> wrote: > Hi all, > > Suppose I've created a Hits object by > all_hits <- findOverlaps(query, subject, select = 'all', type = 'any') > and now I want to get the first (or last) hit for each query. Of course, I could use > first_hits <- findOverlaps(query, subject, select = 'first', type = 'any') > last_hits <- findOverlaps(query, subject, select = 'last', type = 'any') > but I this isn't necessary as all the required information is in the all_hits Hits object. > > I was unable to find a function to do what I wanted, however, based at the source for findOverlaps-methods.R (actually Hector's branch at https://github.com/hcorrada/GenomicRanges/blob/master/R/findOverlaps- methods.R) I wrote the following simple functions > getFirstHit <- function(all_hits){ > ans <- rep.int(NA_integer_, queryLength(all_hits)) > oo <- IRanges:::orderIntegerPairs(queryHits(all_hits), subjectHits(all_hits), decreasing = TRUE) > ans[queryHits(all_hits)[oo]] <- subjectHits(all_hits)[oo] > return(ans) > } > getLastHit <- function(all_hits){ > ans <- rep.int(NA_integer_, queryLength(all_hits)) > ans[queryHits(all_hits)] <- subjectHits(all_hits) > return(ans) > } > > These do what I want and are faster than re-running findOverlaps for my use case (millions of query intervals with tens of thousands of subject intervals) > > Now to my questions: > (1) Did I just overlook built-in "getFirstHit" and "getLastHit" functions? > (2) Would others find "getFirstHit" and "getLastHit" useful enough to justify their inclusion in GenomicRanges (or IRanges, wherever it belongs)? > (3) I recall some discussion on this mailing list and Rdevel stating that using ":::" to access non-exported functions isn't a great idea for use in package code. If I need to include my own "getFirstHit" would someone be so kind as to point me to the best advice for avoiding calls to ":::" in my code to access IRanges:::orderIntegerPairs(queryHits)? > > Many thanks, > Pete > -------------------------------- > Peter Hickey, > PhD Student/Research Assistant, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Ph: +613 9345 2324 > > hickey@wehi.edu.au > http://www.wehi.edu.au > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:31}}
ADD REPLY
0
Entering edit mode
Charles Berry ▴ 290
@charles-berry-5754
Last seen 5.6 years ago
United States
Peter Hickey <hickey at="" ...=""> writes: > > Hi all, > > Suppose I've created a Hits object by > all_hits <- findOverlaps(query, subject, select = 'all', type = 'any') > and now I want to get the first (or last) hit for each query. Of course, > [...] gr1 <- GRanges(rep(letters,each=10), IRanges(start=sample(1e6,260),width=20)) hit1 <- findOverlaps( gr1, gr1, maxgap=1e5) hit1 # show what hit1 is hit1[!duplicated(queryHits(hit1))] # only the first hit for each query value hit1[!duplicated(queryHits(hit1),fromLast=TRUE)] # only last hits HTH, Chuck
ADD COMMENT

Login before adding your answer.

Traffic: 550 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6