Question

IRanges package: findOverlaps on blobs

0

Entering edit mode

Fahim Md ▴ 250

@fahim-md-4018

Last seen 10.6 years ago

Hello I have a file containing a list of aligned Refseq identifier in the following format. I want to convert this file into RangedData format in such a way that "findOverlaps" function be as efficient as possible. I know how to do this when each of the identifiers has just a start and an end coordinate. IRanges package is very efficient in finding the overlapping intervals in such case. But when the structure is in the form of blobs (as shown below in the last three fields), I am not sure how to convert this structure into RangedData format and how to subsequently call the "findOverlaps" function. RefSeqID targetName strand blockSizes queryStart targetStart XM_001065892.1 chr4 + 127,986, 0,127, 124513961,124514706, XM_578205.2 chr2 - 535,137,148, 0,535,672, 155875533,155879894,155895543, NM_012543.2 chr1 + 506,411,212,494, 0,506,917,1129, 96173572,96174920,96176574,96177991, Thanks and appreciate ur help. --Fahim [[alternative HTML version deleted]]

convert IRanges convert IRanges • 1.1k views

ADD COMMENT • link updated 13.9 years ago by Steve Lianoglou ★ 13k • written 13.9 years ago by Fahim Md ▴ 250

score 0 · Answer 1 · 2011-05-31

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 20 days ago

United States

Hi, On Tue, May 31, 2011 at 11:08 AM, Fahim Mohammad <fahim.md at="" gmail.com=""> wrote: > Hello > I have a file containing a list of aligned Refseq identifier in the > following format. ?I want to convert this file into RangedData format in > such a way that ?"findOverlaps" function be as efficient as possible. I know > how to do this when each of the identifiers has just a start and an end > coordinate. IRanges package is very efficient in finding the overlapping > intervals in such case. But when the structure is in the form of blobs (as > shown below in the last three fields), I am not sure how to convert this > structure into RangedData format and how to subsequently call the > "findOverlaps" function. > > RefSeqID ? ? ? ? ? ? ? targetName ? ?strand ? ? ? ?blockSizes > queryStart ? ? ? ? ? ? targetStart > XM_001065892.1 ? ?chr4 ? ? ? ? ? ? ? ? + ? ? ? ? ? ?127,986, > 0,127, ? ? ? ? ? ? ? ? ?124513961,124514706, > XM_578205.2 ? ? ? ? chr2 ? ? ? ? ? ? ? ? ?- ? ? ? ? ? ? 535,137,148, > ? 0,535,672, ? ? ? ? ? ?155875533,155879894,155895543, > NM_012543.2 ? ? ? ? chr1 ? ? ? ? ? ? ? ? + ? ? ? ? ? ? 506,411,212,494, > 0,506,917,1129, ? ?96173572,96174920,96176574,96177991, I guess the answer lies in how you want overlaps to be calculated. Is each "block" for each RefSeqID treated individually, or do you want one overlap to count for all of them? If the answer is the latter, you might consider putting the intervals into a GRangesList object, where each element in the list is a GRanges object that has all the ranges for the particular refseq id ... if you want them all individually, then you just have to parse it into a GRanges object. If you're asking *how* to parse the file, there are many ways. I'd maybe use a read.table to get this thing into its respective columns, then iterate over each row converting it into a GRanges object that has as many ranges as "blocks" -- look at ?strsplit if you don't know how to do that. Lastly -- how are these ranges defined? Is the first row supposed to be turned into: chr4 (124513961) -- (124513961 + 0 + 127) chr4 (124514706 + 127) -- (124514706 + 127 + 986) or? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD COMMENT • link 13.9 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Thanks for reply RefSeqID targetName strand blockSizes queryStart targetStart XM_001065892.1 chr4 + 127,986, 0,127, 124513961,124514706, XM_578205.2 chr2 - 535,137,148, 0,535,672, 155875533,155879894,155895543, NM_012543.2 chr1 + 506,411,212,494, 0,506,917,1129, 96173572,96174920,96176574,96177991, Is each "block" for each RefSeqID treated individually, or do you want one overlap to count for all of them? *Ans: I am looking for the latter option and want one overlap to count for all of the blocks. For example, If I am given the following sequence with its alignment UnknownSeq targetName strand blockSizes queryStart targetStart XM_ABCD chr4 + 100, 200, 500 0,200, 1200 124513961,124514706, 124515900 I am interested in finding which of the three RefSeqIDs above overlaps with the given unknown sequence. The obvious answer is the first refseqID (XM_001065892.1). Lastly -- how are these ranges defined? Is the first row supposed to be turned into: * The intervals on the target genome for each unknown sequence can be found using blockSizes and targetStart. For me, the "queryStart" field above is redundant and wont be used. The first row may be The intervals in the first row may be found using (targetStart + blockSizes -1) chr4 (124513961) -- (124513961 + 127 -1) chr4 (124514706) -- (124514706 + 986 -1) Thanks again Fahim On Tue, May 31, 2011 at 12:05 PM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hi, > > On Tue, May 31, 2011 at 11:08 AM, Fahim Mohammad <fahim.md@gmail.com> > wrote: > > Hello > > I have a file containing a list of aligned Refseq identifier in the > > following format. I want to convert this file into RangedData format in > > such a way that "findOverlaps" function be as efficient as possible. I > know > > how to do this when each of the identifiers has just a start and an end > > coordinate. IRanges package is very efficient in finding the overlapping > > intervals in such case. But when the structure is in the form of blobs > (as > > shown below in the last three fields), I am not sure how to convert this > > structure into RangedData format and how to subsequently call the > > "findOverlaps" function. > > > > RefSeqID targetName strand blockSizes > > queryStart targetStart > > XM_001065892.1 chr4 + 127,986, > > 0,127, 124513961,124514706, > > XM_578205.2 chr2 - 535,137,148, > > 0,535,672, 155875533,155879894,155895543, > > NM_012543.2 chr1 + 506,411,212,494, > > 0,506,917,1129, 96173572,96174920,96176574,96177991, > > I guess the answer lies in how you want overlaps to be calculated. > > Is each "block" for each RefSeqID treated individually, or do you want > one overlap to count for all of them? > > If the answer is the latter, you might consider putting the intervals > into a GRangesList object, where each element in the list is a GRanges > object that has all the ranges for the particular refseq id ... if you > want them all individually, then you just have to parse it into a > GRanges object. > > If you're asking *how* to parse the file, there are many ways. I'd > maybe use a read.table to get this thing into its respective columns, > then iterate over each row converting it into a GRanges object that > has as many ranges as "blocks" -- look at ?strsplit if you don't know > how to do that. > > Lastly -- how are these ranges defined? > Is the first row supposed to be turned into: > > chr4 (124513961) -- (124513961 + 0 + 127) > chr4 (124514706 + 127) -- (124514706 + 127 + 986) > > or? > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > -- Fahim Mohammad Bioinforformatics Lab University of Louisville Louisville, KY, USA Ph: +1-502-409-1167 [[alternative HTML version deleted]]

ADD REPLY • link 13.9 years ago Fahim Md ▴ 250