I have a table with a start and stop column which defines a set of
ranges. I have another table with a list of genes with associated
position. What I would like to do is subset the gene table so it only
contains genes whose position is within any of the ranges. What is
the
best way to do this? The only way I can think of is to construct a
long
list of conditions linked by ORs but I am sure there must be a better
way.
Simple example:
Start Stop
1 3
5 9
13 15
Gene Position
1 14
2 4
3 10
4 6
I would like to get out:
Gene Position
1 14
4 6
Any ideas?
Thanks
Dan
--
**************************************************************
Daniel Brewer, Ph.D.
Institute of Cancer Research
Email: daniel.brewer at icr.ac.uk
**************************************************************
The Institute of Cancer Research: Royal Cancer Hospital, a charitable
Company Limited by Guarantee, Registered in England under Company No.
534147 with its Registered Office at 123 Old Brompton Road, London SW7
3RP.
This e-mail message is confidential and for use by the
a...{{dropped:2}}
You can use cut (?cut) defining the breaks from your ranges, as they
are non-overlapping.
Regards,
Carlos J. Gil Bellosta
http://www.datanalytics.com
> On 10/29/07, Daniel Brewer <daniel.brewer at="" icr.ac.uk=""> wrote:
> >
> > I have a table with a start and stop column which defines a set of
> > ranges. I have another table with a list of genes with associated
> > position. What I would like to do is subset the gene table so it
only
> > contains genes whose position is within any of the ranges. What
is the
> > best way to do this? The only way I can think of is to construct
a long
> > list of conditions linked by ORs but I am sure there must be a
better way.
> >
> > Simple example:
> >
> > Start Stop
> > 1 3
> > 5 9
> > 13 15
> >
> > Gene Position
> > 1 14
> > 2 4
> > 3 10
> > 4 6
> >
> > I would like to get out:
> > Gene Position
> > 1 14
> > 4 6
> >
> > Any ideas?
> >
> > Thanks
> >
> > Dan
> >
> > --
> > **************************************************************
> > Daniel Brewer, Ph.D.
> > Institute of Cancer Research
> > Email: daniel.brewer at icr.ac.uk
> > **************************************************************
> >
> > The Institute of Cancer Research: Royal Cancer Hospital, a
charitable
> > Company Limited by Guarantee, Registered in England under Company
No. 534147
> > with its Registered Office at 123 Old Brompton Road, London SW7
3RP.
> >
> > This e-mail message is confidential and for use by
the...{{dropped:13}}
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
You would like to avoid loops here, especially nested loops: this is
what apply, sapply etc are for. Using your syntax:
final.presence = apply(gene, 1, function(x) any(x[2]>=place$start &
x[2]<=place$end))
-
Dr Oleg Sklyar * EMBL-EBI, Cambridge CB10 1SD, UK * +441223494466
On Mon, 2007-10-29 at 12:42 -0500, Artur Veloso wrote:
> Hi Daniel,
>
> I'm very new to R and I'm far from a good programmer, but I think
that this
> small script should solve your problem. Well, at least for the
example you
> provided it worked. I hope it helps.
>
> Cheers,
>
> Artur
>
> > start <- c(1,5,13)
> > stop <- c(3,9,15)
> > place <- data.frame(start,stop)
> >
> > gene <- c(1,2,3,4)
> > position <- c(14,4,10,6)
> > position <- data.frame(gene,position)
> >
> > range <- list()
> > for(a in 1:dim(place)[1])
> + range[[a]] <- seq(place$start[a],place$stop[a])
> >
> > presence <- NULL
> > final.presence <- NULL
> > for(b in position$position)
> + {
> + for(c in 1:length(range))
> + {
> + presence <- c(presence,b%in%range[[c]])
> + }
> + final.presence <- c(final.presence,as.logical(sum(presence)))
> + presence <- NULL
> + }
> >
> > position[final.presence,]
> gene position
> 1 1 14
> 4 4 6
>
>
> On 10/29/07, Daniel Brewer <daniel.brewer at="" icr.ac.uk=""> wrote:
> >
> > I have a table with a start and stop column which defines a set of
> > ranges. I have another table with a list of genes with associated
> > position. What I would like to do is subset the gene table so it
only
> > contains genes whose position is within any of the ranges. What
is the
> > best way to do this? The only way I can think of is to construct
a long
> > list of conditions linked by ORs but I am sure there must be a
better way.
> >
> > Simple example:
> >
> > Start Stop
> > 1 3
> > 5 9
> > 13 15
> >
> > Gene Position
> > 1 14
> > 2 4
> > 3 10
> > 4 6
> >
> > I would like to get out:
> > Gene Position
> > 1 14
> > 4 6
> >
> > Any ideas?
> >
> > Thanks
> >
> > Dan
> >
> > --
> > **************************************************************
> > Daniel Brewer, Ph.D.
> > Institute of Cancer Research
> > Email: daniel.brewer at icr.ac.uk
> > **************************************************************
> >
> > The Institute of Cancer Research: Royal Cancer Hospital, a
charitable
> > Company Limited by Guarantee, Registered in England under Company
No. 534147
> > with its Registered Office at 123 Old Brompton Road, London SW7
3RP.
> >
> > This e-mail message is confidential and for use by
the...{{dropped:13}}
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
In this case you don't gain much if anything by using apply(), which
is
just a nice wrapper to a for() loop (and the bad rap that for loops
have
in R isn't really applicable these days).
The real gain to be had is from vectorizing the comparison.
Best,
Jim
Oleg Sklyar wrote:
> You would like to avoid loops here, especially nested loops: this is
> what apply, sapply etc are for. Using your syntax:
>
> final.presence = apply(gene, 1, function(x) any(x[2]>=place$start &
> x[2]<=place$end))
>
> -
> Dr Oleg Sklyar * EMBL-EBI, Cambridge CB10 1SD, UK * +441223494466
>
>
> On Mon, 2007-10-29 at 12:42 -0500, Artur Veloso wrote:
>> Hi Daniel,
>>
>> I'm very new to R and I'm far from a good programmer, but I think
that this
>> small script should solve your problem. Well, at least for the
example you
>> provided it worked. I hope it helps.
>>
>> Cheers,
>>
>> Artur
>>
>>> start <- c(1,5,13)
>>> stop <- c(3,9,15)
>>> place <- data.frame(start,stop)
>>>
>>> gene <- c(1,2,3,4)
>>> position <- c(14,4,10,6)
>>> position <- data.frame(gene,position)
>>>
>>> range <- list()
>>> for(a in 1:dim(place)[1])
>> + range[[a]] <- seq(place$start[a],place$stop[a])
>>> presence <- NULL
>>> final.presence <- NULL
>>> for(b in position$position)
>> + {
>> + for(c in 1:length(range))
>> + {
>> + presence <- c(presence,b%in%range[[c]])
>> + }
>> + final.presence <-
c(final.presence,as.logical(sum(presence)))
>> + presence <- NULL
>> + }
>>> position[final.presence,]
>> gene position
>> 1 1 14
>> 4 4 6
>>
>>
>> On 10/29/07, Daniel Brewer <daniel.brewer at="" icr.ac.uk=""> wrote:
>>> I have a table with a start and stop column which defines a set of
>>> ranges. I have another table with a list of genes with associated
>>> position. What I would like to do is subset the gene table so it
only
>>> contains genes whose position is within any of the ranges. What
is the
>>> best way to do this? The only way I can think of is to construct
a long
>>> list of conditions linked by ORs but I am sure there must be a
better way.
>>>
>>> Simple example:
>>>
>>> Start Stop
>>> 1 3
>>> 5 9
>>> 13 15
>>>
>>> Gene Position
>>> 1 14
>>> 2 4
>>> 3 10
>>> 4 6
>>>
>>> I would like to get out:
>>> Gene Position
>>> 1 14
>>> 4 6
>>>
>>> Any ideas?
>>>
>>> Thanks
>>>
>>> Dan
>>>
>>> --
>>> **************************************************************
>>> Daniel Brewer, Ph.D.
>>> Institute of Cancer Research
>>> Email: daniel.brewer at icr.ac.uk
>>> **************************************************************
>>>
>>> The Institute of Cancer Research: Royal Cancer Hospital, a
charitable
>>> Company Limited by Guarantee, Registered in England under Company
No. 534147
>>> with its Registered Office at 123 Old Brompton Road, London SW7
3RP.
>>>
>>> This e-mail message is confidential and for use by
the...{{dropped:13}}
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623
It's about both, and in fact after scrolling down I noticed that we
came
up with exactly the same solution :)
-
Dr Oleg Sklyar * EMBL-EBI, Cambridge CB10 1SD, UK * +441223494466
On Mon, 2007-10-29 at 16:44 -0400, James W. MacDonald wrote:
> In this case you don't gain much if anything by using apply(), which
is
> just a nice wrapper to a for() loop (and the bad rap that for loops
have
> in R isn't really applicable these days).
>
> The real gain to be had is from vectorizing the comparison.
>
> Best,
>
> Jim
>
>
>
> Oleg Sklyar wrote:
> > You would like to avoid loops here, especially nested loops: this
is
> > what apply, sapply etc are for. Using your syntax:
> >
> > final.presence = apply(gene, 1, function(x) any(x[2]>=place$start
&
> > x[2]<=place$end))
> >
> > -
> > Dr Oleg Sklyar * EMBL-EBI, Cambridge CB10 1SD, UK * +441223494466
> >
> >
> > On Mon, 2007-10-29 at 12:42 -0500, Artur Veloso wrote:
> >> Hi Daniel,
> >>
> >> I'm very new to R and I'm far from a good programmer, but I think
that this
> >> small script should solve your problem. Well, at least for the
example you
> >> provided it worked. I hope it helps.
> >>
> >> Cheers,
> >>
> >> Artur
> >>
> >>> start <- c(1,5,13)
> >>> stop <- c(3,9,15)
> >>> place <- data.frame(start,stop)
> >>>
> >>> gene <- c(1,2,3,4)
> >>> position <- c(14,4,10,6)
> >>> position <- data.frame(gene,position)
> >>>
> >>> range <- list()
> >>> for(a in 1:dim(place)[1])
> >> + range[[a]] <- seq(place$start[a],place$stop[a])
> >>> presence <- NULL
> >>> final.presence <- NULL
> >>> for(b in position$position)
> >> + {
> >> + for(c in 1:length(range))
> >> + {
> >> + presence <- c(presence,b%in%range[[c]])
> >> + }
> >> + final.presence <-
c(final.presence,as.logical(sum(presence)))
> >> + presence <- NULL
> >> + }
> >>> position[final.presence,]
> >> gene position
> >> 1 1 14
> >> 4 4 6
> >>
> >>
> >> On 10/29/07, Daniel Brewer <daniel.brewer at="" icr.ac.uk=""> wrote:
> >>> I have a table with a start and stop column which defines a set
of
> >>> ranges. I have another table with a list of genes with
associated
> >>> position. What I would like to do is subset the gene table so
it only
> >>> contains genes whose position is within any of the ranges. What
is the
> >>> best way to do this? The only way I can think of is to
construct a long
> >>> list of conditions linked by ORs but I am sure there must be a
better way.
> >>>
> >>> Simple example:
> >>>
> >>> Start Stop
> >>> 1 3
> >>> 5 9
> >>> 13 15
> >>>
> >>> Gene Position
> >>> 1 14
> >>> 2 4
> >>> 3 10
> >>> 4 6
> >>>
> >>> I would like to get out:
> >>> Gene Position
> >>> 1 14
> >>> 4 6
> >>>
> >>> Any ideas?
> >>>
> >>> Thanks
> >>>
> >>> Dan
> >>>
> >>> --
> >>> **************************************************************
> >>> Daniel Brewer, Ph.D.
> >>> Institute of Cancer Research
> >>> Email: daniel.brewer at icr.ac.uk
> >>> **************************************************************
> >>>
> >>> The Institute of Cancer Research: Royal Cancer Hospital, a
charitable
> >>> Company Limited by Guarantee, Registered in England under
Company No. 534147
> >>> with its Registered Office at 123 Old Brompton Road, London SW7
3RP.
> >>>
> >>> This e-mail message is confidential and for use by
the...{{dropped:13}}
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at stat.math.ethz.ch
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
Daniel Brewer wrote:
> I have a table with a start and stop column which defines a set of
> ranges. I have another table with a list of genes with associated
> position. What I would like to do is subset the gene table so it
only
> contains genes whose position is within any of the ranges. What is
the
> best way to do this? The only way I can think of is to construct a
long
> list of conditions linked by ORs but I am sure there must be a
better way.
>
> Simple example:
>
> Start Stop
> 1 3
> 5 9
> 13 15
>
> Gene Position
> 1 14
> 2 4
> 3 10
> 4 6
>
> I would like to get out:
> Gene Position
> 1 14
> 4 6
>
> Any ideas?
Here is a function that I use for finding overlapping segments. It
takes two data.frames, x and y. Each must have "Chr", "Position", and
"end" columns (often used in conjunction with snapCGH--hence, the
Position rather than "start"). The "shift" parameter is a convenience
function for doing "random shift" random distributions of genomic
segments. The function returns the indexes of x and y that overlap.
So, if the first row of the x data.frame overlaps with the first 3
rows
of y, the output will be:
Xindex Yindex
1 1
1 2
1 3
Note that the data.frames can have more than those three columns, but
those three columns MUST be present and named as mentioned.
Hope this helps.
Sean
Attached function below
-----------------------
findOverlappingSegments <-
function(x,y,shift=0) {
swap <- nrow(x)<nrow(y) #="" want="" to="" have="" larger="" set="" first="" for="" speed="" if(swap)="" {="" tmpx="" <-="" x="" x="" <-="" y="" y="" <-="" tmpx="" }="" intersectchrom="" <-="" intersect(x$chr,y$chr)="" ret="" <-="" list()="" for(i="" in="" intersectchrom)="" {="" aindex="" <-="" which(y$chr="=i)" bindex="" <-="" which(x$chr="=i)" a="" <-="" y[aindex,]="" b="" <-="" x[bindex,]="" overlapsbrow="" <-="" mapply(function(astart,="" aend)="" {="" which((astart="" <="b$end" &="" astart="">=b$Position) |
(Aend <= b$end & Aend>=b$Position) |
(Astart <= b$Position & Aend>=b$end) |
(Astart >= b$Position & Aend<=b$end))
},a$Position+shift,a$end+shift)
tmp1 <- unlist(overlapsBrow)
xindex <- bindex[tmp1]
yindex <-
aindex[rep(1:nrow(a),sapply(overlapsBrow,length,simplify=TRUE))]
if(swap) {
ret[[i]]<- cbind(yindex,xindex)
} else {
ret[[i]] <- cbind(xindex,yindex)
}
colnames(ret[[i]]) <- c('Xindex','Yindex')
}
return(do.call(rbind,ret))
}
Or a more simplistic alternative that will work with the data
provided:
> mat <- matrix(c(1,5,13,3,9,15), ncol=2)
> gn <- matrix(c(14,4,10,6), ncol=1)
> a <- apply(gn, 1, function(x) any(x > mat[,1] & x < mat[,2]))
> gn[a,]
[1] 14 6
Best,
Jim
Sean Davis wrote:
> Daniel Brewer wrote:
>> I have a table with a start and stop column which defines a set of
>> ranges. I have another table with a list of genes with associated
>> position. What I would like to do is subset the gene table so it
only
>> contains genes whose position is within any of the ranges. What is
the
>> best way to do this? The only way I can think of is to construct a
long
>> list of conditions linked by ORs but I am sure there must be a
better way.
>>
>> Simple example:
>>
>> Start Stop
>> 1 3
>> 5 9
>> 13 15
>>
>> Gene Position
>> 1 14
>> 2 4
>> 3 10
>> 4 6
>>
>> I would like to get out:
>> Gene Position
>> 1 14
>> 4 6
>>
>> Any ideas?
>
> Here is a function that I use for finding overlapping segments. It
> takes two data.frames, x and y. Each must have "Chr", "Position",
and
> "end" columns (often used in conjunction with snapCGH--hence, the
> Position rather than "start"). The "shift" parameter is a
convenience
> function for doing "random shift" random distributions of genomic
> segments. The function returns the indexes of x and y that overlap.
> So, if the first row of the x data.frame overlaps with the first 3
rows
> of y, the output will be:
>
> Xindex Yindex
> 1 1
> 1 2
> 1 3
>
> Note that the data.frames can have more than those three columns,
but
> those three columns MUST be present and named as mentioned.
>
> Hope this helps.
>
> Sean
>
> Attached function below
> -----------------------
>
> findOverlappingSegments <-
> function(x,y,shift=0) {
> swap <- nrow(x)<nrow(y) #="" want="" to="" have="" larger="" set="" first="" for="" speed=""> if(swap) {
> tmpx <- x
> x <- y
> y <- tmpx
> }
> intersectChrom <- intersect(x$Chr,y$Chr)
> ret <- list()
> for(i in intersectChrom) {
> aindex <- which(y$Chr==i)
> bindex <- which(x$Chr==i)
> a <- y[aindex,]
> b <- x[bindex,]
> overlapsBrow <- mapply(function(Astart, Aend) {
> which((Astart <= b$end & Astart>=b$Position) |
> (Aend <= b$end & Aend>=b$Position) |
> (Astart <= b$Position & Aend>=b$end) |
> (Astart >= b$Position & Aend<=b$end))
> },a$Position+shift,a$end+shift)
> tmp1 <- unlist(overlapsBrow)
> xindex <- bindex[tmp1]
> yindex <-
> aindex[rep(1:nrow(a),sapply(overlapsBrow,length,simplify=TRUE))]
> if(swap) {
> ret[[i]]<- cbind(yindex,xindex)
> } else {
> ret[[i]] <- cbind(xindex,yindex)
> }
> colnames(ret[[i]]) <- c('Xindex','Yindex')
> }
> return(do.call(rbind,ret))
> }
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623
Hi Dan,
Daniel Brewer wrote:
> I have a table with a start and stop column which defines a set of
> ranges. I have another table with a list of genes with associated
> position. What I would like to do is subset the gene table so it
only
> contains genes whose position is within any of the ranges. What is
the
> best way to do this? The only way I can think of is to construct a
long
> list of conditions linked by ORs but I am sure there must be a
better way.
Are you not telling us something here? Because the problem as stated
is
very simple. Say your matrix below is called mat:
index <- mat[,1] < 6 & mat[,2] < 15
Or do you have a whole bunch of ranges to test?
Best,
Jim
>
> Simple example:
>
> Start Stop
> 1 3
> 5 9
> 13 15
>
> Gene Position
> 1 14
> 2 4
> 3 10
> 4 6
>
> I would like to get out:
> Gene Position
> 1 14
> 4 6
>
> Any ideas?
>
> Thanks
>
> Dan
>
--
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623
> pos <- matrix(c(1, 5, 13, 3, 9, 15), ncol=2) pos
[,1] [,2]
[1,] 1 3
[2,] 5 9
[3,] 13 15
> gene.pos <- c(14,4,10,6)
> gene.pos
[1] 14 4 10 6
> within <- sapply(gene.pos, function(g) any(apply(pos, 1, function(x)
findInterval(g, x)) == 1))
> gene.pos[within]
[1] 14 6
Look at ?findInterval, which does all the work. It returns 1 if
within
range in this case.
-Christos
> -----Original Message-----
> From: bioconductor-bounces at stat.math.ethz.ch
> [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of
> Daniel Brewer
> Sent: Monday, October 29, 2007 12:29 PM
> To: bioconductor at stat.math.ethz.ch
> Subject: [BioC] Is a number within a set of ranges?
>
> I have a table with a start and stop column which defines a
> set of ranges. I have another table with a list of genes
> with associated position. What I would like to do is subset
> the gene table so it only contains genes whose position is
> within any of the ranges. What is the best way to do this?
> The only way I can think of is to construct a long list of
> conditions linked by ORs but I am sure there must be a better way.
>
> Simple example:
>
> Start Stop
> 1 3
> 5 9
> 13 15
>
> Gene Position
> 1 14
> 2 4
> 3 10
> 4 6
>
> I would like to get out:
> Gene Position
> 1 14
> 4 6
>
> Any ideas?
>
> Thanks
>
> Dan
>
> --
> **************************************************************
> Daniel Brewer, Ph.D.
> Institute of Cancer Research
> Email: daniel.brewer at icr.ac.uk
> **************************************************************
>
> The Institute of Cancer Research: Royal Cancer Hospital, a
> charitable Company Limited by Guarantee, Registered in
> England under Company No. 534147 with its Registered Office
> at 123 Old Brompton Road, London SW7 3RP.
>
> This e-mail message is confidential and for use by
the...{{dropped:13}}
Christos Hatzis wrote:
>> pos <- matrix(c(1, 5, 13, 3, 9, 15), ncol=2) pos
> [,1] [,2]
> [1,] 1 3
> [2,] 5 9
> [3,] 13 15
>> gene.pos <- c(14,4,10,6)
>> gene.pos
> [1] 14 4 10 6
>
>> within <- sapply(gene.pos, function(g) any(apply(pos, 1,
function(x)
> findInterval(g, x)) == 1))
>
>> gene.pos[within]
> [1] 14 6
Good to know the existence of findInterval(). Thanks!
For this particular case though, I would be tempted to keep things
simple
by replacing this
any(apply(pos, 1, function(x) findInterval(g, x)) == 1)
by
any(apply(pos, 1, function(x) x[1] <= g && g <= x[2]))
Not only is the later easier to understand, but with the former,
you'll get
wrong results if one of your genes is positioned at one of the Stop
positions:
gene.pos <- c(14,4,10,6,15) # last gene is at a Stop position
# using findInterval() gives:
> within
[1] TRUE FALSE FALSE TRUE FALSE
# using 'x[1] <= g && g <= x[2]' gives:
> within
[1] TRUE FALSE FALSE TRUE TRUE
Note that the "findInterval" method can be fixed by specifying
'rightmost.closed=TRUE' but this doesn't make the code easier to
understand, all the contrary...
Cheers,
H.
>
> Look at ?findInterval, which does all the work. It returns 1 if
within
> range in this case.
>
> -Christos
>
>> -----Original Message-----
>> From: bioconductor-bounces at stat.math.ethz.ch
>> [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of
>> Daniel Brewer
>> Sent: Monday, October 29, 2007 12:29 PM
>> To: bioconductor at stat.math.ethz.ch
>> Subject: [BioC] Is a number within a set of ranges?
>>
>> I have a table with a start and stop column which defines a
>> set of ranges. I have another table with a list of genes
>> with associated position. What I would like to do is subset
>> the gene table so it only contains genes whose position is
>> within any of the ranges. What is the best way to do this?
>> The only way I can think of is to construct a long list of
>> conditions linked by ORs but I am sure there must be a better way.
>>
>> Simple example:
>>
>> Start Stop
>> 1 3
>> 5 9
>> 13 15
>>
>> Gene Position
>> 1 14
>> 2 4
>> 3 10
>> 4 6
>>
>> I would like to get out:
>> Gene Position
>> 1 14
>> 4 6
>>
>> Any ideas?
>>
>> Thanks
>>
>> Dan
>>
>> --
>> **************************************************************
>> Daniel Brewer, Ph.D.
>> Institute of Cancer Research
>> Email: daniel.brewer at icr.ac.uk
>> **************************************************************
>>
>> The Institute of Cancer Research: Royal Cancer Hospital, a
>> charitable Company Limited by Guarantee, Registered in
>> England under Company No. 534147 with its Registered Office
>> at 123 Old Brompton Road, London SW7 3RP.
>>
>> This e-mail message is confidential and for use by
the...{{dropped:13}}
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
Hi Daniel,
I think you could do something smarter using the "outer" function
here.
Let's say, your matrix of intervals be "ints" and the Position column
of
your genes-position matrix be pos,
then something like this, should give you only the positions of those
genes inside those intervals:
pos[which(rowSums(outer(pos,ints[,"Stop"],"<=") &
outer(pos,ints[,"Start"],">=") )>0)]
Maybe there's even a smarter way that I do not know of.
Regards,
Joern
Daniel Brewer wrote:
> I have a table with a start and stop column which defines a set of
> ranges. I have another table with a list of genes with associated
> position. What I would like to do is subset the gene table so it
only
> contains genes whose position is within any of the ranges. What is
the
> best way to do this? The only way I can think of is to construct a
long
> list of conditions linked by ORs but I am sure there must be a
better way.
>
> Simple example:
>
> Start Stop
> 1 3
> 5 9
> 13 15
>
> Gene Position
> 1 14
> 2 4
> 3 10
> 4 6
>
> I would like to get out:
> Gene Position
> 1 14
> 4 6
>
> Any ideas?
>
> Thanks
>
> Dan
>
>
This is a trivial one-liner:
r = data.frame(Start=c(1,5,13), End=c(3,9,15))
g = data.frame(Gene=c(1,2,3,4), Position=c(14,4,10,6))
index = apply(g, 1, function(x) any(x[2]>=r$Start & x[2]<=r$End))
> index
[1] TRUE FALSE FALSE TRUE
> g[index,]
Gene Position
1 1 14
4 4 6
Best,
Oleg
-
Dr Oleg Sklyar * EMBL-EBI, Cambridge CB10 1SD, UK * +441223494466
On Mon, 2007-10-29 at 16:29 +0000, Daniel Brewer wrote:
> I have a table with a start and stop column which defines a set of
> ranges. I have another table with a list of genes with associated
> position. What I would like to do is subset the gene table so it
only
> contains genes whose position is within any of the ranges. What is
the
> best way to do this? The only way I can think of is to construct a
long
> list of conditions linked by ORs but I am sure there must be a
better way.
>
> Simple example:
>
> Start Stop
> 1 3
> 5 9
> 13 15
>
> Gene Position
> 1 14
> 2 4
> 3 10
> 4 6
>
> I would like to get out:
> Gene Position
> 1 14
> 4 6
>
> Any ideas?
>
> Thanks
>
> Dan
>
Daniel Brewer wrote:
> I have a table with a start and stop column which defines a set of
> ranges. I have another table with a list of genes with associated
> position. What I would like to do is subset the gene table so it
only
> contains genes whose position is within any of the ranges. What is
the
> best way to do this? The only way I can think of is to construct a
long
> list of conditions linked by ORs but I am sure there must be a
better way.
>
> Simple example:
>
> Start Stop
> 1 3
> 5 9
> 13 15
>
> Gene Position
> 1 14
> 2 4
> 3 10
> 4 6
>
> I would like to get out:
> Gene Position
> 1 14
> 4 6
>
> Any ideas?
>
> Thanks
>
> Dan
>
Thanks everyone for their ideas. That is marvellous.
Dan
The Institute of Cancer Research: Royal Cancer Hospital, a charitable
Company Limited by Guarantee, Registered in England under Company No.
534147 with its Registered Office at 123 Old Brompton Road, London SW7
3RP.
This e-mail message is confidential and for use by the
a...{{dropped:2}}