Turning a GRanges Metadata Column into Rle List
1
0
Entering edit mode
Dario Strbenac ★ 1.5k
@dario-strbenac-5916
Last seen 15 hours ago
Australia
Anything neater than this available ? g <- GRanges(c("chr1", "chr2"), IRanges(c(10, 50), c(15, 55)), scores = c(20, 10)) seqlengths(g) <- c(100, 100) chrRanges <- split(g, seqnames(g)) lapply(chrRanges, function(x) { scoreVect <- rep(NA, seqlengths(x)[1]) invisible(mapply(function(st, end, score) scoreVect[st:end] <<- score, start(x), end(x), values(x)[,1])) Rle(scoreVect) }) Then, code to define views on these Rle and calculate view summaries. -------------------------------------- Dario Strbenac PhD Student University of Sydney Camperdown NSW 2050 Australia
• 3.8k views
ADD COMMENT
0
Entering edit mode
@aaron-statham-4434
Last seen 10.2 years ago
On 7 January 2013 16:00, Dario Strbenac <d.strbenac@garvan.org.au> wrote: > Anything neater than this available ? > > g <- GRanges(c("chr1", "chr2"), IRanges(c(10, 50), c(15, 55)), scores = > c(20, 10)) > seqlengths(g) <- c(100, 100) > chrRanges <- split(g, seqnames(g)) > lapply(chrRanges, function(x) > { > scoreVect <- rep(NA, seqlengths(x)[1]) > invisible(mapply(function(st, end, score) scoreVect[st:end] <<- score, > start(x), end(x), values(x)[,1])) > Rle(scoreVect) > }) > > Then, code to define views on these Rle and calculate view summaries. > > Hey Dario, If I understand what you want to do correctly (your goal is not obvious) this one liner should work for integer/numeric metadata columns, and even handles overlapping scores gracefully. #define the ranges/scores g <- GRanges(c("chr1", "chr2"), IRanges(c(10, 50), c(15, 55)), scores = c(20, 10)) seqlengths(g) <- c(100, 100) #Rles of scores coverage(g, weight=values(g)$scores) Cheers, Aaron > -------------------------------------- > Dario Strbenac > PhD Student > University of Sydney > Camperdown NSW 2050 > Australia > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Aaron Statham Postgraduate Scholar, Cancer Epigenetics Garvan Institute of Medical Research Tel: (02) 9295 8393 384 Victoria St Darlinghurst 2010 Fax: (02) 9295 8316 NSW Australia email: a.statham@garvan.org.au [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
> coverage(g, weight=values(g)$scores) That is close to what I was after, although the ranges which aren't present in the GRanges object effectively have missing scores and need to be NA because 0 is a valid score for the ranges which are present. So, if I had a view defined that only partially overlapped with a range in the example, I would want any summary to give NA, rather than taking 0 into the calculation.
ADD REPLY
0
Entering edit mode
One hack would be to add 1 to all of the scores, replace the zeros in the coverage() result with NAs and subtract 1. I've hit something like this a couple times in the past. The coverage method for GRanges could gain a default value argument. Michael On Sun, Jan 6, 2013 at 11:00 PM, Dario Strbenac <d.strbenac@garvan.org.au>wrote: > > coverage(g, weight=values(g)$scores) > > That is close to what I was after, although the ranges which aren't > present in the GRanges object effectively have missing scores and need to > be NA because 0 is a valid score for the ranges which are present. So, if I > had a view defined that only partially overlapped with a range in the > example, I would want any summary to give NA, rather than taking 0 into the > calculation. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Something like coverage(foo, bar, ..., NA.value=-1) ? --t On Jan 7, 2013, at 8:59 AM, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: > One hack would be to add 1 to all of the scores, replace the zeros in the > coverage() result with NAs and subtract 1. I've hit something like this a > couple times in the past. The coverage method for GRanges could gain a > default value argument. > > Michael > > > > On Sun, Jan 6, 2013 at 11:00 PM, Dario Strbenac <d.strbenac at="" garvan.org.au="">wrote: > >>> coverage(g, weight=values(g)$scores) >> >> That is close to what I was after, although the ranges which aren't >> present in the GRanges object effectively have missing scores and need to >> be NA because 0 is a valid score for the ranges which are present. So, if I >> had a view defined that only partially overlapped with a range in the >> example, I would want any summary to give NA, rather than taking 0 into the >> calculation. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Here's a quick workaround (would be more elegant if weights in coverage() could be NA, but that would probably break other functionality) g <- GRanges(c("chr1", "chr2"), IRanges(c(10, 50), c(15, 55)), scores = c(20, 10)) seqlengths(g) <- c(100, 100) mask=-1 #placeholder for NA #Define the ranges we want to be NA gInv <- setdiff(GRanges(seqnames(g), IRanges(1, seqlengths(g))), g) values(gInv)$scores <- mask g <- c(g, gInv) endoapply(coverage(g, weight="scores"), function(x) { #replace mask with NA runValue(x)[runValue(x==mask)] <- NA x }) Aaron On 8 January 2013 03:59, Michael Lawrence <lawrence.michael@gene.com> wrote: > One hack would be to add 1 to all of the scores, replace the zeros in the > coverage() result with NAs and subtract 1. I've hit something like this a > couple times in the past. The coverage method for GRanges could gain a > default value argument. > > Michael > > > > On Sun, Jan 6, 2013 at 11:00 PM, Dario Strbenac <d.strbenac@garvan.org.au>wrote: > >> > coverage(g, weight=values(g)$scores) >> >> That is close to what I was after, although the ranges which aren't >> present in the GRanges object effectively have missing scores and need to >> be NA because 0 is a valid score for the ranges which are present. So, if I >> had a view defined that only partially overlapped with a range in the >> example, I would want any summary to give NA, rather than taking 0 into the >> calculation. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- Aaron Statham Postgraduate Scholar, Cancer Epigenetics Garvan Institute of Medical Research Tel: (02) 9295 8393 384 Victoria St Darlinghurst 2010 Fax: (02) 9295 8316 NSW Australia email: a.statham@garvan.org.au [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
I was thinking you could just do (assuming score is always >= 0, and that the ranges do not overlap, which seems to be the case from the initial code): gr$score <- gr$score + 1L cov <- coverage(gr, weight = "score") cov[cov == 0L] <- NA cov <- cov - 1L Note that it should not be necessary to use endoapply or any other apply function, as it is all implicit in the List API. Not tested; just an illustration. Michael On Mon, Jan 7, 2013 at 2:55 PM, Aaron Statham <a.statham@garvan.org.au>wrote: > Here's a quick workaround (would be more elegant if weights in coverage() > could be NA, but that would probably break other functionality) > > g <- GRanges(c("chr1", "chr2"), IRanges(c(10, 50), c(15, 55)), scores > = c(20, 10)) > seqlengths(g) <- c(100, 100) > > mask=-1 #placeholder for NA > #Define the ranges we want to be NA > gInv <- setdiff(GRanges(seqnames(g), IRanges(1, seqlengths(g))), g) > values(gInv)$scores <- mask > g <- c(g, gInv) > > endoapply(coverage(g, weight="scores"), function(x) { > #replace mask with NA > runValue(x)[runValue(x==mask)] <- NA > x > }) > > Aaron > > > On 8 January 2013 03:59, Michael Lawrence <lawrence.michael@gene.com>wrote: > >> One hack would be to add 1 to all of the scores, replace the zeros in the >> coverage() result with NAs and subtract 1. I've hit something like this a >> couple times in the past. The coverage method for GRanges could gain a >> default value argument. >> >> Michael >> >> >> >> On Sun, Jan 6, 2013 at 11:00 PM, Dario Strbenac <d.strbenac@garvan.org.au>> > wrote: >> >>> > coverage(g, weight=values(g)$scores) >>> >>> That is close to what I was after, although the ranges which aren't >>> present in the GRanges object effectively have missing scores and need to >>> be NA because 0 is a valid score for the ranges which are present. So, if I >>> had a view defined that only partially overlapped with a range in the >>> example, I would want any summary to give NA, rather than taking 0 into the >>> calculation. >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > > > -- > Aaron Statham > Postgraduate Scholar, Cancer Epigenetics > Garvan Institute of Medical Research Tel: (02) 9295 8393 > 384 Victoria St Darlinghurst 2010 Fax: (02) 9295 8316 > NSW Australia email: a.statham@garvan.org.au > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi Dario, Alternatively you could call coverage() a 2nd time with no weights to find the regions with no coverage, and set them to NA: cvg <- coverage(gr, weight="scores") cvg[coverage(gr) == 0] <- NA H. On 01/07/2013 04:00 PM, Michael Lawrence wrote: > I was thinking you could just do (assuming score is always >= 0, and that > the ranges do not overlap, which seems to be the case from the initial > code): > > gr$score <- gr$score + 1L > cov <- coverage(gr, weight = "score") > cov[cov == 0L] <- NA > cov <- cov - 1L > > Note that it should not be necessary to use endoapply or any other apply > function, as it is all implicit in the List API. > > Not tested; just an illustration. > > Michael > > > On Mon, Jan 7, 2013 at 2:55 PM, Aaron Statham <a.statham at="" garvan.org.au="">wrote: > >> Here's a quick workaround (would be more elegant if weights in coverage() >> could be NA, but that would probably break other functionality) >> >> g <- GRanges(c("chr1", "chr2"), IRanges(c(10, 50), c(15, 55)), scores >> = c(20, 10)) >> seqlengths(g) <- c(100, 100) >> >> mask=-1 #placeholder for NA >> #Define the ranges we want to be NA >> gInv <- setdiff(GRanges(seqnames(g), IRanges(1, seqlengths(g))), g) >> values(gInv)$scores <- mask >> g <- c(g, gInv) >> >> endoapply(coverage(g, weight="scores"), function(x) { >> #replace mask with NA >> runValue(x)[runValue(x==mask)] <- NA >> x >> }) >> >> Aaron >> >> >> On 8 January 2013 03:59, Michael Lawrence <lawrence.michael at="" gene.com="">wrote: >> >>> One hack would be to add 1 to all of the scores, replace the zeros in the >>> coverage() result with NAs and subtract 1. I've hit something like this a >>> couple times in the past. The coverage method for GRanges could gain a >>> default value argument. >>> >>> Michael >>> >>> >>> >>> On Sun, Jan 6, 2013 at 11:00 PM, Dario Strbenac <d.strbenac at="" garvan.org.au="">>>> wrote: >>> >>>>> coverage(g, weight=values(g)$scores) >>>> >>>> That is close to what I was after, although the ranges which aren't >>>> present in the GRanges object effectively have missing scores and need to >>>> be NA because 0 is a valid score for the ranges which are present. So, if I >>>> had a view defined that only partially overlapped with a range in the >>>> example, I would want any summary to give NA, rather than taking 0 into the >>>> calculation. >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> >> >> >> -- >> Aaron Statham >> Postgraduate Scholar, Cancer Epigenetics >> Garvan Institute of Medical Research Tel: (02) 9295 8393 >> 384 Victoria St Darlinghurst 2010 Fax: (02) 9295 8316 >> NSW Australia email: a.statham at garvan.org.au >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
> I was thinking you could just do (assuming score is > always >= 0, and that the ranges do not overlap, > which seems to be the case from the initial code): In another scenario, what if I have data on multiple cell lines, such as one of the ENCODE integrated tracks, and I would like to find the maximum value at each base within a specified genomic region ? In this case, the ranges in the supertrack would overlap often. I would imagine making a separate coverage RleList for each cell line, to avoid complications with overlapping ranges. Then, it would be nice to have a pmax function in the API that could take a number of RleList objects, each of the same length, and return one RleList. Are there any plans for pmin and pmax of this style ?
ADD REPLY
0
Entering edit mode
On 01/14/2013 10:00 PM, Dario Strbenac wrote: >> I was thinking you could just do (assuming score is always >= 0, and that >> the ranges do not overlap, which seems to be the case from the initial >> code): > > In another scenario, what if I have data on multiple cell lines, such as one > of the ENCODE integrated tracks, and I would like to find the maximum value > at each base within a specified genomic region ? In this case, the ranges in > the supertrack would overlap often. > > I would imagine making a separate coverage RleList for each cell line, to > avoid complications with overlapping ranges. Then, it would be nice to have a > pmax function in the API that could take a number of RleList objects, each of > the same length, and return one RleList. Are there any plans for pmin and > pmax of this style ? I think there is one? library(IRanges) showMethods(class="RleList", where=search()) > r1 = RleList(a=Rle(c(0, 0, 1, 0)), b=Rle(c(1, 1, 0, 0))) > r2 = RleList(a=Rle(c(0, 0, 2, 2)), b=Rle(c(1, 2, 1, 1))) > pmax(r1, r2) SimpleRleList of length 2 $a numeric-Rle of length 4 with 2 runs Lengths: 2 2 Values : 0 2 $b numeric-Rle of length 4 with 3 runs Lengths: 1 1 2 Values : 1 2 1 > > _______________________________________________ Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD REPLY
0
Entering edit mode
If you want the maximum of the overlap you can use something like this too: ir<-IRanges(start=you.intervals[,"start"],end=you.intervals[,"end"]) ir.encode.data<-IRanges(start=encode.data[,"start"],end=encode.data[," end"]) over<-findOverlaps(ir,ir.encode.data) query<-queryHits(over) subject<-subjectHits(over) tapply(as.numeric(encode.data[subject,"score"]),query,function(x) max(x,na.rm=TRUE)) slight modification if have different chrs .... This works for getting GERP scores in indels etc Dr Paul Leo Senior Bioinformatician The University of Queensland Diamantina Institute --------------------------------------------------------------------- TRI, level , 37 Kent Street, Woolloongabba QLD 4102 Tel: +61 7 3443 7072 Mob: 041 303 8691 Fax: +61 7 3443 6966 -----Original Message----- From: Martin Morgan <mtmorgan@fhcrc.org> To: D.Strbenac at garvan.org.au Cc: bioconductor at r-project.org Subject: Re: [BioC] Turning a GRanges Metadata Column into Rle List Date: Mon, 14 Jan 2013 22:17:17 -0800 On 01/14/2013 10:00 PM, Dario Strbenac wrote: >> I was thinking you could just do (assuming score is always >= 0, and that >> the ranges do not overlap, which seems to be the case from the initial >> code): > > In another scenario, what if I have data on multiple cell lines, such as one > of the ENCODE integrated tracks, and I would like to find the maximum value > at each base within a specified genomic region ? In this case, the ranges in > the supertrack would overlap often. > > I would imagine making a separate coverage RleList for each cell line, to > avoid complications with overlapping ranges. Then, it would be nice to have a > pmax function in the API that could take a number of RleList objects, each of > the same length, and return one RleList. Are there any plans for pmin and > pmax of this style ? I think there is one? library(IRanges) showMethods(class="RleList", where=search()) > r1 = RleList(a=Rle(c(0, 0, 1, 0)), b=Rle(c(1, 1, 0, 0))) > r2 = RleList(a=Rle(c(0, 0, 2, 2)), b=Rle(c(1, 2, 1, 1))) > pmax(r1, r2) SimpleRleList of length 2 $a numeric-Rle of length 4 with 2 runs Lengths: 2 2 Values : 0 2 $b numeric-Rle of length 4 with 3 runs Lengths: 1 1 2 Values : 1 2 1 > > _______________________________________________ Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
Actually this may be better. than my last post: But making the Rle object of scores using "coverage and weights" (is the only way I could think of). In won't work in cases where the intervals in ir.encode.data are not unique (that is the coverage values must always be one else you get coverage*score instead of just score when you use the weights) Is there a better way to make an Rle of non-contigious scores ??? ### assume encode.data is table with columns start,end,score ### assume you.intervals is table with columns start,end # ir<-IRanges(start=you.intervals[,"start"],end=you.intervals[,"end"]) ir.encode.data<-IRanges(start=encode.data[,"start"],end=encode.data[," end"]) cov<-coverage(ir.encode.data,weight=as.numeric(ir.encode.data[,"score" ])) ## DANGER HERE myViews<-Views(cov,start=you.intervals[,"start"],end=you.intervals[,"e nd"] ) viewMaxs(myViews) if ir.encode.data and ir are RleLists use regionViews <- RleViewsList(rleList = cov, rangesList =ir ) instesd of the Views I'm not sure which solution is best but the above should use less memory? Dr Paul Leo Senior Bioinformatician The University of Queensland Diamantina Institute --------------------------------------------------------------------- TRI, level , 37 Kent Street, Woolloongabba QLD 4102 Tel: +61 7 3443 7072 Mob: 041 303 8691 Fax: +61 7 3443 6966 -----Original Message----- From: Paul Leo <p.leo@uq.edu.au> Reply-to: <p.leo at="" uq.edu.au=""> To: Martin Morgan <mtmorgan at="" fhcrc.org=""> Cc: bioconductor at r-project.org Subject: Re: [BioC] Turning a GRanges Metadata Column into Rle List Date: Tue, 15 Jan 2013 16:38:33 +1000 If you want the maximum of the overlap you can use something like this too: ir<-IRanges(start=you.intervals[,"start"],end=you.intervals[,"end"]) ir.encode.data<-IRanges(start=encode.data[,"start"],end=encode.data[," end"]) over<-findOverlaps(ir,ir.encode.data) query<-queryHits(over) subject<-subjectHits(over) tapply(as.numeric(encode.data[subject,"score"]),query,function(x) max(x,na.rm=TRUE)) slight modification if have different chrs .... This works for getting GERP scores in indels etc Dr Paul Leo Senior Bioinformatician The University of Queensland Diamantina Institute --------------------------------------------------------------------- TRI, level , 37 Kent Street, Woolloongabba QLD 4102 Tel: +61 7 3443 7072 Mob: 041 303 8691 Fax: +61 7 3443 6966 -----Original Message----- From: Martin Morgan <mtmorgan@fhcrc.org> To: D.Strbenac at garvan.org.au Cc: bioconductor at r-project.org Subject: Re: [BioC] Turning a GRanges Metadata Column into Rle List Date: Mon, 14 Jan 2013 22:17:17 -0800 On 01/14/2013 10:00 PM, Dario Strbenac wrote: >> I was thinking you could just do (assuming score is always >= 0, and that >> the ranges do not overlap, which seems to be the case from the initial >> code): > > In another scenario, what if I have data on multiple cell lines, such as one > of the ENCODE integrated tracks, and I would like to find the maximum value > at each base within a specified genomic region ? In this case, the ranges in > the supertrack would overlap often. > > I would imagine making a separate coverage RleList for each cell line, to > avoid complications with overlapping ranges. Then, it would be nice to have a > pmax function in the API that could take a number of RleList objects, each of > the same length, and return one RleList. Are there any plans for pmin and > pmax of this style ? I think there is one? library(IRanges) showMethods(class="RleList", where=search()) > r1 = RleList(a=Rle(c(0, 0, 1, 0)), b=Rle(c(1, 1, 0, 0))) > r2 = RleList(a=Rle(c(0, 0, 2, 2)), b=Rle(c(1, 2, 1, 1))) > pmax(r1, r2) SimpleRleList of length 2 $a numeric-Rle of length 4 with 2 runs Lengths: 2 2 Values : 0 2 $b numeric-Rle of length 4 with 3 runs Lengths: 1 1 2 Values : 1 2 1 > > _______________________________________________ Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY

Login before adding your answer.

Traffic: 885 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6