IRanges:::coverage() speedup/enchancement

0

Entering edit mode

Patrick Aboyoun ★ 1.6k

@patrick-aboyoun-6734

Last seen 10.6 years ago

United States

Chuck, Thanks for the speedup to coverage for large width IRanges objects. I checked in changes to IRanges 1.5.13 in BioC 2.6 (for use with R-devel) based on the code you submitted. I'll back port this code to BioC 2.5 (R 2.10) in the next few days as well. Just out of curiosity, what is the source of these long width intervals, where do the weights for these intervals come from, and what operations do you perform on the resulting coverage vectors? Echoing Michael's comments, we haven't supported double precision weights in coverage calculations in the past because we hadn't encountered any common use cases for them and there is the workaround Michael mentioned if the need arose. Providing some context for the enhancement request would help motivate us to make a change. :) Cheers, Patrick Michael Lawrence wrote: > On Mon, Nov 30, 2009 at 11:10 AM, Charles C. Berry <cberry at="" tajo.ucsd.edu="">wrote: > > >> The semantics of the IRanges package and especially the RangedData class >> are very apprpriate for some of the applications I deal with. >> >> Unfortunately, coverage() is too slow to be useful to me. >> >> I wonder if the Biocore Team would consider retooling it to make it >> faster? Below I provide a link to a revised coverage.c that might suffice. >> >> The kind of case I need to handle has width values in 10kbase to 10Mbase >> range. As a toy example, being able to run stuff like >> >> tmp <- coverage( IRanges( start=seq(1,by=1000,length=10000), >> width=1e7 ) ) >> >> quickly is needed. >> >> A revised version of coverage.c is available at >> >> http://cabig2.ucsd.edu:8080/Plone/Members/ccberry/software/coverage .c/view >> >> It will handle the case above almost instantly (while the existing version >> needs about 8 minutes on my machine) and seems about equal to the >> existing version for cases with width=30. In the cases I've looked at >> gc() reports the same memory usage. >> >> --- >> >> Also, I wonder if the Biocore Team would entertain allowing the 'weight' >> argument of coverage to be of type double? This would help in cases in >> which downweighting of counts of some genomic features is desired. >> >> >> > In many use cases, it's probably sufficient to simply round floating point > numbers to integers after multiplying by a power of 10. That only goes so > far though, so supporting double-precision seems reasonable. The type of the > output will simply depend on the type of the weights. > > > > >> Thanks, >> >> Chuck >> >> -- >> Charles C. Berry (858) 534-2098 >> Dept of Family/Preventive >> Medicine >> E mailto:cberry at tajo.ucsd.edu UC San Diego >> http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

Coverage IRanges Coverage IRanges • 1.3k views

ADD COMMENT • link updated 15.4 years ago by Charles Berry ▴ 290 • written 15.4 years ago by Patrick Aboyoun ★ 1.6k

0

Entering edit mode

Charles Berry ▴ 290

@charles-berry-5754

Last seen 6.0 years ago

United States

On Tue, 1 Dec 2009, Patrick Aboyoun wrote: > Chuck, > Thanks for the speedup to coverage for large width IRanges objects. I checked > in changes to IRanges 1.5.13 in BioC 2.6 (for use with R-devel) based on the > code you submitted. I'll back port this code to BioC 2.5 (R 2.10) in the next > few days as well. Thank you so much! > Just out of curiosity, what is the source of these long > width intervals, where do the weights for these intervals come from, and what > operations do you perform on the resulting coverage vectors? Re the source of the intervals: I collaborate with Rick Bushman's group at Upenn (http://microb230.med.upenn.edu/), which studies 'the transfer of genetic information between cells and organisms'. It is of particular interest to know where a mobile DNA element is preferentially sited in a genome. (And not just for scientific reasons, gene therapy constructs can do serious damage if they land in the wrong place, so designing good gene therapy vectors is aided by being able to predict where a vector tends to land.) In our papers (for example Berry et al. Selection of Target Sites for Mobile DNA Integration in the Human Genome. PLoS Comput Biol 2(11): e157. doi:10.1371/journal.pcbi.0020157), it emerged that features like 'gene density' (number of genes per base in a window of a specified width), CpG island site density, and DNASE I site density have substantial effects on integration preference. The length scale for 'width' affects the conclusions, and widths greater than a megabase and perhaps to tens of megabases seem to matter. --------------- Re the weights: In performing the calculations of 'expression density' (counts per base of genes expressed above some threshhold value) using Affy arrays, I had to deal with varying numbers of probesets for different genes. It seemed to me that counting all probesets with equal weight would introduce unwanted variation, so I used an inverse weighting scheme so that the sum of weights for each gene would equal one. Michael's workaround seems reasonable. ---------------- Re the operations performed on coverage: You can browse the paper mentioned above to get the idea, but the operations are just a lookup of the coverage values (with some adjustments made for gaps and chromosome ends) followed by data analysis using those values. In the idiom of the IRanges package (and simplified a bit for easier reading), something like this: ## jurkat is a data.frame describing HIV integration site locations ## (aka 'insertions') and those of in silico controls ('match') > xtabs( ~type, jurkat ) type insertion match 39237 116161 all.chr <- levels(jurkat$Chr) ### locations as a RangedData instance: jurkat.RL <- RangedData(IRanges(start=jurkat$Position,width=1), space=jurkat$Chr, strand=jurkat$Ort,orig.rows=rownames(jurkat)) ### get a 1MBase gene density mapping for geneHuman(): library(GenomicFeatures.Hsapiens.UCSC.hg18) knownGenes <- geneHuman() known.RL <- with( subset(knownGenes, chrom%in%all.chr ), RangedData( IRanges(start=txStart,en=txEnd), space=factor(chrom, all.chr ), strand=factor(strand,levels(jurkat$Ort)))) wider.RL <- known.RL start( wider.RL ) <- start( wider.RL ) - 5e5 end( wider.RL ) <- end( wider.RL ) + 5e5 wider.cover <- coverage( wider.RL, width = seqlengths(Hsapiens)[ names( wider.RL ) ] ) ### now lookup the coverage (aka density) values for the insertions and ### matches: jurkat.wider.cover <- wider.cover[ ranges(jurkat.RL) ] ## now match the coverage values back to the parent data.frame and do some ## statistical analysis: ## simple examples: > table( as.vector( as.vector( jurkat.wider.cover ) ), + jurkat[ jurkat.RL$orig.rows, 'type' ] ) insertion match 0 993 14881 1 568 7490 2 843 6453 3 1039 7292 4 1044 6104 5 1044 5590 6 1045 5086 [snip] > lapply(split(as.vector(as.vector(jurkat.wider.cover)), + jurkat[jurkat.RL$orig.rows,'type']),mean) $insertion [1] 52.6881 $match [1] 22.71713 > Chuck > Echoing Michael's comments, we haven't supported double precision weights in > coverage calculations in the past because we hadn't encountered any common > use cases for them and there is the workaround Michael mentioned if the need > arose. Providing some context for the enhancement request would help motivate > us to make a change. :) > > > Cheers, > Patrick > > > > Michael Lawrence wrote: >> On Mon, Nov 30, 2009 at 11:10 AM, Charles C. Berry >> <cberry at="" tajo.ucsd.edu="">wrote: >> >> >> > The semantics of the IRanges package and especially the RangedData class >> > are very apprpriate for some of the applications I deal with. >> > >> > Unfortunately, coverage() is too slow to be useful to me. >> > >> > I wonder if the Biocore Team would consider retooling it to make it >> > faster? Below I provide a link to a revised coverage.c that might >> > suffice. >> > >> > The kind of case I need to handle has width values in 10kbase to 10Mbase >> > range. As a toy example, being able to run stuff like >> > >> > tmp <- coverage( IRanges( start=seq(1,by=1000,length=10000), >> > width=1e7 ) ) >> > >> > quickly is needed. >> > >> > A revised version of coverage.c is available at >> > >> > http://cabig2.ucsd.edu:8080/Plone/Members/ccberry/software/cover age.c/view >> > >> > It will handle the case above almost instantly (while the existing >> > version >> > needs about 8 minutes on my machine) and seems about equal to the >> > existing version for cases with width=30. In the cases I've looked at >> > gc() reports the same memory usage. >> > >> > --- >> > >> > Also, I wonder if the Biocore Team would entertain allowing the 'weight' >> > argument of coverage to be of type double? This would help in cases in >> > which downweighting of counts of some genomic features is desired. >> > >> > >> > >> In many use cases, it's probably sufficient to simply round floating point >> numbers to integers after multiplying by a power of 10. That only goes so >> far though, so supporting double-precision seems reasonable. The type of >> the >> output will simply depend on the type of the weights. >> >> >> >> >> > Thanks, >> > >> > Chuck >> > >> > -- >> > Charles C. Berry (858) 534-2098 >> > Dept of Family/Preventive >> > Medicine >> > E mailto:cberry at tajo.ucsd.edu UC San Diego >> > http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego >> > 92093-0901 >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at stat.math.ethz.ch >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> > >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901

ADD COMMENT • link 15.4 years ago Charles Berry ▴ 290

0

Entering edit mode

Chuck, Thanks for the URL and paper reference. I'll take a look at them. Also thanks for the code snippets. I always find code to be useful user feedback. One thing I wanted to point out is that I have tried to provide methods for the most commonly used operations in R's base and stats packages for Rle objects so users can avoid coercion to dense vector representations of their data. In particular, there are split() and mean() methods for Rle objects so to get the means for the insertion and match types you could use the code lapply(split(as.vector(jurkat.wider.cover), jurkat[jurkat.RL$orig.rows,'type']), mean) The split operation will create an RleList and the lapply(..., mean) operation will produce the summary results you had before. There is a table method for Rle objects as well, but it currently only support univariate summaries: table(as.vector(jurkat.wider.cover)) I'll add support for bivariate table summaries to my TODO list as well. Cheers, Patrick Charles C. Berry wrote: > On Tue, 1 Dec 2009, Patrick Aboyoun wrote: > >> Chuck, >> Thanks for the speedup to coverage for large width IRanges objects. I >> checked in changes to IRanges 1.5.13 in BioC 2.6 (for use with >> R-devel) based on the code you submitted. I'll back port this code to >> BioC 2.5 (R 2.10) in the next few days as well. > > Thank you so much! > >> Just out of curiosity, what is the source of these long width >> intervals, where do the weights for these intervals come from, and >> what operations do you perform on the resulting coverage vectors? > > Re the source of the intervals: > > I collaborate with Rick Bushman's group at Upenn > (http://microb230.med.upenn.edu/), which studies 'the transfer of > genetic information between cells and organisms'. > > It is of particular interest to know where a mobile DNA element is > preferentially sited in a genome. (And not just for scientific > reasons, gene therapy constructs can do serious damage if they land in > the wrong place, so designing good gene therapy vectors is aided by > being able to predict where a vector tends to land.) > > In our papers (for example Berry et al. Selection of Target Sites for > Mobile DNA Integration in the Human Genome. PLoS Comput Biol 2(11): > e157. doi:10.1371/journal.pcbi.0020157), it emerged that features like > 'gene density' (number of genes per base in a window of a specified > width), CpG island site density, and DNASE I site density have > substantial effects on integration preference. The length scale for > 'width' affects the conclusions, and widths greater than a megabase > and perhaps to tens of megabases seem to matter. > > --------------- > > Re the weights: > > In performing the calculations of 'expression density' (counts per > base of genes expressed above some threshhold value) using Affy > arrays, I had to deal with varying numbers of probesets for different > genes. It seemed to me that counting all probesets with equal weight > would introduce unwanted variation, so I used an inverse weighting > scheme so that the sum of weights for each gene would equal one. > > Michael's workaround seems reasonable. > > ---------------- > > Re the operations performed on coverage: > > You can browse the paper mentioned above to get the idea, but the > operations are just a lookup of the coverage values (with some > adjustments made for gaps and chromosome ends) followed by data > analysis using those values. > > In the idiom of the IRanges package (and simplified a bit for easier > reading), something like this: > > ## jurkat is a data.frame describing HIV integration site locations ## > (aka 'insertions') and those of in silico controls ('match') > >> xtabs( ~type, jurkat ) > type > insertion match > 39237 116161 > > > all.chr <- levels(jurkat$Chr) > > ### locations as a RangedData instance: > > jurkat.RL <- RangedData(IRanges(start=jurkat$Position,width=1), > space=jurkat$Chr, strand=jurkat$Ort,orig.rows=rownames(jurkat)) > > ### get a 1MBase gene density mapping for geneHuman(): > > library(GenomicFeatures.Hsapiens.UCSC.hg18) > knownGenes <- geneHuman() > known.RL <- with( subset(knownGenes, chrom%in%all.chr ), > RangedData( IRanges(start=txStart,en=txEnd), > space=factor(chrom, all.chr ), > strand=factor(strand,levels(jurkat$Ort)))) > wider.RL <- known.RL > start( wider.RL ) <- start( wider.RL ) - 5e5 > end( wider.RL ) <- end( wider.RL ) + 5e5 > wider.cover <- coverage( wider.RL, width = seqlengths(Hsapiens)[ > names( wider.RL ) ] ) > > ### now lookup the coverage (aka density) values for the insertions > and ### matches: > > jurkat.wider.cover <- wider.cover[ ranges(jurkat.RL) ] > > ## now match the coverage values back to the parent data.frame and do > some ## statistical analysis: > > ## simple examples: > >> table( as.vector( as.vector( jurkat.wider.cover ) ), > + jurkat[ jurkat.RL$orig.rows, 'type' ] ) > > > insertion match > 0 993 14881 > 1 568 7490 > 2 843 6453 > 3 1039 7292 > 4 1044 6104 > 5 1044 5590 > 6 1045 5086 > [snip] > >> lapply(split(as.vector(as.vector(jurkat.wider.cover)), > + jurkat[jurkat.RL$orig.rows,'type']),mean) > > $insertion > [1] 52.6881 > > $match > [1] 22.71713 > >> > > Chuck > >> Echoing Michael's comments, we haven't supported double precision >> weights in coverage calculations in the past because we hadn't >> encountered any common use cases for them and there is the workaround >> Michael mentioned if the need arose. Providing some context for the >> enhancement request would help motivate us to make a change. :) >> >> >> Cheers, >> Patrick >> >> >> >> Michael Lawrence wrote: >>> On Mon, Nov 30, 2009 at 11:10 AM, Charles C. Berry >>> <cberry at="" tajo.ucsd.edu="">wrote: >>> >>> >>> > The semantics of the IRanges package and especially the >>> RangedData class >>> > are very apprpriate for some of the applications I deal with. >>> > > Unfortunately, coverage() is too slow to be useful to me. >>> > > I wonder if the Biocore Team would consider retooling it to >>> make it >>> > faster? Below I provide a link to a revised coverage.c that might >>> > suffice. >>> > > The kind of case I need to handle has width values in 10kbase >>> to 10Mbase >>> > range. As a toy example, being able to run stuff like >>> > > tmp <- coverage( IRanges( start=seq(1,by=1000,length=10000), >>> > width=1e7 ) ) >>> > > quickly is needed. >>> > > A revised version of coverage.c is available at >>> > > >>> http://cabig2.ucsd.edu:8080/Plone/Members/ccberry/software/coverag e.c/view >>> >>> > > It will handle the case above almost instantly (while the >>> existing > version >>> > needs about 8 minutes on my machine) and seems about equal to the >>> > existing version for cases with width=30. In the cases I've >>> looked at >>> > gc() reports the same memory usage. >>> > > --- >>> > > Also, I wonder if the Biocore Team would entertain allowing the >>> 'weight' >>> > argument of coverage to be of type double? This would help in >>> cases in >>> > which downweighting of counts of some genomic features is desired. >>> > > > >>> In many use cases, it's probably sufficient to simply round >>> floating point >>> numbers to integers after multiplying by a power of 10. That only >>> goes so >>> far though, so supporting double-precision seems reasonable. The >>> type of >>> the >>> output will simply depend on the type of the weights. >>> >>> >>> >>> >>> > Thanks, >>> > > Chuck >>> > > -- >>> > Charles C. Berry (858) 534-2098 >>> > Dept of Family/Preventive >>> > Medicine >>> > E mailto:cberry at tajo.ucsd.edu UC San Diego >>> > http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego >>> > 92093-0901 >>> > > _______________________________________________ >>> > Bioconductor mailing list >>> > Bioconductor at stat.math.ethz.ch >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>> > Search the archives: >>> > http://news.gmane.org/gmane.science.biology.informatics.conductor >>> > > >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > > Charles C. Berry (858) 534-2098 > Dept of Family/Preventive > Medicine > E mailto:cberry at tajo.ucsd.edu UC San Diego > http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego > 92093-0901 > >

ADD REPLY • link 15.4 years ago Patrick Aboyoun ★ 1.6k

Login before adding your answer.