On Tue, 1 Dec 2009, Patrick Aboyoun wrote:
> Chuck,
> Thanks for the speedup to coverage for large width IRanges objects.
I checked
> in changes to IRanges 1.5.13 in BioC 2.6 (for use with R-devel)
based on the
> code you submitted. I'll back port this code to BioC 2.5 (R 2.10) in
the next
> few days as well.
Thank you so much!
> Just out of curiosity, what is the source of these long
> width intervals, where do the weights for these intervals come from,
and what
> operations do you perform on the resulting coverage vectors?
Re the source of the intervals:
I collaborate with Rick Bushman's group at Upenn
(
http://microb230.med.upenn.edu/), which studies 'the transfer of
genetic
information between cells and organisms'.
It is of particular interest to know where a mobile DNA element is
preferentially sited in a genome. (And not just for scientific
reasons,
gene therapy constructs can do serious damage if they land in the
wrong
place, so designing good gene therapy vectors is aided by being able
to
predict where a vector tends to land.)
In our papers (for example Berry et al. Selection of Target Sites for
Mobile DNA Integration in the Human Genome. PLoS Comput Biol 2(11):
e157.
doi:10.1371/journal.pcbi.0020157), it emerged that features like 'gene
density' (number of genes per base in a window of a specified width),
CpG
island site density, and DNASE I site density have substantial effects
on
integration preference. The length scale for 'width' affects the
conclusions, and widths greater than a megabase and perhaps to tens of
megabases seem to matter.
---------------
Re the weights:
In performing the calculations of 'expression density' (counts per
base
of genes expressed above some threshhold value) using Affy arrays, I
had
to deal with varying numbers of probesets for different genes. It
seemed
to me that counting all probesets with equal weight would introduce
unwanted variation, so I used an inverse weighting scheme so that the
sum
of weights for each gene would equal one.
Michael's workaround seems reasonable.
----------------
Re the operations performed on coverage:
You can browse the paper mentioned above to get the idea, but the
operations are just a lookup of the coverage values (with some
adjustments
made for gaps and chromosome ends) followed by data analysis using
those
values.
In the idiom of the IRanges package (and simplified a bit for easier
reading), something like this:
## jurkat is a data.frame describing HIV integration site locations
## (aka 'insertions') and those of in silico controls ('match')
> xtabs( ~type, jurkat )
type
insertion match
39237 116161
all.chr <- levels(jurkat$Chr)
### locations as a RangedData instance:
jurkat.RL <- RangedData(IRanges(start=jurkat$Position,width=1),
space=jurkat$Chr,
strand=jurkat$Ort,orig.rows=rownames(jurkat))
### get a 1MBase gene density mapping for geneHuman():
library(GenomicFeatures.Hsapiens.UCSC.hg18)
knownGenes <- geneHuman()
known.RL <- with( subset(knownGenes, chrom%in%all.chr ),
RangedData( IRanges(start=txStart,en=txEnd),
space=factor(chrom, all.chr ),
strand=factor(strand,levels(jurkat$Ort))))
wider.RL <- known.RL
start( wider.RL ) <- start( wider.RL ) - 5e5
end( wider.RL ) <- end( wider.RL ) + 5e5
wider.cover <- coverage( wider.RL, width = seqlengths(Hsapiens)[
names( wider.RL ) ] )
### now lookup the coverage (aka density) values for the insertions
and
### matches:
jurkat.wider.cover <- wider.cover[ ranges(jurkat.RL) ]
## now match the coverage values back to the parent data.frame and do
some
## statistical analysis:
## simple examples:
> table( as.vector( as.vector( jurkat.wider.cover ) ),
+ jurkat[ jurkat.RL$orig.rows, 'type' ] )
insertion match
0 993 14881
1 568 7490
2 843 6453
3 1039 7292
4 1044 6104
5 1044 5590
6 1045 5086
[snip]
> lapply(split(as.vector(as.vector(jurkat.wider.cover)),
+ jurkat[jurkat.RL$orig.rows,'type']),mean)
$insertion
[1] 52.6881
$match
[1] 22.71713
>
Chuck
> Echoing Michael's comments, we haven't supported double precision
weights in
> coverage calculations in the past because we hadn't encountered any
common
> use cases for them and there is the workaround Michael mentioned if
the need
> arose. Providing some context for the enhancement request would help
motivate
> us to make a change. :)
>
>
> Cheers,
> Patrick
>
>
>
> Michael Lawrence wrote:
>> On Mon, Nov 30, 2009 at 11:10 AM, Charles C. Berry
>> <cberry at="" tajo.ucsd.edu="">wrote:
>>
>>
>> > The semantics of the IRanges package and especially the
RangedData class
>> > are very apprpriate for some of the applications I deal with.
>> >
>> > Unfortunately, coverage() is too slow to be useful to me.
>> >
>> > I wonder if the Biocore Team would consider retooling it to make
it
>> > faster? Below I provide a link to a revised coverage.c that
might
>> > suffice.
>> >
>> > The kind of case I need to handle has width values in 10kbase to
10Mbase
>> > range. As a toy example, being able to run stuff like
>> >
>> > tmp <- coverage( IRanges(
start=seq(1,by=1000,length=10000),
>> > width=1e7 ) )
>> >
>> > quickly is needed.
>> >
>> > A revised version of coverage.c is available at
>> >
>> >
http://cabig2.ucsd.edu:8080/Plone/Members/ccberry/software/cover
age.c/view
>> >
>> > It will handle the case above almost instantly (while the
existing
>> > version
>> > needs about 8 minutes on my machine) and seems about equal to
the
>> > existing version for cases with width=30. In the cases I've
looked at
>> > gc() reports the same memory usage.
>> >
>> > ---
>> >
>> > Also, I wonder if the Biocore Team would entertain allowing the
'weight'
>> > argument of coverage to be of type double? This would help in
cases in
>> > which downweighting of counts of some genomic features is
desired.
>> >
>> >
>> >
>> In many use cases, it's probably sufficient to simply round
floating point
>> numbers to integers after multiplying by a power of 10. That only
goes so
>> far though, so supporting double-precision seems reasonable. The
type of
>> the
>> output will simply depend on the type of the weights.
>>
>>
>>
>>
>> > Thanks,
>> >
>> > Chuck
>> >
>> > --
>> > Charles C. Berry (858) 534-2098
>> > Dept of
Family/Preventive
>> > Medicine
>> > E mailto:cberry at tajo.ucsd.edu UC San Diego
>> >
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego
>> > 92093-0901
>> >
>> > _______________________________________________
>> > Bioconductor mailing list
>> > Bioconductor at stat.math.ethz.ch
>> >
https://stat.ethz.ch/mailman/listinfo/bioconductor
>> > Search the archives:
>> >
http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >
>> >
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>>
https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>>
http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
Charles C. Berry (858) 534-2098
Dept of Family/Preventive
Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego
92093-0901