Hi all,
I have a small but important challenge set before me, that I've been
unable
to solve. I need to aggregate all phastCon scores for 75-100 nt around
all *
mus* exon splicesites. I've tried different approaches, such as
downloading
the entire mulitz30way phastCon dataset from UCSC (too big to work
with
smoothly), download using intersect with UCSC table browser and Galaxy
(limits me to 10 million data points, unfortunately), and fetching
data
trough rtracklayer (too slow). Can anyone point me towards an elegant
and
fast way to fetch datapoints for many genomic intervals? With around
22k
genes, with an average exon count of 8 times 100 nt, it seems I need
to be
able to fetch around 20m data points.
I need to use the data as background in comparison to select
upregulated
exons in a RNA-seq splice study.
All the best,
JW,
University of Copenhagen
[[alternative HTML version deleted]]
On Tue, Dec 15, 2009 at 5:01 PM, Johannes Waage <johannes.waage at="" bric.dk=""> wrote:
> Hi all,
>
> I have a small but important challenge set before me, that I've been
unable
> to solve. I need to aggregate all phastCon scores for 75-100 nt
around all *
> mus* exon splicesites. I've tried different approaches, such as
downloading
> the entire mulitz30way phastCon dataset from UCSC (too big to work
with
> smoothly), download using intersect with UCSC table browser and
Galaxy
> (limits me to 10 million data points, unfortunately), and fetching
data
> trough rtracklayer (too slow). Can anyone point me towards an
elegant and
> fast way to fetch datapoints for many genomic intervals? With around
22k
> genes, with an average exon count of 8 times 100 nt, it seems I need
to be
> able to fetch around 20m data points.
>
> I need to use the data as background in comparison to select
upregulated
> exons in a RNA-seq splice study.
Could you do this chromosome-by-chromosome by loading the per-base
data one chromosome at a time from the files into an R vector and then
using normal vector subsetting to get the regions of interest?
Alternatively, with a little work, you could probably also build a
little index file and then use random access to get the data from the
files.
Finally, there are probably some tools in the UCSC browser tool chain
that you could download to deal with conservation data fairly quickly.
Sean
On Tue, Dec 15, 2009 at 2:21 PM, Sean Davis <seandavi@gmail.com>
wrote:
> On Tue, Dec 15, 2009 at 5:01 PM, Johannes Waage
<johannes.waage@bric.dk>
> wrote:
> > Hi all,
> >
> > I have a small but important challenge set before me, that I've
been
> unable
> > to solve. I need to aggregate all phastCon scores for 75-100 nt
around
> all *
> > mus* exon splicesites. I've tried different approaches, such as
> downloading
> > the entire mulitz30way phastCon dataset from UCSC (too big to work
with
> > smoothly), download using intersect with UCSC table browser and
Galaxy
> > (limits me to 10 million data points, unfortunately), and fetching
data
> > trough rtracklayer (too slow). Can anyone point me towards an
elegant and
> > fast way to fetch datapoints for many genomic intervals? With
around 22k
> > genes, with an average exon count of 8 times 100 nt, it seems I
need to
> be
> > able to fetch around 20m data points.
> >
> > I need to use the data as background in comparison to select
upregulated
> > exons in a RNA-seq splice study.
>
> Could you do this chromosome-by-chromosome by loading the per-base
> data one chromosome at a time from the files into an R vector and
then
> using normal vector subsetting to get the regions of interest?
>
> Alternatively, with a little work, you could probably also build a
> little index file and then use random access to get the data from
the
> files.
>
> Finally, there are probably some tools in the UCSC browser tool
chain
> that you could download to deal with conservation data fairly
quickly.
>
>
This may be a decent use case for bigWig support in Bioconductor. The
data
is stored in a binary, indexed form, so it should be easy and
efficient to
bring subsets into memory/R.
The mappability tracks are another example. Looks like rtracklayer may
be
the place for this, at least initially. The mythical common IO
package
would be helpful though.
Michael
Sean
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
[[alternative HTML version deleted]]
On Tue, Dec 15, 2009 at 7:56 PM, Michael Lawrence
<lawrence.michael at="" gene.com=""> wrote:
>
>
> On Tue, Dec 15, 2009 at 2:21 PM, Sean Davis <seandavi at="" gmail.com="">
wrote:
>>
>> On Tue, Dec 15, 2009 at 5:01 PM, Johannes Waage <johannes.waage at="" bric.dk="">
>> wrote:
>> > Hi all,
>> >
>> > I have a small but important challenge set before me, that I've
been
>> > unable
>> > to solve. I need to aggregate all phastCon scores for 75-100 nt
around
>> > all *
>> > mus* exon splicesites. I've tried different approaches, such as
>> > downloading
>> > the entire mulitz30way phastCon dataset from UCSC (too big to
work with
>> > smoothly), download using intersect with UCSC table browser and
Galaxy
>> > (limits me to 10 million data points, unfortunately), and
fetching data
>> > trough rtracklayer (too slow). Can anyone point me towards an
elegant
>> > and
>> > fast way to fetch datapoints for many genomic intervals? With
around 22k
>> > genes, with an average exon count of 8 times 100 nt, it seems I
need to
>> > be
>> > able to fetch around 20m data points.
>> >
>> > I need to use the data as background in comparison to select
upregulated
>> > exons in a RNA-seq splice study.
>>
>> Could you do this chromosome-by-chromosome by loading the per-base
>> data one chromosome at a time from the files into an R vector and
then
>> using normal vector subsetting to get the regions of interest?
>>
>> Alternatively, with a little work, you could probably also build a
>> little index file and then use random access to get the data from
the
>> files.
>>
>> Finally, there are probably some tools in the UCSC browser tool
chain
>> that you could download to deal with conservation data fairly
quickly.
>>
>
> This may be a decent use case for bigWig support in Bioconductor.
The data
> is stored in a binary, indexed form, so it should be easy and
efficient to
> bring subsets into memory/R.
>
> The mappability tracks are another example. Looks like rtracklayer
may be
> the place for this, at least initially.? The mythical common IO
package
> would be helpful though.
I agree that bigWig support would be a useful addition to the
bioconductor tool set.
Sean
On Tue, Dec 15, 2009 at 5:21 PM, Sean Davis <seandavi at="" gmail.com="">
wrote:
> On Tue, Dec 15, 2009 at 5:01 PM, Johannes Waage <johannes.waage at="" bric.dk=""> wrote:
>> Hi all,
>>
>> I have a small but important challenge set before me, that I've
been unable
>> to solve. I need to aggregate all phastCon scores for 75-100 nt
around all *
>> mus* exon splicesites. I've tried different approaches, such as
downloading
>> the entire mulitz30way phastCon dataset from UCSC (too big to work
with
>> smoothly), download using intersect with UCSC table browser and
Galaxy
>> (limits me to 10 million data points, unfortunately), and fetching
data
>> trough rtracklayer (too slow). Can anyone point me towards an
elegant and
>> fast way to fetch datapoints for many genomic intervals? With
around 22k
>> genes, with an average exon count of 8 times 100 nt, it seems I
need to be
>> able to fetch around 20m data points.
>>
>> I need to use the data as background in comparison to select
upregulated
>> exons in a RNA-seq splice study.
>
> Could you do this chromosome-by-chromosome by loading the per-base
> data one chromosome at a time from the files into an R vector and
then
> using normal vector subsetting to get the regions of interest?
OK. I looked at the files and I don't think it will work without some
cleverness. The two methods below are still possible, though.
Sean
> Alternatively, with a little work, you could probably also build a
> little index file and then use random access to get the data from
the
> files.
>
> Finally, there are probably some tools in the UCSC browser tool
chain
> that you could download to deal with conservation data fairly
quickly.
>
> Sean
>