Entering edit mode
Janet Young
▴
740
@janet-young-2360
Last seen 5.1 years ago
Fred Hutchinson Cancer Research Center,…
thanks - that'd be helpful
Janet
On Jul 31, 2013, at 12:47 PM, Michael Lawrence wrote:
> Yea, I was planning on adding a warning.
>
>
> On Wed, Jul 31, 2013 at 11:49 AM, Janet Young <jayoung@fhcrc.org>
wrote:
> Makes sense - thanks. I was also playing with iterating through my
queries 1000 at a time, but I'm sure the downloaded bigWig would be
more efficient (and more palatable to the UCSC people).
>
> From a user point of view, it would be helpful if ucscTableQuery
could issue a warning if we try to query with >1000 ranges, rather
than silently returning a partial result - does that make sense to
you? I spent a little time down a rabbit hole with that one.
>
> Janet
>
>
>
> On Jul 31, 2013, at 11:39 AM, Michael Lawrence wrote:
>
>> Right, the table browser is the limitation. It only allows 1000
query regions, and limits the output for "whole genome" as well to
something like 100,000.
>>
>> If you've got a lot of data to query, it's best to just download
the bigWig or other file and query it locally. These scores should be
available via the FTP as bigWig.
>>
>>
>>
>>
>> On Tue, Jul 30, 2013 at 1:06 PM, Janet Young <jayoung@fhcrc.org>
wrote:
>> I'm guessing this issue might come from the UCSC end: using their
online Table Browser, there is a query limit of 1000 regions.
>>
>> Janet
>>
>> On Jul 30, 2013, at 12:15 PM, Janet Young wrote:
>>
>> > Hi again,
>> >
>> > I think I've found another issue with rtracklyer, where if I run
a query with range specified as a large number of small GRanges, it
seems to only be getting results for the first 1000 ranges. Again,
hopefully the code below will explain all.
>> >
>> > thanks,
>> >
>> > Janet
>> >
>> >
-------------------------------------------------------------------
>> >
>> > Dr. Janet Young
>> >
>> > Malik lab
>> > http://research.fhcrc.org/malik/en.html
>> >
>> > Fred Hutchinson Cancer Research Center
>> > 1100 Fairview Avenue N., A2-025,
>> > P.O. Box 19024, Seattle, WA 98109-1024, USA.
>> >
>> > tel: (206) 667 4512
>> > email: jayoung ...at... fhcrc.org
>> >
>> >
-------------------------------------------------------------------
>> >
>> >
>> > ##### works:
>> >
>> > library(rtracklayer)
>> > library(GenomicRanges)
>> >
>> > session <- browserSession("UCSC")
>> > genome(session) <- "hg19"
>> >
>> >
>> > #### make some sample ranges - a large number of small ranges:
>> > numRanges <- 5000
>> > rangeWidths <- 50
>> > myRanges <- GRanges( seqnames=rep("chr1",numRanges),
>> >
ranges=IRanges(start=1:numRanges*rangeWidths*2,width=rangeWidths) )
>> >
>> > #### run a query
>> > query <- ucscTableQuery (session, "cons46way", range=myRanges)
>> > tableName(query) <- "phyloP46wayPrimates"
>> > scores <- track(query)
>> > ### seems to work with no apparent error
>> >
>> > ##### but the number of scores returned is much smaller than the
number of positions queried:
>> > length(scores)
>> > # [1] 41749
>> > sum(width(myRanges))
>> > # [1] 250000
>> >
>> >
>> > #### use findOverlaps to see which of the ranges had scores -
only 837 of the 5000 regions had scores. It's ranges 109-1000 that
have scores. Not a problem that ranges 1-108 had missing scores
(they're probably in an assembly gap, or don't have aligned primate
seqs), but at least some of ranges 1001-5000 should have scores.
>> >
>> > myOverlaps <- findOverlaps(myRanges, scores)
>> > length(unique(queryHits(myOverlaps)))
>> > # [1] 837
>> > range(queryHits(myOverlaps))
>> > # [1] 109 1000
>> >
>> > ### a separate query shows that at least ranges 1001-1010 indeed
should have scores (haven't checked the rest):
>> > query2 <- ucscTableQuery (session, "cons46way",
range=myRanges[1001:1010])
>> > tableName(query2) <- "phyloP46wayPrimates"
>> > scores2 <- track(query2)
>> >
>> > ### every position queried has a score:
>> > sum(width(myRanges[1001:1010]))
>> > # [1] 500
>> > length(scores2)
>> > # [1] 500
>> >
>> > myOverlaps2 <- findOverlaps(myRanges, scores2)
>> > range(queryHits(myOverlaps2))
>> > # [1] 1001 1010
>> >
>> >
>> >
>> > ##################
>> >
>> >
>> > sessionInfo()
>> >
>> > R version 3.0.1 Patched (2013-07-29 r63455)
>> > Platform: x86_64-unknown-linux-gnu (64-bit)
>> >
>> > locale:
>> > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> > [9] LC_ADDRESS=C LC_TELEPHONE=C
>> > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> >
>> > attached base packages:
>> > [1] parallel stats graphics grDevices utils datasets
methods
>> > [8] base
>> >
>> > other attached packages:
>> > [1] rtracklayer_1.21.9 GenomicRanges_1.13.35 XVector_0.1.0
>> > [4] IRanges_1.19.19 BiocGenerics_0.7.3
>> >
>> > loaded via a namespace (and not attached):
>> > [1] Biostrings_2.29.14 bitops_1.0-5 BSgenome_1.29.1
RCurl_1.95-4.1
>> > [5] Rsamtools_1.13.26 stats4_3.0.1 tools_3.0.1
XML_3.98-1.1
>> > [9] zlibbioc_1.7.0
>> >
>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor@r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
[[alternative HTML version deleted]]