GenomicFeatures supportedUCSCFeatureDbTracks never returns and has to be terminated with C-c
1
0
Entering edit mode
Malcolm Cook ★ 1.6k
@malcolm-cook-6293
Last seen 3 months ago
United States

In my hands, supportedUCSCFeatureDbTracks never returns and has to be terminated with C-c.

Any debugging help very welcome.

I've tried with hg19 in addition to dm6.  Below I report my linux setup, but it also fails under windows, R 3.5.1, GenomicFeatures_1.34.1

>  trks<-supportedUCSCFeatureDbTracks('dm6')
  C-c C-c
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /n/apps/CentOS7/install/r-3.5.0/lib64/R/lib/libRblas.so
LAPACK: /n/apps/CentOS7/install/r-3.5.0/lib64/R/lib/libRlapack.so

locale:
[1] C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] GenomicFeatures_1.32.0 AnnotationDbi_1.42.1   Biobase_2.40.0         GenomicRanges_1.32.6   GenomeInfoDb_1.16.0    IRanges_2.14.12        S4Vectors_0.18.3       BiocGenerics_0.26.0    devtools_1.13.6       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.19                compiler_3.5.0              XVector_0.20.0              prettyunits_1.0.2           bitops_1.0-6                tools_3.5.0                 zlibbioc_1.26.0             progress_1.2.0              biomaRt_2.36.1              digest_0.6.18               bit_1.1-14                  lattice_0.20-35             RSQLite_2.1.1               memoise_1.1.0               pkgconfig_2.0.2             rlang_0.3.0.1               Matrix_1.2-15               DelayedArray_0.6.6          DBI_1.0.0                   GenomeInfoDbData_1.1.0      rtracklayer_1.40.3         
[22] withr_2.1.2                 stringr_1.3.1               httr_1.3.1                  Biostrings_2.48.0           hms_0.4.2                   grid_3.5.0                  bit64_0.9-7                 R6_2.3.0                    BiocParallel_1.14.2         XML_3.98-1.12               blob_1.1.1                  magrittr_1.5                matrixStats_0.54.0          GenomicAlignments_1.16.0    Rsamtools_1.32.2            SummarizedExperiment_1.10.1 assertthat_0.2.0            stringi_1.2.4               RCurl_1.95-4.11             crayon_1.3.4               

 

 

 

 

genomicfeatures bug • 1.1k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 9 hours ago
United States

I think you are being too impatient. There are 35 tracks for dm6, and for each track you have to make a couple of queries to UCSC, which does take time. Even if I jam it out with a core for each query it takes a bit of time, mostly in transit (because the elapsed time is almost a minute, but the system time is not even two seconds):

> session <- browserSession()
> genome(session) <- "dm6"
> trx <- trackNames(ucscTableQuery(session))
> library(BiocParallel)
> system.time(bplapply(trx, GenomicFeatures:::isGoodTrack, session, BPPARAM = MulticoreParam(35)))
   user  system elapsed
  0.320   1.376  49.718

But waiting for that does look boring. Maybe it should be parallelized?

ADD COMMENT
0
Entering edit mode

Thanks, James, but I don't understand your response.  I commented on the function supportedUCSCFeatureDbTracks, which by my sights you did not exercise in your code, so, how could it address my issue?

ADD REPLY
0
Entering edit mode
I ran the underlying code, which is normally run serially. I parallelized the slow step to show you that it does run. But since it's normally run serially, it will probably take about 35 minutes to run.
ADD REPLY
0
Entering edit mode

The code underlying supportedUCSCFeatureDbTracks is

> supportedUCSCFeatureDbTracks
function (genome) 
{
    session <- browserSession()
    genome(session) <- genome
    query <- ucscTableQuery(session)
    trx <- trackNames(ucscTableQuery(session))
    supported <- makeWhiteList(session, trx)
    trx[supported]
}

Which looks in part like what you are telling me is the underlying code, but not completely.

When I run the above line by line (adding namespace prefix as needed), I find this line is where things stall

    supported <- GenomicFeatures:::makeWhiteList(session, trx)

Which line is absent from what you are running.

... Ah, but, now I see, makeWhiteList in turn calls isGoodTrack.

Neither makeWhiteList nor isGoodTrack are exposed/documented so I don't know what to expect them to be doing.

It appears now to me that the line

    trx <- trackNames(ucscTableQuery(session))

tells me what I want to know.  Can you advise what further filtering for 'goodTracks' is supposed to provide me?

Thanks!

ADD REPLY
0
Entering edit mode

I think we may be getting off point here. The main point I was trying to make, originally, was that the underlying code does work, and the fact that you were killing the process early had more to do with impatience on your part than some sort of error in the code.

To show that the underlying code does work, I had to expose some of the code and I parallelized it so I didn't have to wait for some amount of time =< 35 minutes in order to show that it does work, for some definition of 'work'.

I'm not sure it's critical for you to understand what isGoodTrack does, but I would imagine that it's not really that hard to infer? You already know that we are getting back a bunch of tracks from UCSC, and then a function called isGoodTrack generates something that can be used to subset that set of tracks to just those that can be used to create a FeatureDb. And the function name is isGoodTrack?

So, the main point here is that

  1. The code does work, and you are killing the process early because you think it's frozen when in fact it is diligently working.
  2. The diligent work that is being done is repeated 35 times, and each time takes about a minute, +/-, which ends up being pretty inefficient, and it may be useful to parallelize if we expect people to be able to use supportedUCSDFeatureDbTracks without getting bored and killing the process. The downside of doing that would be the requirement that GenomicFeatures would have to depend on BiocParallel, for this one function that until now I didn't know existed. So the tradeoff would be to make the GenomicFeatures package more complex to speed up a function that maybe people don't even use?
ADD REPLY

Login before adding your answer.

Traffic: 558 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6