Correct way to load a package inside bplapply function
1
0
Entering edit mode
@felixernst-13698
Last seen 5.0 years ago

Hello,

I am trying to figure out to correct way to load a package inside a function called with bplapply. The function resides in a package inside a S4 method call

setMethod(
    f = "analyze",
    signature = signature(.Object = "testClass" ,
                          experimentNo = "numeric"),
    definition = function(.Object,
                          experimentNo){
	
​	# here some stuff happens, which generates a list of inputFiles
	
	FUN <- function(x,
​			.Object,
			workDir){ 
		requireNamespace("tools", quietly = TRUE)
        	requireNamespace("Rsamtools", quietly = TRUE)
        	requireNamespace("S4Vectors", quietly = TRUE)

		fileNameReadCache <- paste0(workDir, "cache_", md5sum(x),".RData")
            	if(!file.exists(fileNameReadCache)){
​			# Call to an S4 function reading in a Bam file and returning a DataFrame
			resReads <- .summarizeReadData(.Object, x)
 			save(resReads, file = fileNameReadCache)
            	} else {
                	load(fileNameReadCache)
            	}
	}
	list <- bplapply(inputFiles, 
​		 	 FUN,
​		 	 .Object = .Object,
​		 	 workDir = workDir)

​	# do some stuff with the data

    }
)

 

This cause the following output to be displayed in the console several times (for each worker):

<environment: namespace:base>
            cpu
            elapsed
            transient
              <environment: namespace:base>
            package
            ...
            quietly
            [1]
          e
            [2]
              <environment: namespace:tools>
            files	

I tried setting the log threshold to WARN but this did not help.

 bpparam <- bpparam()
 bpthreshold(bpparam) <- "WARN"
 register(bpparam, default = TRUE)​

Does anyone have any advice for me?

Thanks in advance.

Edit:

- added sessionInfo() output

- modified example function

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                  
[5] LC_TIME=German_Germany.1252   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] RPF_0.1.1.9076

loaded via a namespace (and not attached):
  [1] Category_2.42.1            bitops_1.0-6               matrixStats_0.52.2         bit64_0.9-7              
  [5] httr_1.3.1                 RColorBrewer_1.1-2         GenomeInfoDb_1.12.2        Rgraphviz_2.20.0         
  [9] tools_3.4.1                backports_1.1.0            R6_2.2.2                   KernSmooth_2.23-15       
 [13] rpart_4.1-11               Hmisc_4.0-3                DBI_0.7                    lazyeval_0.2.0           
 [17] BiocGenerics_0.22.0        colorspace_1.3-2           nnet_7.3-12                gridExtra_2.2.1          
 [21] DESeq2_1.16.1              bit_1.1-12                 compiler_3.4.1             graph_1.54.0             
 [25] Biobase_2.36.2             htmlTable_1.9              Cairo_1.5-9                xtail_1.1.5              
 [29] DelayedArray_0.2.7         rtracklayer_1.36.4         KEGGgraph_1.38.1           caTools_1.17.1           
 [33] scales_0.4.1               checkmate_1.8.3            genefilter_1.58.1          RBGL_1.52.0              
 [37] stringr_1.2.0              digest_0.6.12              Rsamtools_1.28.0           foreign_0.8-69           
 [41] AnnotationForge_1.18.1     XVector_0.16.0             base64enc_0.1-3            pkgconfig_2.0.1          
 [45] htmltools_0.3.6            limma_3.32.5               htmlwidgets_0.9            rlang_0.1.2              
 [49] RSQLite_2.0                bindr_0.1                  GOstats_2.42.0             gtools_3.5.0             
 [53] BiocParallel_1.10.1        xlsx_0.5.7                 acepack_1.4.1              dplyr_0.7.2              
 [57] RCurl_1.95-4.8             magrittr_1.5               GO.db_3.4.1                GenomeInfoDbData_0.99.0  
 [61] Formula_1.2-2              Matrix_1.2-11              Rcpp_0.12.12               munsell_0.4.3            
 [65] S4Vectors_0.14.3           pathview_1.16.5            stringi_1.1.5              SummarizedExperiment_1.6.3
 [69] zlibbioc_1.22.0            gplots_3.0.1               plyr_1.8.4                 grid_3.4.1               
 [73] blob_1.1.0                 gdata_2.18.0               parallel_3.4.1             lattice_0.20-35          
 [77] Biostrings_2.44.2          splines_3.4.1              xlsxjars_0.6.1             GenomicFeatures_1.28.4   
 [81] annotate_1.54.0            KEGGREST_1.16.1            locfit_1.5-9.1             knitr_1.17               
 [85] GenomicRanges_1.28.4       reshape2_1.4.2             geneplotter_1.54.0         codetools_0.2-15         
 [89] biomaRt_2.32.1             stats4_3.4.1               XML_3.98-1.9               glue_1.1.1               
 [93] latticeExtra_0.6-28        data.table_1.10.4          png_0.1-7                  foreach_1.4.3            
 [97] RDAVIDWebService_1.14.0    gtable_0.2.0               assertthat_0.2.0           ggplot2_2.2.1            
[101] xtable_1.8-2               survival_2.41-3            tibble_1.3.3               rJava_0.9-8              
[105] iterators_1.0.8            GenomicAlignments_1.12.2   AnnotationDbi_1.38.2       memoise_1.1.0            
[109] IRanges_2.10.2             bindrcpp_0.2               cluster_2.0.6              LSD_3.0                  
[113] GSEABase_1.38.0
parallel biocparallel • 2.4k views
ADD COMMENT
1
Entering edit mode

Can you provide (edit your question) the output of sessionInfo(), and also a completely reproducible example? I can't replicate your problem from the information you provide.

ADD REPLY
0
Entering edit mode

Thanks for the reply. I changed the initial post accordingly.

Do you know, how this output is created? I don't recognize it from its format. It is not a startup message nor a warning.

To add a bit more context, I switched from using parallel to BiocParallel. It did not change anything inside the functions and that is, when the output started to appear.

ADD REPLY
0
Entering edit mode

Hi Martin,

I did so more digging. It looks to me that this might be an output, which one could get from selectMethod
 

     seqinfo
          check.names
            seqnames
            ranges
            strand
            mcols
            seqlengths
            seqinfo
              <environment: namespace:S4Vectors>
            ...
            row.names
            check.names
  silent
  use.names
  length.out
  drop
      recursive
      use.names
    unique
            listData
            rownames
            nrows
            check
              <environment: namespace:S4Vectors>
            x
              <environment: namespace:base>
            mode
            length
              <environment: namespace:S4Vectors>
            ...
            check
              <environment: namespace:S4Vectors>
            disabled
        envir
            envir
              [1]
            [1]
            [2]
          names
              [2]
              [3]
              [4]
              [5]
            names
              <environment: namespace:GenomicRanges>
            Class
            seqnames
            ranges
            strand
            mcols
            seqlengths
            seqinfo
    levels
          levels
            seqnames
            ranges
            strand
            elementMetadata
            seqinfo
              <environment: namespace:IRanges>
            start
            end
            width
            names
      start
      width
      NAMES
      check
          start
          end
          width
              <environment: namespace:IRanges>
            start
            end
            width
            PACKAGE
              <environment: namespace:IRanges>
            x
            argname
              <environment: namespace:S4Vectors>
            value
            x
              <environment: namespace:S4Vectors>
            values
            lengths
            check
            PACKAGE
              <environment: namespace:stats>
            object
            nm
              <environment: namespace:GenomeInfoDb>
            seqnames
            seqlengths
            isCircular
            genome
            seqnames
            seqlengths
            is_circular
            genome
              <environment: namespace:GenomeInfoDb>

This another example. The output is mile long (more than 1000 lines), so I don't want to post it here in full. I recognize function names, which I use, but apart from that, I cannot add anymore or more precisely I don't know, what to add in addition to this.

​Thanks for any help in advance.

ADD REPLY
1
Entering edit mode
@martin-morgan-1513
Last seen 5 months ago
United States

In general I would expect a simple loadNamespace() to be sufficient, and to not produce spurious output (perhaps suppressPackageStartupMessages(loadNamespace()) would be better). So your report is either a bug or something unique to your system. 

You should update to R-3.4.1 and the current version of Bioconductor (3.5). This is because, if it is a bug, it may have already been fixed. And also, bug fixes can only be introduced into the current release of Bioconductor packages.

You should then try to reproduce this with a much simpler example, e.g., running the following code in a new R session

library(BiocParallel)
FUN = function(...) {
    suppressPackageStartupMessages({
        requireNamespace("tools")
    })
}
bpparam <- bpparam()
bpthreshold(bpparam) <- "WARN"
xx = bplapply(1:5, FUN, BPPARAM=bpparam)

This will help to isolate whether the problem is with BiocParallel, or with an interaction with other packages in your session.

ADD COMMENT
0
Entering edit mode

 

Thanks for the advice. I will do that in the next couple of days. I tried updating to R 3.4.0 a couple of month ago, but couldn't do it, since some dependencies were not up to date.

What is your comment on the usage of loadNamespace vs requireNamespace? The function in question with the weird output is designed to be part of a package and since I am quite to new to that aspect of R, I read a lot of things. Among those were the books from Hadley Wickham​, which has some advice in favour of using just requireNamespace​. Do you think this makes a difference?

I tried the suppressPackageStartupMessages​ approach already this morning and also suppressWarnings. This does not change the output, and from the large number of repetition I see in the output, I would venture a guess and the output is generated upon calling a function rather than loading the package. The output is really a mile long.

 

ADD REPLY
0
Entering edit mode

Make sure you are not installing your updated version into the same directory as the old version (check .libPaths() in the old and new version, and make sure that they are either different or any packages installed under 3.3.* are not present when using the path under 3.4.*).

loadNamespace() and requireNamespace() differ essentially in their return value and messages; they are not functionally different. loadNamespace("faux-package") signals an error, whereas requireNamespace("faux-package") signals a warning and returns FALSE; the latter is easier to recover from (if (!requireNamespace("faux-package")) ...) when there is some sane alternative to using the faux package.

ADD REPLY
0
Entering edit mode

The problem exists with 3.4.1 in a fresh install as well. The example function does not return a output, so the problem has to be in connection with some thing else.

Therefore I modified the function stepwise and commented out all the function calls and simplified some stuff so I ended up with this:

	library("BiocParallel", quietly = TRUE)
        FUN <- function(x,
                        .Object,
                        workDir){           
            fileNameReadCache <- paste0(workDir, "ReadsORF_Reads_cache_", tools::md5sum(x),".RData")
            return(gene = data.frame())
        }
       
        data <- vector(mode="list", length = length(bamFilesRibo))
        data <- bplapply(bamFilesRibo,
                         FUN,
                         .Object = .Object,
                         workDir = workDir)​
       

This still produces the output. Prior to the library("BiocParallel") call, there are not additional library, require or similar function calls.

The whole code snippet resides inside an S4 method, which is part of a package. By default I load the following namespaces in the package:

requireNamespace("SummarizedExperiment")
requireNamespace("BiocParallel")
requireNamespace("Biostrings")
requireNamespace("rtracklayer")
requireNamespace("GenomicRanges")

I don't know how this has setup to do with the problem, but if I just copy paste the function call into the session (with bamFilesRibo <- list("la","la")) no output is created.

ADD REPLY
0
Entering edit mode

addition because of the 5000 spaces limit:

The output is of course much shorter and for each file loaded the output appears:

 <environment: namespace:base>
            cpu
            elapsed
            transient
          class
              <environment: namespace:tools>
            files
          class
          class
          class
              <environment: namespace:base>
            cpu
            elapsed
            transient
          class
              <environment: namespace:tools>
            files
          class
          class
          class
              <environment: namespace:base>
            cpu
            elapsed
            transient
          class
              <environment: namespace:tools>
            files
          class
          class
          class
              <environment: namespace:base>
            cpu
            elapsed
            transient
          class
              <environment: namespace:tools>
            files
          class
          class
          class
              <environment: namespace:base>
            cpu
            elapsed
            transient
          class
              <environment: namespace:tools>
            files
          class
          class
          class
              <environment: namespace:base>
            cpu
            elapsed
            transient
          class
              <environment: namespace:tools>
            files
          class
          class
          class
ADD REPLY
0
Entering edit mode

Can you paste your sessionInfo() after loading BiocParallel?

ADD REPLY
0
Entering edit mode

Sorry, forgot that. I update the sessionInfo output in the original thread opening post, since there is a 5000 character limit for replies.

​the dependencies can be installed using this:
source("https://bioconductor.org/biocLite.R")
biocLite()
biocLite("devtools")
devtools::find_rtools()
library(devtools)
biocLite("DESeq2")
install_github("xryanglab/xtail")
biocLite(c('rtracklayer', 'Rsamtools', 'Biostrings', 'GenomicFeatures', 'GenomicAlignments', 'RDAVIDWebService', 'pathview', 'foreach', 'Cairo', 'gplots', 'LSD', 'limma', 'xlsx', 'dplyr'))

 

ADD REPLY

Login before adding your answer.

Traffic: 919 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6