Question

Trouble creating an RnBeads annotation for rn6 rat genome build

0

Entering edit mode

richardjacton • 0

@richardjacton-19729

Last seen 5.9 years ago

I've been trying to use the RnBeadsAnnotationCreator (https://github.com/epigen/RnBeadsAnnotationCreator) to create an annotation for the rn6 build of the rat genome. I started with the createAnnotationPackage.rn5.R file and updated it for rn6 see: but i've been unable to get it to build the rn6 annotation.

When I call: createAnnotationPackage("rn6", dest = "~/Documents/myCustomRlibs/RnBeads.rn6") The VCF files apprear to be sucessfully downloaded but contain this warning in the output:

2019-02-04 14:57:56     1.8 WARNING             Could not detect MAF data (CAF field)
2019-02-04 14:57:56     1.8 WARNING             Could not detect MAF data (G5 field); allele frequency is not considered

And then I get this error:

2019-02-04 14:59:09     2.3   ERROR         dbSNP table(s) are too large. MapReduce implemented in ~/Documents/myCustomRlibs/RnBeads.rn6/RnBeads.rn6/temp/split.snps.R
Error in logger.error(paste("dbSNP table(s) are too large. MapReduce implemented in",  : 
  dbSNP table(s) are too large. MapReduce implemented in ~/Documents/myCustomRlibs/RnBeads.rn6/RnBeads.rn6/temp/split.snps.R

I took a look at the source of the rnb.update.dbsnp() function (https://rdrr.io/github/epigen/RnBeadsAnnotationCreator/src/R/snps.R), which appears to be the source of the error, but haven't been able to figure out why it's triggering an error.

Any suggestions would be much appreciated.

SessionInfo() :

R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.10 (Santiago)

Matrix products: default
BLAS: /home/local/software/R/3.4.2/lib64/R/lib/libRblas.so
LAPACK: /home/local/software/R/3.4.2/lib64/R/lib/libRlapack.so

locale:
[1] C

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] BSgenome.Rnorvegicus.UCSC.rn6_1.4.1    
 [2] BSgenome_1.46.0                        
 [3] rtracklayer_1.38.3                     
 [4] RnBeadsAnnotationCreator_0.99.0        
 [5] RnBeads_1.10.8                         
 [6] plyr_1.8.4                             
 [7] methylumi_2.24.1                       
 [8] minfi_1.24.0                           
 [9] bumphunter_1.20.0                      
[10] locfit_1.5-9.1                         
[11] Biostrings_2.46.0                      
[12] XVector_0.18.0                         
[13] SummarizedExperiment_1.8.1             
[14] DelayedArray_0.4.1                     
[15] FDb.InfiniumMethylation.hg19_2.2.0     
[16] org.Hs.eg.db_3.5.0                     
[17] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[18] GenomicFeatures_1.30.3                 
[19] AnnotationDbi_1.40.0                   
[20] reshape2_1.4.3                         
[21] scales_0.5.0                           
[22] Biobase_2.38.0                         
[23] illuminaio_0.20.0                      
[24] matrixStats_0.53.1                     
[25] limma_3.34.9                           
[26] gridExtra_2.3                          
[27] gplots_3.0.1                           
[28] ggplot2_3.1.0                          
[29] fields_9.6                             
[30] maps_3.3.0                             
[31] spam_2.2-1                             
[32] dotCall64_1.0-0                        
[33] ff_2.2-14                              
[34] bit_1.1-14                             
[35] cluster_2.0.7-1                        
[36] MASS_7.3-50                            
[37] GenomicRanges_1.30.3                   
[38] GenomeInfoDb_1.14.0                    
[39] IRanges_2.12.0                         
[40] S4Vectors_0.16.0                       
[41] BiocGenerics_0.24.0                    
[42] doParallel_1.0.14                      
[43] iterators_1.0.10                       
[44] foreach_1.4.4                          

loaded via a namespace (and not attached):
 [1] colorspace_1.3-2         siggenes_1.52.0          mclust_5.4              
 [4] base64_2.0               bit64_0.9-7              xml2_1.2.0              
 [7] codetools_0.2-15         splines_3.4.2            Rsamtools_1.30.0        
[10] annotate_1.56.2          readr_1.3.1              compiler_3.4.2          
[13] httr_1.3.1               assertthat_0.2.0         Matrix_1.2-14           
[16] lazyeval_0.2.1           prettyunits_1.0.2        tools_3.4.2             
[19] bindrcpp_0.2.2           gtable_0.2.0             glue_1.3.0              
[22] GenomeInfoDbData_1.0.0   dplyr_0.7.5              doRNG_1.6.6             
[25] Rcpp_1.0.0               multtest_2.34.0          gdata_2.18.0            
[28] preprocessCore_1.40.0    nlme_3.1-137             stringr_1.3.1           
[31] rngtools_1.3.1           gtools_3.5.0             XML_3.98-1.11           
[34] beanplot_1.2             zlibbioc_1.24.0          hms_0.4.2               
[37] GEOquery_2.46.15         RColorBrewer_1.1-2       memoise_1.1.0           
[40] pkgmaker_0.27            biomaRt_2.34.2           reshape_0.8.7           
[43] stringi_1.2.4            RSQLite_2.1.1            genefilter_1.60.0       
[46] RMySQL_0.10.15           caTools_1.17.1           BiocParallel_1.12.0     
[49] bibtex_0.4.2             rlang_0.3.1              pkgconfig_2.0.2         
[52] bitops_1.0-6             nor1mix_1.2-3            lattice_0.20-35         
[55] purrr_0.3.0              bindr_0.1.1              GenomicAlignments_1.14.2
[58] tidyselect_0.2.5         magrittr_1.5             R6_2.2.2                
[61] DBI_1.0.0                pillar_1.3.1             withr_2.1.2             
[64] survival_2.42-3          RCurl_1.95-4.10          tibble_2.0.1            
[67] crayon_1.3.4             KernSmooth_2.23-15       progress_1.2.0          
[70] data.table_1.11.4        blob_1.1.1               digest_0.6.15           
[73] xtable_1.8-2             tidyr_0.8.2              openssl_1.0.1           
[76] munsell_0.5.0            registry_0.5             quadprog_1.5-5

RnBeads rn6 RnBeadsAnnotationCreator • 1.3k views

ADD COMMENT • link updated 5.9 years ago by mscherer ▴ 50 • written 5.9 years ago by richardjacton • 0

score 0 · Answer 1 · 2019-02-05

Hey Richard,

I just had a look into the code for the annotation creator and found the following function definition:

createAnnotationPackage <- function(assembly,dest=getwd(),maxSNPs=500000L,cleanUp=TRUE)

The parameter maxSNPs accounts for the maximal number of SNPs allowed in any given annotation. If the number is larger than this value, the annotation creator tries to distribute the jobs over a high performance compute cluster using MapReduce (that is the error you get). The error is thrown in lines 304-313 of snps.R

    if (.globals[['SNP_MAX']] != 0 && .globals[['SNP_MAX']] <= max(sapply(snps, nrow))) {
    fname <- normalizePath(fname, '/')
    dname <- normalizePath(file.path(.globals[["DIR.PACKAGE"]], "temp", "qsub"), '/', FALSE)
    txt <- c('suppressPackageStartupMessages(library(RnBeadsAnnotationCreator))',
        paste0('txt <- rnb.split.snps("', fname, '", "', dname, '")'),
        'cat(txt, sep = "\\n", file = "split.snps.log")')
    fname <- file.path(.globals[["DIR.PACKAGE"]], "temp", "split.snps.R")
    cat(paste(txt, collapse = "\n"), file = fname)
    logger.error(paste('dbSNP table(s) are too large. MapReduce implemented in', fname))
}

Further information on this process can be found in the documentation of the createAnnotationPackage function:

#' @param maxSNPs  Maximum number of dbSNP records in preprocessed table for a single chromosome, given as a single
#'                 non-negative \code{integer} value. If a larger table (than this threshold) is found, the annotation
#'                 creator stops and suggests a solution that utilizes the environment of a computational cluster.
#'                 Setting this parameter to \code{0} disables such a test. This parameter is used only when SNP data is
#'                 available for the targeted genome assembly.

So you have two possibilities: Either increase the value for MAX_SNPs, or consider distributing this over a high performance compute cluster, in case so have such available.

Hope that helps, Michael