BSmooth error during the smoothing step for a subset of files
1
0
Entering edit mode
brandon.le • 0
@brandonle-12677
Last seen 7.7 years ago

I have a problem while smoothing using the bsseq package. I imported bismark coverage files that I previously separated based on cytosine context (e.g., CG, CHG, CHH). I was able to processed data for the CG and CHG context  following the manual but when I tried to process the CHH data, I obtained the error below:

> library("bsseq")
> list.files(pattern="*.Gm01.CHH.gz") -> cov.files
> read.bismark(files=cov.files,sampleNames = gsub(".Gm01.CHH.gz","",cov.files), rmZeroCov = TRUE) -> bsseq.data
Assuming file type iscov
Read 13672009 rows and 6 (of 6) columns from 0.373 GB file in 00:00:04
done in 8.2 secs
Read 12227559 rows and 6 (of 6) columns from 0.336 GB file in 00:00:04
done in 7.7 secs
Read 13374971 rows and 6 (of 6) columns from 0.368 GB file in 00:00:04
done in 8.8 secs
Read 13209474 rows and 6 (of 6) columns from 0.365 GB file in 00:00:04
done in 7.9 secs
[read.bismark] Joining samples ... done in 29.9 secs
> bsseq.data
An object of type 'BSseq' with
  14696052 methylation loci
  4 samples
has not been smoothed
> bsseq.data.smoothed <- BSmooth(bsseq.data, mc.cores=8, parallelBy="chromosome", verbose=TRUE)
[BSmooth] preprocessing ... done in 7.6 sec
[BSmooth] smoothing by 'chromosome' (mc.cores = 8, mc.preschedule = FALSE)
Error in BSmooth(bsseq.data, mc.cores = 8, parallelBy = "chromosome",  :
  BSmooth encountered smoothing errors
In addition: Warning message:
In mclapply(1:length(clusterIdx), function(ii) { :
  1 function calls resulted in an error

Since there are more CHH sites in the genome than the other two contexts, I thought the number of methylation loci was too much so I was going to smooth one chromosome at a time and then combine them afterwards. However, I obtained the same error message. I've tried this on several datasets and it always fail on the CHH separated files. Do you have any ideas?

Thank you.

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8  
[6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocInstaller_1.24.0       edgeR_3.16.5               limma_3.30.13              bsseq_1.10.0               systemPipeR_1.8.1        
[6] RSQLite_1.1-2              DBI_0.6                    ShortRead_1.32.1           GenomicAlignments_1.10.1   SummarizedExperiment_1.4.0
[11] Biobase_2.34.0             BiocParallel_1.8.1         Rsamtools_1.26.1           Biostrings_2.42.1          XVector_0.14.1           
[16] GenomicRanges_1.26.4       GenomeInfoDb_1.10.3        IRanges_2.8.2              S4Vectors_0.12.2           BiocGenerics_0.20.0       

loaded via a namespace (and not attached):
[1] Rcpp_0.12.10           locfit_1.5-9.1         lattice_0.20-34        GO.db_3.4.0            gtools_3.5.0           digest_0.6.12        
[7] plyr_1.8.4             futile.options_1.0.0   BatchJobs_1.6          backports_1.0.5        ggplot2_2.2.1          zlibbioc_1.20.0      
[13] GenomicFeatures_1.26.3 data.table_1.10.4      annotate_1.52.1        R.oo_1.21.0            R.utils_2.5.0          Matrix_1.2-8         
[19] checkmate_1.8.2        GOstats_2.40.0         splines_3.3.1          stringr_1.2.0          pheatmap_1.0.8         RCurl_1.95-4.8       
[25] biomaRt_2.30.0         munsell_0.4.3          sendmailR_1.2-1        rtracklayer_1.34.2     base64enc_0.1-3        BBmisc_1.11          
[31] fail_1.3               matrixStats_0.51.0     XML_3.98-1.5           AnnotationForge_1.16.1 R.methodsS3_1.7.1      bitops_1.0-6         
[37] grid_3.3.1             RBGL_1.50.0            xtable_1.8-2           GSEABase_1.36.0        gtable_0.2.0           magrittr_1.5         
[43] scales_0.4.1           graph_1.52.0           stringi_1.1.3          hwriter_1.3.2          genefilter_1.56.0      latticeExtra_0.6-28  
[49] futile.logger_1.4.3    brew_1.0-6             rjson_0.2.15           lambda.r_1.1.9         RColorBrewer_1.1-2     tools_3.3.1          
[55] Category_2.40.0        survival_2.40-1        AnnotationDbi_1.36.2   colorspace_1.3-2      
bsseq bsmooth • 2.3k views
ADD COMMENT
0
Entering edit mode

Hi Brandon,

Can you produce and share a minimal CHH file that reproduces the error? I've not had issues specific to CHH files.

Cheers,

Pete

ADD REPLY
0
Entering edit mode

Hi Peter,

I attached a link containing the smallest CHH files that generate the error. Interestingly, when I reduced the same data file to contain 1M or 10M loci, the smoothing step worked without any error. When I increased the number of loci to 12M, the error appeared. This is odd since the number of loci for CHG was > 35M and it worked okay.

https://www.dropbox.com/sh/ohxfb4dujqkbuxe/AACjXl6T1lezDjuDVo1y-8rfa?dl=0

Thank you.

Best,

Brandon

ADD REPLY
0
Entering edit mode

Thanks, Brandon. I was able to reproduce the error using the BS1.12M.CHH.gz file. Ultimately, an error is occurring when locfdr::locfit() is called from within bsseq::BSmooth(). Unfortunately, because it's a little buried and calling some fairly complicated code in locfdr, this means it will take a little work to track down and fix. I'll spend some time on it this week and post back when I have more. 

Thanks for your patience,

Pete

ADD REPLY
0
Entering edit mode
Peter Hickey ▴ 740
@petehaitch
Last seen 7 weeks ago
WEHI, Melbourne, Australia

Hi Brandon,

This issue is that the default BSmooth() parameters, especially nsh, and maxgap, are tuned for the distribution of CG loci in human/mammalian genomes [1]. You are using these default parameters to analyse CHH loci from what I'm guessing is soybean (based on 'Gm01' in your example), and these CHH loci will have a very different spatial distribution to mammalian CG loci. Basically, what is happening is that locfit errors because it doesn't like this choice of smoothing parameters given the spatial distribution of CHH loci in your data.

But there is good news. I was able to run BSmooth() on the BS1.12M.CHH.gz example by increasing ns to 700 and h to 10000. This was a fairly arbitrary choice of parameters, but demonstrates that by modifying the smoothing parameters we can run BSmooth() on these data.

We do not currently have recommended parameters for BSmooth() for non-mammalian, non-CG loci, so you will need to experiment to see what gives a reasonable looking smooth. I recommend focusing on a single sample and small-ish chromosome (or even a part of a chromosome) and plotting the results of various smooth fits along with the raw data.

Cheers,

Pete

 

[1] The ns and h parameters ultimately influence how parameters in locfit::lp() and locfit::locfit() which are called by BSmooth().  As a user of bsseq, you cannot set these locfit parameters directly.

ADD COMMENT
0
Entering edit mode

Hi Peter,

I will try your suggestion with different ns and h to see what works best for soybean.  Thank you very much for looking into this problem. Are there plans to examine parameters for non-mammalian, non-CG loci in the future? In plants, non-CG loci are highly prevalent.

Best,

Brandon

ADD REPLY
0
Entering edit mode

I'm currently exploring parameter choices for non-CG loci, but in human. I recognise that non-CG loci are prevalent and important for methylation in plants and that it would be good to provide recommendations. The simple reason we don't is because neither Kasper (main developer) nor myself analyse plant data; it'd be great if you can contribute your experience back to further improve bsseq.

ADD REPLY

Login before adding your answer.

Traffic: 1024 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6