Question

ATACseqQC memory overflow issue

0

Entering edit mode

José Luis Lavín ▴ 10

@jose-luis-lavin-5531

Last seen 6.8 years ago

Spain

Dear All,

I'm trying to analizesome ATAC-seq samples using the ATACseqQC package and I am unable to make it work correctly.

The proccess always crashes when I reach the same part of the workflow shown in the vignette this function

fragSize <- fragSizeDist(bamfile, bamfile.labels)

After a while running a memory overflow happens and the proccess gets killed.

I'm trying to analyze 100bp Illumina Paired-End samples aligned with Bowtie2. The de-duplicated Bamfiles size is around 3,5 GB and the memory overflow occurs at 84 GB.

Is there any minimmum RAM requirement specifiedfor ATACseqQC that I overlooked? Or is it just that this particular package uses huge amounts of memory anyway?

Does anybody know af an alternative way to carry out this kind of analysis (ATACseq) that render some quality Heatmap and coverage curve for nucleosome positions?

Here is my code (up to the step it works):

library(ATACseqQC)
library(ChIPpeakAnno)
library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(phastCons100way.UCSC.hg19)
library(MotifDb)

knitr::opts_chunk$set(warning=FALSE, message=FALSE)

setwd("/path_to_filees/BAM_dedup/")

file.list <- as.list(list.files(".", pattern = "*.bam$"))
file.list

bamfile <-file.list[[1]]
bamfile.labels <- gsub(".bam", "", basename(bamfile))

bamQC(bamfile, outPath=NULL)

## generate fragement size distribution
fragSize <- fragSizeDist(bamfile, bamfile.labels)

And my Session Info:

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.4 (Nitrogen)

Matrix products: default
BLAS: /opt/R/R-3.4.2/lib/libRblas.so
LAPACK: /opt/R/R-3.4.2/lib/libRlapack.so

locale:
 [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=es_ES.UTF-8
 [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=es_ES.UTF-8
 [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils
 [8] datasets  methods   base

other attached packages:
 [1] MotifDb_1.20.0
 [2] phastCons100way.UCSC.hg19_3.6.0
 [3] GenomicScores_1.2.1
 [4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
 [5] GenomicFeatures_1.30.0
 [6] AnnotationDbi_1.40.0
 [7] Biobase_2.38.0
 [8] BSgenome.Hsapiens.UCSC.hg19_1.4.0
 [9] BSgenome_1.46.0
[10] rtracklayer_1.38.2
[11] ChIPpeakAnno_3.12.4
[12] VennDiagram_1.6.18
[13] futile.logger_1.4.3
[14] GenomicRanges_1.30.1
[15] GenomeInfoDb_1.14.0
[16] Biostrings_2.46.0
[17] XVector_0.18.0
[18] IRanges_2.12.0
[19] ATACseqQC_1.2.7
[20] S4Vectors_0.16.0
[21] BiocGenerics_0.24.0

loaded via a namespace (and not attached):
 [1] ProtGenerics_1.10.0           bitops_1.0-6
 [3] matrixStats_0.52.2            bit64_0.9-7
 [5] progress_1.1.2                httr_1.3.1
 [7] tools_3.4.2                   R6_2.2.2
 [9] rGADEM_2.26.0                 splitstackshape_1.4.2
[11] seqLogo_1.44.0                DBI_0.7
[13] lazyeval_0.2.1                colorspace_1.3-2
[15] ade4_1.7-10                   motifStack_1.22.0
[17] prettyunits_1.0.2             RMySQL_0.10.13
[19] bit_1.1-12                    curl_3.1
[21] compiler_3.4.2                graph_1.56.0
[23] DelayedArray_0.4.1            grImport_0.9-0
[25] scales_0.5.0                  randomForest_4.6-12
[27] RBGL_1.54.0                   stringr_1.2.0
[29] digest_0.6.12                 Rsamtools_1.30.0
[31] pkgconfig_2.0.1               htmltools_0.3.6
[33] ensembldb_2.2.0               limma_3.34.0
[35] regioneR_1.10.0               htmlwidgets_0.9
[37] rlang_0.1.4                   RSQLite_2.0
[39] BiocInstaller_1.28.0          shiny_1.0.5
[41] BiocParallel_1.12.0           RCurl_1.95-4.8
[43] magrittr_1.5                  GO.db_3.5.0
[45] GenomeInfoDbData_1.0.0        Matrix_1.2-11
[47] Rcpp_0.12.13                  munsell_0.4.3
[49] stringi_1.1.6                 yaml_2.1.16
[51] MASS_7.3-47                   SummarizedExperiment_1.8.1
[53] zlibbioc_1.24.0               plyr_1.8.4
[55] AnnotationHub_2.10.1          blob_1.1.0
[57] lattice_0.20-35               splines_3.4.2
[59] multtest_2.34.0               MotIV_1.34.0
[61] seqinr_3.4-5                  biomaRt_2.34.1
[63] futile.options_1.0.0          XML_3.98-1.9
[65] data.table_1.10.4-3           lambda.r_1.2
[67] idr_1.2                       httpuv_1.3.5
[69] assertthat_0.2.0              mime_0.5
[71] xtable_1.8-2                  AnnotationFilter_1.2.0
[73] survival_2.41-3               tibble_1.3.4
[75] GenomicAlignments_1.14.1      memoise_1.1.0
[77] interactiveDisplayBase_1.16.0

Thanks in advance!

JL

atac-seq software error atacseqqc • 2.5k views

ADD COMMENT • link updated 6.8 years ago by Haibo.Liu • 0 • written 6.8 years ago by José Luis Lavín ▴ 10

0

Entering edit mode

Hi JL,

Thank you for trying ATACseqQC. Please have a try to remove bamQC from you current script. And have a try. I'm working on bamQC to improve this issue.

Jianhong.

ADD REPLY • link 6.8 years ago Ou, Jianhong ★ 1.3k

0

Entering edit mode

Dear Jianhong,

I have tried the following script (the bamQC command actually worked). I tried your suggestion using only a couple of chromosomes (e.g. 1 and 2) and it crashes at the a point a little bit further...

library(ATACseqQC)
library(ChIPpeakAnno)
library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(phastCons100way.UCSC.hg19)
library(MotifDb)

knitr::opts_chunk$set(warning=FALSE, message=FALSE)

file.list <- as.list(list.files(".", pattern = "*.bam$"))

bamfile <-file.list[[1]]
bamfile.labels <- gsub(".bam", "", basename(bamfile))

outPath <- "splited"
dir.create(outPath)

bamQC(bamfile, outPath=NULL)

## estimate Library Complexity---------------------------------------------
estimateLibComplexity(readsDupFreq(bamfile))
## generate fragement size distribution
fragSize <- fragSizeDist(bamfile, bamfile.labels)

## ------------------------------------------------------------------------
## bamfile tags
tags <- c("AS", "XN", "XM", "XO", "XG", "NM", "MD", "YS", "YT")
## files will be output into outPath

## shift the bam file by the 5'ends
library(BSgenome.Hsapiens.UCSC.hg19)

seqlev = paste0("chr", c(1:2, "X", "Y"))
which <- as(seqinfo(Hsapiens)[seqlev], "GRanges")
gal <- readBamFile(bamfile, tag=tags, which=which, asMates=TRUE)
gal1 <- shiftGAlignmentsList(gal)#######<- Crashes here #################
shiftedBamfile <- file.path(outPath, "shifted.bam")
export(gal1, shiftedBamfile)

SessionInfo is the same than the previous but now I am using 96 GB of RAM

Any thoughts/Suggestions or update of the package ?

Thanks in advance for your kind help

Best

JL

ADD REPLY • link 6.8 years ago José Luis Lavín ▴ 10

0

Entering edit mode

Getting also process killed when reading huge bam file ~ 9G with readBamFile function.

ADD REPLY • link 6.2 years ago ZheFrench ▴ 60

0

Entering edit mode

One of the developers, Jianhong, will work on making it work with big bam files. For now, please try to generate the plots using a couple of chromosomes from your BAM files. Thanks!

Best regards,

Julie

ADD REPLY • link 6.2 years ago Julie Zhu ★ 4.3k

0

Entering edit mode

Thanks to Jianhong, the memory issue has be fixed in the development version.

Here are the scripts to install the dev version of ATACseqQC.

library(devtools); install_github("jianhong/ATACseqQC")

Please set bigFile=TRUE when using readBamFile. Thanks!

Best regards,

Julie

ADD REPLY • link 6.2 years ago Julie Zhu ★ 4.3k

score 0 · Answer 1 · 2018-02-12

José, Thanks for the feedback! Jianhong, one of the developers will work with you on the coding side. For now, please try to generate the plots using a couple of chromosomes from your BAM files. FYI, we have tested the effects of sequencing depth on the patterns in various diagnostic plots using subsamples of the BAM files from a successful experiment. Our results show that the fragment size distribution and the aggregated signals around TSSs from the subsamples with as low as 2.6 million uniquely mapped reads exhibit similar patterns to that of the full dataset (more than 26 million uniquely mapped reads) and are easily distinguishable from that of the failed experiment. Best regards, Julie From: "José Luis Lavín [bioc]" <noreply@bioconductor.org> Reply-To: "reply+870ae661+code@bioconductor.org" <reply+870ae661+code@bioconductor.org> Date: Monday, February 12, 2018 at 11:10 AM To: "Zhu, Lihua (Julie)" <julie.zhu@umassmed.edu> Subject: [bioc] ATACseqQC memory overflow issue Activity on a post you are following on support.bioconductor.org<https: support.bioconductor.org=""> User José Luis Lavín<https: support.bioconductor.org="" u="" 5531=""/> wrote Question: ATACseqQC memory overflow issue<https: support.bioconductor.org="" p="" 105904=""/>: Dear All, I'm trying to analizesome ATAC-seq samples using the ATACseqQC package and I am unable to make it work correctly. The proccess always crashes when I reach the same part of the workflow shown in the vignette this function fragSize <- fragSizeDist(bamfile, bamfile.labels) After a while running a memory overflow happens and the proccess gets killed. I'm trying to analyze 100bp Illumina Paired-End samples aligned with Bowtie2. The de-duplicated Bamfiles size is around 3,5 GB and the memory overflow occurs at 84 GB. Is there any minimmum RAM requirement specifiedfor ATACseqQC that I overlooked? Or is it just that this particular package uses huge amounts of memory anyway? Does anybody know af an alternative way to carry out this kind of analysis (ATACseq) that render some quality Heatmap and coverage curve for nucleosome positions? Here is my code (up to the step it works): library(ATACseqQC) library(ChIPpeakAnno) library(BSgenome.Hsapiens.UCSC.hg19) library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(phastCons100way.UCSC.hg19) library(MotifDb) knitr::opts_chunk$set(warning=FALSE, message=FALSE) setwd("/path_to_filees/BAM_dedup/") file.list <- as.list(list.files(".", pattern = "*.bam$")) file.list bamfile <-file.list[[1]] bamfile.labels <- gsub(".bam", "", basename(bamfile)) bamQC(bamfile, outPath=NULL) ## generate fragement size distribution fragSize <- fragSizeDist(bamfile, bamfile.labels) And my Session Info: > sessionInfo() R version 3.4.2 (2017-09-28) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Scientific Linux 7.4 (Nitrogen) Matrix products: default BLAS: /opt/R/R-3.4.2/lib/libRblas.so LAPACK: /opt/R/R-3.4.2/lib/libRlapack.so locale: [1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=es_ES.UTF-8 [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=es_ES.UTF-8 [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid stats4 parallel stats graphics grDevices utils [8] datasets methods base other attached packages: [1] MotifDb_1.20.0 [2] phastCons100way.UCSC.hg19_3.6.0 [3] GenomicScores_1.2.1 [4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 [5] GenomicFeatures_1.30.0 [6] AnnotationDbi_1.40.0 [7] Biobase_2.38.0 [8] BSgenome.Hsapiens.UCSC.hg19_1.4.0 [9] BSgenome_1.46.0 [10] rtracklayer_1.38.2 [11] ChIPpeakAnno_3.12.4 [12] VennDiagram_1.6.18 [13] futile.logger_1.4.3 [14] GenomicRanges_1.30.1 [15] GenomeInfoDb_1.14.0 [16] Biostrings_2.46.0 [17] XVector_0.18.0 [18] IRanges_2.12.0 [19] ATACseqQC_1.2.7 [20] S4Vectors_0.16.0 [21] BiocGenerics_0.24.0 loaded via a namespace (and not attached): [1] ProtGenerics_1.10.0 bitops_1.0-6 [3] matrixStats_0.52.2 bit64_0.9-7 [5] progress_1.1.2 httr_1.3.1 [7] tools_3.4.2 R6_2.2.2 [9] rGADEM_2.26.0 splitstackshape_1.4.2 [11] seqLogo_1.44.0 DBI_0.7 [13] lazyeval_0.2.1 colorspace_1.3-2 [15] ade4_1.7-10 motifStack_1.22.0 [17] prettyunits_1.0.2 RMySQL_0.10.13 [19] bit_1.1-12 curl_3.1 [21] compiler_3.4.2 graph_1.56.0 [23] DelayedArray_0.4.1 grImport_0.9-0 [25] scales_0.5.0 randomForest_4.6-12 [27] RBGL_1.54.0 stringr_1.2.0 [29] digest_0.6.12 Rsamtools_1.30.0 [31] pkgconfig_2.0.1 htmltools_0.3.6 [33] ensembldb_2.2.0 limma_3.34.0 [35] regioneR_1.10.0 htmlwidgets_0.9 [37] rlang_0.1.4 RSQLite_2.0 [39] BiocInstaller_1.28.0 shiny_1.0.5 [41] BiocParallel_1.12.0 RCurl_1.95-4.8 [43] magrittr_1.5 GO.db_3.5.0 [45] GenomeInfoDbData_1.0.0 Matrix_1.2-11 [47] Rcpp_0.12.13 munsell_0.4.3 [49] stringi_1.1.6 yaml_2.1.16 [51] MASS_7.3-47 SummarizedExperiment_1.8.1 [53] zlibbioc_1.24.0 plyr_1.8.4 [55] AnnotationHub_2.10.1 blob_1.1.0 [57] lattice_0.20-35 splines_3.4.2 [59] multtest_2.34.0 MotIV_1.34.0 [61] seqinr_3.4-5 biomaRt_2.34.1 [63] futile.options_1.0.0 XML_3.98-1.9 [65] data.table_1.10.4-3 lambda.r_1.2 [67] idr_1.2 httpuv_1.3.5 [69] assertthat_0.2.0 mime_0.5 [71] xtable_1.8-2 AnnotationFilter_1.2.0 [73] survival_2.41-3 tibble_1.3.4 [75] GenomicAlignments_1.14.1 memoise_1.1.0 [77] interactiveDisplayBase_1.16.0 Thanks in advance! JL Post tags: atac-seq, software error You may reply via email or visit ATACseqQC memory overflow issue

score 0 · Answer 2 · 2018-02-21

Please refer to my git hub scripts: https://github.com/seaYali/ATACseqQC/blob/master/ScriptsForATACseqQCCaseStudy.txt for your analysis. Currently, the bottle neck of this package is the BamQC.R function. I did this step using samtools instead for efficinecy. Importantly, for the purpose of QC, You don't need to use the full bam except for BAM QC. Using data from a few chromosomes will be good enough. Using Phastcons information to bin reads into mono-, di- or tri-nucleosome is not necessary, and increase the computing time significantly.