I hope someone can help in this issue.
I have 8 bam files from mm9 alignment, each ~4-5 geg in size. When I run summarizeOverlaps over 3 files, it takes 2-3 hours to finish and it works although my computer almost freezes up. But when I inquire to summarizeOverlaps for the the 8 bam files together, then keep it overnight (as it takes too long to wait), the computer freezes (although it is 16 geg i7 mac, so supposed to be powerful) and the command never results in anything. I even had it run for 30 hours and it looked like it was consuming memory (~600 mega of ram) but still got nothing. I had to reboot the laptop.
I am making my own txdb file from gtf that I used for the alignment to match the naming of the chromosomes. (script is below).
Do you have any tips on how I can get the summzerOverlaps to work on the 8 files to create one se file without freezing up the computer? I have been trying to do that for the past 2 week and always same result.
Any input is appreciated.
here’s the script:
library("DESeq2") library("GenomicFeatures") library("Rsamtools") library("GenomicAlignments") library("GenomicRanges”) mm9_from_cluster_gtf_txdb <- makeTranscriptDbFromGFF(file="~/Desktop/genes.gtf", format="gtf”) head(seqlevels(mm9_from_cluster_gtf_txdb)) saveDb(mm9_from_cluster_gtf_txdb, file="/Path/To/Libraries/TxDB/mm9_from_cluster_Ensembl_txdb.sqlite”) exonsByGene<-exonsBy(mm9_from_cluster_gtf_txdb,by="gene") seqinfo(exonsByGene) fls <- list.files("Path/To/BamFiles", pattern="paired.accepted_hits.bam", full= TRUE) fls Experiment <- c(fls[2:8], fls[1]) Experiment bamLst_experiment <- BamFileList(Experiment, yieldSize=100000) seqinfo(bamLst_experiment) se_test_experiment <- summarizeOverlaps(exonsByGene,bamLst_experiment, mode="Union", singleEnd=FALSE, ignore.strand=TRUE, fragments=TRUE) <<<This is the step that freezes the computer when I run the 8 of the files together. Sessioninfo() R version 3.1.2 (2014-10-31) Platform: x86_64-apple-darwin13.4.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicAlignments_1.2.0 Rsamtools_1.18.1 Biostrings_2.34.0 XVector_0.6.0 [5] GenomicRanges_1.18.3 GenomeInfoDb_1.2.2 IRanges_2.0.0 S4Vectors_0.4.0 [9] BiocGenerics_0.12.0 BiocInstaller_1.16.1 loaded via a namespace (and not attached): [1] base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.8 BiocParallel_1.0.0 bitops_1.0-6 [6] brew_1.0-6 checkmate_1.5.0 codetools_0.2-9 DBI_0.3.1 digest_0.6.4 [11] fail_1.2 foreach_1.4.2 iterators_1.0.7 RSQLite_1.0.0 sendmailR_1.2-1 [16] stringr_0.6.2 tools_3.1.2 zlibbioc_1.12.0
'Serial' means to run one file at a time, no parallel execution. This is what happens when you set the number of workers to 1. I meant that example more as a useful way to debug or test code and not what I would recommended for a final execution.I should have explained that more clearly.
The output of registered() reports what parallel back-end will be used when code is run with a function from the BiocParallel package. Because you are on a mac you can take advantage of shared memory with Multicore so that appears as your default (vs a Snow cluster or BatchJobs). When you call register(MulticoreParam(workers = 2)) you are specifying that you want to use 2 workers instead of the default of 8 workers.
You said a previous run of 3 files took 2-3 hours and the memory was taxed (ie, "computer almost freezes up"). Try running 2 files with a MulticoreParam(workers = 2) leaving yieldSize as is. If that goes ok then try the 8 files with the same param. Otherwise try reducing the yieldSize.
Valerie
Thanks Valerie- I am now setting the workers to 2 and trying 2 files. if that works, then i will keep everything as is and try with 8 files, then decrease the yield size if it still is slow.
I was wondering if I run each file at once (not in parallel), is there a way to combine the outcomes into a single table that i can input into DESeq2? My ultimate goal is to use DESeq2 for analysis of the RNASeq data.
Thanks
Using register(MulticoreParam(workers = 1)) with a BamFileList of all 8 files will run one file at a time and return all results together as a SummarizedExperiment object that can be used in DESeq2. The counts will be a matrix (with 8 columns) that you can access with assays(SummarizedExperiment).
Valerie
How long should this process take for BAM files ~5 gega each that are pair-ends data? Beyong which I would know that i should probably restart the computer. Please advise.
Hussein
I used a 4.4 GIG paired-end test file sorted by position. It took < 9.5 minutes and ~ 0.5 GIG of RAM to run summarizeOverlaps with a yieldSize of 100000.
When you run into memory limitations or any performance issue it's best to break the problem down. Try 1 file (or a subset, see ?ScanBamParam) instead of all 8. Trying out different values for the yieldSize is much easier with 1 vs waiting for all 8 to finish. Once you've got a feel for a good yieldSize and how much memory is used add another worker. Go from there.
Valerie
thanks Valerie-
I will try all algorithms then summarize my findings here so that it may be of use to others. Much appreciated.