qrqc with variable length of short reads? - readSeqFile could not handle a 2GB zipped file.

0

Entering edit mode

Sang Chul Choi ▴ 230

@sang-chul-choi-5066

Last seen 10.5 years ago

Hi, I am using qrqc to plot base quality of a short read fastq file. When the FASTQ file has short reads of the same length, the readSeqFile could read in the FASTQ file (25 millions of 100bp reads) with a couple of GB of memory. I trimmed 3' end of the short reads, which would lead to short reads of variable length because of different base quality at the 3' end. Then, I tried to read in this second FASTQ file of reads of variable length. It used up all of the 16 GB memory, and not using CPUs at all. It seems there are some efficient code in readSeqFile as mentioned in the readSeqFile help message. It seems to fall apart when short reads are of different size. I wish to see how the trimming change the base-quality plots, and this is a problem. I am wondering if there is a way of sidestepping this problem. Thank you, SangChul

qrqc qrqc • 2.1k views

ADD COMMENT • link updated 12.7 years ago by Thomas Girke ★ 1.7k • written 12.7 years ago by Sang Chul Choi ▴ 230

0

Entering edit mode

Vince S. Buffalo ▴ 140

@vince-s-buffalo-4618

Last seen 10.4 years ago

United States

Hi SangChul, By default readSeqFile hashes a proportion of the reads to check against many being non-unique. Specify hash=FALSE to turn this off and your memory usage will decrease. Best, Vince Sent from my iPhone On Jun 1, 2012, at 1:23 PM, Sang Chul Choi <schoi at="" cornell.edu=""> wrote: > Hi, > > I am using qrqc to plot base quality of a short read fastq file. When the FASTQ file has short reads of the same length, the readSeqFile could read in the FASTQ file (25 millions of 100bp reads) with a couple of GB of memory. I trimmed 3' end of the short reads, which would lead to short reads of variable length because of different base quality at the 3' end. Then, I tried to read in this second FASTQ file of reads of variable length. It used up all of the 16 GB memory, and not using CPUs at all. It seems there are some efficient code in readSeqFile as mentioned in the readSeqFile help message. It seems to fall apart when short reads are of different size. > > I wish to see how the trimming change the base-quality plots, and this is a problem. I am wondering if there is a way of sidestepping this problem. > > Thank you, > > SangChul > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.7 years ago Vince S. Buffalo ▴ 140

0

Entering edit mode

I have tried to tunn off the option when reading sequences of variable lengths in a gzipped FASTQ file (2GB) using readSeqFile. The computer has 16 GB memory, and it used up all of the memory, leaving R in "Dead" or not running any more. Is there a way of sidestepping this problem? Thank you, SangChul On Jun 1, 2012, at 4:55 PM, Vince Buffalo wrote: > Hi SangChul, > > By default readSeqFile hashes a proportion of the reads to check against many being non-unique. Specify hash=FALSE to turn this off and your memory usage will decrease. > > Best, > Vince > > Sent from my iPhone > > On Jun 1, 2012, at 1:23 PM, Sang Chul Choi <schoi at="" cornell.edu=""> wrote: > >> Hi, >> >> I am using qrqc to plot base quality of a short read fastq file. When the FASTQ file has short reads of the same length, the readSeqFile could read in the FASTQ file (25 millions of 100bp reads) with a couple of GB of memory. I trimmed 3' end of the short reads, which would lead to short reads of variable length because of different base quality at the 3' end. Then, I tried to read in this second FASTQ file of reads of variable length. It used up all of the 16 GB memory, and not using CPUs at all. It seems there are some efficient code in readSeqFile as mentioned in the readSeqFile help message. It seems to fall apart when short reads are of different size. >> >> I wish to see how the trimming change the base-quality plots, and this is a problem. I am wondering if there is a way of sidestepping this problem. >> >> Thank you, >> >> SangChul >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.7 years ago Sang Chul Choi ▴ 230

0

Entering edit mode

Hi SangChul, Can you attach your sessionInfo()? I will take a look into this issue. best, Vince On Tue, Jun 5, 2012 at 12:01 PM, Sang Chul Choi <schoi@cornell.edu> wrote: > I have tried to tunn off the option when reading sequences of variable > lengths in a gzipped FASTQ file (2GB) using readSeqFile. The computer has > 16 GB memory, and it used up all of the memory, leaving R in "Dead" or not > running any more. Is there a way of sidestepping this problem? > > Thank you, > > SangChul > > On Jun 1, 2012, at 4:55 PM, Vince Buffalo wrote: > > > Hi SangChul, > > > > By default readSeqFile hashes a proportion of the reads to check against > many being non-unique. Specify hash=FALSE to turn this off and your memory > usage will decrease. > > > > Best, > > Vince > > > > Sent from my iPhone > > > > On Jun 1, 2012, at 1:23 PM, Sang Chul Choi <schoi@cornell.edu> wrote: > > > >> Hi, > >> > >> I am using qrqc to plot base quality of a short read fastq file. When > the FASTQ file has short reads of the same length, the readSeqFile could > read in the FASTQ file (25 millions of 100bp reads) with a couple of GB of > memory. I trimmed 3' end of the short reads, which would lead to short > reads of variable length because of different base quality at the 3' end. > Then, I tried to read in this second FASTQ file of reads of variable > length. It used up all of the 16 GB memory, and not using CPUs at all. It > seems there are some efficient code in readSeqFile as mentioned in the > readSeqFile help message. It seems to fall apart when short reads are of > different size. > >> > >> I wish to see how the trimming change the base-quality plots, and this > is a problem. I am wondering if there is a way of sidestepping this > problem. > >> > >> Thank you, > >> > >> SangChul > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- Vince Buffalo Statistical Programmer Bioinformatics Core UC Davis Genome Center vincebuffalo.com twitter.com/vsbuffalo [[alternative HTML version deleted]]

ADD REPLY • link 12.7 years ago Vince S. Buffalo ▴ 140

0

Entering edit mode

Thank you for offering help. Appended are the R script and its output mostly from sessionInfo() function. Thank you, SangChul The R script: ====================================================================== ================== library(qrqc) args <- commandArgs(trailingOnly = TRUE) if (length(args) != 3) { cat ("Rscript job-stat.R 1.fq.gz 1.fq.RData sanger\n") quit("yes") } fq.name <- args[1] fq.quality <- args[3] sessionInfo() # fq.file <- readSeqFilefq.name,quality=fq.quality,hash=FALSE) # toplot <- qualPlot(fq.file) # fq.plot <- args[2] # save(list="toplot", file = fq.plot) ====================================================================== ================== Output from the R script above: ====================================================================== =================== Loading required package: reshape Loading required package: plyr Attaching package: 'reshape' The following object(s) are masked from 'package:plyr': rename, round_any Loading required package: ggplot2 Loading required package: methods Loading required package: Biostrings Loading required package: BiocGenerics Attaching package: 'BiocGenerics' The following object(s) are masked from 'package:stats': xtabs The following object(s) are masked from 'package:base': anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, get, intersect, lapply, Map, mapply, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rbind, Reduce, rep.int, rownames, sapply, setdiff, table, tapply, union, unique Loading required package: IRanges Attaching package: 'IRanges' The following object(s) are masked from 'package:reshape': rename The following object(s) are masked from 'package:plyr': compact, desc, rename Loading required package: biovizBase Loading required package: brew Loading required package: xtable Loading required package: Rsamtools Loading required package: GenomicRanges Loading required package: testthat R Under development (unstable) (2012-04-01 r58897) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C attached base packages: [1] methods stats graphics grDevices utils datasets base other attached packages: [1] qrqc_1.10.0 testthat_0.6 Rsamtools_1.8.5 [4] GenomicRanges_1.8.6 xtable_1.7-0 brew_1.0-6 [7] biovizBase_1.4.2 Biostrings_2.24.1 IRanges_1.14.3 [10] BiocGenerics_0.2.0 ggplot2_0.9.1 reshape_0.8.4 [13] plyr_1.7.1 loaded via a namespace (and not attached): [1] AnnotationDbi_1.18.1 Biobase_2.16.0 biomaRt_2.12.0 [4] bitops_1.0-4.1 BSgenome_1.24.0 cluster_1.14.2 [7] colorspace_1.1-1 DBI_0.2-5 dichromat_1.2-4 [10] digest_0.5.2 evaluate_0.4.2 GenomicFeatures_1.8.1 [13] grid_2.16.0 Hmisc_3.9-3 labeling_0.1 [16] lattice_0.20-6 MASS_7.3-18 memoise_0.1 [19] munsell_0.3 proto_0.3-9.2 RColorBrewer_1.0-5 [22] RCurl_1.91-1 reshape2_1.2.1 RSQLite_0.11.1 [25] rtracklayer_1.16.1 scales_0.2.1 stats4_2.16.0 [28] stringr_0.6 tools_2.16.0 XML_3.9-4 [31] zlibbioc_1.2.0 ====================================================================== =================== On Jun 5, 2012, at 3:03 PM, Vince S. Buffalo wrote: > Hi SangChul, > > Can you attach your sessionInfo()? I will take a look into this issue. > > best, > Vince > > On Tue, Jun 5, 2012 at 12:01 PM, Sang Chul Choi <schoi at="" cornell.edu=""> wrote: > I have tried to tunn off the option when reading sequences of variable lengths in a gzipped FASTQ file (2GB) using readSeqFile. The computer has 16 GB memory, and it used up all of the memory, leaving R in "Dead" or not running any more. Is there a way of sidestepping this problem? > > Thank you, > > SangChul > > On Jun 1, 2012, at 4:55 PM, Vince Buffalo wrote: > > > Hi SangChul, > > > > By default readSeqFile hashes a proportion of the reads to check against many being non-unique. Specify hash=FALSE to turn this off and your memory usage will decrease. > > > > Best, > > Vince > > > > Sent from my iPhone > > > > On Jun 1, 2012, at 1:23 PM, Sang Chul Choi <schoi at="" cornell.edu=""> wrote: > > > >> Hi, > >> > >> I am using qrqc to plot base quality of a short read fastq file. When the FASTQ file has short reads of the same length, the readSeqFile could read in the FASTQ file (25 millions of 100bp reads) with a couple of GB of memory. I trimmed 3' end of the short reads, which would lead to short reads of variable length because of different base quality at the 3' end. Then, I tried to read in this second FASTQ file of reads of variable length. It used up all of the 16 GB memory, and not using CPUs at all. It seems there are some efficient code in readSeqFile as mentioned in the readSeqFile help message. It seems to fall apart when short reads are of different size. > >> > >> I wish to see how the trimming change the base-quality plots, and this is a problem. I am wondering if there is a way of sidestepping this problem. > >> > >> Thank you, > >> > >> SangChul > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > -- > Vince Buffalo > Statistical Programmer > Bioinformatics Core > UC Davis Genome Center > vincebuffalo.com twitter.com/vsbuffalo >

ADD REPLY • link 12.7 years ago Sang Chul Choi ▴ 230

0

Entering edit mode

Hi SangChul, I forgot to mention that you'll also need to turn k-mer hashing off too (kmer=FALSE) if you're working with a machine with limited memory. Can you let me know if this solves your issue? I will likely change these defaults in the next version so as to not accidentally lead to an overuse of memory. best, Vince On Wed, Jun 6, 2012 at 11:14 AM, Sang Chul Choi <schoi@cornell.edu> wrote: > Thank you for offering help. Appended are the R script and its output > mostly from sessionInfo() function. > > Thank you, > > SangChul > > The R script: > > ==================================================================== ==================== > library(qrqc) > args <- commandArgs(trailingOnly = TRUE) > if (length(args) != 3) > { > cat ("Rscript job-stat.R 1.fq.gz 1.fq.RData sanger\n") > quit("yes") > } > fq.name <- args[1] > fq.quality <- args[3] > sessionInfo() > # fq.file <- readSeqFilefq.name,quality=fq.quality,hash=FALSE) > # toplot <- qualPlot(fq.file) > # fq.plot <- args[2] > # save(list="toplot", file = fq.plot) > > ==================================================================== ==================== > > Output from the R script above: > > ==================================================================== ===================== > Loading required package: reshape > Loading required package: plyr > > Attaching package: 'reshape' > > The following object(s) are masked from 'package:plyr': > > rename, round_any > > Loading required package: ggplot2 > Loading required package: methods > Loading required package: Biostrings > Loading required package: BiocGenerics > > Attaching package: 'BiocGenerics' > > The following object(s) are masked from 'package:stats': > > xtabs > > The following object(s) are masked from 'package:base': > > anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, > get, intersect, lapply, Map, mapply, mget, order, paste, pmax, > pmax.int, pmin, pmin.int, Position, rbind, Reduce, rep.int, > rownames, sapply, setdiff, table, tapply, union, unique > > Loading required package: IRanges > > Attaching package: 'IRanges' > > The following object(s) are masked from 'package:reshape': > > rename > > The following object(s) are masked from 'package:plyr': > > compact, desc, rename > > Loading required package: biovizBase > Loading required package: brew > Loading required package: xtable > Loading required package: Rsamtools > Loading required package: GenomicRanges > Loading required package: testthat > R Under development (unstable) (2012-04-01 r58897) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C > [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 > [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C > > attached base packages: > [1] methods stats graphics grDevices utils datasets base > > other attached packages: > [1] qrqc_1.10.0 testthat_0.6 Rsamtools_1.8.5 > [4] GenomicRanges_1.8.6 xtable_1.7-0 brew_1.0-6 > [7] biovizBase_1.4.2 Biostrings_2.24.1 IRanges_1.14.3 > [10] BiocGenerics_0.2.0 ggplot2_0.9.1 reshape_0.8.4 > [13] plyr_1.7.1 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.18.1 Biobase_2.16.0 biomaRt_2.12.0 > [4] bitops_1.0-4.1 BSgenome_1.24.0 cluster_1.14.2 > [7] colorspace_1.1-1 DBI_0.2-5 dichromat_1.2-4 > [10] digest_0.5.2 evaluate_0.4.2 GenomicFeatures_1.8.1 > [13] grid_2.16.0 Hmisc_3.9-3 labeling_0.1 > [16] lattice_0.20-6 MASS_7.3-18 memoise_0.1 > [19] munsell_0.3 proto_0.3-9.2 RColorBrewer_1.0-5 > [22] RCurl_1.91-1 reshape2_1.2.1 RSQLite_0.11.1 > [25] rtracklayer_1.16.1 scales_0.2.1 stats4_2.16.0 > [28] stringr_0.6 tools_2.16.0 XML_3.9-4 > [31] zlibbioc_1.2.0 > > ==================================================================== ===================== > > > On Jun 5, 2012, at 3:03 PM, Vince S. Buffalo wrote: > > > Hi SangChul, > > > > Can you attach your sessionInfo()? I will take a look into this issue. > > > > best, > > Vince > > > > On Tue, Jun 5, 2012 at 12:01 PM, Sang Chul Choi <schoi@cornell.edu> > wrote: > > I have tried to tunn off the option when reading sequences of variable > lengths in a gzipped FASTQ file (2GB) using readSeqFile. The computer has > 16 GB memory, and it used up all of the memory, leaving R in "Dead" or not > running any more. Is there a way of sidestepping this problem? > > > > Thank you, > > > > SangChul > > > > On Jun 1, 2012, at 4:55 PM, Vince Buffalo wrote: > > > > > Hi SangChul, > > > > > > By default readSeqFile hashes a proportion of the reads to check > against many being non-unique. Specify hash=FALSE to turn this off and your > memory usage will decrease. > > > > > > Best, > > > Vince > > > > > > Sent from my iPhone > > > > > > On Jun 1, 2012, at 1:23 PM, Sang Chul Choi <schoi@cornell.edu> wrote: > > > > > >> Hi, > > >> > > >> I am using qrqc to plot base quality of a short read fastq file. When > the FASTQ file has short reads of the same length, the readSeqFile could > read in the FASTQ file (25 millions of 100bp reads) with a couple of GB of > memory. I trimmed 3' end of the short reads, which would lead to short > reads of variable length because of different base quality at the 3' end. > Then, I tried to read in this second FASTQ file of reads of variable > length. It used up all of the 16 GB memory, and not using CPUs at all. It > seems there are some efficient code in readSeqFile as mentioned in the > readSeqFile help message. It seems to fall apart when short reads are of > different size. > > >> > > >> I wish to see how the trimming change the base-quality plots, and > this is a problem. I am wondering if there is a way of sidestepping this > problem. > > >> > > >> Thank you, > > >> > > >> SangChul > > >> _______________________________________________ > > >> Bioconductor mailing list > > >> Bioconductor@r-project.org > > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > > >> Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > -- > > Vince Buffalo > > Statistical Programmer > > Bioinformatics Core > > UC Davis Genome Center > > vincebuffalo.com twitter.com/vsbuffalo > > > > -- Vince Buffalo Statistical Programmer Bioinformatics Core UC Davis Genome Center vincebuffalo.com twitter.com/vsbuffalo [[alternative HTML version deleted]]

ADD REPLY • link 12.7 years ago Vince S. Buffalo ▴ 140

0

Entering edit mode

Hi, Turning off kmer option resolves the issue of variable length reads. Thank you, SangChul ________________________________ From: Vince S. Buffalo [vsbuffalo@gmail.com] Sent: Wednesday, June 06, 2012 2:25 PM To: Sang Chul Choi Cc: bioconductor Subject: Re: [BioC] qrqc with variable length of short reads? - readSeqFile could not handle a 2GB zipped file. Hi SangChul, I forgot to mention that you'll also need to turn k-mer hashing off too (kmer=FALSE) if you're working with a machine with limited memory. Can you let me know if this solves your issue? I will likely change these defaults in the next version so as to not accidentally lead to an overuse of memory. best, Vince On Wed, Jun 6, 2012 at 11:14 AM, Sang Chul Choi <schoi@cornell.edu<mailto:schoi@cornell.edu>> wrote: Thank you for offering help. Appended are the R script and its output mostly from sessionInfo() function. Thank you, SangChul The R script: ====================================================================== ================== library(qrqc) args <- commandArgs(trailingOnly = TRUE) if (length(args) != 3) { cat ("Rscript job-stat.R 1.fq.gz 1.fq.RData sanger\n") quit("yes") } fq.name<http: fq.name=""> <- args[1] fq.quality <- args[3] sessionInfo() # fq.file <- readSeqFilefq.name<http: fq.name="">,quality=fq.quality,hash=FALSE) # toplot <- qualPlot(fq.file) # fq.plot <- args[2] # save(list="toplot", file = fq.plot) ====================================================================== ================== Output from the R script above: ====================================================================== =================== Loading required package: reshape Loading required package: plyr Attaching package: 'reshape' The following object(s) are masked from 'package:plyr': rename, round_any Loading required package: ggplot2 Loading required package: methods Loading required package: Biostrings Loading required package: BiocGenerics Attaching package: 'BiocGenerics' The following object(s) are masked from 'package:stats': xtabs The following object(s) are masked from 'package:base': anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, get, intersect, lapply, Map, mapply, mget, order, paste, pmax, pmax.int<http: pmax.int="">, pmin, pmin.int<http: pmin.int="">, Position, rbind, Reduce, rep.int<http: rep.int="">, rownames, sapply, setdiff, table, tapply, union, unique Loading required package: IRanges Attaching package: 'IRanges' The following object(s) are masked from 'package:reshape': rename The following object(s) are masked from 'package:plyr': compact, desc, rename Loading required package: biovizBase Loading required package: brew Loading required package: xtable Loading required package: Rsamtools Loading required package: GenomicRanges Loading required package: testthat R Under development (unstable) (2012-04-01 r58897) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C attached base packages: [1] methods stats graphics grDevices utils datasets base other attached packages: [1] qrqc_1.10.0 testthat_0.6 Rsamtools_1.8.5 [4] GenomicRanges_1.8.6 xtable_1.7-0 brew_1.0-6 [7] biovizBase_1.4.2 Biostrings_2.24.1 IRanges_1.14.3 [10] BiocGenerics_0.2.0 ggplot2_0.9.1 reshape_0.8.4 [13] plyr_1.7.1 loaded via a namespace (and not attached): [1] AnnotationDbi_1.18.1 Biobase_2.16.0 biomaRt_2.12.0 [4] bitops_1.0-4.1 BSgenome_1.24.0 cluster_1.14.2 [7] colorspace_1.1-1 DBI_0.2-5 dichromat_1.2-4 [10] digest_0.5.2 evaluate_0.4.2 GenomicFeatures_1.8.1 [13] grid_2.16.0 Hmisc_3.9-3 labeling_0.1 [16] lattice_0.20-6 MASS_7.3-18 memoise_0.1 [19] munsell_0.3 proto_0.3-9.2 RColorBrewer_1.0-5 [22] RCurl_1.91-1 reshape2_1.2.1 RSQLite_0.11.1 [25] rtracklayer_1.16.1 scales_0.2.1 stats4_2.16.0 [28] stringr_0.6 tools_2.16.0 XML_3.9-4 [31] zlibbioc_1.2.0 ====================================================================== =================== On Jun 5, 2012, at 3:03 PM, Vince S. Buffalo wrote: > Hi SangChul, > > Can you attach your sessionInfo()? I will take a look into this issue. > > best, > Vince > > On Tue, Jun 5, 2012 at 12:01 PM, Sang Chul Choi <schoi@cornell.edu<mailto:schoi@cornell.edu>> wrote: > I have tried to tunn off the option when reading sequences of variable lengths in a gzipped FASTQ file (2GB) using readSeqFile. The computer has 16 GB memory, and it used up all of the memory, leaving R in "Dead" or not running any more. Is there a way of sidestepping this problem? > > Thank you, > > SangChul > > On Jun 1, 2012, at 4:55 PM, Vince Buffalo wrote: > > > Hi SangChul, > > > > By default readSeqFile hashes a proportion of the reads to check against many being non-unique. Specify hash=FALSE to turn this off and your memory usage will decrease. > > > > Best, > > Vince > > > > Sent from my iPhone > > > > On Jun 1, 2012, at 1:23 PM, Sang Chul Choi <schoi@cornell.edu<mailto:schoi@cornell.edu>> wrote: > > > >> Hi, > >> > >> I am using qrqc to plot base quality of a short read fastq file. When the FASTQ file has short reads of the same length, the readSeqFile could read in the FASTQ file (25 millions of 100bp reads) with a couple of GB of memory. I trimmed 3' end of the short reads, which would lead to short reads of variable length because of different base quality at the 3' end. Then, I tried to read in this second FASTQ file of reads of variable length. It used up all of the 16 GB memory, and not using CPUs at all. It seems there are some efficient code in readSeqFile as mentioned in the readSeqFile help message. It seems to fall apart when short reads are of different size. > >> > >> I wish to see how the trimming change the base-quality plots, and this is a problem. I am wondering if there is a way of sidestepping this problem. > >> > >> Thank you, > >> > >> SangChul > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@r-project.org<mailto:bioconductor@r-project.org> > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > -- > Vince Buffalo > Statistical Programmer > Bioinformatics Core > UC Davis Genome Center > vincebuffalo.com<http: vincebuffalo.com=""> twitter.com/vsbuffalo<http: twitter.com="" vsbuffalo=""> > -- Vince Buffalo Statistical Programmer Bioinformatics Core UC Davis Genome Center vincebuffalo.com<http: vincebuffalo.com=""/> twitter.com/vsbuffalo<http: twitter.com="" vsbuffalo=""> [[alternative HTML version deleted]]

ADD REPLY • link 12.7 years ago Sang Chul Choi ▴ 230

0

Entering edit mode

Thomas Girke ★ 1.7k

@thomas-girke-993

Last seen 11 months ago

United States

Dear Vince, Have you thought about supporting ShortReadQ objects from ShortRead in your package. This way users could random sample reads from large fastq files with FastqSampler() which would reduce the memory requirements and speed things up to generate the really nice and useful quality plots of your package. Right this seems to be only possible by saving things back to files (random sample with ShortRead -> save to file -> reload with qrqc) which is not ideal, but perhaps there is a simpler solution to this already that I missed? Thomas On Fri, Jun 01, 2012 at 08:55:53PM +0000, Vince Buffalo wrote: > Hi SangChul, > > By default readSeqFile hashes a proportion of the reads to check against many being non-unique. Specify hash=FALSE to turn this off and your memory usage will decrease. > > Best, > Vince > > Sent from my iPhone > > On Jun 1, 2012, at 1:23 PM, Sang Chul Choi <schoi at="" cornell.edu=""> wrote: > > > Hi, > > > > I am using qrqc to plot base quality of a short read fastq file. When the FASTQ file has short reads of the same length, the readSeqFile could read in the FASTQ file (25 millions of 100bp reads) with a couple of GB of memory. I trimmed 3' end of the short reads, which would lead to short reads of variable length because of different base quality at the 3' end. Then, I tried to read in this second FASTQ file of reads of variable length. It used up all of the 16 GB memory, and not using CPUs at all. It seems there are some efficient code in readSeqFile as mentioned in the readSeqFile help message. It seems to fall apart when short reads are of different size. > > > > I wish to see how the trimming change the base-quality plots, and this is a problem. I am wondering if there is a way of sidestepping this problem. > > > > Thank you, > > > > SangChul > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.7 years ago Thomas Girke ★ 1.7k

Login before adding your answer.