Question

Sorry for cross post: Trouble creating ShortReadQ object and using writeFastq

0

Entering edit mode

Noah Dowell ▴ 410

@noah-dowell-3791

Last seen 10.6 years ago

Hello All, I am trying to do some preprocessing quality analysis on PacBio reads. To make any plots of quality score or nucleotide frequency one should only look at reads of the same length. One output from PacBio filtering of the raw data is a FASTQ file. The nature of the PacBio reads is that they lack uniformity in their length dimension. I am pretty sure the file is okay because the readSeqFile() function in the QRQC package is able to read the data in and the FASTQSummary object can be used to make plots. Given the non-uniform length of reads though those plots show artifacts. As far as can tell the FASTQSummary object can not be manipulated like a ShortRead or BioStrings object. So I turn to the ShortRead package. I have also successfully used the readFastq() function in ShortRead package to read in Illumina reads of different length and then preprocess or slice-n-dice those reads. I am getting the following error though when I try this on PacBio fastq files: ###################################################################### ############### > dirPath = "/Users/ndowell/Documents/PacBio/130204_NLD1_XL_12cells_25 5/017249_12cells_filtered/data/" > pat1 = "filteredSTD_12cells.fastq" > seqs1 = readFastq(dirPath = dirPath, pattern = pat1, file = "filteredSTD_12cells.fastq") Error: Input/Output file(s): /Users/ndowell/Documents/PacBio/130204_NLD1_XL_12cells_255/017249_1 2cells_filtered/data//filteredSTD_12cells.fastq message: line too long /Users/ndowell/Documents/PacBio/130204_NLD1_XL _12cells_255/017249_12cells_filtered/data//filteredSTD_12cells.fastq:1 7521 ###################################################################### ############### So I tried to manually build a ShortReadQ file using the following (there are generally 4 distinct lines in a FASTQ file): #workaround to error with too long lines seqTemp <- readLines("filteredSTD_12cells.fastq") seqs <- DNAStringSet(seqTemp[c(FALSE, TRUE, FALSE, FALSE)]) ids <- BStringSet(seqTemp[c(TRUE, FALSE, FALSE, FALSE)]) qual <- BStringSet(seqTemp[c(FALSE, FALSE, FALSE, TRUE)]) SeqClean <- ShortReadQ(sread=seqs, quality = qual, id = ids) #This worked!!: length(SeqClean) #484232 summary(width(SeqClean)) Min. 1st Qu. Median Mean 3rd Qu. Max. 50 1368 2153 2785 3567 26940 #But I get this error when I try to write the file (changing withIds to TRUE or full to TRUE gives same error): > writeFastq(SeqClean, filepath = "cleanReads.fastq", format = "fastq", mode = "w", full = FALSE, withIds = FALSE) Error in function (classes, fdef, mtable) : unable to find an inherited method for function ?writeFastq? for signature ?"ShortReadQ", "missing"? #I think something is not quite right with my construction of the ShortReadQ object and specifically the read IDs #I wonder if it is because it is slotting the ids into the wrong place: > head(ids) A BStringSet instance of length 6 width seq [1] 72 @m130205_030957_42152_c100453962550000001523068903201342_s1_ p0/21/0_5104 [2] 75 @m130205_030957_42152_c100453962550000001523068903201342_s1_ p0/53/2430_3193 [3] 75 @m130205_030957_42152_c100453962550000001523068903201342_s1_ p0/53/3237_6580 [4] 75 @m130205_030957_42152_c100453962550000001523068903201342_s1_ p0/53/6626_9948 [5] 72 @m130205_030957_42152_c100453962550000001523068903201342_s1_ p0/54/0_4929 [6] 74 @m130205_030957_42152_c100453962550000001523068903201342_s1_ p0/81/276_2420 ###################################################################### ############### My questions: 1. Is there any way to leverage the ability to slice-n-dice a FASTQ data set in the ShortRead package and then write to a new FASTQ file? 2 Should I just try to add in names to my sequences manually (to the ShortReadQ object) to alleviate the 'mssing' error above? 3. Is there any desire from the developers to merge the id() and names() functions? Any help/hints would be appreciated. I apologize for the lack of a toy-reproducible example; I was unsuccessful in my attempt to create a toy quality score. Thank you in advance for your assistance and time. Best, Noah ###################################################################### ############### > sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] grid stats graphics grDevices utils datasets methods base other attached packages: [1] qrqc_1.12.0 testthat_0.7 xtable_1.7-0 brew_1.0-6 biovizBase_1.6.2 ggplot2_0.9.3 [7] reshape_0.8.4 plyr_1.8 org.Hs.eg.db_2.8.0 RSQLite_0.11.2 DBI_0.2-5 AnnotationDbi_1.20.3 [13] Biobase_2.18.0 BiocInstaller_1.8.3 ShortRead_1.16.3 latticeExtra_0.6-24 RColorBrewer_1.0-5 Rsamtools_1.10.2 [19] lattice_0.20-13 Biostrings_2.26.3 GenomicRanges_1.10.6 IRanges_1.16.5 BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] biomaRt_2.14.0 bitops_1.0-4.2 BSgenome_1.26.1 cluster_1.14.3 colorspace_1.2-1 [6] dichromat_2.0-0 digest_0.6.2 evaluate_0.4.3 GenomicFeatures_1.10.1 gtable_0.1.2 [11] Hmisc_3.10-1 hwriter_1.3 labeling_0.1 MASS_7.3-23 munsell_0.4 [16] parallel_2.15.2 proto_0.3-10 RCurl_1.95-3 reshape2_1.2.2 rtracklayer_1.18.2 [21] scales_0.2.3 stats4_2.15.2 stringr_0.6.2 tools_2.15.2 XML_3.95-0.1 [26] zlibbioc_1.4.0

Preprocessing ShortRead Preprocessing ShortRead • 1.3k views

ADD COMMENT • link 12.2 years ago Noah Dowell ▴ 410