Entering edit mode
Noah Dowell
▴
410
@noah-dowell-3791
Last seen 10.3 years ago
Hello All,
I am trying to do some preprocessing quality analysis on PacBio reads.
To make any plots of quality score or nucleotide frequency one should
only look at reads of the same length. One output from PacBio
filtering of the raw data is a FASTQ file. The nature of the PacBio
reads is that they lack uniformity in their length dimension. I am
pretty sure the file is okay because the readSeqFile() function in the
QRQC package is able to read the data in and the FASTQSummary object
can be used to make plots. Given the non-uniform length of reads
though those plots show artifacts. As far as can tell the
FASTQSummary object can not be manipulated like a ShortRead or
BioStrings object.
So I turn to the ShortRead package. I have also successfully used the
readFastq() function in ShortRead package to read in Illumina reads
of different length and then preprocess or slice-n-dice those reads.
I am getting the following error though when I try this on PacBio
fastq files:
######################################################################
###############
> dirPath = "/Users/ndowell/Documents/PacBio/130204_NLD1_XL_12cells_25
5/017249_12cells_filtered/data/"
> pat1 = "filteredSTD_12cells.fastq"
> seqs1 = readFastq(dirPath = dirPath, pattern = pat1, file =
"filteredSTD_12cells.fastq")
Error: Input/Output
file(s):
/Users/ndowell/Documents/PacBio/130204_NLD1_XL_12cells_255/017249_
12cells_filtered/data//filteredSTD_12cells.fastq
message: line too long /Users/ndowell/Documents/PacBio/130204_NLD1_X
L_12cells_255/017249_12cells_filtered/data//filteredSTD_12cells.fastq:
17521
######################################################################
###############
So I tried to manually build a ShortReadQ file using the following
(there are generally 4 distinct lines in a FASTQ file):
#workaround to error with too long lines
seqTemp <- readLines("filteredSTD_12cells.fastq")
seqs <- DNAStringSet(seqTemp[c(FALSE, TRUE, FALSE, FALSE)])
ids <- BStringSet(seqTemp[c(TRUE, FALSE, FALSE, FALSE)])
qual <- BStringSet(seqTemp[c(FALSE, FALSE, FALSE, TRUE)])
SeqClean <- ShortReadQ(sread=seqs, quality = qual, id = ids)
#This worked!!:
length(SeqClean) #484232
summary(width(SeqClean))
Min. 1st Qu. Median Mean 3rd Qu. Max.
50 1368 2153 2785 3567 26940
#But I get this error when I try to write the file (changing withIds
to TRUE or full to TRUE gives same error):
> writeFastq(SeqClean, filepath = "cleanReads.fastq", format =
"fastq", mode = "w", full = FALSE, withIds = FALSE)
Error in function (classes, fdef, mtable) :
unable to find an inherited method for function ?writeFastq? for
signature ?"ShortReadQ", "missing"?
#I think something is not quite right with my construction of the
ShortReadQ object and specifically the read IDs
#I wonder if it is because it is slotting the ids into the wrong
place:
> head(ids)
A BStringSet instance of length 6
width seq
[1] 72 @m130205_030957_42152_c100453962550000001523068903201342_s1_
p0/21/0_5104
[2] 75 @m130205_030957_42152_c100453962550000001523068903201342_s1_
p0/53/2430_3193
[3] 75 @m130205_030957_42152_c100453962550000001523068903201342_s1_
p0/53/3237_6580
[4] 75 @m130205_030957_42152_c100453962550000001523068903201342_s1_
p0/53/6626_9948
[5] 72 @m130205_030957_42152_c100453962550000001523068903201342_s1_
p0/54/0_4929
[6] 74 @m130205_030957_42152_c100453962550000001523068903201342_s1_
p0/81/276_2420
######################################################################
###############
My questions:
1. Is there any way to leverage the ability to slice-n-dice a FASTQ
data set in the ShortRead package and then write to a new FASTQ file?
2 Should I just try to add in names to my sequences manually (to the
ShortReadQ object) to alleviate the 'mssing' error above?
3. Is there any desire from the developers to merge the id() and
names() functions?
Any help/hints would be appreciated. I apologize for the lack of a
toy-reproducible example; I was unsuccessful in my attempt to create a
toy quality score. Thank you in advance for your assistance and time.
Best,
Noah
######################################################################
###############
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] grid stats graphics grDevices utils datasets
methods base
other attached packages:
[1] qrqc_1.12.0 testthat_0.7 xtable_1.7-0
brew_1.0-6 biovizBase_1.6.2 ggplot2_0.9.3
[7] reshape_0.8.4 plyr_1.8 org.Hs.eg.db_2.8.0
RSQLite_0.11.2 DBI_0.2-5 AnnotationDbi_1.20.3
[13] Biobase_2.18.0 BiocInstaller_1.8.3 ShortRead_1.16.3
latticeExtra_0.6-24 RColorBrewer_1.0-5 Rsamtools_1.10.2
[19] lattice_0.20-13 Biostrings_2.26.3 GenomicRanges_1.10.6
IRanges_1.16.5 BiocGenerics_0.4.0
loaded via a namespace (and not attached):
[1] biomaRt_2.14.0 bitops_1.0-4.2 BSgenome_1.26.1
cluster_1.14.3 colorspace_1.2-1
[6] dichromat_2.0-0 digest_0.6.2 evaluate_0.4.3
GenomicFeatures_1.10.1 gtable_0.1.2
[11] Hmisc_3.10-1 hwriter_1.3 labeling_0.1
MASS_7.3-23 munsell_0.4
[16] parallel_2.15.2 proto_0.3-10 RCurl_1.95-3
reshape2_1.2.2 rtracklayer_1.18.2
[21] scales_0.2.3 stats4_2.15.2 stringr_0.6.2
tools_2.15.2 XML_3.95-0.1
[26] zlibbioc_1.4.0