I have been using Rsubread to perform alignment for a large number of paired-end gzFASTQ files using the align() function, and have been careful to log all of the output to keep track of summary information and any error/warning messages that may have been produced. Of the alignments which reported a possibly incorrect Phred Offset (using a default of +33), only one was actually incorrect; the remaining alignments, which there were ~4-5 of, (involving ~8-10 fq.gz files), all reported a perceived range of [35,84], which is not a valid score range. I decided to zcat the fq.gz files and use AWK extract every 4th line to a text file, then wrote and ran a script to keep track of the running values for minimum and maximum observed ascii values. In all cases, the range was actually 35-74. My hypothesis (for which I have no evidence) is that the quality scores are being read from a line containing read information, as 'T' is ASCII 84 and is also the highest-ASCII-value nucleotide abbreviation. I have the quality scores for the paired-end reads for one sample, as well as the R console output from a successful reproduction of this possible bug. The former are 333MB each, so I cannot provide them in their entirety; however, I can provide the full output. The files in question were obtained from the PCGC data hub, so anyone with access to said data hub may find them there. I am including the two remote filepaths (the names are slightly different from the ones I have, but they are the same files) so that anyone with an account may retrieve them on their own with ExpeDat (https://pcgc.research.chop.edu/data_expedition), as long as they have a PCGC account. Here are the remote paths:
resrhdxp01.research.chop.edu:/pcgc/public/Other/transcriptome/fastq/PCGC0065312_HS_TX__1-02059__v1_FC427_L8_p3of6_P2.fastq.gz resrhdxp01.research.chop.edu:/pcgc/public/Other/transcriptome/fastq/PCGC0065308_HS_TX__1-02059__v1_FC427_L8_p3of6_P1.fastq.gz
The command I ran to extract every fourth line is
zcat $1 | awk '!(NR%4)'
The script I used to calculate the minimum and maximum values was written in haskell, and the source is short enough that I will include it here:
import System.IO import Control.Monad import Data.Char import Data.List main = getLine >>= l.minmax l t = do let (m,m') = t in putStrLn (show (ord m) ++ "-" ++ show (ord m')) foe <- isEOF when (not foe) (getLine >>= l.(mim t).minmax) minmax (h:x) = foldl' mm (h,h) x where mm (a,b) c | c < a = (c,b) | c > b = (a,c) | otherwise = (a,b) mim (a,b) (c,d) = (min a c, max b d)
If you save this to "phred.hs", then compile it with "ghc phred.hs", you can run
zcat $1 | awk '!(NR%4)' | ./phred | uniq
to view the running cumulative minimum and maximum ascii values. If any information I provided here is insufficiently descriptive or ambiguously worded, or if more information is necessary, I will be happy to provide whatever information I can to help resolve this bug.
Here is the output of my R session:
R version 3.1.1 (2014-07-10) -- "Sock it to Me" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-unknown-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > require(Rsubread) Loading required package: Rsubread > align(index = "genome", + readfile1 = "120521--FC427--L8--3of6--1-02059--RA--SeqInst3--R1.fq.gz", + readfile2 = "120521--FC427--L8--3of6--1-02059--RA--SeqInst3--R2.fq.gz", + input_format = "gzFASTQ", + output_format = "BAM", + output_file = "test.bam", + tieBreakHamming = TRUE, + unique = TRUE, + indels = 5, + nthreads = 1, + PE_orientation = "fr") ========== _____ _ _ ____ _____ ______ _____ ===== / ____| | | | _ \| __ \| ____| /\ | __ \ ===== | (___ | | | | |_) | |__) | |__ / \ | | | | ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | | ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| | ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/ Rsubread 1.15.9 //========================== subread-align setting ===========================\\ || || || Function : Read alignment || || Threads : 1 || || Input file 1 : 120521--FC427--L8--3of6--1-02059--RA--SeqInst3--R ... || || Input file 2 : 120521--FC427--L8--3of6--1-02059--RA--SeqInst3--R ... || || Output file : test.bam (BAM) || || Index name : genome || || Phred offset : 33 || || || || Min read1 votes : 3 || || Min read2 votes : 1 || || Max fragment size : 600 || || Min fragment size : 50 || || || || Max indels : 5 || || # of Best mapping : 1 || || Unique mapping : yes || || Hamming distance : yes || || Quality scores : no || || || \\===================== http://subread.sourceforge.net/ ======================// //====================== Running (11-Nov-2014 14:44:53) ======================\\ || || || Decompress 120521--FC427--L8--3of6--1-02059--RA--SeqInst3--R1.fq.gz... || || The input file contains base space reads. || || Decompress 120521--FC427--L8--3of6--1-02059--RA--SeqInst3--R2.fq.gz... || || WARNING The specified phred-score offset (33) seems to be incorrect. || || The observed phred-score range is [35,84]. || || || ------- ABORTED ------- > require(Rsubread) > sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Rsubread_1.15.9