The trim*
functions remove nucleotides from the beginning or ending of reads only, not from the middle. The easiest to understand is trimEnds()
, which might be used to remove low-quality bases at the beginning and / or end of the reads.
sp <- SolexaPath(system.file('extdata', package='ShortRead')) # path to a fastq
rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt") # input data
This is what the quality scores look like
> quality(rfq)
class: SFastqQuality
quality:
A BStringSet instance of length 256
width seq
[1] 36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS
[2] 36 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK
[3] 36 ]]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUXEFLA
[4] 36 ]]]]]]]]]]]]]]]]]]]]]]T]]]]RJRZTQLOA
[5] 36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]MJUJVLSS
... ... ...
[252] 36 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR]MJQNSAOC
[253] 36 ]]]]]]]]]]]]]]]]]]]]]]]Y]]]VTVVRVMSM
[254] 36 ]]]]Y]Y]]]]]]]OYY]]]Y]]]YYVVTSZUOOHH
[255] 36 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[OVXEJSJ
[256] 36 ]]]]]]]]]]]Y]]T]T]]]]TRYVMEVVRSRHHNH
and the available 'alphabet' representing the qualities, from low quality to high quaility
> alphabet(quality(rfq))
[1] " " "!" "\"" "#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "."
[16] "/" "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" ":" ";" "<" "="
[31] ">" "?" "@" "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L"
[46] "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z" "["
[61] "\\" "]" "^" "_" "`" "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
[76] "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y"
[91] "z" "{" "|" "}"
See this wiki entry for description of how these codes translate into probabilities. So to trim tails with quality less than the letter 'L', for instance,
> rfq1 = trimEnds(rfq, "L")
> quality(rfq1)
class: SFastqQuality
quality:
A BStringSet instance of length 256
width seq
[1] 36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS
[2] 32 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZP
[3] 32 ]]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUX
[4] 35 ]]]]]]]]]]]]]]]]]]]]]]T]]]]RJRZTQLO
[5] 36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]MJUJVLSS
... ... ...
[252] 35 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR]MJQNSAO
[253] 36 ]]]]]]]]]]]]]]]]]]]]]]]Y]]]VTVVRVMSM
[254] 34 ]]]]Y]Y]]]]]]]OYY]]]Y]]]YYVVTSZUOO
[255] 35 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[OVXEJS
[256] 35 ]]]]]]]]]]]Y]]T]T]]]]TRYVMEVVRSRHHN
The first record is not trimmed, because the last nucleotide has quality 'S' , which is higher than the threshold 'L'. The second record removes ICCK
, all of which are less than L, etc.
The other trim*
functions scan from left to right along the read, calculating a score for each nucleotide in the read. The read is trimmed at the first location where the score falls below a threshold. trimTailw()
function calculates the score based on a sliding window, whereas trimTails()
accumulates a score for all preceding nucleotides.
These functions act on objects; if you'd like to go from an untrimmed fastq file to a trimmed fastq file you'd follow the instructions section 3.2 of the vignette.
Hi, I would like to know if I can use this package to trim reads that have a primer sequence that will be in all my reads due to the methodology followed for the generation of the sequencing library. Its a 4c-seq experiment and the primer sequence won´t allow me to align my reads to the reference genome. Greetings!!
If the primers are always at the same position, e.g., the first 10 nucleotides, use, e.g.,
narrow(srq, start = 11)
. Likely you would iterate through files, reading a chunk, trimming, writing the chunk, and repeating, as demonstrated in section 3.2 of the vignette.Thanks Martin, It was very useful!