Question

ShortRead trim sequence at specific position

1

Entering edit mode

danova_fr ▴ 20

@danova_fr-7502

Last seen 10.0 years ago

Andorra

Hi,

I´m running some stats on fastq files as shown below

fq <- readFastq(file)

# get quality score list as PhredQuality
qual <- as(quality(fq), "matrix")

#Probability error per base
pe <- apply(qual,2, function(x){10^(-(x/10))}) # P=10^(Qscore/10)

#Cumulated error per read. The probabilty error is increased towards the end of the sequence
cum_pe <- apply(pe,1,cumsum)

cum_pe[,1]

[1] 5.011872e-04 8.992944e-04 1.297402e-03 1.695509e-03 2.093616e-03 2.491723e-03
  [7] 3.286051e-03 3.485578e-03 3.685104e-03 4.083211e-03 4.334400e-03 4.533926e-03
 [13] 8.514998e-03 9.016185e-03 9.414292e-03 1.442616e-02 1.642143e-02 1.681953e-02
.....
[295] 1.226770e+02 1.233080e+02 1.239390e+02 1.245699e+02 1.252009e+02 1.258318e+02

> dim(cum_pe)
[1]   300 56761   So that i have a matrix of 56761 reads of 300nt length each.

How can i trim each read end when the cum_pe gets > 1.0 . For example if the cum_pe is 1 at position 150 of the read i would like to trim at position 150.

Thanks,

shortread fastq • 2.2k views

ADD COMMENT • link 10.0 years ago • updated 9.9 years ago danova_fr ▴ 20

score 1 · Answer 1 · 2015-04-08

1

Entering edit mode

Mike Smith ★ 6.6k

@mike-smith

Last seen 2 days ago

EMBL Heidelberg

You can do something like this to find the first point where each entry in cum_pe goes greater than 1. If no entry does then the function will return the full length of the read.

trimPos <- apply(cum_pe, 2, function(x) { min(min(which(x > 1)), length(x)) } )

Then you can use narrow() to trim each read to the specified length.

fq2 <- narrow(x = fq, start = 1, end = trimPos)

ADD COMMENT • link 10.0 years ago Mike Smith ★ 6.6k

0

Entering edit mode

The calculation of pe in the original question can be vectorized

pe = 10^(-as(quality(fq), "matrix") / 10)

I might write the pos-finding function as one of

rowSums(apply(pe, 1, cumsum) < 1)
apply(pe, 1, function(x) sum(cumsum(x) < 1))

(Mike's fails if none of x > 1). One problem might be that early low-quality bases lead to premature termination

ADD REPLY • link 10.0 years ago Martin Morgan 25k

0

Entering edit mode

I think mine will catch instances where none of > 1 since it'll return min(Inf, length(x)) in that case, but yours is a much more elegant solution.

ADD REPLY • link 10.0 years ago Mike Smith ★ 6.6k

score 0 · Answer 2 · 2015-04-08

0

Entering edit mode

danova_fr ▴ 20

@danova_fr-7502

Last seen 10.0 years ago

Andorra

Great Martin,

Works well. As for the low-quality bases if they are located at the 5´end of the sequence I will remove them before trimming ends. In fact the trimming 3´end is a second filtering in my case.

ADD COMMENT • link 10.0 years ago danova_fr ▴ 20

Martin Morgan · Answer 3 · 2015-05-01

Hi,

following the example above there is something i don´t understand

trimPos <- apply(pe, 1, function(x) sum(cumsum(x) < 1))
    # Trim when pe <1
    # My matrix is transposed but it does the work nicely

> mean(trimPos)
277.3139

Ok so the mean trim position for all reads of my file is 277. So far so good.

# Compute the mean probability at each position of the sequence
> mean_pe <- apply(pe,2,mean)
# and then do the cumsum so that i get a vector with a mean probability at each position of the sequence
>cum_pe <- cumsum(mean_pe)
>cum_pe < 1 # get the cumulated probability error < 1 (this should 
            # normally be the same as the mean(trimPos)
147

What i don´t understand is that it should return 277. Since this is the mean probabilty error per position it should give the same results as the trimPos ?????

Can you help ??