Hi bioC developers,
I am trying to extract the downstream sequences of peptides that have been matched to proteins. The function to do this is the subseq() function. My call of this function looks like this:
> down.aa <- subseq(ref.proteome[df$acc], start = df$start+nchar(as.character(df$pep)), width = 1) Error in .Call2("solve_user_SEW", refwidths, start, end, width, translate.negative.coord, : solving row 2: 'allow.nonnarrowing' is FALSE and the solved end (106) is > refwidth
and creates an error because the sec. peptide in df is exactly terminal with respect to its matching protein, so subseq() tries to extract a subsequence "out-of-bounds" and throws an error. So I thaught well, no problem, just put subseq() into a tryCatch() statement so in case of an error it returns NA. Unfortunately, the vectorisation in combination with tryCatch does not really do what I want:
down.aa <- tryCatch(subseq(ref.proteome[df$acc], start = df$start+nchar(as.character(df$pep)), width = 1, ), error = function(e){return(NA)})
The call works, but now it get NA as soon as one of the subseq calls throws an error:
> down.aa
[1] NA
Hi Martin,
yes that is of course an option, although I feel that solveUserSEW() does this job already quite exhaustively when subseq() is called. What I do not really understand is why subseq() stops when it is called on AAStringSet and only a single SEW-triplet creates an error. Wouldn't it be more convenient if return NA for that particular case and a warning, but process the rest of the AAStringSet?
Why this all or nothing behavior?
A practical answer is that StringSet objects don't have the concept of an 'NA'.
Hi Martin,
ok let's compare how subseq() and substr() behave:
Don't you think a comparable behavior for subseq() would make it more userfriendly (incl. a warning)?
But substr on StringSets does behave as substr on a character vector?
If you're asking about my opinion, then I'd rather be alerted to logical errors in my code. It forces me to deal with the issue rather than propagating misunderstanding to later parts of the analysis. Also, I wouldn't want subseq to replicate behavior already available via substr.
Ohhh...I did not know that substr() can process instances of the AAStringSet class, but that is of very useful information. THX!
In retrospect it seems like the algorithm is to 'capture downstream sequences'. So a precondition is that there are actually downstream sequences. Rather than using the solution in my answer to introduce complexity (e.g., checking for zero-width subsequences), it seems like one should just ensure that the precondition is met, e.g.,