I am using the Biostrings package in R to translate my NGS reads directly to amino acids. It has been great so far in my analysis, but I need to modify it for a part of the analysis.
I am using a conserved genomic site to begin translating reads in the forward direction from the 5' end. However, if the read is truncated at the 5' end, I now need to start translating from the conserved site at the 3' end end to stay in frame. I believe I need to start reading the string at the -3 position and read each codon forward every -3 from that position? *It is fine if the output is 3' - 5'- I can use the reverse function to reverse the order of the amino acids.
For example:
5' CTTGAGATGCGAATT 3'
5' - - - - AGATGCGAATT 3'
The first one will translate in frame. The second read will not translate in frame. Thanks for any help/advice!
Here is the code that I believe will need to be updated:
"last %d bases were ignored", phase);
return 1;
}
return 0;
}
/*
* --- .Call ENTRY POINT ---
* Return an AAStringSet object.
*/
SEXP DNAStringSet_translate(SEXP x, SEXP base_codes, SEXP lkup, SEXP skipcode)
{
cachedXStringSet X, Y;
cachedCharSeq X_elt, Y_elt;
char skipcode0;
int ans_length, i, errcode;
SEXP ans, width, ans_width;
TwobitEncodingBuffer teb;
X = _cache_XStringSet(x);
skipcode0 = (unsigned char) INTEGER(skipcode)[0];
ans_length = _get_cachedXStringSet_length(&X);
PROTECT(width = NEW_INTEGER(ans_length));
for (i = 0; i < ans_length; i++) {
X_elt = _get_cachedXStringSet_elt(&X, i);
INTEGER(width)[i] = X_elt.length / 3;
}
PROTECT(ans = alloc_XRawList("AAStringSet", "AAString", width));
Y = _cache_XStringSet(ans);
teb = _new_TwobitEncodingBuffer(base_codes, 3, 0);
ans_width = _get_XStringSet_width(ans);
for (i = 0; i < ans_length; i++) {
X_elt = _get_cachedXStringSet_elt(&X, i);
Y_elt = _get_cachedXStringSet_elt(&Y, i);
errcode = translate(&X_elt, &Y_elt, &teb, lkup, skipcode0);
if (errcode == -1) {
UNPROTECT(2);
if (ans_length == 1)
error("%s", errmsg_buf);
else
error("in 'x[[%d]]': %s", i + 1, errmsg_buf);
}
if (errcode == 1) {
if (ans_length == 1)
warning("%s", errmsg_buf);
else
warning("in 'x[[%d]]': %s", i + 1, errmsg_buf);
}
INTEGER(ans_width)[i] = Y_elt.length;
}
UNPROTECT(2);
return ans;
}
Yes, I need 15 nucleotides from the 3' end to remain in any case. That solution is easier. That will work- is that what this does? Thanks!
You'll have to try it.
Also make sure to set
no.init.codon
to TRUE when callingtranslate()
on sequences where the initiation codon has been removed or truncated.no.init.codon
is a new argument totranslate()
:https://github.com/Bioconductor/Biostrings/commit/dcd915c21b96d4d981d4530f913ff9108239e129
I just added it to Biostrings 2.50.1 (BioC 3.8, current release) and Biostrings 2.51.1 (BioC 3.9, current devel). Both versions of Biostrings will become available via
BiocManager::install()
to BioC >= 3.8 users in the next couple of days.Cheers,
H.
Thank you very much for your responses. I have not yet tried this approach because I am playing around with retrieving intact reads specific to a genomic range from my indexed BAM file using the Rsamtools and GenomiRanges packages. However, it seems that most of the Bioconductor tools that analyze the reads in a BAM file are designed to group the reads. I am haplotyping mutations so I need to analyze the reads individually- is there a tool/method that can use the indexed BAM file to translate my subset in frame?