Modify Biostrings script
1
0
Entering edit mode
@gingergeiger22-18200
Last seen 15 months ago
United States

I am using the Biostrings package in R to translate my NGS reads directly to amino acids. It has been great so far in my analysis, but I need to modify it for a part of the analysis.

I am using a conserved genomic site to begin translating reads in the forward direction from the 5' end. However, if the read is truncated at the 5' end, I now need to start translating from the conserved site at the 3' end end to stay in frame. I believe I need to start reading the string at the -3 position and read each codon forward every -3 from that position? *It is fine if the output is 3' - 5'- I can use the reverse function to reverse the order of the amino acids. 

For example: 

5'  CTTGAGATGCGAATT  3'  

5'  - - - - AGATGCGAATT 3' 

The first one will translate in frame. The second read will not translate in frame. Thanks for any help/advice! 

Here is the code that I believe will need to be updated:

                         "last %d bases were ignored", phase);

                return 1;

        }

        return 0;

}

 

/*

 * --- .Call ENTRY POINT ---

 * Return an AAStringSet object.

 */

SEXP DNAStringSet_translate(SEXP x, SEXP base_codes, SEXP lkup, SEXP skipcode)

{

        cachedXStringSet X, Y;

        cachedCharSeq X_elt, Y_elt;

        char skipcode0;

        int ans_length, i, errcode;

        SEXP ans, width, ans_width;

        TwobitEncodingBuffer teb;

 

        X = _cache_XStringSet(x);

        skipcode0 = (unsigned char) INTEGER(skipcode)[0];

        ans_length = _get_cachedXStringSet_length(&X);

        PROTECT(width = NEW_INTEGER(ans_length));

        for (i = 0; i < ans_length; i++) {

                X_elt = _get_cachedXStringSet_elt(&X, i);

                INTEGER(width)[i] = X_elt.length / 3;

        }

        PROTECT(ans = alloc_XRawList("AAStringSet", "AAString", width));

        Y = _cache_XStringSet(ans);

        teb = _new_TwobitEncodingBuffer(base_codes, 3, 0);

        ans_width = _get_XStringSet_width(ans);

        for (i = 0; i < ans_length; i++) {

                X_elt = _get_cachedXStringSet_elt(&X, i);

                Y_elt = _get_cachedXStringSet_elt(&Y, i);

                errcode = translate(&X_elt, &Y_elt, &teb, lkup, skipcode0);

                if (errcode == -1) {

                        UNPROTECT(2);

                        if (ans_length == 1)

                                error("%s", errmsg_buf);

                        else

                                error("in 'x[[%d]]': %s", i + 1, errmsg_buf);

                }

                if (errcode == 1) {

                        if (ans_length == 1)

                                warning("%s", errmsg_buf);

                        else

                                warning("in 'x[[%d]]': %s", i + 1, errmsg_buf);

                }

                INTEGER(ans_width)[i] = Y_elt.length;

        }

        UNPROTECT(2);

        return ans;

}

 

 

Biostrings modify translate • 709 views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 4 days ago
Seattle, WA, United States

Hi,

There is no need to start translating from the conserved site at the 3' end to stay in frame. When a read is truncated at the 5' end, all you need to do is trim 1 or 2 nucleotides on its 5' end to be in frame again. Or maybe you don't need to trim anything if the truncation preserved the frame. More precisely, assuming that the 3' of your read is in frame (i.e. its last 3 nucleotides form a codon), all you need to do is trim the 5' end by RL %% 3 where RL is the length of the read. If reads is the DNAStringSet containing your reads and Ltrim the integer vector parallel to reads containing the number of nucleotides you want to trim on the left end (i.e. 5' end) of each read, then trim the reads with windows(reads, start=1+Ltrim).

Hope this helps,

H.

ADD COMMENT
0
Entering edit mode

Yes, I need 15 nucleotides from the 3' end to remain in any case. That solution is easier. That will work- is that what this does? Thanks!

 

ADD REPLY
0
Entering edit mode

You'll have to try it.

Also make sure to set no.init.codon to TRUE when calling translate() on sequences where the initiation codon has been removed or truncated. no.init.codon is a new argument to translate():

  https://github.com/Bioconductor/Biostrings/commit/dcd915c21b96d4d981d4530f913ff9108239e129

I just added it to Biostrings 2.50.1 (BioC 3.8, current release) and Biostrings 2.51.1 (BioC 3.9, current devel). Both versions of Biostrings will become available via BiocManager::install() to BioC >= 3.8 users in the next couple of days.

Cheers,

H.

ADD REPLY
0
Entering edit mode

Thank you very much for your responses. I have not yet tried this approach because I am playing around with retrieving intact reads specific to a genomic range from my indexed BAM file using the Rsamtools and GenomiRanges packages. However, it seems that most of the Bioconductor tools that analyze the reads in a BAM file are designed to group the reads. I am haplotyping mutations so I need to analyze the reads individually- is there a tool/method that can use the indexed BAM file to translate my subset in frame?

ADD REPLY

Login before adding your answer.

Traffic: 490 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6