Question about translate funciton in Biostrings package

0

Entering edit mode

li lilingdu ▴ 450

@li-lilingdu-1884

Last seen 6.9 years ago

Dear list, I'm using "tanslate" function in "Biostrings" package to translate DNA sequence in proteins. It did well when the base letter is "A/G/C/T" But while the DNA sequence contain nucleotide ambiguity codes such as "N"/"M", "tanslate" function did not work, for example: translate(DNAString("AACTGTCGMCCC")) #Error in translate(DNAStringSet(x)) : not a base at pos 9 translate(DNAString("AACTGNTCG")) #Error in translate(DNAStringSet(x)) : not a base at pos 6 sessionInfo() R version 2.12.1 (2010-12-16) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=Chinese_People's Republic of China.936 LC_CTYPE=Chinese_People's Republic of China.936 LC_MONETARY=Chinese_People's Republic of China.936 [4] LC_NUMERIC=C LC_TIME=Chinese_People's Republic of China.936 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.18.2 IRanges_1.8.9 loaded via a namespace (and not attached): [1] Biobase_2.10.0 tools_2.12.1 --- LiGang

• 1.9k views

ADD COMMENT • link updated 14.1 years ago by Hervé Pagès 16k • written 14.1 years ago by li lilingdu ▴ 450

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 1 day ago

Seattle, WA, United States

Hi LiGang, It's not clear to me what translate() should do when the input contains ambiguity letters. I can see that for some ambiguities in the input, the output won't be affected. Like in your first example, replacing M by either A or C produces the same ouput: > translate(DNAString("AACTGTCGACCC")) 4-letter "AAString" instance seq: NCRP > translate(DNAString("AACTGTCGCCCC")) 4-letter "AAString" instance seq: NCRP So yes I could add support for this. Otherwise, in general, what to do? Should the output contain letters representing ambiguous amino acids? The problem is that last time I checked I was not able to find "official" ambiguity codes for amino acids that would represent all possible ambiguities in the protein sequence resulting from all possible ambiguities in the DNA sequence. Can you please clarify what your question is? Thanks, H. ----- Original Message ----- From: "ligang" <luzifer.li@gmail.com> To: bioconductor at stat.math.ethz.ch Sent: Thursday, March 17, 2011 10:23:15 PM Subject: [BioC] Question about translate funciton in Biostrings package Dear list, I'm using "tanslate" function in "Biostrings" package to translate DNA sequence in proteins. It did well when the base letter is "A/G/C/T" But while the DNA sequence contain nucleotide ambiguity codes such as "N"/"M", "tanslate" function did not work, for example: translate(DNAString("AACTGTCGMCCC")) #Error in translate(DNAStringSet(x)) : not a base at pos 9 translate(DNAString("AACTGNTCG")) #Error in translate(DNAStringSet(x)) : not a base at pos 6 sessionInfo() R version 2.12.1 (2010-12-16) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=Chinese_People's Republic of China.936 LC_CTYPE=Chinese_People's Republic of China.936 LC_MONETARY=Chinese_People's Republic of China.936 [4] LC_NUMERIC=C LC_TIME=Chinese_People's Republic of China.936 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.18.2 IRanges_1.8.9 loaded via a namespace (and not attached): [1] Biobase_2.10.0 tools_2.12.1 --- LiGang _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 14.1 years ago Hervé Pagès 16k

0

Entering edit mode

Hi, There is my understanting of the situation In DNA, there are some time ambiguities in nucleic acide sequence. Because an aa may have many codon, sometime swiching an A for a C for exemple won't do any big difference. That where ambiguities letters are used. Each organism have a prefered codon for each aa, and that's helping to find mutation when an other codon for the same aa is used. If you simply want an aa sequence, replacing the ambiguities letters by one of the possible an won't do any difference. If it's for doing phylogenic analysis, there a difference. From what I know from physogenic analysis and what that package do, i think that's not what is intended to be done here. A solution can be to replace manualy each ambiguities letters by one of his correspondian nucleic acide. After that, the function will work well... But an other possibility is to simply add new parameter to it. You say that there no universal convention for the ambiguities letters... But the user should know what is the convention for his sequence. So if my understanding is correct, adding new parameters to specify wich ambiguities letters may be find and by wich nucleic acide do the replacement should fix the function. Am I right? Simon No?l CdeC ________________________________________ De : bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org] de la part de Pages, Herve [hpages at fhcrc.org] Date d'envoi : 18 mars 2011 01:57 ? : ligang Cc : bioconductor at stat.math.ethz.ch Objet : Re: [BioC] Question about translate funciton in Biostrings package Hi LiGang, It's not clear to me what translate() should do when the input contains ambiguity letters. I can see that for some ambiguities in the input, the output won't be affected. Like in your first example, replacing M by either A or C produces the same ouput: > translate(DNAString("AACTGTCGACCC")) 4-letter "AAString" instance seq: NCRP > translate(DNAString("AACTGTCGCCCC")) 4-letter "AAString" instance seq: NCRP So yes I could add support for this. Otherwise, in general, what to do? Should the output contain letters representing ambiguous amino acids? The problem is that last time I checked I was not able to find "official" ambiguity codes for amino acids that would represent all possible ambiguities in the protein sequence resulting from all possible ambiguities in the DNA sequence. Can you please clarify what your question is? Thanks, H. ----- Original Message ----- From: "ligang" <luzifer.li@gmail.com> To: bioconductor at stat.math.ethz.ch Sent: Thursday, March 17, 2011 10:23:15 PM Subject: [BioC] Question about translate funciton in Biostrings package Dear list, I'm using "tanslate" function in "Biostrings" package to translate DNA sequence in proteins. It did well when the base letter is "A/G/C/T" But while the DNA sequence contain nucleotide ambiguity codes such as "N"/"M", "tanslate" function did not work, for example: translate(DNAString("AACTGTCGMCCC")) #Error in translate(DNAStringSet(x)) : not a base at pos 9 translate(DNAString("AACTGNTCG")) #Error in translate(DNAStringSet(x)) : not a base at pos 6 sessionInfo() R version 2.12.1 (2010-12-16) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=Chinese_People's Republic of China.936 LC_CTYPE=Chinese_People's Republic of China.936 LC_MONETARY=Chinese_People's Republic of China.936 [4] LC_NUMERIC=C LC_TIME=Chinese_People's Republic of China.936 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.18.2 IRanges_1.8.9 loaded via a namespace (and not attached): [1] Biobase_2.10.0 tools_2.12.1 --- LiGang _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 14.1 years ago SimonNoël ▴ 450

Login before adding your answer.