Hi Valerie and James,
Thank you both for your helpful answers. I think that I may need to give a little bit of more background on what I need to achieve to make it easier to understand. We are studying short polypeptides derived from proteins expressed by a large number of viruses. We are building large systematic assay systems where we express these polypeptides (approx. 45aa long) using viral vectors in mammalian cells. We build the libraries using custom microarrays which can generate 100 000 oligonucleotides (200bp long) that then are put into the viral vector expression system.
The challenge is this: While 100 000 sounds a lot, it is actually not that many considering the number of viral strains and proteins we wish to express. Therefore, we need to make sure that we do not have unnecessary redundancy in the library, i.e., two genetic sequences that translate into identical polypeptides. Unfortunately, many viral strains have high genetic diversity while coding for highly conserved proteins. Thus if we were to only fragment the DNA into suitable length pieces and sorting out identical duplicates, we would have much more than 100 000 gene sequences and identical polypeptides would be expressed at a higher abundance than those that are actually different. In addition, some of the viruses are not mammalian viruses and thus, there is no guarantee that these DNA sequences would efficiently translate into proteins in mammalian cells.
So the situation is not at all that I need to figure out the original DNA sequence from an AA sequence (I realise that this would be impossible) instead, what we need to generate are cDNA sequences that would translate with sufficient efficiency into the target polypeptides in mammalian cells.
For this, I would myself see one possible process; The first step would be as James suggest to translate 1AA into one codon, based on the human codon frequency table. After that I would then run a mammalian codon optimisation on the entire generated sequence similarly to what Genscript and other gene synthesis companies offer.
It is a function like this that I was looking for, as I have very little knowledge in the codon optimisation principles. The first part of the conversion I can clearly write myself.
I hope that this made my question a little clearer.
Thank you again!
/Tomas
I'm not sure that you are correct that 'it is very straightforward', which is likely why you can't find functionality to do this. Given that each amino acid has a one-to-many association with the codons that encode for it, how do you propose doing the reverse mapping?
As an example, let's consider a simple 3 amino acid sequence, FLP. If I go to someplace like (http://www.genscript.com/cgi-bin/tools/codon_freq_table) and get the human codon frequency table, I get
So unless you are going to make an (unwarranted, IMO) simplifying assumption that you can just use the most common AA -> codon mapping, you have 40 different possible sequences that could have given rise to a simple 3 amino acid sequence. Obviously as the amino acid sequence gets longer, the possible cDNA sequences that could give rise to the amino acid sequence blows up massively.
It seems that for any reasonably long amino acid sequence you would then either get some massive number of possible cDNA sequences (not likely useful), or a single sequence that has a probability somewhere around 1/<some massive number> of being the right one. Neither outcome seems very useful to me.