Dear All,

I have a large fasta.gz file (645.000 elements) that I need to translate to AA. I am having trouble with the translate() function since it does not seem to handle gaps '---' and returns an error. I also need to remove 'X' for use with another program that does not recognize unidentified AA.

I would like to replace the gaps '---' with an empty character '' during translation, somewhat similar to if.fuzzy.codon = 'solve'. Ideally I would also replace ambiguous codons with '' instead of 'X' since ultimately I will have to remove any unknown AA from my input file.

I have searched terms such as: gap sequences, unknown aa, ambiguous aa, translate() documentation, DNAstring documentation, and have not been able to come up with a solution. I would appreciate any pointers or tips.

#Create DNAstring containing all 645k sequences and headers
orthologs = readDNAStringSet('protein_coding_orthologs_dna_cleaned.fasta.gz')

#Loop to translate DNA > AA and output AA sequence in correct input format for TANGO
# for (i in 1:length(orthologs))

for (i in 1:50) {
  aa = Biostrings::translate(orthologs[i], if.fuzzy.codon = "solve")
  name = names(orthologs)[i]
  cat(name, "N N 7 298 0.1", as.character(aa), "\n", file = 'tangoinput.txt', append = T)
 }, error=function(e){cat("ERROR:", name , conditionMessage(e), "\n")})

#ERROR: >header_name not a base at pos 2914 

sessionInfo( )

R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biostrings_2.60.2   GenomeInfoDb_1.28.1 XVector_0.32.0      IRanges_2.26.0      S4Vectors_0.30.0   
[6] BiocGenerics_0.38.0 ggplot2_3.3.5       phylotools_0.2.2    ape_5.5            

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7             pillar_1.6.2           compiler_4.1.0         BiocManager_1.30.16   
 [5] bitops_1.0-7           tools_4.1.0            zlibbioc_1.38.0        lifecycle_1.0.0       
 [9] tibble_3.1.3           nlme_3.1-152           gtable_0.3.0           lattice_0.20-44       
[13] pkgconfig_2.0.3        rlang_0.4.11           rstudioapi_0.13        GenomeInfoDbData_1.2.6
[17] withr_2.4.2            dplyr_1.0.7            generics_0.1.0         vctrs_0.3.8           
[21] grid_4.1.0             tidyselect_1.1.1       glue_1.4.2             R6_2.5.0              
[25] fansi_0.5.0            purrr_0.3.4            magrittr_2.0.1         scales_1.1.1          
[29] ellipsis_0.3.2         colorspace_2.0-2       utf8_1.2.2             RCurl_1.98-1.3        
[33] munsell_0.5.0          crayon_1.4.1
