How to run getSeq() on a GRanges object with invalid DNA letters?
1
0
Entering edit mode
Jon Bråte ▴ 260
@jon-brate-6263
Last seen 5 months ago
Norway

Hi,

I have a huge fasta file that I want to subset, i.e. pull out a subset of the entries and write them to a file. I used indexFa() from RSamtools to create an index of the fasta file and then I made a GRanges object with gr = as(seqinfo(fa), "GRanges"). But when I try to get the sequences from the gr object it fails because some entries contain invalid DNA letters. But I would like to simply write the sequences as they were originally. I use this code to get the sequences, where x is a vector of integers representing entries I want to keep: getSeq(fa, gr[x]).

Error message:

Error in value[[3L]](cond) : 
   record 1 (hCoV-19/Slovakia/PKM2021022700521/2021:1-29921) contains invalid DNA letters
  file: GISAID_Download_package_2021.05.25_sequences_cut_names.fasta
> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 20

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=nb_NO.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=nb_NO.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=nb_NO.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] fastmatch_1.1-0      Rsamtools_2.2.3      Biostrings_2.54.0    XVector_0.26.0       GenomicRanges_1.38.0
 [6] GenomeInfoDb_1.22.1  IRanges_2.20.2       S4Vectors_0.24.4     BiocGenerics_0.32.0  lubridate_1.7.10    
[11] forcats_0.5.1        stringr_1.4.0        dplyr_1.0.5          purrr_0.3.4          readr_1.4.0         
[16] tidyr_1.1.3          tibble_3.1.0         ggplot2_3.3.3        tidyverse_1.3.1     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6             assertthat_0.2.1       utf8_1.2.1             R6_2.5.0              
 [5] cellranger_1.1.0       backports_1.2.1        reprex_2.0.0           httr_1.4.2            
 [9] pillar_1.6.0           zlibbioc_1.32.0        rlang_0.4.10           readxl_1.3.1          
[13] rstudioapi_0.13        BiocParallel_1.20.1    RCurl_1.98-1.3         munsell_0.5.0         
[17] tinytex_0.31           broom_0.7.6            compiler_3.6.3         modelr_0.1.8          
[21] xfun_0.22              pkgconfig_2.0.3        tidyselect_1.1.0       GenomeInfoDbData_1.2.2
[25] fansi_0.4.2            crayon_1.4.1           dbplyr_2.1.1           withr_2.4.1           
[29] bitops_1.0-7           grid_3.6.3             jsonlite_1.7.2         gtable_0.3.0          
[33] lifecycle_1.0.0        DBI_1.1.1              magrittr_2.0.1         scales_1.1.1          
[37] cli_2.4.0              stringi_1.5.3          fs_1.5.0               xml2_1.3.2            
[41] ellipsis_0.3.1         generics_0.1.0         vctrs_0.3.7            tools_3.6.3           
[45] glue_1.4.2             hms_1.0.0              colorspace_2.0-0       rvest_1.0.0           
[49] haven_2.3.1
Rsamtools Biostrings • 2.0k views
ADD COMMENT
0
Entering edit mode

Not completely relevant since you said you wanted to print exactly what you have but you could also look at Biostrings::replaceAmbiguities

ADD REPLY
3
Entering edit mode
@james-w-macdonald-5106
Last seen 2 days ago
United States

You don't have to read in as a DNAStringSet. As an example,

> file.copy( "C:/Users/jmacdon/AppData/Roaming/R/win-library/4.1/Rsamtools/extdata/ce2dict1.fa", "tmp.fa")
## edit things manually - here is the result:
> readLines("tmp.fa")
 [1] ">pattern01"                "GCGAAACTAGGAGAGGCT"       
 [3] ">pattern02"                "CTGTTAGCTAATTTTAAAAATQRST"     <- Notice this one
 [5] ">pattern03"                "ACTACCACCCAAATTTAGATATTC" 
 [7] ">pattern04"                "AAATTTTTTTTGTTGCAAATTTGA" 
 [9] ">pattern05"                "TCTTCTTGGCTTTGGTGGTACTTTT"
> fa <- FaFile("tmp.fa")
> indexFa(fa)
class: FaFile 
path: tmp.fa
index: tmp.fa.fai
gzindex: tmp.fa.gzi
isOpen: FALSE 
yieldSize: NA 
> gr <- as(seqinfo(fa), "GRanges")
> getSeq(fa, gr[2])
Error in value[[3L]](cond) : 
   record 1 (pattern02:1-25) contains invalid DNA letters
  file: tmp.fa
> getSeq(fa, gr[2], as="AAStringSet" )
AAStringSet object of length 1:
    width seq                                               names               
[1]    25 CTGTTAGCTAATTTTAAAAATQRST                         pattern02
ADD COMMENT
0
Entering edit mode

That works, thanks! But I don't understand why reading the sequences as an AAStringSet works? I though the AAString class was only the amino acids?

ADD REPLY
0
Entering edit mode

Well, that's true. It is only for the amino acids. But the amino acid code includes all of the capital letters, so by definition you can have any capital letter and it will be OK

> all(LETTERS %in% names(AMINO_ACID_CODE))
[1] TRUE
ADD REPLY
0
Entering edit mode

Note that you don't even need to bother using an AAStringSet object, which works here just by luck, when you can use a BStringSet object. BStringSet objects don't restrict letters: they're the analog of character vectors in base R.

H.

ADD REPLY
1
Entering edit mode

Except a BStringSet isn't one of the choices for getSeq for a FaFile object. Which is why I chose AAStringSet.

> getSeq(fa, gr[2], as = "BStringSet")
Error in match.arg(as) : 
  'arg' should be one of "DNAStringSet", "RNAStringSet", "AAStringSet"

But maybe there is a better way to do what the OP wants than using getSeq?

ADD REPLY
0
Entering edit mode

oops... bummer!

That restriction seems to be coming from Rsamtools::scanFa() which the getSeq() method for FaFile objects is based on.

Doesn't sound like a crazy restriction though, given that

the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences

according to this Wikipedia article.

So I wonder what the Q letter in the OP's file is supposed to represent. First time I see it in a FASTA file containing nucleotide sequences. If this is a legit thing to use in nucleotide sequences, then maybe DNA_ALPHABET in Biostrings should be extended to include this new letter, or at least Rsamtools::scanFa() should support as="BStringSet".

I guess for now we can just wait and see if this new DNA letter causes problems again in the future or if it was a one-time thing.

H.

ADD REPLY
0
Entering edit mode

The OP might have an amino acid FASTA file. It appears to be something from GISAID, which you need to register to get, and I'm not that interested to go through whatever the registration process is, so...

ADD REPLY
0
Entering edit mode

Understandable. Thanks for the clarification!

ADD REPLY

Login before adding your answer.

Traffic: 798 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6