When using R with the Biostrings Package from Biocunductor, why would you choose to save your data (from a FASTA for example) as a BStringSet instead of a typed StringSet (ie DNAStringSet, RNAStringSet, AAStringSet). In what situations is it better? Also, what conditions does your data have to meet to make it a better option not to specify type?
It's mostly that you have requirements for DNAString, RNAString and AAString objects that don't exist for BString objects.
For example, from ?DNAString
DNAString objects
Description:
A DNAString object allows efficient storage and manipulation of a
long DNA sequence.
Details:
The DNAString class is a direct XString subclass (with no
additional slot). Therefore all functions and methods described
in the XString man page also work with a DNAString object
(inheritance).
Unlike the BString container that allows storage of any single
string (based on a single-byte character set) the DNAString
container can only store a string based on the DNA alphabet (see
below). In addition, the letters stored in a DNAString object are
encoded in a way that optimizes fast search algorithms.
The DNA alphabet:
This alphabet contains all letters from the IUPAC Extended Genetic
Alphabet (see '?IUPAC_CODE_MAP') plus '"-"' (the _gap_ letter),
'"+"' (the _hard masking_ letter), and '"."' (the _not a letter_
or _not available_ letter). It is stored in the 'DNA_ALPHABET'
predefined constant (character vector).
The 'alphabet()' function returns 'DNA_ALPHABET' when applied to a
DNAString object.
And further
> BString(paste(LETTERS, collapse = ""))
26-letter BString object
seq: ABCDEFGHIJKLMNOPQRSTUVWXYZ
> DNAString(paste(LETTERS, collapse = ""))
Error in .Call2("new_XString_from_CHARACTER", class(x0), string, start, :
key 69 (char 'E') not in lookup table
Yes. I understand that. I just can't think of any situation where you would have a set of data where you did not have it with the proper lettering system for DNA, RNA, or AA (depending on your sequence). As such, why does BString exist? It seems to serve no purpose. This is why I was asking for a specific situation where you would use BString because I can't think of any. I mainly use BioStrings in conjunction with msa so I am unaware of other functions. I could have been clearer in my original question.
You would use a BString or BStringSet object in any situation where your character data is not DNA, RNA, or protein sequences, but you still wanted to be able to use Biostrings tools like matchPattern() or pairwiseAlignments(). Note that the native input data structures for these tools are XString or XStringSet derivatives. This means that, even if these tools might seem to accept ordinary character vectors, the first thing they do is convert these ordinary character vectors into BString or BStringSet objects because that's the data structures that these tools were implemented to work on and optimized for.
For example, if your data is general text and you wanted to count the occurrences of a set of keywords in it, you could import your data in a BStringSet object and do something like:
Or to simply tabulate the frequency of unique letters in the text:
letterFrequency(text, uniqueLetters(text))
# A D E F R T X a b c d e t
# [1,] 1 0 0 2 0 0 1 4 1 1 1 1 0
# [2,] 0 0 0 0 1 1 1 4 3 2 0 1 0
# [3,] 3 1 1 0 1 0 1 6 1 1 0 5 2
Of course you can do this with base R and ordinary character vectors but Biostrings data structures and tools were specifically designed to be more efficient if the data is big.
Yes. I understand that. I just can't think of any situation where you would have a set of data where you did not have it with the proper lettering system for DNA, RNA, or AA (depending on your sequence). As such, why does BString exist? It seems to serve no purpose. This is why I was asking for a specific situation where you would use BString because I can't think of any. I mainly use BioStrings in conjunction with msa so I am unaware of other functions. I could have been clearer in my original question.
You would use a BString or BStringSet object in any situation where your character data is not DNA, RNA, or protein sequences, but you still wanted to be able to use Biostrings tools like
matchPattern()
orpairwiseAlignments()
. Note that the native input data structures for these tools are XString or XStringSet derivatives. This means that, even if these tools might seem to accept ordinary character vectors, the first thing they do is convert these ordinary character vectors into BString or BStringSet objects because that's the data structures that these tools were implemented to work on and optimized for.For example, if your data is general text and you wanted to count the occurrences of a set of keywords in it, you could import your data in a BStringSet object and do something like:
Or to simply tabulate the frequency of unique letters in the text:
Of course you can do this with base R and ordinary character vectors but Biostrings data structures and tools were specifically designed to be more efficient if the data is big.
Hope this helps,
H.