Question

What is the use of the BStringSet?

0

Entering edit mode

skigirl618 • 0

@skigirl618-23618

Last seen 4.5 years ago

When using R with the Biostrings Package from Biocunductor, why would you choose to save your data (from a FASTA for example) as a BStringSet instead of a typed StringSet (ie DNAStringSet, RNAStringSet, AAStringSet). In what situations is it better? Also, what conditions does your data have to meet to make it a better option not to specify type?

BioStrings BStringSet StringSet • 1.9k views

ADD COMMENT • link updated 4.5 years ago by James W. MacDonald 67k • written 4.5 years ago by skigirl618 • 0

score 0 · Answer 1 · 2020-05-29

0

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 4 hours ago

United States

It's mostly that you have requirements for DNAString, RNAString and AAString objects that don't exist for BString objects.

For example, from ?DNAString

DNAString objects

Description:

     A DNAString object allows efficient storage and manipulation of a
     long DNA sequence.

Details:

     The DNAString class is a direct XString subclass (with no
     additional slot).  Therefore all functions and methods described
     in the XString man page also work with a DNAString object
     (inheritance).

     Unlike the BString container that allows storage of any single
     string (based on a single-byte character set) the DNAString
     container can only store a string based on the DNA alphabet (see
     below).  In addition, the letters stored in a DNAString object are
     encoded in a way that optimizes fast search algorithms.

The DNA alphabet:

     This alphabet contains all letters from the IUPAC Extended Genetic
     Alphabet (see '?IUPAC_CODE_MAP') plus '"-"' (the _gap_ letter),
     '"+"' (the _hard masking_ letter), and '"."' (the _not a letter_
     or _not available_ letter).  It is stored in the 'DNA_ALPHABET'
     predefined constant (character vector).

     The 'alphabet()' function returns 'DNA_ALPHABET' when applied to a
     DNAString object.

And further

> BString(paste(LETTERS, collapse = ""))
26-letter BString object
seq: ABCDEFGHIJKLMNOPQRSTUVWXYZ
> DNAString(paste(LETTERS, collapse = ""))
Error in .Call2("new_XString_from_CHARACTER", class(x0), string, start,  : 
  key 69 (char 'E') not in lookup table

ADD COMMENT • link 4.5 years ago James W. MacDonald 67k

0

Entering edit mode

Yes. I understand that. I just can't think of any situation where you would have a set of data where you did not have it with the proper lettering system for DNA, RNA, or AA (depending on your sequence). As such, why does BString exist? It seems to serve no purpose. This is why I was asking for a specific situation where you would use BString because I can't think of any. I mainly use BioStrings in conjunction with msa so I am unaware of other functions. I could have been clearer in my original question.

ADD REPLY • link 4.5 years ago skigirl618 • 0

0

Entering edit mode

You would use a BString or BStringSet object in any situation where your character data is not DNA, RNA, or protein sequences, but you still wanted to be able to use Biostrings tools like matchPattern() or pairwiseAlignments(). Note that the native input data structures for these tools are XString or XStringSet derivatives. This means that, even if these tools might seem to accept ordinary character vectors, the first thing they do is convert these ordinary character vectors into BString or BStringSet objects because that's the data structures that these tools were implemented to work on and optimized for.

For example, if your data is general text and you wanted to count the occurrences of a set of keywords in it, you could import your data in a BStringSet object and do something like:

text <- BStringSet(c("XAaaaabcdeFF", "abcabcaabTReX", "XttAAAaDeeeeeERabcaaaa"))
keyword <- BStringSet(c("bc", "abcaa", "aaa", "ab"))
vcountPDict(keyword, text)
#      [,1] [,2] [,3]
# [1,]    1    2    1
# [2,]    0    1    1
# [3,]    2    0    2
# [4,]    1    3    1

Or to simply tabulate the frequency of unique letters in the text:

letterFrequency(text, uniqueLetters(text))
#      A D E F R T X a b c d e t
# [1,] 1 0 0 2 0 0 1 4 1 1 1 1 0
# [2,] 0 0 0 0 1 1 1 4 3 2 0 1 0
# [3,] 3 1 1 0 1 0 1 6 1 1 0 5 2

Of course you can do this with base R and ordinary character vectors but Biostrings data structures and tools were specifically designed to be more efficient if the data is big.

Hope this helps,

H.

ADD REPLY • link 4.5 years ago Hervé Pagès 16k