Question

Using non standard charachters with XStrings object in BioStrings

1

Entering edit mode

aravind-j ▴ 20

@aravind-j-18953

Last seen 4.1 years ago

To facilitate inter-operability with other Bioconductor packages, it is suggested that common data structures are to be re-used.

I am developing a packaged for possible release in Bioconductor which takes nucleotide sequences as input. Use of XStrings object in BioStrings would be ideal for the input. However, my package functions can deal with non-standard nucleotide bases (I: Inosine; A*: Hydroxyadenine; XC: Cis-azobenzene; XT: Trans-azobenzene; X_L: Locked nucleic acid).

Hence at present I have avoided the use of XStrings object, as it can accomodate only standard IUPAC nomenclature, and use the character vector as the string input with suitable downstream checks.

Is this strategy adequate or is there any alternative method to stick to the Bioconductor guidelines?

.

BioStrings Inosine bases Hydroxyadenine Azobenzene IUPAC • 1.2k views

ADD COMMENT • link updated 6.3 years ago by Hervé Pagès 16k • written 6.3 years ago by aravind-j ▴ 20

score 1 · Answer 1 · 2019-01-18

Hi,

Main problem with those codes is that they are not 1-letter codes so wouldn't play well with the "1 letter per nucleotide" paradigm that is at the core of nucleotide sequence representation in Biostrings. So the 1st thing I would recommend is to use 1-letter codes to represent those exotic nucleotide bases.

Another problem as you found out is that DNAString/DNAStringSet and RNAString/RNAStringSet objects only allow letters that belong to predefined alphabets DNA_ALPHABET and RNA_ALPHABET, respectively. Even though it would be possible (at least in theory) to extend these alphabets to support new letters, this is not a change to do lightly so it would need to be considered very closely and supported by a strong use case. And even that might not be the right thing to do.

An important question is what kind of sequences are you dealing with? If these exotic nucleotides don't show up in DNA or mRNA molecules then the DNAString/DNAStringSet or RNAString/RNAStringSet classes are probably not the appropriated classes to represent your sequences in the first place. So maybe you have a case where implementing a new specialized XString concrete subclass would be more appropriate (note that XString is a virtual class with currently 4 concrete subclasses: BString, DNAString, RNAString, and AAString). Then you would be free to choose the alphabet you want to support for this new XString subclass. I believe that the Modstrings package (submitted a couple of weeks ago and still pending for review, see here) does something like that i.e. it defines its own XString/XStringSet subclasses so is probably a good place to look at if you decide to ride with this.

Another much simpler option is to just use BString/BStringSet objects. No enforced alphabets for these objects but hey, with character vectors you don't get that kind of enforcement either. At least by using BString/BStringSet objects you can take advantage of the efficient internal representation and fast string matching facilities provided by the Biostrings package.

That being said, if using character vectors does the job for you and performance is reasonable (maybe your sequences are short and you don't deal with hundreds of thousands of them, are you dealing with tRNA?) then you might just want to stick to that. You may have a use case where re-using the Biostrings infrastructure is not worth it and that's ok.

Cheers,

H.