Question

Biostrings matchPattern with lower case

0

Entering edit mode

story.benjamin ▴ 20

@storybenjamin-11722

Last seen 12 days ago

Switzerland

Hi,

Is it possible to match specifically lower case nucleotides (e.g. agct). When genomes are repeat-masked they can be soft-masked which results in lower case regions - which might in certain cases be of interest vs non-masked regions.

Example:

>random
AGAGTAGTagtAGT

Can Biostrings account for this or is everything automatically converted to upper case under the hood for convenience?

biostrings • 1.2k views

ADD COMMENT • link updated 5.0 years ago by Hervé Pagès 16k • written 5.0 years ago by story.benjamin ▴ 20

score 3 · Accepted Answer · 2020-04-17

DNAString and DNAStringSet objects in Biostrings don't keep track of the case.

Note that we provide "masked genomes" for some organisms (e.g. BSgenome.Hsapiens.UCSC.hg38.masked) where the chromosome sequences have various masks on them (e.g. RepeatMasker mask, but not only). You can use that if you need string matching tools like matchPattern() to ignore the masked regions.

Another approach is to use BString/BStringSet objects instead of DNAString/DNAStringSet objects. Unlike the latter, the former preserve the case. (The BStringSet container is the general purpose string container in Biostrings so is analog to an ordinary character vector in base R.) Note that some matchPattern functionalities specific to DNAString/DNAStringSet objects won't work with BString/BStringSet objects (e.g. fixed=FALSE).

Hope this helps.

H.