Question

How to prevent readDNAStringSet() function from switching masked sequences (lowercase) to uppercase sequence?

0

Entering edit mode

beausoleilmo • 0

@beausoleilmo-13527

Last seen 7.1 years ago

I'm using the package Biostrings

library("Biostrings")

DNAString("TATCAAATACTCAAGCACtaaggaaacaggaaaatct")

will return

37-letter "DNAString" instance seq: TATCAAATACTCAAGCACTAAGGAAACAGGAAAATCT

Why is aaggaaacaggaaaatct not staying in lowercase?

Is there a way to prevent the transformation to uppercase? I need to keep the lowercase letters.

biostrings mask repeatmasker DNA • 2.8k views

ADD COMMENT • link updated 3.9 years ago by ivantsers • 0 • written 7.1 years ago by beausoleilmo • 0

score 0 · Answer 1 · 2017-11-10

0

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 20 hours ago

United States

The simple answer is 'Because that is how it is designed to work'. From ?DNAString, which you should always check first:

Details:

     The DNAString class is a direct XString subclass (with no
     additional slot).  Therefore all functions and methods described
     in the XString man page also work with a DNAString object
     (inheritance).

     Unlike the BString container that allows storage of any single
     string (based on a single-byte character set) the DNAString
     container can only store a string based on the DNA alphabet (see
     below).  In addition, the letters stored in a DNAString object are
     encoded in a way that optimizes fast search algorithms.

And

> DNA_ALPHABET
 [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" "."

Since there are no lower case letters in the DNA alphabet, you can't have them in a DNAString. But do note that:

> BString("TATCAAATACTCAAGCACtaaggaaacaggaaaatct")
  37-letter "BString" instance
seq: TATCAAATACTCAAGCACtaaggaaacaggaaaatct

Where the BString class inherits from XString, just like DNAString, so it should probably suffice?

ADD COMMENT • link 7.1 years ago James W. MacDonald 67k

0

Entering edit mode

The thing is that I want to read a big fasta file, not just individual sequences.

readDNAStringSet(filepath = "~/Desktop/UCSC/my.genome.fa.gz")

But this one is not taking the masked sequences... is there a readDNAStringSet that can keep the masked sequences?

ADD REPLY • link 7.1 years ago beausoleilmo • 0

0

Entering edit mode

The first place you should be looking is the help page for the function you are trying to use! I already tried to nudge you in that direction, so let me be more direct. Reading the help page first will save you lots of time, and that is what you should always do. If it seems to people who might help that you are using the support site as a stand in for doing your own homework, you will quickly find any help you might get dwindling down to nothing.

So do you see anything in the help page for readDNAStringSet that might be applicable to your situation, particularly in relation to what I have already told you about BStrings?

ADD REPLY • link 7.1 years ago James W. MacDonald 67k

0

Entering edit mode

I actually look at help pages. I haven't found anything related to that. I also found somebody else, on another forum, asking for the same question and they said to ask the question here. Now, it seems that there are no easy way to do want I'm trying to aim.

This is my solution:

I used the raw genome that is masked, and then I needed to transform it in a file that was readable by R.

# Create one big line
sed -e ':a' -e 'N' -e '$!ba' -e 's/\n//g' ~/Desktop/UCSC/my.fasta1.fa > ~/Desktop/UCSC/my.fasta2.fa
# brew install gnu-sed # Because I'm on a mac 
gsed  's/>[A-Z]\{2\}[0-9]\{6\}/\n&\n/g;s/^\n//' ~/Desktop/UCSC/my.fasta2.fa > ~/Desktop/UCSC/my.fasta3.fa

Then from this, I imported my file in R and did whatever I wanted. I actually merged that file with the one from the readDNAStringSet function to benefit from both worlds. Thanks for your help.

# I actually found a way to print the masked sequences, so here they are... 
load("~/Desktop/UCSC/whateverIdidtothefastafileinbetween.RData",
     verbose = TRUE)

df = merge(df,
      ucsc.genome.masked, 
      by.x = "seq_name", 
      by.y = "scaffolds.names")

If you have another solution, I'd be happy to see it.

ADD REPLY • link 7.0 years ago beausoleilmo • 0

0

Entering edit mode

For those who find this thread in Google: The function for reading multi-lined fasta containing both lower- and uppercase letters actually exists in Biostings. It is readBStringSet(). This function will keep lowercase letters in a sequence. To write a BStringSet object in a file one may mere use writeBstringSet(). Maybe these functions were not presented in earlier versions of Biostings. My version is 2.58.0.

ADD REPLY • link 3.9 years ago ivantsers • 0