Subsetting a DNAStringSet object with many DNAString's
2
0
Entering edit mode
dr ▴ 10
@dr-9473
Last seen 20 months ago
United States

Hi,

I have a Biostrings DNAStringSet object with many DNAStrings in it, and I want to subset each one of them from position 1 to the minimum between its length and a fixed cutoff.

So far I'm using a for loop for this, as in this example:

library(dplyr)
set.seed(1)
seq.set <- lapply(1:100, function(s) paste(sample(c("A","C","G","T"),as.integer(abs(rnorm(1,500,1000))),replace = T), collapse="")) %>%
  unlist() %>%
  Biostrings::DNAStringSet(.)

for(s in 1:length(seq.set))
  seq.set[s] <- Biostrings::subseq(seq.set[s], 1, min(650, Biostrings::width(seq.set[s])))

But because in reality the size of my DNAStringSet is ~200,000 DNAStrings it takes quite a while. Any faster solution?

Biostrings • 1.9k views
ADD COMMENT
0
Entering edit mode
dr ▴ 10
@dr-9473
Last seen 20 months ago
United States

Seems like an lapply to find the end point of each DNAStrings object in the DNAStringSet and then simply providing that as the end argument to the Biostrings::subseq function is the way to go:

library(dplyr)
set.seed(1)
seq.set <- lapply(1:100, function(s) paste(sample(c("A","C","G","T"),as.integer(abs(rnorm(1,500,1000))),replace = T), collapse="")) %>%
  unlist() %>%
  Biostrings::DNAStringSet(.)

seq.set.ends <- lapply(1:length(seq.set),function(i) min(650, Biostrings::width(seq.set[i]))) %>% unlist()
seq.set <- Biostrings::subseq(seq.set,start = rep(1,length(seq.set)),end = seq.set.ends)
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 1 hour ago
Seattle, WA, United States

Try heads(x, cutoff)

H.

ADD COMMENT

Login before adding your answer.

Traffic: 838 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6