how to extract part of the DNA sequences from a DNAstringset
2
1
Entering edit mode
XIA.PAN ▴ 20
@xiapan-12407
Last seen 5.3 years ago

Dear all,

I have a large DNAstringset from amplicons and only need the random region in each sequence from this set. 

I got the position (or index) of the part that I needed use vmattchpattern function.

hits=vmatchPattern("AAANNNNNNNNAAA",sread,max.mismatch=1,fixed=FALSE)

hits.frame=as.data.frame(hits)

The start and end position is a bit different from sequence to sequence. how can I apply these index (start, end) back to the DNAstringset and get the random region as a new DNAstringset.

Thank you

XIA 

 

DNAstringset • 6.4k views
ADD COMMENT
3
Entering edit mode
@herve-pages-1542
Last seen 14 days ago
Seattle, WA, United States

Hi XIA,

You can use extractAt to extract the sequences corresponding to the hits:

extractAt(sread, hits)

See ?extractAt for more information.

If you have off-limits hits, you can either get rid of them with:

hits <- as(hits, "CompressedIRangesList")
hits <- hits[start(hits) >= 1 & end(hits) <= width(sread)]

or trim them with:

hits <- as(hits, "CompressedIRangesList")
hits <- restrict(hits, 1, width(sread))

In both cases the reason I first turned hits into a CompressedIRangesList object is because CompressedIRangesList objects support the kind of operations that we need to perform (i.e. subsetting with [ in the first case, and restrict() in the 2nd case). Unfortunately, MIndex objects don't support these operations. 

Hope this helps,

H.

ADD COMMENT
0
Entering edit mode

Hi H,

The difference with Mikes' below is that not every sequence contains the random region in my initial DNAStringSet. Yours will contain those rows without a hit, which will be useful if need mapping back to the original set

Thank you.

XIA    

ADD REPLY
2
Entering edit mode
Mike Smith ★ 6.6k
@mike-smith
Last seen 1 hour ago
EMBL Heidelberg

How about something like this?  First we'll set up an example with 3 DNA sequences, and we're going to look for the pattern TACG in each of them, just like you've already done.

dna_initial <- DNAStringSet(c("ACGTACGGACGT",
                              "GTACGTACGGAC",
                              "AAAAGGGAAAAG"))

hits = vmatchPattern("TACG", dna_initial, max.mismatch=0, fixed=FALSE)
hits.frame = as.data.frame(hits)​
The pattern appears once in the first string, twice in the second, and not at all in the third.  No idea if that's possible in your data, but it's nice to get a solution that applies to a general case.
> hits.frame
  group group_name start end width
1     1       <NA>     4   7     4
2     2       <NA>     2   5     4
3     2       <NA>     6   9     4

Now we'll use mapply() to call the subseq() function on three vectors (the set of strings to be subset, the start positions and the ends).  The other things to note is that we use hits.frame$group to select the strings we're going to process.  This has the effect of dropping any strings that weren't found, and replicating those that have our substring multiple times.

dna_list <- mapply(subseq, dna_initial[hits.frame$group], hits.frame$start, hits.frame$end)

The result from this is a list of DNAStrings, which we can pass to DNAStringSet() to get back a single set.

dna_regions <- DNAStringSet(dna_list)
> dna_regions
  A DNAStringSet instance of length 3
    width seq
[1]     4 TACG
[2]     4 TACG
[3]     4 TACG

The result in our case isn't very interesting, but it's found the three instances of the sequence we're looking for and extracted them from the original data. 

ADD COMMENT
0
Entering edit mode

Hi Mike,

Thank you. I have another problem is that some of the sequences starts at '0' as I have set mismatch=1.

How can I do a filter step, deleting the rows in hits.frame if the start=0

Thank you

 

ADD REPLY
0
Entering edit mode

Do you really want to remove the entries that start with 0?  If you want to do that you could this:

hits.frame <- hits.frame[ which(hits.frame$start > 0) , ]

This finds all the rows where the start value greater than 0, and only returns those. 

Perhaps you would rather change the 0 entries to 1, it depends what you're trying to achieve.  You can do that with the following:

hits.frame$start <- max(1, hits.frame$start)

This has the effect of setting any start value less than 1, to 1.  You might want to do something similar with your end values if they also go out of range.

ADD REPLY
0
Entering edit mode

Hi Martin,

Thank you. The hits.frame filter worked, but if use mapply as below, returns only one sequence

dna_list <- mapply(subseq, dna_initial[hits.frame$group], hits.frame$start, hits.frame$end)

so I used  dna_list <- subseq(dna_initial[hits.frame$group], hits.frame$start, hits.frame$end)

and got the new DNAStringset.

 

ADD REPLY

Login before adding your answer.

Traffic: 730 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6