Question

PDict in GenomicRanges

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 16 hours ago

Seattle, WA, United States

Dear Haiying, On 08/04/2012 05:33 AM, Haiying Kong wrote: > Dear Dr. Pages, > > I am trying to create a PDict object with a DNAStringSet to use on > countPDict function. The sequences in the DNAStringSet are having > different lengths. It seems like this causes a problem. > Should I classify the sequences in the DNAStringSet with their > lengths and create multiple number of PDict objects? That would work but you could also preprocess only the rectangular DNAStringSet obtained by trimming the sequences to the length of the shortest. This implies 2 things: (1) the shortest sequence is not too short otherwise performance will degrade a lot (depending on the number of "patterns" in your DNAStringSet and the length of the "subject", should be at least 10 nucleotides for 1 million patterns to match against Human chr1), (2) you want to do exact matching. If those 2 conditions are satisfied, then it's very easy: > dna <- DNAStringSet(c("AACAT", "TCCAGAAAA", "GTCCACA")) > dna A DNAStringSet instance of length 3 width seq [1] 5 AACAT [2] 9 TCCAGAAAA [3] 7 GTCCACA > pdict <- PDict(dna, tb.end=min(width(dna))) > pdict TB_PDict object of length 3 and variable width (preprocessing algo="ACtree2"): - with NO head - with a Trusted Band of width 5 - with a tail of variable width (min=0 / max=4) Here is what the above means. The original DNAStringSet was split into 3 pieces: Trusted head \ Band / tail | | |AACAT| |TCCAG|AAAA |GTCCA|CA Only the "Trusted Band" (which is rectangular) is actually preprocessed (i.e. stored in the Aho-Corasick tree). Then: > countPDict(pdict, DNAString("GTCCACAACATTCCAGAAAGGTCCACA")) [1] 1 0 2 Should take less than 3 minutes on Human chr1 with 1 million patterns, a min pattern length >= 10 nucleotides (and an average pattern length of 25). This drops to less than 1 minute on Human chr1 with 1 million patterns, a min pattern length >= 21 nucleotides (and an average pattern length of 85). Note that if you call matchPDict() or countPDict() with a non-zero max.mismatch value, then the mismatches will be allowed on the tail *only*. Finally, note that if you don't have too many patterns (< 10000), you can avoid the preprocessing entirely by calling matchPDict() or countPDict() directly on your DNAStringSet object: > countPDict(dna, DNAString("GTCCACAACATTCCAGAAAGGTCCACA")) [1] 1 0 2 > countPDict(dna, DNAString("GTCCACAACATTCCAGAAAGGTCCACA"), max.mismatch=1) [1] 1 1 2 > countPDict(dna, DNAString("GTCCACAACATTCCAGAAAGGTCCACA"), max.mismatch=2) [1] 4 2 3 > countPDict(dna, DNAString("GTCCACAACATTCCAGAAAGGTCCACA"), max.mismatch=2, with.indels=TRUE) [1] 5 2 3 In that case, mismatches are allowed anywhere and you can even allow indels if you want. Of course, this will be *much* slower than when preprocessing with PDict(). Please let me know if you have any further questions about this. Cheers, H. > > Thank you very much in advance. > > Best Regards, > > Haiying -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

Preprocessing Cancer Preprocessing Cancer • 1.6k views

ADD COMMENT • link 12.7 years ago Hervé Pagès 16k

score 0 · Answer 1 · 2012-08-04

Haiying, Let's keep this on the Bioconductor mailing list so others can give advice. On 08/04/2012 03:41 PM, Haiying Kong wrote: > Dear Dr. Pag?s, > > Thank you very much for your explanation. It is very helpful, indeed. > > Could you please help me by any chance for the other part of the work? > I need to split a big SFF file into a number of small SFF files. > I was able to read in the big SFF file as a big SFFContainer with the > function readSFF in the Bioconductor package:R453Plus1Toolbox. I also > fond the index to split the big SFFContainer. For example, the index > (5,11,17,22,25,29,31,33,37) for the first small SFFContainer. > I couldn't figure out how to extract the > (5,11,17,22,25,29,31,33,37)th reads from the big SSFContainer and > construct a small SFFContainer and save it as an SFF file. Could you > please tell me by any chance how to do this part of work? Are there any > Bioconductor functions you could recommend? I'm not familiar with SFF files or with the R453Plus1Toolbox package but it doesn't seem like the package is providing something like a writeSFF() function. If all you want to do is split the file into smaller files, I would check some 3rd party tools like DNA Baser SFF Workbench, which, according to their website, can "Cut a SFF file in smaller pieces". Don't know what kind of license they have though and what platforms they support in addition to Windows. Anyway other people on this list might have better suggestions. Cheers, H. > > Thank you very much in advance. > > Best Regards, > > Haiying > > On Sun, Aug 5, 2012 at 6:34 AM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > Dear Haiying, > > > On 08/04/2012 05:33 AM, Haiying Kong wrote: > > Dear Dr. Pages, > > I am trying to create a PDict object with a DNAStringSet to > use on > countPDict function. The sequences in the DNAStringSet are having > different lengths. It seems like this causes a problem. > Should I classify the sequences in the DNAStringSet with their > lengths and create multiple number of PDict objects? > > > That would work but you could also preprocess only the rectangular > DNAStringSet obtained by trimming the sequences to the length of > the shortest. This implies 2 things: (1) the shortest sequence > is not too short otherwise performance will degrade a lot (depending > on the number of "patterns" in your DNAStringSet and the length of > the "subject", should be at least 10 nucleotides for 1 million patterns > to match against Human chr1), (2) you want to do exact matching. > > If those 2 conditions are satisfied, then it's very easy: > > > dna <- DNAStringSet(c("AACAT", "TCCAGAAAA", "GTCCACA")) > > dna > A DNAStringSet instance of length 3 > width seq > [1] 5 AACAT > [2] 9 TCCAGAAAA > [3] 7 GTCCACA > > > pdict <- PDict(dna, tb.end=min(width(dna))) > > pdict > TB_PDict object of length 3 and variable width (preprocessing > algo="ACtree2"): > - with NO head > - with a Trusted Band of width 5 > - with a tail of variable width (min=0 / max=4) > > Here is what the above means. The original DNAStringSet was > split into 3 pieces: > > Trusted > head \ Band / tail > | | > |AACAT| > |TCCAG|AAAA > |GTCCA|CA > > Only the "Trusted Band" (which is rectangular) is actually > preprocessed (i.e. stored in the Aho-Corasick tree). > Then: > > > countPDict(pdict, DNAString("__GTCCACAACATTCCAGAAAGGTCCACA")) > [1] 1 0 2 > > Should take less than 3 minutes on Human chr1 with 1 million patterns, > a min pattern length >= 10 nucleotides (and an average pattern length > of 25). > > This drops to less than 1 minute on Human chr1 with 1 million patterns, > a min pattern length >= 21 nucleotides (and an average pattern length > of 85). > > Note that if you call matchPDict() or countPDict() with a non- zero > max.mismatch value, then the mismatches will be allowed on the tail > *only*. > > Finally, note that if you don't have too many patterns (< 10000), > you can avoid the preprocessing entirely by calling matchPDict() > or countPDict() directly on your DNAStringSet object: > > > countPDict(dna, DNAString("__GTCCACAACATTCCAGAAAGGTCCACA")) > [1] 1 0 2 > > countPDict(dna, DNAString("__GTCCACAACATTCCAGAAAGGTCCACA"), > max.mismatch=1) > [1] 1 1 2 > > countPDict(dna, DNAString("__GTCCACAACATTCCAGAAAGGTCCACA"), > max.mismatch=2) > [1] 4 2 3 > > countPDict(dna, DNAString("__GTCCACAACATTCCAGAAAGGTCCACA"), > max.mismatch=2, with.indels=TRUE) > [1] 5 2 3 > > In that case, mismatches are allowed anywhere and you can even allow > indels if you want. Of course, this will be *much* slower than when > preprocessing with PDict(). > > Please let me know if you have any further questions about this. > > Cheers, > H. > > > > Thank you very much in advance. > > Best Regards, > > Haiying > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319