BioStrings questions

0

Entering edit mode

rcaloger ▴ 500

@rcaloger-1888

Last seen 10.2 years ago

European Union

Hi, In my onechannelGUI package I am developing a section related to Affymetrix exon array analysis, creating few functions that allow the association of exon-level Probe Selection Region (PSR) to refseq 1st question: I have implemented a function that blast a list of PSR sequences over all refseq. However, I would like to know if there is any way of doing something similar using the Biostring package. I tried the pairwiseAlignment function but it is quite slow compared to blast. 2nd question: there is any way of merging two DNAStringSets ? Cheers Raffaele -- ---------------------------------------- Prof. Raffaele A. Calogero Bioinformatics and Genomics Unit Dipartimento di Scienze Cliniche e Biologiche c/o Az. Ospedaliera S. Luigi Regione Gonzole 10, Orbassano 10043 Torino tel. ++39 0116705417 Lab. ++39 0116705408 Fax ++39 0119038639 Mobile ++39 3333827080 email: raffaele.calogero at unito.it raffaele[dot]calogero[at]gmail[dot]com www: http://www.bioinformatica.unito.it Info: http://publicationslist.org/raffaele.calogero

probe oneChannelGUI probe oneChannelGUI • 2.1k views

ADD COMMENT • link 16.5 years ago rcaloger ▴ 500

0

Entering edit mode

Patrick Aboyoun ★ 1.6k

@patrick-aboyoun-6734

Last seen 10.6 years ago

United States

Raffaele, The pairwiseAlignment function uses an O(nm) {with n and m being the length of the two sequences being aligned} dynamic programming algorithm that is designed to find an optimal alignment and as you have discovered isn't intended for use with a long reference sequence. Do your PSR sequences map nearly exactly to a location on your reference sequence and are these sequences of equal length? If so, see the matchPDict function. It matches a pattern dictionary consisting of equal length fragments to a reference sequence. The pseudo code looks something like: psrPDict <- PDict(PSRDNAStringSet) matchPDict(psrPDict, refseq) To answer your second question, the append function should get you what you want: > append(DNAStringSet(c("AAA", "GA")), DNAStringSet(c("ACTG", "TTTACCC"))) A DNAStringSet instance of length 4 width seq [1] 3 AAA [2] 2 GA [3] 4 ACTG [4] 7 TTTACCC Patrick rcaloger wrote: > Hi, > In my onechannelGUI package I am developing a section related to > Affymetrix exon array analysis, creating few functions that allow the > association of exon-level Probe Selection Region (PSR) to refseq > > 1st question: > I have implemented a function that blast a list of PSR sequences over > all refseq. > However, I would like to know if there is any way of doing something > similar using the Biostring package. > I tried the pairwiseAlignment function but it is quite slow compared > to blast. > > 2nd question: > there is any way of merging two DNAStringSets ? > > Cheers > Raffaele > > >

ADD COMMENT • link 16.5 years ago Patrick Aboyoun ★ 1.6k

0

Entering edit mode

rcaloger ▴ 500

@rcaloger-1888

Last seen 10.2 years ago

European Union

Dear Patrick, many thanks for the quick answer. Unfortunately PSR can be located in any position of the refseq and the sequence length can be very different. Therefore, I cannot apply your suggestion. Cheers Raffaele Patrick Aboyoun wrote: > Raffaele, > The pairwiseAlignment function uses an O(nm) {with n and m being the > length of the two sequences being aligned} dynamic programming > algorithm that is designed to find an optimal alignment and as you > have discovered isn't intended for use with a long reference sequence. > Do your PSR sequences map nearly exactly to a location on your > reference sequence and are these sequences of equal length? If so, see > the matchPDict function. It matches a pattern dictionary consisting of > equal length fragments to a reference sequence. The pseudo code looks > something like: > > psrPDict <- PDict(PSRDNAStringSet) > matchPDict(psrPDict, refseq) > > To answer your second question, the append function should get you > what you want: > > > append(DNAStringSet(c("AAA", "GA")), DNAStringSet(c("ACTG", > "TTTACCC"))) > A DNAStringSet instance of length 4 > width seq > [1] 3 AAA > [2] 2 GA > [3] 4 ACTG > [4] 7 TTTACCC > > > Patrick > > > rcaloger wrote: >> Hi, >> In my onechannelGUI package I am developing a section related to >> Affymetrix exon array analysis, creating few functions that allow the >> association of exon-level Probe Selection Region (PSR) to refseq >> >> 1st question: >> I have implemented a function that blast a list of PSR sequences over >> all refseq. >> However, I would like to know if there is any way of doing something >> similar using the Biostring package. >> I tried the pairwiseAlignment function but it is quite slow compared >> to blast. >> >> 2nd question: >> there is any way of merging two DNAStringSets ? >> >> Cheers >> Raffaele >> >> >> > -- ---------------------------------------- Prof. Raffaele A. Calogero Bioinformatics and Genomics Unit Dipartimento di Scienze Cliniche e Biologiche c/o Az. Ospedaliera S. Luigi Regione Gonzole 10, Orbassano 10043 Torino tel. ++39 0116705417 Lab. ++39 0116705408 Fax ++39 0119038639 Mobile ++39 3333827080 email: raffaele.calogero at unito.it raffaele[dot]calogero[at]gmail[dot]com www: http://www.bioinformatica.unito.it Info: http://publicationslist.org/raffaele.calogero

ADD COMMENT • link 16.5 years ago rcaloger ▴ 500

0

Entering edit mode

Hi Raffaele, The PSR do not need to be of equal length with PDict/matchPDit. An important limitation though is that indels are not supported for now (but will be in the very near future). Another limitation is that you need to define a Trusted Band on your PDict object i.e. a vertical band of constant width where mismatches are not allowed. For example: psrPDict <- PDict(PSRDNAStringSet, start=5, end=15) will set a Trusted Band of width 11 that covers nucleotides 5 to 15 of each PSR. Then use the 'max.mismatch' arg of matchPDict() to control the max number of mismatches that you want to allow *outside* the Trusted Band for each PSR (the nucleotides that are *inside* must match exactly). Note that even if you want to perform exact matching, you need to define a Trusted Band if your PSR are not of equal length. It's important to try to define the widest possible Trusted Band since the performance of the underlying algo depends directly on how wide it is. Typically the performance will be good if the width is >= 12 and will start to drop significantly for smaller values. Of course if you want to use you PDict object for exact matching only, then you want to set the Trusted Band to the max possible width i.e. to min(width(PSRDNAStringSet)). Hope this helps, H. Quoting rcaloger <raffaele.calogero at="" gmail.com="">: > Dear Patrick, > many thanks for the quick answer. Unfortunately PSR can be located in > any position of the refseq and the sequence length can be very > different. Therefore, I cannot apply your suggestion. > Cheers > Raffaele > > Patrick Aboyoun wrote: >> Raffaele, >> The pairwiseAlignment function uses an O(nm) {with n and m being >> the length of the two sequences being aligned} dynamic programming >> algorithm that is designed to find an optimal alignment and as you >> have discovered isn't intended for use with a long reference >> sequence. Do your PSR sequences map nearly exactly to a location on >> your reference sequence and are these sequences of equal length? >> If so, see the matchPDict function. It matches a pattern >> dictionary consisting of equal length fragments to a reference >> sequence. The pseudo code looks something like: >> >> psrPDict <- PDict(PSRDNAStringSet) >> matchPDict(psrPDict, refseq) >> >> To answer your second question, the append function should get you >> what you want: >> >>> append(DNAStringSet(c("AAA", "GA")), DNAStringSet(c("ACTG", "TTTACCC"))) >> A DNAStringSet instance of length 4 >> width seq >> [1] 3 AAA >> [2] 2 GA >> [3] 4 ACTG >> [4] 7 TTTACCC >> >> >> Patrick >> >> >> rcaloger wrote: >>> Hi, >>> In my onechannelGUI package I am developing a section related to >>> Affymetrix exon array analysis, creating few functions that allow >>> the association of exon-level Probe Selection Region (PSR) to refseq >>> >>> 1st question: >>> I have implemented a function that blast a list of PSR sequences >>> over all refseq. >>> However, I would like to know if there is any way of doing >>> something similar using the Biostring package. >>> I tried the pairwiseAlignment function but it is quite slow >>> compared to blast. >>> >>> 2nd question: >>> there is any way of merging two DNAStringSets ? >>> >>> Cheers >>> Raffaele >>> >>> >>> >> > > > -- > > ---------------------------------------- > Prof. Raffaele A. Calogero > Bioinformatics and Genomics Unit > Dipartimento di Scienze Cliniche e Biologiche > c/o Az. Ospedaliera S. Luigi > Regione Gonzole 10, Orbassano > 10043 Torino > tel. ++39 0116705417 > Lab. ++39 0116705408 > Fax ++39 0119038639 > Mobile ++39 3333827080 > email: raffaele.calogero at unito.it > raffaele[dot]calogero[at]gmail[dot]com > www: http://www.bioinformatica.unito.it > Info: http://publicationslist.org/raffaele.calogero > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 16.5 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Herve, many thanks for the informations. After digging into the man pages of the Biostrings package I found the countPatter function. It is nice because it returns me the simple presence of the pattern allowing a certain number of mismatches score.tmp <- countPattern(PSR.seq , DNAString(as.character(exs.seq[j])), max.mismatch=3) I found it very useful if a limited number of sequences are queried, however, when I work with the full refseq fasta file blast is much faster. Cheers Raffaele hpages at fhcrc.org wrote: > Hi Raffaele, > > The PSR do not need to be of equal length with PDict/matchPDit. > An important limitation though is that indels are not supported > for now (but will be in the very near future). > Another limitation is that you need to define a Trusted Band on > your PDict object i.e. a vertical band of constant width where > mismatches are not allowed. For example: > > psrPDict <- PDict(PSRDNAStringSet, start=5, end=15) > > will set a Trusted Band of width 11 that covers nucleotides > 5 to 15 of each PSR. > Then use the 'max.mismatch' arg of matchPDict() to control the max > number of mismatches that you want to allow *outside* the Trusted > Band for each PSR (the nucleotides that are *inside* must match > exactly). > Note that even if you want to perform exact matching, you need > to define a Trusted Band if your PSR are not of equal length. > It's important to try to define the widest possible Trusted Band > since the performance of the underlying algo depends directly > on how wide it is. Typically the performance will be good if the > width is >= 12 and will start to drop significantly for smaller > values. > Of course if you want to use you PDict object for exact matching > only, then you want to set the Trusted Band to the max possible > width i.e. to min(width(PSRDNAStringSet)). > > Hope this helps, > H. > > > Quoting rcaloger <raffaele.calogero at="" gmail.com="">: > >> Dear Patrick, >> many thanks for the quick answer. Unfortunately PSR can be located in >> any position of the refseq and the sequence length can be very >> different. Therefore, I cannot apply your suggestion. >> Cheers >> Raffaele >> >> Patrick Aboyoun wrote: >>> Raffaele, >>> The pairwiseAlignment function uses an O(nm) {with n and m being >>> the length of the two sequences being aligned} dynamic programming >>> algorithm that is designed to find an optimal alignment and as you >>> have discovered isn't intended for use with a long reference >>> sequence. Do your PSR sequences map nearly exactly to a location on >>> your reference sequence and are these sequences of equal length? >>> If so, see the matchPDict function. It matches a pattern >>> dictionary consisting of equal length fragments to a reference >>> sequence. The pseudo code looks something like: >>> >>> psrPDict <- PDict(PSRDNAStringSet) >>> matchPDict(psrPDict, refseq) >>> >>> To answer your second question, the append function should get you >>> what you want: >>> >>>> append(DNAStringSet(c("AAA", "GA")), DNAStringSet(c("ACTG", >>>> "TTTACCC"))) >>> A DNAStringSet instance of length 4 >>> width seq >>> [1] 3 AAA >>> [2] 2 GA >>> [3] 4 ACTG >>> [4] 7 TTTACCC >>> >>> >>> Patrick >>> >>> >>> rcaloger wrote: >>>> Hi, >>>> In my onechannelGUI package I am developing a section related to >>>> Affymetrix exon array analysis, creating few functions that allow >>>> the association of exon-level Probe Selection Region (PSR) to refseq >>>> >>>> 1st question: >>>> I have implemented a function that blast a list of PSR sequences >>>> over all refseq. >>>> However, I would like to know if there is any way of doing >>>> something similar using the Biostring package. >>>> I tried the pairwiseAlignment function but it is quite slow >>>> compared to blast. >>>> >>>> 2nd question: >>>> there is any way of merging two DNAStringSets ? >>>> >>>> Cheers >>>> Raffaele >>>> >>>> >>>> >>> >> >> >> -- >> >> ---------------------------------------- >> Prof. Raffaele A. Calogero >> Bioinformatics and Genomics Unit >> Dipartimento di Scienze Cliniche e Biologiche >> c/o Az. Ospedaliera S. Luigi >> Regione Gonzole 10, Orbassano >> 10043 Torino >> tel. ++39 0116705417 >> Lab. ++39 0116705408 >> Fax ++39 0119038639 >> Mobile ++39 3333827080 >> email: raffaele.calogero at unito.it >> raffaele[dot]calogero[at]gmail[dot]com >> www: http://www.bioinformatica.unito.it >> Info: http://publicationslist.org/raffaele.calogero >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- ---------------------------------------- Prof. Raffaele A. Calogero Bioinformatics and Genomics Unit Dipartimento di Scienze Cliniche e Biologiche c/o Az. Ospedaliera S. Luigi Regione Gonzole 10, Orbassano 10043 Torino tel. ++39 0116705417 Lab. ++39 0116705408 Fax ++39 0119038639 Mobile ++39 3333827080 email: raffaele.calogero at unito.it raffaele[dot]calogero[at]gmail[dot]com www: http://www.bioinformatica.unito.it Info: http://publicationslist.org/raffaele.calogero

ADD REPLY • link 16.5 years ago rcaloger ▴ 500

0

Entering edit mode

Hi Raffaele, Quoting rcaloger <raffaele.calogero at="" gmail.com="">: > Hi Herve, > many thanks for the informations. > After digging into the man pages of the Biostrings package I found the > countPatter function. > It is nice because it returns me the simple presence of the pattern > allowing a certain number of mismatches > > score.tmp <- countPattern(PSR.seq , > DNAString(as.character(exs.seq[j])), max.mismatch=3) > > I found it very useful if a limited number of sequences are queried, > however, when I work with the full refseq fasta file blast is much > faster. If you have a big number of query sequences, using countPattern() in a loop won't be very efficient I agree. In that case it's better to use countPDict(). countPDict() is to matchPDict() what countPattern() is to matchPattern(). Use countPDict() the same way you would use matchPDict(). The returned object is different though: it returns an integer vector instead of an MIndex object. With a lot of query sequences, chances are that it will be faster than BLAST: it takes less than 1 minute to count all the exact matches of a set of tens of thousands of 25-mers in Human chormosome 1. Cheers, H. > Cheers > Raffaele > > hpages at fhcrc.org wrote: >> Hi Raffaele, >> >> The PSR do not need to be of equal length with PDict/matchPDit. >> An important limitation though is that indels are not supported >> for now (but will be in the very near future). >> Another limitation is that you need to define a Trusted Band on >> your PDict object i.e. a vertical band of constant width where >> mismatches are not allowed. For example: >> >> psrPDict <- PDict(PSRDNAStringSet, start=5, end=15) >> >> will set a Trusted Band of width 11 that covers nucleotides >> 5 to 15 of each PSR. >> Then use the 'max.mismatch' arg of matchPDict() to control the max >> number of mismatches that you want to allow *outside* the Trusted >> Band for each PSR (the nucleotides that are *inside* must match >> exactly). >> Note that even if you want to perform exact matching, you need >> to define a Trusted Band if your PSR are not of equal length. >> It's important to try to define the widest possible Trusted Band >> since the performance of the underlying algo depends directly >> on how wide it is. Typically the performance will be good if the >> width is >= 12 and will start to drop significantly for smaller >> values. >> Of course if you want to use you PDict object for exact matching >> only, then you want to set the Trusted Band to the max possible >> width i.e. to min(width(PSRDNAStringSet)). >> >> Hope this helps, >> H. >> >> >> Quoting rcaloger <raffaele.calogero at="" gmail.com="">: >> >>> Dear Patrick, >>> many thanks for the quick answer. Unfortunately PSR can be located in >>> any position of the refseq and the sequence length can be very >>> different. Therefore, I cannot apply your suggestion. >>> Cheers >>> Raffaele >>> >>> Patrick Aboyoun wrote: >>>> Raffaele, >>>> The pairwiseAlignment function uses an O(nm) {with n and m being >>>> the length of the two sequences being aligned} dynamic >>>> programming algorithm that is designed to find an optimal >>>> alignment and as you have discovered isn't intended for use with >>>> a long reference sequence. Do your PSR sequences map nearly >>>> exactly to a location on your reference sequence and are these >>>> sequences of equal length? If so, see the matchPDict function. >>>> It matches a pattern dictionary consisting of equal length >>>> fragments to a reference sequence. The pseudo code looks >>>> something like: >>>> >>>> psrPDict <- PDict(PSRDNAStringSet) >>>> matchPDict(psrPDict, refseq) >>>> >>>> To answer your second question, the append function should get >>>> you what you want: >>>> >>>>> append(DNAStringSet(c("AAA", "GA")), DNAStringSet(c("ACTG", "TTTACCC"))) >>>> A DNAStringSet instance of length 4 >>>> width seq >>>> [1] 3 AAA >>>> [2] 2 GA >>>> [3] 4 ACTG >>>> [4] 7 TTTACCC >>>> >>>> >>>> Patrick >>>> >>>> >>>> rcaloger wrote: >>>>> Hi, >>>>> In my onechannelGUI package I am developing a section related to >>>>> Affymetrix exon array analysis, creating few functions that >>>>> allow the association of exon-level Probe Selection Region >>>>> (PSR) to refseq >>>>> >>>>> 1st question: >>>>> I have implemented a function that blast a list of PSR sequences >>>>> over all refseq. >>>>> However, I would like to know if there is any way of doing >>>>> something similar using the Biostring package. >>>>> I tried the pairwiseAlignment function but it is quite slow >>>>> compared to blast. >>>>> >>>>> 2nd question: >>>>> there is any way of merging two DNAStringSets ? >>>>> >>>>> Cheers >>>>> Raffaele >>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> >>> ---------------------------------------- >>> Prof. Raffaele A. Calogero >>> Bioinformatics and Genomics Unit >>> Dipartimento di Scienze Cliniche e Biologiche >>> c/o Az. Ospedaliera S. Luigi >>> Regione Gonzole 10, Orbassano >>> 10043 Torino >>> tel. ++39 0116705417 >>> Lab. ++39 0116705408 >>> Fax ++39 0119038639 >>> Mobile ++39 3333827080 >>> email: raffaele.calogero at unito.it >>> raffaele[dot]calogero[at]gmail[dot]com >>> www: http://www.bioinformatica.unito.it >>> Info: http://publicationslist.org/raffaele.calogero >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> > > > -- > > ---------------------------------------- > Prof. Raffaele A. Calogero > Bioinformatics and Genomics Unit > Dipartimento di Scienze Cliniche e Biologiche > c/o Az. Ospedaliera S. Luigi > Regione Gonzole 10, Orbassano > 10043 Torino > tel. ++39 0116705417 > Lab. ++39 0116705408 > Fax ++39 0119038639 > Mobile ++39 3333827080 > email: raffaele.calogero at unito.it > raffaele[dot]calogero[at]gmail[dot]com > www: http://www.bioinformatica.unito.it > Info: http://publicationslist.org/raffaele.calogero

ADD REPLY • link 16.5 years ago Hervé Pagès 16k

0

Entering edit mode

The Bioinformatics and Genomics Unit <http: www.bioinformatica.unito.it=""/> (B&Gu) is pleased to inform that we have organized a Data Analysis Workshop entitled: GeneChip^® EXON 1.0 ST Array Practical Data Analysis Course. This hands-on course is suitable to biologists who are new to Exon <http: www.affymetrix.com="" products="" arrays="" specific="" exon.affx=""> array technology. The course is based on the use of Affymetrix^® and Bioconductor open-source softwares. However, R coding skill is not required since all the analyses are done using graphical interfaces. Basic knowledge of microarray technology is required. Knowledge of statistics is not necessary prior to attending the course. The GeneChip^® EXON 1.0 ST Data Analysis Course will last three full days, from November the 19th to November the 21. The course is limited to 20 participants. Please complete the registration before *2nd November 2008* Further information and logistics at: http://www.bioinformatica.unito.it/exon.array.course.pdf Course web page: http://www.bioinformatica.unito.it/exon.array.course.html Course registration page: http://www6.unito.it/exonCourse/ The course is organized by B&Gu and supported by Affymetrix^® . All registration and course queries should be directed to the Course Secretariat: Prof. Raffaele A. Calogero Bioinformatics and Genomics Unit Dipartimento di Scienze Cliniche e Biologiche c/o Az. Ospedaliera S. Luigi Regione Gonzole 10, Orbassano, 10043 Italy Tel. ++39 0116705417 Fax ++39 0119038639 Mobile ++39 3333827080 email: raffaele.calogero@unito.it www: www.bioinformatica.unito.it <http: www.bioinformatica.unito.it=""/> For more information on Affymetrix products and applications, innovations and events, click here <http: www.affymetrix.com="" index.affx=""> ---------------------------------------- Prof. Raffaele A. Calogero Bioinformatics and Genomics Unit Dipartimento di Scienze Cliniche e Biologiche c/o Az. Ospedaliera S. Luigi Regione Gonzole 10, Orbassano 10043 Torino tel. ++39 0116705417 Lab. ++39 0116705408 Fax ++39 0119038639 Mobile ++39 3333827080 email: raffaele.calogero@unito.it raffaele[dot]calogero[at]gmail[dot]com www: http://www.bioinformatica.unito.it Info: http://publicationslist.org/raffaele.calogero [[alternative HTML version deleted]]

ADD REPLY • link 16.5 years ago rcaloger ▴ 500

Login before adding your answer.