DNA motif sequence prediction - finding a method to compare with

0

Entering edit mode

Faheem Mitha ▴ 10

@faheem-mitha-5718

Last seen 10.4 years ago

Hi, I've developed a method for motif sequence search, and I'm trying to find a method to compare it with, because reviewers like to see how your method compares with what is out there. However, I am having some difficulty in finding such a method. To be clear, this is not a de novo motif discovery method, but is related. So, I'm asking the Bioconductor community for help. I'd like to know of methods implemented in software that I can use directly, either in Bioconductor or otherwise. Here are more details about what I have done. I'm analyzed two [RSS](http://en.wikipedia.org/wiki/Recombination_signal_sequences) data sets, each of which is a collection of RSS sequences. The fasta files for these data sets are at [human 12 RSS](http://www.itb.cnr.it/rss/stats/HS12RSS.fasta) and [mouse 12 RSS](http://www.itb.cnr.it/rss/stats/MM12RSS.fasta). The main purpose of the analysis is to predict whether sequences not in this family belong to the family. So, I used a cross-validation method. I divided each data set into 5 parts, and used 4 of the five parts as a training set in turn. (The number 5 here is a bit arbitrary, but since I wanted to include the results per training set, I didn't want the number to be too large.) After fitting a model to the training set, I then used this model for prediction as follows. The RSS data set is contained in gene segments, typically one or two RSS per gene segment. The gene segments are often much larger than the RSS. These are 12RSS, so each RSS is of length 28. I took all the gene segments I could find that contained an RSS, and selected from them all contiguous sequences of length 28. The current total number of these sequences is 449905 for one, and 624400 for the other. The corresponding number of RSS is 118 and 201. Note that these sets did not necessarily contain all distinct values. I then used the model derived from the training set to calculate pvalues for all these approx 500,000 sequences, omitting the RSS sequences that were in the training set. (I'm leaving out some details here, but I don't think it is important how exactly I calculated the values.) Then I ranked the sequences by order of decreasing pvalues. The hope was that the remaining RSS sequences would rank highly in this ranking, and in the event they did. Now, I'd like to find an algorithm which is already implemented in software, which can perform a similar procedure on the same data in a reasonable amount of time, so I can compare the results. Please let me know if you know of any such things, either in Bioconductor or some other software package. Also, please CC me on any reply. Thanks. Regards, Faheem Mitha

• 1.1k views

ADD COMMENT • link updated 12.0 years ago by Paul Shannon ▴ 750 • written 12.0 years ago by Faheem Mitha ▴ 10

0

Entering edit mode

Paul Shannon ▴ 750

@paul-shannon-5161

Last seen 10.4 years ago

Faheem, Two places to look with Bioconductor, to get you acquainted with what we currently offer: http://www.bioconductor.org/packages/release/BiocViews.html#___Moti fDiscovery http://www.bioconductor.org/help/workflows/gene-regulation-tfbs/ - Paul On Jan 17, 2013, at 5:32 AM, Faheem Mitha wrote: > > Hi, > > I've developed a method for motif sequence search, and I'm trying to find a method to compare it with, because reviewers like to see how your method compares with what is out there. However, I am having some difficulty in finding such a method. To be clear, this is not a de novo motif discovery method, but is related. So, I'm asking the Bioconductor community for help. I'd like to know of methods implemented in software that I can use directly, either in Bioconductor or otherwise. Here are more details about what I have done. > > I'm analyzed two > [RSS](http://en.wikipedia.org/wiki/Recombination_signal_sequences) > data sets, each of which is a collection of RSS sequences. The fasta > files for these data sets are at [human 12 > RSS](http://www.itb.cnr.it/rss/stats/HS12RSS.fasta) and [mouse 12 > RSS](http://www.itb.cnr.it/rss/stats/MM12RSS.fasta). > > The main purpose of the analysis is to predict whether sequences not > in this family belong to the family. So, I used a cross-validation > method. I divided each data set into 5 parts, and used 4 of the five > parts as a training set in turn. (The number 5 here is a bit > arbitrary, but since I wanted to include the results per training set, > I didn't want the number to be too large.) After fitting a model to > the training set, I then used this model for prediction as follows. > > The RSS data set is contained in gene segments, typically one or two > RSS per gene segment. The gene segments are often much larger than the > RSS. These are 12RSS, so each RSS is of length 28. I took all the gene > segments I could find that contained an RSS, and selected from them > all contiguous sequences of length 28. The current total number of > these sequences is 449905 for one, and 624400 for the other. The > corresponding number of RSS is 118 and 201. Note that these sets did > not necessarily contain all distinct values. > > I then used the model derived from the training set to calculate > pvalues for all these approx 500,000 sequences, omitting the RSS > sequences that were in the training set. (I'm leaving out some details > here, but I don't think it is important how exactly I calculated the > values.) > > Then I ranked the sequences by order of decreasing pvalues. The hope > was that the remaining RSS sequences would rank highly in this > ranking, and in the event they did. > > Now, I'd like to find an algorithm which is already implemented in software, which can perform a similar procedure on the same data in a reasonable amount of time, so I can compare the results. Please let me know if you know of any such things, either in Bioconductor or some other software package. Also, please CC me on any reply. Thanks. > > Regards, Faheem Mitha > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.0 years ago Paul Shannon ▴ 750

Login before adding your answer.