matching sRNA sequences with whole data

0

Entering edit mode

chawla ▴ 190

@chawla-4416

Last seen 10.3 years ago

Hi I want to know the faster method of obtaining the frequency of only perfect matches between a data seq and seq target file both are set of nucleotide sequences but in large numbers. I tried for (i in 1:100) #for (i in 1:nrow(urfreq)) { pos1<-which(glr4[,1]==urfreq[i,1]) pos2<-which(glr5[,1]==urfreq[i,1]) pos3<-which(glr6[,1]==urfreq[i,1]) if(length(pos1>0)) { urfreq[i,2]<-length(pos1) } if(length(pos2>0)) { urfreq[i,3]<-length(pos2) } if(length(pos3>0)) { urfreq[i,4]<-length(pos3) } } Since the target datafile is huge , this piece of code take 22 min for only 100 sequences , while I need to find frequency of over 3 million sequences in the three samples data(glr 4 5 and 6). Is there any package/function for such matching. Thanks Konika

• 657 views

ADD COMMENT • link updated 13.4 years ago by Valerie Obenchain ★ 6.8k • written 13.4 years ago by chawla ▴ 190

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 2.9 years ago

United States

Hi Konika, The "Biostrings BSgenome Overview" link on this page is a great summary of string matching, http://bioconductor.org/help/course-materials/2011/BioC2011/ Specifically, I think the vmatchPattern() and matchPDict() functions will be most helpful to you. Valerie On 08/08/2011 04:25 AM, chawla wrote: > Hi > I want to know the faster method of obtaining the frequency of only > perfect matches between a data seq and seq target file > both are set of nucleotide sequences but in large numbers. > I tried > for (i in 1:100) > #for (i in 1:nrow(urfreq)) > { > pos1<-which(glr4[,1]==urfreq[i,1]) > pos2<-which(glr5[,1]==urfreq[i,1]) > pos3<-which(glr6[,1]==urfreq[i,1]) > if(length(pos1>0)) > { > urfreq[i,2]<-length(pos1) > } > if(length(pos2>0)) > { > urfreq[i,3]<-length(pos2) > } > if(length(pos3>0)) > { > urfreq[i,4]<-length(pos3) > } > > } > Since the target datafile is huge , this piece of code take 22 min for > only 100 sequences , while I need to find frequency of over 3 million > sequences in the three samples data(glr 4 5 and 6). > Is there any package/function for such matching. > Thanks > Konika > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 13.4 years ago Valerie Obenchain ★ 6.8k

Login before adding your answer.