advice on Biostrings

0

Entering edit mode

Rafael A. Irizarry ★ 2.3k

@rafael-a-irizarry-205

Last seen 10.2 years ago

hi im using biostrings to count base content as well as pair of bases content. im using the following sniped of code: ###pmseq is a vector of character strings (not of the same nchar). tmp <- sapply(pmseq,function(x){ y = DNAString(x) c(alphabetFrequency(y)[2:5], ##count A,T,G,C length(matchDNAPattern("GC",y))+length(matchDNAPattern("CG",y))) ##count GC or CG }) it is painfully slow. strsplit and grep were much faster for the first part (counting bases) but the using grep for the second part was not straight forward. any suggestions? -r

Biostrings Biostrings • 1.8k views

ADD COMMENT • link updated 18.8 years ago by Martin Morgan 25k • written 18.8 years ago by Rafael A. Irizarry ★ 2.3k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 16 hours ago

Seattle, WA, United States

Hi Rafael, Comparing the speed of > grep("GC", y, fixed=TRUE) with the speed of: > matchDNAPattern("GC", DNAString(y)) is a little unfair since the former will stop searching after it founds the first occurence of "GC" (which is likely to happen very early since nchar("GC") is only 2, probably in the first 10 or 20 letters of a random TGCA string), while the latter will process the entire string in order to count the total number of occurences of "GC". So yes, grep is faster and is all what you need as long as you are not interested in counting the number of matches (or retrieving their offsets). On my system: > library(Biostrings) > y <- scan(file="bigrandomTGCA.txt", what="") Read 1 item > nchar(y) [1] 10000000 > system.time(grep("GC", y, fixed=TRUE)) [1] 0.01 0.00 0.00 0.00 0.00 > dy <- DNAString(y) > system.time(length(matchDNAPattern("GC", dy))) [1] 0.07 0.01 0.08 0.00 0.00 Now if you need to count the number of matches, using length(strsplit()) might be faster than matchDNAPattern() on small strings (nchar < 5000) but it will definetly be __much__ slower on big strings: > nchar(y2) [1] 18314 > system.time(length(strsplit(y2, "A", fixed=TRUE)[[1]])) [1] 0.08 0.00 0.09 0.00 0.00 > dy2 = DNAString(y2) > system.time(length(matchDNAPattern("A", dy2))) [1] 0.02 0.00 0.01 0.00 0.00 Don't even try strsplit() on a 10 millions character string: it will take forever and you won't be able to interrupt with CTRL C... Regards, H. Rafael A Irizarry wrote: > hi im using biostrings to count base content as well as pair of bases > content. im using the following sniped of code: > > > ###pmseq is a vector of character strings (not of the same nchar). > tmp <- sapply(pmseq,function(x){ > y = DNAString(x) > c(alphabetFrequency(y)[2:5], ##count A,T,G,C > length(matchDNAPattern("GC",y))+length(matchDNAPattern("CG",y))) > ##count GC or CG > }) > > it is painfully slow. strsplit and grep were much faster for the first > part (counting bases) but the using grep for the second part was not > straight forward. > > any suggestions? > > -r > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor -- ------------------------ Hervé Pagès E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 18.8 years ago Hervé Pagès 16k

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 4 months ago

United States

Actually, this function in Biostrings uses one of those fiendishly clever algorithms coded in C, and as a consequence is *very* fast: on my puny windows laptop, searching a random sequence of 10 million bp for "GC" returns more than 1/2 million matches in about a 10th of a second: > seq <- paste( dna[ runif( 10000000, min=1, max=5 )], collapse="" ) > system.time( res <- length(matchDNAPattern( "GC", seq ))); res [1] 0.13 0.00 0.12 NA NA [1] 626393 Maybe your speed issues are related to R, or to memory management, or 'user error' ;) ?. Here are suggestions: * If the match pattern "GC" is a character, it is converted to a DNAString by matchDNAPattern. Short-circuit this step by GC <- DNAString("GC") outside the sapply, and then matchDNAPattern( GC, y ) This seems to actually make quite a bit of difference: > seqs <- lapply( 1:1000, function(i) paste( dna[ runif( 100, min=1, max=5 )], collapse="" )) > system.time( res <- sapply( seqs, function(s) length(matchDNAPattern( "GC", s )))) [1] 27.52 0.01 27.61 NA NA > GC <- DNAString("GC") > system.time( res <- sapply( seqs, function(s) length(matchDNAPattern( GC, s )))) [1] 16.53 0.00 16.56 NA NA > * Make sure your function is actually doing what you want. My mistake in exploring this was to write > seq <- paste( dna[ runif( 10000000, min=1, max=5 )]) (i.e., without the 'collapse' argument). this (tried to) create a vector with 10 million single characters. The sapply would then have looped over these, rather than the single string I intended! * Do the calculations in parallel > library(snow) > cl <- makeCluster(4, "MPI") > clusterEvalQ(cl, library(Biostrings)) > system.time(res <- parSapply( cl, seqs, function(s) length(matchDNAPattern( GC, s )))) [1] 0.01 0.00 4.56 0.00 0.00 ! * Perhaps your R is spending a lot of time swapping virtual memory to disk. Try garbage collection gc() before performing the analysis. If this doesn't halp, and it seems like memory really is an issue, then perhaps reading the strings in a chunk at a time would be necessary. You might alos try separating the sapply into three separate calls, one for each function. I doubt this is really a problem, unless you're dealing with *very* large collections of sequence. * I found that analyzing some number of bp in a few strings is much faster than analyzing the same total number of bp in many strings, i.e., the bottleneck is actually in sapply. There are apparently tricks to enhancing sapply speed, though my exploration in the present context didn't provide any dramatic results. Hope these suggestions help... Martin Wolfgang Huber <huber at="" ebi.ac.uk=""> writes: > Rafael A Irizarry wrote: >> hi im using biostrings to count base content as well as pair of bases >> content. im using the following sniped of code: >> > > Hi Rafa, > > to count symbols in character vectors, matchprobes:basecontent is fast: > > library(matchprobes) > v = c("AAACT", "GGGTT", "ggAtT") > bc = basecontent(v) > print.default(bc) > bc[,"C"]+bc[,"G"] > > and if there is interest I'd be happy amend the C code to also count > pairs of bases (or you could, it is not terribly complicated). > > Cheers > Wolfgang > >> >> ###pmseq is a vector of character strings (not of the same nchar). >> tmp <- sapply(pmseq,function(x){ >> y = DNAString(x) >> c(alphabetFrequency(y)[2:5], ##count A,T,G,C >> length(matchDNAPattern("GC",y))+length(matchDNAPattern("CG",y))) >> ##count GC or CG >> }) >> >> it is painfully slow. strsplit and grep were much faster for the first >> part (counting bases) but the using grep for the second part was not >> straight forward. >> >> any suggestions? > > > ------------------------------------- > Wolfgang Huber > European Bioinformatics Institute > European Molecular Biology Laboratory > Cambridge CB10 1SD > England > Phone: +44 1223 494642 > Fax: +44 1223 494486 > Http: www.ebi.ac.uk/huber > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor

ADD COMMENT • link 18.8 years ago Martin Morgan 25k

0

Entering edit mode

thanks everybody. this made it much faster: tmpseq <- DNAString(pmseq) GC <- DNAString("GC") CG <- DNAString("CG") tmp <- sapply(1:length(tmpseq),function(i){ y=tmpseq[i] c(alphabetFrequency(y)[2:5], length(matchDNAPattern(GC,y)), length(matchDNAPattern(CG,y))) }) Martin Morgan wrote: >Actually, this function in Biostrings uses one of those fiendishly >clever algorithms coded in C, and as a consequence is *very* fast: on >my puny windows laptop, searching a random sequence of 10 million bp >for "GC" returns more than 1/2 million matches in about a 10th of a >second: > > > >>seq <- paste( dna[ runif( 10000000, min=1, max=5 )], collapse="" ) >>system.time( res <- length(matchDNAPattern( "GC", seq ))); res >> >> >[1] 0.13 0.00 0.12 NA NA >[1] 626393 > >Maybe your speed issues are related to R, or to memory management, or >'user error' ;) ?. Here are suggestions: > >* If the match pattern "GC" is a character, it is converted to a >DNAString by matchDNAPattern. Short-circuit this step by > >GC <- DNAString("GC") > >outside the sapply, and then > >matchDNAPattern( GC, y ) > >This seems to actually make quite a bit of difference: > > > >>seqs <- lapply( 1:1000, function(i) paste( dna[ runif( 100, min=1, max=5 )], collapse="" )) >>system.time( res <- sapply( seqs, function(s) length(matchDNAPattern( "GC", s )))) >> >> >[1] 27.52 0.01 27.61 NA NA > > >>GC <- DNAString("GC") >>system.time( res <- sapply( seqs, function(s) length(matchDNAPattern( GC, s )))) >> >> >[1] 16.53 0.00 16.56 NA NA > > > >* Make sure your function is actually doing what you want. My mistake >in exploring this was to write > > > >>seq <- paste( dna[ runif( 10000000, min=1, max=5 )]) >> >> > >(i.e., without the 'collapse' argument). this (tried to) create a >vector with 10 million single characters. The sapply would then have >looped over these, rather than the single string I intended! > >* Do the calculations in parallel > > > >>library(snow) >>cl <- makeCluster(4, "MPI") >>clusterEvalQ(cl, library(Biostrings)) >>system.time(res <- parSapply( cl, seqs, function(s) length(matchDNAPattern( GC, s )))) >> >> >[1] 0.01 0.00 4.56 0.00 0.00 > >! > >* Perhaps your R is spending a lot of time swapping virtual memory to >disk. Try garbage collection > >gc() > >before performing the analysis. If this doesn't halp, and it seems >like memory really is an issue, then perhaps reading the strings in a >chunk at a time would be necessary. You might alos try separating the >sapply into three separate calls, one for each function. I doubt this >is really a problem, unless you're dealing with *very* large >collections of sequence. > >* I found that analyzing some number of bp in a few strings is much >faster than analyzing the same total number of bp in many strings, >i.e., the bottleneck is actually in sapply. There are apparently >tricks to enhancing sapply speed, though my exploration in the present >context didn't provide any dramatic results. > >Hope these suggestions help... > >Martin > >Wolfgang Huber <huber at="" ebi.ac.uk=""> writes: > > > >>Rafael A Irizarry wrote: >> >> >>>hi im using biostrings to count base content as well as pair of bases >>>content. im using the following sniped of code: >>> >>> >>> >>Hi Rafa, >> >>to count symbols in character vectors, matchprobes:basecontent is fast: >> >>library(matchprobes) >>v = c("AAACT", "GGGTT", "ggAtT") >>bc = basecontent(v) >>print.default(bc) >>bc[,"C"]+bc[,"G"] >> >>and if there is interest I'd be happy amend the C code to also count >>pairs of bases (or you could, it is not terribly complicated). >> >> Cheers >> Wolfgang >> >> >> >>>###pmseq is a vector of character strings (not of the same nchar). >>>tmp <- sapply(pmseq,function(x){ >>> y = DNAString(x) >>> c(alphabetFrequency(y)[2:5], ##count A,T,G,C >>> length(matchDNAPattern("GC",y))+length(matchDNAPattern("CG",y))) >>>##count GC or CG >>>}) >>> >>>it is painfully slow. strsplit and grep were much faster for the first >>>part (counting bases) but the using grep for the second part was not >>>straight forward. >>> >>>any suggestions? >>> >>> >>------------------------------------- >>Wolfgang Huber >>European Bioinformatics Institute >>European Molecular Biology Laboratory >>Cambridge CB10 1SD >>England >>Phone: +44 1223 494642 >>Fax: +44 1223 494486 >>Http: www.ebi.ac.uk/huber >> >>_______________________________________________ >>Bioconductor mailing list >>Bioconductor at stat.math.ethz.ch >>https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> > > >

ADD REPLY • link 18.8 years ago Rafael A. Irizarry ★ 2.3k

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 3 months ago

EMBL European Molecular Biology Laborat…

Rafael A Irizarry wrote: > hi im using biostrings to count base content as well as pair of bases > content. im using the following sniped of code: > Hi Rafa, to count symbols in character vectors, matchprobes:basecontent is fast: library(matchprobes) v = c("AAACT", "GGGTT", "ggAtT") bc = basecontent(v) print.default(bc) bc[,"C"]+bc[,"G"] and if there is interest I'd be happy amend the C code to also count pairs of bases (or you could, it is not terribly complicated). Cheers Wolfgang > > ###pmseq is a vector of character strings (not of the same nchar). > tmp <- sapply(pmseq,function(x){ > y = DNAString(x) > c(alphabetFrequency(y)[2:5], ##count A,T,G,C > length(matchDNAPattern("GC",y))+length(matchDNAPattern("CG",y))) > ##count GC or CG > }) > > it is painfully slow. strsplit and grep were much faster for the first > part (counting bases) but the using grep for the second part was not > straight forward. > > any suggestions? ------------------------------------- Wolfgang Huber European Bioinformatics Institute European Molecular Biology Laboratory Cambridge CB10 1SD England Phone: +44 1223 494642 Fax: +44 1223 494486 Http: www.ebi.ac.uk/huber

ADD COMMENT • link 18.8 years ago Wolfgang Huber ★ 13k

Login before adding your answer.