Question

muscle memory usage out of control with many alignments

0

Entering edit mode

jean.peccoud • 0

@jeanpeccoud-8564

Last seen 9.7 years ago

When doing multiple muscle alignments (in a loop) in R (v. 3.0.3) under OS X 10.9.5 64 bits, the memory usage of R keeps increasing until the system has to stop the App.

This is muscle 3.8.31-4

I'm doing 100,000+ alignments of pairs of sequences (length 1000 bp or less) in a loop. I don't save the alignments, I just output each alignment in an object that I process.

The command goes like this (simplified)

for (i in 1:100000) {

aln = muscle(sequences, quiet =T)

#do stuff with aln

}

The "sequences" object is just a pair of dna sequences. Note that the "aln" object should be purged at every iteration. So I don't see any reason why memory usage goes out of control. I have 32 GB or ram in my Mac. After approx 70000 steps, the system could not compress memory any longer and forced R to stop. R was using 80 gb+ of virtual memory.

And it's not the stuff I do with the aln object that causes the issue.

Thanks for your help.

Jean

muscle • 3.1k views

ADD COMMENT • link 9.7 years ago jean.peccoud • 0

score 0 · Answer 1 · 2015-08-06

0

Entering edit mode

jean.peccoud • 0

@jeanpeccoud-8564

Last seen 9.7 years ago

Of note : processing speed decreases as memory usage increases to ridiculous levels ( > 20 gb for R).

That also happens when the output of muscle is written to a file instead of an object.

But I see that internally, the muscle() function writes and reads to file using read.table and write.table (even if muscle() returns and object), which are notorious for being extremely slow. The function should use writeChar or readChar, or at least readLines. Or better, there shouldn't be any disk I/O for that, since MUSCLE can read from stdin write to stdout. But I don't think the memory issue has anything to do with disk I/O.

Anyway, I skipped the muscle package and used system calls from R to do my alignments, which is not only faster, but also much less memory hungry. For one, I can capture the output of MUSCLE from std.out without having to read an alignment from disk. (However, I haven't found a way to align my R sequences without putting them in a file, any advice would be appreciated :-)

For my uses, the muscle package is not suited. :(

ADD COMMENT • link 9.7 years ago jean.peccoud • 0

1

Entering edit mode

Hi Jean,

All this is valuable feedback for Alex, the maintainer of the muscle package. Note that you seem to be using an old version of the package though. The package used to be on CRAN before it became a Bioconductor package. The current version of the package is 3.10.0 and is part of BioC 3.1, the current version of Bioconductor (requires R-3.2). Your version of muscle (3.8.31-4) indicates that you're using a version of muscle that you got from CRAN (it's probably archived by now). As Steve suggested, please always use the current version of Bioconductor and only report problems for this version (together with your sessionInfo).

Furthermore, as Steve mentioned, the msa package (another Bioconductor package for multiple sequence alignments), also includes the MUSCLE algo, in addition to the ClustalW and ClustalOmega algos. You could try this out. I know that the authors of msa have put some special efforts in avoiding memory leaks in some of the algorithms. Would be interesting to know if you get better results with it.

Finally let me mention that, if you're aligning pairs of sequences, you could also try the pairwiseAlignment() function from the Biostrings package. pairwiseAlignment() is vectorized on the pattern and subject so it can align your 100,000+ pairs in a single call. According to the timings I get on a modest laptop, this should not take more than 20 minutes and not use more than 1.5 Gb of RAM.

Cheers,

H.

ADD REPLY • link 9.7 years ago Hervé Pagès 16k

score 0 · Answer 2 · 2015-08-06

Although you mention which version of muscle you are using, it's not clear to me if this is the package version, or the version of muscle that is in the package, or what ...

I'm curious if you've tried using MUSCLE through the bioconductor msa package, or the bioconductor muscle package?

I'm guessing from your R version number that you are not, since both the msa and muscle packages were recent additions to bioconductor, and your R version would not be using the latest bioconductor version, by default.

You could try to upgrade your R and Bioc installation anyway to see if this problem persists ... it would be good to know for posterity, even though it seems you have worked around your immediate problem.

score 0 · Answer 3 · 2015-08-09

0

Entering edit mode

jean.peccoud • 0

@jeanpeccoud-8564

Last seen 9.7 years ago

Thanks for the answer. I think I have installed muscle from the bioconductor muscle package.

I'd like to upgrade R, but last I tried, I got issues. For instance the anova.glm() function was not found (!) even though the default stat package was loaded. I had to revert to a previous version.

I used the pairwiseAlignment function from biostring once. I thought it was slower than muscle, but I may try it again.

Jean

ADD COMMENT • link 9.7 years ago jean.peccoud • 0

1

Entering edit mode

anova.glm() is an example of an S3 'method' defined on the generic 'anova' -- invoke anova() with the first object an instance of class 'glm', and you'll end up with anova.glm. You can see other methods on anova with

> methods(anova)
[1] anova.glm*     anova.glmlist* anova.lm*      anova.lmlist*  anova.loess*  
[6] anova.mlm*     anova.nls*    
see '?methods' for accessing help and source code

The source code to the method does not have to be visible to the user, just registered with the generic; the * is indicating those methods that have been defined in this way. The source code is available, e.g., getAnywhere("anova.glm") or stats:::anova.glm. I notice that anova.glm was not asterisked in the R-3.0 release, but is in R-3.1 (and later), so in R-3.0 you would have seen anova.glm without having to do anything special. If there are other problems with anova.glm, then they'd be worthy of a separate post to this support site or to the R-devel mailing list.

ADD REPLY • link 9.7 years ago Martin Morgan 25k

0

Entering edit mode

Aaaah. Now I see the problem I had with anova.glm.

Thanks.

Jean

ADD REPLY • link 9.7 years ago jean.peccoud • 0

score 0 · Answer 4 · 2015-08-09

0

Entering edit mode

jean.peccoud • 0

@jeanpeccoud-8564

Last seen 9.7 years ago

Indeed the pairwiseAlignment() function does exactly what I want. I will use it.

Thanks for the help. Jean

ADD COMMENT • link 9.7 years ago jean.peccoud • 0