DESeq2 rlog function takes too long
3
0
Entering edit mode
bharata1803 ▴ 60
@bharata1803-7698
Last seen 5.8 years ago
Japan

Hello,

I have a quite big readcount matrix form TCGA. The size is 577 samples with number of genes 18.522. When I tried to run DESeq2 to calculate log foldchange, it took not that long, around 3-4 hours. After that, I want to use rlog function to get the log transform of gene expression but it almost take 24 hours and it still not finish. I cancel it because I think it is error.I have Intel® Core™ i7 CPU 975 @ 3.33GHz × 8 with RAM 24 GB. I know that R can not use multiple core to calculate DESeq2. Is there any suggestion how to optimize this process?

deseq2 • 7.5k views
ADD COMMENT
5
Entering edit mode
@mikelove
Last seen 1 day ago
United States

In the vignette and the workflow, I suggest to use the VST instead for hundreds of samples:

Note on running time: if you have many samples (e.g. 100s), the rlog function might take too long, and the variance stabilizing transformation might be a better choice. The rlog and VST have similar properties, but the rlog requires fitting a shrinkage term for each sample and each gene which takes time.

EDIT (Oct 2017): the code snippet below is no longer necessary, as the speedup is implemented in the function vst(), since DESeq2 version 1.12.

In addition to this suggestion, here is a snippet of code to speed up the VST even more.

I keep planning to add this to DESeq2 as a proper function, but haven't done so yet.

 

ADD COMMENT
0
Entering edit mode

Thank you for your code. I will try that. 

ADD REPLY
2
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

You probably already know this, but the rpkm() or cpm() functions in the edgeR package compute log transformed gene expression very quickly. These compute a simple but effective regularized log transformation.

ADD COMMENT
0
Entering edit mode

Thanks for the suggestion. This is the first time I use data  these much. I will try your suggestion.

ADD REPLY
0
Entering edit mode

For the purpose of leaving breadcrumbs, a similar function in DESeq2 is normTransform which divides out library size factors, adds a pseudocount and log2 transforms. This was added when plotPCA was added to BiocGenerics, so that DESeq2::plotPCA could be easily run on a matrix log normalized counts, for comparing various transformation options.

ADD REPLY
0
Entering edit mode
Joseph Bundy ▴ 20
@joseph-bundy-9123
Last seen 5.6 years ago
United States

Hi there,

I've been encountering similar problems with long wait times on certain R functions (especially those in DEXSeq and WGCNA), and I have only 60 samples. If waiting around on R is a problem you're facing often, I might give Intel MKL libraries a look, discussed here: http://brettklamer.com/diversions/statistical/faster-blas-in-r/  It speeds up certain calculations and allows some calculations in R to use multiple cores.

The easiest way to get the libraries is to simply download Revolution R (which is free, and automatically recognized by R-studio):
https://mran.revolutionanalytics.com/download/#download

I gave it a try at my PI's suggestion, and it's cut down on some of the analysis times considerably. Just make sure you install both Revolution R AND the MKL library. Just to be clear, as I realize I sound a bit like a salesman, I am not an employee of Revoltuion Analytic. I just download and used their library because it was advertised as doing mathematical calculations more efficiently and enables multi-threaded calculations (which I have confirmed by watching the task manager). 

Unfortunately, the MKL libraries aren't going to help you with your memory (RAM) management, which I suspect is why you're getting an error when doing the rlog transformation. Could you give more information about the error? If you already have one 577 by 18,522 cell matrix in the R workspace, I can't imagine that you have much room for another one.   Monitor your memory usage in the task manager next time you try to do the transformation and see if it's at capacity.  If it is indeed at capacity, you can attempt to better manage which objects you maintain in the R environment with the rm() and gc() functions.  rm() will remove an object, which you specify by name as a single argument, from the R environment, and gc() will ensure that R returns unused memory to the operating system for subsequent calculations. You might also go through your code and make sure that you're not generating too many redundant objects to begin with (if you're like me, you have a lot of them).  My current windows installation has 128GB of RAM, and even with all that I've still had to remove certain objects to make room for others (which is admittedly mostly due to my sloppy programming and not the system's fault). 

If you still don't have the RAM to run your analysis, I'd recommend simply installing more if your board will support it.  
 

ADD COMMENT

Login before adding your answer.

Traffic: 599 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6