Question

DESeq2 for big data sets

0

Entering edit mode

ajajoo • 0

@ajajoo-21737

Last seen 3.3 years ago

United States

I used DESeq2 for around 2500 subjects. I used parallel true option and it used around 40 processors. It took about 1.5 weeks to perform estimating dispersions, fitting model and testing, etc all the way till final dispersion estimates. It also displayed replacing outliers for ** genes etc. But after that the program is just running with one R (vs for other steps it would open 40 R program ) for another week or so. Any idea why it took so long at that stage and what is happening, at this point R was occupying 27GB of ram. Unfortunately, there was power shutdown so the code stopped. So before running it again I would like to know what I can do to may be make it run faster at that stage.

deseq2 • 1.3k views

ADD COMMENT • link updated 5.5 years ago by Michael Love 43k • written 5.5 years ago by ajajoo • 0

score 3 · Accepted Answer · 2019-08-27

3

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

I've mentioned before on the support site, I myself use limma-voom for 100s of samples. The NB GLM is an expensive operation and requires convergence per row.

But another factor, maybe less relevant for you than for others in the 100s of samples regime, is that Constantin Ahlmann-Eltze has improved the efficiency of DESeq2 on large sample sizes by 10 fold in the development branch (you can already access it on GitHub), which will be released in October 2019.