Question

deseq2 machine sizing best practices for very large data set

1

Entering edit mode

aedavids ▴ 20

@aa611017

Last seen 2.7 years ago

United States

I want to perform differential expression analysis on a data set containing 17,000 samples. The salmon quant.sf files are about 1.5 Tb.

based on my naive understanding of R and R packages I believe I will need to run on a single very large machine, that is to say, I can not take advantage of a cluster of machines.

I read the section in the vignette on ' Using parallelization'.

Is there a rule of thumb for machine sizing?

I plan to run my analysis in either AWS or GCP so I should be able to access a very large machine.

Can you recommend docker image?

Any suggestions for how much SDD, memory, swap, cpu, ... I should use and what the run time is likely to be?

Should I consider porting a pair bones version to something like apache spark so I can throw a lot of machines at the problem?

Kind regards

Andy

l DESeq2 • 1.9k views

ADD COMMENT • link updated 3.0 years ago by ariel ▴ 20 • written 3.2 years ago by aedavids ▴ 20

0

Entering edit mode

Running DESeq with 1000 samples

If you ask me the memory is the only factor here, with limma-voom things are single-threaded. You can simulate this by running a dummy matrix with 10-20...100,200,...500... samples on a local machine using limma-voom and collect memory statistics. That should get an estimate. A docker image might make sense as you do not need to install anything, or simply the required packages with conda. It is just a DE analysis after all, as long as you have enough memory you will be fine, and once you have the results table you can go back to any standard laptop for downstream analysis.

ADD REPLY • link 3.2 years ago ATpoint ★ 4.6k

0

Entering edit mode

Hi Michael

These are bulk samples.

I like your idea of running a couple of subsets to get a rough idea about the required resources. I am not sure why you would want to use limma-voom to do the resource estimation instead of using DESeq2?

thanks

Andy

ADD REPLY • link 3.2 years ago aedavids ▴ 20

0

Entering edit mode

DESeq2 with many samples

ADD REPLY • link 3.2 years ago ATpoint ★ 4.6k

3

Entering edit mode

I recommend and use limma-voom for large bulk datasets often, it is much faster than GLM-based methods.

ADD REPLY • link 3.2 years ago Michael Love 43k

0

Entering edit mode

Any recommendations for normalization, variance stabilization, or other additional steps?

ADD REPLY • link 3.0 years ago ariel ▴ 20

0

Entering edit mode

What exactly is the problem? Both normalization and vst scale well with larger datasets. Should take a minute or less even for hundreds of samples.

ADD REPLY • link 3.0 years ago ATpoint ★ 4.6k

0

Entering edit mode

I was asking in terms of limma voom. But as I decided to actually read about what limma voom does, I found out that it does a different kind of transformation to make the data approximate normal. Terrible idea for small data sets of count data, but probably decent for large ones. I have on the order of10^4 genes and 10^4 samples. I may just give DESeq2 a shot anyway as I can spin up a pretty hefty machine on GCP.

ADD REPLY • link 3.0 years ago ariel ▴ 20

score 0 · Answer 1 · 2021-09-22

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 3 days ago

United States

Is this bulk or single cell?

ADD COMMENT • link 3.2 years ago Michael Love 43k