deseq2 machine sizing best practices for very large data set
1
1
Entering edit mode
aedavids ▴ 20
@aa611017
Last seen 2.7 years ago
United States

I want to perform differential expression analysis on a data set containing 17,000 samples. The salmon quant.sf files are about 1.5 Tb.

based on my naive understanding of R and R packages I believe I will need to run on a single very large machine, that is to say, I can not take advantage of a cluster of machines.

I read the section in the vignette on ' Using parallelization'.

Is there a rule of thumb for machine sizing?

I plan to run my analysis in either AWS or GCP so I should be able to access a very large machine.

Can you recommend docker image?

Any suggestions for how much SDD, memory, swap, cpu, ... I should use and what the run time is likely to be?

Should I consider porting a pair bones version to something like apache spark so I can throw a lot of machines at the problem?

Kind regards

Andy

l DESeq2 • 1.9k views
ADD COMMENT
0
Entering edit mode

Running DESeq with 1000 samples

If you ask me the memory is the only factor here, with limma-voom things are single-threaded. You can simulate this by running a dummy matrix with 10-20...100,200,...500... samples on a local machine using limma-voom and collect memory statistics. That should get an estimate. A docker image might make sense as you do not need to install anything, or simply the required packages with conda. It is just a DE analysis after all, as long as you have enough memory you will be fine, and once you have the results table you can go back to any standard laptop for downstream analysis.

ADD REPLY
0
Entering edit mode

Hi Michael

These are bulk samples.

I like your idea of running a couple of subsets to get a rough idea about the required resources. I am not sure why you would want to use limma-voom to do the resource estimation instead of using DESeq2?

thanks

Andy

ADD REPLY
0
Entering edit mode
ADD REPLY
3
Entering edit mode

I recommend and use limma-voom for large bulk datasets often, it is much faster than GLM-based methods.

ADD REPLY
0
Entering edit mode

Any recommendations for normalization, variance stabilization, or other additional steps?

ADD REPLY
0
Entering edit mode

What exactly is the problem? Both normalization and vst scale well with larger datasets. Should take a minute or less even for hundreds of samples.

ADD REPLY
0
Entering edit mode

I was asking in terms of limma voom. But as I decided to actually read about what limma voom does, I found out that it does a different kind of transformation to make the data approximate normal. Terrible idea for small data sets of count data, but probably decent for large ones. I have on the order of10^4 genes and 10^4 samples. I may just give DESeq2 a shot anyway as I can spin up a pretty hefty machine on GCP.

ADD REPLY
0
Entering edit mode
@mikelove
Last seen 3 days ago
United States

Is this bulk or single cell?

ADD COMMENT

Login before adding your answer.

Traffic: 838 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6