I've been using DEseq2 to find the differentially expressed genes on a dataset of RNAseq samples. There about 700 cases and 300 control samples.
The design formula is "~ Condition + age + sex". DEseq2 takes a really long time to finish. It used about 16h with 8 cores and 32 GB memory.
Here is my running codes:
dds <- DESeqDataSetFromMatrix(countData = round(cts),
colData = covarianceTable,
design = ~ CONDITION + age + sex)
register(MulticoreParam(8))
## It takes a long time to run.................
dds_subset <- DESeq(dds,parallel=TRUE, BPPARAM=MulticoreParam(8))
resultsNames(dds_subset)
resLFC <- lfcShrink(dds_subset, coef="CONDITION_Case_vs_Control", type="apeglm",
parallel=TRUE, BPPARAM=MulticoreParam(8))
Also, the lfcShrink function also takes a long time.
So my questions are:
Is it the expected running time to run DESeq2 on a 58000 Genes * 10000 Samples expression matrix?
Is there any way that I can reduce the time needed for running DEseq2 on my dataset?
(I have tried to use only protein-coding genes(~20 000 genes), It still takes a long time, about 12h)
...I really appreciate any suggestion and many thanks in advance!
Best,
Ruifeng
The benefits of DESeq2 mainly kick in when sample size (and therefore per-gene information) is limited. With 1000 sampes I would simply use limma-voom, see for references e.g.
DESeq2 with many samples
Running DESeq with 1000 samples
DESeq2 taking long time to run with 270 samples in 10 groups.
Yes agree with this as well for bulk RNA-seq.