I've been trying to use DEseq2 to test for differential gene expression on a dataset of RNAseq sample from control and schizophrenia cases.This dataset is roughly ~200 control + ~200 schizophrenia cases/samples. Even if only use "diagnosis" in the design formula ("~ diagnosis"), DEseq2 takes a really long time to run. So far I've used a 72h limit in our cluster computing system, and within that time DEseq2 does not run to completion.
Another detail is this: For some reason, I've found that for this particular data I have that for every gene there's at least one sample with a 0 read count. Upon researching a bit on google, i've found that I could use the parameter type = "iterate" in the estimateSizeFactors function. So that, if "ds" is my DEseqDataset object, the comands I'm using to run DEseq2 are:
ds = estimateSizeFactors(ds, type = "iterate")
dds1 = DESeq( ds, parallel = TRUE)
My questions are:
Would DEseq2 be expected to take this long for a dataset this large ?
How can i improve the time needed for DEseq2 to run?.
Is it "normal" / "expected" that for every gene there's at least one sample with a read count of 0 for that gene ?
Any ideas/insights you might have that could help me improve this will be very well appreciated !
... and many thanks in advance !
Best,
-Mathias.
One more comment: no, it's not normal that every sample would have a 0 for every gene in bulk RNA-seq. There are housekeeping genes that you'd expect to have positive counts for all samples. There may be failed samples that are being included and should be excluded based on QC. I recommend using MultiQC on FASTQC output.