Hello!
I'm working on comparing tumor and healthy samples with DESeq2. My analysis only cares about the adjusted Fold Change values that DESeq2 computes.
The input data encompasses a huge number of samples (I'm working with TCGA and GTEX data), and I need to run the calculation a bunch of times.
Calls to DESeq()
take a very, very long time to resolve.
Is there a way to speed the process up given the fact that I merely require the adjusted fold change instead of the whole DESeq2 output?
The full script that I run on the data is here at this permalink.
The specific code snippet is here:
run_deseq <- function(case_frame, control_frame) {
# Make a dummy metadata frame
metadata <- data.frame(
row.names = c(colnames(case_frame), colnames((control_frame))),
status = factor(c(rep("case", ncol(case_frame)), rep("control", ncol(control_frame))), levels = c("control", "case"))
)
# Merge the case and control frames
data <- rowmerge(case_frame, control_frame, all = FALSE) # inner merge
# Assert that the order of the cols and rows is the same
data <- data[, row.names(metadata)]
# The data must be un-logged
data <- round((2 ** data) -1)
DESeq2::DESeqDataSetFromMatrix(
data, metadata, ~ status
) |>
DESeq2::DESeq() -> dsq_obj
DESeq2::results(dsq_obj, name = "status_case_vs_control") -> dsq_res
cat("Deseq finished running!\n")
as.data.frame(dsq_res) |> rownames_to_column("gene_id")
}
I'm not posting sessionInfo()
since nothing is not working.
Thank you!
Update: on Mike Love's suggestion, I've tried to use glmGamPoi option in the
fitType
call, but it seems to make the runtime even longer, so it does not seem suitable.The
DESeq()
function can be parallelized via BiocParallel, see?DESeq
and theBPPARAM
argument. Beyond that, what do you define as "very long"? On simulated data with 50000 genes and 1000 samples as below, DESeq2 runs about 3.5min on my machine with a single core (i7 12700k) for a design with a single covariate. Does it run a lot longer for you? Would running a linear approach such as voom be an option?Unrelated to this, GTEx and TCGA are completely different studies/batches so combiningthem in a single quantitative analysis is highly questionable.
Thanks for the reply! I'm already running the task in a parallel way so that is not really a viable solution. The single run takes approximately as much as you pointed out, but the problem is that I need to run many DESeq calls, so the overall runtime is very long.
I'm now considering using
voom
, as you pointed out, as well as the other ideas that have come up.I am aware of this and I'm already working on other data, but I wanted to use something easy and quick to do a first run.
What do you mean by "corrected"?
Do you just mean the log2fold changes calcaulted by doing inter-sample normalisation and the doing the fold change? Or do you mean calculating the stabalised shrunk lfcs that DESeq is capable of? Because if its the former then you don't need to run
DESeq()
and if its the latter then your code isn't calculating them.To normalise data using DESeq's intersample normalisation, you can use
dds<- estimateSizeFactors(dds)
and then recover the normalised counts withcounts_data <- counts(dds, normalized=TRUE)
. No call toDESeq()
neccessary.To get the stablised, shrunken values, you need to use
lfcShink
on yourresults
object.I notice that you are"unlogging" your data. This suggests to me that the data yo have isn't counts. To run DESeq you need counts. Its not enough to have integers, they need to be Negative Binomially distributed (or approximately so). I suspect DESeq is taking so long to run because its struggling to fit the distribution as your data isn't really count data.
It could also be memory, this matrix is ~200 Mb.
Likely things could be speed up a lot also by reasonable filtering.
See ?edgeR::filterByExpr
Filtering is a thing I do not do, which I probably should be doing. I'll try it out! Thanks.
Hello! Thanks for taking the time to reply.
It's the latter. I was 100% sure that it did calculate them, but now that you point it out, I'm completely wrong. Since our datasets are very large, I believe that shrinkage of the LFCs would not have a big effect, but the error is there none the less.
The original data is from UCSC Xena (here). The data was logged by Xena, but it's actually estimated counts from
RSEM
, which I believe are the correct metric for DESeq2. But since the data comes from two very different consortia, I wonder if the marked batch effect between the two would cause DESeq to "struggle" as you said.In any case, thanks for pointing my error out.