Hi all,
I am trying to use DESeq2 to simulate time course RNA seq dataset.
Setup details:
Let's assume we have a time course dataset with 1000 genes, 8 time points and 3 replicates per time point. Here each time point is a 'condition' amounting to a total of 8 conditions. The total number of samples are 8x3 = 24.
Simulation approach:
- Feed a count matrix of size (1000x24) into DESeq.
- Get one mean per gene per sample (hence 1000x24 = 24k mean values) and one dispersion estimate per gene (hence 1000 dispersion values).
- Take a gene (let's say BRCA1) and a given replicate (let's say replicate 1).
- Take the 8 Negative Binomial distributions corresponding to each of the 8 time points of BRCA1 replicate 1.
- Sample values from each of the 8 distributions to create new time courses for BRCA1 replicate 1.
Questions:
- After doing mean/dispersion estimation, how to get the mean values(per gene per sample, mu_ij) and dispersion values (per gene, alpha_i)?
- It'd also be nice to hear from other members of the community (most of whom I believe are way more experienced in DESeq than me) their thoughts on such an approach to simulating time course RNA seq data and if anyone is aware of a better approach to simulating time courses.
DESeq model that I have referred to:
K_ij ~ NB(mu_ij, alpha_i)
mu_ij = s_j q_ij
log2(q_ij) = x_j. beta_i
Thanks!
Indeed, what I am trying to do is simulate multiple time courses of count data for a gene, all originating from the same ground truth (true transcript level time course for the gene). Therefore, I thought it would be a good idea to sample from the NB distribution NB(mu_ij, alpha_i) for the i th gene at each time point to simulate "new" time courses of "counts" for the gene.
However, I have been unable to confidently extract mu_ij and alpha_i from DESeq. I found that calling assays(dds) returns a few lists, one of them labelled "mu". Are these the mu_ij 's I am looking for?
Also, it doesn't seem to return anything that looks like the alpha_i values. Seems like I am missing something very trivial here. Could you let me know how to access these 2 parameters?
alpha_i is dispersions(dds), see here.
I'm not sure I'd go about it that way with the mu_ij. I'd just simulate time profiles from scratch. But again, this is up to you. I'm not writing any time course simulation functions.
Hi Micheal, thanks for pointing me towards the relevant link(s). I had a follow up question about this and felt it would be more appropriate to ask it here rather than as a separate question.
I checked the source code of the makeExampleDESeqDataSet function. I see that you pass mu=mu and size = 1/dispersion as 2 of the arguments.
Does that mean that in my case, K_ij ~ NB(n=<number of samples to draw>, mu=mu_ij, size = 1/alpha_i)?
Doesn't this (and rnbinom's usage in the makeExampleDESeqDataSet function) directly contradict the inference from the mean-dispersion plot that mean counts and variance have a strong correlation? Even in case of a simple (non-time course) 2 condition system, wouldn't it be better to have per gene per sample measures of variance fed into the rnbinom function, while sampling read counts?
I also read in your paper that:
Var K_ij = μ_ij + (α_i*(μ_ij^2)) models the variance of read counts across replicates. Do you also take this into account in the makeExampleDESeqDataSet function?
Thanks
Here's the line you're referencing:
https://github.com/mikelove/DESeq2/blob/master/R/core.R#L421
So 'mu' is a matrix and 'dispersion' is a vector. And we plug this into the 'rnbinom' function.
"Doesn't this (and rnbinom's usage in the makeExampleDESeqDataSet function) directly contradict the inference from the mean-dispersion plot that mean counts and variance have a strong correlation?"
No. See how 'dispersion' is constructed a few lines above directly from the expected mean for the first condition group. But even if we modeled dispersion as constant across all genes, mean and variance will be related, more on this below.
"wouldn't it be better to have per gene per sample measures of variance fed into the rnbinom function, while sampling read counts?"
The DESeq2 model (and nearly all other Bioc packages which use count distributions) is that there is a constant dispersion for each gene. Note that dispersion is not variance. Variance depends on the sample-and-gene-specific mean and the gene-specific dispersion.
Our simulation function directly generates the data according to the model presented in the paper.
Var = mu + alpha * mu^2 is built into the Negative Binomial. If you use 'rnbinom', then you obtain data that has this variance relationship.