Hi,
I apologise for the naive question but after reading the DESeq2 paper and vignette, I am still not sure I fully understand how the method works.
My questions are as follows:
As one dispersion parameter is calculated per gene, does the calculation ignore the group membership of each sample, and is this also true for the mean parameter?
Is a single negative binomial model fit per gene, so this assumes that the distribution of counts for each condition e.g. genotype is the same, or have I misunderstood this?
What is the reason that you can't just perform a t-test comparison? Is the reason for generating a negative binomial model so that you can estimate the variance you would expect if you collected more samples?
How does the Wald test work to identify DEGs? Does it essentially look at where the samples from a given condition lie within the NB distribution for that gene?
Thanks for any explanations/clarifications. My lack of stats training is really hindering me here!
Best wishes,
Lucy
Thank you. With regards to question 1, I am still unclear as to how you can have a single NB model or equation per gene if the mean parameter is sample-specific. I am imagining one curve that shows the distribution of counts in a given gene, but if the mean is sample-specific, then should I be imagining one curve per sample per gene?
Would you be able to provide a theoretical example of an equation for a gene (say with the design formula ~condition) and how you would plug in the relevant values to get back to the original counts. I understand how the 1s and 0s in the design matrix switch on and off particular values in the equation.
We don't have a single NB model per gene, right? Look again at equation 1:
The count for gene i and sample j is given as a NB with a gene and sample specific mean
mu_ij
and dispersionalpha_i
.I see, so why don't you end up with separate beta coefficients for each sample/gene combination rather than beta coefficients that represent the fold change of a given gene in e.g. group B vs. group A?
I don't follow. For gene i, we have a single column vector
beta_i
which gives the intercept and then log fold changes associated with covariates. E.g. if the design is~batch + condition
, and each have two levels, then thebeta_i
column vector would have three elements, e.g.:Intercept, batch2, conditionB
...You may want to consult with a textbook on how linear models are represented, it's typical to write:
So
beta_0
andbeta_1
are shared across the samples j. We could rewrite this as: