DESeq2 Dispersion Shrinkage - more samples is better?
1
0
Entering edit mode
rrcutler ▴ 70
@rrcutler-10667
Last seen 4.8 years ago

Hello,

Reading more on how dispersion shrinkage is working, I wanted to clarify what my understanding is as well as address a couple of questions. I have read both the DESeq2 and DDS papers that talk about dispersion shrinkage. 

 

My understanding:

The point of dispersion shrinkage is to share information (dispersion estimates) across genes with similar average expression levels. This is under the assumption that genes with similar expression levels should have similar dispersions and are thus subject to shrinkage once a curve is fitted that represent an accurate estimate for dispersion levels. The larger a dispersion value, the larger the difference in expression has to be in order for a gene to be called DE. As the number of replicates for each condition increases, the amount of dispersion shrinkage per gene decreases as we are then able to estimate the dispersion parameter from the data without shrinkage. 

 

Questions:

1) Biologically, how can we make the assumption that genes with similar expression levels have similar dispersions? Namely, what is the evidence for this?

2) It is recommended for most cases to include all samples in the experiment when running DESeq2 in order to more accurately estimate the dispersion parameter for each gene. Is this because this gives DESeq2 more points to use in order to fit the dispersion estimate?

3) If 2 is true, then including more samples should not effect the magnitude of shrinkage for each gene's dispersion estimate? This is only affected when more replicates per condition are included.

4) Dispersion estimates are made gene-wise. Is this gene-wise estimate specific to the genes within a group of replicates? For example, in two different conditions (1& 2), the dispersion estimate for gene A in condition 1 will not necessarily be the same as the dispersion estimate for gene B in condition 2 (especially if they are expressed at different levels).

 

Any comments and help on understanding this would be greatly appreciated.

Thanks,

-R

 

deseq2 DDS • 5.5k views
ADD COMMENT
2
Entering edit mode
@mikelove
Last seen 1 day ago
United States

These are good questions. Let me start with the question about the mean of the dispersion prior. In the DESeq2 paper we assume a prior on the dispersion parameter which is dependent on the mean of normalized counts for a gene, so we write alpha_tr(mu-bar), where the 'tr' stands for 'trend'. The functional form of alpha_tr can be either parametric, or if we fail to find a good fit for the parametric form, we fall back on semi-parametric (loess).

The motivation for the having the mean of the dispersion prior be a function of the mean isn't from biology or first principles. I would say our motivation is empirical, or you could similarly say it is motivated by data science. If you take a large bulk RNA-seq dataset, where we have a good number of samples for estimating the dispersion, and calculate an MLE for the dispersion, and then plot over the sample mean of normalized counts, you get a plot like this one below. The "middle" for the dispersion values is not flat across the dynamic range of counts. (This is the set of mean and dispersion pairs from the Pickrell dataset in the DESeq2 paper, and I've cut the bottom at 1e-3, note that there are some genes with sample variance less than sample mean whose dispersion estimate goes toward 0, see Supplementary Figure 1 for more details on this plot).

If we assume that the dispersion is roughly similar for different groups of samples, then we have better estimates of that parameter by putting all the samples together for dispersion estimation. We have a FAQ in the vignette about when this is a good idea or when not.

The amount of shrinkage is reduced when you increase the residual degrees of freedom = number of samples - number of parameters to estimate. So if you have n samples and p groups, and using a simple design of ~group that translates to (n - p).

The dispersion value for gene i is alpha_i (see Equation 1 in DESeq2 paper). So it's the same for all samples and all groups for a given gene. It links the variance of counts to the expected value by Var K_ij = mu_ij + alpha_i mu_ij^2. So for a given group of samples, you use X and beta to figure out q_ij, then you multiply by s_ij to account for sequencing depth and other biases, giving you mu_ij. The variance of K_ij for this sample is then determined by alpha_i and mu_ij.

ADD COMMENT
0
Entering edit mode

Hi Michael,

Thanks for your helpful replies, as always. 

To expand/summarize on your explanation of the mean of the dispersion prior... (please correct me if I'm wrong) Despite having small sample sizes, we want to estimate a dispersion parameter which is closest to the true population (if our number of replicates was to be sufficiently high) in order to more accurately detect SDE genes. By pooling all of the dispersion estimates from all the samples included in the experiment (assuming their dispersion is similar) and using similarly expressed genes to fit the mean of the dispersion prior to this pooled dispersion estimate and then shrink dispersions gene-wise accordingly. To me, this seems to be a clever way around having a small number of replicates to estimate dispersion, and instead to use samples from other groups to share dispersion estimate of genes that are similarly expressed; ghost replicates. 

As an example, if I had 2 conditions with 3 replicates and each sample contained 3 genes, then the dispersion plot would show points for 6 genes in the experiment (average of each gene between 3 replicates) and would then get a dispersion estimate for each gene as described above?

I think it would be great if we can pair these empirical observations with any biological meaning. Why aren't these lowly expressed genes as tightly regulated between replicates when compared to highly expressed? Although I can see there being exceptions to this 'rule' - should these genes then be influencing the mean of the dispersion prior?

What is concerning to me is that the gene dispersion values alpha_i is the same for the gene in all samples. So this would mean that even though gene A has 10 counts in condition 1 and gene A has 1000 counts in condition 2, the dispersion value would be the same be the same for this gene? I say in this case, dispersion values for these genes should be unique to their condition as their expression is very different.  Say in an experiment over a developmental time course, the expression of a particular gene changes throughout the samples not due to treatment effect - is there only one dispersion value for this gene?

 

ADD REPLY
0
Entering edit mode

You just get a single dispersion estimate per gene. So if you have 2 conditions with 3 replicates each, and 'n' genes, you have 'n' dispersion estimates. And the plotDispEsts() plot will have 'n' points.

Note that the up-tick on the left side of the plot is only partially related to variance. As we explain in the Methods section of DESeq2, another part of the up-tick has to do with estimation error. See the section starting with "Some caution is warranted to disentangle true underlying dependence from effects of estimation bias that can create a perceived dependence of the dispersion on the mean."

Regarding "Why aren't these lowly expressed genes as tightly regulated between replicates when compared to highly expressed", I would say, these plots depend highly on the experiment, tissue, population, etc. There are some plots where the dependence is fairly flat. When the dependence is flat, the parametric fit fails and the loess is substituted. In addition, we have fitType="mean" if the user can tell by eye that the dependence is flat. Also, because of the way we fit the mean, the prior is only influenced by genes with similar mean, so it's a fairly "safe" statistical procedure.

Regarding: "even though gene A has 10 counts in condition 1 and gene A has 1000 counts in condition 2, the dispersion value would be the same be the same for this gene" Note that the dispersion is not the variance. See the formula I gave above.

ADD REPLY
0
Entering edit mode

Hi Michael,

Since there is a single dispersion estimate per gene across the whole sample, I assume the dispersion value is plotted with the average expression of a gene across all samples. So is including more samples, in order to better estimate the dispersion, essentially just including more expression data per gene that will result in an average representative of a gene's expression across the whole experiment? I previously thought including more samples would introduce more data points to the dispersion vs mean plot in oder to better fit the mean of the dispersion prior.

 I previously thought the dispersion value estimate was gene-wise per group of samples, which makes sense to me given that the expression of a gene can change enormously in different groups.  Taking the average expression of a gene across all samples in an experiment seems inaccurate to calculate the dispersion for each gene given that this does not seem to capture the different expression levels of a gene in different groups. 

As I have mentioned before, I am especially concerned with these estimates being made this way as the different groups I am using come from different stages in development where gene expression values for the same genes are enormously different. My worry is that genes that may be DE in stages where they are highly expressed, may not show up if they are lowly expressed in other stages. This would be due to a large dispersion estimate that is due to the averaging of these low and high expression levels that does not seem to reflect the true dispersion within each group of the experiment. So would you recommend to use all samples when their expression profiles are very different due to different times in development?

I now see the relationship between dispersion and variance by plotting these values for my data, which was not what I expected. A lower dispersion value, the higher that variance? Nonetheless, since the two are closely related, if a dispersion parameter is the same across genes where their expression levels are very different, then it does not seem that the dispersion nor variance is estimated accurately because of loss of group resolution. 

Dispersion over mean u

Variance over mean u

ADD REPLY
0
Entering edit mode

Regarding this question: "So is including more samples, in order to better estimate the dispersion, essentially just including more expression data per gene that will result in an average representative of a gene's expression across the whole experiment?...Taking the average expression of a gene across all samples in an experiment seems inaccurate to calculate the dispersion for each gene given that this does not seem to capture the different expression levels of a gene in different groups."

It's useful for rephrase this in terms of the terms from the DESeq2 paper, so I can be very specific. The x-axis above is mu-bar_i = sum_j K_ij / s_ij, the mean of normalized counts across all samples. 

The question can be rephrased:

Q: Is mu-bar_i (a single estimate of the average expression per gene i) used to set the mean parameter for the negative binomial distribution for counts when fitting the dispersion parameter (either the initial maximum likelihood or the final maximum posterior estimate of dispersion)?

No, mu-bar_i isn't used. This would be a really bad estimate for the counts. As I said above, we use our estimate of mu_ij, which we write mu-hat_ij, or sometimes mu-vec_i, a vector of fitted mean values for all samples for gene i. The estimates mu-hat_ij (again, there is a different mu-hat_ij for each sample j in gene i) incorporates the normalization factor s_ij and the fitted fold changes between groups. So the most important thing to appreciate is: Large differences between groups will not increase the dispersion estimate. This is the same approach as for any other GLM-based method.

ADD REPLY
0
Entering edit mode

Thanks Michael, this has really helped me understand how the dispersion estimation works now in DESeq2.

Knowing that mu-hat_ij is a vector incorporated with other metrics to set the mean count is a bit confusing, especially when you end up with one dispersion parameter per gene across all samples. 

To address my primary concern, I would like to know if it is reasonable to load all samples in the experiment into DESeq2 when they have come from different developmental stages where global gene expression is very different? You pointed out that large differences will not increase the dispersion estimate, although I do not think a common dispersion parameter for a gene across all samples is appropriate. 

Best,

-R

ADD REPLY
0
Entering edit mode

You need a separate value mu_ij for each sample and each gene. This is a critical component of all RNA-seq methods. Really the DESeq2 paper formula (which are also provided in the vignette) are the best place to dive deep into the methods.

Yes, it's fine to load all samples even though the global gene expression levels are different. However, we do have an FAQ on this topic which provides further guidance on when to combine or split data. Have you seen that part of the vignette?

ADD REPLY
0
Entering edit mode

Hi Michael,

Yes, I have read this part in the FAQ. The clustering density of the within group samples looks to be about the same across developmental stages, although the stages are clearly separated by the 2nd principle component. However, I'm still not sure why it is reasonable to use a single dispersion parameter for a single gene across the conditions where that same gene may be at very different expression levels. In the vignette, it reads, "the final dispersion value will incorporate the within-group variability across all groups," which I interpret as the average of the within-group variability across all groups, assuming that all groups have the same within-group variability. Although this seems to be getting at a global measure of variability, which DESeq2 does not have, only gene-wise. How can we check to make sure that this global variability within-groups is not due to specific gene-wise variabilities that change between groups (the within-group variability between two different groups is caused by different genes)? My main concern here is that the final dispersion value for a gene may be misrepresented between different groups despite a similar spread on a PCA plot. 

 Can you explain why large differences in expression between groups will not increase the dispersion estimate? This seems to be very important in regards to the above question and when using groups from different developmental stages in DESeq2.  

I have read recently from this paper by Yu et al., where they estimate a dispersion parameter for each separate condition, as this might lead to other misestimations that use the dispersion parameter when dispersion parameters vary by condition. This is also able to be used in the framework of a GLM, it appears. 

While DSS does not use GLMs, it looks like their use of the dispersion parameter is specific to each group being compared in their formulation of the wald test.

-R

 

ADD REPLY
0
Entering edit mode

I don’t know any more ways to explain that a single dispersion estimate still means that the groups of samples will have different means and therefore variances.

ADD REPLY
0
Entering edit mode

Just to add, it sounds like you prefer a model with more parameters. That is fine, it’s obviously up to you what method to use. Our reason for limiting the number of parameters is to have reliable performance, but like I said, if you prefer another method that is your choice. I don’t have much more to say in this thread though so I may stop at this point.

ADD REPLY
0
Entering edit mode

Thanks for all your expertise Michael, I really appreciate it.

ADD REPLY
0
Entering edit mode

As a side note, I have always been confused as to the relationship between the mean-dispersion and mean-SD. As you have shown in your plot, it is obvious that dispersion decreases as the mean increases. However, I have read that count data is heteroscedastic which means that the variance is greater with the mean. How does these seemingly opposite observations complement each other? Is this because count data is heteroscedastic until it is fit to the Negative Binomial model and there after we observe the mean-dispersion relationship?  

ADD REPLY

Login before adding your answer.

Traffic: 953 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6