Question

gene dispersion, what does it mean?

1

Entering edit mode

tonja.r ▴ 80

@tonjar-7565

Last seen 8.5 years ago

United Kingdom

In all RNA-seq analysis applications they talk about the dispersion of a gene. As far as I understood, it is not a variance of the normalized counts for a given gene. It is somehow much more complicated.

DESeq defines the dispersion as follows:

But what would a dispersion of 0.19 or a dispersion of 0.80 tell me? Can I still interpret it as a variance of a gene?

gener rnaseq • 29k views

ADD COMMENT • link updated 2.1 years ago by Gordon Smyth 52k • written 9.4 years ago by tonja.r ▴ 80

2

Entering edit mode

That looks like the DESeq2 paper. We define the dispersion parameter in the first sentence of the section in the main text where we introduce it:

"Within-group variability, i.e., the variability between replicates, is modeled by the dispersion parameter alpha, which describes the variance of counts via..."

The dispersion parameter links the variance and mean of the count for the negative binomial distribution.

ADD REPLY • link 9.4 years ago Michael Love 43k

5

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 4 hours ago

The city by the bay

Check out Section 2.8.2 of the edgeR user's guide (v3.12.0). Briefly, the square root of the negative binomial dispersion for a gene is the biological coefficient of variation (BCV) across your replicate samples. This refers to the standard deviation divided by the mean, for the fraction of all cDNA fragments corresponding to that gene. Roughly speaking, this is interpretable as the proportion of expression attributable to variability.

Larger values for the BCV indicate that the gene expression is more variable relative to the mean. This is arguably more useful than the variance itself. Highly expressed genes will usually have a larger absolute variance because the counts are bigger, but the relative variability and BCV are often lower. For example, if you have a highly expressed gene, then even if it has a large absolute variance, that might not matter if the variance contributes to a very small wobble around the large mean expression across replicates. Conversely, a small variance might be substantial for lowly expressed genes because the wobble would be relatively large compared to the mean.

ADD COMMENT • link 9.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Briefly, the square root of the negative binomial dispersion for a gene is the biological coefficient of variation (BCV) across your replicate samples. This refers to the standard deviation divided by the mean, for the fraction of all cDNA fragments corresponding to that gene.
So, according to edgeR, to calculate a dispersion for a gene, I just need to do following: (sd(gene)/mean(gene))^2. Right?

I do not think I understood your example.
For example, if you have a highly expressed gene, then even if it has a large absolute variance, that might not matter if the variance contributes to a very small wobble around the large mean expression across replicates.

Could you please give a small example with numbers and let say 4 replicates: 2 treated and 2 untreated?

ADD REPLY • link 9.4 years ago tonja.r ▴ 80

2

Entering edit mode

No, you use the estimateDisp function. We've already set it up for you, so don't try to do it yourself. This is not a simple calculation; it maximizes the adjusted profile likelihood to obtain an unbiased estimate of the NB dispersion, with some EB shrinkage to stabilize the estimate in the presence of limited replicates. Of course, you don't need that level of understanding to interpret the value of the dispersion; all you need to know is "bigger dispersion = more variable".

As for the example, imagine you have a gene with counts of 1000, 1010, 990, 1000. Now imagine you have another gene with counts of 10, 20, 0, 10. They have the same variance, but which one would you consider to be more variable?

Edit: Obviously, this is a rhetorical question. The first gene has fluctuations of 1% around the mean. The second gene has fluctuations of 100%. It's clear that the second one is more variable, and this will be reflected in the BCV/dispersion.

ADD REPLY • link 9.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

How do you define a "fluctuation" (how do you count it?) and what do you mean under "more variable"? If the second is more variable, does to mean that it is differentially expressed?

ADD REPLY • link 9.4 years ago tonja.r ▴ 80

2

Entering edit mode

The first gene has a mean of 1000 across all samples. The second gene has a mean of 10. In each gene, one sample has an increase of 10, and another sample has a decrease of 10. This represents a 1% change relative to the mean for the first gene, and a 100% change for the second gene. So, clearly, the second gene is more variable across your samples; its expression doubles in one sample and is completely absent in another, while the first gene has fairly stable expression across all samples. I should also stress that this example has nothing to do with differential expression. The numbers above don't have any relation to the dispersion parameter, either; the example is just to show that the absolute variance of the counts is less important than the variance relative to the mean (which is quantified by the biological coefficient of variation, which in turn is represented as the square-root of the dispersion).

ADD REPLY • link 9.4 years ago Aaron Lun ★ 28k

score 9 · Accepted Answer · 2015-11-29

9

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 47 minutes ago

WEHI, Melbourne, Australia

The most complete explanation of what the dispersion means from a scientific point of view is probably in the edgeR glm paper:

http://nar.oxfordjournals.org/content/40/10/4288

See the first section of Results in conjunction with the first section of Methods. That article characterized sqrt(dispersion) as the "biological coefficient of variation (BCV)", and that is the terminology we have used since in the edgeR articles and documentation. The BCV is the relative variability of expression between biological replicates.

If you estimate dispersion = 0.19, then sqrt(dispersion) = BCV = 0.44. This means that the expression values vary up and down by 44% between replicates.

An important point, that is easy to miss, is that the BCV measures the relative variability of true expression levels, not the variability of measured expression levels. The BCV represents the relative variability that you would observe if you were able to measure the true expression levels perfectly in each RNA sample, even though one can't actually do that. It represents the variability that remains after the Poisson variability from sequencing has been removed.

To repeat, BCV does not represent the variability between observed expression levels. It is the variability of true expression levels. You cannot measure BCV using an undergraduate formula from the observed counts or RPKM values.

Afternote: I have just noticed that you asked the same question at the same time on Biostars:

https://www.biostars.org/p/167688/

I also see that you have previously posted a number of questions about edgeR on Biostars but not on Bioconductor, and some of those questions went unanswered. Please be aware the that edgeR authors try hard to answer questions on Bioc, but we don't have the time or resources to monitor every possible forum.

ADD COMMENT • link 9.4 years ago • updated 2.1 years ago Gordon Smyth 52k

0

Entering edit mode

Just a small question: true expression levels is estimated when the Poisson variability is removed, right?

ADD REPLY • link 9.4 years ago tonja.r ▴ 80

0

Entering edit mode

No, it doesn't have anything to do with estimating expression values.

ADD REPLY • link 9.4 years ago Gordon Smyth 52k

0

Entering edit mode

Sorry, I have single cell seq data. I have 209 cells for a time point, I have two genes for which I want to know actually gene with higher expression in 209 cells in this time point

Gene A mean=2.71 sd=1.05 biological coefficient of variation (BCV)= 0.37

Gene B mean= 2.27 sd= 0.922 biological coefficient of variation (BCV)= 0.406

So, which gene is really expression more within 209 cells in this time point???????????? I was not able to relate BCV concept to my question.

ADD REPLY • link 6.9 years ago AZ ▴ 30

1

Entering edit mode

Are you using the zinbwave DESeq2 integration? If not you should. Without it DESeq2 does a bad job with bimodal data that can’t be fit with a NB.

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#recommendations-for-single-cell-analysis

ADD REPLY • link 6.9 years ago Michael Love 43k

0

Entering edit mode

I am confused BCV in among replicates belonging to the same treatmet, eg, I have Treatment A ( 3 replicates) Treatment B ( 3 replicates). The BCV refer to variation between replicates of which treatment?

ADD REPLY • link 5.6 years ago vm.higareda ▴ 10

0

Entering edit mode

I am confused BCV in among replicates belonging to the same treatmet, eg, I have Treatment A ( 3 replicates) Treatment B ( 3 replicates). The BCV refer to variation between replicates of which treatment?

ADD REPLY • link 5.6 years ago vm.higareda ▴ 10