Question

DESeq variance stabilisation and clustering

0

Entering edit mode

Timothy Hughes ▴ 20

@timothy-hughes-4553

Last seen 10.5 years ago

We wish to perform clustering on expression data and therefore are interested in the variance-stabilizing transformation of DESeq. I understand what the purpose of the transformation is namely to produce values whose variances are approximately the same, but why is it necessary to do this when computing the distance between two values? Or put another way, in what way does hierarchical clustering make assumptions about similar variances? I believe I have the answer, but it would be nice if someone could confirm this. When doing clustering one is often effectively trying to minimize the variance within a cluster even if this is not explicitly defined. If we consider that the observations being clustered are random variables with a variance then we should explicitly account for this variance and use a variance stabilising transformation. This avoids the need for trying to account for the variance in the clustering process. The intuition would be that given 3 observation: A (high var)----------------B--------- --------------------------------C (low var) One may choose to cluster B and C if C's variance is very much lower than A's eventhough the observed distance between B and C is greater than the distance between B and A. Any help much appreciated. -- Tim Hughes PhD (http://digitised.info) Medical Genetics Department Oslo University Hospital Ullevål Kirkeveien 166 0407 Oslo Norway [[alternative HTML version deleted]]

Genetics Clustering PROcess DESeq Genetics Clustering PROcess DESeq • 1.6k views

ADD COMMENT • link updated 13.9 years ago by Timothy Hughes ▴ 30 • written 13.9 years ago by Timothy Hughes ▴ 20

score 0 · Answer 1 · 2011-03-23

We wish to perform clustering on expression data and therefore are interested in the variance-stabilizing transformation of DESeq. I understand what the purpose of the transformation is namely to produce values whose variances are approximately the same, but why is it necessary to do this when computing the distance between two values? Or put another way, in what way does hierarchical clustering make assumptions about similar variances? I believe I have the answer, but it would be nice if someone could confirm this. When doing clustering one is often effectively trying to minimize the variance within a cluster even if this is not explicitly defined. If we consider that the observations being clustered are random variables with a variance then we should explicitly account for this variance and use a variance stabilising transformation. This avoids the need for trying to account for the variance in the clustering process. The intuition would be that given 3 observation: A (high var)----------------B----------------------------------------- C (low var) One may choose to cluster B and C if C's variance is very much lower than A's eventhough the observed distance between B and C is greater than the distance between B and A. Any help much appreciated. -- Tim Hughes PhD (http://digitised.info) Medical Genetics Department Oslo University Hospital (Ullevål) Kirkeveien 166 0407 Oslo Norway Tel: (+47) 23 02 72 55 [[alternative HTML version deleted]]

score 0 · Answer 2 · 2011-03-23

Hi Timothy On 03/23/2011 10:47 AM, Timothy Hughes wrote: > We wish to perform clustering on expression data and therefore are > interested in the variance-stabilizing transformation of DESeq. I understand > what the purpose of the transformation is namely to produce values whose > variances are approximately the same, but why is it necessary to do this > when computing the distance between two values? Or put another way, in what > way does hierarchical clustering make assumptions about similar variances? > > I believe I have the answer, but it would be nice if someone could confirm > this. > > When doing clustering one is often effectively trying to minimize the > variance within a cluster even if this is not explicitly defined. If we > consider that the observations being clustered are random variables with a > variance then we should explicitly account for this variance and use a > variance stabilising transformation. This avoids the need for trying to > account for the variance in the clustering process. [...] When talking about clustering, it is important to get clear on what you are clustering: samples or genes? In the DESeq vignette, I am clustering samples, i.e., I want to see which samples are similar to each other, hoping to find that replicate samples appear more similar than samples from different conditions. For this, I need to measure of distance between samples. To compare to samples, one usually takes the two vectors with the expression values of all genes in the respective sample and calculates the distance between these vectors. If one uses Euclidean distance, one calculated, for each gene, the difference of expression between the two samples, squares all these differences, adds up the squares and takes the square root. You want all genes to have roughly equal influence on the distance, and for this, all genes should have equal variance. If you use raw counts, the variance of the top ten-or-so most strongly expressed genes have so much more variance that all the other genes have hardly any influence. DESeq's VST rectifies this. So, my motivation to add the VST to DESeq was to give the user a possibility to calculate distances about You seem to be talking about clustering genes, not samples, however. I hd not thought yet about this application, but I think, your explanation goes the right way. As strong genes have strong variance in all samples, all samples will contribute equally to any measure of distance between two genes. So, we don't have the issue I just discussed that different components influencing the distance have unequal weight. However, the variance of the distance measure itself is now vastly different between weak and strong genes. Two strong genes which actually behave similarly will not cluster together because their large values will give amplify the noise contributions to the distance, while two weak genes will always have small distance because their small expression values also lets their distance appear small. Again, the VST changes the scales such that typical distances (as difference, not ratio) between genes become independent of overall expression strength. Simon