Question

How to compute the stabilized variance in DESeq?

0

Entering edit mode

Peng Yu ▴ 940

@peng-yu-3586

Last seen 10.3 years ago

Hi Simon, I see the resVarA & resVarB columns, which is the ratio between the stabilized variance and the actual variance. To get the stabilized variance, is the following R code correct? var(values_A)*resVarA -- Regards, Peng

• 868 views

ADD COMMENT • link updated 12.7 years ago by Simon Anders ★ 3.8k • written 12.7 years ago by Peng Yu ▴ 940

score 0 · Answer 1 · 2012-04-10

0

Entering edit mode

Simon Anders ★ 3.8k

@simon-anders-3855

Last seen 4.4 years ago

Zentrum für Molekularbiologie, Universi…

Dear Peng On 04/09/2012 11:46 PM, Peng Yu wrote: > Hi Simon, > > I see the resVarA& resVarB columns, which is the ratio between the > stabilized variance and the actual variance. To get the stabilized > variance, is the following R code correct? > > var(values_A)*resVarA No, for several reasons: - assuming that values_A is a matrix, 'var' gives you a single value computed over all values. You want 'rowVars' - resVarA/B was the ratio of per-gene variance estimate and fitted variance. Both are estimates, neither is an 'actual' value in the sense of a population value. You might be confusing things with the 'varianze-stabilizing transformation', because there is no quantity in DESeq that we call the 'stabilized variance'. Hence, I don't know what you are actually looking for. - You are using an outdated version of DESeq. In the current version, the resVarA/B columns have been removed. See the new vignette for details and reasons for this. See also the description of the new 'fitInfo' function in the vignette, which might give you what you want. Simon

ADD COMMENT • link 12.7 years ago Simon Anders ★ 3.8k

0

Entering edit mode

On Tue, Apr 10, 2012 at 3:50 AM, Simon Anders <anders at="" embl.de=""> wrote: > Dear Peng > > > On 04/09/2012 11:46 PM, Peng Yu wrote: >> >> Hi Simon, >> >> I see the resVarA& ?resVarB columns, which is the ratio between the >> >> stabilized variance and the actual variance. To get the stabilized >> variance, is the following R code correct? >> >> var(values_A)*resVarA > > > No, for several reasons: > > - assuming that values_A is a matrix, 'var' gives you a single value > computed over all values. You want 'rowVars' > > - resVarA/B was the ratio of per-gene variance estimate and fitted variance. > Both are estimates, neither is an 'actual' value in the sense of a > population value. You might be confusing things with the > 'varianze-stabilizing transformation', because there is no quantity in DESeq > that we call the 'stabilized variance'. Hence, I don't know what you are > actually looking for. The purpose is to plot the means and variances (of the normalized count per sample) of a few genes, so that people can easily see the changes or no changes in these genes. I understand that the variances of the normalized count per sample is not directly used in the statistical test (as in (14) of the GB paper, the variances of the sum of normalized count are used). I should have used 'fitted variance' if that is the terminology. Essentially, the fitted mean vs dispersion curve is fitted. Based on the mean of each gene (of course of the same condition), the fitted dispersion for that gene is computed. Using this fitted dispersion and the mean, the "fitted variance" is computed. Right? (Just to double check my understand, as I read this long time ago.) Given the following formula, shown in the vignette. v = s ? + ? s^2 ?^2 To make sure, v means the variance of the raw counts (i.e., (3) in the GB paper). So the fitted variance of the normalized count for a given condition (when the size factors are close to 1 (which is usual), the distribution of the normalized count (hence its mean and variance) of the same gene across samples in the same condition can be consider the same) should be just ? + ? ?^2? (If using maximum sharingMode, the variance will be the actual sample variance, if it is above the fitted value (roughly speaking), right?) > - You are using an outdated version of DESeq. In the current version, the > resVarA/B columns have been removed. See the new vignette for details and > reasons for this. It is hard for me find which paragraph/page you referred, would you please point it to me? > See also the description of the new 'fitInfo' function in the vignette, > which might give you what you want. The fitted mean dispersion function is now a parametric function. The vignette and ?estimateDispersions do not say why this is better than locfit. It seems that given tens of thousands of genes in a typical genome, locfit is more likely going to be better than parametric. Is there a rational why parametric becomes the default? Otherwise, to be backward compatible, I think that locfit should still be the default. > cds=makeExampleCountDataSet() > cds=estimateSizeFactors( cds ) > cds=estimateDispersions( cds ) > conditions(cds) A1 A2 B1 B2 B3 A A B B B Levels: A B > x$dispFunc function (q) coefs[1] + coefs[2]/q <environment: 0x10878b270=""> attr(,"coefficients") asymptDisp extraPois 0.2202342 0.6951945 attr(,"fitType") [1] "parametric" There are two conditions A and B, according to (3) in the GB paper, v_i,rho(j) should be specific to the condition rho. Why there is only one dispFunc? Should there be two such functions? Or you actually assumed v_i (without dependence on rho)? Also, since there is already a "maximum" option of for sharingMode (which, I think, was not available in the original version), why not add an additional option to allow shrinkage (as in usual Bayesian analysis)? -- Regards, Peng

ADD REPLY • link 12.7 years ago Peng Yu ▴ 940