I was using DESeq (and edgeR) for differentially expression analysis.
In my current dataset I compare 3 biological replicates of control vs.
3 biol. replicates from a mutant.
The resulting 4 top genes according adjusted pvalue by DESeq and edgeR
have a very high variance.
(The reason for this is, that this are genes located on the chrY and
only one replicate of the mutant was male)
My question is now, how can genes with such a high variance of the
counts result in this small pvalues?
Is there any way to avoid this, because I think this are False
Positives?
Attached you can find the combined result table of DESeq and edgeR for
the top 100 genes.
The problem occurs for the first 4 genes. The raw counts are stated in
columns P-U (P-R: Mutant, T-U Control).
The information contained in this email and any attachments is
confidential and may be subject to copyright or other intellectual
property protection. If you are not the intended recipient, you are
not authorized to use or disclose this information, and we request
that you notify us by reply mail or telephone and delete the original
message from your mail system.
Dear Steffen
On 2011-12-02 13:53, Steffen Priebe wrote:
> I was using DESeq (and edgeR) for differentially expression
analysis.
> In my current dataset I compare 3 biological replicates of control
vs. 3 biol. replicates from a mutant.
> The resulting 4 top genes according adjusted pvalue by DESeq and
edgeR have a very high variance.
> (The reason for this is, that this are genes located on the chrY and
only one replicate of the mutant was male)
>
> My question is now, how can genes with such a high variance of the
counts result in this small pvalues?
> Is there any way to avoid this, because I think this are False
Positives?
>
> Attached you can find the combined result table of DESeq and edgeR
for the top 100 genes.
> The problem occurs for the first 4 genes. The raw counts are stated
in columns P-U (P-R: Mutant, T-U Control).
Short answer: I suppose you used version 1.4.x of DESeq. In the new
release (DESeq version 1.6.x), we made some major changes, which
should
cause the problem to disappear.
Longer answer: The data frame returned by 'nbinomTest' in the old
version returned, next to the p values, two vectors of "variance
residuals", labeled "resVarA" and "resVarB". The vignette explained
that
p values should be considered unreliable if the variance residuals
were
too large and advised to disregard such hits. Your Y chromosome genes
certainly had such large values in resVarA or resVarB, and you should
have removed them because of this.
These variance residuals are the ratio of the per-gene estimate of the
variance (which is very imprecise in case of few samples) and the
fitted
value found from sharing data across genes (which is stable but may be
misleading in case of genes which behave very different than the other
genes of similar expression range.) Previously, we used only the
fitted
dispersion values for the test and left it to the user to filter out
those hits for which the two values were in too much disagreement.
Many
users overlooked the need for this last step, others found the
solution
unsatisfactory as it turned out to be hard to advise on a good
threshold
for the filtering on variance residuals.
The new version solves the issue with a pragmatic and simple approach
that works surprisingly well: DESeq now simply uses the maximum of the
two values. See the updated vignette for more details on this topic.
This costs power but avoids the need for filtering. In our experience,
the power cost is surprisingly low for typical data sets, which, in
our
view, justifies the use of such a simple method, at least for now.
You can switch back to the old behaviour, using the 'sharingMode'
argument to the 'estimateDispersions' function. This can be useful to
see how this 'maximum rule' influences your result.
EdgeR, with its empirical Bayesian approach (implemented in its
function
'estimateTagwiseDispersion') should typically give p values in the
middle between DESeq's result using the 'maximum' and its the
'fitted-only' sharing modes. However, at least in your case, edgeR
seemed to have stayed too close to the fitted values (or: to the
'common
dispersion', in edgeR's terminology) as you wrote it also gave you p
values for your high-variance genes that you considered implausibly
low.
Simon
Dear Steffen,
> Date: Fri, 02 Dec 2011 13:53:42 +0100
> From: "Steffen Priebe" <steffen.priebe at="" hki-jena.de="">
> To: <bioconductor at="" r-project.org="">
> Subject: [BioC] DESeq variance question
>
> I was using DESeq (and edgeR) for differentially expression
analysis. In
> my current dataset I compare 3 biological replicates of control vs.
3
> biol. replicates from a mutant. The resulting 4 top genes according
> adjusted pvalue by DESeq and edgeR have a very high variance. (The
> reason for this is, that this are genes located on the chrY and only
one
> replicate of the mutant was male)
Replicates should be representative of the same population, so I would
remove the male mutant from the experiment, or else remove all X and Y
chromosome genes from the analysis. In our in-house analyses, we have
tended to do the latter when faced with your situation.
More generally, this is exactly the issue that tagwise dispersion
estimation in edgeR is intended to combat. In our experience,
filtering
so that genes are expressed in at least three libraries (for a 3 vs 3
study) and using a reasonably low prior.n to estimateTagwiseDisp()
will
give a satisfying topTags gene list.
You don't say whether you used tagwise dispersion estimation.
> My question is now, how can genes with such a high variance of the
> counts result in this small pvalues? Is there any way to avoid this,
> because I think this are False Positives?
>
> Attached you can find the combined result table of DESeq and edgeR
for
> the top 100 genes. The problem occurs for the first 4 genes. The raw
> counts are stated in columns P-U (P-R: Mutant, T-U Control).
Note that we have not seen your attachments, with are removed by the
list
server. Nor do we know what version of software you are using. If
you
post again, please give output of sessionInfo() and give code for your
edgeR analysis.
Best wishes
Gordon
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}
Dear Simon and Steffen,
> Date: Sat, 03 Dec 2011 20:36:10 +0100
> From: Simon Anders <anders at="" embl.de="">
> To: Steffen Priebe <steffen.priebe at="" hki-jena.de="">,
> bioconductor at r-project.org
> Subject: Re: [BioC] DESeq variance question
>
> Dear Steffen
>
> On 2011-12-02 13:53, Steffen Priebe wrote:
>> I was using DESeq (and edgeR) for differentially expression
analysis.
>> In my current dataset I compare 3 biological replicates of control
vs.
>> 3 biol. replicates from a mutant. The resulting 4 top genes
according
>> adjusted pvalue by DESeq and edgeR have a very high variance. (The
>> reason for this is, that this are genes located on the chrY and
only
>> one replicate of the mutant was male)
>>
>> My question is now, how can genes with such a high variance of the
>> counts result in this small pvalues? Is there any way to avoid
this,
>> because I think this are False Positives?
>>
>> Attached you can find the combined result table of DESeq and edgeR
for
>> the top 100 genes. The problem occurs for the first 4 genes. The
raw
>> counts are stated in columns P-U (P-R: Mutant, T-U Control).
...
> EdgeR, with its empirical Bayesian approach (implemented in its
function
> 'estimateTagwiseDispersion') should typically give p values in the
> middle between DESeq's result using the 'maximum' and its the
> 'fitted-only' sharing modes. However, at least in your case, edgeR
> seemed to have stayed too close to the fitted values (or: to the
'common
> dispersion', in edgeR's terminology)
Common dispersion is not edgeR terminology for DESeq's "fitted
values",
and (in the current Bioconductor release) edgeR moderates towards a
local
prior rather towards the common dispersion. By default, edgeR does
not
fit any model to the dispersion, hence does not have fitted values.
Instead it uses a prior based on locally weighted likelihood.
> as you wrote it also gave you p values for your high-variance genes
that
> you considered implausibly low.
>
> Simon
We don't actually know whether tagwise dispersion was used in the
edgeR
analysis, nor have we seen the gene list (at least I haven't). In the
absence of the knowing either the analysis or the results, it would
seem
premature to make conclusions about the behaviour of
estimateTagwiseDisp.
Best wishes
Gordon
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}