Hi list,
Please ignore my last email with two figure attachments because some
calculations were not right, sorry about this.
Actually, there were some overlap among different ways to get DE genes
using DESeq,
I had a typo in my previous code: filter=dat[rowSums(dat > = 16), ]
--> filter=dat[rowSums(dat) > = 16, ], so no overlap was found.
But, I still couldn't get any DE genes after adjusting p.value using
EdgeR. It was supposed to get some but not 0 DE gene, and there should
be some overlap between EdgeR's result and DESeq's or other similar
tools'.
To follow up Simon's suggestions:
>> > filter=dat[rowSums(dat[,group1]> = 8) | rowSums(dat[,group2]> =
8), ]
>> Try instead something like:
>> filter=dat[rowSums(dat) > = 16, ]
>>
>> - How does your filter affect the variance functions? Do the plots
>> generated by 'scvPlot()' differ between the filtered and the
unfiltered
>> data set?
>> - If so, are the hits that you get at expression strength were the
>> variance functions differ? Are they at the low end, i.e., where the
>> filter made changes?
The scvPlot for unfiltered data or filtered data are similar, the
differences are at the low end (at the begining of the scvPlot).
>> - Have you tried what happens if you filter after estimating
variance?
>> The raw p values should be the same as without filtering, but the
>> adjusted p values might get better.
Using DESeq, I tested three ways, now the overlap among different sets
seemed normal.
(A) no filter
gene# 51939
DE gene# (pvalue<0.1) 2385
DE gene# (padjust<0.1) 106
(B) filtering before estimating variance (rowsums>=16)
gene filtered# 20923
DE gene# (pvalue<0.1) 2403
DE gene# (padjust<0.1) 197 (105 in (A)'s 106 genes)
(C) filtering and adjusting after estimating variance (rowsum>=16)
gene filtered# 20923
DE gene# (pvalue<0.1) 2247
DE gene# (padjust<0.1) 139 (106 in (A), 138 in (B))
Using EdgeR, I tested the common disper and tagwise disper it
provides, but no DE genes was found after adjusting pvalue.
(D) common disper
gene# 51939
DE gene# (pvalue<0.1) 1771
DE gene# (padjust<0.1) 0
(E) tagwise disper
gene# 51939
DE gene# (pvalue<0.1) 880 (88.5% overlap with (D))
DE gene# (padjust<0.1) 0
(F) filtering before exactTest (rowsum>=16, tagwise disper)
gene filtered# 20923
DE gene# (pvalue<0.1) 1771
DE gene# (padjust<0.1) 0
>> > 2. For EdgeR
>>
>> DESeq and edgeR are sufficiently similar that any correct answer
>> regarding filtering should apply to both.
>>
>> > 2) I got 800 DE genes with p.value< 0.1, but got 0 DE genes
after adjusting p.value, is this possible? Then, can I used the
*unadjusted* p.value to get DE genes?
>> > To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method =
"BH")< 0.05)
>>
>> Of course, this is possible. (Read up on the "multiple hypothesis
>> testing problem" if this is unclear to you.) Not also, though, that
you
>> used an FDR of .1 in your DESeq code but of .05 here.
Actually, I tried 0.1 or 0.05, or other adjusted methods provided in
EdgeR, always 0 DE genes after adjusted.
Best,
Xiaohui
>> Il May/21/11 5:16 PM, Wolfgang ha scritto:
>> Hi Martin,
>>
>> this is a very good question, and it needs more investigation. I'll
have
>> a closer look into this and report back in this place. Two comments
>> though already:
>>
>> - That the dispersion parameter is estimated from the data need by
>> itself not be problematic (It is not problematic in the microattay
case
>> that the t-test variances are estimated from the data.)
>>
>> - The situation with limma is different: there, the problem is that
>> limma's Bayesian model, which assumes that gene-level error
variances
>> ?^2_1, ..., ?^2_m follow a scaled inverse ?2 distribution, no
longer
>> fits the data when the data are filtered for genes with low overall
>> variance. Due to the way that limma is implemented, the posterior
>> degrees-of-freedom estimate of that distribution then becomes
infinite,
>> gene-level variance estimates will be ignored (leading to an
unintended
>> analysis based on fold change only), and type I error rate control
is
>> lost. See Fig. 2B+C in the paper, and the text on p.3.
>>
>> So, what we need to do with DESeq is
>> - simulate data according to the null model
>> - see if & how filtering affects the estimated mean-dispersion
relationship
>> - see if & how it affects the type I error globally and locally
(for
>> particular ranges of the mean).
>>
>> Another point is how much the filtering improves power - that will
be
>> related to how many genes can actually be removed by the filtering
step,
>> which depends on the data.
>>
>> Best wishes
>> Wolfgang
>>
>>
>>
>> Il May/21/11 4:36 PM, Martin Morgan ha scritto:
>> > On 05/21/2011 07:07 AM, Wolfgang Huber wrote:
>> > > Hi Xiaohui
>> > >
>> > > to follow up on the filtering question:
>> > >
>> > > - the filter that Xiaohui applied is invalid, it will distort
the
>> > > null-distribution of the test statistic and lead to invalid
p-values.
>> > > This might explain the discrepancy.
>> > >
>> > > - the filter that Simon suggested is OK and should provide
better
>> > > results.
>> > >
>> > > - I'd also be keen to hear about your experience with this.
>> > >
>> > > A valid filtering criterion does not change the null
distribution of the
>> > > subsequently applied test statistic (it can, and in fact
should, change
>> > > the alternative distribution(s)). In practice, this means
choosing a
>> > > filter criterion that is statistically independent, under the
null, from
>> > > the test statistic, and in particular, that it does not use
the class
>> > > labels. Details in the below-cited PNAS paper.
>> >
>> > Was wondering whether, since the dispersion parameter is
estimated from
>> > the data, in some strict sense the filtering and testing
procedures are
>> > not independent under the null anyway? For the same reason that
one
>> > would not want to use a variance filter before a limma-style
analysis,
>> > if I understand correctly.
>> >
>> > Martin
>> >
>> > >
>> > > Best wishes
>> > > Wolfgang
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > Il May/21/11 11:02 AM, Simon Anders ha scritto:
>> > > > Hi Xiaohui
>> > > >
>> > > > I agree thatit is worrying to get so different results from
your two
>> > > > approaches of using DESeq. Here are a few suggestion how you
might
>> > > > investigate this (and I'd be eager to hear about your
findings):
>> > > >
>> > > > - Bourgen et al. (PNAS, 2010, 107:9546) have studied how
pre-filtering
>> > > > affects the validity and power of a test. They stress that
it is
>> > > > important that the filter is blind to the sample labels
(actually: even
>> > > > permutation invariant). So what you do here is not
statistically sound:
>> > > >
>> > > > > filter=dat[rowSums(dat[,group1]> = 8) |
rowSums(dat[,group2]> = 8), ]
>> > > >
>> > > > Try instead something like:
>> > > >
>> > > > filter=dat[rowSums(dat) > = 16, ]
>> > > >
>> > > > - How does your filter affect the variance functions? Do the
plots
>> > > > generated by 'scvPlot()' differ between the filtered and the
unfiltered
>> > > > data set?
>> > > >
>> > > > - If so, are the hits that you get at expression strength
were the
>> > > > variance functions differ? Are they at the low end, i.e.,
where the
>> > > > filter made changes?
>> > > >
>> > > > - Have you tried what happens if you filter after estimating
variance?
>> > > > The raw p values should be the same as without filtering,
but the
>> > > > adjusted p values might get better.
>> > > >
>> > > > To be honest, I'm currently a bit at a loss which one is
more correct:
>> > > > Filtering before or after variance estimation. Let's hear
what other
>> > > > people on the list think.
>> > > >
>> > > > > 2. For EdgeR
>> > > >
>> > > > DESeq and edgeR are sufficiently similar that any correct
answer
>> > > > regarding filtering should apply to both.
>> > > >
>> > > > > 2) I got 800 DE genes with p.value< 0.1, but got 0 DE
genes after
>> > > > > adjusting p.value, is this possible? Then, can I used the
*unadjusted*
>> > > > > p.value to get DE genes?
>> > > > > To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p,
method =
>> > > > > "BH")< 0.05)
>> > > >
>> > > > Of course, this is possible. (Read up on the "multiple
hypothesis
>> > > > testing problem" if this is unclear to you.) Not also,
though, that you
>> > > > used an FDR of .1 in your DESeq code but of .05 here.
>> > > >
>> > > > Simon
>> > > >