Entering edit mode
Xiaohui Wu
▴
280
@xiaohui-wu-4141
Last seen 10.3 years ago
Thanks, Simon and Wolfgang.
to follow up Simon's suggestions:
> > filter=dat[rowSums(dat[,group1]>= 8) | rowSums(dat[,group2]>= 8),
]
>Try instead something like:
>filter=dat[rowSums(dat) >= 16, ]
>
>- How does your filter affect the variance functions? Do the plots
>generated by 'scvPlot()' differ between the filtered and the
unfiltered
>data set?
I tried filter=dat[rowSums(dat) >= 16, ] or filter=dat[rowSums(dat) >=
8, ], the scvPlot of 8 or 16 is almost the same, but the plots of
filtering and no filtering are different.
The two plots are attached, to me, they look different, but I don't
know what excatly the differences are, could you look into the plots?
>- If so, are the hits that you get at expression strength were the
>variance functions differ? Are they at the low end, i.e., where the
>filter made changes?
Sorry, I don't know how to find out where the variance functions
change, can we see this from the scvPlots?
>- Have you tried what happens if you filter after estimating
variance?
>The raw p values should be the same as without filtering, but the
>adjusted p values might get better.
How to filter the CountDataSet? or could I just adjust p values (using
some adjusted function?) on the unfiltered results of nbinomTest()?
d <- newCountDataSet( dat, group )
d <- estimateSizeFactors( d)
sf=sizeFactors(d)
d <- estimateVarianceFunctions( d )
... Then, how could I get the CountDataSet with filtered counts and
variance?
>> 2. For EdgeR
>
>DESeq and edgeR are sufficiently similar that any correct answer
>regarding filtering should apply to both.
>
>> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after
adjusting p.value, is this possible? Then, can I used the *unadjusted*
p.value to get DE genes?
>> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method =
"BH")< 0.05)
>
>Of course, this is possible. (Read up on the "multiple hypothesis
>testing problem" if this is unclear to you.) Not also, though, that
you
>used an FDR of .1 in your DESeq code but of .05 here.
Actually, I tried 0.1 or 0.05, or other adjusted methods provided in
EdgeR, always 0 DE genes after adjusted.
> Il May/21/11 5:16 PM, Wolfgang ha scritto:
>Hi Martin,
>
>this is a very good question, and it needs more investigation. I'll
have
>a closer look into this and report back in this place. Two comments
>though already:
>
>- That the dispersion parameter is estimated from the data need by
>itself not be problematic (It is not problematic in the microattay
case
>that the t-test variances are estimated from the data.)
>
>- The situation with limma is different: there, the problem is that
>limma's Bayesian model, which assumes that gene-level error variances
>?^2_1, ..., ?^2_m follow a scaled inverse ?2 distribution, no longer
>fits the data when the data are filtered for genes with low overall
>variance. Due to the way that limma is implemented, the posterior
>degrees-of-freedom estimate of that distribution then becomes
infinite,
>gene-level variance estimates will be ignored (leading to an
unintended
>analysis based on fold change only), and type I error rate control is
>lost. See Fig. 2B+C in the paper, and the text on p.3.
>
>So, what we need to do with DESeq is
>- simulate data according to the null model
>- see if & how filtering affects the estimated mean-dispersion
relationship
>- see if & how it affects the type I error globally and locally (for
>particular ranges of the mean).
>
>Another point is how much the filtering improves power - that will be
>related to how many genes can actually be removed by the filtering
step,
>which depends on the data.
>
> Best wishes
> Wolfgang
>
>
>
>Il May/21/11 4:36 PM, Martin Morgan ha scritto:
>> On 05/21/2011 07:07 AM, Wolfgang Huber wrote:
>>> Hi Xiaohui
>>>
>>> to follow up on the filtering question:
>>>
>>> - the filter that Xiaohui applied is invalid, it will distort the
>>> null-distribution of the test statistic and lead to invalid
p-values.
>>> This might explain the discrepancy.
>>>
>>> - the filter that Simon suggested is OK and should provide better
>>> results.
>>>
>>> - I'd also be keen to hear about your experience with this.
>>>
>>> A valid filtering criterion does not change the null distribution
of the
>>> subsequently applied test statistic (it can, and in fact should,
change
>>> the alternative distribution(s)). In practice, this means choosing
a
>>> filter criterion that is statistically independent, under the
null, from
>>> the test statistic, and in particular, that it does not use the
class
>>> labels. Details in the below-cited PNAS paper.
>>
>> Was wondering whether, since the dispersion parameter is estimated
from
>> the data, in some strict sense the filtering and testing procedures
are
>> not independent under the null anyway? For the same reason that one
>> would not want to use a variance filter before a limma-style
analysis,
>> if I understand correctly.
>>
>> Martin
>>
>>>
>>> Best wishes
>>> Wolfgang
>>>
>>>
>>>
>>>
>>>
>>> Il May/21/11 11:02 AM, Simon Anders ha scritto:
>>>> Hi Xiaohui
>>>>
>>>> I agree thatit is worrying to get so different results from your
two
>>>> approaches of using DESeq. Here are a few suggestion how you
might
>>>> investigate this (and I'd be eager to hear about your findings):
>>>>
>>>> - Bourgen et al. (PNAS, 2010, 107:9546) have studied how pre-
filtering
>>>> affects the validity and power of a test. They stress that it is
>>>> important that the filter is blind to the sample labels
(actually: even
>>>> permutation invariant). So what you do here is not statistically
sound:
>>>>
>>>> > filter=dat[rowSums(dat[,group1]>= 8) | rowSums(dat[,group2]>=
8), ]
>>>>
>>>> Try instead something like:
>>>>
>>>> filter=dat[rowSums(dat) >= 16, ]
>>>>
>>>> - How does your filter affect the variance functions? Do the
plots
>>>> generated by 'scvPlot()' differ between the filtered and the
unfiltered
>>>> data set?
>>>>
>>>> - If so, are the hits that you get at expression strength were
the
>>>> variance functions differ? Are they at the low end, i.e., where
the
>>>> filter made changes?
>>>>
>>>> - Have you tried what happens if you filter after estimating
variance?
>>>> The raw p values should be the same as without filtering, but the
>>>> adjusted p values might get better.
>>>>
>>>> To be honest, I'm currently a bit at a loss which one is more
correct:
>>>> Filtering before or after variance estimation. Let's hear what
other
>>>> people on the list think.
>>>>
>>>>> 2. For EdgeR
>>>>
>>>> DESeq and edgeR are sufficiently similar that any correct answer
>>>> regarding filtering should apply to both.
>>>>
>>>>> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after
>>>>> adjusting p.value, is this possible? Then, can I used the
*unadjusted*
>>>>> p.value to get DE genes?
>>>>> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method =
>>>>> "BH")< 0.05)
>>>>
>>>> Of course, this is possible. (Read up on the "multiple hypothesis
>>>> testing problem" if this is unclear to you.) Not also, though,
that you
>>>> used an FDR of .1 in your DESeq code but of .05 here.
>>>>
>>>> Simon
>>>>