Entering edit mode
Ken recently posted a question to the list regarding computation of t-
and
F-statistics (and corresponding unadjusted p-values) in multtest. Here
are
some suggestions.
Ken's data set consists of the following:
m: a 22277x6 matrix of probeset expressions
m.cl=c(0,0,0,1,1,1): the corresponding class label
m_pt: the 22277x6 matrix m with columns rearranged for paired tests
m_pt.cl=c(0,1,0,1,0,1): the corresponding class label for paired tests
(presumably m_pt=m[,c(1,4,2,5,3,6)])
He is doing two sample tests, but wants to run the F tests (typically
used to compare k>2 groups) for practice.
Ken computes the test statistics for each possible test available
using
the function mt.teststat: regular and rank-based (nonpara argument)
for
each value of the test argument.
teststat_t_para <- mt.teststat(m,m.cl,test="t",nonpara="n")
teststat_t_nonpara <- mt.teststat(m,m.cl,test="t",nonpara="y")
teststat_te_para <- mt.teststat(m,m.cl,test="t.equalvar",nonpara="n")
teststat_te_nonpara <-
mt.teststat(m,m.cl,test="t.equalvar",nonpara="y")
teststat_w_para <- mt.teststat(m,m.cl,test="wilcoxon",nonpara="n")
teststat_w_nonpara <- mt.teststat(m,m.cl,test="wilcoxon",nonpara="y")
teststat_f_para <- mt.teststat(m,m.cl,test="f",nonpara="n")
teststat_f_nonpara <- mt.teststat(m,m.cl,test="f",nonpara="y")
teststat_pt_para <- mt.teststat(m_pt,m_pt.cl,test="pairt",nonpara="n")
teststat_pt_nonpara <-
mt.teststat(m_pt,m_pt.cl,test="pairt",nonpara="y")
teststat_bf_para <-
mt.teststat(m_pt,m_pt.cl,test="blockf",nonpara="n")
teststat_bf_nonpara <-
mt.teststat(m_pt,m_pt.cl,test="blockf",nonpara="y")
These look OK, except that for the "blockf" tests, Ken might want to
use m
and m_bf.cl=c(0,1,2,0,1,2), since presumably he is thinking about the
two groups as two blocks (each with 3 observations). As written, with
m_pt, the "blockf" test corresponds to looking at three blocks each
with
two observations. Note that in the way Ken has this now, the square of
the
paired t-statistic is equal to the block F-statistic (for three blocks
of
two observations).
Next, Ken wants to use tabled distributions (with the functions pt,
pwilcox, and pf) to compute marginal, unadjusted p-values for each
gene
based on the above test statistics. Assuming two-tailed tests, he
proposes the following (which isn't quite right):
rawp0_teststat_t_para <- 2 * (1 - pt(abs(teststat_t_para), df =5))
rawp0_teststat_t_nonpara <- 2 * (1 - pt(abs(teststat_t_nonpara)),
df=5)
rawp0_teststat_te_para <- 2 * (1 - pt(abs(teststat_te_para)), df=5)
rawp0_teststat_te_nonpara <- 2 * (1 - pt(abs(teststat_te_nonpara)),
df=5)
rawp0_teststat_w_para <- 2 * (1 - pwilcox(abs(teststat_w_para)), df=5)
rawp0_teststat_w_nonpara <- 2 * (1 - pwilcox(abs(teststat_w_nonpara)),
df=5)
rawp0_teststat_f_para <- 2 * (1 - pf(abs(teststat_f_para)), df=5)
rawp0_teststat_f_nonpara <- 2 * (1 - pf(abs(teststat_f_nonpara)),
df=5)
rawp0_teststat_pt_para <- 2 * (1 - pt(abs(teststat_pt_para)), df=5)
rawp0_teststat_pt_nonpara <- 2 * (1 - pt(abs(teststat_pt_nonpara)),
df=5)
rawp0_teststat_bf_para <- 2 * (1 - pf(abs(teststat_bf_para)), df=5)
rawp0_teststat_bf_nonpara <- 2 * (1 - pf(abs(teststat_bf_nonpara)),
df=5)
Here are some suggetions:
1. Typo: For the nonpara ones, the parens are in the wrong place. They
should be the same as for the para ones, eg:
rawp0_teststat_t_nonpara <- 2 * (1 - pt(abs(teststat_t_nonpara),
df=5))
2. Degrees of freedom: For t-tests with equal variance (t.equalvar),
df=n1+n2-2=4. For unequal variances, the df formula is more
complicated
(Welch - or Satterthwaite - approximation). For the paired t-test,
df=2.
For F-tests, you need two dfs (see ? pf).
For pwilcox, you just give sample sizes in each group, not df (see ?
pwilcox).
3. You don't need abs() for F-tests: the statistics are always
positive
since they are sums of squares. (It doesn't hurt though).
With the correct rawp0 values in hand, Ken could use the function
mt.rawp2adjp to adjust for multiple comparisons.
Let me add that while this is a good exercise in understanding
marginal
p-values, the multtest package is designed to compute these without
using
these tabled distributions (and making the corresponding assumptions).
The
functions MTP and mt.minP (or mt.maxT) provide p-values (raw and
adjusted) through bootstrap and permutation resampling. These p-values
account for correlations between genes and do not make assumptions
about
the parametric forms of the distributions for each test statistic.
Hence,
one does not have to guess degrees of freedom or distributions. The
imput
is simply the data and class labels.
Best,
Katie