Entering edit mode
Mark W Kimpel
▴
830
@mark-w-kimpel-2027
Last seen 10.2 years ago
I'm confused about how the J-G statistic of Jiang and Gentleman (2007)
is
calculated. To check the results I was getting, I calculated the
statistic
with my own code and came up with a very different answer to that
obtained
using gsealmPerm (which uses GSNormalize). Before proceeding to my
code, I
also note that it looks like the statistic is calculated slightly
differently in the paper of Oron, et al (2008) and in the GSEAlm
vignette.
The former does not subtract t statistics from the mean t stat of the
entire
dataset (t.ref) but the latter does. Am I interpreting that correctly
and,
if so, why is it presented differently in the 2 sources?
In the code below, which is a self-contained example, I construct an
artificial ExpressionSet with a fold change difference between groups
A and
B of phenotype y in probesets corresponding to a GeneSet. I then
calculate
the statistic using gsealmPerm and then using parametric assumptions
(which
should hold considering how the dataset was constructed) using my own
code.
As can be seen, very different answers are obtained.
What am I misunderstanding?
Here is the code followed by my output and sessionInfo(). First come a
few
necessary functions with a hack of mine to output the
observedStatistic from
gsealmPerm:
###################################################
gsealmPerm.MWK <- function (eSet, formula = "", mat, nperm, na.rm =
TRUE,
...)
{
nSamp = ncol(eSet)
if (formula == "") {
nvar <- 0
} else {
xvarnames <- all.vars(formula)
nvar <- length(xvarnames)
}
obsRaw <- lmPerGene(eSet = eSet, formula = formula, na.rm = na.rm)
if (nvar > 0) {
observedStats = GSNormalize(obsRaw$tstat[2, ], incidence =
mat,
...)
}
else {
observedStats = GSNormalize(t(obsRaw$tstat), incidence = mat,
fun2 = identity, ...)
}
perm.eset = eSet
i <- 1L
if (nvar > 0) {
permMat <- matrix(0, nrow = nrow(eSet), ncol = nperm)
while (i < (nperm + 1)) {
if (nvar >= 2) {
splitter = pData(eSet)[, xvarnames[2]]
if (nvar > 2)
splitter = as.list(pData(eSet)[, xvarnames[2:nvar]])
label.perm = unsplit(lapply(split(1:nSamp, splitter),
sample), splitter)
pData(perm.eset)[, xvarnames[1]] <-
pData(eSet)[label.perm,
xvarnames[1]]
}
else if (nvar == 1) {
pData(perm.eset)[, xvarnames[1]] <-
pData(eSet)[sample(1:nSamp),
xvarnames[1]]
}
temp.results <- lmPerGene(eSet = perm.eset, formula =
formula,
na.rm = na.rm)
permMat[, i] <- temp.results$tstat[2, ]
i <- i + 1L
}
permMat <- GSNormalize(permMat, incidence = mat, ...)
rownames(permMat) = rownames(mat)
}
else if (nvar == 0) {
permMat <- matrix(0, nrow = nrow(mat), ncol = nperm)
rownames(permMat) = rownames(mat)
for (i in 1:nperm) permMat[, i] = GSNormalize(t(obsRaw$tstat),
incidence = mat[, sample(1:ncol(mat))], fun2 = identity,
...)
}
output <- data.frame(pvalFromPermMat(observedStats, permMat),
observedStats)
}
#####################################3
pvalFromPermMat <- function(obs, perms) { #needed because this is an
internal function of the GSEAlm package
N <- ncol(perms)
pvals <- matrix(as.double(NA), nr=nrow(perms), ncol=2)
dimnames(pvals) <- list(rownames(perms), c("Lower", "Upper"))
tempObs <- rep(obs, ncol(perms))
dim(tempObs) <- dim(perms)
pvals[ , 1] <- rowSums(perms <= tempObs)/N
pvals[ , 2] <- rowSums(perms >= tempObs)/N
pvals
}
#########################
## mat: A 0/1 incidence matrix with each row representing a gene set
## and each column representing a gene. A 1 indicates
## membership of a gene in a gene set.
makeGSmat <- function(eset, geneSet.lst){
amat <- matrix(0, nrow= length(geneSet.lst), ncol =
length(featureNames(eset)))
for (i in 1:nrow(amat)){
amat[i,] <- featureNames(eset) %in% geneIds(geneSet.lst[[i]])
}
rownames(amat) <- names(geneSet.lst)
amat
}
####################################################
###########
# simulate non-parametric GSEA
require(GSEABase); require(GSEAlm)
set.seed(123)
eset.mat <- matrix(rnorm(n = 12*12000, mean = 10, sd = 1), ncol = 12)
rownames(eset.mat) <- paste("A", 1:12000, sep = "_")
gs <- GeneSet(paste("A", unique(floor(runif(n = 50, min = 1, max =
300))),
sep = "_"), setIdentifier = "test.gs")
eset.mat[geneIds(gs),1:6] <- eset.mat[geneIds(gs),1:6] + 0.5
df <- data.frame(x = 1:12, y = c(rep("A", 6), rep("B", 6)))
adf <- as(df, "AnnotatedDataFrame")
exprsSet <- new("ExpressionSet", exprs = eset.mat, phenoData = adf)
#
gs.lst <- listtest.gs = gs)
#####################################
amat <- makeGSmat(exprsSet, gs.lst)
#################################################3
GSEAResults.nonpara <- gsealmPerm.MWK(eSet = exprsSet, formula = ~y,
mat =
amat, nperm = 1000)
GSEAResults.nonpara
###################################################
# simulate parametric GSEA
gene.ts <- (lmPerGene(eSet = exprsSet, formula = ~y)$tstat)[2,]
pos <- match(geneIds(gs), names(gene.ts))
gene.ts[pos]
t.ref <- mean(gene.ts)
summation <- 0
GSEAResults.para <- {
for (i in 1:length(gene.ts[geneIds(gs)])){
summation <- summation + (gene.ts[geneIds(gs)][i] - t.ref)
}
GSEAResults.para <- summation/length(gene.ts[geneIds(gs)])
GSEAResults.para
}
GSEAResults.para
GSEAResults.nonpara
GSEAResults.nonpara$observedStats/length(gene.ts[geneIds(gs)])
####################################################
> GSEAResults.para
A_121
-1.083127
> GSEAResults.nonpara
Lower Upper observedStats
test.gs 0 1 -7.435567
> GSEAResults.nonpara$observedStats/length(gene.ts[geneIds(gs)])
[1] -0.1582036
> sessionInfo()
R version 2.9.0 Under development (unstable) (2008-10-28 r46793)
x86_64-unknown-linux-gnu
locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US
.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_N
AME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTI
FICATION=C
attached base packages:
[1] splines tools stats graphics grDevices utils
datasets
[8] methods base
other attached packages:
[1] GSEAlm_1.3.0 GSEABase_1.5.0 affycoretools_1.15.0
[4] annaffy_1.15.0 KEGG.db_2.2.5 gcrma_2.15.1
[7] matchprobes_1.15.0 biomaRt_1.17.0 GOstats_2.9.0
[10] Category_2.9.7 genefilter_1.23.0 survival_2.34-1
[13] RBGL_1.19.0 annotate_1.21.1 xtable_1.5-4
[16] GO.db_2.2.5 RSQLite_0.7-1 DBI_0.2-4
[19] AnnotationDbi_1.5.5 limma_2.17.3 affy_1.21.0
[22] Biobase_2.3.3 graph_1.21.0
loaded via a namespace (and not attached):
[1] affyio_1.11.2 cluster_1.11.11 preprocessCore_1.5.1
[4] RCurl_0.92-0 XML_1.98-1
>
------------------------------------------------------------
Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine
15032 Hunter Court, Westfield, IN 46074
(317) 490-5129 Work, & Mobile & VoiceMail
(317) 399-1219 Home
Skype: mkimpel
"The real problem is not whether machines think but whether men do."
-- B.
F. Skinner
******************************************************************
[[alternative HTML version deleted]]