Hi Paul.
Some are in "MArrayLM" object, others require a couple commands.
For example, using the data example in lmFit help:
------------
sd <- 0.3*sqrt(4/rchisq(100,df=4))
y <- matrix(rnorm(100*6,sd=sd),100,6)
rownames(y) <- paste("Gene",1:100)
y[1:2,4:6] <- y[1:2,4:6] + 2
design <- cbind(Grp1=1,Grp2vs1=c(0,0,0,1,1,1))
# Ordinary fit
fit <- lmFit(y,design)
fit <- eBayes(fit)
------------
you get recreate what you want in the linear model by the following
1) Residual standard error
> lm.first<-lm(y[1,]~-1+design)
> sqrt(anova(lm.first)["Residuals","Mean Sq"])
[1] 0.2990779
> fit$sigma[1]
Gene 1
0.2990779
2) R-Squared
> sst<-rowSums(y^2)
> ssr<-sst-fit$df.residual*fit$sigma^2
> (ssr/sst)[1]
Gene 1
0.964451
> summary(lm.first)$r.squared
[1] 0.964451
...
4) The F-statistic will change since the variances are moderated,
causing both the statistic to change and the degrees of freedom to
change.
Cheers,
Mark
On 01/06/2007, at 5:38 AM, Paul Shannon wrote:
> A summary of an lm result includes some readily-understood
> goodness of fit information:
>
> 1) Residual standard error
> 2) Multiple R-Squared
> 3) Adjusted R-Squared
> 4) F-statistic
>
> With limma, and eBayes, I deduce (incorrectly?) that efit$F and efit
> $F.p.value
> convey information similar to number 4 above. How about the first
> three
> measures -- is there any way to get equivalent information for the
> linear
> model eFit produces?
>
> Thanks!
>
> - Paul
Hi Mark,
Thanks for the very helpful explanation and -- especially -- the
example code.
Two questions remain:
1) You say, "The F-statistic will change since the variances are
moderated,
causing both the statistic to change and the degrees of freedom to
change."
I am afraid I don't understand this. In what circumstances does
the F-statistic
change?
2) I am interested in these goodness-of-fit measures because I want to
search
through my data (and the eBayes output) to find genes whose
behavior is
very nicely modeled by different coefficients, like this
a) devise a variety of model formualae based on biological
intuition
and examinations of the data
b) fit those models, one at a time
c) identify different subsets of genes based on their fit to
each formula
I remain a bit puzzled by the absence of these statistics from the
eBayes output. Does
their absence suggest that my 3-step procedure (2a-c, above) is not
common practice? And
would _that_ suggest that my 3-step procedure is not such a great
idea? Are there better,
more commonly used ways in limma to look for particular effects and
influences
of the experimental factors?
I'll be grateful for any comments or suggestions.
Cheers,
- Paul
Hi Paul.
> 1) You say, "The F-statistic will change since the variances are
> moderated,
> causing both the statistic to change and the degrees of freedom
to
> change."
> I am afraid I don't understand this. In what circumstances does
the
> F-statistic
> change?
What I mean is that the F statistic you get from limma is moderated
and is
different from the *classical* F test (which is what you'd get from
the
'lm' or 'anova' commands in R). Have a read of Gordon's paper for the
details:
http://www.bepress.com/sagmb/vol3/iss1/art3/
> 2) I am interested in these goodness-of-fit measures because I want
to
> search
> through my data (and the eBayes output) to find genes whose
behavior is
> very nicely modeled by different coefficients, like this
>
> a) devise a variety of model formualae based on biological
> intuition
> and examinations of the data
> b) fit those models, one at a time
> c) identify different subsets of genes based on their fit to
each
> formula
Can you give an example of what you mean here? I'm not sure what
"Very
nicely modeled by different coefficients" and "variety of model
formulae
based on biological intuition" are really referring to. Do you have a
designed experiment where you wish to look for differences between,
say
treatments A, B, C, etc.?
> I remain a bit puzzled by the absence of these statistics from the
eBayes
> output. Does
> their absence suggest that my 3-step procedure (2a-c, above) is not
common
> practice? And
> would _that_ suggest that my 3-step procedure is not such a great
idea?
> Are there better,
> more commonly used ways in limma to look for particular effects and
> influences
> of the experimental factors?
Common practice would be to make contrasts to the effect you want to
assess the significance of (if I understand what you are asking).
There
are several examples in the limma user's guide:
http://bioconductor.org/packages/2.0/bioc/html/limma.html
Cheers,
Mark
Hi Mark,
Thanks very much for your reply of a couple of days ago. You kindly
ask:
> Can you give an example of what you mean here? I'm not sure what
"Very
> nicely modeled by different coefficients" and "variety of model
formulae
> based on biological intuition" are really referring to. Do you
have a
> designed experiment where you wish to look for differences
between, say
> treatments A, B, C, etc.?
In brief: with lm (which I turned to at Gordon's suggestion) I can
trawl my
data for specific expression patterns, using different model formulae
and
stringent goodness-of-fit measures. But I am stumped when I try to do
the same with limma. I'd like to benefit from the refinements offered
by limma,
eBayes, in particular.
Here's the longer story.
The experiment is a direct 2-color design. There are 9 reference
samples (call them
J1-J9) and 3 'signal' samples (M1-M3). A brief description of the
actual study may be
helpful:
Malaria parasites exhibit a 'binding phenotype' when infecting
primigravida
(women pregnant for the first time). The parasites attach to CSA
exposed
in the placental lining and cause lots of harm to mother and fetus.
We hope to
elucidate the mechanisms of that binding by finding genes strongly
up-regulated in
primigravida. So our basic question is: what genes behave markedly
differently in primigravid parasites when compared to juvenile
parasites?
The 28 microarray slides consist of 14 dye-swap pairs, for a total of
28 slides:
M J
--- ----------
1 1 2 3 4 5
2 1 3 5 6 7 9
3 4 7 8
My simplest linear model treats every M/J comparison identically. And
in this simple
comparison we found two genes (call them gene1 and gene2), each with a
logFC ratio ~= 6,
and with very small residuals. A great signal: these genes were
uniformly up-regulated in
all comparisons.
I then examined genes a little further down in the limma topTable
results. I found a
gene3 whose statistics weren't bad (adj.P.Val=e-7, B=7.9), but a plot
showed a very
interesting pattern: like gene1 and gene2, in had a great signal in
all comparisons except
those involving M3. This doesn't seem like random variation; this
seems like another gene
which might be contributing (in 2 primigravida out of 3) to the
binding phenotype.
So here's where my biological intuition come in: multiple mechanisms
may be involved in
CSA binding. gene1 and gene2 are perhaps always involved. Parasites
from M3 do not seem
to require gene3 to bind to the placenta, but maybe M1 and M2 do. My
evidence so far is
scant, but I wish to explore the data to test the hypothesis.
I followed Gordon suggestion and used bare-bones lm () to model each
row in my MA object.
I was happy to see that I found small residuals, and a high R-squared
when I modeled gene3
like this:
gene3.lm <- lm (ratio ~ 1 + M3)
My simpler model for gene1 and gene2 would be simply
gene12.lm <- lm (ratio ~ 1)
Now I want to explore all the genes in the 28 comparisons to find
(possibly) other
striking and biologically suggestive behavior. Perhaps there are good
R-squared values
for
lm (ratio ~ 1 + M1) or
lm (ratio ~ 1 + M2)
which might (if we're lucky) reveal other pieces of the multiple,
conditional
contributors to the binding phenotype.
I'd like to do this sort of exploring in limma itself, rather than
stick with what is, for
me, the more transparent but less rich lm. I sense that a judicious
use of makeContrasts,
contrast.fit, and decideTests may allow me to do this kind of
exploration. But I pore
over the manual, experiment for hours, and (it's embarrassing, but
true) end up confused
and feeling rather a dunce.
Got any advice?
Cheers!
- Paul
Hello,
Has anybody used the Windows XP Professional x64 Edition
http://www.microsoft.com/windowsxp/64bit/overview.mspx
to run the BioConductor packages?
If so are there any configurations that one must be concerned about?
Any
advice is appreciated.
Thanks,
Sandhya
Hi Sandhya,
We've used R/bioconductor on XP 64bit and things work just fine. Keep
in
mind that this will be a 32bit version of R. There are currently no
tools that can compile a 64 bit version of R on windows. This means
you
will not be able to take advantage of the speed and available memory
increase of a 64-bit machine.
Also, please do not use an existing message to reply to when posting
on
the list. Some email clients treat that as an actual reply and it also
messes up the list archives. You should create a new message instead.
On Tue, 2007-06-05 at 16:36 -0400, Xirasagar, Sandhya wrote:
> Hello,
>
> Has anybody used the Windows XP Professional x64 Edition
> http://www.microsoft.com/windowsxp/64bit/overview.mspx
> to run the BioConductor packages?
>
> If so are there any configurations that one must be concerned about?
Any
> advice is appreciated.
>
> Thanks,
>
> Sandhya
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
Thanks for this helpful information and also advice on not using an
existing message to reply to post a message. I was not aware of the
repercussions.
Sandhya
-----Original Message-----
From: Francois Pepin [mailto:fpepin@cs.mcgill.ca]
Sent: Wednesday, June 06, 2007 10:46 AM
To: Xirasagar, Sandhya
Cc: bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] Windows XP Professional x64 Edition
Hi Sandhya,
We've used R/bioconductor on XP 64bit and things work just fine. Keep
in
mind that this will be a 32bit version of R. There are currently no
tools that can compile a 64 bit version of R on windows. This means
you
will not be able to take advantage of the speed and available memory
increase of a 64-bit machine.
Also, please do not use an existing message to reply to when posting
on
the list. Some email clients treat that as an actual reply and it also
messes up the list archives. You should create a new message instead.
On Tue, 2007-06-05 at 16:36 -0400, Xirasagar, Sandhya wrote:
> Hello,
>
> Has anybody used the Windows XP Professional x64 Edition
> http://www.microsoft.com/windowsxp/64bit/overview.mspx
> to run the BioConductor packages?
>
> If so are there any configurations that one must be concerned about?
Any
> advice is appreciated.
>
> Thanks,
>
> Sandhya
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
Hi Paul.
Thanks for the explanation, I see what you are doing. I would call
what you are doing 'model selection' and its true, limma doesn't do
that. But, you can probably fit a 'full' model and just look for the
differences that you describe, all within limma.
For example, you could fit a model:
~ -1 + M1 + M2 + M3
where M1-M3 are binary indicators of the signal samples. So this
fits a model with no intercept and a different mean for each 'signal'
sample. If I understand your problem correctly, you are looking for
genes that are non-zero in each of coefficients for M1-M3 (like you
gene1 and gene2). But, you are also interested in genes which have
non-zero in M1,M2 and maybe not so in M3 (your gene3). These are
just contrasts. So, you should be able to look for everything you
are interested in, by constructing contrasts on M1-M3.
Alternatively, you could fit all the possible models you are
interested in and filter all the topTables. There are not that many.
Just one other note ....
> I was happy to see that I found small residuals, and a high R-
> squared when I modeled gene3
> like this:
I think you'll find that small residuals (or at least small relative
to the signal) and high R-squares correspond to large (in magnitude)
t-statistic or large Fs. So, everything you need is in limma.
Hope that helps.
Cheers,
Mark