Entering edit mode
Tony McBryan
▴
60
@tony-mcbryan-3847
Last seen 10.2 years ago
Hello list,
I'm using limma to analyse some agilent microarrays for an experiment
that had a large number of variables and I'm running into a wall
trying
to decide on the best model to use.
Briefly (and where R commands have been abbreviated for clarity);
I'm using the AgiMicroRna package, reading in the arrays and the
target
file. Background correcting by normexp + offset and then quantile
normalisation and converting to log values. Then I create a model
using
model.matrix and fit it using lmFit.
The bit I'm having trouble with is deciding the appropriate model that
I
should use. We have 36 arrays in total. Split evenly by gender
(m/f),
age (young/old), treatment (a,b,c) as well as three hybridisation
batches, three experimental batches, and 12 cell strains.
(Targets.txt
pasted below for reference).
If I choose a model as such:
groups<- model.matrix(~treatment + young.old + hyb + batch + sex)
The results seem to broadly make sense from a biological standpoint.
However, if I use:
groups<- model.matrix(~young.old + sex + treatment + hyb + batch +
strain + treatment:young.old + young.old:sex + treatment:sex +
batch:young.old)
which would include all the variables plus interaction effects then
the
results make less sense and generally don't overlap with the results
of
the first model.
Thus, my suspicion would be that the latter model is too complex and
is
overfitting. Sensible?
Looking at the R-Squared stats for the fit of the larger model:
rsquared<- function(fit, data)
{
sst<-rowSums(data^2)
ssr<-sst-fit$df.residual*fit$sigma^2
rsquared<- ssr/sst
rsquared
}
> summary(rsquared(fit,data))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9762 0.9998 0.9999 0.9998 1.0000 1.0000
Seems to show that the model is fitting to the data virtually 100%
which
seems quite high to me.
Looking at one probe individually supports this too:
> summary(lm(dat2[1,]~-1+groups))
Call:
lm(formula = dat2[1, ] ~ -1 + groups)
<--Snip-->
Residual standard error: 0.1418 on 22 degrees of freedom
Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
F-statistic: 1.913e+04 on 14 and 22 DF, p-value:< 2.2e-16
However, when I look at the simpler model:
> summary(rsquared(simplerfit,dat2))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9645 0.9997 0.9999 0.9997 1.0000 1.0000
The R squared is largely the same; which I would interpret as the
additional interaction terms are adding little to the model. Would I
be
correct in this interpretation?
Now, I would want to do something akin to a stepwise regression to
determine the appropriate model I should be using:
I can do this for a single probe as so:
> step(lm(dat2[1,]~young.old + sex + treatment + hyb + batch + strain
+ treatment:young.old + young.old:sex + treatment:sex +
batch:young.old))
Which suggests the correct model to use is:
~young.old + sex + treatment + batch + strain + young.old:sex
However, that has obviously been computed for just a single probe and
I'll get different results if I use a different probe as the basis for
the model. I can't find any obvious way to do a similar analysis for
the fit produced by lmFit to optimise the model for the entire set of
probes. (i.e. find the best overall model).
TL;DR:
Broadly I'm looking for a way to minimise AIC (or another similar
goodness of fit metric) by model selection for a produced by lmFit.
Can
anyone offer any suggestions on approaches which can do this?
Tony McBryan
---
targets.txt
> FileName hyb strain treatment batch sex young-old
> 40398_2.txt 1 S030 Ctrl 1 m old
> 40398_3.txt 1 S030 Rot 3d 1 m old
> 40398_4.txt 1 S030 Rot 3h 1 m old
> 40399_2.txt 1 JS14 Rot 3d 1 m young
> 40399_3.txt 1 JS14 Ctrl 1 m young
> 40399_4.txt 1 JS14 Rot 3h 1 m young
> 40400_2.txt 1 S175 Ctrl 1 f old
> 40400_3.txt 1 S175 Rot 3d 1 f old
> 40400_4.txt 1 S175 Rot 3h 1 f old
> 40401_1.txt 1 JS15 Rot 3d 1 f young
> 40401_2.txt 1 JS15 Rot 3h 1 f young
> 40401_3.txt 1 JS15 Ctrl 1 f young
> 40396_2.txt 2 S386 Rot 3h 1 m old
> 40396_3.txt 2 S386 Rot 3d 1 m old
> 40396_4.txt 2 S386 Ctrl 1 m old
> 40397_1.txt 2 JS20 Ctrl 1 m young
> 40397_2.txt 2 JS20 Rot 3h 1 m young
> 40397_4.txt 2 JS20 Rot 3d 1 m young
> 40415_2.txt 2 S569 Ctrl 3 f old
> 40415_3.txt 2 S569 Rot 3d 3 f old
> 40415_4.txt 2 S569 Rot 3h 3 f old
> 40437_2.txt 2 JS10 Rot 3d 2 f young
> 40437_3.txt 2 JS10 Rot 3h 2 f young
> 40437_4.txt 2 JS10 Ctrl 2 f young
> 40402_2.txt 3 S591 Rot 3d 2 m old
> 40402_3.txt 3 S591 Ctrl 2 m old
> 40402_4.txt 3 S591 Rot 3h 2 m old
> 40403_2.txt 3 JS30 Rot 3d 2 m young
> 40403_3.txt 3 JS30 Rot 3h 2 m young
> 40403_4.txt 3 JS30 Ctrl 2 m young
> 40404_2.txt 3 S534 Rot 3h 3 f old
> 40404_3.txt 3 S534 Ctrl 3 f old
> 40404_4.txt 3 S534 Rot 3d 3 f old
> 40414_2.txt 3 JS05 Rot 3h 3 f young
> 40414_3.txt 3 JS05 Rot 3d 3 f young
> 40414_4.txt 3 JS05 Ctrl 3 f young
[[alternative HTML version deleted]]