Entering edit mode
Yannick Wurm
▴
220
@yannick-wurm-2314
Last seen 10.2 years ago
Hello Jim & List,
how would you go about doing model selection using two-color data?
(since your response variable is actually a ratio)
I'm actually surprised that no "formal/mathematical" linear model is
written in the limma Users Guide... any comments?
Kind regards,
Yannick
On 2009-09-03 13:33:0, Jim Macdonald wrote:
> Hi Andre,
>
> If you want to do model selection, then limma is probably not the
tool
> for the job.
>
> Instead, what I would do would be to choose some (one, five, ten,
> whatever) genes and use lm() for the model selection process. That
way
> you can do all the conventional model selection steps, and once you
are
> satisfied with the model you have chosen you can go back to limma
and
> fit the model on all the genes.
>
> Best,
>
> Jim
>
> Andre J. Aberer wrote:
> > Dear list members,
> >
> > short version of my question:
> > How can I determine, whether it improves the model quality of a
linear
> > model (in limma), when I introduce additional explanatory
variables? Is
> > there an equivalent to feature selection (as in machine learning)
for
> > choosing the explanatory variables?
> >
> > The complete story:
> > We analyse a dataset of about ninety single channel microarray
chips and
> > we want to search for differentially expressed genes and enriched
gene
> > sets. The chips are annotated with information (at least 20
factors,
> > could be extended to 50) like the organ from which the RNA was
> > extracted, the experimenter that did the lab work, the labelling
kit she
> > used and a huge amount of features describing e.g. the genotype of
the
> > individual or different aspects of the disease.
> >
> > We would like to build one linear model (resp. one design matrix)
with
> > all of the factors of interest mentioned above as explanatory
variables
> > in order to test various contrasts. Of course, we have to include
all
> > the variables that we possibly want to test in the linear model.
But
> > what about the ``technical'' factors like the ``labelling kit''
that was
> > used? One never might want to test a contrast using this
explanatory
> > variable, however the net chip intensity could be influenced by a
> > technical factor like this. So how can I determine, if it makes
sense to
> > include this variable?
> >
> > I am using the standard procedure as described in the limma guide:
> > designMatrix <- model.matrix(~0 + var1 + var2, data=someTable)
> > fitBoth <- lmFit(eset, designMatrix)
> > where var1 and var2 are variables like ``diseaseOutcome'' and
> > ``labellingKit''.
> >
> > We thought, that maybe an anova table could help us here, showing
us the
> > influences of var1 and var2. As far as I read
> > (e.g. http://data.princeton.edu/R/linearModels.html) the anvoa
function
> > can be simply applied to a lm object or can be used to compare two
lm
> > instances. Of course, in that case it is only applied to one
linear model
> > and not one per gene as in the limma setting.
> > So, if I try anova for one or two limma fit objects (MArrayLM), R
> > complains that there is no applicable method and other anova
variants
> > (like anova.lm) do not work neither. This holds as well, when I
want to
> > do an anova for just one extracted linear model for one gene
> > (like anova(lmFit[1,])).
> >
> > Our ultimo ratio so far is, to build a design matrix with and
another
> > one without a certain explanatory variable. Then we would
determine the
> > top DEGs and compare for each DEG their fitted linear models in an
anova
> > table. Finally we could check for how many of the top DEGs the
> > additional variable would make a difference.
> > However, this does not seem to be the golden path...or are we
completely
> > on the wrong track?
> >
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> Douglas Lab
> University of Michigan
> Department of Human Genetics
> 5912 Buhl
> 1241 E. Catherine St.
> Ann Arbor MI 48109-5618
> 734-615-7826
[[alternative HTML version deleted]]