Entering edit mode
Tom Wenseleers
▴
60
@tom-wenseleers-3741
Last seen 10.2 years ago
Dear all,
I was just analyzing some data from a differential (dual label LC/MS)
single factor peptidomics experiment using both a per-feature mixed
model approach (using lme4's lmer function, using a model of the form
intensity/abundance ~ (1|run)+label+treatment, and testing
significance and calculating least square means of the differential
expression ratios for the different treatments using lmerTest and
lsmeans, using a Satterthwaite df approximation) and an empirical
Bayes standard deviation shrinkage method as implemented in limma (by
putting my data in an RGLimma object). To my surprise, I am finding
more statistically significantly differentially expressed features
using the mixed model approach than using limma (independent of the
types of normalizations I use in limma, and I did check that the sd vs
abundance was flat). Before, I have also observed the same phenomenon
in some microarray datasets.
I am wondering if this would point towards a mixed model approach
having superior statistical power than limma, even for this very
simple design. Could this be the case? I only did 8 replicate runs in
this case (corresponding to arrays in 2-color microarray exps), so not
terribly many... Could it be that limma has superior statistical power
for the analysis of relatively small designs with few experimental
replicates (where the SD shrinkage could be beneficial), but that for
larger number of replicates and also for complex designs (in
particular when there are multiple sources of dependencies in the
data) a mixed model approach would work better?
Also, would a mixed model approach in general not be much easier to
specify (just requiring the model formula to be specified, as opposed
to all the contrasts, which becomes pretty tedious for complex
designs) and statistically be more flexible and powerful for
multifactorial experiments (e.g. easily allowing for treatment
interaction effects to be tested, and also allowing for multiple
crossed random factors, e.g. to take into account several dependencies
in the data, e.g. owing to spatial correlations in the microarray
slides/print tip effects, other random factors, e.g. due to the use of
samples with multiple genetic backgrounds or the presence of repeated
measures factors, etc). For repeated measures designs I would also
think that mixed model fits obtained using nlme could be even more
flexible, e.g. allowing for various custom error covariance structures
(e.g. to take into account temporal autocorrelations, etc); in fact,
even lmer supports unstructured covariance models (which would allow
variances to be different in your different groups, which I think
could be quite common). Model selection, on the other hand, would
appear a little more tricky, but also feasible, e.g. by minimizing the
AIC of a model that is fit over all features/genes combined (as in htt
p://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/vie
wer.htm#statug_hpmixed_sect035.htm#statug.hpmixed.hpmeg4b).
Finally, an important advantage that I would see of using this
approach is that it would readily generalize to RNAseq data if one
would use generalized mixed models (glmer's) with a Poisson error
structure (correcting for the total nr of mapped reads, etc and other
important covariates) (or, alternatively, using a binomial error
structure, to analyze the proportion of reads mapping to a particular
gene). Overdispersion in the data in this approach could readily be
taken into account by including an observation-level random factor in
the model, cf. https://stat.ethz.ch/pipermail/r-sig-mixed-
models/2011q1/015867.html. The advantage here again would be that
complex designs, e.g. with multiple crossed random factors, could
readily be taken into account (to my understanding, edgeR or DEseq do
not allow for this, right?). Would anyone perhaps have any thoughts
why not to go for such a generic approach?
Would there perhaps be any market for a package to analyze
differential expression in limma RGLimma and ESet data (or
CountDataSet or DGEList objects for RNAseq data) using the approach I
outline above? If there is enough interest, I would be keen to develop
it, as it would seem pretty straightforward to do...
Cheers,
Tom Wenseleers
______________________________________________________________________
_________________
Prof. Tom Wenseleers
* Lab. of Socioecology and Social Evolution
Dept. of Biology
Zoological Institute
K.U.Leuven
Naamsestraat 59, box 2466
B-3000 Leuven
Belgium
* +32 (0)16 32 39 64 / +32 (0)472 40 45 96
* tom.wenseleers@bio.kuleuven.be
http://bio.kuleuven.be/ento/wenseleers/twenseleers.htm
[[alternative HTML version deleted]]