Entering edit mode
Arne.Muller@aventis.com
▴
620
@arnemulleraventiscom-466
Last seen 10.5 years ago
Hello,
I'm trying to analyze some toxicology data, and I got some problems
with lm
and rlm for which I'm looking for help. The experiment is a 3 factor
design:
- the value (the y vector) is a response for a treatment with a drug
- a dose factor with 5 levles (a drug in 5 concentrations)
- a time factor with 2 levels (time of treatment)
- a batch factor with 3 levels corresponding to the laboratories that
analyzed the data
Within each dose, time and batch I've 3 "technical" replicates. I'm
looking
for genes for which the expression is altered mainly by the dose at
any of
the time points.
Ideally I'd just like to merge the 3 replicates produced by each lab
into one
group with then 9 replicates (i.e. omitting the batch factor).
Unfortunately
the batch factor is very strong and introduces quite a large variance
within
each time and dose level.
I know that sometimes one laboratory is quite a strong outlier, but
the
"extreme" lab changes from gene to gene! When looking at a stripchart
for the
dose (within one timepoint) and merged batches the datapoints are
scattered,
and often 2 or 3 of the replicates from one lab are >2 standard
deviations
away from the mean.
I'm doing a simple lm to quantify the main effects (creating an lm for
each
gene).
fit <- lm(value ~ dose + time + batch, d)
> summary(fit)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.772575997 0.04040316 167.6249085 2.174127e-99
dose010mM -0.038755967 0.04435994 -0.8736704 3.850501e-01
dose025mM -0.028481675 0.04598919 -0.6193123 5.375628e-01
dose050mM -0.069415506 0.05087185 -1.3645171 1.764316e-01
dose100mM 0.008306756 0.04435994 0.1872580 8.519573e-01
time24h 0.015798762 0.02976119 0.5308511 5.970699e-01
batchOLD 0.134205422 0.03534714 3.7967832 2.930010e-04
batchPRG 0.040506343 0.03837539 1.0555291 2.945274e-01
> anova(fit)
Analysis of Variance Table
Response: value
Df Sum Sq Mean Sq F value Pr(>F)
dose 4 0.06086 0.01521 0.8180 0.517660
time 1 0.00524 0.00524 0.2818 0.597070
batch 2 0.27889 0.13944 7.4968 0.001068 **
Residuals 76 1.41362 0.01860
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
> anova(rfit)
Analysis of Variance Table
To minumize the batch effect, e.g. to find true influences of the dose
I
thought it may be a good idea to exclude obvious outliers. I've done
this by
excludung data point >2 SD from the mean within a dose at a timepoint.
I
guess that's not realy elegant ... .
Would it be apporpiate to use a robust lm (rlm or lqs) to "downgrade"
the
outliers in importance? Could a robutsly fitted linear model be used
for an
anova, e.g. do weight the data within the groups? E.g.
rfit <- rlm(value ~ dose + time + batch, d)
Why are there no p-values given for the t-statistics?
> summary(rfit)$coefficients
Value Std. Error t value
(Intercept) 6.781613686 0.04215968 160.8554538
dose010mM -0.038060320 0.04628847 -0.8222418
dose025mM -0.042400425 0.04798856 -0.8835528
dose050mM -0.078787526 0.05308349 -1.4842192
dose100mM 0.008569384 0.04628847 0.1851300
time24h 0.005742874 0.03105505 0.1849256
batchOLD 0.140608264 0.03688384 3.8121911
batchPRG 0.031820089 0.04004375 0.7946331
... and no p-values for the F-stats/F-values either ...
> anova(rfit)
Analysis of Variance Table
Response: value
Df Sum Sq Mean Sq F value Pr(>F)
dose 4 0.07794 0.01949
time 1 0.00165 0.00165
batch 2 0.29484 0.14742
Residuals 1.42161
lqs doesn't have any implementation for anova, is it theoretically not
possible to use the model for an anova?
As an alternative I'm thinking about using the "weights" option for
lm. Maybe
the distance from the mean in units of standard deviations
w <- abs(mean(i) - i) / sd(i)
w <- 1/w # to give small weight to outliers
I'd be happy to discuss these options with you.
kind regards + thanks a lot for your comments,
Arne
--
Arne Muller, Ph.D.
Toxicogenomics, Aventis Pharma
arne dot muller domain=aventis com