Dear all,
I am struggling with using limma in the case of many missing values. Here is a small example:
df = data.frame(
g1=c(5,5,6,6,NA,NA),
g2=c(5,5,NA,NA,6,6),
g3=c(NA,NA,5,5,6,6),
treat=c("T1","T1","T2","T2","T3","T3"),
class=c("A","B","A","B","A","B"))
conts = list(treat="contr.sum", class="contr.treatment")
X = t(df[,1:3])
D = model.matrix( ~ class + treat, data=df, contrasts=conts)
Fit = lmFit(X, D)
These are the results:
> Fit$coefficients
(Intercept) classB treatT2 treatT3
g1 5 2.220446e-16 1 NA
g2 5 2.220446e-16 NA 1
g3 6 3.330669e-16 -1 NA
Suppose I am interessted in the comparison of T2-T1. The column "treatT2" of "Fit$coefficients" looks fine for gene 1 and gene 2. In the case of gene 3 the baseline level T1 can not be estimated and the coefficient treatT2 seems to be the comparison of T2-T3. I find this confusing when computing many thousand models and applying toptable(Fit, coef="treatT2") afterwards.
I changed the design matrix by setting conts = list(treat="contr.sum", class="contr.sum") Now, I get the results:
> Fit$coefficients
(Intercept) class1 treat1 treat2
g1 6.0 -1.665335e-16 -1.0 NA
g2 5.5 -1.665335e-16 -0.5 NA
g3 5.0 -1.110223e-16 -1.0 NA
However, I expected:
Intercept treat1 treat2
g1 5.5 -0.5 0.5
g2 5.5 -0.5 NA
g3 5.5 NA -0.5
Is it appropriate to use limma in such scenarios? Did I something wrong or misinterpreted the results? I think, replacing the NAs by mean intensities would be a solution in this case.
Regards,
Hans-Ulrich
> sessionInfo()
R version 2.6.2 (2008-02-08)
x86_64-pc-linux-gnu
locale:
C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] limma_2.12.0
loaded via a namespace (and not attached):
[1] rcompgen_0.1-17
Dear Gordon,
thank you very much for your reply!
I am sorry for that. I wrote
conts = list(treat="contr.sum", class="contr.treatment")
in my posting but actually usedconts = list(treat="contr.treatment", class="contr.treatment")
.Yes. This behaviour makes sense for lm(), where only one model is fitted and the user knows that a coefficient can not be estimated due to NAs.
I found it confusing how limma stores and returns the results. Consider an even shorter version of the example in my previous mail (attached at the end of this mail). The design matrix is modeled such that the coefficient "treatT2" is the difference between T2 and T1 (T1 is the baseline). The column "treatT2" of Fit$coefficients stores the estimated difference T2-T1 for gene 1 but T2-T3 for gene 3. Calling topTable() with coef="treatT2" to compare T2 and T1 is misleading.
The problem is that often a user will not notice when limma drops a column of the design matrix for a few of many thousand genes. Maybe a warning message should be printed?
About 10%-15% of the spots are flagged by the image analysis software (GenePix) mainly due to saturation. Perhaps the scanner was poorly adjusted. Probably it is a good idea to include them in the analysis with low weights?
Best wishes, Hans-Ulrich
Here is the example:
I find it hard to accept that a behaviour can be sensible for lm() but not lmFit(), since they're both fitting linear models. On the other hand, the behaviour may not be sensible for lm() either. See below.
I understand your point. However it is hard to think of another strategy for accommodating missing values in lmFit() that would be universally applicable and interprettable. Can you suggest one?
Note that lm() has the same problem. Using lm() you could fit the model for gene3 either by
or by
and both are wrong, in the sense that the T2 and T3 coefficients cannot be interpretted as the T2-T1 and T3-T1 contrasts.
Note that you can use limma as it is now to get exactly the correct answer in all cases. You could proceed by
Now if you want to compare T2 to T1, you could remove T3 and compute the contrast by
which is exactly correct. Similarly to compare T3 to T1
Maybe. On one hand, I'm reluctant to introduce a warning a message in a situation in which lm() does not. One the other hand, I agree with you that linear model formulae and missing values don't mix when one of the coefficients becomes inestimable. The only way to get the correct answer is to use the group-mean parametrisation and form contrasts explicitly.
Yes, definitely, it would be far better to include them, whether or not you use low weights, in fact it would be wrong not to do so. Marking these spots as NA is incorrect, because linear model functions such as lmFit() have to assume that they are missing-at-random, which they are not. They are actually the largest intensities in the dataset.
Similar remarks apply to other spots flagged by GenePix because they are faint. Again it would be incorrect to convert them to NA or to give them zero weight. They do contain information.
Best wishes
Gordon
I have now added a warning message to lmFit() in this situation.
Best wishes
Gordon