Question

limma batch question

0

Entering edit mode

jalali4 • 0

@jalali4-11943

Last seen 8.2 years ago

Hello all,

I have a few questions about the linear model in limma.

I am doing limma/voom on my expression results, looking for differences between treatment1 and treatment2. I have several batches, so my design is model.matrix(~batch + treatment). My understanding is that with this design, limma first compares treatment1 and treatment2 samples within each batch and then somehow combines the results of different batches together. My first questions:

1. I assume that the linear model gives different batches equal weights when combining the results from each batch, even if batches contain different number of samples, am I right?

Additionally, most of my batches contain both treatment1 and treatment2 samples. Let's call these dual-treatment batches. However, a few batches unfortunately only contain treatment1 samples (or only treatment2 samples). Let's call these uni-treatment batches. Additional questions:

2. Are uni-treatment batches simply dismissed in the linear model calculation with the above design? I don't get any error messages when running lmFit and I get meaningful results from topTable. But I wonder how linear model can run without dismissing the uni-treatment batches. For example, if one batch has 0 treatment1 samples and 3 treatment2 samples, how can limma make a meaningful comparison between treatment1 and treatment2 within that batch?

3. If I manually remove samples in uni-treatment batches ahead of my design (so I end up with fewer batches, all dual-treatment), my topTable results are slightly different from my results in question 2 (slightly different logFC values, moderate differences in p values, and more noticeable differences in AveExpr values for each gene). Overall, for FDR<0.05, I get 826 genes vs 906 genes comparing the two methods. If the linear model simply dismisses uni-treatment batches in its calculation (I don't see how it could not) then I'd think the results under question 2 and question 3 should be the same.

Sorry for the long questions and thanks for your thoughts and answers.

Ali

limma batch effect • 1.8k views

ADD COMMENT • link 8.2 years ago jalali4 • 0

0

Entering edit mode

Thank you Aaron very much for that explanation, which helped my understanding. My earlier understanding came from a simplistic statement in limma vignette under "Blocking" which said "In this type of analysis, the treatments are compared only within each block."

I also thought of another item that potentially adds to the difference I see between results in questions 2 and 3, and that is the normalization factor calculation as well as voom transformation, which are affected when I remove samples from the analysis.

ADD REPLY • link 8.2 years ago jalali4 • 0

0

Entering edit mode

The comment about blocking in the limma User's Guide is correct and does not imply separate analyses for each block. Treatments are compared within each block, and these comparisons are combined, but the combining is done automatically as part of fitting a single linear model. That is how linear models work. If I may say so, it was actually your interpretation rather than the original statement that was too "simplistic".

Yes, normalization changes will affect downstream results if you remove some of the samples.

PS. I have moved you answer to be a comment. For future reference, if you want to reply to Aaron, please use "ADD COMMENT" or "ADD REPLY" rather than posting an answer to your own question.

ADD REPLY • link 8.2 years ago Gordon Smyth 52k

0

Entering edit mode

Thank you Gordon. Point taken. And thanks for clarifying the commenting function.

ADD REPLY • link 8.2 years ago jalali4 • 0

score 6 · Accepted Answer · 2017-01-28

Firstly, your concept of how lmFit works is wrong. Linear models are not fitted piece-wise to each batch in an additive model. Rather, the whole thing is fitted simultaneously by minimizing the (weighted) sum of squares, i.e., the squared difference between each observation and its fitted value from the model. There is no combining step, because all (relevant) observations are used to estimate each coefficient in the first place.

Now, as for your first question; the answer is no. Large batches with many samples will contribute more to the coefficient estimates than small batches, because large batches contribute more to the sum of squares. Thus, all other things being equal, it is more important to get good estimates for many observations in large batches than for one or two observations in a smaller batch. I should point out that this shouldn't matter if the additive model is correct, because the expected difference between treatments should be the same in each batch. Otherwise, if you expect differences in the treatment effect between batches, you should be using an interaction model.

As for your second question; a batch with only one treatment condition does not contribute to estimation of the treatment coefficients. This is because all of the information in that batch is used to estimate the corresponding batch coefficient. Consider a situation where you have a systematic increase in expression for all samples in this batch, relative to equivalent treatment samples in all other batches. Which one would improve the fit - increasing the treatment coefficient, which would increase the sum of squares in all other batches; or just increasing the batch coefficient for this batch, which does not affect the model fit for other batches?

Finally, for your third question; samples in single-treatment batches will still be used to estimate the variance, provided that there's more than 2 of them per batch. This will increase the residual degrees of freedom and improve power for hypothesis testing. So, these samples are not totally useless, but I doubt you intended to obtain them for the sole purpose of variance estimation. This is where good experimental design comes in, to ensure you get the maximum value for the data you intend to collect.