Question

Batch effects in limma

0

Entering edit mode

smt8n • 0

@smt8n-9982

Last seen 8.4 years ago

Dear all,

I realize most facets of my question have been discussed here, but I am having hard time pulling all the answers together and generalizing.

I have 36 iPSC, 9 primary CM, and 18 biopsy samples. My quality control plots showed batch effects in the data and I added 3 batches to the design matrix, using, e.g. this:

A: Limma, blocking batch effect

discussion (and reviewing others on the topic)

After I added batches to the design matrix and got the "Coefficients not estimable" message, I found an answer from Gordon Smyth: "You can't expect to be able to estimate more things than your experiment contains information about. In this case, your batches are entirely confounded with patients+treatment, and hence you can't estimate coefficients for them." in one of the posts (and the differential expression was the same with and without the "+batches" in the matrix).

There are three questions I have here:

1) Does it mean that if I have ALL (36+9+18) samples in the design matrix, adding "batches" explicitly is introducing too many variables and there is no way to explicitly include batch effects?

2) In this case, does the regression on the samples implicitly take care of the batch effect?

3a) (if the answer to (2) is "Yes" ) Since there are a lot of discussions on batch effect inclusion, it should be needed for something. In what cases is it needed?

3b) (if the answer to (2) is "No") What can I do?

Thank you. Any help is greatly appreciated.

Slava

Thank you for the comment. Added upon request:

the sample table is the following:

<caption>Sample table</caption>

iPSC0

iPSC3

iPSC7

iPSC10

iPSC14

iPSC20

iPSC28

iPSC35

iPSC45

iPSC60

iPSC90

iPSC120

Prim

Fetal

LV

Heart1

Sorry, I lumped biological replicates together, but the table still did not fit. There should be 2 more columns: "Heart2" and "Heart diseased"

There are 3 biological replicates for all iPSC entries, 9 replicates for "Prim" entry, 6 replicates for "Fetal" and 3 replicates for "LV", "Heart1", "Heart2", and "Heart diseased".

The code that I used to generate the original (before attempting a batch correction) design matrix:

    RNAs <- factor(c("iPSC0","iPSC0","iPSC0",
                     "iPSC3","iPSC3","iPSC3",
                     "iPSC7","iPSC7","iPSC7",
                     "iPSC10","iPSC10","iPSC10",
                     "iPSC14","iPSC14","iPSC14",
                     "iPSC20","iPSC20","iPSC20",
                     "iPSC28","iPSC28","iPSC28",
                     "iPSC35","iPSC35","iPSC35",
                     "iPSC45","iPSC45","iPSC45",
                     "iPSC60","iPSC60","iPSC60",
                     "iPSC90","iPSC90","iPSC90",
                     "iPSC120","iPSC120","iPSC120",
                     "Prim","Prim","Prim","Prim","Prim","Prim","Prim","Prim","Prim",
                     "Fetal","Fetal","Fetal","Fetal","Fetal","Fetal",
                     "Heart1","Heart1","Heart1",
                     "LV","LV","LV",
                     "HeartDis","HeartDis","HeartDis",
                     "Heart2","Heart2","Heart2")
                   )
###    print(RNAs)
    design <- model.matrix(~0 + RNAs)
    colnames(design) <- levels(RNAs)

My attempt to account for batches (gse is an eset read from the series_matrix file with getGEO()) :

    colstring <- substring(pData(gse)[,"source_name_ch1"],1,4)

batches <- factor(colstring, levels=unique(colstring))
#    print(batches)
    designBC <- model.matrix(~0 + RNAs + batches)
    colnames(designBC)[1:nlevels(RNAs)] <- levels(RNAs)

Thank you!

Addition #2

Sorry about misunderstanding. I think the table below is what you requested:

RNAs Batches

iPSC0	iPSC
iPSC0	iPSC
iPSC0	iPSC
iPSC3	iPSC
iPSC3	iPSC
iPSC3	iPSC
iPSC7	iPSC
iPSC7	iPSC
iPSC7	iPSC
iPSC10	iPSC
iPSC10	iPSC
iPSC10	iPSC
iPSC14	iPSC
iPSC14	iPSC
iPSC14	iPSC
iPSC20	iPSC
iPSC20	iPSC
iPSC20	iPSC
iPSC28	iPSC
iPSC28	iPSC
iPSC28	iPSC
iPSC35	iPSC
iPSC35	iPSC
iPSC35	iPSC
iPSC45	iPSC
iPSC45	iPSC
iPSC45	iPSC
iPSC60	iPSC
iPSC60	iPSC
iPSC60	iPSC
iPSC90	iPSC
iPSC90	iPSC
iPSC90	iPSC
iPSC120	iPSC
iPSC120	iPSC
iPSC120	iPSC
Prim	Prim
Prim	Prim
Prim	Prim
Prim	Prim
Prim	Prim
Prim	Prim
Prim	Prim
Prim	Prim
Prim	Prim
Fetal	Comm
Fetal	Comm
Fetal	Comm
Fetal	Comm
Fetal	Comm
Fetal	Comm
Heart1	Comm
Heart1	Comm
Heart1	Comm
LV	Comm
LV	Comm
LV	Comm
HeartDis	Comm
HeartDis	Comm
HeartDis	Comm
Heart2	Comm
Heart2	Comm
Heart2	Comm

Thank you!

limma batch effect • 3.4k views

ADD COMMENT • link updated 9.0 years ago by Aaron Lun ★ 28k • written 9.0 years ago by smt8n • 0

0

Entering edit mode

The answers to this kind of question depend on what your sample table looks like. Can you show the full table and the code that you're using to generate your design matrix? (Note that there is an "insert table" button in the edit bar when editing your question that you can use to add your sample table.)

ADD REPLY • link 9.0 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Can you make a data frame with one row per sample, with a column for "RNAs" and a second column for "batches"? This is the table I'm asking you to include.

ADD REPLY • link 9.0 years ago Ryan C. Thompson ★ 7.9k

score 1 · Answer 1 · 2016-04-11

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 6 hours ago

The city by the bay

Your groups are fully nested within your batch. It is impossible to account for the batch effect in the model, as you cannot distinguish (uninteresting) differences between batches from (interesting) differences between groups. You should instead use one-way layout with only the groupings, as you've done to generate design in your code. Any batch effect will be absorbed into the group-specific coefficients, so it will be implicitly modelled. All you have to do is to remember to only compare groups in the same batch. Comparisons between groups in different batches can but should not be formulated when you're running makeContrasts.

More generally, I wonder whether it is sensible to include all of these groups in the same analysis. Some of these samples seem to be very different - for example, iPSCs are probably cultured and of low variability, while the biopsy samples are likely to be more variable. If you're not going to test for DE between iPSCs and the biopsies, then it's unnecessary to analyze them together. In fact, mashing everything into a single analysis is unlikely to be appropriate, as you'll make normalization and variance modelling more difficult, which would outweigh any increase in residual d.f. from more samples.

Finally, as to your question 3, you need to include batch effects when you want to block on a non-confounding batch factor. I think you'll find plenty of examples in the limma and edgeR user's guides, but here's one:

batch <- factor(rep(1:3, 3))
condition <- factor(rep(LETTERS[1:3], each=3))
design <- model.matrix(~batch + condition)

Here, we have three batches, and each batch contains the same three groups. Blocking on batch allows us to test for DE between groups, accounting for the batch effect. Unlike your example, this is possible as the groups are not nested within batches.

ADD COMMENT • link 9.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron,

Thank you for the answer. Unfortunately, it brings up a problem.

First a quick question: do I understand correctly that you recommend the design <- model.matrix(~0 + RNAs) approach that will implicitly take care of the batches? Just making sure since I do not know the "one-way layout" term.

The big problem is that the whole analysis was started to compare some of the iPSC batch samples to samples from the other batches. Is there any way I could do this?

Thank you

Slava

ADD REPLY • link 9.0 years ago smt8n • 0

1

Entering edit mode

If there is any batch effect, then it will be absorbed into the estimates of the group-specific coefficients in design. It doesn't "take care of them" in the sense that the batch effects won't get removed (which is impossible for your experimental design, for reasons discussed above). However, the batch effects won't compromise the fit of the model either, which ensures that your dispersion estimates and various other statistics for comparisons within batches are valid.

If you want to compare between groups in different batches... well, that's simply not possible when your groups are nested in the batches. However, before we get too agitated, I would ask how you decided that batch effects were present in the first place. Did you do a MDS plot and see that the three sets of samples clustered separately? If so, that doesn't sound like a batch effect to me; I would expect separation between clusters just due to biological differences. On the other hand, if your batch effects are defined based on some technical reason, e.g., you prepared them on separate days/at sequencing centres, then you're out of luck. In such cases you'd have to go back to the drawing board and get the experimental design right.

ADD REPLY • link 9.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

I looked at box plots of intensities and at MDS plots. Just in case I am crazy, I am posting them below. Unfortunately, the batch effects seem too pronounced to be figments of my imagination.

ADD REPLY • link 9.0 years ago smt8n • 0

1

Entering edit mode

The clustering in the MDS plot is to be expected and may represent genuine biological differences. Now, I don't deal with microarray data much, but the intensity plots look strange to me. Why are all the boxplots exactly the same within each putative batch? This should only be possible if you performed, say, quantile normalization; but in that case, why are the boxplots different between samples in different batches? This suggests that you normalized within each batch, which is not a sensible strategy if you want to compare samples between different batches.

ADD REPLY • link 9.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you for noticing the normalization issue. This is a data set from GEO that did not have regular raw files posted. Most of what was posted was normalized. That is what I used and what you saw. After your post I grabbed the only file with non-normalized data that there was, reshaped it using perl. I am not quite sure whether I treated it correctly (no negative controls, only p-values and neqc could not be used because of the issue discussed here: https://support.bioconductor.org/p/49954/), but even the raw data did not look like having batches. After running the through backgroundCorrect and normalizeBetweenArrays I saw graphs without any hint of batch effects. Thanks again.

ADD REPLY • link 9.0 years ago smt8n • 0