Question

Interpretation of DE analysis results of an Illumina microarray dataset, and possible batch effect size problem

0

Entering edit mode

Konstantinos Yeles ▴ 80

@konstantinos-yeles-8961

Last seen 16 months ago

Italy

Dear community,

i would like to address an important question regarding an illumina microarray dataset, i'm currently re-analyzing. It's about the experimantal design of the dataset with GSE32091, which has also been publiced.

In detail, there are 3 different conditions (Irradiated, bystander, control) in 3 different time points( 4h, 8h, 26h) with 3 or 4 biological replicates. The problem is that in some samples there are 3 bystander replicates in 4h and 4 bystander control replicates in 4 hour. Below, im posting the code of an MDS plot and a basic limma DE analysis of some contrasts:

> head(filtered.2$targets)
  Assay.Name Sample.and.Data.Relationship.Format time Batch
1  GSM795538                     irradiated_ctrl   4h     1
2  GSM795539                     irradiated_ctrl   4h     2
3  GSM795540                     irradiated_ctrl   4h     3
4  GSM795541                          irradiated   4h     1
5  GSM795542                          irradiated   4h     2
6  GSM795543                          irradiated   4h     3

> plotMDS(filtered$E,labels=as.factor(paste0(targets$Sample.and.Data.Relationship.Format, ".", targets$time)),col= as.numeric(as.factor(targets$Batch)))

# here is the link to the MDS plot

https://www.dropbox.com/s/ah4v0pem30irjvz/MDSplot.image.png?dl=0

## Continue with the DE analysis

....
samples <- as.factor(paste0(targets$Sample.and.Data.Relationship.Format, ".", targets$time)) # to make a factor including both conditions and 3 time points

batch <- filtered.2$targets$Batch # filtered.2 the eSet
design <- model.matrix(~0+samples+batch)

fit <- lmFit(filtered.2, design)

> table(batch)
batch
 1  2  3  4 
12 12 12  6

con <- makeContrasts(test=samplesbystander.4h-samplesbystander_ctrl.4h,levels=design) # one specific example for one specific contrast

fit2 <- contrasts.fit(fit, contrasts=con)
ebayes <- eBayes(fit2, trend=TRUE)

top2 <- topTable(ebayes, coef=1, number=nrow(ebayes), adjust.method="fdr", sort.by="none")..

To conclude, most of the contrasts implemented, don't return any DE, after controlling FDR < 0.05. Thus, from a first naive inspection, there are not any DE genes for most of the comparisons. So, my main questions are the following:

1. This is very obvious-evident from the MDS plot i posted above ? In which, we can see that in the majority of the time points, the control samples group together with either of the conditions tested (bystander and/or irradiated) ?

2. Moreover, could the different batch size as illustrated in some of the experimental groups above --even with a relatively small replicate difference-- hamper further the DE analysis, regarding the DE genes ?

(Just to add a small note: from the relative publication, the authors used a cutoff of FDR <0.15 combined with a nonimal p-value <0.05, ending with a relatively low number of DE genes...)

Any comments or suggestions would be grateful !!

limma illumina microarray multifactorial design batch effect • 2.0k views

ADD COMMENT • link updated 8.8 years ago by Gordon Smyth 52k • written 8.8 years ago by Konstantinos Yeles ▴ 80

score 1 · Answer 1 · 2016-06-19

1

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 6 months ago

Icahn School of Medicine at Mount Sinai…

You are correct that the MDS plot shown is consistent with a finding of no DE, since each condition and its corresponding control cluster together. However, keep in mind that the MDS plot only shows the first 2 dimensions, and it's possible that the controls and conditions separate on the 3rd or higher dimensions, so you may wish to also inspect the next few dimensions as well.

The different batch sizes are not a major issue here. Including batch in your model ensures that this batch will only contribute to the logFC estimation when samples from that batch are present in both conditions being compared.

By the way, there's a shortcut for the factor(paste0(...)) pattern for combining factors: interaction(..., drop=TRUE).

ADD COMMENT • link 8.8 years ago Ryan C. Thompson ★ 7.9k

2

Entering edit mode

Indeed, even if a lot of the spread across the plot is due to a batch effect, I would expect that each control sample would be "shifted" in a consistent direction relative to the non-control sample in the same batch, if there were a systematic control/non-control change in expression across all batches. This does not seem to be the case, though you'd get a clearer picture of what's going on if you performed removeBatchEffect on the expression matrix prior to running plotMDS.

Another thing to try would be treatment (i.e., bystander/irradiated)-specific blocking factors for the batch effects. This would model differences in how the bystander/irradiated samples respond to the batch effect, e.g., the irradiated samples might not be affected much by batch but the bystander samples show a greater variation. Adding specific blocking might improve the model fit, reduce the sample variance and increase power. Or maybe it won't, because you'll lose residual d.f.; you'll have to test it to find out.

ADD REPLY • link 8.8 years ago Aaron Lun ★ 28k

score 1 · Answer 2 · 2016-06-20

1

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 5 hours ago

WEHI, Melbourne, Australia

I don't actually see any evidence of a batch effect in this experiment, either from the MDS plot or from the description of the experiment. You seem to have created the Batch factor based on the ordering of the samples on the GEO website, but there may be no meaning to how the samples have been ordered.

That being the case, including the Batch factor in the model will only reduce power to detect DE genes.

ADD COMMENT • link 8.8 years ago Gordon Smyth 52k

0

Entering edit mode

Dear Gordon,

thank you also for your comment! However,i didn't just create the Batch factor based only on the GEO ordering, but also from the experimental design of the paper, in which it mentions 3-4 biological replicates for each condition ? So, if I'm not mistaken, you imply that the above ordering of the samples would not be valid ? But if this the case, i could not make any hypotheses or conclusions, because I can only utilize and stick with the available information on the GEO and authors paper -on the other hand, do you refer mostly to the MDS plot, in which no sign of an actual batch effect is present, and thus i could try the same approach above with limma, but instead not include the batch variable in my design ?

ADD REPLY • link 8.8 years ago Konstantinos Yeles ▴ 80

0

Entering edit mode

A batch effect has nothing to do with the ordering of the samples in GEO. A batch effect is present when different sets of samples are treated differently in a way that has nothing to do with the experimental design. For example, if half the samples were run together on one day, and the other half were run together on a different day, that would be two batches. The only variables listed in GEO for this dataset are treatment and time point (combined into a single "Treatment group" column), which are both experimental effects. No batch effects are listed.

ADD REPLY • link 8.8 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Dear Ryan,

thank you for your notification !! perhaps it is my mistake to not also post some information about this specific dataset that is not present in GEO:

in the following link of arrayExpress, there is the analytical SDRF with all the information about this experiment-in there, i found the information about the biological replicates information, which made me include them to inspect the above discussed possibility:

https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-32091/?query=GSE32091

ADD REPLY • link 8.8 years ago Konstantinos Yeles ▴ 80

0

Entering edit mode

Unless those different replicates were treated differently somehow, such as being done on different days, or each set of replicates coming from a different cell line, they don't constitute a batch effect. I don't see anything in the table that suggests any difference between the replicates, however, so I think the numbering of the replicates here is completely arbitrary and has no meaning in terms of the experimental design.

ADD REPLY • link 8.8 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Dear Ryan,

I agree that in this particular dataset the numbering of replicates maybe is arbitary and also unfortunately the description of the experimental design is not very informative-but from my naive point of view, also it is a point of consideration for either including this numbering as a batch variable(which can be also seen partially from the MDS plot above--without of course implying any severe batch effect),or totally ignore it--at any case however, the authors used the same cell line in 4 different cultures-thus each different culture gave rise to each samples belonging to each condition, which can also be considered as a different batch (in the way that every different culture resulted in "different samples" each time). I hope i desribe my notion clearly.

ADD REPLY • link 8.8 years ago Konstantinos Yeles ▴ 80

2

Entering edit mode

Basically, ask yourself: do all the samples in your "Batch 1" have something in common that is different from all the samples in the other batches? If all the samples in Batch 1 were derived from a single cell culture, and all the samples from Batch 2 were derived from a different cell culture, and so on, then it would be a legitimate batch effect. But again, I don't see any indication that this is the case, and the fact that the ArrayExpress sample table says "replicate 1/2/3/4" instead of "batch 1/2/3/4", and the fact that this attribute is in the identification section rather than the "Sample attributes" or "Variables" sections of the table, all argue that this is not intended to represent a batch effect. Lastly, there is no clear separation between the "batches" in the MDS plot. So I see nothing to justify the conclusion that this "batch" represents anything other than the order the samples were listed in GEO. Unless the paper (which I don't have access to) specifically says something about this batch effect, there's no reason to include it effect in your model, and as Gordon says, doing so is just throwing away statistical power for no benefit. And even if the paper does say that there are real batches, the MDS plot shows that the batches had minimal effect, and you'd probably still get better power by ignoring it.

ADD REPLY • link 8.8 years ago Ryan C. Thompson ★ 7.9k