Question

How to incorporate batch effect that is only present in one level into the design?

0

Entering edit mode

tim.meese • 0

@timmeese-23719

Last seen 6 months ago

Belgium

Hi

I want to perform a differential expression on RNA-Seq data. During the data exploration, the PCA plot indicated that a batch effect was present (plot not shown). I obtained additional information about the experiment and indeed, the samples were processed by two different persons. The metadata of the experiment is shown in this table:

   sample treatment lab_tech
1 sample1   control    tech1
2 sample2   control    tech1
3 sample3   control    tech1
4 sample4    treatA    tech2
5 sample5    treatA    tech2
6 sample6    treatA    tech1
7 sample7    treatA    tech1

My first idea was to perform a differential expression analysis between samples4/5 and samples6/7. The genes that are called differentially expressed are probably due to the batch effect. Therefore, I could use that list to "correct" the results of differential expression analysis of control vs treatA. But then I started wondering if the batch effect couldn't be modelled by including it into the design? However, I do not know how to formulate a correct design. I am not even sure if this is possible. Does anyone want to help?

Thanks in advance.

linear model batch effect DESeq2 design • 3.8k views

ADD COMMENT • link updated 4.7 years ago by James W. MacDonald 68k • written 4.7 years ago by tim.meese • 0

score 2 · Answer 1 · 2020-08-20

Edit: it would help to understand if this is your entire dataset (n=7)? - sometimes people only post a small example. As per James's answer, tech2 is nested in Treatment. For all intents and purposes, assuming n=7, you could simply drop tech2 and proceed with just tech1's samples. There is probably some other approach whereby you could find 'anchor' genes whose expression is constant across tech1's samples, and then use these to adjust the data for tech2, but this is difficult given a sample size of 5 and 2 (for tech1 and tech2 respectively).

------------

Hey tim.meese,

There have been many questions on this topic across various web-forums over the years. It would have helped to have seen your PCA bi-plot and to understand the amount of variation contributed by the PC(s) (principal component(s)) along which you are observing a batch difference.

Nevertheless, generally, for dealing with batch effects, the broadly-accepted procedure is to not directly modify your input raw counts in order to adjust for batch differences. Instead, one can include batch in the design formula, which means that one is simply modeling and adjusting for the effect size of batch. In your case, you would need:

~ lab_tech + treatment

Then, when you derive test statistics for treatment, these (the statistical inferences) will be adjusted for the effects of lab_tech.

--------

If, later, you then wish to directly modify your transformed (rlog, vst, or logCPM) expression levels for downstream analyses like clustering, PCA, 'machine learning' stuff, et cetera, you could use limma::removeBatchEffect() to model and directly eliminate the batch effect in these. See Why after VST are there still batches in the PCA plot?.

Other related posts:

Kevin

score 0 · Answer 2 · 2020-08-20

I'm pretty sure that since tech2 is nested in treatment, both the treatment effect and the tech-specific effect get absorbed into the tech2 factor level, so it's the same as just dropping the data from tech2, except for the reduction in degrees of freedom. As an example:

> set.seed(0xabeef)
> z <- rnorm(7) + c(0,0,0,2,2,0,0) ## add a tech-specific offset
> trt <- factor(rep(c("control","treat"), c(3,4)))
> tech <- factor(rep(c(1,2,1), c(3,2,2)))
> d.f <- data.frame(z = z, trt = trt, tech = tech)
> summary(lm(z~tech + trt, d.f))

Call:
lm(formula = z ~ tech + trt, data = d.f)

Residuals:
      1       2       3       4       5       6       7 
 1.5649 -0.5580 -1.0069  1.4282 -1.4282 -0.4599  0.4599 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.3608     0.8305   0.434    0.686
tech2         0.7866     1.4384   0.547    0.614
trttreat      0.8010     1.3131   0.610    0.575

Residual standard error: 1.438 on 4 degrees of freedom
Multiple R-squared:  0.2702,    Adjusted R-squared:  -0.09475 
F-statistic: 0.7404 on 2 and 4 DF,  p-value: 0.5327

> summary(lm(z~trt, d.f, subset = tech == 1))

Call:
lm(formula = z ~ trt, data = d.f, subset = tech == 1)

Residuals:
      1       2       3       6       7 
 1.5649 -0.5580 -1.0069 -0.4599  0.4599 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.3608     0.6829   0.528    0.634
trttreat      0.8010     1.0798   0.742    0.512

Residual standard error: 1.183 on 3 degrees of freedom
Multiple R-squared:  0.155, Adjusted R-squared:  -0.1267 
F-statistic: 0.5503 on 1 and 3 DF,  p-value: 0.512