limma - complex design with paired samples, sex and batch effect
3
4
Entering edit mode
aec ▴ 90
@aec-9409
Last seen 4.6 years ago

Hi,

I have a very complex paired design to test for differential expression with limma. I have a control group and intervention group (more than 100 RNA-seq samples). The intervention group received a treatment but not the control. So my expression data is structured like this:

Pre1: control pre-treatment

Pre2: intervention pre-treatment

Post1: control post-treatment

Post2: intervention post-treatment

I would like to compare:

Post1-Pre1

Post2-Pre2

Pre2-Pre1

Post2-Post1

But taking into account batch effects (batch1 and batch2) and sex (male/female) . But I am not sure if including batch effect because Pre and Post samples coming from the same individual fall in the same batch.

This is my limma code so far:

head(info)
PATIENT TREATMENT GROUP GENDER BATCH
11 Pre 2 Female BATCH1
11 Post 2 Female BATCH1
12 Pre 2 Male BATCH1
12 Post 2 Male BATCH1
13 Pre 2 Female BATCH1
13 Post 2 Female BATCH1
14 Pre 2 Male BATCH1
14 Post 2 Male BATCH1
15 Pre 2 Male BATCH1
15 Post 2 Male BATCH1
16 Pre 2 Male BATCH2
16 Post 2 Male BATCH2
17 Pre 2 Female BATCH2
17 Post 2 Female BATCH2
18 Pre 2 Male BATCH2
18 Post 2 Male BATCH2
19 Pre 2 Female BATCH2
19 Post 2 Female BATCH2
20 Pre 2 Female BATCH2
20 Post 2 Female BATCH2
21 Pre 2 Female BATCH2
21 Post 2 Female BATCH2
22 Pre 2 Male BATCH2
22 Post 2 Male BATCH2
23 Pre 2 Female BATCH2
23 Post 2 Female BATCH2
24 Pre 1 Female BATCH2
24 Post 1 Female BATCH2
25 Pre 2 Male BATCH2
25 Post 2 Male BATCH2
26 Pre 2 Female BATCH2
26 Post 2 Female BATCH2
27 Pre 2 Female BATCH2
27 Post 2 Female BATCH2
28 Pre 2 Female BATCH2
28 Post 2 Female BATCH2
29 Pre 1 Female BATCH2
29 Post 1 Female BATCH2
30 Pre 2 Female BATCH2
30 Post 2 Female BATCH2
31 Pre 2 Female BATCH2
31 Post 2 Female BATCH2
32 Pre 2 Female BATCH2
32 Post 2 Female BATCH2
33 Pre 2 Male BATCH2
33 Post 2 Male BATCH2
34 Pre 1 Female BATCH2
34 Post 1 Female BATCH2
       
y=DGEList(counts=counts)
A<-rowSums(y$counts)
isexpr<-A>500
y=y[isexpr,keep.lib.size=FALSE]
dim(y)

y=calcNormFactors(y)
sex=as.factor(info$SEX)
batch=as.factor(info$BATCH)
Treat=factor(paste(info$TREATMENT, info$GROUP, sep="."))
design=model.matrix(~0+Treat+sex+batch)
colnames(design)
v=voom(y,design)
cor=duplicateCorrelation(v,design,block=info$PATIENT)

fit=lmFit(v,design,block=info$PATIENT, correlation=cor$consensus)
cm=makeContrasts(CONTROLvsINTERVENTIONforPRE= TreatPre.2-TreatPre.1, CONTROLvsINTERVENTIONforPOST= TreatPost.2-TreatPost.1, PrevsPostforCONTROL= TreatPost.1-TreatPre.1, PrevsPostforINTERVENTION=TreatPost.2-TreatPre.2,levels=design)
fit2=contrasts.fit(fit,cm)
fit2=eBayes(fit2)

summary(decideTests(fit2))

Pre1Pre2_table=topTable(fit2,coef="CONTROLvsINTERVENTIONforPRE",n=Inf, sort="p", p=0.05)
Post1Post2_table=topTable(fit2,coef="CONTROLvsINTERVENTIONforPOST",n=Inf, sort="p", p=0.05)
Pre1Post1_table=topTable(fit2,coef="PrevsPostforCONTROL",n=Inf, sort="p", p=0.05)
Pre2Post2_table=topTable(fit2,coef="PrevsPostforINTERVENTION",n=Inf, sort="p", p=0.05)

 

limma paired samples complex-design • 6.0k views
ADD COMMENT
4
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 16 hours ago
The city by the bay

Looks fine to me. Including sex and batch as fixed effects in design probably doesn't hurt, as you're not losing a lot of residual d.f. from two extra terms. In fact, if you're using duplicateCorrelation, it might even help. Those terms will absorb systematic differences in expression between sexes and batches, which ensures that the remaining random effects across patients are more homogeneous.

The alternative would be to block directly on the patient ID in the design matrix, which would then obviate the need to block on sex and batch. However, this renders your desired DE comparisons impossible as you can't compare directly across blocking levels, e.g., if you wanted to compare post-treatment expression between the different levels of GROUP. You'd be restricted to comparing within GROUP (your first two comparisons) or looking for differential treatment effects between groups, i.e., (Post2 - Pre2) - (Post1 - Pre1).

Some other comments:

  • Setting isexpr<-A>500 is not robust as the aggressiveness of the filter depends on the number of libraries. If you were to copy-paste this code elsewhere, or if you got new libraries and added them to the analysis, or if you found out some of your existing libraries were rubbish and threw them out, then the strength of the filter would change. You could make it safer by just switching to the equivalent threshold in terms of the rowMeans.

  • You can set robust=TRUE in eBayes to protect against genes with outlier variances, of which there may be a few in such a large data set. This ensures that those outliers won't distort the estimates of the EB statistics.

ADD COMMENT
0
Entering edit mode

You said that blocking directly on patient in the design matrix makes impossible to compare directly across blocking levels, but this is actually what we did in the last posts I compared Post2-Post1 and obtained 13 DE genes.

ADD REPLY
0
Entering edit mode

You are not comparing Post2 to Post1 expression in those posts. Rather, you are comparing the Post2 log-FC (before/after treatment) to the Post1 log-FC. This is possible as you have computed those log-FCs within patients, such that the blocking factor for each patient cancels out. You can't compare absolute expression between patients when they're blocked in the design, because the blocking factors for different patients would not cancel out and will absorb DE.

ADD REPLY
0
Entering edit mode
aec ▴ 90
@aec-9409
Last seen 4.6 years ago

Aaron thanks for your comment.

The thing is that if I run this model: ~0+Treat+sex+batch, the number of DE genes is only 2 for Post2-Post1 and I would expect many more DE genes because controls did not received the treatment but the intervention group did.

See results:

   CONTROLvsINTERVENTIONforPRE CONTROLvsINTERVENTIONforPOST PrevsPostforCONTROL

-1                           0                            0                   0
0                        23776                        23774               23776
1                            0                            2                   0
   PrevsPostforINTERVENTION
-1                     1165
0                     21349
1                      1262

If I do not take into account the batch effect, the model would be : ~0+Treat+sex and these are the results:

   CONTROLvsINTERVENTIONforPRE CONTROLvsINTERVENTIONforPOST PrevsPostforCONTROL
-1                          92                          893                   0
0                        23603                        21861               23776
1                           81                         1022                   0
   PrevsPostforINTERVENTION
-1                     1455
0                     20658
1                      1663

As you can see I do get more DE genes for Post2-Post1, but it is true that also the number of DE genes for Pre2-Pre1 also increases and that is not desirable as we expect no differences. 

My question would be which of the two models is more appropiate? Is it really necessary to account for batch effect if the same individual pre and post comes from the same batch?

The other question is, when we compare Pre2-Pre1 or Post2-Post1 we do not have paired samples, so is it necessary to account for the patient covariate for these two comparisons? 

 

ADD COMMENT
0
Entering edit mode

I'd check what your samples are doing with respect to the batch, e.g., on a MDS plot. I suspect there may be a strong difference between samples in different batches. If so, this needs to be modelled because it will otherwise distort the variance estimates. This is true even when you're comparing within batches via duplicateCorrelation (in which case, a systematic difference between patients in different batches will distort the correlation estimate).

For similar reasons, you need to account for the patient covariate when comparing between treatments. This is because samples from the same patient are correlated - ignoring that will underestimate the standard error of the log-fold change.

Edit: I should add that getting more DE genes when ignoring a batch effect is pretty weird. Usually, the batch effect will inflate the variance/increase the correlation, which should result in fewer DE genes for that comparison. Check if all the control or intervention samples are in one batch, because if that's the case, the samples in the other batch will not be used in the DE comparison.

ADD REPLY
0
Entering edit mode

Aaron, you are right, Batch1 has only intervention samples, anyone coming from controls. But Batch2 has both control and intervention samples. How should I proceed with such a case?

Regarding the MDS plot I can clearly see the sex and batch separation.

Many thanks for your comments.

ADD REPLY
0
Entering edit mode

I would definitely block on batch in the design matrix. You don't want a situation where a gene is DE between intervention and control just because of a batch effect (e.g., in Batch 1 that results in up/downregulation in many of the intervention samples). Now, it's unfortunate that you get fewer DE genes, but that's the way it is. Perhaps a more relevant comparison would account for pre-treatment differences; try out the (Post2 - Pre2) - (Post1 - Pre1) contrast that I described before, which tests for a differential treatment effect between control and intervention.

ADD REPLY
0
Entering edit mode

Aaron, The contrast you suggested gives no DE genes. Here is part of the code:

fit=lmFit(v,design,block=info$PATIENT, correlation=cor$consensus)
cm=makeContrasts(CONTROLvsINTERVENTIONforTreatment_Effect= (TreatPost.2-TreatPre.2)-(TreatPost.1-TreatPre.1),levels=design)
fit2=contrasts.fit(fit,cm)
fit2=eBayes(fit2)

Result:

CONTROLvsINTERVENTIONforTreatment_Effect
-1                                        0
0                                     23776
1                                         0

This is not expected, right? Is there something I am doing wrong? I corrected by sex and batch effect as well.

 

ADD REPLY
0
Entering edit mode

The code looks fine to me. The result is quite strange, though; you get around 2000 DE genes for PrevsPostforINTERVENTION and no DE genes for PrevsPostforCONTROL, yet the comparison between the two log-fold changes doesn't yield any DE genes. Try blocking on patient in the design matrix:

design <- model.matrix(~ patient + group:treatment)

... and do a contrast between the coefficients representing the treatment log-fold change between control and intervention. (Dropping of a coefficient to obtain full rank will probably be required, probably with the second-last coefficient.) Also drop each coefficient individually, as a reference. The aim is to ensure that you're not getting penalised by duplicateCorrelation for between-block comparisons. 

ADD REPLY
0
Entering edit mode

I think this is your suggested code:

design <- model.matrix(~ patient + treatment:group)
colnames(design)

[1] "(Intercept)"          "patient 10"           "patient100"
  [4] "patient101"           "patient102"           "patient103"
  [7] "patient104"           "patient105"           "patient106"
 [10] "patient107"           "patient108"           "patient109"
 [13] "patient 11"           "patient110"           "patient111"
 [16] "patient112"           "patient113"           "patient114"
 [19] "patient115"           "patient116"           "patient117"
 [22] "patient118"           "patient119"           "patient 12"
 [25] "patient120"           "patient121"           "patient122"
 [28] "patient123"           "patient124"           "patient125"
 [31] "patient126"           "patient127"           "patient128"
 [34] "patient129"           "patient 13"           "patient130"
 [37] "patient 14"           "patient 15"           "patient 16"
 [40] "patient 17"           "patient 18"           "patient 19"
 [43] "patient  2"           "patient 20"           "patient 21"
 [46] "patient 22"           "patient 23"           "patient 24"
 [49] "patient 25"           "patient 26"           "patient 27"
 [52] "patient 28"           "patient 29"           "patient  3"
 [55] "patient 30"           "patient 31"           "patient 32"
 [58] "patient 33"           "patient 34"           "patient 35"
 [61] "patient 36"           "patient 37"           "patient 38"
 [64] "patient 39"           "patient  4"           "patient 40"
 [67] "patient 41"           "patient 42"           "patient 43"
 [70] "patient 44"           "patient 45"           "patient 46"
 [73] "patient 47"           "patient 48"           "patient 49"
 [76] "patient  5"           "patient 50"           "patient 51"
 [79] "patient 52"           "patient 53"           "patient 54"
 [82] "patient 55"           "patient 56"           "patient 57"
 [85] "patient 58"           "patient 59"           "patient  6"
 [88] "patient 60"           "patient 61"           "patient 62"
 [91] "patient 63"           "patient 64"           "patient 65"
 [94] "patient 66"           "patient 67"           "patient 68"
 [97] "patient 69"           "patient  7"           "patient 70"
[100] "patient 71"           "patient 72"           "patient 73"
[103] "patient 74"           "patient 75"           "patient 76"
[106] "patient 77"           "patient 78"           "patient 79"
[109] "patient  8"           "patient 80"           "patient 81"
[112] "patient 82"           "patient 83"           "patient 84"
[115] "patient 85"           "patient 86"           "patient 87"
[118] "patient 88"           "patient 89"           "patient  9"
[121] "patient 90"           "patient 91"           "patient 92"
[124] "patient 93"           "patient 94"           "patient 95"
[127] "patient 96"           "patient 97"           "patient 98"
[130] "patient 99"           "treatmentPost:group1" "treatmentPre:group1"
[133] "treatmentPost:group2" "treatmentPre:group2"

v=voom(y,design)
fit=lmFit(v,design)
cm=makeContrasts(CONTROLvsINTERVENTIONforTreatment_Effect= (treatmentPost:group2 - treatmentPre:group2)-(treatmentPost:group1 - treatmentPre:group1),levels=design)
fit2=contrasts.fit(fit,cm)
fit2=eBayes(fit2)

Which gives these error messages:

Coefficients not estimable: treatmentPre:group1 treatmentPre:group2
Warning message:
Partial NA coefficients for 23776 probe(s)
Coefficients not estimable: treatmentPre:group1 treatmentPre:group2
Warning message:
Partial NA coefficients for 23776 probe(s)
Error in makeContrasts(CONTROLvsINTERVENTIONforTreatment_Effect = (treatmentPost:group2 -  :
  The levels must by syntactically valid names in R, see help(make.names).  Non-valid names: patient 10,patient 11,patient 12,patient 13,patient 14,patient 15,patient 16,patient 17,patient 18,patient 19,patient  2,patient 20,patient 21,patient 22,patient 23,patient 24,patient 25,patient 26,patient 27,patient 28,patient 29,patient  3,patient 30,patient 31,patient 32,patient 33,patient 34,patient 35,patient 36,patient 37,patient 38,patient 39,patient  4,patient 40,patient 41,patient 42,patient 43,patient 44,patient 45,patient 46,patient 47,patient 48,patient 49,patient  5,patient 50,patient 51,patient 52,patient 53,patient 54,patient 55,patient 56,patient 57,patient 58,patient 59,patient  6,patient 60,patient 61,patient 62,patient 63,patient 64,patient 65,patient 66,patient 67,patient 68,patient 69,patient  7,patient 70,patient 71,patient 72,patient 73,patient 74,patient 75,patient 76,patient 77,patient 78,patient 79,patient  8,patient 80,patient 81,patient 82,patient 83,patient
In addition: Warning message:
In makeContrasts(CONTROLvsINTERVENTIONforTreatment_Effect = (treatmentPost:group2 -  :
  Renaming (Intercept) to Intercept
Execution halted

 

ADD REPLY
0
Entering edit mode

You need to drop coefficients 132 and 134. Also, you need to rename the coefficients so that they don't have spaces or colons in them.

ADD REPLY
0
Entering edit mode

I think I got your point:

design <- model.matrix(~ patient + treatment:group)
colnames(design) <- make.names(colnames(design), unique=TRUE)
design=design[,c(-132,-134)]
colnames(design)

 [1] "X.Intercept."         "patient.10"           "patient100"
  [4] "patient101"           "patient102"           "patient103"
  [7] "patient104"           "patient105"           "patient106"
 [10] "patient107"           "patient108"           "patient109"
 [13] "patient.11"           "patient110"           "patient111"
 [16] "patient112"           "patient113"           "patient114"
 [19] "patient115"           "patient116"           "patient117"
 [22] "patient118"           "patient119"           "patient.12"
 [25] "patient120"           "patient121"           "patient122"
 [28] "patient123"           "patient124"           "patient125"
 [31] "patient126"           "patient127"           "patient128"
 [34] "patient129"           "patient.13"           "patient130"
 [37] "patient.14"           "patient.15"           "patient.16"
 [40] "patient.17"           "patient.18"           "patient.19"
 [43] "patient..2"           "patient.20"           "patient.21"
 [46] "patient.22"           "patient.23"           "patient.24"
 [49] "patient.25"           "patient.26"           "patient.27"
 [52] "patient.28"           "patient.29"           "patient..3"
 [55] "patient.30"           "patient.31"           "patient.32"
 [58] "patient.33"           "patient.34"           "patient.35"
 [61] "patient.36"           "patient.37"           "patient.38"
 [64] "patient.39"           "patient..4"           "patient.40"
 [67] "patient.41"           "patient.42"           "patient.43"
 [70] "patient.44"           "patient.45"           "patient.46"
 [73] "patient.47"           "patient.48"           "patient.49"
 [76] "patient..5"           "patient.50"           "patient.51"
 [79] "patient.52"           "patient.53"           "patient.54"
 [82] "patient.55"           "patient.56"           "patient.57"
 [85] "patient.58"           "patient.59"           "patient..6"
 [88] "patient.60"           "patient.61"           "patient.62"
 [91] "patient.63"           "patient.64"           "patient.65"
 [94] "patient.66"           "patient.67"           "patient.68"
 [97] "patient.69"           "patient..7"           "patient.70"
[100] "patient.71"           "patient.72"           "patient.73"
[103] "patient.74"           "patient.75"           "patient.76"
[106] "patient.77"           "patient.78"           "patient.79"
[109] "patient..8"           "patient.80"           "patient.81"
[112] "patient.82"           "patient.83"           "patient.84"
[115] "patient.85"           "patient.86"           "patient.87"
[118] "patient.88"           "patient.89"           "patient..9"
[121] "patient.90"           "patient.91"           "patient.92"
[124] "patient.93"           "patient.94"           "patient.95"
[127] "patient.96"           "patient.97"           "patient.98"
[130] "patient.99"           "treatmentPost.group1" "treatmentPost.group2"

v=voom(y,design)
fit=lmFit(v,design)
cm=makeContrasts(CONTROLvsINTERVENTIONforPost= (treatmentPost.group2 - treatmentPost.group1),levels=design)
fit2=contrasts.fit(fit,cm)
fit2=eBayes(fit2) 

summary(decideTests(fit2))

CONTROLvsINTERVENTIONforPost
-1                            9
0                         23763
1                             4

Still very few DE genes comparing Post2-Post1 but this design does not include batch and sex effects. Should I add them too?

ADD REPLY
1
Entering edit mode

No need to include batch and sex effect, as you're blocking on the patient. Each patient is nested within a batch and sex, so the latter two would be redundant. Anyway, there's not a lot of DE genes. One possible explanation is that you just don't have a lot of control samples (I assume this is group 1), such that lots of weak DE can be picked up in the intervention pre-post comparison (e.g., by dropping treatmentPost.group2)  but not in the equivalent comparison for the control (e.g., by dropping treatmentPost.group1). Similarly, this means that there's not enough evidence for looking at differential treatment effects between intervention and control. If so, I suspect that trying different analyses won't do much to help, because the experimental design is inherently limited.

ADD REPLY
0
Entering edit mode

Yes, I have different amount of samples for group1 and group2.

43 for group1 (control)

87 for group2 (intervention)

Thanks for all your comments Aaron.

 

ADD REPLY
0
Entering edit mode

one last question, if here we are comparing Post2-Post1 why should we block on patient? Patients from group1 are different from group2...

ADD REPLY
0
Entering edit mode

What makes you think that you shouldn't block on the patient term (either in the design matrix or via duplicateCorrelation)? You need to account for patient-specific effects on expression, regardless of what comparison you're making.

ADD REPLY
0
Entering edit mode

Ok I see. It's difficult to me to see the difference between blocking via duplicateCorrelation o directly in the design matrix. What is the advantage of each methodology?

ADD REPLY
0
Entering edit mode

If you want to compare between different levels of the blocking factor, use duplicateCorrelation. If not, then you can use either method - in such cases, duplicateCorrelation is more powerful but can also be slightly liberal. This is because it treats the patient-specific effects as being randomly sampled from some homogeneous distribution, whereas the same effects are not considered to be related when you block on patient in the design matrix. As such, duplicateCorrelation makes better use of the information in the data set as it doesn't have to estimate each patient-specific effect separately; however, this comes at the cost of having to estimate the correlation (related to the variability of the patient effects) which is not easy to do with low numbers of patients, when there are systematic differences between patients, or in the presence of outlier patients. These are rather esoteric considerations so, for all intents and purposes, the practical choice is driven by the comparisons you want to do.

ADD REPLY
0
Entering edit mode

Clear, thanks a lot.

ADD REPLY
0
Entering edit mode
aec ▴ 90
@aec-9409
Last seen 4.6 years ago

Aaron, I would like to know how sex and batch affect the results. How to test if they are significant? 

ADD COMMENT
0
Entering edit mode

We do this informally by just seeing if they increase or decrease the number of DE genes when we include them in the model. However, if you're seeing separation by sex/batch in the MDS plot, I would suggest that you keep them in your model, regardless of how the numbers of DE genes change. In such cases, it's clear there's an effect that should be modelled.

ADD REPLY

Login before adding your answer.

Traffic: 565 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6