Hi,
I have performed a RNA-seq experiment that has quite a complex design. I have 40 bee individuals belonging to 4 treatment groups (1 control group and 3 different treatments, each of which I would like to compare to the control). Each treatment has 10 biological replicates (10 individual bees). For each individual the tissue of interest was dissected in three regions, so I have 120 RNA-seq samples. The 40 individuals were collected from ten hives (for each hive I have 4 individuals, one per treatment group), and the samples were always processed in batches by hive (RNA, library prep, sequencing lane), so now I have just one batch effect of these ten batches/hives. Finally, through other methods I have realized that while performing the experiment the individuals were transitioning between two behavioural states (these two behavioural groups are not perfectly balanced across treatments and hives/batches but it is not too bad), and I would expect some differential expression between them (which does seem to show up in exploratory ordination plots). This is how my design matrix looks like
Sample Sample_ID Treatment Region Hive Behaviour
1 1 H1B1_6 MD OL H1 nurse
2 2 H1B1_6 MD MB H1 nurse
3 3 H1B1_6 MD AL H1 nurse
4 4 H1B3_5 Bi OL H1 nurse
5 5 H1B3_5 Bi MB H1 nurse
6 6 H1B3_5 Bi AL H1 nurse
7 7 H1B5_3 CLA OL H1 nurse
8 8 H1B5_3 CLA MB H1 nurse
9 9 H1B5_3 CLA AL H1 nurse
10 10 H1B6_4 CLH OL H1 nurse
11 11 H1B6_4 CLH MB H1 nurse
12 12 H1B6_4 CLH AL H1 nurse
13 13 H2B1_4 Bi OL H2 nurse
14 14 H2B1_4 Bi MB H2 nurse
15 15 H2B1_4 Bi AL H2 nurse
16 16 H2B3_3 CLH OL H2 nurse
17 17 H2B3_3 CLH MB H2 nurse
18 18 H2B3_3 CLH AL H2 nurse
19 19 H2B4_3 CLA OL H2 nurse
20 20 H2B4_3 CLA MB H2 nurse
21 21 H2B4_3 CLA AL H2 nurse
22 22 H2B5_1 MD OL H2 forager
23 23 H2B5_1 MD MB H2 forager
24 24 H2B5_1 MD AL H2 forager
25 25 H3B1_3 CLH OL H3 nurse
26 26 H3B1_3 CLH MB H3 nurse
27 27 H3B1_3 CLH AL H3 nurse
28 28 H3B3_4 Bi OL H3 forager
29 29 H3B3_4 Bi MB H3 forager
30 30 H3B3_4 Bi AL H3 forager
31 31 H3B5_2 CLA OL H3 nurse
32 32 H3B5_2 CLA MB H3 nurse
33 33 H3B5_2 CLA AL H3 nurse
34 34 H3B6_2 MD OL H3 forager
35 35 H3B6_2 MD MB H3 forager
36 36 H3B6_2 MD AL H3 forager
37 37 H4B3_4 CLA OL H4 nurse
38 38 H4B3_4 CLA MB H4 nurse
39 39 H4B3_4 CLA AL H4 nurse
40 40 H4B4_7 MD OL H4 nurse
41 41 H4B4_7 MD MB H4 nurse
42 42 H4B4_7 MD AL H4 nurse
43 43 H4B5_2 Bi OL H4 forager
44 44 H4B5_2 Bi MB H4 forager
45 45 H4B5_2 Bi AL H4 forager
46 46 H4B6_2 CLH OL H4 nurse
47 47 H4B6_2 CLH MB H4 nurse
48 48 H4B6_2 CLH AL H4 nurse
49 49 H5B1_3 CLA OL H5 nurse
50 50 H5B1_3 CLA MB H5 nurse
51 51 H5B1_3 CLA AL H5 nurse
52 52 H5B2_3 Bi OL H5 nurse
53 53 H5B2_3 Bi MB H5 nurse
54 54 H5B2_3 Bi AL H5 nurse
55 55 H5B4_4 CLH OL H5 nurse
56 56 H5B4_4 CLH MB H5 nurse
57 57 H5B4_4 CLH AL H5 nurse
58 58 H5B5_6 MD OL H5 forager
59 59 H5B5_6 MD MB H5 forager
60 60 H5B5_6 MD AL H5 forager
61 61 H6B2_5 CLA OL H6 forager
62 62 H6B2_5 CLA MB H6 forager
63 63 H6B2_5 CLA AL H6 forager
64 64 H6B4_3 CLH OL H6 forager
65 65 H6B4_3 CLH MB H6 forager
66 66 H6B4_3 CLH AL H6 forager
67 67 H6B5_4 Bi OL H6 nurse
68 68 H6B5_4 Bi MB H6 nurse
69 69 H6B5_4 Bi AL H6 nurse
70 70 H6B6_2 MD OL H6 nurse
71 71 H6B6_2 MD MB H6 nurse
72 72 H6B6_2 MD AL H6 nurse
73 73 H7B1_3 CLA OL H7 forager
74 74 H7B1_3 CLA MB H7 forager
75 75 H7B1_3 CLA AL H7 forager
76 76 H7B2_3 CLH OL H7 nurse
77 77 H7B2_3 CLH MB H7 nurse
78 78 H7B2_3 CLH AL H7 nurse
79 79 H7B3_4 MD OL H7 forager
80 80 H7B3_4 MD MB H7 forager
81 81 H7B3_4 MD AL H7 forager
82 82 H7B4_9 Bi OL H7 nurse
83 83 H7B4_9 Bi MB H7 nurse
84 84 H7B4_9 Bi AL H7 nurse
85 85 H8B5_3 MD OL H8 nurse
86 86 H8B5_3 MD MB H8 nurse
87 87 H8B5_3 MD AL H8 nurse
88 88 H8B1_3 Bi OL H8 nurse
89 89 H8B1_3 Bi MB H8 nurse
90 90 H8B1_3 Bi AL H8 nurse
91 91 H8B2_3 CLH OL H8 nurse
92 92 H8B2_3 CLH MB H8 nurse
93 93 H8B2_3 CLH AL H8 nurse
94 94 H8B3_3 CLA OL H8 forager
95 95 H8B3_3 CLA MB H8 forager
96 96 H8B3_3 CLA AL H8 forager
97 97 H9B1_3 CLA OL H9 forager
98 98 H9B1_3 CLA MB H9 forager
99 99 H9B1_3 CLA AL H9 forager
100 100 H9B3_5 CLH OL H9 forager
101 101 H9B3_5 CLH MB H9 forager
102 102 H9B3_5 CLH AL H9 forager
103 103 H9B4_7 Bi OL H9 forager
104 104 H9B4_7 Bi MB H9 forager
105 105 H9B4_7 Bi AL H9 forager
106 106 H9B5_3 MD OL H9 forager
107 107 H9B5_3 MD MB H9 forager
108 108 H9B5_3 MD AL H9 forager
109 109 H10B1_6 CLH OL H10 forager
110 110 H10B1_6 CLH MB H10 forager
111 111 H10B1_6 CLH AL H10 forager
112 112 H10B2_4 MD OL H10 nurse
113 113 H10B2_4 MD MB H10 nurse
114 114 H10B2_4 MD AL H10 nurse
115 115 H10B5_5 Bi OL H10 nurse
116 116 H10B5_5 Bi MB H10 nurse
117 117 H10B5_5 Bi AL H10 nurse
118 118 H10B6_3 CLA OL H10 nurse
119 119 H10B6_3 CLA MB H10 nurse
120 120 H10B6_3 CLA AL H10 nurse
Because the project is about analysing the effects of some specific treatments on a more peripheral tissue than where the treatment was applied, I do not expect the treatment to have had a very strong effect, but I would like to identify these minor effects in a good detail if they are there.
Ideally I would like to know:
1) What genes are DEGs between the control and each treatment overall (across all three tissue regions). 2) Are there any region specific effects of treatment, so the same pairwise comparisons as before, but by specific region (interaction between treatment and region?). 3) Are there any DEGs dependent on behavioural state (again both overall and region-specific). 4) Is there an interaction between treatment and behavioural group, i.e. specific genes may differ between treatments and control only for one of the two behavioural groups. And possibly also by region (interaction between region, behaviour and treatment?)
I first tried to analyze my data by splitting the dataset by region (so 40 samples each in three separate analyses), but like this I do not seem to get any DEG in any pair-wise comparison between treatments. So I then tried to model the whole dataset of 120 samples in one go, and like this I seem to find a few genes of interest, but I am unsure if what I have been doing is indeed correct and recommended, and how to then include and get results for specific comparisons for interaction terms. This is my code so far:
library(edgeR)
y=DGEList(counts=count.table)
Hive=factor(Design$Hive)
Region=factor(Design$Region)
Behav=factor(Design$Behaviour)
Treatment <- relevel(Design$Treatment, "MD")
Treat=factor(Treatment)
design=model.matrix(~0+Treat+Hive+Region+Behav)
#Filter low expressed genes
keep <- filterByExpr(y, design, min.count = 10, min.total.count = 20)
y <- y[keep,,keep.lib.sizes=FALSE]
dim(y)
#Now normalize counts
y=calcNormFactors(y, method = "TMM")
colnames(design)<-make.names(colnames(design))
v=voom(y,design,plot=TRUE)
cor=duplicateCorrelation(v,design,block=Design$Sample_ID)
fit=lmFit(v,design,block=Design$Sample_ID, correlation=cor$consensus)
head(coef(fit))
cm=makeContrasts(MDvsCLH= TreatCLH-TreatMD, MDvsCLA= TreatCLA-TreatMD, MDvsBi= TreatBi-TreatMD, ForagervsNurse =Behavnurse, levels=make.names(colnames(coef(fit)))) # To test treatment and behaviour effect
fit2=contrasts.fit(fit,cm)
fit2=eBayes(fit2, robust=TRUE)
summary(decideTests(fit2))
MDvsCLH_table=topTable(fit2,coef="MDvsCLH",n=Inf, sort="p", p=0.05)
MDvsCLA_table=topTable(fit2,coef="MDvsCLA",n=Inf, sort="p", p=0.05)
MDvsBi_table=topTable(fit2,coef="MDvsBi",n=Inf, sort="p", p=0.05)
ForagervsNurse_table=topTable(fit2,coef="ForagervsNurse",n=Inf, sort="p", p=0.05)
MDvsCLH_table
MDvsCLA_table
MDvsBi_table
ForagervsNurse_table
In essence, I have tried to include Treatment, Region, Behaviour and Hive (the technical batches) as Fixed effects and the individuals from which the three regions came from as a Random effect. I would have included also Hive as a random effect instead of a fixed effect but I am not sure if it is possible to include two different random effects?
Like this I do get some DEGs for pairwise comparisons between treatments and also for the behavioural group comparison (more so than treatment, so, as I was expecting, that has a stronger effect) and plotting the lcpm counts of individual DEGs shows that at least a few are consistently and convincingly up-regulated in treatments compared to the control group (MD) across the three regions.
Would this approach be valid and therefore I should trust that I am not getting false positive results? And if I then want to include interactions and get treatment comparisons for all the specific regions and behaviours how could I do that?
Should I approach the different questions with separate models or should I do that all at once somehow?
Thank you very much for your kind help with this.
Best, Joanito
Dear Aaron,
Thank you so much, this was really really helpful and got me through the main questions I wanted to ask! I have a few remaining doubts that I hope you could clarify.
When I run the analysis by first creating a "group" factor that contains region, treatment and behaviour as you suggested and then test for overall treatment effects, I get less DEGs for pairwise treatment comparisons than when I used the formula:
and then run the following contrasts:
So what would be the difference of this previous approach I used with the one you suggested (apart from more easily allowing me to test all contrasts I am interested in)? I implemented your suggestion like this:
If you were to spell out the model that you suggested how would that look like? Do I get less DEGs because it also includes interactions and we then end up with less degrees of freedom?
If this is the case, then I assume that having the behaviour factor in the model will also reduce the power to test for treatment effects (but at the same time controlling for possible confounding effects of behaviour)? Is there a trade-off here? I am just wondering what would be the best thing to do because the treatment effect is clearly very small, but the few DEGs I recover seem to be absolutely non random and to make a lot of sense, but a few disappear depending on whether I use your approach and enter the behaviour factor or not. I am just wondering what would be the best way to characterize them in most detail.
Thanks a lot again for your kind help!
More residual d.f. but stronger assumptions.
Don't know what you're asking for here.
Yes, which reduces power compared to the previous model. However, the previous model also assumed that all of the interesting biological effects (behaviour, region and treatment) were additive. If this is not true, one can expect systematic biases in the estimation of the coefficients, possibly manifesting as an increased number of (false positive) DEGs.
Yes.
Yes.
If you want to look at interaction effects - or even if you don't, but interactions are present - you cannot use an additive model.