Dear list,
I want to start off my thanking the edgeR developers, for making, maintaining and supporting edgeR. I have learned a lot by reading the helpful support provided through these lists.
I have read the edgeR manual as well as multiple posts about this topic, but nevertheless I am still not 100% sure whether my approach is correct. I apologise if I have missed something obvious, and am gladly pointed elsewhere if this is the case.
I have performed RNAseq on 80 samples in a fairly complicated full factorial experimental design with three factors. I am interested in both the main effects of these factors, and in their interaction. Despite reading the manual and similar posts, I would be very grateful for some input as to whether I am implementing my experimental design properly in edgeR.
I've tried to simplify my data as much as possible for these questions, and inspired by a very helpful earlier answer by Dr. Gordon Smyth I am using simulated data for 2 genes, rather than real data. My full code (incl sessionInfo() is below this email. My main question is whether I am implementing my models correctly.
(part of my simplication is also to not bother with normalization, and not estimate dispersions--it's about the model definition).
These are my 3 experimental factors:
A - 2 levels (high, low)
B - 2 levels (control, stress)
C - 4 levels (1, 2, 3, 4)
Thus, there are 16 unique combinations of factor levels, and for each combination my sample size is 5. This is a summary of my samples, below this email in the full code is the full list of 80 samples in relation to the three factors:
> A <- as.factor(rep(c("high","low"),40))
> B <- as.factor(rep(rep(c("control","stress"),each=2),20))
> C <- as.factor(rep(rep(1:4,each=4),5))
> samples <- data.frame(A, B, C)
> table(samples$A, samples$B, samples$C)
, , = 1
control stress
high 5 5
low 5 5
, , = 2
control stress
high 5 5
low 5 5
, , = 3
control stress
high 5 5
low 5 5
, , = 4
control stress
high 5 5
low 5 5
This is my simulated expression data set (2 genes, 80 samples)
> y <- matrix(rpois(8000,lambda=200),2,80)
> lib.size <- rep(10000,80)
> offset <- log(lib.size)
Firstly, I am interested in the main effects:
- Which genes are affected by A, controlling for B and C?
- same for main effects of B, and C (C has 4 levels, but I'm not interested in specific effects of one particular level, but instead I want to know whether there is any effect of C)
Secondly, I am also interested in the interaction between A and B, in the following two ways:
- In general, which genes are affected by the interaction between A and B? In other words, which genes respond differently to B depending on A?
- Specifically, and following up on the general interactive effect of AxB, I am additionally interested in which genes are affected by B within each level of A. So, for A="high", which genes are affected by B? And for A="low", which genes are affected by B?
I am certainly aware of the advise to use contrasts to specify models rather than factors, but in addition to using contrasts I would like to be able to do this using factors.
So, as I have understood it from the examples in the manual and from several posts on the BioC mailing list / support site, I should fit one model for testing the main effects, and separate models for testing each of the interactive effects. Previously, I had fitted a single, full model with all main and interactive terms, and I thought I could use it to test all terms. However, after reading more online about edgeR, I now believe that this was incorrect and that I should fit separate models. Am I correct in this?
This would be the model for the main effects
> design_main <- model.matrix(~A+B+C)
> fit_main <- glmFit(y,design_main,offset=offset,dispersion=0)
The following gives me the coefficients I need for the glmLRT:
> colnames(design_main)
[1] "(Intercept)" "Alow" "Bstress" "C2" "C3" "C4"
Now I use glmRT and the appropriate coefficients to test for each of the main effects:
> glmLRT(fit_main,coef=2)$table # main effect A
logFC logCPM LR PValue
1 -0.058139094 14.3079 6.5320502 0.01059478
2 0.008096458 14.3039 0.1263495 0.72224749
(Note that I have only two genes in my data set.)
> glmLRT(fit_main,coef=3)$table # main effect B
logFC logCPM LR PValue
1 -0.007894384 14.3079 0.1204581 0.7285382
2 -0.022490557 14.3039 0.9749272 0.3234544
> glmLRT(fit_main,coef=4:6)$table # main effect C
logFC.C2 logFC.C3 logFC.C4 logCPM LR PValue
1 -0.020330032 -0.04569516 -0.009950886 14.3079 2.2265738 0.5267331
2 0.004687591 0.01758968 -0.001083918 14.3039 0.4258204 0.9348584
Is this approach correct?
For the AxB interaction effect, I fit a different model:
> design_AxB <- model.matrix(~A*B+C)
> fit_AxB <- glmFit(y,design_AxB,offset=offset,dispersion=0)
> colnames(design_AxB)
[1] "(Intercept)" "Alow" "Bstress" "C2" "C3" "C4" "Alow:Bstress"
I want to know the effect of the interaction, so I test coefficient 7 which gives me the following result:
> glmLRT(fit_AxB,coef=7)$table
logFC logCPM LR PValue
1 0.09015271 14.3079 3.9252329 0.0475666
2 -0.02830344 14.3039 0.3859811 0.5344195
Now I want to test the effect of B within each level of A. My question is whether I can do this from my previous model, and if so, how. Or should I make a new model like this:
> design_B_within_A <- model.matrix(~A+A:B+C)
> colnames(design_B_within_A)
[1] "(Intercept)" "Alow" "C2" "C3" "C4"
"Ahigh:Bstress" "Alow:Bstress"
> fit_B_within_A <- glmFit(y,design_B_within_A,offset=offset,dispersion=0)
Effect of B within level "high" for A:
> glmLRT(fit_B_within_A,coef=6)$table
logFC logCPM LR PValue
1 -0.052064376 14.3079 2.67201634 0.1021266
2 -0.008299669 14.3039 0.06619957 0.7969528
Effect of B within level "low" for A:
> glmLRT(fit_B_within_A,coef=7)$table
logFC logCPM LR PValue
1 0.03808834 14.3079 1.373675 0.2411815
2 -0.03660311 14.3039 1.294709 0.2551820
I suspect this latter approach is correct but I am not 100% certain.
I had previously fitted one single, full factorial model, under the impression that I could use the same model to test all main and interactive effects. I now believe this is incorrect, but I it'd be great to have some input on this. I proceeded as follows:
> design_AxBxC <- model.matrix(~A*B*C)
> fit_AxBxC <- glmFit(y,design_AxBxC,offset=offset,dispersion=0)
Use the following coefficients to test specific (main and interactive) effects:
> colnames(design_AxBxC)
[1] "(Intercept)" "Alow" "Bstress" "C2"
"C3" "C4" "Alow:Bstress" "Alow:C2"
[9] "Alow:C3" "Alow:C4" "Bstress:C2" "Bstress:C3"
"Bstress:C4" "Alow:Bstress:C2" "Alow:Bstress:C3" "Alow:Bstress:C4"
Use coefficient 2 to test main effect of A:
> glmLRT(fit_AxBxC,coef=2)$table
logFC logCPM LR PValue
1 -0.1117345 14.3079 3.021610 0.082161879
2 0.1906238 14.3039 8.851755 0.002928072
Similarly, I used coefficient 3 for main effect B, 4:6 for main effect C, coefficient 7 to test two-way interaction effect of AxB, coefficients 8:10 to test AxC, and finally coefficients 14:16 to test three-way interaction effect of AxBxC:
I now suspect that this approach is incorrect, any thoughts?
Many thanks for any help provided, and apologies if I didn't manage to explain myself clearly.
All the best,
Vicencio
(UCL, UK)
full code here: https://www.dropbox.com/s/o61puleaoz7gk1m/BioC_edgeR_question.doc?dl=0
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] edgeR_3.10.0 limma_3.24.0
loaded via a namespace (and not attached):
[1] tools_3.2.0 KernSmooth_2.23-14 splines_3.2.0
>
Dear Aaron,
many thanks for your quick response. The interactions are indeed easier to implement using the contrast approach. However, there is one thing that I want from my data (which I should have mentioned in my post) that I am unsure how to do in your contrast approach:
[start quote]
I'd probably do a comparison between levels for A at each combination of levels for B and C. Assuming you have a
fit
object for the abovedesign
matrix:If you want to summarize these results into a single DE list, then it gets a bit complicated. For example, if you want to find genes affected by A regardless of B or C, then I'd intersect DE lists from all of the contrasts to find the common set that changes in all comparisons.
[end quote]
What I would really like to have is not a list of 'DE' genes (produced by an arbitrary p value cut-off), but, for each gene, a fold change value for the main effect of A (and also for each gene a single fold change value for the main effect of B). (I want to use the whole distribution of fold change values as a transcriptome-wide measure of response to the main effect). I am unsure how to do this using the contrast approach that you explain, whereas a traditional factor approach would seem more intuitive to me (if I can get it defined correctly).
I would really like to analyse my data using a factor rather than contrast parametrization (I apologise for being stubborn, and I understand if you're tired of explaining something the easy way and people still want to do it the hard way). The main reason (in addition to the specific issue above) is that I initially got two very different results using those two approaches, and to me that is very suspicious. I actually understand the factor approach more intuitively (at least I thought I did!) which is why stuck with that. However, I recently discovered that I likely made a mistake there, hence my post. I would have much more confidence in my data if I could get to the same results (same fold change, LR, p values) using the contrast approach and the factor approach.
Perhaps I can simplify my data even more and summarize my problem as follows.
Given again
> y <- matrix(rpois(8000,lambda=200),2,80)
> lib.size <- rep(10000,80)
> offset <- log(lib.size)
> A <- as.factor(rep(c("high","low"),40))
> B <- as.factor(rep(rep(c("control","stress"),each=2),20))
but now omitting factor C
> samples <- data.frame(A, B)
To test main and interaction effects using the factor approach, should I a) fit separate models for main & interaction effects, or b) fit one full model, and test the main and interaction effects from that single model?
Factor approach a:
> # first additive model for main effects; then test them
> design_main <- model.matrix(~A+B)
> fit_main <- glmFit(y,design_main,offset=offset,dispersion=0)
> glmLRT(fit_main,coef=2)$table# main effect A
logFC logCPM LR PValue
1 -0.01593154 14.29505 0.4861903 0.4856317
2 0.02354142 14.29469 1.0613029 0.3029182
> glmLRT(fit_main,coef=3)$table# main effect B
logFC logCPM LR PValue
1 -0.01882826 14.29505 0.6790606 0.4099104
2 -0.01412465 14.29469 0.3820663 0.5364994
>
> # fit full model for A x B interaction, then test general interaction effect
> design_AxB <- model.matrix(~A*B)
> fit_AxB <- glmFit(y,design_AxB,offset=offset,dispersion=0)
> glmLRT(fit_AxB,coef=4)$table
logFC logCPM LR PValue
1 -0.04221268 14.29505 0.85326183 0.3556312
2 0.01182093 14.29469 0.06689629 0.7959106
>
> # fit different model to test specific effect of B at each level of A
> design_B_within_A <- model.matrix(~A+A:B)
> fit_B_within_A <- glmFit(y,design_B_within_A,offset=offset,dispersion=0)
> glmLRT(fit_B_within_A,coef=3)$table# effect of B at A="high"
logFC logCPM LR PValue
1 0.002160532 14.29505 0.004495505 0.9465431
2 -0.020083399 14.29469 0.383059954 0.5359701
> glmLRT(fit_B_within_A,coef=4)$table# effect of B at A="low"
logFC logCPM LR PValue
1 -0.040052151 14.29505 1.52782690 0.2164389
2 -0.008262469 14.29469 0.06590267 0.7973987
Factor approach b:
> # fit full model, then test all (main & interactive) effects from this model
> design_AxB <- model.matrix(~A*B)
> fit_AxB <- glmFit(y,design_AxB,offset=offset,dispersion=0)
> colnames(design_AxB)
[1] "(Intercept)" "Alow" "Bstress" "Alow:Bstress"
> glmLRT(fit_AxB,coef=2)$table# main effect A
logFC logCPM LR PValue
1 0.005036218 14.29505 0.02445111 0.8757426
2 0.017659990 14.29469 0.30008936 0.5838264
> glmLRT(fit_AxB,coef=3)$table# main effect B
logFC logCPM LR PValue
1 0.002160532 14.29505 0.004495505 0.9465431
2 -0.020083399 14.29469 0.383059954 0.5359701
> glmLRT(fit_AxB,coef=4)$table# general interaction effect
logFC logCPM LR PValue
1 -0.04221268 14.29505 0.85326183 0.3556312
2 0.01182093 14.29469 0.06689629 0.7959106
My hunch is that approach a is correct, and b is not. The main effects of A and B are different in the two approaches, and also in approach b I am unsure how to test for a specific effect of B at each level of A. Furthermore, approach a gave me the exact same results for the two main effects and the general interaction effect as computing the anovas 'manually' for each gene separately using the this code:
gene1.model <- glm(y[1,] ~ A*B, family="poisson", offset=offset) # glm for gene 1
anova(gene1.model, test="Chi") # anova table for gene 1
The reason I am not satisfied with my approach a is that I am not sure whether it is correct, and and particular whether I am getting the B within A effects right (i.e. test for a specific effect of B at each level of A).
Again, many thanks for your help.
All the best,
Vicencio
The traditional factor approach is superficially more intuitive for examining the main effect of A or B, but it actually just papers over the problem. What happens if you have a gene that is, e.g., up in high-control against high-stress, but down in low-control against low-stress? I don't think it's sensible or safe to talk about the main effect of B (control/stress) if that effect changes depending on its interaction with A (high/low).
Nonetheless, something akin to that comparison can be done in the one-way layout. The following looks at the "main effect" of A (run it through
glmLRT
as before).This tests whether the average across all "high" groups is equal to the average across all "low" groups, to examine the overall effect of factor A.
As for the interaction model (i.e., your "b" approach); dropping the second coefficient tests for differences between "high" and "low" within the "control" condition. Dropping the third coefficient tests for differences between "control" and "stress" within the "high" condition. Check out the
design_AxB
matrix if you're not convinced. In both cases, you're not testing for "main effects" of either factor.While
design_main
does test for "main effects", the sensibility of doing so is questionable. You can also expect to get inflated dispersions if you use this design matrix inestimateDisp
, because you're assuming that the effects are additive.Thanks for detailing how to test for an 'average' effect of A over all values of B, using your contrast approach. I think this is the way I will take my data forward.
[quote] The traditional factor approach is superficially more intuitive for examining the main effect of A or B, but it actually just papers over the problem. What happens if you have a gene that is, e.g., up in high-control against high-stress, but down in low-control against low-stress? I don't think it's sensible or safe to talk about the main effect of B (control/stress) if that effect changes depending on its interaction with A (high/low). [end quote]
I agree, and this is why I want to, after looking at the main (average) effect of B, then examine the specific effects of B within each level of A. Basically, I want to analyse how much of the whole-transcriptome response to B is driven by changes irrespective of A, and then compare that with changes dependent on the value of A. This sounds abstract, but consider A to be age and B some kind of food treatment. Some genes will only be affected by food (B) at one specific age A1 or A2, while other genes will be affected irrespective of age (A). The latter group might be very different kinds of genes ('general food response genes') from the former ones ('old people-specific food response genes' or whatever). I want to get a sense of how much of the overall response is in either category.
The reasons I am not satisfied with just producing gene lists based on a p value cut-off, and then intersecting them, are
1) While it's easy to determine whether a gene is significantly DE to produce a list of DE genes, I am not comfortable with calling an effect non-significant based on e.g. FDR>10%, and then producing a list of 'non-DE genes'.
2) I am also interested in the magnitude of transcriptional responses, not just their statistical (non)significance. Therefore, computing (for each gene) fold change due to the 'average' effect of B, and then comparing that to fold change due to the A1-specific effect of B and the A2-specific effect of B would be a valuable analysis to get at my goal of analysis general vs. condition (age)-dependent responses.
I'm happy to do this in a contrast set up, as long as I'm sure I'm doing it correctly, so many thanks for the pointers.
[quote] As for the interaction model (i.e., your "b" approach); dropping the second coefficient tests for differences between "high" and "low" within the "control" condition. Dropping the third coefficient tests for differences between "control" and "stress" within the "high" condition. Check out the
design_AxB
matrix if you're not convinced. In both cases, you're not testing for "main effects" of either factor. [end quote]Also thanks for looking at my incorrect approach 'b', which I embarked upon initially after getting confused by the design_AxB matrix. For coefficient 2, the matrix has a '1' for all samples with A=low, and a '0' for all samples with A=high. I incorrectly assumed this to mean that dropping this coefficient would compare the 1s with the 0s.
[quote] While
design_main
does test for "main effects", the sensibility of doing so is questionable. You can also expect to get inflated dispersions if you use this design matrix inestimateDisp
, because you're assuming that the effects are additive. [end quote]I tried this approach because I came across this post:
A: Testing main effects and interactions in edgeR without contrasts
where Gordon Smyth implemented a 2x3 design both in edgeR, and in the standard R function glm(), and demonstrated that they're equivalent. I suppose this was just to demonstrate the equivalent of approaches, not to champion this approach in general, right? (due to the deviance inflation you mention).
Many thanks again, your help is much appreciated!