Hi, I am analyzing RNA-seq data of a tree infected with 2 pathogens. The experiment lay-out is like this:
- condition 1(C): tree + water (control)
- condition 2(P1): tree + pathogen1
- condition 3(P1_P2): tree + pathogen1 + pathogen2
I inherited this code and I succeeded using it to call DEGs for condition 2 (contrast = 2-1; 2401 DEGs) and 3 (contrast = 3-1; 5209 DEGs), but when I tried calling DEGs for differences between condition 2 and 3 (contrast = either 2-3 or 3-2), the result is always 0. I expect there are some differences between the two conditions given that, when compared to control, the number of DEGs found in the two conditions are quite different. May you please help enlighten if I have use wrong codes or this is what I can expect if there is no difference between the two conditions?
I also wonder if I really need the line "colnames(design) <- levels(group)" in my code. I tried reading up about a function of colnames, but I don't understand what it actually does.
All suggestions and advice are appreciated. Thanks a lot in advance for your help! :)
Code should be placed in three backticks as shown below
> rawCountTable <- read.table("counts.txt", header=TRUE, sep="\t", row.names="Geneid")
> sampleInfo <- read.table("sampleInfo.csv", header=TRUE, sep=",", row.names="file")
> group <- factor(sampleInfo$condition)
> y <- DGEList(counts=rawCountTable,group=sampleInfo$condition)
> keep <- filterByExpr(y)
> y <- y[keep,,keep.lib.sizes=FALSE]
> y <- calcNormFactors(y, method="TMM")
> design <- model.matrix(~0+group)
$design
C P1 P1_P2
1 1 0 0
2 1 0 0
3 1 0 0
4 0 1 0
5 0 1 0
6 0 0 1
7 0 0 1
8 0 0 1
> colnames(design) <- levels(group)
> y <- estimateDisp(y,design,robust=TRUE)
> y$common.dispersion
> fit <- glmQLFit(y,design,robust=TRUE)
> my.contrasts <- makeContrasts(vsP1=P1-C, vsP1_P2=P1_P1-C, vsdiff=(P1_P2-C)-(P1-C), vsdiff2=(P1-C)-(P1_P2-C)), levels=design)
> my.contrasts
Contrasts
Levels P1_P2 P1 vsdiff vsdiff2
control -1 -1 0 0
P1 0 1 -1 1
P1_P2 1 0 1 -1
> qvsdiff <- glmQLFTest(fit, contrast=my.contrasts[,"vsdiff"])
> toptags.qvsdiff <- topTags(qvsdiff, n=nrow(qvsdiff$table),adjust.method="fdr")
> sig005.qvsdiff <- toptags.qvsdiff$table[toptags.qvsdiff$table$FDR<0.05,]
> sum(toptags.qvsdiff$table$FDR < 0.05)
0
It hard to be sure what you've done because the output you show doesn't correspond to the code. The design matrix you show is not as output by
model.matrix
and the contrast matrix you show is not produced by themakeContrasts
call that you show. Can you please run the code again and regenerate the output?Hi Prof. Smyth, Thanks again for your reply. I am sorry for the previous code. I was trying to make the factor names to be easily recognised, but it seems I am not that smart. I am sorry for this. Anyway, here is the more or less original code and in this case, A = control, B = P1 and C = P1_P2. I I tried rerunning it again to be sure that it works with some of the results. :)
Thanks a lot again for your time and have a great day!
OK, thanks. Your code seems fine but do you appreciate that
vsBACA=(B-A)-(C-A)
andvsCABA=(C-A)-(B-A)
are the same contrast (just with signs reversed) and that both are the same asB-C
? The control groupA
just cancels out of both contrasts, so it is unnecessary.Hi, I am sorry, but I don't understand why they both are B-C. From the contrast matrix, one is B-C and another is C-B. I read the edgeR manual (section 3.3.1) where they use DrugvsPlacebo.2h = (Drug.2h-Drug.0h)-(Placebo.2h-Placebo.0h). I guess the reason the control group A is unnecessary in my case is that I am working with only one control? And for my learning, may you explain me how to achieve the contrast C-B, please?
Thank you very much and have a great day!
B-C and C-B are the same thing, just with reversed signs for the logFCs. They have exactly the same t-statistics, p-values and so on. There is no need to do both.
Hi Prof. Smyth, thanks so much for the explanation. It is clear now for the B-C and C-B. I however don't understand why the control group A is unnecessary. I understand that (B-A)-(C-A) would detail DE genes (after contrasting with group A) that is significant in B but not in C, whereas B-C would detail ANY genes that is differentially expressed in B but not in C. I am sorry for having too many questions, but I hope you can help to correct me.
Thanks again and have a great day!
No, you are over-interpretting the contrasts, adding on interpretations that they don't have. The B-C contrast simply finds genes that are differentially expressed between B and C, nothing more and nothing less. The null hypothesis for B-C is that muB - muC = 0 where muB and muC are the expected log-expressions for the two groups.
You can see that A cancels out just by elementary math: (B-A)-(C-A) = B - A - C + A = B - C + A - A = B - C.
Aha, it is crystal clear now. I just followed the section 3.3.1 in the edgeR manual without being aware that it's another situation there. Thanks so much for your time and the explanation! Have a great day! :)
Please don't add random comments on 11 month old posts. I've already noted in your other post that this isn't a Bioconductor question, and you should ask for help elsewhere.