Dear all,
I have a question on how to design the model matrix to properly analyze my RNAseq data.
Here the summary of my data: I have two conditions (young and old). In each condition I have 3 biological replicates (sample) and each biological replicate is replicated twice (technical replicate), in total 12 libraries. My technical replicates are not different runs or different sequencing machines, they are two libraries generated from one single biological replicate. All 12 samples were sequenced together. So it is different from a batch effect.
Here is the structure of my target
> target
condition sample technical
1 young 1 a
2 young 1 b
3 young 2 a
4 young 2 b
5 young 3 a
6 young 3 b
7 old 4 c
8 old 4 d
9 old 5 c
10 old 5 d
11 old 6 c
12 old 6 d
My aim: I want to know the differentially expressed genes between young and old.
I think I need to design the model.matrix taking into account the technical replicates nested within sample nested within condition (if it was a standard glm I would write it like this: ~condition/sample/technical) but I am not sure how to do it in edgeR
I tried
design<-model.matrix(~condition + technical, data=target)
fit <-glmFit(d,design)
lrt<-glmLRT(fit,coef="condition")
but I do not think this is correct because my technical replicates should be nested within sample.
From the edgeR userguide, I got the following suggestion:
design<-model.matrix(~condition + condition:technical, data=target)
fit <-glmFit(d,design)
lrt<-glmLRT(fit,coef="condition")
but still I am missing the sample term which I think it is necessary to give as input for the full nested design.
The following two options will not work as they give errors at the dispersion estimate step:
design<-model.matrix(~condition + condition:sample:technical, data=target)
design<-model.matrix(~condition:technical, data=target)
Any idea is really really welcome!
Thanks for your help!
Silvia
Hi Aaron,
thank you very much for your suggestion!
I am trying to do what you suggested and it looks like it's working. I thought about pooling the technical replicates, but I was not sure at which point I was supposed to pool.
And this is what I get when I check the NB dispersions
summary(d.temp$trended)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.05544 0.26690 0.45460 0.42990 0.61060 0.75630
I still think that I am probably loosing some information by pooling the technical replicates, but it is more conservative.
Thanks again
Silvia
Okay, those dispersions are pretty high. Make sure your
design.temp
is constructed usingtarget$sample
; my original post was incorrectly usingtarget$technical
. The other possibility is that you're getting high technical dispersions for low-abundance genes, so filtering might help; but I doubt it, because even the lowest trended dispersion is still quite high.Anyway, regarding your last comment; if the counts were truly Poisson-distributed between replicates, you wouldn't lose any information. This is because the expression for the likelihood (conditional on the mean) is effectively the same regardless of whether you pool the counts or keep them separate. As a result, you will obtain the same coefficients and dispersion estimates from either approach. Of course, pooling is more convenient as it allows us to plug it in to the existing framework.
Hi Aaron, thanks again for your comment. Yes, my design.temp is constructed using target$sample. Indeed the dispersions are quite high, this is why I don' t think pooling is the best option. However, at the moment it looks like it is the best I can do using edgeR. When I look at the heatmap of the significant differentially expressed genes that I get after pooling the technical replicates (heatmap generated using the Trinity command analyze_diff_expr.pl and the matrix.TMM_normalized.FPKM generated with Trinity), I can see that the technical replicates are rather different and not clustering as I would expect them to do...
By the way, do you have any suggestion (or a link) on how I could generate my heatmap for the differentially expressed genes, after pooling the replicates? I mean a heatmap where I would see only my 6 samples, 3 per condition.
Thank you very much!
Silvia
Well, the presence of these kinds of large technical overdispersion is a troubling issue. How are you getting your counts? If you're counting de novo transcripts assembled with Trinity, there might be extra variability introduced by assembly process, e.g., different assemblies formed between libraries.