Dear all,
I have a question on how to used the nested matrix to analyse my RNAseq data.
Here is the description of my data. I have one continuous factor measuring the level of protein A by immunostaining. The second factor is categorical representing 3 different regions (BS, HP, CX) of the brain.
The key questions we want to ask are:
-
Is there any genes differentially expressed according to the level of protein A measured by immunostaining .
-
If there regional differences for genes which are affected / in response by protein A (which is more interested to us).
Here is what the meta data looks like,
>head(meta)
protein |
regions |
---|---|
9.28453159356615 |
BS |
10.9531507793882 |
BS |
11.1858664750967 |
HP |
8.85570185738541 |
CX |
11.2412805317039 |
HP |
10.3046406236311 |
BS |
9.82840322124053 |
CX |
10.4616254492949 |
HP |
10.2353912875853 |
HP |
9.94273889586413 |
CX |
6.92469668752532 |
HP |
3.13141359991371 |
BS |
5.59111480089629 |
BS |
5.75868326124029 |
CX |
3.48120813508149 |
BS |
3.25876039462462 |
BS |
6.13335849940239 |
CX |
6.78688204460976 |
CX |
6.57238798262957 |
CX |
7.53613331510434 |
HP |
I have been trying three designs:
> design1a=model.matrix(~0+protein+protein:regions,meta)
> head(design1a)
protein protein:regionsBS protein:regionsCX protein:regionsHP
1 8.820494 0.000000 8.820494 0.00000
2 10.576712 0.000000 10.576712 0.00000
3 9.722720 0.000000 0.000000 9.72272
4 10.042670 10.042670 0.000000 0.00000
5 8.459912 8.459912 0.000000 0.00000
6 9.694727 0.000000 9.694727 0.00000
> design1b=model.matrix(~protein+protein:regions,meta)
> head(design1b)
(Intercept) protein protein:regionsCX protein:regionsHP
1 1 8.820494 8.820494 0.00000
2 1 10.576712 10.576712 0.00000
3 1 9.722720 0.000000 9.72272
4 1 10.042670 0.000000 0.00000
5 1 8.459912 0.000000 0.00000
6 1 9.694727 9.694727 0.00000
> design2=model.matrix(~regions+protein:regions,meta)
> head(design2)
(Intercept) regionsCX regionsHP regionsBS:protein regionsCX:protein regionsHP:protein
1 1 1 0 0.000000 8.820494 0.00000
2 1 1 0 0.000000 10.576712 0.00000
3 1 0 1 0.000000 0.000000 9.72272
4 1 0 0 10.042670 0.000000 0.00000
5 1 0 0 8.459912 0.000000 0.00000
6 1 1 0 0.000000 9.694727 0.00000
For design 1a, it seems easy for me to understand, as i can estimate coefficient 2 for the main effect of protein A, and estimate glmQLFTest(fit_design1a, coef=3:5)
for any regional differences in response to protein A. But this design matrix is not full rank.
So design 1b is probably more correct. But i am not sure how to interpret design 1b.
If I try to used glmQLFTest(fit_design1b, coef=3:4),
does it give me those genes which are most variable across regions in response to protein A, or does it give me genes which are most different from the base level regionBS in response to protein A?
And design2 , which i can no longer measure the main effect of protein A, but if am only interested in the regional effects in response to protein A ( or regional differences of protein A), can I still use this design and ignore the main effects, only measure the nested coeffcients. glmQLFTest(fit_design2, coef=4:6)?
Dear Aaron,
Thank you very much for the detail explanation. And designY is what i've been looking for.
I still have a question related to this design.
1. "set
coef=4:6
, which will do an ANODEV to test for whether there is any effect (or more strictly speaking, association) between protein A staining and each gene's expression".I am a little confused here. So wouldn't a simple design also test this (association of proteinA staining with each gene expression ? For example, using
model.matrix(~protein).
My initial interpretation of setting coef=4:6 was that this will test whether there is any differences between regions in how proteinA associates with each gene expression. Or maybe I misunderstood here?
Your simple design is probably insufficient, as it fails to account for region-to-region differences in the intercept or gradient of the relationship between gene expression and protein A staining. As a result, the variance will be inflated which will reduce your power to detect a significant relationship. At worst, there might be a non-zero gradient in each region (potentially of differing signs), but the net gradient is zero when all regions are combined together in the simple model. This would result in a complete failure to detect a non-zero effect in each tissue.
Dear Aaron,
Thank you very much for clarifying the differences. So I followed the instruction and performed the following step:
So by setting coef =4:6 , i am testing if there is any genes significantly associated with protein A after accounts the variations between different regions. For example, can I interpret the above example Ctsd is significantly associated with proteinA , and with one unit changes in the immunostaining of protein A , there is a 0.5483 changes in Ctsd expression in the BS area?
Also does a significant p-values here indicate there is at least one region out of three are significantly associated with protein A , or does it indicate all three regions are significantly associated with protein A?
And lastly, if I want to know which genes are significantly different between regions in their association with protein A, is it correct to use contrast to do pairwise comparison between all levels (coefficients 4:6) and then pool all the p-values to do a BH p-adjust?
Question 1: Yes. Though the ANODEV will only tell you that there is a significant association in any region; it doesn't tell you that the association in the BS region is significant. If you want that, you'll have to test
coef=4
directly.Question 2: A rejection means that at least one region has a non-zero gradient.
Question 3: You can still do an ANODEV with the comparisons between coefficients, e.g.,