Hi everyone,
I am trying to fit a multi factor comparison using EdgeR
I want to analyse expression data from samples collected from different populations at different latitudes (3 populations at both latitudes) which were subjected to a treatment.
I want to know the effect of latitude (lat), treatment (treat) and their interaction on the expression data.
My data looks like this:
sample pop lat treat
1 1 N 1
2 1 N 1
3 1 N 2
4 1 N 2
5 2 N 1
6 2 N 1
7 2 N 2
8 2 N 2
9 3 N 1
10 3 N 1
11 3 N 2
12 3 N 2
13 4 S 1
14 4 S 1
15 4 S 2
16 4 S 2
17 5 S 1
18 5 S 1
19 5 S 2
20 5 S 2
21 6 S 1
22 6 S 1
23 6 S 2
24 6 S 2
When I want to fit a model using this design
design <- model.matrix(~lat+pop+treat+lat:treat)
the matrix is not of full rank because pop is nested in lat.
I have read in another post (DESeq2 paired multifactor test) and in the Edge R user guide (section 3.5) that this can be solved by creating another population variable (pop.nested) like this:
sample pop lat treat pop.nested
1 1 N 1 1
2 1 N 1 1
3 1 N 2 1
4 1 N 2 1
5 2 N 1 2
6 2 N 1 2
7 2 N 2 2
8 2 N 2 2
9 3 N 1 3
10 3 N 1 3
11 3 N 2 3
12 3 N 2 3
13 4 S 1 1
14 4 S 1 1
15 4 S 2 1
16 4 S 2 1
17 5 S 1 2
18 5 S 1 2
19 5 S 2 2
20 5 S 2 2
21 6 S 1 3
22 6 S 1 3
23 6 S 2 3
24 6 S 2 3
Yet, how does EdgeR handle this data? It seems statistically not sound to me as now population 1 and 4 are assumed to be same in the model?
Could someone comment on this?
Thanks a lot!
Hi, I have another question related to this answer. Can I also compare latN.treat20 to latS.treat20 using this design? So compare the gene expression within the 20 treatment between the N and S latitude? How would you go about to test this? Thanks in advance!
Where did the 20 come from? Your post only has values of 1 or 2 for
treat
.Sorry, I used the wrong numbers. So using the above-mentioned design, my question is: Can I also compare latN_treatment1 to latS.treatment1 using this design? So compare the gene expression within the treatment 1 between the N and S latitude? How would you go about to test this? Would this involve making a subset of the data? Thanks in advance!
Remember,
latN_treatment2
andlatS_treatment2
represent the log-fold change between treatments at each latitude; you can't have a separate coefficient likelatN_treatment1
orlatS_treatment1
in this design, because the log-fold change is already computed in treatment 2 relative to treatment 1.More generally, if you want to directly compare expression between the N/1 and S/1 groups, you can't do this with this design because the different latitudes belong in different
pop
levels. You've already blocked onpop
, so a direct comparison is simply not possible. Rather, you'll have to subset the data to only take the treatment 1 samples, and then pool the two treatment 1 samples at eachpop
level, such that there would be one pooled library perpop
level. These can be treated as direct replicates and modelled with a simple one-way layout based on latitude.Allright, thanks!
I'm still wondering though: could I also make the dummy variable "pop.nested" (as below) and then make a design like this: design = ~0 +lat + lat:pop.n
Would this be a valid alternative for pooling my samples within the populations?
No. This doesn't properly account for correlations for samples from the same
pop
. Consider a gene with no latitude effect; in this situation, your design implies that samples 1, 2, 13 and 14 should be independent replicates. However, 1 and 2 are from the samepop
and will be correlated, as will 13 and 14. Failure to account for this means that the degrees of freedom in the model - and the confidence of downstream inferences - will be overstated. This is why I recommended pooling, to avoid worrying about these correlations.Thanks for your reply! Do you have a suggestion on how to pool the data within populations? Just add up the counts? Or would you first do a normalisation procedure? In EdgeR would this mean adding up the rows of the samples within populatons after y <- calcNormFactors(y) ?
Run
sumTechReps
beforecalcNormFactors
.