I would suggest that you arrange your data so that each row contains all the counts for both alleles of each SNP, for all treated/untreated samples from all individuals. This means that you would get 12 counts per gene in total (4 counts per individual). You could then set up your design like:
allele <- factor(rep(c("alt", "ref"), 6), level=c("ref", "alt"))
individual <- factor(rep(1:3, each=4))
treatment <- factor(rep(c("T", "T", "U", "U"), 3))
design <- model.matrix(~0 + individual:treatment + allele:treatment)
This design is a bit complicated, so bear with me. If we look at colnames(design)
, we get:
[1] "individual1:treatmentT" "individual2:treatmentT" "individual3:treatmentT"
[4] "individual1:treatmentU" "individual2:treatmentU" "individual3:treatmentU"
[7] "treatmentT:allelealt" "treatmentU:allelealt"
The first six coefficients represent the blocking factors for each individual/treatment combination, i.e., the baseline log-expression of the reference allele in each individual/treatment combination. The last two coefficients represent the log-fold change of the alternative allele over the reference (i.e., allele-specific expression) in the treated or untreated samples. You can drop either of these two coefficients individually to test for ASE in each treatment type, or you can compare them to each other (e.g., by setting the contrast
argument to c(0,0,0,0,0,0,-1,1)
in glmLRT
) to test for differential ASE between treatment types.
For the sake of completeness, I will also describe a more sophisticated model:
design <- model.matrix(~0 + individual:treatment + individual:allele + allele:treatment)
If we look at the colnames(design)
, we get:
[1] "individual1:treatmentT" "individual2:treatmentT" "individual3:treatmentT"
[4] "individual1:treatmentU" "individual2:treatmentU" "individual3:treatmentU"
[7] "individual1:allelealt" "individual2:allelealt" "individual3:allelealt"
[10] "treatmentU:allelealt"
The first six coefficients are as previously described; the next three coefficients represent the alternative/reference ASE in the treated sample of each individual; and the last coefficient represents the log-fold change in the ASE in the untreated over treated samples across individuals. In other words, the last coefficient can be dropped to test for differential ASE upon treatment. This model is more sophisticated as it allows for individual-specific ASE that might have inflated the dispersions in the previous model (thus leading to a decrease in power). However, the cost is that this model provides fewer residual d.f., which will also reduce power.
Hi Aaron, thank you very much. Where can I read more about the designing the model from
glm
? I would like to get thorough concepts in creating adesign.
Here, are we comparing the ratios of ref/alt as a pairwise manner between treated vs untreated ?The
edgeR
user's guide has several good examples about parametrisation, so that's a good place to start. For this particular analysis, you have several pairings; pairing between treated and untreated per individual, and pairing between alternative and reference. The second model uses the pairing between treated and untreated, whereas the first model does not. Both models use the pairing between alternative and reference within each treatment/individual combination.