Dear All,
This is my first post on Bioconductor so if I am not clear please have some patience. I am working on an RNA-seq (human) dataset with 6 paired samples pre and post treatment and 2 independent controls. Unfortunately I don't have a bioinformatics background so I am trying to grasp the design matrix concept. I can do simple design differential expression analysis with EdgeR, but in this case I have a situation with paired samples, 2 independent samples and spike-in controls that I would like to use with RUVseq to remove variability. This means in my design matrix I have to account for the W_1 as a covariate.
How to design a matrix for this purpose? If it was only the paired samples it would be an additive model as described on the edger vignette and I would do something like:
design = model.matrix(~patient+W_1+Treat, data = pData(set1))
with W_1 from RUVseq with spike-in controls.
But in this case I also have 2 healthy controls that are completely independent from the patients pre and post treatment. So I would like to design a matrix so that I can compare differential expression within the patients pre and post treatment, between patients pre treatment and healthy controls and between patients post-treatment and healthy controls. So in the end my question is how to design a matrix and after fitting the model how to get the differential expression for each case.
If you also have any good examples to understand the design matrix and how to get the differential expression with glmLRT(fit) in edger that would be greatly appreciated.
Thanks for your kind help,
Chris
To follow up on James' comments:
If you're going to fit the second proposed model with edgeR, make sure you only use one sample from each patient. For example, for half of the patients, take only the "pre" sample, and for the other half, take the "post" sample. This avoids correlations between samples that come from the same patient (which would not be modelled properly here, as there is no
patient
blocking factor in the second model). Or alternatively, you set up two separate models, one with the two controls + all post samples for the control/post comparison; and another model with controls + pre samples, for the control/pre comparison.You can see how this getting complicated, so I would recommend using
duplicateCorrelation
instead. This makes life a lot easier, as you can just use a single model for all of your comparisons between post/pre/controls. It also ensures that you use all of the information in the data for each of your contrasts, albeit at the cost of some anticonservativeness.Thanks for the reply! So using two separate models is still fine from the statistical point of view. I was just concerned about the difference between having all the samples in one model versus two different models , but I guess they answer different questions. The first model would address what difference is there in patients pre and post treatment and the second would address what differences pre and post patients have with normal healthy cell transcriptome.