Hello,
I have a question about the best DESeq2 experimental design matrix for my dataset.
I am working with 75 RNAseq samples from the same tissue, but classified in various subtypes as follows:
- 5x normal tissue (group A), which I use as the reference/denominator in the analyses.
- 22x early-stage disease (group B)
- 12x late-stage disease of 4 different subtypes (48 samples total) (groups C-F)
The raw counts for all 75 samples are stored in one matrix.
Currently we are mostly interested in the genes specific to each late-stage disease subtype (e.g. the gene expression signatures associated with C, D, E and F).
Right now I've used the design
ddsMat <- DESeqDataSetFromMatrix(countData = counts.raw, colData = metadata, design = ~ subtype)
Where subtype is one of A-F. I then extract the subtype-specific results from the comparison of the subtype to normal (that is C vs A, D vs A, E vs A and F vs A):
res <- results(ddsMat, contrast=c(subtype, "C", "A"))
And so on for groups D, E and F.
However, looking at CountsPlots for the top genes in each contrast shows that I'm mostly finding genes differentially expressed between normal tissue and (all the groups of) late-stage disease - not the genes specific to C/D/E/F, which is what I'm after.
My first question is, is there a better design matrix that I could use to account for this comparison? For example, would including a "stage" term consist of the factors "normal", "early" and "late", and then using the following design help to extract the subtype-specific differences?
design = ~ subtype + stage + subtype:stage
(apologies if the syntax is wrong!)
My second question is regarding how DESeq2 handles data not included in the analysis. As I said above, we have 75 samples, but right now I'm focused on analysing the late-stage (groups C,D,E,F) and normal (group A) samples. Is there any problem with leaving the early-stage samples (group B) in the matrix, in terms of how DESeq2 deals with the filtering, normalisation and significance testing steps?
Thanks in advance!
(For reference I'm using R-3.1.2 and DESeq2_1.6.2)
Thank you very much for the suggestions! I will, of course, try both approaches and see which gives the more sensible results.