Question

DESeq2 design matrix for 6 groups

0

Entering edit mode

m.fletcher ▴ 20

@mfletcher-7284

Last seen 8.6 years ago

Germany

Hello,

I have a question about the best DESeq2 experimental design matrix for my dataset.

I am working with 75 RNAseq samples from the same tissue, but classified in various subtypes as follows:

5x normal tissue (group A), which I use as the reference/denominator in the analyses.
22x early-stage disease (group B)
12x late-stage disease of 4 different subtypes (48 samples total) (groups C-F)

The raw counts for all 75 samples are stored in one matrix.

Currently we are mostly interested in the genes specific to each late-stage disease subtype (e.g. the gene expression signatures associated with C, D, E and F).

Right now I've used the design

ddsMat <- DESeqDataSetFromMatrix(countData = counts.raw, colData = metadata, design = ~ subtype)

Where subtype is one of A-F. I then extract the subtype-specific results from the comparison of the subtype to normal (that is C vs A, D vs A, E vs A and F vs A):

res <- results(ddsMat, contrast=c(subtype, "C", "A"))

And so on for groups D, E and F.

However, looking at CountsPlots for the top genes in each contrast shows that I'm mostly finding genes differentially expressed between normal tissue and (all the groups of) late-stage disease - not the genes specific to C/D/E/F, which is what I'm after.

My first question is, is there a better design matrix that I could use to account for this comparison? For example, would including a "stage" term consist of the factors "normal", "early" and "late", and then using the following design help to extract the subtype-specific differences?

design = ~ subtype + stage + subtype:stage

(apologies if the syntax is wrong!)

My second question is regarding how DESeq2 handles data not included in the analysis. As I said above, we have 75 samples, but right now I'm focused on analysing the late-stage (groups C,D,E,F) and normal (group A) samples. Is there any problem with leaving the early-stage samples (group B) in the matrix, in terms of how DESeq2 deals with the filtering, normalisation and significance testing steps?

Thanks in advance!

(For reference I'm using R-3.1.2 and DESeq2_1.6.2)

rnaseq deseq2 design and contrast matrix • 3.2k views

ADD COMMENT • link updated 10.2 years ago by Michael Love 43k • written 10.2 years ago by m.fletcher ▴ 20

score 2 · Accepted Answer · 2015-01-23

"I'm mostly finding genes differentially expressed between normal tissue and (all the groups of) late-stage disease - not the genes specific to C/D/E/F, which is what I'm after."

I'd recommend using a design of ~ subtype and then one of the following strategies for results tables, depending on the interpretation of the above. First notice you can use the listValues argument of results() to form a contrast between one level and a combination of a number of other levels.

This table would test if subtype C is different than A,D,E and F, where each of the four levels are given equal weight. However, this does not guarantee, for example that A and C will have a large difference.

results(dds, contrast=list("subtypeC", c("subtypeA","subtypeD","subtypeE","subtypeF")), listValues=c(1, -1/4))

If you want to enforce a large difference between A and C, then I'd recommend building two sets of results tables and then looking at the intersection of the sets with FDR < threshold. The two sets would be defined by the simple contrast=c("subtype","C","A") and the second set by:

results(dds, contrast=list("subtypeC", c("subtypeD","subtypeE","subtypeF")), listValues=c(1, -1/3))

The combination of these two results tables would enforce: C vs A is significant and C vs (D+E+F) is significant.

"My second question is regarding how DESeq2 handles data not included in the analysis. As I said above, we have 75 samples, but right now I'm focused on analysing the late-stage (groups C,D,E,F) and normal (group A) samples. Is there any problem with leaving the early-stage samples (group B) in the matrix, in terms of how DESeq2 deals with the filtering, normalisation and significance testing steps?"

Adding extra samples is usually better for inference, even if they are not used in the contrasts, because it helps improve the dispersion estimation steps.