Question

DESeq2: with multiple factors and interaction terms won't show all effects

1

Entering edit mode

mat.lesche ▴ 110

@matlesche-6835

Last seen 11 months ago

Germany

Hello,

I'm running DESeq2 for my expression analysis and I'm having trouble with the design formula. I have 32 mice samples and 3 different factors and each factor has 2 levels. So the last factor (tissue) has always 4 biological replicates. Here is an overview:

<caption>coldata</caption>

row_names	treatment	type	tissue
s1	A	control	layer_one
s2	A	control	layer_one
s3	A	control	layer_one
s4	A	control	layer_one
s5	A	control	layer_two
s6	A	control	layer_two
s7	A	control	layer_two
s8	A	control	layer_two
s9	A	changed	layer_one
s10	A	changed	layer_one
s11	A	changed	layer_one
s12	A	changed	layer_one
s13	A	changed	layer_two
s14	A	changed	layer_two
s15	A	changed	layer_two
s16	A	changed	layer_two
s17	B	control	layer_one
s18	B	control	layer_one
s19	B	control	layer_one
s20	B	control	layer_one
s21	B	control	layer_two
s22	B	control	layer_two
s23	B	control	layer_two
s24	B	control	layer_two
s25	B	changed	layer_one
s26	B	changed	layer_one
s27	B	changed	layer_one
s28	B	changed	layer_one
s29	B	changed	layer_two
s30	B	changed	layer_two
s31	B	changed	layer_two
s32	B	changed	layer_two

What I'd like to compare is the following interactions: treatment:type + treatment:type:tissue. In special I'm interested in treatmentA.typechanged vs treatmentB.typechanged, treatmentAtypecontrol vs treatmentBtypecontrol, treatmentBtypechanged.tissuelayer_one vs treatmentBtypechanged.tissuelayer_two, treatmentBtypecontrol.tissuelayer_one vs treatmentBtypecontrol.tissuelayer_two

I run the following DESeq:

dds <- DESeqDataSetFromMatrix(countData = Jcounts, colData = coldata, design = ~ treatment + treatment:type + treatment:type:tissue)

dds <- DESeq(dds, parallel=T)

As a result I get

resultsNames(dds)

[1] "Intercept" "treatment_A_vs_B" "treatmentA.typecontrol" "treatmentB.typecontrol"
[5] "treatmentA.typecontrol.tissuelayer_one" "treatmentB.typecontrol.tissuelayer_one" "treatmentA.typecontrolchanged.tissuelayer_two" "treatmentB.typecontrolchanged.tissuelayer_two"

I was wondering if this is the right way to go? However resultsNames doesn't show all effects. Am I doing something wrong?

Another thing I could do is to run DESeq several times but exclude samples. For example for the first comparison I would run the sample from treatmentA.typecontrol.tissuelayer_one vs treatmentA.typecontrol.tissuelayer_two etc. but I would prefer to have all samples in one DESeq run.

In another post I read that it is also possible to combine two factors (merging the columns) which creates one single factor and run DESeq with that. In my case it would be a factor with the following levels treatmentA-typecontrol, treatmentA-typechanged, treatmentB-typecontrol, treatmentB-typechanged. Would this be possible?

Thanks for the help!

best

Mathias

deseq2 multiple factor design contrast interactions • 11k views

ADD COMMENT • link updated 10.2 years ago by Michael Love 43k • written 10.2 years ago by mat.lesche ▴ 110

score 2 · Answer 1 · 2015-02-19

2

Entering edit mode

Michael Love 43k

@mikelove

Last seen 2 days ago

United States

hi Mathias,

I would recommend combining the factors into a single factor, and then this will greatly simplify your desired comparison of individual groups. You can use the contrast argument to compare whichever groups defined by the combined factor.

Also the beta prior calculation is not implemented for second order interactions, so it should be giving an error here. What version of DESeq2 are you using?

As a side, typically I would expect one would include base terms as well in the model, for example type and tissue, if you are including them as interactions.

ADD COMMENT • link 10.2 years ago Michael Love 43k

0

Entering edit mode

Hi Michael,

thanks for the answer. I thought the same and started to run some analysis. For the first one I combined the first two factor into one and ended up with two factors:

coldata$treatment_type <- factor(paste0(coldata$treatment, "-", coldata$type))

dds <- DESeqDataSetFromMatrix(countData = Jcounts, colData = coldata, design = ~ treatment_type + treatment_type:tissue)
dds <- DESeq(dds, parallel=T)
resultsNames(dds)

I did the following comparison:

result<- results(dds, contrast=list("treatment_typeA.control.tissuelayer_one", "treatment_typeA.changed.tissuelayer_one"))
sum(result$padj<0.1, na.rm=T)

and the result were 2600 differentially expressed genes.

Afterwards, I followed your example and combined all three factors into one:

coldata$treatment_type_tissue <- factor(paste0(coldata$treatment, "-", coldata$type, "-", coldata$tissue))
dds <- DESeqDataSetFromMatrix(countData = Jcounts, colData = coldata, design = ~ treatment_type_tissue)
dds <- DESeq(dds, parallel=T)
resultsNames(dds)

result <- results(dds, contrast=list("treatment_type_tissueA.control.layer_one", "treatment_type_tissueA.changed.layer_one"))
sum(result$padj<0.1, na.rm=T)

Here I have 6086 differentially expressed genes. Now I'm wondering which way is the better one? Because these are the same comparisons for me.

If I have two factors (the first example), then I can do the following comparison: "treatment_typeA.changed" vs "treatment_typeB.changed". How would I do this with your example of one factor? I have the following results:

> resultsNames(dds)
[1] "Intercept"                                "treatment_type_tissueA.changed.layer_one" "treatment_type_tissueA.changed.layer_two"
[4] "treatment_type_tissueA.control.layer_one" "treatment_type_tissueA.control.layer_two" "treatment_type_tissueB.changed.layer_one"
[7] "treatment_type_tissueB.changed.layer_two" "treatment_type_tissueB.control.layer_one" "treatment_type_tissueB.control.layer_two"

would it be:

res <- results(dds, contrast=c(0, -1, -1, 0, 0, 1, 1, 0, 0))

And to answer your question. I'm using DESeq2 v 1.6.3 and there was no error at all.

Thanks for helping me!

ADD REPLY • link 10.2 years ago mat.lesche ▴ 110

1

Entering edit mode

hi Mat,

There must be a bug in the code which isn't giving an error with betaPrior=TRUE and second order interactions. I'll have to hunt this down.

Then, you should use the design which combines all factors into one. As 6000 is a huge number, you might want to filter on log fold change for more biologically meaningful tables of results, see the lfcThreshold argument of results().

Your first design above is strange because it doesn't include the tissue as a main effect, but only in the interaction.

To get the average of two coefficients over the average of two others, you can use the list-type of contrast, and listValues:

results(dds, contrast=list( c("A_changed_one","A_changed_two"),
c("B_changed_one","B_changed_two")), listValues=c(1/2, -1/2))

which is equivalent to writing out the contrast with -1/2 and 1/2 in the right spots, but is safer in using the names of the coefficients. (You have to use the resultsNames(dds) versions in the list, I just shortened them for my convenience here.).

ADD REPLY • link 10.2 years ago Michael Love 43k

0

Entering edit mode

Hey Michael,

thx again. Yes, you are completely right, I forgot the tissue. I included the tissue as main effect in my first design and the number of genes increased to 2649.

However, I followed your suggestion and combined all three factors. Your explanation was really helpful.

It was the first time that I tried to do an analysis with three factor and interaction terms and it made my realise that I need to understand this better (also how to use the lfc threshold). Can you point me to a good how-to? I already read the vignette. Also can you give some guidance on how to evaluate if one should keep samples from conditions, even though one doesn't need them for the comparison. Or if it is better to do separate comparisons (e.g. one factor with conditions A, B, C and I do one DESeq with A vs B without C in the dataset and B vs C without A in the dataset, ...)

ADD REPLY • link 10.2 years ago mat.lesche ▴ 110

0

Entering edit mode

hi,

So, I've now fixed this in devel (there was an exception in the logic that let you run 2nd order interactions with beta prior because the factors had two levels). Now it recommends either grouping variables as you have done or betaPrior=FALSE.

You should generally keep samples from all conditions, as it helps to estimate the dispersion. The only exception is if some groups have very different within group variance than others. You can check this with the PCA plot as in the vignette. If the groups are similarly spread (or just not an extreme difference), it's best to keep all the samples from all conditions in the analysis, vs doing separate comparisons.

ADD REPLY • link 10.2 years ago Michael Love 43k

0

Entering edit mode

Hi Michael,

Regarding the issue about keeping the sample. Here is my PCA:

http://www.biotec.tu-dresden.de/sharing/t/uploadyl6T3Z/pca.test.pdf

At the moment I would either keep all the samples for the comparisons or built two data sets from the original and I would separate it by the factor "tissue (layer_one, layer_two) because withing group variance is different between tissue layer_one and layer_two. I tend to do the separation of the data set. What do you think?

Thanks again. Like I said I don't have so much experience with so many samples and conditions (at least for me it is more than usual)

ADD REPLY • link 10.2 years ago mat.lesche ▴ 110

0

Entering edit mode

hi Mat,

Layer two appears to me to have a lot more within-group variance. So I'd recommend splitting into two datasets for making within layer comparisons. Of course, if you want to compare any groups across layer, then you need to run it as one dataset. The treatment effect is in the same direction for both layers, it's just that layer two has a lot more variability within each treatment group.

ADD REPLY • link 10.2 years ago Michael Love 43k

0

Entering edit mode

Hey Michael,

the more I get into this the more question come up. :( Regarding the splitting. I ran deseq2 with all samples and did the following to get the sample only for one tissue

coldata$dmt <- factor(paste0(coldata$treatment, "-", coldata$type, "-", coldata$tissue))
dds <- DESeqDataSetFromMatrix(countData = Jcounts, colData = coldata, design = ~ dmt)
dds <- DESeq(dds, parallel=T)

dds_layerone <- dds[, dds$tissue == "layer_one"]
dds_layerone$dmt <-droplevels(dds_layerone$dmt)
dds_layerone <- DESeq(dds_layerone, parallel=T)

Deseq uses the already existing size factors for this second run and I get for one comparison 2311 differentially expressed genes. If I do the same comparison but this time I run deseq only on the layer_one samples, which means new size factor estimation, I get 2266 differential expressed genes.

What's the best way to do it? Run deseq with all samples and remove the ones you don't need and use droplevels or run deseq only on the samples one is interested?

thanks again. I hope that will be last question for this problem :)

best

Mathias

ADD REPLY • link 10.2 years ago mat.lesche ▴ 110

1

Entering edit mode

hi Mathias,

I would re-estimate size factors, but it shouldn't make a big difference (note that 50/2200 ~ 2% fluctuation is not surprising to small changes in model parameters, p-values are tail probabilities which fluctuate a lot even with small changes to model parameters).

You can re-estimate size factors with:

dds <- estimateSizeFactors(dds)

ADD REPLY • link 10.2 years ago Michael Love 43k

0

Entering edit mode

Hey Michael,

I'll do it! Again thanks a lot for the help. It was nice go through such an analysis with you. Now I do understand everything a bit more :)

best

mathias

ADD REPLY • link 10.2 years ago mat.lesche ▴ 110

0

Entering edit mode

posted twice...

ADD REPLY • link 10.2 years ago Michael Love 43k