Hello,
I'm running DESeq2 for my expression analysis and I'm having trouble with the design formula. I have 32 mice samples and 3 different factors and each factor has 2 levels. So the last factor (tissue) has always 4 biological replicates. Here is an overview:
<caption>coldata</caption>
row_names | treatment | type | tissue |
---|---|---|---|
s1 | A | control | layer_one |
s2 | A | control | layer_one |
s3 | A | control | layer_one |
s4 | A | control | layer_one |
s5 | A | control | layer_two |
s6 | A | control | layer_two |
s7 | A | control | layer_two |
s8 | A | control | layer_two |
s9 | A | changed | layer_one |
s10 | A | changed | layer_one |
s11 | A | changed | layer_one |
s12 | A | changed | layer_one |
s13 | A | changed | layer_two |
s14 | A | changed | layer_two |
s15 | A | changed | layer_two |
s16 | A | changed | layer_two |
s17 | B | control | layer_one |
s18 | B | control | layer_one |
s19 | B | control | layer_one |
s20 | B | control | layer_one |
s21 | B | control | layer_two |
s22 | B | control | layer_two |
s23 | B | control | layer_two |
s24 | B | control | layer_two |
s25 | B | changed | layer_one |
s26 | B | changed | layer_one |
s27 | B | changed | layer_one |
s28 | B | changed | layer_one |
s29 | B | changed | layer_two |
s30 | B | changed | layer_two |
s31 | B | changed | layer_two |
s32 | B | changed | layer_two |
What I'd like to compare is the following interactions: treatment:type + treatment:type:tissue. In special I'm interested in treatmentA.typechanged vs treatmentB.typechanged, treatmentAtypecontrol vs treatmentBtypecontrol, treatmentBtypechanged.tissuelayer_one vs treatmentBtypechanged.tissuelayer_two, treatmentBtypecontrol.tissuelayer_one vs treatmentBtypecontrol.tissuelayer_two
I run the following DESeq:
dds <- DESeqDataSetFromMatrix(countData = Jcounts, colData = coldata, design = ~ treatment + treatment:type + treatment:type:tissue)
dds <- DESeq(dds, parallel=T)
As a result I get
resultsNames(dds)
[1] "Intercept" "treatment_A_vs_B" "treatmentA.typecontrol" "treatmentB.typecontrol"
[5] "treatmentA.typecontrol.tissuelayer_one" "treatmentB.typecontrol.tissuelayer_one" "treatmentA.typecontrolchanged.tissuelayer_two" "treatmentB.typecontrolchanged.tissuelayer_two"
I was wondering if this is the right way to go? However resultsNames doesn't show all effects. Am I doing something wrong?
Another thing I could do is to run DESeq several times but exclude samples. For example for the first comparison I would run the sample from treatmentA.typecontrol.tissuelayer_one vs treatmentA.typecontrol.tissuelayer_two etc. but I would prefer to have all samples in one DESeq run.
In another post I read that it is also possible to combine two factors (merging the columns) which creates one single factor and run DESeq with that. In my case it would be a factor with the following levels treatmentA-typecontrol, treatmentA-typechanged, treatmentB-typecontrol, treatmentB-typechanged. Would this be possible?
Thanks for the help!
best
Mathias
Hi Michael,
thanks for the answer. I thought the same and started to run some analysis. For the first one I combined the first two factor into one and ended up with two factors:
I did the following comparison:
and the result were 2600 differentially expressed genes.
Afterwards, I followed your example and combined all three factors into one:
Here I have 6086 differentially expressed genes. Now I'm wondering which way is the better one? Because these are the same comparisons for me.
If I have two factors (the first example), then I can do the following comparison: "treatment_typeA.changed" vs "treatment_typeB.changed". How would I do this with your example of one factor? I have the following results:
would it be:
And to answer your question. I'm using DESeq2 v 1.6.3 and there was no error at all.
Thanks for helping me!
hi Mat,
There must be a bug in the code which isn't giving an error with betaPrior=TRUE and second order interactions. I'll have to hunt this down.
Then, you should use the design which combines all factors into one. As 6000 is a huge number, you might want to filter on log fold change for more biologically meaningful tables of results, see the lfcThreshold argument of results().
Your first design above is strange because it doesn't include the tissue as a main effect, but only in the interaction.
To get the average of two coefficients over the average of two others, you can use the list-type of contrast, and listValues:
which is equivalent to writing out the contrast with -1/2 and 1/2 in the right spots, but is safer in using the names of the coefficients. (You have to use the resultsNames(dds) versions in the list, I just shortened them for my convenience here.).
Hey Michael,
thx again. Yes, you are completely right, I forgot the tissue. I included the tissue as main effect in my first design and the number of genes increased to 2649.
However, I followed your suggestion and combined all three factors. Your explanation was really helpful.
It was the first time that I tried to do an analysis with three factor and interaction terms and it made my realise that I need to understand this better (also how to use the lfc threshold). Can you point me to a good how-to? I already read the vignette. Also can you give some guidance on how to evaluate if one should keep samples from conditions, even though one doesn't need them for the comparison. Or if it is better to do separate comparisons (e.g. one factor with conditions A, B, C and I do one DESeq with A vs B without C in the dataset and B vs C without A in the dataset, ...)
hi,
So, I've now fixed this in devel (there was an exception in the logic that let you run 2nd order interactions with beta prior because the factors had two levels). Now it recommends either grouping variables as you have done or betaPrior=FALSE.
You should generally keep samples from all conditions, as it helps to estimate the dispersion. The only exception is if some groups have very different within group variance than others. You can check this with the PCA plot as in the vignette. If the groups are similarly spread (or just not an extreme difference), it's best to keep all the samples from all conditions in the analysis, vs doing separate comparisons.
Hi Michael,
Regarding the issue about keeping the sample. Here is my PCA:
http://www.biotec.tu-dresden.de/sharing/t/uploadyl6T3Z/pca.test.pdf
At the moment I would either keep all the samples for the comparisons or built two data sets from the original and I would separate it by the factor "tissue (layer_one, layer_two) because withing group variance is different between tissue layer_one and layer_two. I tend to do the separation of the data set. What do you think?
Thanks again. Like I said I don't have so much experience with so many samples and conditions (at least for me it is more than usual)
hi Mat,
Layer two appears to me to have a lot more within-group variance. So I'd recommend splitting into two datasets for making within layer comparisons. Of course, if you want to compare any groups across layer, then you need to run it as one dataset. The treatment effect is in the same direction for both layers, it's just that layer two has a lot more variability within each treatment group.
Hey Michael,
the more I get into this the more question come up. :( Regarding the splitting. I ran deseq2 with all samples and did the following to get the sample only for one tissue
Deseq uses the already existing size factors for this second run and I get for one comparison 2311 differentially expressed genes. If I do the same comparison but this time I run deseq only on the layer_one samples, which means new size factor estimation, I get 2266 differential expressed genes.
What's the best way to do it? Run deseq with all samples and remove the ones you don't need and use droplevels or run deseq only on the samples one is interested?
thanks again. I hope that will be last question for this problem :)
best
Mathias
hi Mathias,
I would re-estimate size factors, but it shouldn't make a big difference (note that 50/2200 ~ 2% fluctuation is not surprising to small changes in model parameters, p-values are tail probabilities which fluctuate a lot even with small changes to model parameters).
You can re-estimate size factors with:
Hey Michael,
I'll do it! Again thanks a lot for the help. It was nice go through such an analysis with you. Now I do understand everything a bit more :)
best
mathias
posted twice...