Could you kindhearted to help me again?
In my data, one biological repeat (CYR32_1) in one group is obvious deviation from the other two biological repeats(CYR32_2 and CYR32_3) in the plotMDS with or without the method BCV.
So I just removed this sample, while when the design (= model.matrix(~ 0 + grouping + batch)) was used, it seemed among all of the data only the batch2 and batch3 were used.
design
groupingNILR.B.0h groupingNILR.A.0h groupingNILR.B.18h groupingNILR.B.24h groupingNILR.S.18h groupingNILR.S.24h groupingNILS.B.18h groupingNILS.B.24h batchb batchc
the levels of con:groupingNILR.B.0h groupingNILR.A.0h groupingNILR.B.18h groupingNILR.B.24h groupingNILR.S.18h groupingNILR.S.24h groupingNILS.B.18h groupingNILS.B.24h batchb batchc
After the statistical analysis with glmFit and glmLTR, there are 11994 significantly differentially expressed genes with the FDR<0.05.
I tried to omit the batch effect to use the design3 (= model.matrix(~ 0 + grouping)). The levels of con are groupingMK.26.usp, groupingMK.32.usp, groupingNR.26.18h, groupingNR.26.24h, groupingNR.32.18h, groupingNR.32.24h, groupingNS.32.18h, groupingNS.32.24h. After the statistical analysis with glmFit and glmLTR, there are only 10398 significantly differentially expressed genes (DEG) with the FDR<0.05. compare with the result of “design”, 8722 was identified by both “design” and “design3”. In addition, the FDR and P-value of the same gene is different.
So my questions are:
1. When one of the biological repeat should be removed, how to design the model?
2. In my data, which result seem to be more receivable, the “design” or “design3”, or some others?
3. Does the batch effects have a great effect of the DEG.
4. During the estimation of dispersion, there are three approaches:
d = estimateGLMCommonDisp(d,design)
d = estimateGLMTrendedDisp(d,design)
d = estimateGLMTagwiseDisp(d,design)
After each method was used, the last logFC of each gene will have a minor alteration while the FDR and p-value will have a great change, in some paper only the first and the third method were used, so I’m not so clear whether all of these three approaches should be used together, for in the edgeR users guide just described “The tagwise dispersion approach is strongly recommended in multi-factor experiment cases.”
Thank you very much fro your gudance! Could you kindhearted to point out how to compare the resistance group (groupingNILR.B) to the susceptible group ((groupingNILR.A+groupingNILS.B)/2)?
Well, you've almost written it yourself:
... where you replace
x
with the time point of interest. You'll want to set up comparisons at each time point as separate contrasts, just in case you have a situation where a gene is up in one time point and down in another time point. Such genes would be missed if you summarized samples from all time points into one DE log-fold change.On a similar note, the contrast I've written above will only test for differences between NILR.B and the average of the other two groups. It is possible to get situations where the reported log-fold change is positive, but the log-fold change between NILR.B and one of the other groups is negative. To protect against this, I tend to prefer ANOVA-like comparisons:
I would then focus on significant genes where the log-fold changes are in the same direction, as these are the ones showing a consistent qualitative effect in the resistant group over both of the susceptible groups. If you want to be more rigorous, you can do the NILR.B/NILR.A and NILR.B/NILS.B comparisons separately and intersect the DE lists to obtain genes with significant differences between resistant and both susceptible groups. I think this is probably unnecessarily conservative, though.
Thank you for your response. I really appreciate the advice on ANOVA-like comparisons you have given. From your opinion, it is not sensible to make pairwise comparisons between the resistance group VS the susceptible group which defining each treatment combination as a group. Do you have some good idea about gain the global differential expression genes (needn't set up comparisons at each time point as separate contrasts) involved in the interaction with pathogen, such as resistance or related genes?
Why isn't it sensible? I don't recall saying anything like that.
If you want to find globally DE genes, you can set up an ANOVA-like comparison involving all time points:
This will give you genes that are DE between the resistant and susceptible groups at any time point. That said, getting DE genes at a time point of zero mightn't be particularly interesting, because inoculation shouldn't have any effect at the start of the time course. A more sophisticated DE comparison might be something like:
This will correct the DE at the later time points for any (uninteresting) DE at time point zero. You can then focus on genes where the log-fold changes are of the same sign, as these have a consistent association with resistance against susceptibility for both susceptible groups and across all time points.
So appreciate for your advise! Thank you very much!