Hi Everyone
I have a DESeq2 design problem, where I have the counts for reads mapping to maternal and paternal allele, for the wildtype and knockdown samples. So the design is like this, where TF = transcription factor, rep=replicate and KD=knockdown. I have matching controls for each TF.
sampleName | Sample | Allele |
Control_1_repA_mat | Control_1 | Maternal |
Control_1_repB_mat | Control_1 | Maternal |
Control_1_repC_mat | Control_1 | Maternal |
Control_2_repA_mat | Control_2 | Maternal |
Control_2_repB_mat | Control_2 | Maternal |
Control_2_repC_mat | Control_2 | Maternal |
KD_TF1_1_repA_mat | KD_TF1_1 | Maternal |
KD_TF1_1_repB_mat | KD_TF1_1 | Maternal |
KD_TF1_1_repC_mat | KD_TF1_1 | Maternal |
KD_TF2_2_repA_mat | KD_TF2_2 | Maternal |
KD_TF2_2_repB_mat | KD_TF2_2 | Maternal |
KD_TF2_2_repC_mat | KD_TF2_2 | Maternal |
Control_1_repA_pat | Control_1 | Paternal |
Control_1_repB_pat | Control_1 | Paternal |
Control_1_repC_pat | Control_1 | Paternal |
Control_2_repA_pat | Control_2 | Paternal |
Control_2_repB_pat | Control_2 | Paternal |
Control_2_repC_pat | Control_2 | Paternal |
KD_TF1_1_repA_pat | KD_TF1_1 | Paternal |
KD_TF1_1_repB_pat | KD_TF1_1 | Paternal |
KD_TF1_1_repC_pat | KD_TF1_1 | Paternal |
KD_TF2_2_repA_pat | KD_TF2_2 | Paternal |
KD_TF2_2_repB_pat | KD_TF2_2 | Paternal |
KD_TF2_2_repC_pat | KD_TF2_2 | Paternal |
Now I want to see, for each TF knockdown, the differential expression between maternal and paternal allele. But I also want to exclude the genes which show differential expression between maternal and paternal allele in Controls. Earlier I was splitting the samples into Control and Test, and use DESeq2 with Design ~ sample + allele, and later remove the genes which are diff expressed in both control and test, but it's not giving me expected results.
So is it a better strategy to not split the samples and use the same Design formula? Also after running DESeq how shall I extract the differences? Shall I extract Mat over Pat difference for each sample (Control and KD) and then again simply remove the common diffExp genes, or is there a better way ( i.e for constructing the design matrix or extracting results), that takes care of this thing.
Thanks in advance
Thanks Michael for the reply.. I saw the dealing with interactions section in the manual but couldn't understand that it's the same situation that I have. In that case, is it better to split the input by KD and the matching Control? or is it not important?
Sorry, I missed that in my first pass. How are the controls for TF1 and TF2 different?
Each TF has one matching control as the knock-down was performed by different people. They used different scrambled siRNA sequences.
If you want to control each TF with its matching control, this is possible as well.
Create column data which looks like this (update: note that all columns should be factors)
Then use a design of
~ TF + TF:condition + TF:allele + TF:condition:allele
The two interaction terms for TF:condition:allele are tests for differences in Paternal vs Maternal, controlling for differences in that TF's control. You will use results(dds, name=...) to extract each one separately.
So when I made the desing matrix like the one above, I found out that resultNames(dds) looks like this
"Intercept" "TF" "TF.conditionKD"
"TF.allelePaternal" "TF.conditionKD.allelePaternal"
Then I extracted results with name = "TF.conditionKD.allelePaternal". But this is maybe the combined one from two TFs?
I was wondering whether It would be good if I just divide my input into two separate data frames and run DESeq on them separately with the first design matrix you suggested above.
Can you post your column data and your design which produced these resultsNames?
I was thinking it would look like my example in the comment above.
My Design:
where row.names(design) = colnames(featureCount.Result). Condition,allele and TF are factors.
Then I run the following:
Runs without any trouble..
colData(ase.deseq)
The TF column needs to be a factor.
dds$TF = factor( dds$TF)
Thanks a lot, it was my mistake.
One last thing, I ran DESeq with this design, I get 40 significant genes for TF1 . Then I run it with previous design that I mentioned.(i.e. split the input by TF and fit the following design formula you suggested).
and I get 35 genes for TF1. (34 of them common bw them).
Can you please tell where this difference comes from, and which one is a better strategy then? I assume splitting by TF and running DESeq seperately is also correct, as it just means I am treating TF1 and TF2 as separate experiments.
Small fluctuations are expected when you change the samples involved in the analysis.
p-values are tail probabilities and therefore very sensitive to small changes in parameter estimates.
Remember also that the set of genes with FDR < alpha are not the "true set of DE genes", but at most a set enriched with the genes for which you had power to detect DE. By changing the samples you change the power as well.
I see.. Alright , thanks a lot Michael..