Hello All,
I am performing a DEG analysis using DESeq2 and would really appreciate your advice on the results(contrast) section of my analysis.
I have 5 disease categories(RM, RP, SP_progress, SP_stable and healthy) for each category we extracted monocytes and treated these monocytes with different concentration of a chemical (10nm,50nm,100nm,200nm and 0nm/none). The goal is to perform DE analysis within each group by comparing monocytes with different treatment compositions (Eg: For disease category RM, I am interested in comparing 10nm vs 100nm, 10nm vs 50nm, 100nm vs none..etc). My code is as follows:
ddsComplete <- DESeqDataSetFromMatrix(countData=FilteredCounts, colData=phenoFile,design=~0+Monocytes:Disease)
ddsComplete <- DESeq(ddsComplete)
PhenoFile:
head(phenoFile) Position Disease Monocytes 1 A1 RM None 2 B1 RM 10nm 3 C1 RM 50nm 4 D1 RP 10nm 5 E1 RP 50nm 6 F1 RP 100nm
Results:
resultsNames(ddsComplete)
[1] "Monocytes100nm.DiseaseHealthy"
[2] "Monocytes10nm.DiseaseHealthy"
[3] "Monocytes50nm.DiseaseHealthy"
[4] "MonocytesNone.DiseaseHealthy"
[5] "Monocytes200nm.DiseaseHealthy"
[6] "Monocytes100nm.DiseaseRM"
[7] "Monocytes10nm.DiseaseRM"
[8] "Monocytes50nm.DiseaseRM"
[9] "MonocytesNone.DiseaseRM"
[10] "Monocytes200nm.DiseaseRM"
[11] "Monocytes100nm.DiseaseRP"
[12] "Monocytes10nm.DiseaseRP"
[13] "Monocytes50nm.DiseaseRP"
[14] "MonocytesNone.DiseaseRP"
[15] "Monocytes200nm.DiseaseRP"
[16] "Monocytes100nm.DiseaseSP_progress"
[17] "Monocytes10nm.DiseaseSP_progress"
[18] "Monocytes50nm.DiseaseSP_progress"
[19] "MonocytesNone.DiseaseSP_progress"
[20] "Monocytes200nm.DiseaseSP_progress"
[21] "Monocytes100nm.DiseaseSP_stable"
[22] "Monocytes10nm.DiseaseSP_stable"
[23] "Monocytes50nm.DiseaseSP_stable"
[24] "MonocytesNone.DiseaseSP_stable"
[25] "Monocytes200nm.DiseaseSP_stable"
In order to extract results for e.g. comparing 100nm vs 200nm for Disease category RP, I wrote the following contrast:
RP100_vs_200 = results(ddsComplete, contrast=c('Monocytes','100nm.DiseaseRP','200nm.DiseaseRP'))
>RP100_vs_200
log2 fold change (MLE): Monocytes 100nm.DiseaseRP vs 200nm.DiseaseRP
Wald test p-value: Monocytes 100nm.DiseaseRP vs 200nm.DiseaseRP
DataFrame with 16770 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
A1BG 0.77580621 -0.4489085 0.5952707 -0.75412502 0.45077415 0.9999971
A1BG-AS1 0.14957383 0.3799384 1.7488089 0.21725552 0.82800922 0.9999971
A1CF 0.01171596 -0.3450653 10.0207498 -0.03443507 0.97253022 0.9999971
A2M 0.46024794 -1.5849584 0.7909503 -2.00386595 0.04508442 0.9999971
A2M-AS1 0.01496816 -0.3450659 10.0207498 -0.03443514 0.97253016 0.9999971
... ... ... ... ... ... ...
ZYG11A 1.3799655 0.1101611 0.4711617 0.2338074 0.8151345 0.9999971
ZYG11B 29.3911744 -0.2092063 0.5674405 -0.3686841 0.7123632 0.9999971
ZYX 0.8827047 0.2417958 0.6352289 0.3806436 0.7034677 0.9999971
ZZEF1 0.4399459 -0.3553409 0.7771330 -0.4572459 0.6474943 0.9999971
ZZZ3 0.5939009 -0.5578059 0.6982475 -0.7988655 0.4243684 0.9999971
Could someone please help me understand how the contrast works? For the above result(RP100_vs_200) does the contrast function internally subsets groups and then generates table based upon the inputs given to it? If I was to subset my countData and phenoFile to just include disease category RP and re-run the analysis using design=~Monocytes, would the results for 100nm_vs_200nm be similiar? I expect the results to be little different for RP since the estimates would be different for subset RP considering the sample size.
Hope the question is clear! Thank you!!!
Micheal, thank you for your response. I am uploading a PCA plot of my full data below. You can clearly see from the PCA that the condition "None" (untreated) is clearly separating regardless of the disease category. I tried investigating how the results looked when all samples were used in the model vs disease specific subsets, surprisingly the results looked very different. I understand the baseMean, p-values and the lfcSE to be different but even the log2FoldChange are very different. I suspect that this could be due to "None" condition, but I am not very sure about this. What are your thoughts on this? Should this dataset be analyzed by disease subset?
https://i.imgur.com/rdWzMKc.png
"even the log2FoldChange are very different"
Can you make a plot of the LFC for a category using the two analysis approaches: first with all samples using the interaction term, and second with only samples from that category in the 'dds'?
You can subset to just the genes with small padj, to filter out noisy LFCs (another option would be to shrink LFCs across the range, but this would require you to change the design a bit, and also I don't know which version of DESeq2 you are using).
If you have res1 and res2:
Micheal,
I have attached the plot below.
This is my overall code to generate the plot: (My data has technical replicates which I have collapsed together)
Instead of using the padj I have used pval to generate the plot, most my padj value where 0.99 for the RP : 200nm vs 50nm comparison.
https://i.imgur.com/gFi64yq.png
Can you confirm that the counts in
ddsRM
are the same as inddsAll[, ddsAll$Disease == "RM"]
? I'm concerned that something may be going wrong during the subsetting procedure, which mixes up the samples. A safer way to subset a DESeqDataSet (or any such matrix-like Bioconductor object) is simply:Then you can reassign a design with:
You are right it was indeed the subsetting. The plot now shows a straight line. Thank you very much Micheal.
Here is the plot: https://i.imgur.com/1sXmMYZ.png