Hi,
I've a small question concerning differential expression analysis regarding 12 samples divided into 2 cell lines (cell 1 & 2), two conditions: tumor (T) and control (C):
Cell | Condition | Patient number (p) |
Cell1 | T | 1 |
Cell1 | T | 2 |
Cell1 | T | 4 |
Cell2 | T | 1 |
Cell2 | T | 2 |
Cell2 | T | 4 |
Cell1 | C | 3 |
Cell1 | C | 5 |
Cell1 | C | 6 |
Cell2 | C | 3 |
Cell2 | C | 5 |
Cell2 | C | 6 |
In order to capture the differences among Cell1 and Cell2 (considering Cell2 and condition C as reference levels), I used the following model:
design = ~ Cell+Condition+Cell:Condition
but the DEGs (significant genes) I am getting here via interaction term (Cell:Condition) is not considering patient. Therefore, please help me to design a model which can perform a differential expression analysis by doing the following:
(i) Cell1/Cell2: T samples vs. Cell1/Cell2 C samples including patient information.
I like to obtain DEGs between Cell 1 and Cell2 cell lines considering (i) comparison.
Thank you,
Here's that section:
https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#group-specific-condition-effects-individuals-nested-within-groups
Note that the terminology is different, your "condition" is called "group" in the vignette, and your "cell" is called "condition" in the vignette. Otherwise you should be able to apply the recommendation straight away to your data.
Hi Michael:
Thank you for your suggestion and the link. I have applied the model as suggested in vignettes by creating an additional column i.e. (ind.n)
ds <- DESeqDataSetFromMatrix(countData=counts, colData=samples, design=~cell+cell:ind.n+cell:cnd) here cnd=condition
ds$cell <- relevel(ds$cell, "Cell2")
ds$cnd <- relevel(ds$cnd, "C")
ds=DESeq(ds)
This model has returned the following resultsNames:
cell_X_vs_Y, cellX.ind.np2, cellY.ind.np2, cellX.ind.np3, cellY.ind.np3, cellX.cndT, cellY.cndT
Consider X as cell1 and Y as cell2.
But this model is not giving the results for patient1 samples (T vs. C). How can I get this?
To obtain the differences among two cell lines considering patients information, do I need to combine all the significant genes obtained by results name ending with .np2 & .np3. Is this correct?
Thank you!
Note that you're not following the example exactly. It should be (using your terms) condition:individual, not cell:individual. This is because it is across condition (C or T) that the individuals are nested.
Furthermore, once you use the condition:individual, note that the model controls for patient differences, but it's letting you estimate the Y vs X effect for C and for T, or to contrast that effect between T and C. But the differences seen in individual patients are used as replication to evaluate the significance. You will not get Y vs X coefficients for each patient. You would need more replication to do a statistical analysis at that level.
Michael - thank you for your reply.
I want the exact interaction what you have mentioned in your response i.e. Y vs X effect for C and for T, or to contrast that effect between T and C, specially contrast between T(Y/X) and C(Y/X).
Can you please check my the below approach, if I am doing correct?
Revised EDF (experimental design file):
As per your suggestion, I am using the following design:
##################################################################################
ds<- DESeqDataSetFromMatrix(countData=counts, colData=edf, design = ~cnd+cnd:ind.n + cnd:cell)
ds$cnd <- relevel(ds$cnd, "C")
ds$cell <- relevel(ds$cell, "X")
ds = DESeq(ds)
resultsNames(ds)
[1] "Intercept" "cnd_T_vs_C" "cndC.ind.np2" "cndT.ind.np2" "cndC.ind.np3" "cndT.ind.np3" "cndC.cellX" "cndT.cellX"
res.model1 <- results(ds, contrast=list("cndT.cellX", "cndC.cellX")
#######################################################################
is this the correct design to fetch DEGs for the contrast between T(Y/X) and C(Y/X)?
Thank you
If you relevel() like in the code above, then the resultsNames() should have given you "cndC.cellY" instead of "cndC.cellX". Can you try that again and see if you get "cndC.cellY". I'm guessing you may have added in the relevel() code later but the copy/paste of resultsNames() was from before?
But aside from this issue, if you use
You will get 1) the Y vs X effect in C, 2) the Y vs X effect in T, and 3) the difference between the Y vs X effect in T compared to C, all while controlling for baseline patient differences.
Michael - thank you very much for your diligent feedback and detailed explanation which is very much appreciated.
Regarding relevel(), you correctly guessed that I copied/pasted before adding relevel(), therefore, the interested names are as you mentioned i.e. cndC.cellY and cndT.cellY.
Again, thank you very much!
Hi Michael:
After getting DEGs using contrast = list("cndT.cellY", "cndC.cellY"), I have notified some unusual observations :
1. the fold change (or logfc) doesn't match with the expression pattern of several genes for two different groups. For example, say gene-x has down patterns in group X but more up expression in Y, the expected fold should be up whereas contrast is giving opposite results (negative fold) and vice verse.
2. With FDR <5%, there is no signal gene found significant. But at pv<0.05, I found ~600 genes. Upon comparing the top significant genes (p-value) using their relative expression values in three patients for a given groups [C or T], genes are not at all having similar patterns see below table:
X and Y are two cell lines
C and T are control and Tumor
3. When I performed a t-test on the relative expression values, I got ~2000 genes with pv<0.05 with perfect patterns among patients of two different groups (up up up in C, down down down in T or vice versa), but not with DESeq2. Is there any bug in my code mentioned in my previous messages?
Looking forward to learn your suggestions.
Thank you
I see you have "X/Y" in the tables above, although if you read my explanation above, the coefficients specify a log fold change for Y/X. Is this the confusion? You determine what the reference level is, which is the denominator of the fold change. Earlier you said the reference (denominator) should be X. It can be whatever you want, but you should pick one and go with it for the whole analysis.
You shouldn't be looking at p-values, only adjusted p-values.
You can expect 5% of a null dataset to have a p-value less than 0.05. So looking at the number of small p-values isn't necessarily interesting, and hence statisticians have come up with the methods for controlling FDR, and an adjusted p-value to produce such sets.
A t-test on estimated fold changes ignores the data distribution, and does not share parameter estimates across genes. It is very much not recommended by me. It's not surprising to me that you get different results with a method which is totally different.
Thank you for your reply, Michael.
Regarding X/Y or Y/X both fold-change values are not matching with the relative expression values. Could it be because of higher FDR values?
I do agree with your comment on FDR cut-off.
I guess I would double check how you are double-checking the coefficients, and walk through the 3 tables I described above. The contrast is simply the difference between LFCs for table 1 and 2, so there should be no surprises when you get to table 3.