Hi,
I am using DESeq2 1.18.1 on a dataset including primary tumor and metastasis samples from the same patients, so I am doing a paired analysis. My data look like this:
> samples condition patient 007_RL1_S25_L005 primary Patient7 008_RL1_S23_L005 metastasis Patient7 030_RL1_S11_L003 primary Patient25 031_RL1_S12_L003 metastasis Patient25 045_RL1_S4_L001 primary Patient37 046_RL1_S30_L006 metastasis Patient37 018_RL1_S5_L001 primary Patient42 017_RL1_S9_L002 metastasis Patient42 024_RL1_S14_L003 primary Patient44 024_RL1_S18_L004 metastasis Patient44 015_RL1_S19_L004 primary Patient45 013_RL1_S20_L004 metastasis Patient45 004_RL1_S6_L002 primary Patient47 006_RL1_S7_L002 metastasis Patient47 034_RL1_S4_L001 primary Patient48 035_RL1_S13_L003 metastasis Patient48
I used the following commands to run DESeq2 with the aim of finding differentially expressed genes between primary tumors and metastasis:
dds<-DESeqDataSetFromMatrix(countData=count.keep,colData=samples,design=~patient+condition)
dds$condition<-relevel(dds$condition, ref="primary") # re-order the levels to set "primary" as control
dds<-DESeq(dds)
res<-results(dds,alpha=0.05)
resLFC<-lfcShrink(dds, coef="condition_metastasis_vs_primary")
Besides running this, I also ran DESeq2 after releveling the dds$patient. I don't think this is really necessary, it was just to have the levels in the same order they appear in the samples table (above):
dds$patient<-factor(dds$patient, levels=c("Patient7","Patient25","Patient37","Patient42","Patient44","Patient45","Patient47","Patient48"))
After running DESeq(), results() and lfcShrink(), however, I compared the log2FoldChange and the adjusted p-values got from the two analyses and I noticed that both are very similar for most genes, but they differ considerably for a handful of genes. I am having troubles in uploading a plot showing this, so I paste here the differences (just head and tail of the vector, since it has ~28 000 elements) between the run after releveling the patient factor (resLFC2) and before releveling it (resLFC):
log2FoldChange:
> logfc<-resLFC2$log2FoldChange-resLFC$log2FoldChange > head(sort(logfc),50) [1] -0.4640070 -0.3845596 -0.3427803 -0.3230904 -0.2984851 -0.2928294 -0.2905385 -0.2904418 -0.2888848 -0.2872439 -0.2821799 -0.2641823 [13] -0.2439860 -0.2363212 -0.2319339 -0.2268218 -0.2254644 -0.2229968 -0.2120390 -0.2090975 -0.2087127 -0.2020668 -0.1998705 -0.1901468 [25] -0.1799651 -0.1772388 -0.1750045 -0.1690204 -0.1687915 -0.1642568 -0.1639595 -0.1630091 -0.1618522 -0.1600158 -0.1594862 -0.1567698 [37] -0.1563773 -0.1550373 -0.1543865 -0.1541377 -0.1540670 -0.1508587 -0.1483248 -0.1473214 -0.1462534 -0.1456059 -0.1450134 -0.1447601 [49] -0.1384240 -0.1377154 > tail(sort(logfc),50) [1] 0.2789096 0.2840764 0.2930939 0.2957986 0.3028408 0.3037882 0.3040117 0.3097307 0.3122050 0.3126521 0.3143998 0.3178907 0.3192067 [14] 0.3216394 0.3231158 0.3240794 0.3274379 0.3276931 0.3362521 0.3374602 0.3390317 0.3398524 0.3454407 0.3467025 0.3493566 0.3493887 [27] 0.3499525 0.3572250 0.3595307 0.3599138 0.3622678 0.3635893 0.3637432 0.3665189 0.3678655 0.3743606 0.3778559 0.3789713 0.3844315 [40] 0.3849069 0.3895188 0.3930795 0.4083101 0.4354175 0.5025769 0.5609633 0.5774954 0.5995166 0.7007780 0.7019946
Adjusted p-values:
> apv<-resLFC2$padj-resLFC$padj > head(sort(apv),50) [1] -2.785914e-01 -2.644752e-01 -1.552267e-01 -8.497233e-02 -7.797622e-02 -7.545974e-02 -2.573009e-02 -3.442260e-03 -3.368228e-03 [10] -2.775045e-03 -3.353486e-04 -1.141852e-06 -1.141852e-06 -3.516378e-07 -1.549470e-18 3.567823e-08 2.735471e-07 3.343793e-07 [19] 4.276062e-07 4.276062e-07 4.709782e-07 4.709782e-07 1.753099e-06 1.753099e-06 1.812765e-06 1.830827e-06 2.321229e-06 [28] 2.512533e-06 2.895689e-06 2.895689e-06 2.895689e-06 3.307283e-06 3.307283e-06 3.307283e-06 3.307283e-06 3.716460e-06 [37] 4.933032e-06 4.933032e-06 4.933032e-06 4.933032e-06 5.185253e-06 5.999162e-06 7.144691e-06 7.144691e-06 7.144691e-06 [46] 7.144691e-06 7.144691e-06 7.550917e-06 7.829693e-06 7.829693e-06 > tail(sort(apv),50) [1] 0.01177972 0.01178042 0.01178478 0.01178478 0.01178577 0.01178577 0.01178577 0.01178747 0.01178747 0.01178807 0.01179446 0.01179949 [13] 0.01179949 0.01180295 0.01180523 0.01180523 0.01180523 0.01180523 0.01180523 0.01180523 0.01180523 0.01180523 0.01180523 0.01180523 [25] 0.01180523 0.01180523 0.01180553 0.01180574 0.01180574 0.01188054 0.01189941 0.01191584 0.01205369 0.01217532 0.01345537 0.01498878 [37] 0.01628600 0.04005408 0.04870757 0.04890803 0.05299741 0.06693309 0.08588581 0.10006774 0.11252920 0.17237000 0.24872782 0.54833178 [49] 0.59861983 0.64759282
As you can see, the differences for some genes are quite striking. Since lfcShrink() should produce a table with the log2FoldChange and p-values regarding the last variable in the design formula ("condition_metastasis_vs_primary"), I didn't expect the results to change just by releveling the patient factor.
Do you have any explanation for this? Am I missing something?
Thanks in advance for your help!
Luisa
Hi Michael,
thanks for your quick reply!
I checked what you suggested by looking at the adjusted p-values for genes with a difference in log2FoldChange > |0.2| between the two analyses, just to select the subset with the most extreme changes. This subset includes 175 genes, among which 15 have an adjusted p-value <0.05. I must say that in this subset of 175 genes, only 1 gene would have dropped out of my list of differentially expressed genes, since I first select genes with adj p-value <0.05 and then I apply an additional filter by keeping only the genes with log2FoldChange > |1|. Nonetheless, it would be good to know which values of log2FoldChange and p-values I should trust more.
Regarding your other suggestion, I tried with type="apeglm", but it only works with the dds after releveling the patient factor. If I don't relevel, I get this error:
Do you have further suggestions? Or should I just decide to relevel or to not relevel the patient factor, and consider that probably the genes I might miss by using one or the other approach are not strong candidates, since they don't come up as differentially expressed in all cases?
Thanks again.
Luisa
I'd prefer to go with one of the new estimators, and drop the old LFC shrinkage method for this dataset.
That's too bad that apeglm hits a problem there. I'd guess it's to do with small count genes. Two options are (1) try filtering out genes that have zeros for nearly all samples because these apparently trip up apeglm on this dataset or (2) use type="ashr".
To filter out the genes with nearly all zeros, try:
Hi Michael,
as you suggested I removed the genes with low counts. type="apeglm" gave again the same error, but type="ashr" worked and the log2FoldChanges given by the two analyses are definitely more consistent than before.
Different shrinkage estimators however don't explain the differences between adjusted p-values, right? What could these be due to?
Thanks,
Luisa
Are the adjusted pvalue differences for genes with high adjusted pvalues? There can be some differences for such a design for Wald tests when the reference level has all zeros or not. The LRT is more robust to the refactoring for this kind of design.
No, the adjusted p-value differences appear all along the way from 0 to 1. When using LRT more genes show considerable differences in the adj p-values (these were only 14 genes with Wald test), but since these big differences concern only genes with high p-values, it actually gets better!
In the future I'll have to analyze other datasets with the same design: do you think LRT is more robust in all cases or does it depend on the features of each dataset?
Thanks for your very useful advice!
Luisa
The LRT is more robust to re-factoring when controlling for something like patient where there are many groups, and when some of those groups have all zeros for both conditions while the other groups have high counts for either or both conditions.
Ok, I'll keep that in mind, thank you very much!
Luisa