Enter the body of text here
Hi, there:
I used DESeq2 a while back, and just recently I started to use it again for a project, and I bet there might be some updates or changes, since I am trying to understand why I got different DEGs results with discrepancy values use two ways of model settings that supposed to be no difference to derive DEGs unless my codes have some issues.
Code should be placed in three backticks as shown below
##the first way: the main code for setting the model and contrast as below
> show(head(tar3,4))
DepMap_ID Lines Types RAS
AUTO_GANGLIA.Mut.CHP212 ACH.000120 CHP212 AUTO_GANGLIA Mut
AUTO_GANGLIA.Mut.SKNSH ACH.000149 SKNSH AUTO_GANGLIA Mut
AUTO_GANGLIA.Mut.SKNAS ACH.000260 SKNAS AUTO_GANGLIA Mut
AUTO_GANGLIA.WT.MHHNB11 ACH.000078 MHHNB11 AUTO_GANGLIA WT
...more lines not shown
> mydata3<-mydata3[,tar3$DepMap_ID];
> show("all(colnames(mydata3)==tar3$DepMap_ID)")
[1] "all(colnames(mydata3)==tar3$DepMap_ID)"
> show(all(colnames(mydata3)==tar3$DepMap_ID))
[1] TRUE
tar3$TypesRas<-paste(tar3$Types,tar3$RAS,sep=".");
tar3$TypesRas <- factor(tar3$TypesRas)
mydata3<-round(mydata3)
library("DESeq2")
dds <- DESeqDataSetFromMatrix(countData = mydata3,
colData = tar3,
design = ~TypesRas)
colData(dds)$TypesRas <- factor(colData(dds)$TypesRas,
levels=c("AUTO_GANGLIA.Mut", "AUTO_GANGLIA.WT",
"BILE.Mut","BILE.WT",......
"URINARY.Mut", "URINARY.WT"));
dds$TypesRas <- relevel(dds$TypesRas, ref = "AUTO_GANGLIA.WT")
dds<- DESeq(dds);
show(resultsNames(dds)) # lists the coefficients
# [1] "Intercept"
#[2] "TypesRas_AUTO_GANGLIA.Mut_vs_AUTO_GANGLIA.WT"
#[3] "TypesRas_BILE.Mut_vs_AUTO_GANGLIA.WT"
# [4] "TypesRas_BILE.WT_vs_AUTO_GANGLIA.WT"
# [5] "TypesRas_BREAST.Mut_vs_AUTO_GANGLIA.WT"
#[6] "TypesRas_BREAST.WT_vs_AUTO_GANGLIA.WT"
........(more here not shown)
#[33] "TypesRas_URINARY.Mut_vs_AUTO_GANGLIA.WT"
#[34] "TypesRas_URINARY.WT_vs_AUTO_GANGLIA.WT"
##set up wanted contrasts
myConL<-list();
kk<-1;
myConL[[kk]]<-c("TypesRas","AUTO_GANGLIA.Mut", "AUTO_GANGLIA.WT");
names(myConL)[kk]<-"AUTO_GANGLIA.Mut_WT";
kk<-kk+1;
myConL[[kk]]<-c("TypesRas","BILE.Mut", "BILE.WT");
names(myConL)[kk]<-"BILE.Mut_WT";
kk<-kk+1;
myConL[[kk]]<-c("TypesRas","BREAST.Mut", "BREAST.WT");
names(myConL)[kk]<-"BREAST.Mut_WT";
....
##for question here, only retrieve DEGs for the first three contrasts
for(i in 1:3){
res <- results(dds, contrast=myConL[[i]]);
res<-as.data.frame(res[order(res$pvalue),]);
}
##the 2nd way: the main code for setting the model and contrast as below
> show(head(tar3,4))
DepMap_ID Lines Types RAS
AUTO_GANGLIA.Mut.CHP212 ACH.000120 CHP212 AUTO_GANGLIA Mut
AUTO_GANGLIA.Mut.SKNSH ACH.000149 SKNSH AUTO_GANGLIA Mut
AUTO_GANGLIA.Mut.SKNAS ACH.000260 SKNAS AUTO_GANGLIA Mut
AUTO_GANGLIA.WT.MHHNB11 ACH.000078 MHHNB11 AUTO_GANGLIA WT
...more lines not shown
> mydata3<-mydata3[,tar3$DepMap_ID];
> show("all(colnames(mydata3)==tar3$DepMap_ID)")
[1] "all(colnames(mydata3)==tar3$DepMap_ID)"
> show(all(colnames(mydata3)==tar3$DepMap_ID))
[1] TRUE
mydata3<-round(mydata3)
dds <- DESeqDataSetFromMatrix(countData = mydata3,
colData = tar3,
design = ~Types+RAS+Types:RAS);
dds$RAS <- relevel(dds$RAS, ref = "WT")
dds$Types <- relevel(dds$Types, ref = "AUTO_GANGLIA")
dds<- DESeq(dds)
show(resultsNames(dds)) # lists the coefficients
#[1] "Intercept" "Types_BILE_vs_AUTO_GANGLIA"
# [3] "Types_BREAST_vs_AUTO_GANGLIA"
......(more here not shown)
#[17] "Types_URINARY_vs_AUTO_GANGLIA" "RAS_Mut_vs_WT"
#[19] "TypesBILE.RASMut" "TypesBREAST.RASMut"
#[21] "TypesCNS.RASMut" "
.......
#[33] "TypesUPPER_DIGEST.RASMut" "TypesURINARY.RASMut"
##set up wanted contrasts
myConL<-list();
kk<-1;
myConL[[kk]]<-c("RAS_Mut_vs_WT");
names(myConL)[kk]<-"AUTO_GANGLIA.Mut_WT";
kk<-kk+1;
myConL[[kk]]<-c("RAS_Mut_vs_WT","TypesBILE.RASMut");
names(myConL)[kk]<-"BILE.Mut_WT";
kk<-kk+1;
myConL[[kk]]<-c("RAS_Mut_vs_WT","TypesBREAST.RASMut");
names(myConL)[kk]<-"BREAST.Mut_WT";
...
##for question here, only retrieve DEGs for the first three contrasts
for(i in 1:3){
res <- results(dds, contrast=myConL[[i]]);
res<-as.data.frame(res[order(res$pvalue),]);
}
# include your problematic code here with any corresponding output
For the above two ways of contrast setting for the model, when I retreived the DEGs, say, adjusted p-value <0.05, I not only get different numbers of DEGs, but also when I compared some genes that were DEGs from both setting ways for the same contrast: for example, for contrast AUTO_GANGLIA.Mut_WT, I got the following stats for the same gene:
the first way of model setting:
IDs baseMean log2FoldChange lfcSE stat pvalue padj
KRT8P14..ENSG00000214282. 8.109762759 17.28097998 1.586815863 10.8903499 1.28E-27 4.30E-23
the 2nd way of model setting:
IDs baseMean log2FoldChange lfcSE stat pvalue padj
KRT8P14..ENSG00000214282. 8.109762759 11.83411087 1.588355537 7.450542774 9.30E-14 1.37E-09
As you can see, the same gene got different stats for log2FoldChange, pvalue, and padj, although baseMean is the same, and lfcSE is very close but different.
I checked other contrasts, I had similar observations.
Can someone or Michael can point out anything wrong with my codes or any reason(s) that led to this difference that I observed?
# please also include the results of running the following in an R session
sessionInfo( )
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /mnt/nasapps/development/R/4.0.2/lib64/R/lib/libRblas.so
LAPACK: /mnt/nasapps/development/R/4.0.2/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] DESeq2_1.28.1 SummarizedExperiment_1.20.0
[3] Biobase_2.50.0 MatrixGenerics_1.2.1
[5] matrixStats_0.59.0 GenomicRanges_1.42.0
[7] GenomeInfoDb_1.26.7 IRanges_2.24.1
[9] S4Vectors_0.28.1 BiocGenerics_0.36.1
[11] limma_3.46.0 reshape_0.8.8
[13] reshape2_1.4.4 ggrepel_0.9.1
[15] ggplot2_3.3.5
loaded via a namespace (and not attached):
[1] locfit_1.5-9.4 Rcpp_1.0.6 lattice_0.20-44
[4] assertthat_0.2.1 utf8_1.2.1 R6_2.5.0
[7] plyr_1.8.6 RSQLite_2.2.7 pillar_1.6.1
[10] zlibbioc_1.36.0 rlang_0.4.11 annotate_1.66.0
[13] blob_1.2.1 Matrix_1.3-4 splines_4.0.2
[16] BiocParallel_1.24.1 geneplotter_1.66.0 stringr_1.4.0
[19] RCurl_1.98-1.3 bit_4.0.4 munsell_0.5.0
[22] DelayedArray_0.16.3 compiler_4.0.2 pkgconfig_2.0.3
[25] tidyselect_1.1.1 tibble_3.1.2 GenomeInfoDbData_1.2.4
[28] XML_3.99-0.6 fansi_0.5.0 crayon_1.4.1
[31] dplyr_1.0.7 withr_2.4.2 bitops_1.0-7
[34] grid_4.0.2 xtable_1.8-4 gtable_0.3.0
[37] lifecycle_1.0.0 DBI_1.1.1 magrittr_2.0.1
[40] scales_1.1.1 stringi_1.6.2 cachem_1.0.5
[43] XVector_0.30.0 genefilter_1.70.0 ellipsis_0.3.2
[46] generics_0.1.0 vctrs_0.3.8 RColorBrewer_1.1-2
[49] tools_4.0.2 bit64_4.0.5 glue_1.4.2
[52] purrr_0.3.4 survival_3.2-11 fastmap_1.1.0
Just for people's info and discussion purpose, I did try different variants that used interaction terms (the second way of model and contrast setting that I posted initially) and also I tried to avoid the way of DESeq2 specific way of setting contrasts that I used as posted initally for the second way of setting, but used more general ways instead as shown in example below. It seems that I can get rough close but always slightly different numbers of DEGs for the same contrast (with almost majority of DEGs are overlapped between these variants and the results of the first way of setting), but the issue is: many stats of the DEGs such as log2FoldChange, pvalue, padj are very different for the same DEGs of the same contrast, although basemMean and lfcSE are identical. As curiosity, I did the same exercise using edgeR, and I can get exactly the same numbers of DEGs between the two ways of model and contrast settings for the same contrast, not only the ranking or order of DEGs (if sorted by p-values etc) are the same but every statistics of the DEGs are exactly the same (e.g., logFC, logCPM, F,PValue,FDR). Just wonder if the DESeq2 have some minor issues for such a bit more complicated model and contrast settings or the way to derive the log2FoldChange etc would be subjected to change if changing the way of model and contrast settings, for which the authors of DESeq2 may have the best answers or what I did is not what intended to be in DESeq2.
Below I just give some examples of how I setting up the models and contrasts:
dds <- DESeqDataSetFromMatrix(countData = mydata3, colData = tar3, design = ~0+RAS+RAS:Types)
dds$RAS <- relevel(dds$RAS, ref = "WT") .....
[1] "RASWT" "RASMut"
[3] "RASWT.TypesBILE" "RASMut.TypesBILE"
[5] "RASWT.TypesBREAST" "RASMut.TypesBREAST"
........
[33] "RASWT.TypesURINARY" "RASMut.TypesURINARY"
myConL<-list();
kk<-1;
myConL[[kk]]<-c(-1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0);
names(myConL)[kk]<-"AUTO_GANGLIA.Mut_WT";
kk<-kk+1;
myConL[[kk]]<-c(-1,1,-1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0);
names(myConL)[kk]<-"BILE.Mut_WT";
kk<-kk+1;
myConL[[kk]]<-c(-1,1,0,0,-1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0);
names(myConL)[kk]<-"BREAST.Mut_WT";
for(i in 1:3){
res <- results(dds, contrast=myConL[[i]]);
currTypes<-gsub(".Mut_WT","",names(myConL)[i])
}
this is one of the example variants I tried, which has most overlapping DEGs for the same contrast with those from my original first model/contrast setting using combined strings in my initial post. just give one example of rough idea: 1834 vs 1846 (overlapped 1830) for BILE. From the numbers, one would say not bad at all, but the issue is: for the same contrast, many statistics including log2FoldChange for the same gene are very different, which is bothering me. for example, one gene for the two settings look like:
2 AL035541.1 ENSG00000227297. AL035541.1..ENSG00000227297. 65.83473
log2FoldChange lfcSE stat pvalue padj
2 -42.03294 1.684618 -24.95103 2.081179e-137 3.59555e-133
23 AL035541.1 ENSG00000227297. AL035541.1..ENSG00000227297. 65.83473
log2FoldChange lfcSE stat pvalue padj
23 -24.60017 1.684421 -14.60452 2.628189e-48 3.948225e-45
Here the LFC is not finite. Because we use convergence to stop for infinite LFC you get a different number. I’d recommend to just stick with a sensible coding of reference levels.
Hi, Michael:
Thx for the input. Can you be more specific on what you suggested: "I’d recommend to just stick with a sensible coding of reference levels"? Do you mean that in my second way of model setting, try different ways? Indeed, Idid try: design = ~0+RAS+RAS:Types, design =~RAS+RAS:Types design = ~0+RAS+Types+RAS:Types, or design = ~RAS+Types+RAS:Types, and also tried dds$RAS <- relevel(dds$RAS, ref = "WT") with or without dds$Types <- relevel(dds$Types, ref = "AUTO_GANGLIA"), still the similar observations. Let me know if you have alternative ways or what you really meant here.
Also not just log2FoldChange, other stats including p-values except baseMean are all different, which seem with larger scheme of difference in handling contrasts in models than just LFC-related shrinkage or convergence issue.
Alternatively, in both ways of my model/contrast settings, DESeq2 handles the same contrast in different ways that may have led to different final results. Since when I tried the similar ways setting up models in edgeR, I did not see any difference for the same contrast in the result stats from different model/contrast settings (the same way as I did in DESEq2), as observed in results of DESeq2. This seems suggesting some potential handling issues or limitation in DESeq2 for relatively complicated model settings unless my codes have some issues.
Also I did try lfcShrink(dds, type="apeglm") function. however, I did find that for my model and contrast setting, it is kind of hard for the ways of my both ways of model settings in my initial post to get this function to work for all of my contrasts, since ‘apeglm’ requires use of ‘coef’ but not "contrast=" , which is really limiting my ability to use lfcShrink() function unless I have to set up models for each pair of contrast.
This brings up my alternative question for your expertise opinions: my model setting is kind of including all contrasts in the same roof of one global model as I showed (for both ways of my model settings), although I can set up a model per contrast that I am interested in. Do you think which way is the better at least from point of DEseq2? I read from somewhere that setting up a global model for all contrasts altogether usually is better than setting one model for each contrast. Do you agree on this? This would be of course a big topic, since it may be related to how different in the different contrast in data etc. But at least for this case setting up a model per contrast,
I shall be able to use lfcShrink(dds, type="apeglm") for each contrast by using "coef=".
I would really appreciate your expertise input on these aspects. Thx so much in advance!
I mean to set the reference levels from the start to a meaningful baseline like WT and then use this throughout.
Note that for lfcShrink you can use ashr and contrast, this combination works well.
I don’t have a preference on your last question.
Hi, Michael:
Really appreciated your input and advice!
For the first question, I did used "WT" for my second way of modle setting although I did try with or without dds$Types <- relevel(dds$Types, ref = "AUTO_GANGLIA"). But for my first way of mode setting, since I used combined string, the reference level has to be "AUTO_GANGLIA.WT" as I did as below: dds$TypesRas <- relevel(dds$TypesRas, ref = "AUTO_GANGLIA.WT"). These efforts seem not able to make the consensus results in DESeq2 so far between the two ways of model settings as I had showed, although I was able to make consensus results in edegR between the two ways of model settings. My guess is my first way of model setting may have to be preferentially used in DESeq2. Hope you have any advice on this aspect.
I will try ashr with my contrast as your suggested for situations using contrast in the model.
Thx again for your great input!
I would set reference levels for all factors in the design.
Thx for the advice about the reference levels, in which I did do as you suggested! In fact, I even tried more options just want to see why I had discrepancy between the resuts of my first way of model setting vs second way of mdoel settngs.
Any advice on the discrepancy between the resuts of my first way of model setting vs second way of mdoel settngs that my post originally asked about?
Since these efforts seem not able to make the consensus results beyond just LFC in DESeq2 but also for other stats including p-values etc that are all different, which were observed so far between the two ways of model settings for DESeq2 as I had showed, although I was able to make complete consensus results in edegR between the two ways of mbodel settings
Therefore, Just to make it clear since too many questions have been asked in this post, hereby I reiterate below the two ways of model and contrast settings and corresponding results that I did in DESeq2: (I only listed the critical code here; and for the 2nd way of model setting, I only showed the options I tried, which gives the most overlapped DEGs with the 1st way for the same contrast)
the first way: the main code for setting the model and contrast as below:
tar3$TypesRas<-paste(tar3$Types,tar3$RAS,sep=".");
tar3$TypesRas <- factor(tar3$TypesRas)
dds <- DESeqDataSetFromMatrix(countData = mydata3, colData = tar3, design = ~TypesRas)
colData(dds)$TypesRas <- factor(colData(dds)$TypesRas, levels=c("AUTO_GANGLIA.Mut", "AUTO_GANGLIA.WT", "BILE.Mut","BILE.WT",...... "URINARY.Mut", "URINARY.WT"));
dds$TypesRas <- relevel(dds$TypesRas, ref = "AUTO_GANGLIA.WT")
dds<- DESeq(dds);
show(resultsNames(dds)) # lists the coefficients
[1] "Intercept"
[2] "TypesRas_AUTO_GANGLIA.Mut_vs_AUTO_GANGLIA.WT"
[3] "TypesRas_BILE.Mut_vs_AUTO_GANGLIA.WT"
[4] "TypesRas_BILE.WT_vs_AUTO_GANGLIA.WT"
[5] "TypesRas_BREAST.Mut_vs_AUTO_GANGLIA.WT"
[6] "TypesRas_BREAST.WT_vs_AUTO_GANGLIA.WT"
.....
[33] "TypesRas_URINARY.Mut_vs_AUTO_GANGLIA.WT"
[34] "TypesRas_URINARY.WT_vs_AUTO_GANGLIA.WT"
myConL<-list();
kk<-1;
myConL[[kk]]<-c("TypesRas","AUTO_GANGLIA.Mut", "AUTO_GANGLIA.WT");
names(myConL)[kk]<-"AUTO_GANGLIA.Mut_WT";
kk<-kk+1;
myConL[[kk]]<-c("TypesRas","BILE.Mut", "BILE.WT");
names(myConL)[kk]<-"BILE.Mut_WT";
kk<-kk+1;
myConL[[kk]]<-c("TypesRas","BREAST.Mut", "BREAST.WT");
names(myConL)[kk]<-"BREAST.Mut_WT"; ....
for(i in 1:3){
res <- results(dds, contrast=myConL[[i]]);
res<-as.data.frame(res[order(res$pvalue),]);
}
the 2nd way: the main code for setting the model and contrast as below
tar3$Types <- factor(tar3$Types)
tar3$RAS <- factor(tar3$RAS)
dds <- DESeqDataSetFromMatrix(countData = mydata3, colData = tar3, design = ~0+RAS+RAS:Types);
dds$RAS <- relevel(dds$RAS, ref = "WT")
dds$Types <- relevel(dds$Types, ref = "AUTO_GANGLIA")
dds<- DESeq(dds)
show(resultsNames(dds)) # lists the coefficients
[1] "RASWT" "RASMut"
[3] "RASWT.TypesBILE" "RASMut.TypesBILE"
[5] "RASWT.TypesBREAST" "RASMut.TypesBREAST"
...... [31] "RASWT.TypesUPPER_DIGEST" "RASMut.TypesUPPER_DIGEST" [33] "RASWT.TypesURINARY" "RASMut.TypesURINARY"
myConL<-list();
kk<-1;
myConL[[kk]]<-c(-1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0);
names(myConL)[kk]<-"AUTO_GANGLIA.Mut_WT";
kk<-kk+1;
myConL[[kk]]<-c(-1,1,-1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0);
names(myConL)[kk]<-"BILE.Mut_WT";
kk<-kk+1;
myConL[[kk]]<-c(-1,1,0,0,-1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0);
names(myConL)[kk]<-"BREAST.Mut_WT"; ......
for(i in 1:3){
res <- results(dds, contrast=myConL[[i]]);
res<-as.data.frame(res[order(res$pvalue),]);
}
When I compared the DEGs between the two ways of settings for the same contrasts, I did see difference between,
for contrast: "AUTO_GANGLIA.Mut_WT": 1479 vs 873 at padj<0.05 with overalpped 849
for contrast: "BILE.Mut_WT": 1834 vs 1846 at padj<0.05 with overalpped 1830
for contrast: "BREAST.Mut_WT": 1763 vs 1784 at padj<0.05 with overalpped 1758
Considering the numbers of overalapped genes, these are rather closed results except for the first contrast "AUTO_GANGLIA.Mut_WT": But still very different in terms of the actual statistics of these DEGs as shown below:
Below I showed the top DEGs from both ways of model seetings for the same contrast "BILE.Mut_WT" here FYI (I took out some anotation and baseMean columns)
first way of model setting:
1 AP005229.1 48.05928 1.654480 29.04796 1.632705e-185 5.641484e-181
2 AL035541.1 -42.03294 1.684618 -24.95103 2.081179e-137 3.595550e-133
3 HTR3E -44.55963 1.895428 -23.50900 3.299678e-122 3.800459e-118
4 AC243772.3 -40.67986 1.734676 -23.45098 1.291749e-121 1.115845e-117
second way of model setting:
1 AP005229.1 48.00919 1.654480 29.01769 3.936211e-185 1.360040e-180
2 AC022031.1 -39.05947 1.573384 -24.82513 4.800262e-136 8.292933e-132
3 HTR3E -41.68154 1.895425 -21.99060 3.542587e-107 4.080116e-103
4 MS4A14 39.83830 1.931396 20.62668 1.581351e-94 1.365971e-90
I notiiced that log2FoldChange,stat, pvalue and padj are different for the same genes, which make the ranking different. I undertsand as you mentioned log2FoldChange may be impacted by convergence to stop for infinite LFC (which may be impacted more or less by different model setting?), and so maybe p-value and padj as well, I guess?
But why contrast: "AUTO_GANGLIA.Mut_WT" did the worst? could be due to more unbalanced data since much less mut samples in this tissue type than BILE and BREAST? But different model settings all used the same data.
Also what would be the possible reasons I can not get exactly results between the two ways of model and contrast settings like I did using edgeR is still the question that I am wondering.
Any input for these aspects would hghly appreciated!
What is the correlation of the LFC for your two contrasts of mut vs wt in auto ganglia? Does the scatterplot of log2 fold changes look like these are the same contrast?
The above changes in bile mut vs wt are what I would expect with running equivalent models but with different matrices. Due to how the models converge the values will not be exactly the same. E.g. for AP005229 the test statistic varies by 0.1%, and AC022031 by 0.5%. Even seeing higher differences in LFC like in HTR3E which is ~6% would be expected when the LFC is not finite (the algorithm is stopping as the coefficient goes to positive or negative infinity). All of these are examples you provide are cases where the LFC is not finite.
Thanks a lot for the insights. Also I plotted the scatter plots for log2FC as shown. B0 and BD, also G0 and GD are indeed from the same contrast but different ways of model settings (0 vs D), and they look much better than between B(BILE) and G(auto ganglia). This is promising. Just wonder why not exactly identical. Your interpretation kind of making sense. Thx again!
For the horizontal lines, I think it is that DESeq2 "zeros" out the LFC when you are doing a two group comparison with
contrast
, when the two groups under comparison have all zero counts. So this approach (using two group comparison andcontrast
) is preferred for setting those LFC to exactly zero. For complex reasons, those LFC can drift away from zero when you have complex designs, and then the Wald statistic can even be nonzero. In the case of a multi-group design, with some groups having all zeros, the LRT is superior to Wald test, because it doesn't have the same issue undefined LFC and SE. So in short, you can either use the two groupcontrast
approach here which sets those LFC to exactly zero, or you can create a model matrix and perform LRT in DESeq2 by providing full and reduced matrices tofull
andreduced
arguments ofDESeq()
withtest="LRT"
.Thx so much for the diagnosis and advice. Indeed, in my data, some genes have all 0 for all sampes of a contrast for both "Mut" and "WT" groups, although for some other contrasts we have samples with non-zero counts for the same genes. So when you say: "So in short, you can either use the two group contrast approach here which sets those LFC to exactly zero,", Do you mean I just set up model and contrast for subset of the data e.g.,just using the sampes of these two groups of contrast (AUTO_GANGLIA.Mut, and AUTO_GANGLIA.WT) for the contrast AUTO_GANGLIA.Mut_WT for example, and deal with each contrast one model at a time, is my undersatnding correct?
also when you mentioned " or you can create a model matrix and perform LRT in DESeq2 by providing full and reduced matrices to full and reduced arguments of DESeq() with test="LRT". " I think you mean use DESeq(dds, test="LRT") instead of default test="Wald", if I use a global model to cover all contrasts in the same model as shown in the two ways of my model settings as I posted? I am just not sure what "full and reduced matrices to full and reduced arguments" you meant here.
In edgeR, I did use their filterbyExpr(y,group=Group) to get rid of those cases, but for DESeq2, since it is generally not recommended for filtering as I am aware of, although we can do something like:
keep <- rowSums(counts(dds) >= 3) >= 3;
dds <- dds[keep,]
but I did not do that yet according to the general not filtering advice to help something like dispersion assessment as I understood. So based on what you just mentoned, shall I do that here for this dataset? since in fact in some of the contrasts, e.g., for contrast AUTO_GANGLIA.Mut_WT where I actually got one group (say the Mut group) with only three samples, but only one sample with non-zero count and two other samples are with 0 count for a gene X, and the other group (i.e. WT group) has all samples as 0 counts for gene X, and indeed this gene X was picked up by DESeq2 as a DEG, which indeed I do not like, and so maybe the filtering like mentioned above would help this situation, Do you agree?
Thx again in advance!
Q1: No I meant just use your results from above that you called "the first way". Not sub-setting, as you will have worse dispersion estimates that way.
Don't worry about the LRT suggestion if you are unfamiliar because you are set with "the first way".
I would recommend filtering if you have a heterogenous dataset like this. I would use
counts(dds) >= 10
and then perhaps>= 6
samples. If you have a gene with all zeros except for three samples, it's hard to assess the LFC anyway. Filtering will help here. It doesn't hurt dispersion estimation.Thx so much for your great advice! Really appreciated your kind follow-up along the way!
Hi, Michael:
Thx a lot for the great help of my previous questions.
Sorry to bother you again since I want to follow up a few questions that are relevant to what I asked before in terms of model settings but this time is related to the derived DEGs that I intended to validate the DEGs results with their TPM values and raw reads values at analysis level.
In a similar model setting that I have asked you previously with multiple groups under the roof of the same model, then in one of the contrasts (GA vs A) that I am checking for the corresponding DEGs, I have the top 5 DEGs with the following statistics from output of results(dds, contrast=…) sorted by pvalue:
SYMBOL baseMean log2FoldChange lfcSE stat pvalue padj
Gm27177 155.421521 9.904237824 2.070825923 4.782748 1.73E-06 0.05207
Entpd4b 614.582446 0.669852004 0.158722096 4.220282 2.44E-05 0.367374
Xist 5543.798443 8.487655879 2.083273481 4.074192 4.62E-05 0.463483
Hs3st3b1 190.0771603 1.324582889 0.339567831 3.90079 9.59E-05 0.721803
Bid 737.1657088 -0.480701471 0.126551375 -3.79847 0.000146 0.876846
Then below we pull out these genes’ original Raw Read (expected counts of RSEM) with two samples for A and two samples for GA:
SYMBOL A.14 A.17 GA.20 GA.23
Gm27177 0 1 499 442
Entpd4b 369 459 638 659
Xist 52 61 38924 34
Hs3st3b1 71 70 195 152
Bid 727 737 482 553
Then below we also pull out these genes’ TPM values with two samples for A and two samples for GA:
SYMBOL A.14 A.17 GA.20 GA.23
Gm27177 0 0.02 10.37 9.18
Entpd4b 5.08 5.64 9.86 9.42
Xist 0.25 0.14 166.29 0.11
Hs3st3b1 0.69 0.58 2.04 1.44
Bid 17.66 15.28 12.94 13.11
If we look at the first table, no single gene with padj<0.05, although all with decent pvalue, however, if we checked the TPM values and even raw reads of these 5 genes between the GA vs A groups, they look not bad at all except for Xist gene although only two samples in each group. I understand this is multiple-test issue as for the high-throughput RNAseq data. And unfortunately, we only have two samples in each group of this contrast even under the same roof of model with many other groups and many other samples.
The result of this contrast apparently shows no DEGs at padj 0.05 level. Also even the PCA plot kind of support this trend that there shall be not much DEGs between GA and A groups. My questions are: Do we really believe in the judgement of the used method for the obtained result: no single DEGs at padj 0.05 even with quite some genes showed actual raw read and TPM data with obvious difference? Do you have experience that even in such situation, you might be able to validate experimentally in some cases that these genes indeed expressed differentially given the TPM and raw reads between the groups. Do you think other data under the same roof may have impacted DESeq2 to give these genes bad padj in general?
For the last question, indeed, I used only the subset of data to set up the model (only used the two samples for A and two samples for GA for the contrast GA vs A) and Top DEGs sort by pvalue shown below:
SYMBOL baseMean log2FoldChange lfcSE stat pvalue padj
Gm16867 465.599518 8.882917552 0.787147 11.28495 1.56E-29 2.31E-25
Gm12671 140.0266586 10.59748901 1.514742 6.996235 2.63E-12 1.95E-08
Gm27177 238.9063234 9.924949846 1.468001 6.760861 1.37E-11 6.79E-08
Tmem267 179.963209 -1.373050954 0.316497 -4.33827 1.44E-05 0.049749
H19 1758.066822 0.817829696 0.19 4.304356 1.67E-05 0.049749
Gm38378 13.71554853 7.245713788 1.800326 4.024667 5.71E-05 0.121064
Comparing to above DEGs result, only one gene GM27177 is in common. But this setting create more DEGs at padj<0.05 as seen from above. And the reads and TPM values also kind of support.
Raw Reads:
SYMBOL A.14 A.17 GA.20 GA.23
Gm16867 4 0 726 1120
Gm12671 0 0 346 203
Gm27177 0 1 499 442
Tmem267 254 275 99 99
H19 1321 1268 2270 2157
Gm38378 0 0 30 24
TPM values:
SYMBOL A.14 A.17 GA.20 GA.23
Gm16867 0.06 0 12.38 17.39
Gm12671 0 0 21.02 11.16
Gm27177 0 0.02 10.37 9.18
Tmem267 9.89 8.78 3.96 3.76
H19 29.92 25.32 55.49 50.34
Gm38378 0.01 0 0.72 0.53
Any advices would be really appreciated!
The p-values from DESeq2 are small. I don't know what you are expecting a procedure to do here. DESeq2 is a statistical inference procedure while you are just eyeballing TPM values and thinking that the p-value should somehow be lower than the very low values that are given. I don't have a response for you here.
Note that the top three have FDR < 0.5. This means that, even given the very low amount of data provided, the model thinks about 1.5 out of these 3 would validate.
No I think the p-values are small and you need more data.
I'll be working on BioC conference for the next week, but I think you may want to discuss power tradeoffs with local bioinformaticians to calibrate expectations here.
Thanks a lot for your kind and prompt response! Apreciated very much!
Indeed, I was just wondering if there exist general validatable alpha level that applied to your DEGs lists in your experience, although I noticed that the default DESeq2 alpha level is 0.1.
Also as mentioned in my last question, although sure as you suggested that we need more data to convince the results, but for the given available data, when I just used the subset of data to set up the model in this case, I seem to be able to derive more DEGs for the same contrast comparing to when I used all data to set up the model in the same roof. I know you mentioned before you have no preference between these two ways of model settings. But I just want to understand the underlying reasons that at least may help us to understand what the DESeq2 did behind the scene in these two scenarios and determine whether to proceed with these DEGs from biology point of views. My guess is the difference may be due to the dispersion assessment of subset vs whole set of data?