Hi. I have a problem about different DEG result from analysis same data using DESeq2 and edgeR. Here is two raw all DEG result without fiter based on |logFC| and padj/FDR
# DESeq2 23000 gene
baseMean log2FoldChange lfcSE stat pvalue padj
AT1G61780 849.0853127 -8.903695371 0.466024423 -19.10564109 2.27E-81 4.93E-77
AT5G37130 4294.715279 3.273271654 0.182620448 17.92390553 7.67E-72 8.34E-68
AT2G22330 1462.260281 -2.042049359 0.190228279 -10.73473076 6.99E-27 5.07E-23
AT4G39950 1767.094459 -1.715318859 0.189821306 -9.036492751 1.62E-19 8.79E-16
AT4G07850.1 157.2221618 -3.419929682 0.384430341 -8.896097194 5.78E-19 2.52E-15
AT4G33467 164.1142641 -3.886305717 0.469515739 -8.277263985 1.26E-16 4.57E-13
AT1G29025 139.9963086 3.812475165 0.494958859 7.702610213 1.33E-14 4.13E-11
AT5G39100 5286.22104 -1.419454643 0.184684843 -7.685820976 1.52E-14 4.13E-11
AT5G40780 308.4253424 1.599004982 0.21056858 7.593749184 3.11E-14 7.51E-11
AT4G37060 453.042214 -1.607152702 0.222616086 -7.219391609 5.22E-13 1.14E-09
# edgeR 19000gene
logFC logCPM F PValue FDR
AT1G61780 -8.845660521 5.249802313 1871.270675 7.25E-11 1.38E-06
AT5G37130 3.283848639 7.594636928 310.2048595 9.66E-08 0.000922328
AT2G22330 -2.030610109 6.033101299 138.9722858 2.23E-06 0.014165433
AT4G39950 -1.70625125 6.305968363 104.290019 6.66E-06 0.031780368
AT5G40780 1.608654845 3.798079714 86.26828826 1.36E-05 0.051882335
AT4G07850.1 -3.404115495 2.81708925 78.44634595 1.94E-05 0.061573085
AT4G27410 -1.360683624 3.788481722 73.59416568 2.45E-05 0.06682167
AT4G37060 -1.596523553 4.346113775 70.67441336 2.84E-05 0.067857601
AT5G39100 -1.410359242 7.886167204 65.86720875 3.68E-05 0.078056
AT3G42670 -1.25508234 3.044920624 62.2533565 4.52E-05 0.086254526
As you can see. The logFC result is very similar for every gene. But the padj and FDR are very different! If I use |logFC|>1 & padj/FDR < 0.05,then DESeq2 will remain about 400 gene, but edgeR only leave 4 gene. I dont'know why ? Here is the process code :
# DESeq2
suppressPackageStartupMessages(library(DESeq2))
count <- read.csv("demo.csv",row.names = 1, stringsAsFactors =F)
head(count)
# WT1 WT2 WT3 Al1 Al2 Al3
# AT1G01010 1301 1648 1121 1941 2307 1980
# AT1G01020 784 987 730 699 779 1033
# AT1G01030 36 54 33 30 28 39
# AT1G01040 1796 1950 1531 1433 2068 2463
# AT1G01046 13 10 6 8 9 27
# AT1G01050 1549 2030 1305 1332 1386 2593
condition <- factor(c(rep("wt",3), rep("Al", 3)), levels = c("wt", "Al"))
coldata <- data.frame(row.names = colnames(count), condition)
coldata
# condition
# WT1 wt
# WT2 wt
# WT3 wt
# Al1 Al
# Al2 Al
# Al3 Al
dds <- DESeqDataSetFromMatrix(count,coldata,design = ~ condition)
dds
# class: DESeqDataSet
# dim: 33610 6
# metadata(1): version
# assays(1): counts
# rownames(33610): AT1G01010 AT1G01020 ... ATMG01400 ATMG01410
# rowData names(0):
# colnames(6): WT1 WT2 ... Al2 Al3
# colData names(1): condition
dds <- dds[rowSums(counts(dds)) >= 6,] # filter rowSum of count <= 6
dds2 <- DESeq(dds)
res <- results(dds2)
resOrder <- res[order(res$padj),]
dim(resOrder)
#[1] 23086 6
write.csv(resOrder, "DESeq2_result_all.csv")
deg <- subset(resOrder, padj <= 0.05 & abs(log2FoldChange) >= 1) # get most significant deg
# dim(deg)
# [1] 443 6
write.csv(deg,"DESeq2_result_significant.csv")
# edgeR prcess code
suppressPackageStartupMessages(library(edgeR))
data <- read.csv("demo.csv",row.names = 1,stringsAsFactors = F)
head(data)
# WT1 WT2 WT3 Al1 Al2 Al3
# AT1G01010 1301 1648 1121 1941 2307 1980
# AT1G01020 784 987 730 699 779 1033
# AT1G01030 36 54 33 30 28 39
# AT1G01040 1796 1950 1531 1433 2068 2463
# AT1G01046 13 10 6 8 9 27
# AT1G01050 1549 2030 1305 1332 1386 2593
group <- factor(c(rep("wt",3),rep("Al",3)),levels = c("wt","Al"))
group
# [1] wt wt wt Al Al Al
# Levels: wt Al
y <- DGEList(data,group)
keep <- rowSums(cpm(y$counts)>1) >=2
y <- y[keep,,keep.lib.sizes=F]
y <- calcNormFactors(y)
design2 <- model.matrix(~group)
design2
# (Intercept) groupAl
# 1 1 0
# 2 1 0
# 3 1 0
# 4 1 1
# 5 1 1
# 6 1 1
# attr(,"assign")
# [1] 0 1
# attr(,"contrasts")
# attr(,"contrasts")$group
# [1] "contr.treatment"
y <- estimateDisp(y,design2)
fit <- glmQLFit(y,design2)
qlf <- glmQLFTest(fit,coef = 2)
b <- topTags(qlf,n=nrow(fit$counts))
dim(b$table)
#[1] 19477 5
head(b$table)
write.csv(b,file = "edgeR_allDiff.csv")
edgeR_sigDiff.gene <- subset(b$table,FDR <0.05 & abs(logFC)>1)
dim(edgeR_sigDiff.gene)
#[1] 4 5
write.csv(edgeR_sigDiff.gene,file = "edgeR_sigDiff.csv")
Can someone give valueable suggestion or ideas! Thanks a lot!
You say that these are results without filtering, but the edgeR pipelines all assume that you have filtered on expression. You can't omit the filtering step.
No one will be able to help you unless provide the code that produced the output you show.
Thank your for suggestion! I have edit my question! And I know your No.1 point. What I want to say is "without filter raw deg result based on |logFC| and padj/FDR. Not "without filter count" .I know both edgeR and DESeq2 would filter low count in the process.