Hi, there:
I used edgeR a while back, and just recently I started to use it again for a RNAseq project, and I bet there might be some updates or changes, since I am trying to understand why I got different DEGs results with very different numbers of DEGs use two ways of model settings:
- setting the model and contrasts with all data together in the same model
- setting the model and contrasts with only subset of data per contrast
total 24 samples in my dataset are divided into 12 groups, but each group has only 2 samples unfortunately. Here I only discuss 3 contrasts pairwise beteween 2 of the 3 groups (H.A, H.G, H.GA) as shown in code below:
In listed three contrasts (H.G_A, H.GA_A, H.G_GA,), for the first two of the contrasts I saw reasonably similar numbers DEGs in between the two ways of model settings, but for the 3rd contrast (H.G_GA,) , I saw almost no DEGs at all in the 2nd way, but saw quite a lot DEGs in the first way. The details are described below.
The other two contreasts have much simiar results of DEGs in terms of numbers of DEGs despite minor difference, which are not shown here.
Does anyone have insights why this is happening? Which way is the better way? Or how can we tell which way to choose based on each particular dataset? Any advice would be highly appreciated! Not sure if the community has any experience or solution for such situations. I know the numbers of the samples in my dataset are not great for each group (only 2, although I have many groups), and maybe seeing all the data would help assess the variance of the data, but did not expect that including all data in the same roof of model would make such a big difference in DEGs output for the same contrast comparing to including just subset of the data in the model
Looking at the total genes for contrast H.G_GA, it seems that 2nd way filtered out more genes than the first way, although I used keep <- filterByExpr(y,group=Group) in both ways. I did check and find that most of the DEGs from the first way for contrast H.G_GA are indeed in the total gene list of 2nd way after filtering (filterByExpr), but were not called as DEGs by the 2nd way for whatever reasons.
Code should be placed in three backticks as shown below
################################################################################################################
#### the first way: the main code for setting the model and contrasts with all data together in the same model
library(reshape)
library(edgeR)
library(org.Mm.eg.db)
##total 24 samples
> show(tar5)
Types Replicate Status Types.Stat
Ctrl.A.13 Ctrl 1 A Ctrl.A
Ctrl.A.16 Ctrl 2 A Ctrl.A
......#more rows for Ctrl not shown
H.A.14 H 1 A H.A
H.A.17 H 2 A H.A
H.G.8 H 1 G H.G
H.G.11 H 2 G H.G
H.GA.20 H 1 GA H.GA
H.GA.23 H 2 GA H.GA
.....##more rows for other groups not shown
> show(head(mydata5))
....##more columns for more samples not shown
Ctrl.GA.22 Ctrl.WT.1 Ctrl.WT.4 H.A.14 H.A.17 H.G.8
ENSMUSG00000000001.4_Gnai3 8593 9090 8040 8949 9795 6949
ENSMUSG00000000003.15_Pbsn 0 0 0 0 0 0
ENSMUSG00000000028.15_Cdc45 1126 1023 705 949 1041 555
ENSMUSG00000000031.16_H19 6654 4537 8610 1321 1268 2707
ENSMUSG00000000037.16_Scml2 166 99 111 65 47 42
ENSMUSG00000000049.11_Apoh 9 10 6 0 9 1
H.G.11 H.GA.20 H.GA.23 H.WT.2 H.WT.5
ENSMUSG00000000001.4_Gnai3 8979 8749 8435 10465 9567
ENSMUSG00000000003.15_Pbsn 0 0 0 0 0
ENSMUSG00000000028.15_Cdc45 539 766 1048 902 683
ENSMUSG00000000031.16_H19 3471 2270 2157 3324 5896
ENSMUSG00000000037.16_Scml2 46 61 72 59 66
ENSMUSG00000000049.11_Apoh 7 3 12 9 8
....##more columns for more samples not shown
> show("all(colnames(mydata5)==rownames(tar5)")
[1] "all(colnames(mydata5)==rownames(tar5)"
> show(all(colnames(mydata5)==rownames(tar5)))
[1] TRUE
rawdata<-data.frame(SYMBOL=sapply(strsplit(rownames(mydata5), split="\\_"), function(x) x[2]),
ENSID=sapply(strsplit(rownames(mydata5), split="\\_"), function(x) x[1]),mydata5, stringsAsFactors = FALSE);
Group <- factor(tar5$Types.Stat)
y <- DGEList(counts=rawdata[,3:ncol(rawdata)], genes=rawdata[,1:2], group=Group)
o <- order(rowSums(y$counts), decreasing=TRUE)
y <- y[o,]
d <- duplicated(y$genes$SYMBOL)
y <- y[!d,]
idfound <- y$genes$SYMBOL %in% mappedRkeys(org.Mm.egSYMBOL)
y <- y[idfound,];
egSYMBOL<- toTable(org.Mm.egSYMBOL)
m <- match(y$genes$SYMBOL, egSYMBOL$symbol)
y$genes$EntrezGene <- egSYMBOL$gene_id[m]
keep <- filterByExpr(y,group=Group)
y <- y[keep, , keep.lib.sizes=FALSE]
rownames(y$counts) <- rownames(y$genes) <- y$genes$EntrezGene
y$genes$EntrezGene <- NULL
y <- calcNormFactors(y)
design <- model.matrix(~0+Group)
colnames(design) <- levels(Group)
rownames(design) <- colnames(y)
y<- estimateDisp(y,design, robust=TRUE)
fit <- glmQLFit(y,design,robust=TRUE);
my.contrasts <- makeContrasts(
H.GA_A=H.GA-H.A, H.G_GA=H.G-H.GA,H.G_A=H.G-H.A, levels=design)
for(i in 1:length(colnames(my.contrasts))){
currContrast<-colnames(my.contrasts)[i];
currTypes<-sapply(strsplit(currContrast, split="\\."), function(x) x[1])
show(currContrast)
qlf <- glmQLFTest(fit, contrast=my.contrasts[,currContrast])
show("summary(decideTests(qlf))")
show(summary(decideTests(qlf)))
}
#######################################################################
##The 2nd way: the main code for setting the model and contrasts with only subset of data per contrast as below:
mydata5T<-mydata5;
tar5T<-tar5;
AllTypes<-unique(tar5T$Types)
ConL<-list()
SamL<-list();
relevelRefL<-list();
kk<-1;
ConL[[kk]]<-"G-GA";
SamL[[kk]]<-c("G","GA")
relevelRefL[[kk]]<-"GA";
kk<-kk+1;
ConL[[kk]]<-"G-A";
SamL[[kk]]<-c("G","A")
relevelRefL[[kk]]<-"A";
kk<-kk+1;
ConL[[kk]]<-"GA-A";
SamL[[kk]]<-c("GA","A")
relevelRefL[[kk]]<-"A";
for(ii in 1:length(AllTypes)){
currTypes<-AllTypes[ii];
for(kk in 1:length(ConL)){
tar5<-tar5T[tar5T$Types==AllTypes[ii] & tar5T$Status %in% SamL[[kk]],]
mydata5<-mydata5T[,rownames(tar5)]
rawdata<-data.frame(SYMBOL=sapply(strsplit(rownames(mydata5), split="\\_"), function(x) x[2]),
ENSID=sapply(strsplit(rownames(mydata5), split="\\_"), function(x) x[1]),mydata5, stringsAsFactors = FALSE);
Group <- factor(tar5$Status);
Group<- relevel(Group, ref=relevelRefL[[kk]]);
y <- DGEList(counts=rawdata[,3:ncol(rawdata)], genes=rawdata[,1:2], group=Group)
o <- order(rowSums(y$counts), decreasing=TRUE)
y <- y[o,]
d <- duplicated(y$genes$SYMBOL)
y <- y[!d,]
idfound <- y$genes$SYMBOL %in% mappedRkeys(org.Mm.egSYMBOL)
y <- y[idfound,];
egSYMBOL<- toTable(org.Mm.egSYMBOL)
m <- match(y$genes$SYMBOL, egSYMBOL$symbol)
y$genes$EntrezGene <- egSYMBOL$gene_id[m]
keep <- filterByExpr(y,group=Group)
y <- y[keep, , keep.lib.sizes=FALSE]
rownames(y$counts) <- rownames(y$genes) <- y$genes$EntrezGene
y$genes$EntrezGene <- NULL
y <- calcNormFactors(y)
design <- model.matrix(~0+Group)
colnames(design) <- levels(Group)
rownames(design) <- colnames(y)
y <- estimateDisp(y,design, robust=TRUE)
fit <- glmQLFit(y,design,robust=TRUE);
currContrast <- makeContrasts(ConL[[kk]], levels=design)
qlf <- glmQLFTest(fit, contrast=currContrast)
show(currTypes)
show(ConL[[kk]])
show("summary(decideTests(qlf))")
show(summary(decideTests(qlf)))
}
}
# include your problematic code here with any corresponding output
# please also include the results of running the following in an R session
##The output from the first way of model setting for contrast H.G_GA which is between H.G vs H.GA
[1] "H.G_GA"
[1] "summary(decideTests(qlf))"
1*H.G -1*H.GA
Down 528
NotSig 14280
Up 574
##The output from the 2nd way of model setting for contrast H.G_GA which is between H.G vs H.GA
[1] "H"
[1] "G-GA"
[1] "summary(decideTests(qlf))"
-1*GA 1*G
Down 0
NotSig 14249
Up 1
sessionInfo( )
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /mnt/nasapps/development/R/4.0.2/lib64/R/lib/libRblas.so
LAPACK: /mnt/nasapps/development/R/4.0.2/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] org.Mm.eg.db_3.12.0 AnnotationDbi_1.52.0 IRanges_2.24.1
[4] S4Vectors_0.28.1 Biobase_2.50.0 BiocGenerics_0.38.0
[7] edgeR_3.32.1 limma_3.46.0 reshape_0.8.8
[10] reshape2_1.4.4 ggrepel_0.9.1 ggplot2_3.3.5
loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 pillar_1.6.2 compiler_4.0.2 plyr_1.8.6
[5] tools_4.0.2 statmod_1.4.36 bit_4.0.4 memoise_2.0.0
[9] RSQLite_2.2.7 lifecycle_1.0.0 tibble_3.1.3 gtable_0.3.0
[13] lattice_0.20-44 pkgconfig_2.0.3 rlang_0.4.11 DBI_1.1.1
[17] fastmap_1.1.0 withr_2.4.2 dplyr_1.0.7 stringr_1.4.0
[21] generics_0.1.0 vctrs_0.3.8 bit64_4.0.5 locfit_1.5-9.4
[25] grid_4.0.2 tidyselect_1.1.1 glue_1.4.2 R6_2.5.0
[29] fansi_0.5.0 blob_1.2.2 purrr_0.3.4 magrittr_2.0.1
[33] splines_4.0.2 scales_1.1.1 ellipsis_0.3.2 assertthat_0.2.1
[37] colorspace_2.0-2 utf8_1.2.2 stringi_1.7.3 munsell_0.5.0
[41] cachem_1.0.5 crayon_1.4.1
Dear Dr. Smyth:
Thank you so much for your insightful comments, which are very much expected, convincing, and perfect answer to me, and very helpful as always. I have been offered great advices from you years back and still as always, you are the greatest in the world!! Really appreciated and very happy in R world we have you!!
Best wishes and have a greart weekend!! Ming
If it was the "perfect answer" then the idea of this question-and-answer-forum is that you will accept or upvote the answer you received so that readers can then see that the question has an accepted answer. If you don't accept the answer then you are giving the message that you are unsatisfied with the answers so far and are still waiting for a new answer that better resolves your question. You have received a lot of help on this forum over the years but it seems that you have never upvoted or accepted any answers that people have provided for you.