Question

EdgeR question: why I got discrepancy results for DEGs when using different ways setting up the models

0

Entering edit mode

Ming ▴ 380

@ming-yi-6363

Last seen 2.6 years ago

United States

Hi, there:

I used edgeR a while back, and just recently I started to use it again for a RNAseq project, and I bet there might be some updates or changes, since I am trying to understand why I got different DEGs results with very different numbers of DEGs use two ways of model settings:

setting the model and contrasts with all data together in the same model
setting the model and contrasts with only subset of data per contrast

total 24 samples in my dataset are divided into 12 groups, but each group has only 2 samples unfortunately. Here I only discuss 3 contrasts pairwise beteween 2 of the 3 groups (H.A, H.G, H.GA) as shown in code below:

In listed three contrasts (H.G_A, H.GA_A, H.G_GA,), for the first two of the contrasts I saw reasonably similar numbers DEGs in between the two ways of model settings, but for the 3rd contrast (H.G_GA,) , I saw almost no DEGs at all in the 2nd way, but saw quite a lot DEGs in the first way. The details are described below.

The other two contreasts have much simiar results of DEGs in terms of numbers of DEGs despite minor difference, which are not shown here.

Does anyone have insights why this is happening? Which way is the better way? Or how can we tell which way to choose based on each particular dataset? Any advice would be highly appreciated! Not sure if the community has any experience or solution for such situations. I know the numbers of the samples in my dataset are not great for each group (only 2, although I have many groups), and maybe seeing all the data would help assess the variance of the data, but did not expect that including all data in the same roof of model would make such a big difference in DEGs output for the same contrast comparing to including just subset of the data in the model

Looking at the total genes for contrast H.G_GA, it seems that 2nd way filtered out more genes than the first way, although I used keep <- filterByExpr(y,group=Group) in both ways. I did check and find that most of the DEGs from the first way for contrast H.G_GA are indeed in the total gene list of 2nd way after filtering (filterByExpr), but were not called as DEGs by the 2nd way for whatever reasons.

Code should be placed in three backticks as shown below

################################################################################################################
#### the first way: the main code for setting the model and contrasts with all data together in the same model

library(reshape)
library(edgeR)
library(org.Mm.eg.db)

##total 24 samples 
> show(tar5)
           Types Replicate Status Types.Stat
Ctrl.A.13   Ctrl         1      A     Ctrl.A
Ctrl.A.16   Ctrl         2      A     Ctrl.A
......#more rows for Ctrl not shown
H.A.14         H         1      A        H.A
H.A.17         H         2      A        H.A
H.G.8          H         1      G        H.G
H.G.11         H         2      G        H.G
H.GA.20        H         1     GA       H.GA
H.GA.23        H         2     GA       H.GA
.....##more rows for other groups not shown

> show(head(mydata5))
....##more columns for more samples not shown
                            Ctrl.GA.22 Ctrl.WT.1 Ctrl.WT.4 H.A.14 H.A.17 H.G.8
ENSMUSG00000000001.4_Gnai3        8593      9090      8040   8949   9795  6949
ENSMUSG00000000003.15_Pbsn           0         0         0      0      0     0
ENSMUSG00000000028.15_Cdc45       1126      1023       705    949   1041   555
ENSMUSG00000000031.16_H19         6654      4537      8610   1321   1268  2707
ENSMUSG00000000037.16_Scml2        166        99       111     65     47    42
ENSMUSG00000000049.11_Apoh           9        10         6      0      9     1
                            H.G.11 H.GA.20 H.GA.23 H.WT.2 H.WT.5 
ENSMUSG00000000001.4_Gnai3    8979    8749    8435  10465   9567   
ENSMUSG00000000003.15_Pbsn       0       0       0      0      0     
ENSMUSG00000000028.15_Cdc45    539     766    1048    902    683  
ENSMUSG00000000031.16_H19     3471    2270    2157   3324   5896   
ENSMUSG00000000037.16_Scml2     46      61      72     59     66    
ENSMUSG00000000049.11_Apoh       7       3      12      9      8    
....##more columns for more samples not shown

> show("all(colnames(mydata5)==rownames(tar5)")
[1] "all(colnames(mydata5)==rownames(tar5)"
> show(all(colnames(mydata5)==rownames(tar5)))
[1] TRUE

rawdata<-data.frame(SYMBOL=sapply(strsplit(rownames(mydata5), split="\\_"), function(x) x[2]),
ENSID=sapply(strsplit(rownames(mydata5), split="\\_"), function(x) x[1]),mydata5, stringsAsFactors = FALSE);

Group <- factor(tar5$Types.Stat)
y <- DGEList(counts=rawdata[,3:ncol(rawdata)], genes=rawdata[,1:2], group=Group)

o <- order(rowSums(y$counts), decreasing=TRUE)
y <- y[o,]
d <- duplicated(y$genes$SYMBOL)
y <- y[!d,]
idfound <- y$genes$SYMBOL %in% mappedRkeys(org.Mm.egSYMBOL)
y <- y[idfound,];
egSYMBOL<- toTable(org.Mm.egSYMBOL)

m <- match(y$genes$SYMBOL, egSYMBOL$symbol)
y$genes$EntrezGene <- egSYMBOL$gene_id[m]

keep <- filterByExpr(y,group=Group)
y <- y[keep, , keep.lib.sizes=FALSE]
rownames(y$counts) <- rownames(y$genes) <- y$genes$EntrezGene
y$genes$EntrezGene <- NULL

y <- calcNormFactors(y)

design <- model.matrix(~0+Group)
colnames(design) <- levels(Group)
rownames(design) <- colnames(y)

y<- estimateDisp(y,design, robust=TRUE)
fit <- glmQLFit(y,design,robust=TRUE);

my.contrasts <- makeContrasts(
H.GA_A=H.GA-H.A, H.G_GA=H.G-H.GA,H.G_A=H.G-H.A, levels=design)

for(i in 1:length(colnames(my.contrasts))){
currContrast<-colnames(my.contrasts)[i];
currTypes<-sapply(strsplit(currContrast, split="\\."), function(x) x[1])
show(currContrast)
qlf <- glmQLFTest(fit, contrast=my.contrasts[,currContrast])
show("summary(decideTests(qlf))")
show(summary(decideTests(qlf)))
}

#######################################################################
##The 2nd way: the main code for setting the model and contrasts with only subset of data per contrast as below:

mydata5T<-mydata5;
tar5T<-tar5;

AllTypes<-unique(tar5T$Types)

ConL<-list()
SamL<-list();
relevelRefL<-list();
kk<-1;
ConL[[kk]]<-"G-GA";
SamL[[kk]]<-c("G","GA")
relevelRefL[[kk]]<-"GA";
kk<-kk+1;
ConL[[kk]]<-"G-A";
SamL[[kk]]<-c("G","A")
relevelRefL[[kk]]<-"A";
kk<-kk+1;
ConL[[kk]]<-"GA-A";
SamL[[kk]]<-c("GA","A")
relevelRefL[[kk]]<-"A";

for(ii in 1:length(AllTypes)){
currTypes<-AllTypes[ii]; 
for(kk in 1:length(ConL)){
tar5<-tar5T[tar5T$Types==AllTypes[ii] & tar5T$Status %in% SamL[[kk]],]
mydata5<-mydata5T[,rownames(tar5)]
rawdata<-data.frame(SYMBOL=sapply(strsplit(rownames(mydata5), split="\\_"), function(x) x[2]),
ENSID=sapply(strsplit(rownames(mydata5), split="\\_"), function(x) x[1]),mydata5, stringsAsFactors = FALSE);

Group <- factor(tar5$Status);
Group<- relevel(Group, ref=relevelRefL[[kk]]);

y <- DGEList(counts=rawdata[,3:ncol(rawdata)], genes=rawdata[,1:2], group=Group)

o <- order(rowSums(y$counts), decreasing=TRUE)
y <- y[o,]
d <- duplicated(y$genes$SYMBOL)
y <- y[!d,]
idfound <- y$genes$SYMBOL %in% mappedRkeys(org.Mm.egSYMBOL)
y <- y[idfound,];
egSYMBOL<- toTable(org.Mm.egSYMBOL)
m <- match(y$genes$SYMBOL, egSYMBOL$symbol)
y$genes$EntrezGene <- egSYMBOL$gene_id[m]
keep <- filterByExpr(y,group=Group)
y <- y[keep, , keep.lib.sizes=FALSE]
rownames(y$counts) <- rownames(y$genes) <- y$genes$EntrezGene
y$genes$EntrezGene <- NULL
y <- calcNormFactors(y)

design <- model.matrix(~0+Group)
colnames(design) <- levels(Group)
rownames(design) <- colnames(y)

y <- estimateDisp(y,design, robust=TRUE)

fit <- glmQLFit(y,design,robust=TRUE);
currContrast <- makeContrasts(ConL[[kk]], levels=design)
qlf <- glmQLFTest(fit, contrast=currContrast)

show(currTypes)
show(ConL[[kk]])
show("summary(decideTests(qlf))")
show(summary(decideTests(qlf)))
}
}

# include your problematic code here with any corresponding output 
# please also include the results of running the following in an R session 

##The output from the first way of model setting for contrast H.G_GA which is between H.G vs H.GA
[1] "H.G_GA"
[1] "summary(decideTests(qlf))"
       1*H.G -1*H.GA
Down             528
NotSig         14280
Up               574


##The output from the 2nd way of model setting for contrast H.G_GA which is between H.G vs H.GA
[1] "H"
[1] "G-GA"
[1] "summary(decideTests(qlf))"
       -1*GA 1*G
Down           0
NotSig     14249
Up             1


sessionInfo( )

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /mnt/nasapps/development/R/4.0.2/lib64/R/lib/libRblas.so
LAPACK: /mnt/nasapps/development/R/4.0.2/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] org.Mm.eg.db_3.12.0  AnnotationDbi_1.52.0 IRanges_2.24.1
 [4] S4Vectors_0.28.1     Biobase_2.50.0       BiocGenerics_0.38.0
 [7] edgeR_3.32.1         limma_3.46.0         reshape_0.8.8
[10] reshape2_1.4.4       ggrepel_0.9.1        ggplot2_3.3.5

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7       pillar_1.6.2     compiler_4.0.2   plyr_1.8.6
 [5] tools_4.0.2      statmod_1.4.36   bit_4.0.4        memoise_2.0.0
 [9] RSQLite_2.2.7    lifecycle_1.0.0  tibble_3.1.3     gtable_0.3.0
[13] lattice_0.20-44  pkgconfig_2.0.3  rlang_0.4.11     DBI_1.1.1
[17] fastmap_1.1.0    withr_2.4.2      dplyr_1.0.7      stringr_1.4.0
[21] generics_0.1.0   vctrs_0.3.8      bit64_4.0.5      locfit_1.5-9.4
[25] grid_4.0.2       tidyselect_1.1.1 glue_1.4.2       R6_2.5.0
[29] fansi_0.5.0      blob_1.2.2       purrr_0.3.4      magrittr_2.0.1
[33] splines_4.0.2    scales_1.1.1     ellipsis_0.3.2   assertthat_0.2.1
[37] colorspace_2.0-2 utf8_1.2.2       stringi_1.7.3    munsell_0.5.0
[41] cachem_1.0.5     crayon_1.4.1

RNAseq edgeR DEGs • 1.3k views

ADD COMMENT • link updated 3.1 years ago by Gordon Smyth 52k • written 3.7 years ago by Ming ▴ 380

score 0 · Answer 1 · 2021-08-06

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 8 hours ago

WEHI, Melbourne, Australia

edgeR is designed to analyse datasets as a whole rather than to chop them up into little pieces. It is a universal property of anova in general, not just of edgeR. that it is more powerful to analyse all the groups together rather than to do pairwise comparisons with subsets of the data. It is obviously unrealistic to expect to estimate the variance reliably from just n=2 samples, which is the reason that comparing n=2 vs n=2 loses so much power and gives fewer DE genes.

ADD COMMENT • link 3.7 years ago Gordon Smyth 52k

0

Entering edit mode

Dear Dr. Smyth:

Thank you so much for your insightful comments, which are very much expected, convincing, and perfect answer to me, and very helpful as always. I have been offered great advices from you years back and still as always, you are the greatest in the world!! Really appreciated and very happy in R world we have you!!

Best wishes and have a greart weekend!! Ming

ADD REPLY • link 3.7 years ago Ming ▴ 380

0

Entering edit mode

If it was the "perfect answer" then the idea of this question-and-answer-forum is that you will accept or upvote the answer you received so that readers can then see that the question has an accepted answer. If you don't accept the answer then you are giving the message that you are unsatisfied with the answers so far and are still waiting for a new answer that better resolves your question. You have received a lot of help on this forum over the years but it seems that you have never upvoted or accepted any answers that people have provided for you.

ADD REPLY • link 3.1 years ago Gordon Smyth 52k