Question

edgeR: How can I determine which factors significantly explain my DE genes?

0

Entering edit mode

Robert Gilmore • 0

@robert-gilmore-15892

Last seen 4.7 years ago

Hello SE. I'm relatively new to using edgeR and GLM's. I have been doing some basic research on linear regression. In many examples of using the stats::glm function, there have been ways to determine which coefficients help explain the data being modeled after.

For example Taken from Youtube...

    # First we get our data.

    mydata <- read.table("panel80.txt")

    # attach(mydata) # In case you want to work with the variable names directly

    names(mydata) # This shows us all the variable names.

    # options(scipen=20) # suppress "scientific" notation

    options(scipen=NULL) # Brings things back to normal

    # Now we want to make a binary response variable that has a vote for Carter=0 and a vote for Reagan=1.

    mysubsetdata1 <- subset(mydata, VOTE == 1 | VOTE==2) # This keeps only those respondents who actually voted for Carter or Reagan.

    mysubsetdata2<-subset(mysubsetdata1, select=c(VOTE, REAFEEL3, CARFEEL3, INC, AGE, PARTYID)) #This keeps only the variables that we are using.

    #mysubsetdata2

    # In VOTE, a vote for Reagan=1 and a vote for Carter=2. We change that in the loop below when we create the new variable binaryvote.

    binaryvote <- 0 # This initializes the binaryvote variable.

    for (i in 1:nrow(mysubsetdata2)) {

    if (mysubsetdata2$VOTE[i]==2) binaryvote[i]=0 else binaryvote[i]=1

    }

    # Here is another way to do the same thing in one line.

    binaryvote2 <- ifelse((mysubsetdata2$VOTE == 1), 1, 0)

    mysubsetdata3<-cbind(mysubsetdata2, binaryvote, binaryvote2)

    mysubsetdata3 

    # Now we do the logistic regression. 

    logitvote.model <- glm(binaryvote ~ INC + AGE + PARTYID, family=binomial, data=mysubsetdata3) 

    summary(logitvote.model)

The summary(logitvote.model) line should print something like this:

> summary(logitvote.model)

Call:
glm(formula = binaryvote ~ INC + AGE + PARTYID, family = binomial, 
    data = mysubsetdata3)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5041  -0.7614   0.3473   0.6110   2.0888  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.330928   0.572542  -5.818 5.96e-09 ***
INC          0.060679   0.021454   2.828  0.00468 ** 
AGE          0.007275   0.007407   0.982  0.32598    
PARTYID      0.730539   0.069137  10.567  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 627.21  on 463  degrees of freedom
Residual deviance: 437.00  on 460  degrees of freedom
  (52 observations deleted due to missingness)
AIC: 445

Number of Fisher Scoring iterations: 5

Here the Coefficients with an asterisk(*) are factors that help explain the response variable (in this case binaryvote).

How can I find this same information for and edgeR example? I'm looking at a large group of patients grouped together by phenotype, and I'm trying to compare several subgroups. Here's what my workflow looks like:

ds <- create_dataset("data/merged_counts_table.csv", "data/annotation.csv")
dataset <- ds$edgeR
dataset$samples$group <- relevel(dataset$samples$group, ref = "E")
keep <- rowSums(cpm(dataset) > 1) >= 2
filtered_object <- dataset[keep, , keep.lib.sizes=FALSE]
norm_object <- calcNormFactors(filtered_object)

# Dispersion
experiment_design <- ~0+group + region + sex
design <- model.matrix(experiment_design, data = norm_object$samples)
> colnames(design)
[1] "groupE"    "groupA"    "groupB"    "groupC"    "groupD"    "regionOFC" "sexMale"  

colnames(design) <- stringr::str_replace(colnames(design), "group", "")
colnames(design) <- stringr::str_replace(colnames(design), ":", "_")
colnames(design) <- stringr::str_replace(colnames(design), ":", "_")
colnames(design) <- stringr::str_replace(colnames(design), ":", "_")
design <- design[, -15]
design <- design[, -19]
dataset <- estimateDisp(norm_object, design = design)

fit <- glmQLFit(dataset, design, robust=TRUE)
res <- glmQLFTest(fit, contrast=c(0, -0.5, -0.5, 0.5, 0.5, 0, 0)))
is.de <- decideTestsDGE(res)

Here's my sessionInfo:

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] org.Hs.eg.db_3.5.0         AnnotationDbi_1.40.0       bindrcpp_0.2.2             DESeq2_1.18.1             
 [5] SummarizedExperiment_1.8.1 DelayedArray_0.4.1         matrixStats_0.53.1         Biobase_2.38.0            
 [9] GenomicRanges_1.30.3       GenomeInfoDb_1.14.0        IRanges_2.12.0             S4Vectors_0.16.0          
[13] BiocGenerics_0.24.0        edgeR_3.20.9               limma_3.34.9              

loaded via a namespace (and not attached):
 [1] bit64_0.9-7            splines_3.4.3          assertthat_0.2.0       Formula_1.2-3          statmod_1.4.30        
 [6] DO.db_2.9              rvcheck_0.0.9          latticeExtra_0.6-28    blob_1.1.1             GenomeInfoDbData_1.0.0
[11] yaml_2.1.19            pillar_1.2.2           RSQLite_2.1.1          backports_1.1.2        lattice_0.20-35       
[16] glue_1.2.0             digest_0.6.15          RColorBrewer_1.1-2     XVector_0.18.0         checkmate_1.8.5       
[21] qvalue_2.10.0          colorspace_1.3-2       htmltools_0.3.6        Matrix_1.2-12          plyr_1.8.4            
[26] XML_3.98-1.11          pkgconfig_2.0.1        genefilter_1.60.0      zlibbioc_1.24.0        xtable_1.8-2          
[31] GO.db_3.5.0            scales_0.5.0           BiocParallel_1.12.0    annotate_1.56.2        tibble_1.4.2          
[36] htmlTable_1.11.2       ggplot2_2.2.1          nnet_7.3-12            lazyeval_0.2.1         survival_2.41-3       
[41] magrittr_1.5           memoise_1.1.0          DOSE_3.4.0             MASS_7.3-50            foreign_0.8-69        
[46] tools_3.4.3            data.table_1.11.2      stringr_1.3.1          munsell_0.4.3          locfit_1.5-9.1        
[51] cluster_2.0.6          compiler_3.4.3         rlang_0.2.0            grid_3.4.3             RCurl_1.95-4.10       
[56] rstudioapi_0.7         htmlwidgets_1.2        igraph_1.2.1           bitops_1.0-6           base64enc_0.1-3       
[61] gtable_0.2.0           DBI_1.0.0              R6_2.2.2               reshape2_1.4.3         gridExtra_2.3         
[66] dplyr_0.7.4            knitr_1.20             bit_1.1-12             bindr_0.1.1            fastmatch_1.1-0       
[71] Hmisc_4.1-1            fgsea_1.4.1            GOSemSim_2.4.1         stringi_1.2.2          Rcpp_0.12.16          
[76] geneplotter_1.56.0     rpart_4.1-13           acepack_1.4.1

edger differential gene expression model.matrix design matrix • 2.1k views

ADD COMMENT • link updated 6.9 years ago by Gordon Smyth 52k • written 6.9 years ago by Robert Gilmore • 0

score 3 · Answer 1 · 2018-05-22

You're essentially just asking how to get p-values out of edgeR. You can see p-values from glm() or lm() by using summary(fitted.model). How can you see p-values from edgeR? The answer is simply use to use:

topTable(res)

Let me make a few observations here:

summary(glm()) gives p-values for each coefficient for one response variable. edgeR however gives you all genewise p-values for one coefficient, because that is a much easier way to interpret RNA-seq data. In other words, edgeR gives you all the p-values for one coefficient for lots of response variables. To see multiple p-values for one gene, you have to make several contrasts separately.
summary(glm()) only gives you p-values for coefficients in the design matrix. edgeR however allows you to make arbitrary contrasts of the coefficients, which you have taken advantage of in your example. Using glm() it would be very difficult to compare groups C+D to A+B as in your edgeR example.
The whole concept of "which groups explain the DE results" doesn't really apply to your RNA-seq analysis. In the RNA-seq analysis, you want to know which genes are different between groups. It's not about which groups explain the results but about which differences between groups are important for each gene. Or, to put it the other way around, which genes are DE for which comparisons ... which is what edgeR gives you.
summary(glm()) uses a very simple type of statistical test called a Wald test. This is however known to perform very poorly for binary regression. edgeR uses a much better test called a likelihood ratio test.
edgeR adjusts the p-values for multiple testing across genes to give you false discovery rates (FDR). glm() of course doesn't need to do this because it's working with only one response variable.