Hello SE. I'm relatively new to using edgeR and GLM's. I have been doing some basic research on linear regression. In many examples of using the stats::glm function, there have been ways to determine which coefficients help explain the data being modeled after.
For example Taken from Youtube...
# First we get our data.
mydata <- read.table("panel80.txt")
# attach(mydata) # In case you want to work with the variable names directly
names(mydata) # This shows us all the variable names.
# options(scipen=20) # suppress "scientific" notation
options(scipen=NULL) # Brings things back to normal
# Now we want to make a binary response variable that has a vote for Carter=0 and a vote for Reagan=1.
mysubsetdata1 <- subset(mydata, VOTE == 1 | VOTE==2) # This keeps only those respondents who actually voted for Carter or Reagan.
mysubsetdata2<-subset(mysubsetdata1, select=c(VOTE, REAFEEL3, CARFEEL3, INC, AGE, PARTYID)) #This keeps only the variables that we are using.
#mysubsetdata2
# In VOTE, a vote for Reagan=1 and a vote for Carter=2. We change that in the loop below when we create the new variable binaryvote.
binaryvote <- 0 # This initializes the binaryvote variable.
for (i in 1:nrow(mysubsetdata2)) {
if (mysubsetdata2$VOTE[i]==2) binaryvote[i]=0 else binaryvote[i]=1
}
# Here is another way to do the same thing in one line.
binaryvote2 <- ifelse((mysubsetdata2$VOTE == 1), 1, 0)
mysubsetdata3<-cbind(mysubsetdata2, binaryvote, binaryvote2)
mysubsetdata3
# Now we do the logistic regression.
logitvote.model <- glm(binaryvote ~ INC + AGE + PARTYID, family=binomial, data=mysubsetdata3)
summary(logitvote.model)
The summary(logitvote.model) line should print something like this:
> summary(logitvote.model)
Call:
glm(formula = binaryvote ~ INC + AGE + PARTYID, family = binomial,
data = mysubsetdata3)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5041 -0.7614 0.3473 0.6110 2.0888
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.330928 0.572542 -5.818 5.96e-09 ***
INC 0.060679 0.021454 2.828 0.00468 **
AGE 0.007275 0.007407 0.982 0.32598
PARTYID 0.730539 0.069137 10.567 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 627.21 on 463 degrees of freedom
Residual deviance: 437.00 on 460 degrees of freedom
(52 observations deleted due to missingness)
AIC: 445
Number of Fisher Scoring iterations: 5
Here the Coefficients with an asterisk(*) are factors that help explain the response variable (in this case binaryvote).
How can I find this same information for and edgeR example? I'm looking at a large group of patients grouped together by phenotype, and I'm trying to compare several subgroups. Here's what my workflow looks like:
ds <- create_dataset("data/merged_counts_table.csv", "data/annotation.csv")
dataset <- ds$edgeR
dataset$samples$group <- relevel(dataset$samples$group, ref = "E")
keep <- rowSums(cpm(dataset) > 1) >= 2
filtered_object <- dataset[keep, , keep.lib.sizes=FALSE]
norm_object <- calcNormFactors(filtered_object)
# Dispersion
experiment_design <- ~0+group + region + sex
design <- model.matrix(experiment_design, data = norm_object$samples)
> colnames(design)
[1] "groupE" "groupA" "groupB" "groupC" "groupD" "regionOFC" "sexMale"
colnames(design) <- stringr::str_replace(colnames(design), "group", "")
colnames(design) <- stringr::str_replace(colnames(design), ":", "_")
colnames(design) <- stringr::str_replace(colnames(design), ":", "_")
colnames(design) <- stringr::str_replace(colnames(design), ":", "_")
design <- design[, -15]
design <- design[, -19]
dataset <- estimateDisp(norm_object, design = design)
fit <- glmQLFit(dataset, design, robust=TRUE)
res <- glmQLFTest(fit, contrast=c(0, -0.5, -0.5, 0.5, 0.5, 0, 0)))
is.de <- decideTestsDGE(res)
Here's my sessionInfo:
> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] org.Hs.eg.db_3.5.0 AnnotationDbi_1.40.0 bindrcpp_0.2.2 DESeq2_1.18.1
[5] SummarizedExperiment_1.8.1 DelayedArray_0.4.1 matrixStats_0.53.1 Biobase_2.38.0
[9] GenomicRanges_1.30.3 GenomeInfoDb_1.14.0 IRanges_2.12.0 S4Vectors_0.16.0
[13] BiocGenerics_0.24.0 edgeR_3.20.9 limma_3.34.9
loaded via a namespace (and not attached):
[1] bit64_0.9-7 splines_3.4.3 assertthat_0.2.0 Formula_1.2-3 statmod_1.4.30
[6] DO.db_2.9 rvcheck_0.0.9 latticeExtra_0.6-28 blob_1.1.1 GenomeInfoDbData_1.0.0
[11] yaml_2.1.19 pillar_1.2.2 RSQLite_2.1.1 backports_1.1.2 lattice_0.20-35
[16] glue_1.2.0 digest_0.6.15 RColorBrewer_1.1-2 XVector_0.18.0 checkmate_1.8.5
[21] qvalue_2.10.0 colorspace_1.3-2 htmltools_0.3.6 Matrix_1.2-12 plyr_1.8.4
[26] XML_3.98-1.11 pkgconfig_2.0.1 genefilter_1.60.0 zlibbioc_1.24.0 xtable_1.8-2
[31] GO.db_3.5.0 scales_0.5.0 BiocParallel_1.12.0 annotate_1.56.2 tibble_1.4.2
[36] htmlTable_1.11.2 ggplot2_2.2.1 nnet_7.3-12 lazyeval_0.2.1 survival_2.41-3
[41] magrittr_1.5 memoise_1.1.0 DOSE_3.4.0 MASS_7.3-50 foreign_0.8-69
[46] tools_3.4.3 data.table_1.11.2 stringr_1.3.1 munsell_0.4.3 locfit_1.5-9.1
[51] cluster_2.0.6 compiler_3.4.3 rlang_0.2.0 grid_3.4.3 RCurl_1.95-4.10
[56] rstudioapi_0.7 htmlwidgets_1.2 igraph_1.2.1 bitops_1.0-6 base64enc_0.1-3
[61] gtable_0.2.0 DBI_1.0.0 R6_2.2.2 reshape2_1.4.3 gridExtra_2.3
[66] dplyr_0.7.4 knitr_1.20 bit_1.1-12 bindr_0.1.1 fastmatch_1.1-0
[71] Hmisc_4.1-1 fgsea_1.4.1 GOSemSim_2.4.1 stringi_1.2.2 Rcpp_0.12.16
[76] geneplotter_1.56.0 rpart_4.1-13 acepack_1.4.1
Ok. Thanks! I'm a biomedical engineer by design, but I've been retraining myself to do computation biology over the past few years.
What you said makes sense, and some of which crossed my mind. However, I don't like to put my data into a conceptual black box. Now I feel much more confident in moving forward.
Also, I'm about to start getting involved with a new python port of edgeR called edgePy. The r/bioinformatics community has come together to begin writing it. https://github.com/r-bioinformatics/edgePy
Thanks for letting me know about edgePy. I think you will find this a very big project. Also, I don't see how edgePy hopes to be faster than the original, since edgeR is primarily written in C++.
Well they have a plan. I believe they will be using NumPy's an SciPy's math library, which is written in C. So it should be comparable, but I'm not sure to be honest. I think some people just prefer Python over R. I personally like both, and use both.