Dear Bioconductor volunteers,
I am Irene Román, postdoctoral researcher in epidemiology, and I am writing because I have some questions associated to an analysis I am currently doing.
I am analyzing biomarkers of demyelination in mice. In a first step, I am analyzing biomarkers in microglia. I have selected 2 GEOs and analyzed them together. I attach the most important code for the analysis. I am not including quality control (QC) figures to be more concise (although I can include them if needed), QC figures were ok after the steps that are shown in the code.
My questions are the following:
In the eBayes function there is one option to include an intensity-trend for the prior variance (trend) and another to robustify estimations against outlier sample variances (robust). I am inclined to put them both=TRUE as probably the prior variance is not constant and there could be some outliers. Although I have read the function and the limma package documentation in detail I cannot get a clear idea of which strategy is best regarding these 2 options. In my analysis, I found that the number of differentially expressed genes obtained depends on which option is selected (Table). In general, the trend=TRUE option increases the number of differentially expressed genes, the robust=TRUE option decreases them, and setting both to TRUE gives you a number between both (Table).
When I examine the differentially expressed genes with decideTests I can choose between 2 methods: separate and global. I would like to use global to adjust for the multiple comparisons undertaken in the 2 separate comparisons I am doing. As expected I get a different number of differentially expressed genes depending on the method I use (Table). In addition, for 1 of the 2 comparisons I only get differentially expressed genes when I select method=global. This was a bit surprising, as I would expect that a further multiple adjustment for the number of comparisons would reduce the number of differentially expressed genes. Then, when I do a topTable I can obtain the differentially expressed genes with the separate method, which seems to be the default, but I cannot select the genes for the global method (that I can see with the vennDiagram).
I would really appreciate if you could give me some insight in the use of the trend and robust options and help in obtaining the differentially expressed genes with the global method.
Thank you so much, sincerely,
Irene.
Summary of results by strategy used
library(affy)
library(affyPLM)
library(affyQCReport)
library(scatterplot3d)
library(gplots)
library(limma)
library(GEOquery)
library(annotate)
library(mogene10sttranscriptcluster.db)
library(oligo)
library(tidyr)
library(ReactomePA)
library(a4Base)
library(casper)
library(sva)
library(SummarizedExperiment)
library(dplyr)
library(topGO)
library(ReactomePA)
library(DOSE)
library(clusterProfiler)
library(TimeSeriesExperiment)
library(calibrate)
library(ggplot2)
workingDir<-"C:/Users/iroman/Documents/Master_Omics/Project"
setwd(workingDir)
GEO84<-getGEOSuppFiles("GSE84113")
GEO66<-getGEOSuppFiles("GSE66926")
untar("GSE84113/GSE84113_RAW.tar", exdir = getwd())
untar("GSE66926/GSE66926_RAW.tar", exdir = getwd())
# reading cel files I am interested in
GEOFS84 <- read.celfiles(filenames = c("GSM2227385_25374.CEL.gz","GSM2227386_25375.CEL.gz",
"GSM2227387_25376.CEL.gz","GSM2227388_25377.CEL.gz","GSM2227389_25378.CEL.gz",
"GSM2227396_25385.CEL.gz","GSM2227397_25386.CEL.gz","GSM2227398_25387.CEL.gz",
"GSM2227399_25388.CEL.gz","GSM2227400_25389.CEL.gz"))
GEOFS66 <- read.celfiles(filenames = c("GSM1634433_23164.CEL.gz", "GSM1634434_23165.CEL.gz",
"GSM1634436_23167.CEL.gz", "GSM1634437_23168.CEL.gz"))
# renaming samples
sampleNames(GEOFS84)<-gsub("GSM2227","",sampleNames(GEOFS84))
sampleNames(GEOFS84)<-gsub(".CEL.gz","",sampleNames(GEOFS84))
sampleNames(GEOFS66)<-gsub("GSM16344","",sampleNames(GEOFS66))
sampleNames(GEOFS66)<-gsub(".CEL.gz","",sampleNames(GEOFS66))
# normalization
GEOFS84.rma<-rma(GEOFS84,target="core")
GEOFS66.rma<-rma(GEOFS66,target="core")
### Sample combination and quantile normalization
GEOS_microglia<-combineTwoExpressionSet(GEOFS84.rma,GEOFS66.rma)
GEOS_microglia_quant = quantileNorm(GEOS_microglia)
### Removing experiment effect
pheno = pData(GEOS_microglia_quant)
pheno$batch<-c("1","1","1","1","1","1","1","1","1","1","2","2","2","2")
edata = exprs(GEOS_microglia_quant)
batch = pheno$batch
colnames(edata)<-paste(rownames(pheno),as.character(batch),sep="_")
modcombat = model.matrix(~1, data=pheno)
combat_edata = ComBat(dat=edata, batch=batch, mod=modcombat,par.prior=TRUE,prior.plots=FALSE)
### Non-specific filtering
## Filter based on intensity
# Calculation of row-wise medians
microg.medians <- rowMedians(combat_edata)
hist_medians <- hist(microg.medians)
quantile(microg.medians,c(0.3))
# Filter transcripts with lower intensity than 4 in at least 3 arrays (smallest experimental group)
med_threshold <- 4.14
samples_cutoff <- 3
idx_med_threshold <- apply(combat_edata,1,
function(x){
sum(x > med_threshold) >= samples_cutoff})
table(idx_med_threshold)
combat.filt <- subset(combat_edata,idx_med_threshold)
## Annotation of transcript clusters
combat.filt.eset <- ExpressionSet(combat.filt)
# getting annotation from the transcript clusters
annot.microg <- AnnotationDbi::select(mogene10sttranscriptcluster.db,
keys=(featureNames(combat.filt.eset)),
columns = c("SYMBOL","GENENAME","ENSEMBL"),
keytype = "PROBEID")
annot.microg$SYMBOL <- as.factor(annot.microg$SYMBOL)
# filter out the probes that do not map to a gene
annot.microg.filt <- subset(annot.microg, !is.na(SYMBOL))
## Removing probes mapping to multiple gene symbols
annot.grouped <- group_by(annot.microg.filt, PROBEID)
annot.summ <- dplyr::summarize(annot.grouped, no_of_matches = n_distinct(SYMBOL))
# filter for probes with multiple matches
annot.microg.multfilt <- filter(annot.summ, no_of_matches >1)
# filter out probes with multiple matches from assay data
ids_to_exclude <- (featureNames(combat.filt.eset) %in% annot.microg.multfilt$PROBEID)
data.final <- subset(combat.filt.eset, !ids_to_exclude)
# filter out probes with multiple matches from feature data
fData(data.final)$PROBEID <- rownames(fData(data.final))
fData(data.final) <- left_join(fData(data.final),annot.microg.filt)
### DEGs
pData(data.final)$cond<-factor(c("WTcont","WTcont","WTcupri","WTcupri","WTcupri",
"Tremcont","Tremcont","Tremcupri","Tremcupri","Tremcupri","WTcont","WTcupri","Tremcont","Tremcupri"))
cond.mic<-as.factor(pData(data.final)$cond)
design.mic<-model.matrix(~0+cond.mic)
rownames(design.mic)<-sampleNames(data.final)
colnames(design.mic)<-c("Tremcontrol","Tremcupri","WTcontrol","WTcupri")
fit.mic<-lmFit(exprs(data.final),design.mic)
contrast.matrix.mic<-makeContrasts(WTcontrol-WTcupri,Tremcontrol-Tremcupri,levels=design.mic)
fit2.mic<-contrasts.fit(fit.mic,contrast.matrix.mic)
fite.mic<-eBayes(fit2.mic, trend=TRUE)
resultsG <- decideTests(fite.mic,method="global",adjust.method="BH")
resultsS <- decideTests(fite.mic,method="separate",adjust.method="BH")
vennDiagram(resultsG)
vennDiagram(resultsS)
top.table_BH_WT<-topTable(fite.mic,coef=1,n=Inf,adjust.method="BH")
results.p0.05_BH_WT<-top.table_BH_WT[top.table_BH_WT$adj.P.Val<0.05,]
top.table_BH_Trem<-topTable(fite.mic,coef=2,n=Inf,adjust.method="BH")
results.p0.05_BH_Trem<-top.table_BH_Trem[top.table_BH_Trem$adj.P.Val<0.05,]
# sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 LC_MONETARY=Spanish_Spain.1252
[4] LC_NUMERIC=C LC_TIME=Spanish_Spain.1252
attached base packages:
[1] grid parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] pd.mogene.1.0.st.v1_3.14.1 DBI_1.0.0
[3] RSQLite_2.1.1 ggplot2_3.1.1
[5] calibrate_1.7.5 TimeSeriesExperiment_1.0.4
[7] clusterProfiler_3.10.1 DOSE_3.8.2
[9] topGO_2.34.0 SparseM_1.77
[11] graph_1.60.0 dplyr_0.8.0.1
[13] sva_3.30.1 mgcv_1.8-27
[15] nlme_3.1-137 casper_2.16.1
[17] a4Base_1.30.0 a4Core_1.30.0
[19] a4Preproc_1.30.0 glmnet_2.0-16
[21] foreach_1.4.4 Matrix_1.2-15
[23] multtest_2.38.0 genefilter_1.64.0
[25] mpm_1.0-22 KernSmooth_2.23-15
[27] MASS_7.3-51.1 annaffy_1.54.0
[29] KEGG.db_3.2.3 GO.db_3.7.0
[31] ReactomePA_1.26.0 tidyr_0.8.3
[33] oligo_1.46.0 Biostrings_2.50.2
[35] XVector_0.22.0 oligoClasses_1.44.0
[37] mogene10sttranscriptcluster.db_8.7.0 org.Mm.eg.db_3.7.0
[39] annotate_1.60.0 XML_3.98-1.19
[41] AnnotationDbi_1.44.0 GEOquery_2.50.5
[43] limma_3.38.3 gplots_3.0.1.1
[45] scatterplot3d_0.3-41 affyQCReport_1.60.0
[47] lattice_0.20-38 affyPLM_1.58.0
[49] preprocessCore_1.44.0 gcrma_2.54.0
[51] affy_1.60.0 SummarizedExperiment_1.12.0
[53] DelayedArray_0.8.0 BiocParallel_1.16.6
[55] matrixStats_0.54.0 Biobase_2.42.0
[57] GenomicRanges_1.34.0 GenomeInfoDb_1.18.2
[59] IRanges_2.16.0 S4Vectors_0.20.1
[61] BiocGenerics_0.28.0
loaded via a namespace (and not attached):
[1] proto_1.0.0 tidyselect_0.2.5 htmlwidgets_1.3 munsell_0.5.0
[5] codetools_0.2-16 chron_2.3-53 withr_2.1.2 colorspace_1.4-1
[9] GOSemSim_2.8.0 knitr_1.22 rstudioapi_0.10 simpleaffy_2.58.0
[13] urltools_1.7.3 GenomeInfoDbData_1.2.0 polyclip_1.10-0 bit64_0.9-7
[17] farver_2.0.1 coda_0.19-3 xfun_0.6 affxparser_1.54.0
[21] R6_2.4.0 graphlayouts_0.5.0 VGAM_1.1-2 locfit_1.5-9.1
[25] bitops_1.0-6 fgsea_1.8.0 gridGraphics_0.4-1 assertthat_0.2.1
[29] scales_1.0.0 nnet_7.3-12 ggraph_2.0.0 enrichplot_1.2.0
[33] gtable_0.3.0 tidygraph_1.1.2 rlang_0.4.2 splines_3.5.3
[37] rtracklayer_1.42.2 lazyeval_0.2.2 acepack_1.4.1 europepmc_0.3
[41] checkmate_1.9.1 BiocManager_1.30.4 yaml_2.2.0 reshape2_1.4.3
[45] GenomicFeatures_1.34.3 backports_1.1.3 qvalue_2.14.1 Hmisc_4.2-0
[49] tools_3.5.3 ggplotify_0.0.4 affyio_1.52.0 ff_2.2-14
[53] RColorBrewer_1.1-2 proxy_0.4-23 dynamicTreeCut_1.63-1 ggridges_0.5.1
[57] gsubfn_0.7 Rcpp_1.0.1 plyr_1.8.4 base64enc_0.1-3
[61] progress_1.2.0 zlibbioc_1.28.0 purrr_0.3.2 RCurl_1.95-4.12
[65] prettyunits_1.0.2 rpart_4.1-13 sqldf_0.4-11 viridis_0.5.1
[69] cowplot_0.9.4 ggrepel_0.8.1 cluster_2.0.7-1 magrittr_1.5
[73] data.table_1.12.8 DO.db_2.9 triebeard_0.3.0 reactome.db_1.66.0
[77] hms_0.4.2 xtable_1.8-3 gaga_2.28.1 gridExtra_2.3
[81] compiler_3.5.3 biomaRt_2.38.0 tibble_2.1.1 crayon_1.3.4
[85] htmltools_0.4.0 Formula_1.2-3 geneplotter_1.60.0 tweenr_1.0.1
[89] rappdirs_0.3.1 readr_1.3.1 permute_0.9-5 gdata_2.18.0
[93] igraph_1.2.4.1 pkgconfig_2.0.2 rvcheck_0.1.7 GenomicAlignments_1.18.1
[97] foreign_0.8-71 xml2_1.2.0 EBarrays_2.46.0 stringr_1.4.0
[101] digest_0.6.18 vegan_2.5-4 fastmatch_1.1-0 htmlTable_1.13.1
[105] edgeR_3.24.3 Rsamtools_1.34.1 gtools_3.8.1 graphite_1.28.2
[109] jsonlite_1.6 viridisLite_0.3.0 pillar_1.3.1 httr_1.4.0
[113] survival_2.43-3 glue_1.3.1 UpSetR_1.4.0 iterators_1.0.10
[117] bit_1.1-14 ggforce_0.3.1 stringi_1.4.3 blob_1.1.1
[121] DESeq2_1.22.2 latticeExtra_0.6-28 caTools_1.17.1.2 memoise_1.1.0
The Table is in https://ibb.co/M2MYDXj Irene.
Do you have a question? I've read through your post, but there are no questions, only comments.
Dear Gordon,
I have 2 questions:
in which situations are the trend and robust options of the eBayes function recommended?
how can I obtain the differentially expressed genes identified with the global method of the decideTests?
Thank you so much, Irene.