Dear Bioconductor,
I am currently performing an RNAseq analysis on rat samples using edgeR. Data was aligned using STAR against the rat genome (Rnor v6.0) and the genes were counted using featureCounts against the Ensembl Rnor gtf file.
The experimental design comprises two factors: DSS and stimulation.
> phenosheet
DSS Stimulation
1 Water Stimulation
2 Water Sham
3 DSS Stimulation
4 Water Stimulation
5 Water Sham
6 DSS Sham
7 DSS Stimulation
8 DSS Sham
9 Water Stimulation
10 Water Sham
11 DSS Stimulation
12 DSS Sham
Given the design, I adopted an approach similar to example "3.3 Experiments with all combinations of multiple factors" from the edgeR manual.
#Prepare DGEList object
> dgList <- DGEList(counts = gcounts_proc,
genes = rownames(gcounts_proc),
samples = phenosheet)
#Keep genes that have over 1 CPM for at least 3 samples
> dgList <- dgList[rowSums(cpm(dgList)>1) >= 2, , keep.lib.sizes=FALSE]
#Normalize using TMM
> dgList <- calcNormFactors(dgList)
#Design
> design_mat <- model.matrix(~ 0 + paste0(dgList$samples$DSS, "_", dgList$samples$Stimulation))
> colnames(design_mat) <- gsub(".+)(.+)$", "\\1", colnames(design_mat))
> design_mat
DSS_Sham DSS_Stimulation Water_Sham Water_Stimulation
1 0 0 0 1
2 0 0 1 0
3 0 1 0 0
4 0 0 0 1
5 0 0 1 0
6 1 0 0 0
7 0 1 0 0
8 1 0 0 0
9 0 0 0 1
10 0 0 1 0
11 0 1 0 0
12 1 0 0 0
#Dispersion estimation
> dgList <- estimateDisp(y = dgList, design = design_mat, robust=TRUE)
> plotBCV(dgList)
> fit <- glmQLFit(y = dgList, design = design_mat, robust = TRUE)
> plotQLDisp(fit)
#Comparisons
> contrast_mat <- makeContrasts(
DSS_StimvSham = DSS_Stimulation - DSS_Sham,
Water_StimvSham = Water_Stimulation - Water_Sham,
Sham_DSSvWater = DSS_Sham - Water_Sham,
Stim_DSSvWater = Water_Stimulation - Water_Sham,
DSSvWater = (DSS_Sham+DSS_Stimulation)/2-(Water_Sham+Water_Stimulation)/2,
StimvSham = (DSS_Stimulation+Water_Stimulation)/2-(DSS_Sham+Water_Sham)/2,
StimvShamvDSSvWater = (DSS_Stimulation-DSS_Sham)-(Water_Stimulation-Water_Sham),
levels = design_mat)
> results_DSS_StimvSham <- glmQLFTest(glmfit = fit, contrast = contrast_mat[, "DSS_StimvSham"])
> topTags_DSS_StimvSham <- topTags(object = results_DSS_StimvSham)
> head(topTags_DSS_StimvSham$table)
logFC logCPM F PValue FDR
1.1020749 4.977046 29.12748 0.0001212882 0.9999431
0.4504407 7.806974 26.97715 0.0001689446 0.9999431
0.9105491 3.790667 24.58781 0.0002559413 0.9999431
-5.1133035 2.364305 27.19418 0.0002968992 0.9999431
2.0377876 1.294093 21.54299 0.0004531409 0.9999431
-1.1755983 3.359916 20.39320 0.0005701034 0.9999431
Surprisingly all adjusted p-values are all 0.9999431. When I generate a jitterplot of log(CPM) of the most DE gene I get the impression that the top gene is differentially expressed when comparing Stimulation vs Sham in the DSS group.
I have played around with some other parameters (i.e. using the tagwise dispersion instead of the default trended dispersion, or simply using the glmLRT function), which yielded adjusted p-values that were lower, which made me think that I was not perhaps not using the right dispersion. However, investigating the BCV plot did not suggest to me that there was anything strange going on where I see a trend where genes with low CPMs tend to have a higher BCV.
Something I did notice was that my QL plot was less squeezed than what was observed in Section 4.2.7, but I am not sure if that is important.
So my question is: Why does glmQLFit using the trended dispersion not work here, and is the usage of tagwise dispersion warranted in this case?
>sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS
Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] NDlib_0.0.0.9000 pheatmap_1.0.8 matrixStats_0.52.2 ggplot2_2.2.1 Cairo_1.5-9 edgeR_3.20.6 limma_3.34.5
[8] biomaRt_2.34.1
loaded via a namespace (and not attached):
[1] ProtGenerics_1.10.0 bitops_1.0-6 bit64_0.9-7 RColorBrewer_1.1-2
[5] progress_1.1.2 httr_1.3.1 GenomeInfoDb_1.14.0 tools_3.4.3
[9] backports_1.1.2 R6_2.2.2 rpart_4.1-11 Hmisc_4.1-1
[13] DBI_0.7 lazyeval_0.2.1 BiocGenerics_0.24.0 Gviz_1.22.2
[17] colorspace_1.3-2 nnet_7.3-12 gridExtra_2.3 prettyunits_1.0.2
[21] RMySQL_0.10.13 curl_3.1 bit_1.1-12 compiler_3.4.3
[25] Biobase_2.38.0 htmlTable_1.11.1 DelayedArray_0.4.1 labeling_0.3
[29] rtracklayer_1.38.2 scales_0.5.0 checkmate_1.8.5 stringr_1.2.0
[33] digest_0.6.13 Rsamtools_1.30.0 foreign_0.8-69 XVector_0.18.0
[37] base64enc_0.1-3 dichromat_2.0-0 htmltools_0.3.6 ensembldb_2.2.0
[41] BSgenome_1.46.0 htmlwidgets_0.9 rlang_0.1.6 rstudioapi_0.7
[45] RSQLite_2.0 BiocInstaller_1.28.0 shiny_1.0.5 BiocParallel_1.12.0
[49] acepack_1.4.1 VariantAnnotation_1.24.5 RCurl_1.95-4.10 magrittr_1.5
[53] GenomeInfoDbData_1.0.0 Formula_1.2-2 Matrix_1.2-11 Rcpp_0.12.14
[57] munsell_0.4.3 S4Vectors_0.16.0 stringi_1.1.6 yaml_2.1.16
[61] SummarizedExperiment_1.8.1 zlibbioc_1.24.0 plyr_1.8.4 AnnotationHub_2.10.1
[65] grid_3.4.3 blob_1.1.0 ggrepel_0.7.0 parallel_3.4.3
[69] lattice_0.20-35 Biostrings_2.46.0 splines_3.4.3 GenomicFeatures_1.30.0
[73] locfit_1.5-9.1 knitr_1.18 pillar_1.0.1 GenomicRanges_1.30.1
[77] stats4_3.4.3 XML_3.98-1.9 biovizBase_1.26.0 latticeExtra_0.6-28
[81] data.table_1.10.4-3 httpuv_1.3.5 gtable_0.2.0 assertthat_0.2.0
[85] mime_0.5 xtable_1.8-2 AnnotationFilter_1.2.0 survival_2.41-3
[89] tibble_1.4.1 GenomicAlignments_1.14.1 AnnotationDbi_1.40.0 memoise_1.1.0
[93] IRanges_2.12.0 cluster_2.0.6 statmod_1.4.30 interactiveDisplayBase_1.16.0
I see, I guess my understanding of the BH method is currently lacking, I'll read more into that. Thank you for the answer
In addition to the reason given by Aaron, there's another reason why tagwise dispersion should not be input to glmQLFit. glmQLFit does empirical Bayes moderation of the QL dispersions. It is critical that the genewise QL dispersions represent genuine gene-to-gene heterogeneity and are not already moderated. Applying empirical Bayes moderation to dispersion values that have already been empirical Bayes moderated by estimateDisp in a gene-specific manner (which is what the tagwise dispersions are) would be mathematical nonsense. You can't do empirical Bayes twice.
I see, thank you for the answer. So even if the NB dispersion is known, you should not use tagwise dispersion for glmQLFit. In which situation would you use tagwise dispersion? Or is it simply a diagnostic measure.
Tagwise dispersions are used in
glmLRT
, where EB shrinkage stabilises the dispersion estimate for each gene. These would otherwise be very unstable, due to the fact that there's not a lot of information per gene. If I recall correctly, stability improves the power of the test and reduces loss of type I error control, both of which are good things.From the perspective of the QL framework, though, the tagwise dispersions are simply diagnostic in nature. You can use them to identify particularly variable genes that might be problematic (e.g., X-linked genes in studies where you don't have sex labels, immunoglobulin variable chains when dealing with B cell clones, and so on).