I have a question regarding underlying theory for significant probes and genes identified by a.) probe level differential methylation analysis using limma and b.) region level differential methylation analysis with regions assigned by DMRcate.
DMP analysis resulted in 805 probes and 95 genes that meet FDR significance.
DMR analysis resulted in 803 probes and 163 genes that meet FDR significance.
My questions are as follows
- Why is there only around 12 percent overlap between significant CpGs identified at probe and region level
- (Likely will be answered by Q1) Why are there only 31 significant genes overlapping at probe and region level
Approach
For reference, my approach closely follows what is outlined by Maksimovic in (https://doi.org/10.12688/f1000research.8839.3).
For following analysis, I test 814,549 EPICv2 array probes that passed my ENmix QC pipeline. I have 124 samples (93 cases, phen.1 and 31 controls, phen.0).
Inputs for differential methylation analysis include
- EPICv2 Annotation
annoEPICv2 <- getAnnotation(IlluminaHumanMethylationEPICv2anno.20a1.hg38)
- M Value matrix
mval <- logit2(beta)
mval <- as.matrix(mval)
mval <- rmSNPandCH(mval, dist = 2, mafcut = 0.05, and = TRUE, rmcrosshyb = TRUE, rmXY = TRUE)
- Phenotypes restricted to samples used in analysis with samples ordered to what is in M Matrix
keep <- intersect(colnames(mval), phen$id)
mval <- mval[,colnames(mval) %in% keep]
phen <- phen[phen$id %in% keep,]
phen <- data.frame(phen)
rownames(phen) <- phen$id
phen <- phen[match(colnames(mval), rownames(phen)), ]
- Design and Contrast matrix for both Probe and Region level analyses
sex_cat <- as.factor(phen$Sex)
phen_cat <- as.factor(phen$Pheno)
design <-model.matrix(~0+phen_cat+sex_cat+Age+CD8T+CD4T+NK+Bcell+Mono, data=phen)
colnames(design) <- c('phen.0', 'phen.1', 'sex_cat', 'Age', 'CD8T', 'CD4T', 'NK', 'Bcell', 'Mono')
cont.matrix <- makeContrasts(phen.effect=phen.1-phen.0, levels=design)
Probe level Differential Methylation Analysis (Result is 805 probes and 95 genes FDR significant)
fit <- lmFit(mval, design)
fit2 <- contrasts.fit(fit,cont.matrix)
fit2 <- eBayes(fit2)
summary(decideTests(fit2))
annoEPICv2Sub <- annoEPICv2[match(rownames(mval), annoEPICv2$Name),
c(1:4,12:19,24:ncol(annoEPICv2))]
annoEPICv2Sub <- as.data.frame(annoEPICv2Sub)
DMPs <- limma::topTable(fit2, num=Inf, coef=1, adjust='fdr', genelist=annoEPICv2Sub, confint=TRUE)
DMPs$sig_flag <- ifelse(DMPs$adj.P.Val <=0.05,'pass', 'fail')
table(DMPs$sig_flag)
Region level Differential Methylation Analysis (Result is 803 probes, 156 regions, and 163 genes with FDR significance).
Annotation of the methylation dataset (mval) using EPICv2 data type and differential analysis design
myAnnotation <- cpg.annotate(
datatype = 'array', # Type of data (array format)
object = mval, # Methylation values matrix
what = 'M', # Methylation data (mean methylation)
arraytype = 'EPICv2', # Specifies that the dataset is from the EPICv2 array
epicv2Filter = 'mean', # Filtering method for EPICv2 probes
epicv2Remap = TRUE, # Reassign cross hybridizing probes to correct locations
analysis.type = 'differential', # Type of analysis: differential methylation
design = design, # Design matrix from differential methylation analysis (DMPs)
contrasts = TRUE, # Use contrasts for analysis
cont.matrix = cont.matrix, # Contrast matrix
coef = 'phen.effect', # Coefficient for the phenotype effect
fdr = 0.05, # Set false discovery rate (FDR) threshold to 0.05
annotation = 'ilmn20a1.hg38' # Genome annotation (hg38)
)
EPICv2 specified. Loading manifest... snapshotDate() 2024 10 28 loading from cache Your contrast returned 803 individually significant probes. We recommend the default setting of pcutoff in dmrcate().
Identify DMRs using the parameters defined above
DMRs <- dmrcate(
myAnnotation, # Annotated data from the previous step
lambda = 1000, # Smoothing window size of 1000 base pairs
C = 2, # Scaling factor for kernel smoothing
pcutoff = 'fdr', # Significance cutoff: false discovery rate (FDR)
consec = FALSE, # Dont restrict to consecutive probes (include any probes within the window)
conseclambda = 10, # Lambda for consecutive probes, irrelevant when consec=FALSE
betacutoff = NULL, # No specific methylation beta cutoff (default behavior)
min.cpgs = 1 # Minimum number of CpGs per region to define a DMR (can also be set to 2)
)
### Extract ranges (genomic locations) of significant DMRs from the results of dmrcate
results.ranges <- DMRcate::extractRanges(DMRs, genome = 'hg38')
Get CpG probes for Differentially Methylated Regions. Resource extract CpGs from DMRcate results
### Load the necessary package and data for IlluminaHumanMethylationEPICv2anno.20a1.hg38 annotations
locs <- IlluminaHumanMethylationEPICv2anno.20a1.hg38::Locations
### Create GRanges object from the chr and pos columns in locs.
locs.ranges <- GRanges(locs$chr, IRanges(locs$pos, locs$pos))
names(locs.ranges) <- rownames(locs)
### Use sapply to apply a function over the range of results.ranges to get overlapping CpG sites for each range
allDMRcpgs <- sapply(1:length(results.ranges), function (x) {
### Get the names of the CpGs that overlap with the current range in results.ranges
overlap_names <- names(subsetByOverlaps(locs.ranges, results.ranges[x]))
### Return the CpG names for the current range
return(overlap_names)
})
### Flatten the list of CpGs into a single string per DMR
allDMRcpgs <- sapply(allDMRcpgs, function(cpgs) paste(cpgs, collapse = ','))
### DMR results as a table
DMRdata <- data.frame(
Overlapping_genes = results.ranges@elementMetadata@listData[['overlapping.genes']],
Coordinate = DMRs@coord, # Coordinates of the DMRs
Num_CPGs = results.ranges@elementMetadata@listData[['no.cpgs']],
Min_Smoothed_FDR = results.ranges@elementMetadata@listData[['min_smoothed_fdr']],
Stouffer = results.ranges@elementMetadata@listData[['Stouffer']],
HMFDR = results.ranges@elementMetadata@listData[['HMFDR']],
Fisher = results.ranges@elementMetadata@listData[['Fisher']],
Max_Diff = results.ranges@elementMetadata@listData[['maxdiff']],
Mean_Diff = results.ranges@elementMetadata@listData[['meandiff']],
Overlapping_CPGs = allDMRcpgs)
Session Info
sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Red Hat Enterprise Linux 9.1 (Plow)
Matrix products: default
BLAS: /cm/shared/R/4.4.0/lib64/R/lib/libRblas.so
LAPACK: /cm/shared/R/4.4.0/lib64/R/lib/libRlapack.so LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/New_York
tzcode source: system (glibc)
attached base packages:
[1] parallel grid stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] DMRcatedata_2.24.0 ENmix_1.42.0
[3] doParallel_1.0.17 methylGSA_1.24.0
[5] jsonlite_1.9.0 missMethyl_1.40.0
[7] IlluminaHumanMethylationEPICv2anno.20a1.hg38_1.0.0 IlluminaHumanMethylationEPICanno.ilm10b4.hg19_0.6.0
[9] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1 qqman_0.1.9
[11] minfi_1.52.1 bumphunter_1.48.0
[13] locfit_1.5-9.11 iterators_1.0.14
[15] foreach_1.5.2 Biostrings_2.74.1
[17] XVector_0.46.0 SummarizedExperiment_1.36.0
[19] MatrixGenerics_1.18.1 matrixStats_1.5.0
[21] RColorBrewer_1.1-3 Gviz_1.50.0
[23] patchwork_1.3.0 ggrepel_0.9.6
[25] data.table_1.17.0 plyr_1.8.9
[27] DMRcate_3.2.1 ggthemes_5.1.0
[29] GEOquery_2.74.0 Biobase_2.66.0
[31] limma_3.62.2 lubridate_1.9.4
[33] forcats_1.0.0 stringr_1.5.1
[35] dplyr_1.1.4 purrr_1.0.4
[37] readr_2.1.5 tidyr_1.3.1
[39] tibble_3.2.1 ggplot2_3.5.1
[41] tidyverse_2.0.0 sesame_1.24.0
[43] sesameData_1.24.0 ExperimentHub_2.14.0
[45] AnnotationHub_3.14.0 BiocFileCache_2.14.0
[47] dbplyr_2.5.0 GenomicRanges_1.58.0
[49] GenomeInfoDb_1.42.3 IRanges_2.40.1
[51] S4Vectors_0.44.0 BiocGenerics_0.52.0