Question

Confusing DSS outputs with smoothing (DMLtest)

0

Entering edit mode

Weisheng • 0

@177e01d3

Last seen 22 months ago

United States

I've run DMLtest to test differential methylation between two samples. Since there are no replicates I used smoothing=TRUE. Some of the results don't make sense to me. Below is an example.

Below is a region that was called as a DMR from callDMR. There are 16 CpGs in this region (the 16 bars in the first track). Only the CpG at 30111445 has coverage in both samples (as shown in the top two tracks). And this CpG is fully unmethylated in sample 1 (all 7 reads are unmethylated) and fully methylated in sample 2 (all 5 reads are methylated). enter image description here

Below is the output from DMLtest. All 16 CpGs have identical mu1 and mu2 values. The CpG at 30111445 (the only CpG that is covered in both samples) has a p-value = 2.252324e-03, and all the other CpGs got very low p-values, even though they don't have read coverage in the 2nd sample at all. So by using the default p-value cutoff, the true CpG that is differentially methylated is discarded, while all the remaining 15 CpGs that do not have coverage in one sample are kept as DMLs. enter image description here

I'm really confused by these results:

Why does DSS do test on CpGs that have no coverage at all in one sample and how does it make up the values from nothing?
The only CpG that is covered in both samples in this region has 0% methylation in sample 1 and 100% methylated in sample2. Why does DSS change this to 0.03564888 versus 0.7295291?

I guess both actions are a result of smoothing, but I think this twists the real data too much. Is there a way to avoid this behavior? Should I first filter out the CpGs that have coverage only from one sample and then run DSS? But I've read that coverage filtering is not recommended because it's taken care of in DSS.

Thanks in advance for any explanation & advice.

Code:

library(DSS)
bs_data <- list()
bs_data[['hap1']] <- read.table('NA12878_EM_LAB01_rep1_1_CpG.bedGraph', skip=1)
bs_data[['hap2']] <- read.table('NA12878_EM_LAB01_rep1_2_CpG.bedGraph', skip=1)

for (i in names(bs_data)) {
    dd <- bs_data[[i]]
    dd$chr <- dd[,1]
    dd$pos <- dd[,2] + 1
    dd$N <- dd[,5] + dd[,6]
    dd$X <- dd[,5]
    bs_data[[i]] <- subset(dd, select=c(chr, pos, N, X))}

BSobj = makeBSseqData(list(bs_data[[1]], bs_data[[2]]), c("hap1", "hap2") )
dmlTest.sm = DMLtest(BSobj, group1=c("hap1"), group2=c("hap2"), smoothing=TRUE)
dmls = callDML(dmlTest.sm, p.threshold=0.001)
dmrs = callDMR(dmlTest.sm, p.threshold=0.001, minlen=100, minCG=15, dis.merge=100, pct.sig=0.5, delta=0)

> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)

Matrix products: default
BLAS/LAPACK: /usr/local/lib/libopenblasp-r0.3.21.so

locale:
[1] C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] DSS_2.46.0                  bsseq_1.34.0               
 [3] SummarizedExperiment_1.28.0 MatrixGenerics_1.10.0      
 [5] matrixStats_0.62.0          GenomicRanges_1.50.0       
 [7] GenomeInfoDb_1.34.1         IRanges_2.32.0             
 [9] S4Vectors_0.36.0            BiocParallel_1.32.0        
[11] Biobase_2.58.0              BiocGenerics_0.44.0

DSS • 904 views

ADD COMMENT • link 2.0 years ago Weisheng • 0