Question

Varying results from DSS when specifying different BSobj

0

Entering edit mode

varunorama • 0

@varunorama-20073

Last seen 5.4 years ago

Hello!

I am using the DSS package to identify differentially methylated loci (DMLs) and regions (DMRs) between treatment and control samples of a time-series data set. However, I am getting varying results based on how I design my BSobj dataset. If anyone has any insight as to why I am getting varying results, I would greatly appreciate it!

The dataset I am using is a time-series dataset composed of 9-time points (6, 12, 24, 48, 72, 96, 120, 144, 168); with one treatment and control per timepoint. There is only one replicate per sample. The contrasts that I am running are between the treatment and the control of each time point independently, and I am trying to identify differentially methylated loci (DMLs) and regions (DMRs)

The first way I designed my BSobj was:

library(DSS)

BSobj.allhr = makeBSseqData( list(C_6hr_cpg[,c(1:4)], T_6hr_cpg[,c(1:4)],
                                  C_12hr_cpg[,c(1:4)], T_12hr_cpg[,c(1:4)],
                                  C_24hr_cpg[,c(1:4)], T_24hr_cpg[,c(1:4)],
                                  C_48hr_cpg[,c(1:4)], T_48hr_cpg[,c(1:4)],
                                  C_72hr_cpg[,c(1:4)], T_72hr_cpg[,c(1:4)],
                                  C_96hr_cpg[,c(1:4)], T_96hr_cpg[,c(1:4)],
                                  C_120hr_cpg[,c(1:4)], T_120hr_cpg[,c(1:4)],
                                  C_144hr_cpg[,c(1:4)], T_144hr_cpg[,c(1:4)],
                                  C_168hr_cpg[,c(1:4)], T_168hr_cpg[,c(1:4)]),
                           c("C6", "H6","C12", "H12","C24","H24","C48", "H48",
                             "C72", "H72","C96", "H96","C120", "H120","C144", "H144","C168", "H168") )

where I incorporated all of my timepoints into a single object called BSobj.allhr. From here I am comparing each timepoint treatment to it's control. As an example, for my 6hr samples, I would run:

dmltest.6hr.sm500 = DMLtest(BSobj.allhr,
                            group1 = c("H6"),group2 = c("C6"),
                            smoothing = TRUE, smoothing.span = 500)
dmltest.6hr.sm500.dml = callDML(dmltest.6hr.sm500, delta = 0.1, p.threshold=0.01)
dmltest.6hr.sm500.dmrs = callDMR(dmltest.6hr.sm500, p.threshold=0.05, delta = 0.1, dis.merge = 250, minCG = 3, pct.sig = 0.5, minlen = 50)

The resulting DML and DMR file produced approx. ~800 DML's and 185 DMRs.

However, if I create a reduced BSobj for only the timepoint comparison I am running (e.g. 6hr) and run the DML and DMR analysis as such:

BSobj.6hr = makeBSseqData( list(C_6hr_cpg[,c(1:4)], T_6hr_cpg[,c(1:4)]),
                             c("C6", "H6") )
dmltest.6hr.sm500.dml.reduced = callDML(dmltest.6hr.sm500.reduced, delta = 0.1, p.threshold=0.01)
dmltest.6hr.sm500.dmrs.reduced = callDMR(dmltest.6hr.sm500.reduced, p.threshold=0.05, delta = 0.1, dis.merge = 250, minCG = 3, pct.sig = 0.5, minlen = 50)

I identify ~1360 DML's and ~380 DMR's, a significant increase from the previous analysis.

Several questions arise: By incorporating all of the timepoints, is that affecting the smoothing approach somehow? Of the two approaches, would you recommend the second? Any advice would be greatly appreciated.

Thank you.

DSS DSS-single bsseq • 1.0k views

ADD COMMENT • link updated 5.4 years ago by James W. MacDonald 68k • written 5.4 years ago by varunorama • 0

score 0 · Answer 1 · 2019-10-23

When you fit a linear model and then compute contrasts, the denominator of your contrast is based on the within-group variability, averaged across all your groups. In general, given the fact that sample variance estimates are not very efficient, it's better to include more data rather than less. In other words, in your first example, the denominator for your contrast includes data from all the other groups, even though they don't directly play a part in the computation of the numerator of that contrast!

That said, if you have some groups that are far more variable than others, then they will affect the overall power to detect differences, because they will (possibly incorrectly?) increase the overall variance estimate. In a situation like yours, there may not be a biological reason for increased variability of measurements at any time point, in which case it is probably best to include everything. But you are the analyst, and it's your call to make (and defend).