Edit: Technically the quantiles are the (0,25,50,75,100) percentiles of the data <- That is like, not correct.
Quantiles can be computed using the quantile
function. Using the data from the vignette:
> library(bsseqData)
> data(BS.cancer.ex.tstat)
> ps <- getStats(BS.cancer.ex.tstat)[,"tstat.corrected"]
> quantile(ps)
0% 25% 50% 75% 100%
-10.9073281 -0.5864714 0.1185087 0.8609352 14.3986699
But I doubt that you want to use any of those percentiles as a cutoff. More likely you want to look at the (95, 99, 99.5) percentiles instead.
> quantile(ps, c(0,0.005, 0.01, 0.05, 0.95,0.99, 0.995,1))
0% 0.5% 1% 5% 95% 99% 99.5%
-10.907328 -4.115913 -3.447037 -1.998064 2.622552 5.918016 8.127983
100%
14.398670
And you could exclude all but about 2% of the regions under contention by using the cutoffs that are used in the vignette
> sum(ps < -4.6 | ps > 4.6)/length(ps)
[1] 0.01981683
You could use the p-value for those t-statistics, but it wouldn't be materially different than using the t-statistics directly.
The idea here is pretty simple - you just want to get a relatively small number of CpG sites that look to be differentially methylated, and then see if any of them are all together in one or more regions. How you select the CpG sites will necessarily be pretty ad hoc, and you just want to make sure you don't have too many (don't want too many false positives creeping in) or too few (you do want some results, no?). In the end you need some criterion to say if a CpG is in or out, and using the t-stat or the p-value will result in pretty much the same thing.
To see what I mean, you can make a 'volcano plot' of the data:
> plot(ps, -log10(pt(abs(ps), 8, lower.tail = FALSE)))
> abline(v = c(-4.6, 4.6))
The cutoff in the vignette is choosing all the CpG sites outside of those two vertical lines. Using a p-value cutoff would be the same as putting a horizontal line in there somewhere, at say
> abline(h = 3.1)
And taking all the CpG sites above that line. Which just so happens to be the same exact CpG sites.
The 4.6 cutoff was picked based on a quantile of the cancer dataset (using all autosomes). I have since used the 4.6 cutoff routinely for other systems. The problem I have with a quantile cutoff is that you're implicitely assuming that a certain percentage of the genome is differentially methylation, which is often an open question. In my experience 4.6 gives you good looking DMRs and decreasing this cutoff substantially just gives a lot of bad looking regions. In systems with few differences I don't find many DMRs with a 4.6 cutoff, but that is because there is not much difference between the two groups, not because the cutoff is too stringent.
The best way to do this is to pair the choice of cutoff with a permutation analysis where you permute the sample labels. For most samples sizes using WGBS this sounds a bit insane, but we have done this successfully. For example, in our paper on EBV in Genome Research we do use a permutation approach despite the fact that we only have 3 samples in each group. But in this case the signal is so strong that we for almost all the DMRs, we do not find any DMR in any permutation which is better.
Thanks for clarifying Kasper - very helpful. I will check what results I get with both the 2.5/97.5 percentiles and your empirical 4.6 cutoff.
Probably a naive question (as you might have already realised I don't have a stats background) but wouldn't you solve this problem at least in part by selecting DMRs based on their p-value adjusted for multiple testing? In that way you would not make the assumption that a certain percentage of the genome is necessarily differentially methylated and you could use a p-value threshold that, albeit still arbitrary, is easier to explain/justify in most contexts.
Back to Peter's answer below, if I were to compute p-values based on the t-statistics with
pt()
, how would I calculate the degrees of freedom?I ended up using the 4.6 cutoff as it's considerably more stringent than the 2.5/97.5 percentiles.