I have been running an analysis of RRBS data using edgeR, following the guide in the edgeR manual. The metadata looks like this:
File | Sample | Condition |
S1_Pre.cov.gz | S1 | Pre |
S1_Post.cov.gz | S1 | Post |
... | ... | ... |
S9_Pre.cov.gz | S9 | Pre |
S9_Post.cov.gz | S9 | Post |
Everything has been done as in the manual, with the exception of the design matrix and contrast which is:
designSL <- model.matrix(~0+Condition + Sample, data=targets) design <- modelMatrixMeth(designSL) contr <- makeContrasts(Condition = ConditionPost - ConditionPre, levels=design)
When I plot a histogram of the raw P-values it looks like so:
Two things strike me as odd, the high number of sites with a P-value close to 1, and the overrepresentation of sites with a P-value around 0.2.
The sites with a P-value close to 1 are (almost) 100% or 0% methylated in all samples (they have 0 counts of either methylated or unmethylated C's in almost all samples). I don't know what characterizes the sites with a P-value around 0.2.
I have tried removing sites with a constant methylation level, but that does not solve the issue (most site have at least one observation in at least one sample)
My questions are:
Is it reasonable/safe to remove all sites that have close to 0 or 100% methylation in all samples? What would be a reasonable heuristic?
What can cause a bump around P = 0.2? Can this be remedied somehow?
I just realised that filtering out all sites where the average number of methylated or unmethylated reads across all samples is less than one seems to solve both problems. Now the distribution of P-values looks uniform with a slight increase in sites with a low P-value. If anyone has an explanation, suggestions or guidelines you are welcome to post them, but I think this solves the problem.
Have your followed the edgeR User's Guide regarding filtering, as in Section 4.7.3? Or else the "Filter to remove low counts" section of https://f1000research.com/articles/6-2055/ ?
I used the filtering as in Section 4.7.3 when I wrote this question, I have now also tested the filtering used in f1000 paper and also the filterByExpr function (with the design matrix) of edgeR. The edgeR guide leaves me with 150.000 sites, f1000 with 260.000 and filterByExpr with 592.000 sites. In comparison filtering out sites with almost 100% or 0% methylation removes 10.2 million sites (the histogram was on a subset of sites).
Is there any reason not to use filterByExpr?