Odd P-value distribution in edgeR on RRBS data
3
0
Entering edit mode
ingerslev • 0
@ingerslev-19177
Last seen 22 months ago
Denmark

I have been running an analysis of RRBS data using edgeR, following the guide in the edgeR manual. The metadata looks like this:

File Sample Condition
S1_Pre.cov.gz S1 Pre
S1_Post.cov.gz S1 Post
... ... ...
S9_Pre.cov.gz S9 Pre
S9_Post.cov.gz S9 Post

Everything has been done as in the manual, with the exception of the design matrix and contrast which is:

designSL <- model.matrix(~0+Condition + Sample, data=targets)
design <- modelMatrixMeth(designSL)

contr <- makeContrasts(Condition = ConditionPost - ConditionPre, levels=design)

When I plot a histogram of the raw P-values it looks like so:

Two things strike me as odd, the high number of sites with a P-value close to 1, and the overrepresentation of sites with a P-value around 0.2.

The sites with a P-value close to 1 are (almost) 100% or 0% methylated in all samples (they have 0 counts of either methylated or unmethylated C's in almost all samples). I don't know what characterizes the sites with a P-value around 0.2.

I have tried removing sites with a constant methylation level, but that does not solve the issue (most site have at least one observation in at least one sample)

My questions are:

Is it reasonable/safe to remove all sites that have close to 0 or 100% methylation in all samples? What would be a reasonable heuristic?

What can cause a bump around P = 0.2? Can this be remedied somehow?

edgeR rrbs • 2.4k views
ADD COMMENT
0
Entering edit mode

I just realised that filtering out all sites where the average number of methylated or unmethylated reads across all samples is less than one seems to solve both problems. Now the distribution of P-values looks uniform with a slight increase in sites with a low P-value. If anyone has an explanation, suggestions or guidelines you are welcome to post them, but I think this solves the problem.

ADD REPLY
0
Entering edit mode

Have your followed the edgeR User's Guide regarding filtering, as in Section 4.7.3? Or else the "Filter to remove low counts" section of  https://f1000research.com/articles/6-2055/ ?

 

ADD REPLY
0
Entering edit mode

I used the filtering as in Section 4.7.3 when I wrote this question, I have now also tested the filtering used in f1000 paper and also the filterByExpr function (with the design matrix) of edgeR. The edgeR guide leaves me with 150.000 sites, f1000 with 260.000 and filterByExpr with 592.000 sites. In comparison filtering out sites with almost 100% or 0% methylation removes 10.2 million sites (the histogram was on a subset of sites).

Is there any reason not to use filterByExpr?

ADD REPLY
4
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

How to filter methylation data

You ask an interesting question and I had a think about it overnight.

Filtering for methylation data is different from RNA-seq. The main requirement is that each CpG site should have decent read coverage, otherwise it is impossible to tell whether the site is methylated or not. In the case study in the edgeR User's Guide (Section 4.7) we required that a CpG site should have at least 10 reads (methylated or unmethylated) in every sample. In version 2 of our F1000 article https://f1000research.com/articles/6-2055/ we relaxed this slightly to at least 8 reads in every sample. You have already applied this filtering, so your analysis should be fine.

The edgeR filterByExpr() function is intended for RNA-seq and is not appropriate for methylation data. For methylation data we require that a CpG has minimal coverage for all samples whereas for RNA-seq data a gene needs to expressed only in some samples.

Now I come to your main question. What if a CpG site has plenty of reads but they are always 100% methylated or 0% methylated. Such a CpG is uninteresting from a differential methylation (DM) point of view. It can never be DM. It will have logFC=0 and PValue=1 for all comparisons. Can you filter it?

After some thought, I think that, yes, you can filter CpG sites with 100% or 0% methylation across all samples. It is true that this filtering is dependent on the methylated vs unmethylated factor in the design matrix. Such "peeking" would not be allowed for RNA-seq. However, for methylation data this factor is a structural variable rather than an experimental factor that you will test for. Filtering these sites should not bias the results for the experimental factors, all of which are essentially orthogonal to the methylation factor.

CpG sites with 100% or 0% methylation will be fitted perfectly by edgeR, with fitted values equal to observed counts, and the deviances will be zero. These sites will not contribute to the dispersion estimation so removing them will not bias the empirical Bayes estimation.

However, removing the 100% or 0% methylated sites may make little different to you final results. There is no objective reason why you need to have a uniform p-value histogram. The only gains will be (i) decrease is running time and (ii) a slight increase in statistical power for detecting DM at the remaining sites.

ADD COMMENT
0
Entering edit mode

This answers my question, thank you. Can I ask two more questions? Firstly, why is filterByExpr not appropriate, if it is used on the Coverage object from the f1000 paper? The function seems to be doing almost the same. Secondly, how is the model affected by removing sites that are 0% or 100% across almost all samples. My tissue is very homogenous, and 94% of all sites have an average of less than 1 counts across all samples in either the methylated or the unmethylated state, whereas only 50% have exactly 0 counts. This amount to testing either 5.5 mio. sites or a few hundred thousand sites. The decrease in running time is substantial, and to me it seems the power increase should be big as well - but am I somehow biasing the results by excluding these sites?

ADD REPLY
0
Entering edit mode

No, filterByExpr isn't doing the same thing. For methylation data, we need CpGs to have minimal coverage for every sample, otherwise the methylation level isn't even measurable. For RNA-seq data we only need a gene to be expressed in a few samples to be worth keeping.

The CpG numbers you mention seem enormous. Half of all CpGs have no methylation at all yet you still have 5.5 million CpG sites left after filtering for both coverage >= 10 and methylation not all zero? You must have started with a lot of CpG sites. Is this really RRBS data?

I suppose you could filter CpGs for which the total number of methylated counts was very small, say less than 5. I have not tried this myself and I can't tell you confidently it will be fine if you filter more than that.

ADD REPLY
0
Entering edit mode

I will play it safe and stick to your initial suggestion and only filter out sites with 100% or 0% methylation. The data is RRBS, but it is sequenced quite deeply, maybe more than necessary. 

Your explanation of filterByExpr made it make sense to me - thanks.

ADD REPLY
0
Entering edit mode

Of course, if you follow our promoter region approach, then it all becomes much simpler.

ADD REPLY
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 14 hours ago
The city by the bay

If you have a site with 100% or 0% methylation across all samples, then getting a p-value of 1 is inevitable; after all, there is literally no difference in methylation status between the conditions. You can also get p-values of 1 if all sites in one condition have no reads for both the methylated and unmethylated states. The log-ratio of methylated:unmethylated is undefined and thus there is no evidence to reject the null hypothesis. So I don't think there's any mathematical error here - that's just the way it is.

One could argue that you should have done a better job of filtering out such sites, but I don't see an easy way of identifying them in a manner that is independent from the model fit. Your proposed solution (removing all-methylated or all-unmethylated sites) indirectly "peeks" at the difference between conditions; filtering of sites on the basis of this information subverts the multiple testing correction and will probably result in more false positives. All-methylated or all-unmethylated sites occur with non-zero probability under the null hypothesis, so removing them will effectively increase the type I error rate across all remaining tests. 

In short, I wouldn't worry about it. You can think of the spike at 1 as a discrete approximation to a uniform distribution (hah!). If we were able to get precise continuous values to replace the zeroes, we'd probably have a nice uniform distribution, but because our observations are rounded off to the nearest integer, we're stuck with p-values of 1 for these sites. I wouldn't worry too much about the spike at 0.2 either. Unless you know that the null hypothesis is true for all sites, you can't expect the p-value distribution to be perfectly uniform.

Edit: Given your update, I should mention that the average count is a good approximation to (what is probably) an independent filter statistic when the library sizes are equal. Getting rid of low-abundance sites is not guaranteed to solve the spike-at-1 problem (e.g., what about strongly methylated/unmethylated sites at high coverage?) but it will reduce the number of sites with many zeroes, so that's a start. The disappearance of the bump is harder to explain but I'm not convinced that was really a problem in the first place. 

ADD COMMENT
0
Entering edit mode
@peter-langfelder-4469
Last seen 4 weeks ago
United States

I'll go against Aaron's advice a bit. You can filter by variance (that's not peeking at the differences between conditions), or, maybe better in this case, by median absolute deviation (MAD, R function mad). If you have a variable (site) that is constant except for a few (less than 25%) of samples, mad will be zero. To the best of my knowledge, filtering out such sites will not bias the significance statistics.

ADD COMMENT
0
Entering edit mode

In my opinion, variance filtering is always incorrect in conjunction with edgeR, limma or DESeq2. It is true that variance filtering is independent of the explanatory variable, but it biases the variance distribution and hence stuffs up the empirical Bayes dispersion estimation that those packages do and hence does interfere with the statistical testing procedures.

ADD REPLY
0
Entering edit mode

A simple example to back up Gordon's point:

library(edgeR)

group <- gl(2,4)
design <- model.matrix(~group)

set.seed(999)
disp <- 0.1
y <- matrix(rnbinom(80000, mu=100, size=1/disp), ncol=nrow(design), byrow=TRUE)

d <- DGEList(y)
d <- estimateDisp(d, design)
fit <- glmQLFit(d, design, robust=TRUE)
res <- glmQLFTest(fit, coef="group2")

par(mfrow=c(1,2))
hist(res$table$PValue) # Looks uniform.
mean(res$table$PValue <= 0.1) # Controlled below threshold.
## [1] 0.0956

# Filtering for low-deviance sites, analogous to low variance.
keep <- fit$deviance < median(fit$deviance)

# Repeating empirical Bayes with filtered sites:
fit2 <- glmQLFit(d[keep,], design, robust=TRUE)
res2 <- glmQLFTest(fit2, coef="group2")
hist(res2$table$PValue) # Not uniform!
mean(res2$table$PValue <= 0.1) # Above the threshold.
​## [1] 0.1892

You'll notice that I used the deviance instead of the variance, as the former is the analogous metric for GLMs. Computing the variance of the counts doesn't make sense; the linear model would be identity-link rather than log-link, for starters. Looking for NB dispersions near zero isn't helpful either, as either the dispersions are squeezed away from zero (if prior.df!=0) or many useful features get discarded due to imprecise estimation of the dispersions with low replication (if prior.df=0). 

In the OP's specific case, to get rid of the problematic sites in the most specific and comprehensive manner, I would fit the null model to each site and discard all sites that had deviances of zero under the null. However, I would rather not do this in an actual analysis. I suspect this is equivalent to data dredging, though loss of type I error control is difficult to demonstrate; in scenarios where these problematic sites are generated, the p-value distributions are usually quite screwed up in other respects.

ADD REPLY
0
Entering edit mode

I am not going to disagree with Gordon since he knows infinitely more than I do about the internal workings of the empirical Bayes moderation, but I have a comment on Aaron's code.

When filtering by deviance, one needs to calculate the deviance from the fit to design ~1 (which is analogous to calculating variance of the original variable), not ~group (that's biased since you are looking at the response and would be analogous to looking at the variance of the residuals). I modified Aaron's code to filter by deviance obtained from a null model (intercept only), and there is no obvious p-value inflation.

 

library(edgeR)

group <- gl(2,4)
design <- model.matrix(~group)
design.null = model.matrix(~1, data = data.frame(x = group))

set.seed(999)
disp <- 0.1
y <- matrix(rnbinom(80000, mu=100, size=1/disp), ncol=nrow(design), byrow=TRUE)

d <- DGEList(y)
d <- estimateDisp(d, design)
fit <- glmQLFit(d, design, robust=TRUE)
res <- glmQLFTest(fit, coef="group2")

d.null = estimateDisp(d, design.null)
fit.null <- glmQLFit(d.null, design.null, robust=TRUE)

par(mfrow=c(1,2))
hist(res$table$PValue) # Looks uniform.
mean(res$table$PValue <= 0.1) # Controlled below threshold.
## [1] 0.0956

# Filtering for low-deviance sites, analogous to low variance.
keep <- fit.null$deviance < median(fit$deviance)

# Repeating empirical Bayes with filtered sites:
fit2 <- glmQLFit(d[keep,], design, robust=TRUE)
res2 <- glmQLFTest(fit2, coef="group2")
hist(res2$table$PValue)
mean(res2$table$PValue <= 0.1)

## [1] 0.07942708

 

 

ADD REPLY
0
Entering edit mode

Perhaps the problems may become clearer if you drop the type I error threshold:

library(edgeR)

group <- gl(2,4)
design <- model.matrix(~group)

set.seed(999)
disp <- 0.1
y <- matrix(rnbinom(400000, mu=100, size=1/disp), ncol=nrow(design), byrow=TRUE)
# Increased the number of iterations for stable type I error rate estimates.

d <- DGEList(y)
d <- estimateDisp(d, design)
fit <- glmQLFit(d, design, robust=TRUE)
res <- glmQLFTest(fit, coef="group2")

par(mfrow=c(1,2))
hist(res$table$PValue) # Looks uniform.
mean(res$table$PValue <= 0.01) # Controlled at threshold.
## [1] 0.00988

# Following Peter's suggestion; slightly corrected median() call.
design.null <- design[,1,drop=FALSE]
d.null <- estimateDisp(d, design.null)
fit.null <- glmQLFit(d.null, design.null, robust=TRUE)
keep <- fit.null$deviance < median(fit.null$deviance)

# Repeating empirical Bayes with filtered sites:
fit2 <- glmQLFit(d[keep,], design, robust=TRUE)
res2 <- glmQLFTest(fit2, coef="group2")
hist(res2$table$PValue) # Not uniform; see the far left!
mean(res2$table$PValue <= 0.01) # ~8-fold below the threshold.
## [1] 0.00122

So the opposite problem now occurs where you get a depletion of low p-values. Not surprising, as you're removing features where the deviance under the null model is high - that is, those features that are likely to be differential and thus of interest in the first place! Moral of the story: any filtering on the GLM deviance is dangerous. 

ADD REPLY

Login before adding your answer.

Traffic: 732 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6