Filtering low coverage RRBS counts results in few reads. Any alternatives?
2
0
Entering edit mode
@altintasali-11054
Last seen 6 days ago
Denmark

I have been following the paper "Differential methylation analysis of reduced representation bisulfite sequencing experiments using edgeR" and I have some problems with the low count filtering process.

Experimental design

Here is the experimental design I have:

meta <- data.frame(library = as.factor(1:18),
                   patient = as.factor(c(rep(1:3, each = 4), rep(4:6, each = 2))),
                   disease = c(rep("sick", each = 12), rep("healthy", each = 6)),
                   tissue = c(rep(c("liver","muscle"), times = 6), rep(c("liver","muscle"), times = 3)), 
                   treatment = c(rep(c("placebo","drug"), each = 6), rep("placebo", each = 6))
)
meta$group <- paste(meta$disease, meta$tissue, meta$treatment, sep = ".")

I know that it is kind of a weird setup as the healthy patients didn't receive a "drug" as a treatment. That's why I intended to use the approach with voomLmFit I asked earlier (post here).

PS: I tried to keep the experimental setup as simple as possible for easy communication. I have more more than 10 patients with the total number of 44 libraries.

Problem: Filtering

When I do the recommended filtering for the CpG sites, I end up with so few reads.

## The following code is similar to the one on the RRBS edgeR paper
## (...)
Coverage <- yall$counts[, Methylation=="Me"] + yall$counts[, Methylation=="Un"]
min.coverage <- 8
nObs <- ncol(Coverage)
keep <- rowSums(Coverage >= min.coverage) == nObs
table(keep)
#   FALSE    TRUE 
# 8023559    1042  

## (...)

I have only 1042 reads, which have at least 8 counts across all the samples, out of ~8 million.

Question

I would like to get your feedback about how to filter the low counts in such a dataset. As the edgeR paper on RRBS differential analysis mentions, the filtering process can be somewhat relaxed, but I am not sure how. Therefore, I tried some of the following approaches:

Approach #1: filterByExpr()

I am aware of the fact that Gordon Smyth did not recommend the usage of filterByExpr on RRBS data (post here) as the function was originally designed for RNAseq filtering. I gave it a try anyways as I was running out of options and here are the results:

keep <- filterByExpr(Coverage, group = meta$group)
table(keep)
#   FALSE    TRUE 
# 5179460 2845141

Approach #2: Keep at least "n" samples that has enough coverage in every group

Let's assume that we would like to keep CpGs that has 8 coverage in at least 2 samples per group.

min.coverage <- 8                      # minimum coverage for a CpG site
pass <- (Coverage >= min.coverage) * 1 # convert logical to numeric
pass_per_group <- t(rowsum(t(pass), group = meta$group))
min.n <- 2                             # minimum number of samples with enough coverage
keep <- rowSums(pass_per_group >= min.n) == length(unique(meta$group))
table(keep)
# keep
#   FALSE    TRUE 
# 7998435   26166

Unfortunately, it is still not high enough coverage. Besides, I am not sure how such filtering might affect the differential analysis.

Question: I believe that the approaches above are not ideal. May I ask for your help if you have a better way of filtering the low RRBS counts? Or should I live with 1024 CpG sites after recommended filtering.

Thanks in advance for your help.

limma edgeR • 1.8k views
ADD COMMENT
2
Entering edit mode
@gordon-smyth
Last seen 56 minutes ago
WEHI, Melbourne, Australia

Your data seems to have very low coverage, and there is no way for a filtering step to solve that. Your approach #2 seems reasonable. You could lower the min.coverage further without any negative consequences but you probably won't end up with many significant differential CpGs. There's no way to change that -- it's an inevitable consequence of low coverage.

You could consider consolidating CpG counts by genomic region to get larger counts.

ADD COMMENT
1
Entering edit mode
gamma.jian ▴ 10
@franceschinigianmarco-16472
Last seen 3 months ago
Switzerland

Have you considered smoothing approaches? Maybe BSmooth or similar approaches could be helpful in your setting ( see https://genomebiology.biomedcentral.com/articles/10.1186/gb-2012-13-10-r83 ). The main idea is to exploit the strong local correlation of CpGs to retain more information, without using strong filters on the coverage. Overlapping CpGs with high coverage from multiple experiments often results in that culprit. I hope this could be useful!

ADD COMMENT

Login before adding your answer.

Traffic: 624 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6