I have been following the paper "Differential methylation analysis of reduced representation bisulfite sequencing experiments using edgeR" and I have some problems with the low count filtering process.
Experimental design
Here is the experimental design I have:
meta <- data.frame(library = as.factor(1:18),
patient = as.factor(c(rep(1:3, each = 4), rep(4:6, each = 2))),
disease = c(rep("sick", each = 12), rep("healthy", each = 6)),
tissue = c(rep(c("liver","muscle"), times = 6), rep(c("liver","muscle"), times = 3)),
treatment = c(rep(c("placebo","drug"), each = 6), rep("placebo", each = 6))
)
meta$group <- paste(meta$disease, meta$tissue, meta$treatment, sep = ".")
I know that it is kind of a weird setup as the healthy
patients didn't receive a "drug" as a treatment
. That's why I intended to use the approach with voomLmFit
I asked earlier (post here).
PS: I tried to keep the experimental setup as simple as possible for easy communication. I have more more than 10 patients with the total number of 44 libraries.
Problem: Filtering
When I do the recommended filtering for the CpG sites, I end up with so few reads.
## The following code is similar to the one on the RRBS edgeR paper
## (...)
Coverage <- yall$counts[, Methylation=="Me"] + yall$counts[, Methylation=="Un"]
min.coverage <- 8
nObs <- ncol(Coverage)
keep <- rowSums(Coverage >= min.coverage) == nObs
table(keep)
# FALSE TRUE
# 8023559 1042
## (...)
I have only 1042 reads, which have at least 8 counts across all the samples, out of ~8 million.
Question
I would like to get your feedback about how to filter the low counts in such a dataset. As the edgeR paper on RRBS differential analysis mentions, the filtering process can be somewhat relaxed, but I am not sure how. Therefore, I tried some of the following approaches:
Approach #1: filterByExpr()
I am aware of the fact that Gordon Smyth did not recommend the usage of filterByExpr
on RRBS data (post here) as the function was originally designed for RNAseq filtering. I gave it a try anyways as I was running out of options and here are the results:
keep <- filterByExpr(Coverage, group = meta$group)
table(keep)
# FALSE TRUE
# 5179460 2845141
Approach #2: Keep at least "n" samples that has enough coverage in every group
Let's assume that we would like to keep CpGs that has 8 coverage in at least 2 samples per group.
min.coverage <- 8 # minimum coverage for a CpG site
pass <- (Coverage >= min.coverage) * 1 # convert logical to numeric
pass_per_group <- t(rowsum(t(pass), group = meta$group))
min.n <- 2 # minimum number of samples with enough coverage
keep <- rowSums(pass_per_group >= min.n) == length(unique(meta$group))
table(keep)
# keep
# FALSE TRUE
# 7998435 26166
Unfortunately, it is still not high enough coverage. Besides, I am not sure how such filtering might affect the differential analysis.
Question: I believe that the approaches above are not ideal. May I ask for your help if you have a better way of filtering the low RRBS counts? Or should I live with 1024 CpG sites after recommended filtering.
Thanks in advance for your help.