Question

scalability of edgeR on RRBS methylation data

0

Entering edit mode

osa.zuta ▴ 40

@osazuta-8625

Last seen 7.2 years ago

United States

Hi all,

I have a large RRBS dataset that I'm trying to analyze using

https://f1000research.com/articles/6-2055/v1

Following this paper, after filtering the methylated and unmetylated count matrix using the filter

keep <- rowSums(counts_total >= 10) == 100

my DGE object is about 6e5 rows x 100 columns. If I then further filter to promoters I can carry out the differential methylation analysis in predefined gene promoters outlined in the paper without issue, however I cannot estimate dispersion and carry out DMR analysis without doing so because estimateDisp takes too long to finish (it just doesn't, for me). For clarity , the command is

dge_methyl <- estimateDisp(dge_methyl, design=fullmethyldesign, trend="none", robust=TRUE)

The design matrix is created using the function modelMatrixMeth

Is there a better way to optimize this and subsequent steps using

glmFit,

glmLRT ?

thanks,

zo

edgeR methyation RRBS • 1.9k views

ADD COMMENT • link updated 7.2 years ago by Gordon Smyth 52k • written 7.2 years ago by osa.zuta ▴ 40

0

Entering edit mode

How many different conditions do you have? How many columns does the design matrix have?

ADD REPLY • link 7.2 years ago Gordon Smyth 52k

score 0 · Answer 1 · 2018-01-24

I don't work on RRBS data, but I can (sort of) address the edgeR parts of your post. modelMatrixMeth generates an additive design matrix, meaning that glmFit in edgeR must use the mglmLevenberg algorithm rather than the faster mglmOneWay algorithm. This is exacerbated by the number of samples in your dataset, which results in a proportional number of coefficients in the design matrix. The Levenberg algorithm performs a Cholesky decomposition, which has cubic time complexity with respect to the number of coefficients. So you can really see how everything slows right down - the number of rows doesn't help, either.

There are several options for speeding up the analysis. The first is to parallelize the GLM fits, assuming you have a cluster over which the calculations could be distributed. This doesn't really solve the cubic scaling but it would probably increase the range of datasets for which calculations are still feasible. The second approach is to improve the efficiency of the Cholesky decomposition, e.g., using Matrix:::dsCMatrix_Cholesky for sparse matrices. This is... possible, but not straightforward. Both of these solutions are developer-side, so there's not much that can be done on your end.

In the meantime, I suggest implementing a "poor man's parallelization", e.g., break up the analysis by chromosome and submit each of these as a separate job. There should still be enough regions on each chromosome for empirical Bayes to be effective. I have used a similar strategy for Hi-C data analyses with edgeR, when the genome-by-genome interaction space was too large to be loaded.

score 0 · Answer 2 · 2018-01-24

Wow, that is a lot of RRBS samples! I haven't seen any published methylation datasets that have both replication and that many samples.

I suggest that you try a faster variation of the edgeR pipeline, with

dge_methyl <- estimateGLMCommonDisp(dge_methyl, design=fullmethyldesign, subset=2000)

On my laptop computer, this takes about a minute and a half for a dataset of the same size as yours.

You could then proceed with glmFit and glmLRT, as in the F1000R workflow, although this will now be somewhat sensitive to outliers. Better probably would be use the quasi pipeline instead, with:

fit <- glmQLFit(dge_methyl, fullmethyldesign, robust=TRUE)
res <- glmQLFTest(fit)

etc. The glmQLFit step might take 10-20 minutes on a laptop.