scalability of edgeR on RRBS methylation data
2
0
Entering edit mode
osa.zuta ▴ 40
@osazuta-8625
Last seen 7.0 years ago
United States

Hi all, 

I have a large RRBS dataset that I'm trying to analyze using 

https://f1000research.com/articles/6-2055/v1

Following this paper, after filtering the methylated and unmetylated count matrix using the filter 

keep <- rowSums(counts_total >= 10) == 100

my DGE object is about 6e5 rows x 100 columns. If I then further filter to promoters I can carry out the differential methylation analysis in predefined gene promoters outlined in the paper without issue, however I cannot estimate dispersion and carry out DMR analysis without doing so because estimateDisp takes too long to finish (it just doesn't, for me). For clarity , the command is 

dge_methyl <- estimateDisp(dge_methyl, design=fullmethyldesign, trend="none", robust=TRUE)

The design matrix is created using the function modelMatrixMeth

Is there a better way to optimize this and subsequent steps using 

glmFit, 
glmLRT ?

 

thanks,

zo

edgeR methyation RRBS • 1.7k views
ADD COMMENT
0
Entering edit mode

How many different conditions do you have? How many columns does the design matrix have?

ADD REPLY
0
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 2 hours ago
The city by the bay

I don't work on RRBS data, but I can (sort of) address the edgeR parts of your post. modelMatrixMeth generates an additive design matrix, meaning that glmFit in edgeR must use the mglmLevenberg algorithm rather than the faster mglmOneWay algorithm. This is exacerbated by the number of samples in your dataset, which results in a proportional number of coefficients in the design matrix. The Levenberg algorithm performs a Cholesky decomposition, which has cubic time complexity with respect to the number of coefficients. So you can really see how everything slows right down - the number of rows doesn't help, either.

There are several options for speeding up the analysis. The first is to parallelize the GLM fits, assuming you have a cluster over which the calculations could be distributed. This doesn't really solve the cubic scaling but it would probably increase the range of datasets for which calculations are still feasible. The second approach is to improve the efficiency of the Cholesky decomposition, e.g., using Matrix:::dsCMatrix_Cholesky for sparse matrices. This is... possible, but not straightforward. Both of these solutions are developer-side, so there's not much that can be done on your end.

In the meantime, I suggest implementing a "poor man's parallelization", e.g., break up the analysis by chromosome and submit each of these as a separate job. There should still be enough regions on each chromosome for empirical Bayes to be effective. I have used a similar strategy for Hi-C data analyses with edgeR, when the genome-by-genome interaction space was too large to be loaded.

ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 2 hours ago
WEHI, Melbourne, Australia

Wow, that is a lot of RRBS samples! I haven't seen any published methylation datasets that have both replication and that many samples.

I suggest that you try a faster variation of the edgeR pipeline, with

dge_methyl <- estimateGLMCommonDisp(dge_methyl, design=fullmethyldesign, subset=2000)

On my laptop computer, this takes about a minute and a half for a dataset of the same size as yours.

You could then proceed with glmFit and glmLRT, as in the F1000R workflow, although this will now be somewhat sensitive to outliers. Better probably would be use the quasi pipeline instead, with:

fit <- glmQLFit(dge_methyl, fullmethyldesign, robust=TRUE)
res <- glmQLFTest(fit)

etc. The glmQLFit step might take 10-20 minutes on a laptop.

ADD COMMENT

Login before adding your answer.

Traffic: 706 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6