Main Question:
I have been working on an analysis of RRBS data, using various utility function to generate a data matrix that consists of averages of proportions of methylated reads at a list of CpG islands for each of the n = 59 subjects (the dimension of the matrix is roughly 15,000 x 59). The `limma::lmFit` method (and related functions) are then used, along with an appropriate design matrix, to assess any evidence of differential methylation at these CpG islands. I am concerned that the bounded nature of the outcome variable (linear regression is of course unsuitable for bounded outcomes) is preventing any potential signal in the data from being picked up. Does Limma accommodate such analyses, or are there transformations of the data that may help?
Background:
In working on an analysis of RRBS data, I have performed preprocessing on BAM files for n = 59 subjects using Bismark. I have then imported the coverage files produced by Bismark into R using the `bsseq::read.bismark` function to generate an object of class BSseq. I then extract the coverage and count of methylated reads from the BSseq object, generating a data structure of dimension ~10e6 x 59. After dropping CpG loci with insufficient coverage, a single data.frame is generated that contains the proportion of methylated reads at each CpG locus -- the new matrix is roughly 3,000,000 x 59. A table of CpG islands is then extract from the UCSC genome browser and the methylation proportion values for CpG sites falling into the same defined island region are averaged. This summarized data structure is then analyzed with `limma::lmFit` and various associated functions (`eBayes`, `topTable`, etc.), with an appropriate design matrix. The result is a table of CpG islands assessed for any evidence of differential methylation. It is clear that linear modeling is not a suitable method for assessing bounded outcomes (the averaged proportions are of course in [0, 1]). Are there suggestions on how to properly transform the data to deal with this problem? I have seen general approaches to bounded regression but nothing specific to Limma and related bioinformatical tools.
Any help/advice would be much appreciated.
I can't answer your question fully as I don't have experience with methyl-seq, but for methylation arrays my understanding is that one typically runs limma on the M-values, which are the result of a logistic transformation of the methylation proportions. This handles the issue that proportions are bounded between 0 and 1 and decidedly non-normal as a result, but it doesn't handle the discreteness that exists in methyl-seq but not methylation array data. I don't know how you would handle that.