Hi everyone,
I have a question on differential DNA methylation probes and genes. Hope somebody could give some guidacne. I am working on an EPIC DNA methylation dataset with more than 200 samples (half control samples, and half disease samples), with other confounding information included, like various cell contents (cell type1 to cell type3) of the samples, and other factors like donator's age, gender, BMI and so on. Then I did differential methylation analysis on two levels, one was on EPIC methylation probe level (use their beta values), and the other was on gene level (use the average beta value of probes in gene promotor region to represent gene methylation level). For both probes and genes, I used limma to find differential ones with an regression model like,
probe/gene beta value ~ Disease_status + Cell_type1_content + Cell_type2_content + Cell_type3_content + Disease_status:Cell_type1_content + Diease_status:Cell_type2_content + Diease_status:Cell_type3_content + age + gender + BMI + ... (other confounding factors)
Then, I checked the adjusted p-value of the interaction terms like, Diesease_status:Cell_type1_content. It inidcated whether the correlation between probe/gene and Cell_type1_content, was significantly different between control group and disease group. However, what I found wired was that, on gene level, more than 3000 such differential genes could be found for the interaction term Disease_status:Cell_type1_content, while on probe level, only less than 500 probes could be found, which was much fewer than the genes. Since gene methylation value was summaried from probe value, as mentioned above, it was unreasonable that its number was much larger than that of probe. I don't know whether it is the problem of my limma regression design, or gene beta value summarization. Hope somebody could give some suggestions. Thank you so much!
Thanks for answering a limma question. I agree with most of your post, but I don't see any reason to think that taking averages would be anti-conservative or would be at all equivalent to cheating. I agree that averages will be less noisy, but that is a good thing. IMO an analysis of averages for gene promoters should control the error rate at least as well as published methods for aggregating adjacent regions implemented in packages such as DMRcate or minfi. I agree that M-values would probably be better than beta-values, but I have little experience with methylation arrays.
In general, it is always valid to compute summary statistics (such as an averages) and to conduct a statistical analysis of the averages. The only requirement is that the values to be averaged should be chosen purely according to gene annotation and not according to the data or the size of the beta or M-values. Ideally, the averages should be taken over similar numbers of individual observations, but statistical analyses are usually fairly robust to variations in n.
The reason why DMRcate, minfi etc don't take genewise averages is that they want to conduct analyses that are annotation agnostic, and they want to detect methylation changes anywhere in the genome, not just within promoters. It is a question of what scientific question one wants to answer rather than a question of statistical validity. The approach they take is far more delicate statistically because they are in fact choosing regions to be aggregated on the basis of the observed M-values and therefore they have to adjust for multiple testing at the observation level as well at the region level. Aaron Lun and I explored similar issues in the csaw package for ChIP-seq. These issues don't arise for gene promoter analyses or for any analysis where the genomic intervals are preset.
I am a big fan of gene-orientated analyses that focus on the main genewise signal rather than trying to detect every possible change anywhere in the genome in a single analysis, for example:
Thank you for your reply.