Newbie methylation and stats question

0

Entering edit mode

Gustavo Fernández Bayón ▴ 440

@gustavo-fernandez-bayon-5300

Last seen 9.3 years ago

Spain

Hi everybody. As a newbie to bioinformatics, it is not uncommon to find difficulties in the way biological knowledge mixes with statistics. I come from the Machine Learning field, and usually have problems with the naming conventions (well, among several other things, I must admit). Besides, I am not an expert in statistics, having used the barely necessary for the validation of my work. Well, let's try to be more precise. One of the topics I am working more right now is the analysis of methylation array data. As you surely now, the final processed (and normalized) beta values are presented in a pxn matrix, where there are p different probes and n different samples or individuals from which we have obtained the beta- values. I am not currently working with the raw data. Imagine, for a moment, that we have identified two regions of probes, A and B, with a group of nA probes belonging to A, another group (of nB probes) that belongs to B, and the intersection is empty. Say that we want to find a way to show there is a statistically significant difference between the methylation values of both regions. As far as I have seen in the literature, comparisons (statistical tests) are always done comparing the same probe values between case and control groups of individuals or samples. For example, when we are trying to find differentiated probes. However, if I think of directly comparing all the beta values from region A (nA * n values) against the ones in region B (nB * n values) with a, say, t test, I get the suspicion that something is not being done the way it should. My knowledge of Biology and Statistics is still limited and I cannot explain why, but I have the feeling that there is something formally wrong in this approximation. Am I right? What I have done in similar experiments has been to find differentiated probes, and then do a test to the proportion of differentiated probes to total number of them, so I could assign a p-value to prove that there was a significant influence of the region of reference. Several questions here: which could be a coherent approximation to the regions A and B problem stated above? Is there any problem with methylation data I am not aware of which makes only the in-probe analysis valid? Any bibliographic references that could help me seeing the subtleties around? As you can see, concepts are quite interleaved in my mind, so any help would be very appreciated. Regards, Gustavo --------------------------- Enviado con Sparrow (http://www.sparrowmailapp.com/?sig)

probe ASSIGN probe ASSIGN • 2.6k views

ADD COMMENT • link updated 12.8 years ago by Tim Triche ★ 4.2k • written 12.8 years ago by Gustavo Fernández Bayón ▴ 440

0

Entering edit mode

Tim Triche ★ 4.2k

@tim-triche-3561

Last seen 4.6 years ago

United States

Look up Andrew Jaffe and Rafa Irizarry's paper on "bump hunting" for regional differences. Or run a smooth over it (caveat: I just wrote smoothing "the way I want it" yesterday, after being provoked by a collaborator, so you might have to use lumi). The function "dmrFinder" in the "charm" package is specifically meant for this sort of thing. Also, if you're doing linear tests, be careful with normalization, mask your SNPs and chrX probes, and maybe use M-values (logit(beta)) for the task. The latter is more important for epidemiological datasets than something like cancer, where every single interesting result from M-value testing has been reproduced using untransformed beta values when I ran comparisons (e.g. HELP hg17 methylation differences for IDH1/2 mutants vs. Illumina hm450 differences for IDH1/2 mutants, the complete absence of any differences for TET2 mutants regardless of platform, etc.) Mark Robinson just chimed in, I see. Probably a good idea to read his reply carefully as well. On Tue, Jun 19, 2012 at 3:57 AM, Gustavo Fernández Bayón <gbayon@gmail.com>wrote: > Hi everybody. > > As a newbie to bioinformatics, it is not uncommon to find difficulties in > the way biological knowledge mixes with statistics. I come from the Machine > Learning field, and usually have problems with the naming conventions > (well, among several other things, I must admit). Besides, I am not an > expert in statistics, having used the barely necessary for the validation > of my work. > > Well, let's try to be more precise. One of the topics I am working more > right now is the analysis of methylation array data. As you surely now, the > final processed (and normalized) beta values are presented in a pxn matrix, > where there are p different probes and n different samples or individuals > from which we have obtained the beta-values. I am not currently working > with the raw data. > > Imagine, for a moment, that we have identified two regions of probes, A > and B, with a group of nA probes belonging to A, another group (of nB > probes) that belongs to B, and the intersection is empty. Say that we want > to find a way to show there is a statistically significant difference > between the methylation values of both regions. > As far as I have seen in the literature, comparisons (statistical tests) > are always done comparing the same probe values between case and control > groups of individuals or samples. For example, when we are trying to find > differentiated probes. > > However, if I think of directly comparing all the beta values from region > A (nA * n values) against the ones in region B (nB * n values) with a, say, > t test, I get the suspicion that something is not being done the way it > should. My knowledge of Biology and Statistics is still limited and I > cannot explain why, but I have the feeling that there is something formally > wrong in this approximation. Am I right? > > What I have done in similar experiments has been to find differentiated > probes, and then do a test to the proportion of differentiated probes to > total number of them, so I could assign a p-value to prove that there was a > significant influence of the region of reference. > > Several questions here: which could be a coherent approximation to the > regions A and B problem stated above? Is there any problem with methylation > data I am not aware of which makes only the in-probe analysis valid? Any > bibliographic references that could help me seeing the subtleties around? > > As you can see, concepts are quite interleaved in my mind, so any help > would be very appreciated. > Regards, > Gustavo > > > > > --------------------------- > Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD COMMENT • link 12.8 years ago Tim Triche ★ 4.2k

0

Entering edit mode

As for as I know, this is no standard normalization for methylation data. For me, I prefer keeping the raw value and just adjusting the technical variants. Anyone has better solution. Please let me know. Back the question, I agree with Mark. It's unusual to compare different region. These regions may have different background methylation status and hardly to directly compare. Jack 2012/6/19 Tim Triche, Jr. <tim.triche@gmail.com> > Look up Andrew Jaffe and Rafa Irizarry's paper on "bump hunting" for > regional differences. Or run a smooth over it (caveat: I just wrote > smoothing "the way I want it" yesterday, after being provoked by a > collaborator, so you might have to use lumi). > > The function "dmrFinder" in the "charm" package is specifically meant for > this sort of thing. > > Also, if you're doing linear tests, be careful with normalization, mask > your SNPs and chrX probes, and maybe use M-values (logit(beta)) for the > task. The latter is more important for epidemiological datasets than > something like cancer, where every single interesting result from M-value > testing has been reproduced using untransformed beta values when I ran > comparisons (e.g. HELP hg17 methylation differences for IDH1/2 mutants vs. > Illumina hm450 differences for IDH1/2 mutants, the complete absence of any > differences for TET2 mutants regardless of platform, etc.) > > Mark Robinson just chimed in, I see. Probably a good idea to read his > reply carefully as well. > > > > On Tue, Jun 19, 2012 at 3:57 AM, Gustavo Fernández Bayón > <gbayon@gmail.com>wrote: > > > Hi everybody. > > > > As a newbie to bioinformatics, it is not uncommon to find difficulties in > > the way biological knowledge mixes with statistics. I come from the > Machine > > Learning field, and usually have problems with the naming conventions > > (well, among several other things, I must admit). Besides, I am not an > > expert in statistics, having used the barely necessary for the validation > > of my work. > > > > Well, let's try to be more precise. One of the topics I am working more > > right now is the analysis of methylation array data. As you surely now, > the > > final processed (and normalized) beta values are presented in a pxn > matrix, > > where there are p different probes and n different samples or individuals > > from which we have obtained the beta-values. I am not currently working > > with the raw data. > > > > Imagine, for a moment, that we have identified two regions of probes, A > > and B, with a group of nA probes belonging to A, another group (of nB > > probes) that belongs to B, and the intersection is empty. Say that we > want > > to find a way to show there is a statistically significant difference > > between the methylation values of both regions. > > As far as I have seen in the literature, comparisons (statistical tests) > > are always done comparing the same probe values between case and control > > groups of individuals or samples. For example, when we are trying to find > > differentiated probes. > > > > However, if I think of directly comparing all the beta values from region > > A (nA * n values) against the ones in region B (nB * n values) with a, > say, > > t test, I get the suspicion that something is not being done the way it > > should. My knowledge of Biology and Statistics is still limited and I > > cannot explain why, but I have the feeling that there is something > formally > > wrong in this approximation. Am I right? > > > > What I have done in similar experiments has been to find differentiated > > probes, and then do a test to the proportion of differentiated probes to > > total number of them, so I could assign a p-value to prove that there > was a > > significant influence of the region of reference. > > > > Several questions here: which could be a coherent approximation to the > > regions A and B problem stated above? Is there any problem with > methylation > > data I am not aware of which makes only the in-probe analysis valid? Any > > bibliographic references that could help me seeing the subtleties around? > > > > As you can see, concepts are quite interleaved in my mind, so any help > > would be very appreciated. > > Regards, > > Gustavo > > > > > > > > > > --------------------------- > > Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper< > http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf> > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 12.8 years ago yao chen ▴ 210

0

Entering edit mode

On Tue, Jun 19, 2012 at 7:46 AM, Yao Chen <chenyao.bioinfor@gmail.com>wrote: > As for as I know, this is no standard normalization for methylation data. As far as I know, there is no standard for microarray or RNAseq normalization either! But that doesn't mean an investigator should ignore the issue of technical (as opposed to biological) fixed or varying effects in their data. Especially if it could materially impact the outcome of a study. lumi offers quantile normalization, minfi & methylumi will do dye bias normalization, etc. For example, GenomeStudio appears to choose a reference array for dye bias adjustment within each batch of 450k samples, and correct using the normalization controls so that the chips in the run have equivalent Cy3:Cy5 bias to the reference. This is less than optimal if you then want to compare with another, separate batch. Personally I feel that it's better to start from IDATs. Another possibility is pernicious batch effects -- something like ComBat seems to work very well for those, usually, although as noted it's always up to the investigator to ensure that they are reporting on biologically (vs. technically) interesting differences. See for example http://www.biomedcentral.com/1755-8794/4/84 > For me, I prefer keeping the raw value and just adjusting the technical > variants. Anyone has better solution. Please let me know. See above. If the usual MDS plots indicate a supervised effect, one should fix it, preferably on the logit scale with ComBat, SVA, or something else appropriate to the task (i.e. if you're doing unsupervised analyses, a different method might be optimal). thanks, --t Jack > > 2012/6/19 Tim Triche, Jr. <tim.triche@gmail.com> > >> Look up Andrew Jaffe and Rafa Irizarry's paper on "bump hunting" for >> regional differences. Or run a smooth over it (caveat: I just wrote >> smoothing "the way I want it" yesterday, after being provoked by a >> collaborator, so you might have to use lumi). >> >> The function "dmrFinder" in the "charm" package is specifically meant for >> this sort of thing. >> >> Also, if you're doing linear tests, be careful with normalization, mask >> your SNPs and chrX probes, and maybe use M-values (logit(beta)) for the >> task. The latter is more important for epidemiological datasets than >> something like cancer, where every single interesting result from M-value >> testing has been reproduced using untransformed beta values when I ran >> comparisons (e.g. HELP hg17 methylation differences for IDH1/2 mutants vs. >> Illumina hm450 differences for IDH1/2 mutants, the complete absence of any >> differences for TET2 mutants regardless of platform, etc.) >> >> Mark Robinson just chimed in, I see. Probably a good idea to read his >> reply carefully as well. >> >> >> >> On Tue, Jun 19, 2012 at 3:57 AM, Gustavo Fernández Bayón >> <gbayon@gmail.com>wrote: >> >> > Hi everybody. >> > >> > As a newbie to bioinformatics, it is not uncommon to find difficulties >> in >> > the way biological knowledge mixes with statistics. I come from the >> Machine >> > Learning field, and usually have problems with the naming conventions >> > (well, among several other things, I must admit). Besides, I am not an >> > expert in statistics, having used the barely necessary for the >> validation >> > of my work. >> > >> > Well, let's try to be more precise. One of the topics I am working more >> > right now is the analysis of methylation array data. As you surely now, >> the >> > final processed (and normalized) beta values are presented in a pxn >> matrix, >> > where there are p different probes and n different samples or >> individuals >> > from which we have obtained the beta-values. I am not currently working >> > with the raw data. >> > >> > Imagine, for a moment, that we have identified two regions of probes, A >> > and B, with a group of nA probes belonging to A, another group (of nB >> > probes) that belongs to B, and the intersection is empty. Say that we >> want >> > to find a way to show there is a statistically significant difference >> > between the methylation values of both regions. >> > As far as I have seen in the literature, comparisons (statistical tests) >> > are always done comparing the same probe values between case and control >> > groups of individuals or samples. For example, when we are trying to >> find >> > differentiated probes. >> > >> > However, if I think of directly comparing all the beta values from >> region >> > A (nA * n values) against the ones in region B (nB * n values) with a, >> say, >> > t test, I get the suspicion that something is not being done the way it >> > should. My knowledge of Biology and Statistics is still limited and I >> > cannot explain why, but I have the feeling that there is something >> formally >> > wrong in this approximation. Am I right? >> > >> > What I have done in similar experiments has been to find differentiated >> > probes, and then do a test to the proportion of differentiated probes to >> > total number of them, so I could assign a p-value to prove that there >> was a >> > significant influence of the region of reference. >> > >> > Several questions here: which could be a coherent approximation to the >> > regions A and B problem stated above? Is there any problem with >> methylation >> > data I am not aware of which makes only the in-probe analysis valid? Any >> > bibliographic references that could help me seeing the subtleties >> around? >> > >> > As you can see, concepts are quite interleaved in my mind, so any help >> > would be very appreciated. >> > Regards, >> > Gustavo >> > >> > >> > >> > >> > --------------------------- >> > Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor@r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> >> >> >> -- >> *A model is a lie that helps you see the truth.* >> * >> * >> Howard Skipper< >> http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf> >> >> [[alternative HTML version deleted]] >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD REPLY • link 12.8 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Hi Tim. I didn't mean we don't normalization methylation data because there is no standard method. What I want to say is the most of the existing normalization methods are derived from microarray which don't fit the methylation data. Most of these methods such as quantile normalization assume that most genes are not differentially expressed. However, In DNA methylation data, global hypomethylation is observed in many diseases such as cancer . Improper normalization method would erase the real biological difference. Jack 2012/6/19 Tim Triche, Jr. <tim.triche@gmail.com> > On Tue, Jun 19, 2012 at 7:46 AM, Yao Chen <chenyao.bioinfor@gmail.com>wrote: > >> As for as I know, this is no standard normalization for methylation data. > > > As far as I know, there is no standard for microarray or RNAseq > normalization either! But that doesn't mean an investigator should ignore > the issue of technical (as opposed to biological) fixed or varying effects > in their data. Especially if it could materially impact the outcome of a > study. lumi offers quantile normalization, minfi & methylumi will do dye > bias normalization, etc. > > For example, GenomeStudio appears to choose a reference array for dye bias > adjustment within each batch of 450k samples, and correct using the > normalization controls so that the chips in the run have equivalent Cy3:Cy5 > bias to the reference. This is less than optimal if you then want to > compare with another, separate batch. Personally I feel that it's better > to start from IDATs. > > Another possibility is pernicious batch effects -- something like ComBat > seems to work very well for those, usually, although as noted it's always > up to the investigator to ensure that they are reporting on biologically > (vs. technically) interesting differences. > > See for example http://www.biomedcentral.com/1755-8794/4/84 > > >> For me, I prefer keeping the raw value and just adjusting the technical >> variants. Anyone has better solution. Please let me know. > > > See above. If the usual MDS plots indicate a supervised effect, one > should fix it, preferably on the logit scale with ComBat, SVA, or something > else appropriate to the task (i.e. if you're doing unsupervised analyses, a > different method might be optimal). > > thanks, > > --t > > > > Jack >> >> 2012/6/19 Tim Triche, Jr. <tim.triche@gmail.com> >> >>> Look up Andrew Jaffe and Rafa Irizarry's paper on "bump hunting" for >>> regional differences. Or run a smooth over it (caveat: I just wrote >>> smoothing "the way I want it" yesterday, after being provoked by a >>> collaborator, so you might have to use lumi). >>> >>> The function "dmrFinder" in the "charm" package is specifically meant for >>> this sort of thing. >>> >>> Also, if you're doing linear tests, be careful with normalization, mask >>> your SNPs and chrX probes, and maybe use M-values (logit(beta)) for the >>> task. The latter is more important for epidemiological datasets than >>> something like cancer, where every single interesting result from M-value >>> testing has been reproduced using untransformed beta values when I ran >>> comparisons (e.g. HELP hg17 methylation differences for IDH1/2 mutants >>> vs. >>> Illumina hm450 differences for IDH1/2 mutants, the complete absence of >>> any >>> differences for TET2 mutants regardless of platform, etc.) >>> >>> Mark Robinson just chimed in, I see. Probably a good idea to read his >>> reply carefully as well. >>> >>> >>> >>> On Tue, Jun 19, 2012 at 3:57 AM, Gustavo Fernández Bayón >>> <gbayon@gmail.com>wrote: >>> >>> > Hi everybody. >>> > >>> > As a newbie to bioinformatics, it is not uncommon to find difficulties >>> in >>> > the way biological knowledge mixes with statistics. I come from the >>> Machine >>> > Learning field, and usually have problems with the naming conventions >>> > (well, among several other things, I must admit). Besides, I am not an >>> > expert in statistics, having used the barely necessary for the >>> validation >>> > of my work. >>> > >>> > Well, let's try to be more precise. One of the topics I am working more >>> > right now is the analysis of methylation array data. As you surely >>> now, the >>> > final processed (and normalized) beta values are presented in a pxn >>> matrix, >>> > where there are p different probes and n different samples or >>> individuals >>> > from which we have obtained the beta-values. I am not currently working >>> > with the raw data. >>> > >>> > Imagine, for a moment, that we have identified two regions of probes, A >>> > and B, with a group of nA probes belonging to A, another group (of nB >>> > probes) that belongs to B, and the intersection is empty. Say that we >>> want >>> > to find a way to show there is a statistically significant difference >>> > between the methylation values of both regions. >>> > As far as I have seen in the literature, comparisons (statistical >>> tests) >>> > are always done comparing the same probe values between case and >>> control >>> > groups of individuals or samples. For example, when we are trying to >>> find >>> > differentiated probes. >>> > >>> > However, if I think of directly comparing all the beta values from >>> region >>> > A (nA * n values) against the ones in region B (nB * n values) with a, >>> say, >>> > t test, I get the suspicion that something is not being done the way it >>> > should. My knowledge of Biology and Statistics is still limited and I >>> > cannot explain why, but I have the feeling that there is something >>> formally >>> > wrong in this approximation. Am I right? >>> > >>> > What I have done in similar experiments has been to find differentiated >>> > probes, and then do a test to the proportion of differentiated probes >>> to >>> > total number of them, so I could assign a p-value to prove that there >>> was a >>> > significant influence of the region of reference. >>> > >>> > Several questions here: which could be a coherent approximation to the >>> > regions A and B problem stated above? Is there any problem with >>> methylation >>> > data I am not aware of which makes only the in-probe analysis valid? >>> Any >>> > bibliographic references that could help me seeing the subtleties >>> around? >>> > >>> > As you can see, concepts are quite interleaved in my mind, so any help >>> > would be very appreciated. >>> > Regards, >>> > Gustavo >>> > >>> > >>> > >>> > >>> > --------------------------- >>> > Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) >>> > >>> > _______________________________________________ >>> > Bioconductor mailing list >>> > Bioconductor@r-project.org >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>> > Search the archives: >>> > http://news.gmane.org/gmane.science.biology.informatics.conductor >>> > >>> >>> >>> >>> -- >>> *A model is a lie that helps you see the truth.* >>> * >>> * >>> Howard Skipper< >>> http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf> >>> >>> [[alternative HTML version deleted]] >>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > > [[alternative HTML version deleted]]

ADD REPLY • link 12.8 years ago yao chen ▴ 210

0

Entering edit mode

Oh, I don't disagree that improper normalization is a bad idea. However, quantile normalization on the overall raw intensities (for example), assuming there are not gross differences in copy number, seems to work OK in many cases. I have seen people quantile normalizing on the summary statistics, which strikes me as perverse, but it's their data and their papers, not mine. I do tend to believe that methods which take into account the peculiarities of the platform are preferable to those that don't, but the former do exist; the trouble is that few systematic comparisons have been conducted, mostly on small or unusual datasets. As you point out, failing to take into account the differences between expression data (sparse transcripts, mostly absent) and genomic DNA (whether genotyping or "epigenotyping" arrays) can be expected to lead to poor results. I'm not a fan of blindly applying anything, hence the suggestion to plot the data first and ask questions thereafter :-) Cheers, --t On Tue, Jun 19, 2012 at 11:12 AM, Yao Chen <chenyao.bioinfor@gmail.com>wrote: > Hi Tim. > > I didn't mean we don't normalization methylation data because there is no > standard method. What I want to say is the most of the existing > normalization methods are derived from microarray which don't fit the > methylation data. Most of these methods such as > quantile normalization assume that most genes are not differentially > expressed. However, In DNA methylation data, global hypomethylation is > observed in many diseases such as cancer . Improper normalization method > would erase the real biological difference. > > Jack > > 2012/6/19 Tim Triche, Jr. <tim.triche@gmail.com> > >> On Tue, Jun 19, 2012 at 7:46 AM, Yao Chen <chenyao.bioinfor@gmail.com>wrote: >> >>> As for as I know, this is no standard normalization for methylation >>> data. >> >> >> As far as I know, there is no standard for microarray or RNAseq >> normalization either! But that doesn't mean an investigator should ignore >> the issue of technical (as opposed to biological) fixed or varying effects >> in their data. Especially if it could materially impact the outcome of a >> study. lumi offers quantile normalization, minfi & methylumi will do dye >> bias normalization, etc. >> >> For example, GenomeStudio appears to choose a reference array for dye >> bias adjustment within each batch of 450k samples, and correct using the >> normalization controls so that the chips in the run have equivalent Cy3:Cy5 >> bias to the reference. This is less than optimal if you then want to >> compare with another, separate batch. Personally I feel that it's better >> to start from IDATs. >> >> Another possibility is pernicious batch effects -- something like ComBat >> seems to work very well for those, usually, although as noted it's always >> up to the investigator to ensure that they are reporting on biologically >> (vs. technically) interesting differences. >> >> See for example http://www.biomedcentral.com/1755-8794/4/84 >> >> >>> For me, I prefer keeping the raw value and just adjusting the technical >>> variants. Anyone has better solution. Please let me know. >> >> >> See above. If the usual MDS plots indicate a supervised effect, one >> should fix it, preferably on the logit scale with ComBat, SVA, or something >> else appropriate to the task (i.e. if you're doing unsupervised analyses, a >> different method might be optimal). >> >> thanks, >> >> --t >> >> >> >> Jack >>> >>> 2012/6/19 Tim Triche, Jr. <tim.triche@gmail.com> >>> >>>> Look up Andrew Jaffe and Rafa Irizarry's paper on "bump hunting" for >>>> regional differences. Or run a smooth over it (caveat: I just wrote >>>> smoothing "the way I want it" yesterday, after being provoked by a >>>> collaborator, so you might have to use lumi). >>>> >>>> The function "dmrFinder" in the "charm" package is specifically meant >>>> for >>>> this sort of thing. >>>> >>>> Also, if you're doing linear tests, be careful with normalization, mask >>>> your SNPs and chrX probes, and maybe use M-values (logit(beta)) for the >>>> task. The latter is more important for epidemiological datasets than >>>> something like cancer, where every single interesting result from >>>> M-value >>>> testing has been reproduced using untransformed beta values when I ran >>>> comparisons (e.g. HELP hg17 methylation differences for IDH1/2 mutants >>>> vs. >>>> Illumina hm450 differences for IDH1/2 mutants, the complete absence of >>>> any >>>> differences for TET2 mutants regardless of platform, etc.) >>>> >>>> Mark Robinson just chimed in, I see. Probably a good idea to read his >>>> reply carefully as well. >>>> >>>> >>>> >>>> On Tue, Jun 19, 2012 at 3:57 AM, Gustavo Fernández Bayón >>>> <gbayon@gmail.com>wrote: >>>> >>>> > Hi everybody. >>>> > >>>> > As a newbie to bioinformatics, it is not uncommon to find >>>> difficulties in >>>> > the way biological knowledge mixes with statistics. I come from the >>>> Machine >>>> > Learning field, and usually have problems with the naming conventions >>>> > (well, among several other things, I must admit). Besides, I am not an >>>> > expert in statistics, having used the barely necessary for the >>>> validation >>>> > of my work. >>>> > >>>> > Well, let's try to be more precise. One of the topics I am working >>>> more >>>> > right now is the analysis of methylation array data. As you surely >>>> now, the >>>> > final processed (and normalized) beta values are presented in a pxn >>>> matrix, >>>> > where there are p different probes and n different samples or >>>> individuals >>>> > from which we have obtained the beta-values. I am not currently >>>> working >>>> > with the raw data. >>>> > >>>> > Imagine, for a moment, that we have identified two regions of probes, >>>> A >>>> > and B, with a group of nA probes belonging to A, another group (of nB >>>> > probes) that belongs to B, and the intersection is empty. Say that we >>>> want >>>> > to find a way to show there is a statistically significant difference >>>> > between the methylation values of both regions. >>>> > As far as I have seen in the literature, comparisons (statistical >>>> tests) >>>> > are always done comparing the same probe values between case and >>>> control >>>> > groups of individuals or samples. For example, when we are trying to >>>> find >>>> > differentiated probes. >>>> > >>>> > However, if I think of directly comparing all the beta values from >>>> region >>>> > A (nA * n values) against the ones in region B (nB * n values) with >>>> a, say, >>>> > t test, I get the suspicion that something is not being done the way >>>> it >>>> > should. My knowledge of Biology and Statistics is still limited and I >>>> > cannot explain why, but I have the feeling that there is something >>>> formally >>>> > wrong in this approximation. Am I right? >>>> > >>>> > What I have done in similar experiments has been to find >>>> differentiated >>>> > probes, and then do a test to the proportion of differentiated probes >>>> to >>>> > total number of them, so I could assign a p-value to prove that there >>>> was a >>>> > significant influence of the region of reference. >>>> > >>>> > Several questions here: which could be a coherent approximation to the >>>> > regions A and B problem stated above? Is there any problem with >>>> methylation >>>> > data I am not aware of which makes only the in-probe analysis valid? >>>> Any >>>> > bibliographic references that could help me seeing the subtleties >>>> around? >>>> > >>>> > As you can see, concepts are quite interleaved in my mind, so any help >>>> > would be very appreciated. >>>> > Regards, >>>> > Gustavo >>>> > >>>> > >>>> > >>>> > >>>> > --------------------------- >>>> > Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) >>>> > >>>> > _______________________________________________ >>>> > Bioconductor mailing list >>>> > Bioconductor@r-project.org >>>> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> > Search the archives: >>>> > http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> > >>>> >>>> >>>> >>>> -- >>>> *A model is a lie that helps you see the truth.* >>>> * >>>> * >>>> Howard Skipper< >>>> http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf> >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor@r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> >> >> >> -- >> *A model is a lie that helps you see the truth.* >> * >> * >> Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >> >> > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD REPLY • link 12.8 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Well, to sum up, I wanted to thank you all for your kind and constructive answers. Now I am getting to work through the references you provided. There are a lot of things to learn in this field and I am still at the beginning. If I still have problems, be sure I'll be back in the list for asking. Regards, Gus --------------------------- Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) El martes 19 de junio de 2012 a las 20:19, Tim Triche, Jr. escribi?: > Oh, I don't disagree that improper normalization is a bad idea. However, quantile normalization on the overall raw intensities (for example), assuming there are not gross differences in copy number, seems to work OK in many cases. I have seen people quantile normalizing on the summary statistics, which strikes me as perverse, but it's their data and their papers, not mine. > > I do tend to believe that methods which take into account the peculiarities of the platform are preferable to those that don't, but the former do exist; the trouble is that few systematic comparisons have been conducted, mostly on small or unusual datasets. > > As you point out, failing to take into account the differences between expression data (sparse transcripts, mostly absent) and genomic DNA (whether genotyping or "epigenotyping" arrays) can be expected to lead to poor results. I'm not a fan of blindly applying anything, hence the suggestion to plot the data first and ask questions thereafter :-) > > Cheers, > > --t > > > > On Tue, Jun 19, 2012 at 11:12 AM, Yao Chen <chenyao.bioinfor at="" gmail.com="" (mailto:chenyao.bioinfor="" at="" gmail.com)=""> wrote: > > Hi Tim. > > > > I didn't mean we don't normalization methylation data because there is no standard method. What I want to say is the most of the existing normalization methods are derived from microarray which don't fit the methylation data. Most of these methods such as quantile normalization assume that most genes are not differentially expressed. However, In DNA methylation data, global hypomethylation is observed in many diseases such as cancer . Improper normalization method would erase the real biological difference. > > > > Jack > > > > 2012/6/19 Tim Triche, Jr. <tim.triche at="" gmail.com="" (mailto:tim.triche="" at="" gmail.com)=""> > > > On Tue, Jun 19, 2012 at 7:46 AM, Yao Chen <chenyao.bioinfor at="" gmail.com="" (mailto:chenyao.bioinfor="" at="" gmail.com)=""> wrote: > > > > As for as I know, this is no standard normalization for methylation data. > > > > > > > > > As far as I know, there is no standard for microarray or RNAseq normalization either! But that doesn't mean an investigator should ignore the issue of technical (as opposed to biological) fixed or varying effects in their data. Especially if it could materially impact the outcome of a study. lumi offers quantile normalization, minfi & methylumi will do dye bias normalization, etc. > > > > > > For example, GenomeStudio appears to choose a reference array for dye bias adjustment within each batch of 450k samples, and correct using the normalization controls so that the chips in the run have equivalent Cy3:Cy5 bias to the reference. This is less than optimal if you then want to compare with another, separate batch. Personally I feel that it's better to start from IDATs. > > > > > > Another possibility is pernicious batch effects -- something like ComBat seems to work very well for those, usually, although as noted it's always up to the investigator to ensure that they are reporting on biologically (vs. technically) interesting differences. > > > > > > See for example http://www.biomedcentral.com/1755-8794/4/84 > > > > > > > For me, I prefer keeping the raw value and just adjusting the technical variants. Anyone has better solution. Please let me know. > > > > > > > > > > > > See above. If the usual MDS plots indicate a supervised effect, one should fix it, preferably on the logit scale with ComBat, SVA, or something else appropriate to the task (i.e. if you're doing unsupervised analyses, a different method might be optimal). > > > > > > thanks, > > > > > > --t > > > > > > > > > > > > > Jack > > > > > > > > 2012/6/19 Tim Triche, Jr. <tim.triche at="" gmail.com="" (mailto:tim.triche="" at="" gmail.com)=""> > > > > > Look up Andrew Jaffe and Rafa Irizarry's paper on "bump hunting" for > > > > > regional differences. Or run a smooth over it (caveat: I just wrote > > > > > smoothing "the way I want it" yesterday, after being provoked by a > > > > > collaborator, so you might have to use lumi). > > > > > > > > > > The function "dmrFinder" in the "charm" package is specifically meant for > > > > > this sort of thing. > > > > > > > > > > Also, if you're doing linear tests, be careful with normalization, mask > > > > > your SNPs and chrX probes, and maybe use M-values (logit(beta)) for the > > > > > task. The latter is more important for epidemiological datasets than > > > > > something like cancer, where every single interesting result from M-value > > > > > testing has been reproduced using untransformed beta values when I ran > > > > > comparisons (e.g. HELP hg17 methylation differences for IDH1/2 mutants vs. > > > > > Illumina hm450 differences for IDH1/2 mutants, the complete absence of any > > > > > differences for TET2 mutants regardless of platform, etc.) > > > > > > > > > > Mark Robinson just chimed in, I see. Probably a good idea to read his > > > > > reply carefully as well. > > > > > > > > > > > > > > > > > > > > On Tue, Jun 19, 2012 at 3:57 AM, Gustavo Fern?ndez Bay?n > > > > > <gbayon at="" gmail.com="" (mailto:gbayon="" at="" gmail.com)="">wrote: > > > > > > > > > > > Hi everybody. > > > > > > > > > > > > As a newbie to bioinformatics, it is not uncommon to find difficulties in > > > > > > the way biological knowledge mixes with statistics. I come from the Machine > > > > > > Learning field, and usually have problems with the naming conventions > > > > > > (well, among several other things, I must admit). Besides, I am not an > > > > > > expert in statistics, having used the barely necessary for the validation > > > > > > of my work. > > > > > > > > > > > > Well, let's try to be more precise. One of the topics I am working more > > > > > > right now is the analysis of methylation array data. As you surely now, the > > > > > > final processed (and normalized) beta values are presented in a pxn matrix, > > > > > > where there are p different probes and n different samples or individuals > > > > > > from which we have obtained the beta-values. I am not currently working > > > > > > with the raw data. > > > > > > > > > > > > Imagine, for a moment, that we have identified two regions of probes, A > > > > > > and B, with a group of nA probes belonging to A, another group (of nB > > > > > > probes) that belongs to B, and the intersection is empty. Say that we want > > > > > > to find a way to show there is a statistically significant difference > > > > > > between the methylation values of both regions. > > > > > > As far as I have seen in the literature, comparisons (statistical tests) > > > > > > are always done comparing the same probe values between case and control > > > > > > groups of individuals or samples. For example, when we are trying to find > > > > > > differentiated probes. > > > > > > > > > > > > However, if I think of directly comparing all the beta values from region > > > > > > A (nA * n values) against the ones in region B (nB * n values) with a, say, > > > > > > t test, I get the suspicion that something is not being done the way it > > > > > > should. My knowledge of Biology and Statistics is still limited and I > > > > > > cannot explain why, but I have the feeling that there is something formally > > > > > > wrong in this approximation. Am I right? > > > > > > > > > > > > What I have done in similar experiments has been to find differentiated > > > > > > probes, and then do a test to the proportion of differentiated probes to > > > > > > total number of them, so I could assign a p-value to prove that there was a > > > > > > significant influence of the region of reference. > > > > > > > > > > > > Several questions here: which could be a coherent approximation to the > > > > > > regions A and B problem stated above? Is there any problem with methylation > > > > > > data I am not aware of which makes only the in-probe analysis valid? Any > > > > > > bibliographic references that could help me seeing the subtleties around? > > > > > > > > > > > > As you can see, concepts are quite interleaved in my mind, so any help > > > > > > would be very appreciated. > > > > > > Regards, > > > > > > Gustavo > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------- > > > > > > Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) > > > > > > > > > > > > _______________________________________________ > > > > > > Bioconductor mailing list > > > > > > Bioconductor at r-project.org (mailto:Bioconductor at r-project.org) > > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > > > Search the archives: > > > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > *A model is a lie that helps you see the truth.* > > > > > * > > > > > * > > > > > Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > > > > > > > > > > [[alternative HTML version deleted]] > > > > > > > > > > > > > > > _______________________________________________ > > > > > Bioconductor mailing list > > > > > Bioconductor at r-project.org (mailto:Bioconductor at r-project.org) > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > > > > > > > > > -- > > > A model is a lie that helps you see the truth. > > > > > > Howard Skipper (http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf) > > > > -- > A model is a lie that helps you see the truth. > > Howard Skipper (http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf)

ADD REPLY • link 12.8 years ago Gustavo Fernández Bayón ▴ 440

0

Entering edit mode

Hi Yao. First of all, thank you for your answer. --------------------------- Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) El martes 19 de junio de 2012 a las 16:46, Yao Chen escribi?: > As for as I know, this is no standard normalization for methylation data. As I have said to Tim in a previous post, I really thought I did not have to deal with normalization issues. Now it seems I have to start worrying about it. > For me, I prefer keeping the raw value and just adjusting the technical variants. Anyone has better solution. Please let me know. I thought that, for a beginner like me, it was better not to deal with the normalization stages, and just start working with the beta values. > > Back the question, I agree with Mark. It's unusual to compare different region. These regions may have different background methylation status and hardly to directly compare. Thanks to you three, I think I start to see things clear. The fact is that I just didn't know how to put it down in words. We should not compare the methylation status of different regions because their magnitudes and behaviors are not comparable. Am I getting near it? > > Jack Regards, Gustavo

ADD REPLY • link 12.8 years ago Gustavo Fernández Bayón ▴ 440

0

Entering edit mode

On Tue, Jun 19, 2012 at 8:22 AM, Gustavo Fernández Bayón <gbayon@gmail.com>wrote: > > As I have said to Tim in a previous post, I really thought I did not have > to deal with normalization issues. Now it seems I have to start worrying > about it. Maybe yes, maybe no. Judgment matters when making such a decision. In some cases normalization seems to be a wash (see for example http://genomebiology.com/2011/12/1/R10 ) whereas in others it appears to make a material difference ( http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pge n.1000952). My purpose in raising the point is just that investigators ought to be aware of the possibility for spurious differences, so that they are caught early on rather than later. Sorry if it sounded more alarmist than that. --t [[alternative HTML version deleted]]

ADD REPLY • link 12.8 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Hi Tim. Thank you for your answer. I'll try to "defend" myself the best I can below. ;) --------------------------- Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) El martes 19 de junio de 2012 a las 16:20, Tim Triche, Jr. escribi?: > Look up Andrew Jaffe and Rafa Irizarry's paper on "bump hunting" for regional differences. I think both Mark and you have agreed on the paper. That surely is a good point for making me read it thoroughly. > Or run a smooth over it (caveat: I just wrote smoothing "the way I want it" yesterday, after being provoked by a collaborator, so you might have to use lumi). I am not sure if I understand what you are trying to tell me here. ;) Sorry. I know lumi, although I thought it covered only the necessary stages until normalization of data. > The function "dmrFinder" in the "charm" package is specifically meant for this sort of thing. I had looked at the charm Vignette in the past few days, but thought it was designed for technology different from ours. For me, sometimes it is difficult to just "understand" the goals or targets of different packages. I am currently biocLite'ing it while I am writing this, so I'll take a look to dmrFinder and tell you. > Also, if you're doing linear tests, be careful with normalization, I thought (too naively, I guess) that, when given the beta values, everything was normalized. I.e., that I was safe unless I worked with raw data. > mask your SNPs and chrX probes, I am currently doing something well :) At least, the chrX part. How could I mask the SNP's? > and maybe use M-values (logit(beta)) for the task. Yes, that's a point I was reading a lot lately. As far as I think I have understood, M-values have better statistical properties for spotting DMR's, haven't they? > The latter is more important for epidemiological datasets than something like cancer, where every single interesting result from M-value testing has been reproduced using untransformed beta values when I ran comparisons (e.g. HELP hg17 methylation differences for IDH1/2 mutants vs. Illumina hm450 differences for IDH1/2 mutants, the complete absence of any differences for TET2 mutants regardless of platform, etc.) Well. I have to assume that I do not understand completely what you have written above. ;) Don't worry, it's not your problem. I'm sure it's mine. I am sometimes quite overwhelmed by the huge amount of information in this field. > > Mark Robinson just chimed in, I see. Probably a good idea to read his reply carefully as well. I have done. And both your answer and his have been very helpful, constructive, and kind. Thank you very much. Regards, Gustavo

ADD REPLY • link 12.8 years ago Gustavo Fernández Bayón ▴ 440

0

Entering edit mode

On Tue, Jun 19, 2012 at 8:16 AM, Gustavo Fernández Bayón <gbayon@gmail.com>wrote: > Hi Tim. > > Thank you for your answer. I'll try to "defend" myself the best I can > below. ;) never a need to defend oneself for asking interesting questions, if anything I'd say it's the other way round. > I am not sure if I understand what you are trying to tell me here. ;) > Sorry. I know lumi, although I thought it covered only the necessary stages > until normalization of data. > A few days ago, Pan Du checked in some code to add smoothing to lumi, presumably for segmentation. I am of the opinion that there are multiple scales of differences going on in most DNA methylation surveys (e.g. chromatin-level vs. transcription-factor-level) but certainly it is likely that large-scale regional differences will be better detected if the "noise" is dampened by smoothing. This kind of gets back to the "bump hunting" paper and various other elaborations on mixtures and HMMs. > > The function "dmrFinder" in the "charm" package is specifically meant > for this sort of thing. > > I had looked at the charm Vignette in the past few days, but thought it > was designed for technology different from ours. For me, sometimes it is > difficult to just "understand" the goals or targets of different packages. > I am currently biocLite'ing it while I am writing this, so I'll take a look > to dmrFinder and tell you. > It is, but that doesn't mean it can't be adapted. > > Also, if you're doing linear tests, be careful with normalization, > > I thought (too naively, I guess) that, when given the beta values, > everything was normalized. I.e., that I was safe unless I worked with raw > data. > It is always a good idea to investigate one's data for spurious differences that associate with technical variables like batch. > > mask your SNPs and chrX probes, > > I am currently doing something well :) At least, the chrX part. How could > I mask the SNP's? > I uploaded a FeatureDb package with Infinium features (27k and 450k) and another UCSC's snp135common track recently, which makes finding common SNPs that intersect interrogated loci rather trivial. Or, you can use the SNPlocs packages if you don't care about the minor allele frequency (see eg. http://www.pnas.org/content/early/2012/06/05/1120658109 ) > > and maybe use M-values (logit(beta)) for the task. > > Yes, that's a point I was reading a lot lately. As far as I think I have > understood, M-values have better statistical properties for spotting DMR's, > haven't they? > Hard to say, it would require context. They are certainly more homoskedastic than beta values, however. > > The latter is more important for epidemiological datasets than something > like cancer, where every single interesting result from M-value testing has > been reproduced using untransformed beta values when I ran comparisons > (e.g. HELP hg17 methylation differences for IDH1/2 mutants vs. Illumina > hm450 differences for IDH1/2 mutants, the complete absence of any > differences for TET2 mutants regardless of platform, etc.) > > Well. I have to assume that I do not understand completely what you have > written above. ;) Don't worry, it's not your problem. I'm sure it's mine. I > am sometimes quite overwhelmed by the huge amount of information in this > field. > IDH1/2 mutations are known to induce massive changes in genome-wide methylation, most likely as a side effect of inhibiting histone demethylases that would "release" repressive marks in the process of differentiation. Another salient effect of the common IDH mutations is that they inhibit the TET family of Fe2+ dioxygenases, which can convert 5-methylcytosine to 5-hydroxymethylcytosine. When the connections between these observations came to light, there was a bit of a kerfuffle about whether various platforms could detect common and distinct effects of the above recurrent mutations, and at least in a comparison as of last December, 1) all of the array platforms generated the same results for IDH1/2, modulo some (extremely) marginally significant loci 2) all of the data publicly available as of that point in time showed no significant difference in methylation for TET2 mut vs. wt. The above observations held true regardless of the scale (beta, mvalue, probit(beta)) on which the tests were conducted. If the effects are strong enough, the scale on which they are being measured sometimes has little impact on the results. For more subtle changes, m-values or similar may be more appropriate, but then normalization also becomes more important. Best of luck, and apologies for any excess information, --t [[alternative HTML version deleted]]

ADD REPLY • link 12.8 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Mark Robinson ▴ 880

@mark-robinson-4908

Last seen 6.5 years ago

Hi Gustavo, I've inserted a few "reactions" below. On 19.06.2012, at 12:57, Gustavo Fern?ndez Bay?n wrote: > Hi everybody. > > As a newbie to bioinformatics, it is not uncommon to find difficulties in the way biological knowledge mixes with statistics. I come from the Machine Learning field, and usually have problems with the naming conventions (well, among several other things, I must admit). Besides, I am not an expert in statistics, having used the barely necessary for the validation of my work. > > Well, let's try to be more precise. One of the topics I am working more right now is the analysis of methylation array data. As you surely now, the final processed (and normalized) beta values are presented in a pxn matrix, where there are p different probes and n different samples or individuals from which we have obtained the beta- values. I am not currently working with the raw data. > > Imagine, for a moment, that we have identified two regions of probes, A and B, with a group of nA probes belonging to A, another group (of nB probes) that belongs to B, and the intersection is empty. Say that we want to find a way to show there is a statistically significant difference between the methylation values of both regions. > As far as I have seen in the literature, comparisons (statistical tests) are always done comparing the same probe values between case and control groups of individuals or samples. For example, when we are trying to find differentiated probes. You can do differential analyses at the probe level or a regional level. An example of the latter (perhaps less popular or less established or less known) is: http://ije.oxfordjournals.org/content/41/1/200.abstract > However, if I think of directly comparing all the beta values from region A (nA * n values) against the ones in region B (nB * n values) with a, say, t test, I get the suspicion that something is not being done the way it should. My knowledge of Biology and Statistics is still limited and I cannot explain why, but I have the feeling that there is something formally wrong in this approximation. Am I right? First of all, I feel this is an unusual comparison to make. Presumably, region A and region B are different regions of the genome - what does it mean if methylation levels in region A and B are different? Maybe you could expand on the biological question here? Second, if this is the comparison you really want to make, what role do your n samples play here? Do you have cases and controls? It may be sensible to fit a model to allow you to decompose effects of case/control from those of interest (A/B). But again, this needs to be geared to your biological question, which I don't yet understand. Best, Mark > What I have done in similar experiments has been to find differentiated probes, and then do a test to the proportion of differentiated probes to total number of them, so I could assign a p-value to prove that there was a significant influence of the region of reference. > Several questions here: which could be a coherent approximation to the regions A and B problem stated above? Is there any problem with methylation data I am not aware of which makes only the in-probe analysis valid? Any bibliographic references that could help me seeing the subtleties around? > > As you can see, concepts are quite interleaved in my mind, so any help would be very appreciated. > Regards, > Gustavo > > > > > --------------------------- > Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ---------- Prof. Dr. Mark Robinson Bioinformatics Institute of Molecular Life Sciences University of Zurich Winterthurerstrasse 190 8057 Zurich Switzerland v: +41 44 635 4848 f: +41 44 635 6898 e: mark.robinson at imls.uzh.ch o: Y11-J-16 w: http://tiny.cc/mrobin ---------- http://www.fgcz.ch/Bioconductor2012

ADD COMMENT • link 12.8 years ago Mark Robinson ▴ 880

0

Entering edit mode

Hi Mark. First of all, thank you for your kind answer. I am answering you below (or at least trying to). ;) --------------------------- Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) El martes 19 de junio de 2012 a las 16:17, Mark Robinson escribi?: > [?] > You can do differential analyses at the probe level or a regional level. An example of the latter (perhaps less popular or less established or less known) is: > http://ije.oxfordjournals.org/content/41/1/200.abstract I have just given it a super-fast read, and it seems very interesting. I am going to read it more carefully, and see if it can help me to understand better where I am standing. If I have got the idea right, the authors seem to do some kind of regression or model fitting using the methylation values against the (maybe relative) position of the probes, in order to detect contiguous regions where differential methylation exists. Am I right? > [?] > First of all, I feel this is an unusual comparison to make. Presumably, region A and region B are different regions of the genome - what does it mean if methylation levels in region A and B are different? Maybe you could expand on the biological question here? Yes, of course. A fellow wants to prove that a given region is differentially methylated between two sets of individuals. She has 6 case and 5 control individuals, along with their methylation beta values for a given set of probes (small, around 27 subdivided among 4 regions). Visually, she is able to see that there is a difference in methylation between the control and case group and, what is more, that the differentiation occurs 99% of the time in a given region. She asked me for a statistic test, so she could have a p-value showing that, not only the two groups are differentially methylated, but also the methylation happens at exactly one region. Kind of a "how can I show that this region is different and the others aren't?" > > Second, if this is the comparison you really want to make, what role do your n samples play here? Do you have cases and controls? It may be sensible to fit a model to allow you to decompose effects of case/control from those of interest (A/B). But again, this needs to be geared to your biological question, which I don't yet understand. I don't know if the explanation above is helping. Feel free to ask me anything you need. The biggest problem, I know, is that sometimes I do not know how to put all of this down to words. Well, I hope that is going to improve with time (I have been only in Bioinformatics for two months). > > Best, > Mark Regards, Gustavo

ADD REPLY • link 12.8 years ago Gustavo Fernández Bayón ▴ 440

Login before adding your answer.