MEDIPS: how does MEDIPS define a methylated region / cluster?

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 10.6 years ago

Hi, I have aligned my sequencing data for one sample (filename: Ca2_MAPQ20.bam --- because I filtered out everything with MAPQ20 or less) and I've put it through MEDIPS. At the moment, I am not making a comparison between samples. Therefore, the result file shows the methylation profile for only one sample. In this results file, I noticed that the data is organized as such: chr start stop CF Ca2_MAPQ20.bam.counts Ca2_MAPQ20.bam.rpkm Ca2_MAPQ20.bam.rms Ca2_MAPQ20.bam.prob MSets1.counts.mean MSets1.rpkm.mean MSets1.rms.mean MSets1.prob.mean 1) I tried reading the Down et al. (2008) paper that explains the concept of coupling factor and I think it is, simply put, a measure of local CpG density. I am not sure if my understanding of 'coupling factor' is correct? 2) This lead to question how or what MEDIPS defines as a "region or cluster"? That is to say, on Chromosome 1, I have reads aligning from position "1002501" to "1003300" but then there is a region of 1400 bp (from "1003301" to "1004700") where there are no reads aligned (rpkm = 0). Then, again, from "1004701" to "1005400", there reads aligning to this region. So my question is does MEDIPS consider this to be a case of 2 methylated regions, that is, "1002501-1003300" and "1004701-1005400" or does MEDIPS try to consolidate these 2 methylated regions into one methylated region/cluster (that is, from "1002501-1005400") since they are fairly close to one another. 3) I used "uniq=TRUE" to get rid of stacked/clonal reads and so for each 100 bp bin, I am usually getting rpkm=1 within each bin but occasionally, it may be as many rpkm=3. This lead me to wonder how the relative methylation score ("Ca2_MAPQ20.bam.rms") is calculated? For example, 3 bins have a value of 1 rpkm each, but one has an rms value of "1660.409928", another an rms of "7122.10254" and the third is only "679.1814523". How is that possible that 3 bins that have the same rpkm can have such varying rms values? 4)Can the rms be added up for a region so as to represent the cumulative methylation level for that region. I am asking because I do not know if the rms value is a log value or what? So, in my above question (3), if the 3 bins are adjacent to one another and I decide to cluster them together and consider them as one methylated region, then would the rms value for this new 300bp bin that I have created be 1660.409928 + 7122.10254 + 679.1814523 = 9461.68. 5)Finally, what does "Ca2_MAPQ20.bam.prob" represent? I figured that is some sort of a probability score but I am not sure of what. Thanks in advance for any assistance in answering my questions. Best regards, Allen -- output of sessionInfo(): error -- Sent via the guest posting facility at bioconductor.org.

Sequencing MEDIPS • 2.9k views

ADD COMMENT • link updated 10.6 years ago by mengru1223 • 0 • written 10.6 years ago by Guest User ★ 13k

0

Entering edit mode

Lukas Chavez ▴ 570

@lukas-chavez-5781

Last seen 7.2 years ago

USA/La Jolla/UCSD

Dear Allen, please see my comments below. Moreover, please consider reading section 7.7 of the MEDIPS manual. Best, Lukas On Fri, Sep 5, 2014 at 1:14 AM, Allen [guest] <guest at="" bioconductor.org=""> wrote: > Hi, > I have aligned my sequencing data for one sample (filename: Ca2_MAPQ20.bam > --- because I filtered out everything with MAPQ20 or less) and I've put it > through MEDIPS. At the moment, I am not making a comparison between > samples. Therefore, the result file shows the methylation profile for only > one sample. In this results file, I noticed that the data is organized as > such: > chr start stop CF Ca2_MAPQ20.bam.counts > Ca2_MAPQ20.bam.rpkm Ca2_MAPQ20.bam.rms Ca2_MAPQ20.bam.prob > MSets1.counts.mean MSets1.rpkm.mean MSets1.rms.mean > MSets1.prob.mean > > 1) I tried reading the Down et al. (2008) paper that explains the concept > of coupling factor and I think it is, simply put, a measure of local CpG > density. I am not sure if my understanding of 'coupling factor' is correct? > That is correct. > 2) This lead to question how or what MEDIPS defines as a "region or > cluster"? That is to say, on Chromosome 1, I have reads aligning from > position "1002501" to "1003300" but then there is a region of 1400 bp (from > "1003301" to "1004700") where there are no reads aligned (rpkm = 0). Then, > again, from "1004701" to "1005400", there reads aligning to this region. So > my question is does MEDIPS consider this to be a case of 2 methylated > regions, that is, "1002501-1003300" and "1004701-1005400" or does MEDIPS > try to consolidate these 2 methylated regions into one methylated > region/cluster (that is, from "1002501-1005400") since they are fairly > close to one another. > Each window is considered as one region; CpG density, counts, rpkm, rms, and differential coverage will be calculated for each window separately. When windows with significant differential enrichment between groups have been identified, the MEDIPS.mergeFrames() function can be used to merge adjacent significant windows into extended regions. > 3) I used "uniq=TRUE" to get rid of stacked/clonal reads and so for each > 100 bp bin, I am usually getting rpkm=1 within each bin but occasionally, > it may be as many rpkm=3. This lead me to wonder how the relative > methylation score ("Ca2_MAPQ20.bam.rms") is calculated? For example, 3 bins > have a value of 1 rpkm each, but one has an rms value of "1660.409928", > another an rms of "7122.10254" and the third is only "679.1814523". How is > that possible that 3 bins that have the same rpkm can have such varying rms > values? > This is due to the different CpG densities in these windows. > 4)Can the rms be added up for a region so as to represent the cumulative > methylation level for that region. I am asking because I do not know if the > rms value is a log value or what? So, in my above question (3), if the 3 > bins are adjacent to one another and I decide to cluster them together and > consider them as one methylated region, then would the rms value for this > new 300bp bin that I have created be 1660.409928 + 7122.10254 + 679.1814523 > = 9461.68. > Instead of merging any values across windows, I recommend to make use of the MEDIPS.createROIse() function which allows for analyzing specific regions of interest. > 5)Finally, what does "Ca2_MAPQ20.bam.prob" represent? I figured that is > some sort of a probability score but I am not sure of what. > As mentioned in the manual, please disregard the prob values which will be removed in a future version. > > Thanks in advance for any assistance in answering my questions. > > Best regards, > Allen > > -- output of sessionInfo(): > > error > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 10.6 years ago Lukas Chavez ▴ 570

0

Entering edit mode

Dear Lukas, Again, I am so grateful for your quick reply. The only manual for MEDIPS that I know of is this one: http://www.bioc onductor.org/packages/release/bioc/manuals/MEDIPS/man/MEDIPS.pdf and I have looked and just can't find the section 7.7 that you mentioned. The reason I asked about how MEDIPS defines a cluster is because I wanted to try and reduce the number of methylated regions that I look at and then use these "consolidated" regions to do a principal component analysis (PCA). I have 12 cell types and I wanted to see if I could either do a PCA or Unsupervised Hierarchical Clustering to separate these 12 cell types into subgroups, based on their methylation profile. You said: "When windows with significant differential enrichment between groups have been identified, the MEDIPS.mergeFrames() function can be used to merge adjacent significant windows into extended regions." But at this point, I don't have any groupings. Normally, one would have a control and test group and so you can do what you suggested. But I have only, for example, a test group and I want to know if this test group can be further subdivided into a few subgroups based on their methylation profile. I would think that it would be computationally intensive if I used original 100bp windows generated by MEDIPS and entered all the rms values for each window of each cell type into the PCA. You said: "Instead of merging any values across windows, I recommend to make use of the MEDIPS.createROIse() function which allows for analyzing specific regions of interest." I didn't want my PCA analysis to be biased by a predefined "region of interest", be it whether the ROI is a promoter, CpG, etc. I wanted to just input the raw methylation scores (rms) to see what I could get. So that is why I asked if the rms can be added up (in question 4 of my last email)...also I was curious if it could be done. Once again., thanks so very much for taking time from your busy schedule to answer my questions. Best regards, Allen ________________________________________ From: Lukas Chavez [lukas.chavez.mailings@googlemail.com] Sent: Saturday, September 06, 2014 11:09 AM To: Allen [guest] Cc: bioconductor at r-project.org; Chong Kim San Allen; MEDIPS Maintainer Subject: Re: [BioC] MEDIPS: how does MEDIPS define a methylated region / cluster? Dear Allen, please see my comments below. Moreover, please consider reading section 7.7 of the MEDIPS manual. Best, Lukas On Fri, Sep 5, 2014 at 1:14 AM, Allen [guest] <guest at="" bioconductor.org<mailto:guest="" at="" bioconductor.org="">> wrote: Hi, I have aligned my sequencing data for one sample (filename: Ca2_MAPQ20.bam --- because I filtered out everything with MAPQ20 or less) and I've put it through MEDIPS. At the moment, I am not making a comparison between samples. Therefore, the result file shows the methylation profile for only one sample. In this results file, I noticed that the data is organized as such: chr start stop CF Ca2_MAPQ20.bam.counts Ca2_MAPQ20.bam.rpkm Ca2_MAPQ20.bam.rms Ca2_MAPQ20.bam.prob MSets1.counts.mean MSets1.rpkm.mean MSets1.rms.mean MSets1.prob.mean 1) I tried reading the Down et al. (2008) paper that explains the concept of coupling factor and I think it is, simply put, a measure of local CpG density. I am not sure if my understanding of 'coupling factor' is correct? That is correct. 2) This lead to question how or what MEDIPS defines as a "region or cluster"? That is to say, on Chromosome 1, I have reads aligning from position "1002501" to "1003300" but then there is a region of 1400 bp (from "1003301" to "1004700") where there are no reads aligned (rpkm = 0). Then, again, from "1004701" to "1005400", there reads aligning to this region. So my question is does MEDIPS consider this to be a case of 2 methylated regions, that is, "1002501-1003300" and "1004701-1005400" or does MEDIPS try to consolidate these 2 methylated regions into one methylated region/cluster (that is, from "1002501-1005400") since they are fairly close to one another. Each window is considered as one region; CpG density, counts, rpkm, rms, and differential coverage will be calculated for each window separately. When windows with significant differential enrichment between groups have been identified, the MEDIPS.mergeFrames() function can be used to merge adjacent significant windows into extended regions. 3) I used "uniq=TRUE" to get rid of stacked/clonal reads and so for each 100 bp bin, I am usually getting rpkm=1 within each bin but occasionally, it may be as many rpkm=3. This lead me to wonder how the relative methylation score ("Ca2_MAPQ20.bam.rms") is calculated? For example, 3 bins have a value of 1 rpkm each, but one has an rms value of "1660.409928", another an rms of "7122.10254" and the third is only "679.1814523". How is that possible that 3 bins that have the same rpkm can have such varying rms values? This is due to the different CpG densities in these windows. 4)Can the rms be added up for a region so as to represent the cumulative methylation level for that region. I am asking because I do not know if the rms value is a log value or what? So, in my above question (3), if the 3 bins are adjacent to one another and I decide to cluster them together and consider them as one methylated region, then would the rms value for this new 300bp bin that I have created be 1660.409928 + 7122.10254 + 679.1814523 = 9461.68. Instead of merging any values across windows, I recommend to make use of the MEDIPS.createROIse() function which allows for analyzing specific regions of interest. 5)Finally, what does "Ca2_MAPQ20.bam.prob" represent? I figured that is some sort of a probability score but I am not sure of what. As mentioned in the manual, please disregard the prob values which will be removed in a future version. Thanks in advance for any assistance in answering my questions. Best regards, Allen -- output of sessionInfo(): error -- Sent via the guest posting facility at bioconductor.org<http: bioconductor.org="">. _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor Important: This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately; you should not copy or use it for any purpose, nor disclose its contents to any other person. Thank you.

ADD REPLY • link 10.6 years ago Chong Kim San Allen ▴ 70

0

Entering edit mode

Dear Allen, I was referring to section 7.7 of the MEDIPS Tutorial (or vignette) available at: http://www.bioconductor.org/packages/release/bioc/vignettes/MEDIPS/ins t/doc/MEDIPS.pdf I recommend to apply the a PCA to the counts at the genomic windows returned by the MEDIPS.meth() function where you probably want to remove windows with zero or low counts first. Here, the assumption is that CpG density correction of MeDIP signals might not be necessary as you are comparing different samples with each other (please see section 7.7 of the MEDIPS tutorial). In case you want to apply a PCA on CpG density methylation estimates anyway, you can either use the rms values or calculate methylation estimates using BayMeth, a recent tool that appears to be superior for such methylation estimates. All the best, Lukas On Sun, Sep 7, 2014 at 11:10 AM, Chong Kim San Allen <patcksa at="" nus.edu.sg=""> wrote: > Dear Lukas, > Again, I am so grateful for your quick reply. > > The only manual for MEDIPS that I know of is this one: > http://www.bioconductor.org/packages/release/bioc/manuals/MEDIPS/man /MEDIPS.pdf > and I have looked and just can't find the section 7.7 that you mentioned. > > The reason I asked about how MEDIPS defines a cluster is because I wanted > to try and reduce the number of methylated regions that I look at and then > use these "consolidated" regions to do a principal component analysis > (PCA). I have 12 cell types and I wanted to see if I could either do a PCA > or Unsupervised Hierarchical Clustering to separate these 12 cell types > into subgroups, based on their methylation profile. > > You said: "When windows with significant differential enrichment between > groups have been identified, the MEDIPS.mergeFrames() function can be used > to merge adjacent significant windows into extended regions." > But at this point, I don't have any groupings. Normally, one would have a > control and test group and so you can do what you suggested. But I have > only, for example, a test group and I want to know if this test group can > be further subdivided into a few subgroups based on their methylation > profile. I would think that it would be computationally intensive if I used > original 100bp windows generated by MEDIPS and entered all the rms values > for each window of each cell type into the PCA. > > You said: "Instead of merging any values across windows, I recommend to > make use of the MEDIPS.createROIse() function which allows for analyzing > specific regions of interest." > I didn't want my PCA analysis to be biased by a predefined "region of > interest", be it whether the ROI is a promoter, CpG, etc. I wanted to just > input the raw methylation scores (rms) to see what I could get. So that is > why I asked if the rms can be added up (in question 4 of my last > email)...also I was curious if it could be done. > > Once again., thanks so very much for taking time from your busy schedule > to answer my questions. > > Best regards, > Allen > > ________________________________________ > From: Lukas Chavez [lukas.chavez.mailings at googlemail.com] > Sent: Saturday, September 06, 2014 11:09 AM > To: Allen [guest] > Cc: bioconductor at r-project.org; Chong Kim San Allen; MEDIPS Maintainer > Subject: Re: [BioC] MEDIPS: how does MEDIPS define a methylated region / > cluster? > > Dear Allen, > > please see my comments below. Moreover, please consider reading section > 7.7 of the MEDIPS manual. > > Best, > Lukas > > > On Fri, Sep 5, 2014 at 1:14 AM, Allen [guest] <guest at="" bioconductor.org=""> <mailto:guest at="" bioconductor.org="">> wrote: > Hi, > I have aligned my sequencing data for one sample (filename: Ca2_MAPQ20.bam > --- because I filtered out everything with MAPQ20 or less) and I've put it > through MEDIPS. At the moment, I am not making a comparison between > samples. Therefore, the result file shows the methylation profile for only > one sample. In this results file, I noticed that the data is organized as > such: > chr start stop CF Ca2_MAPQ20.bam.counts > Ca2_MAPQ20.bam.rpkm Ca2_MAPQ20.bam.rms Ca2_MAPQ20.bam.prob > MSets1.counts.mean MSets1.rpkm.mean MSets1.rms.mean > MSets1.prob.mean > > 1) I tried reading the Down et al. (2008) paper that explains the concept > of coupling factor and I think it is, simply put, a measure of local CpG > density. I am not sure if my understanding of 'coupling factor' is correct? > > That is correct. > > 2) This lead to question how or what MEDIPS defines as a "region or > cluster"? That is to say, on Chromosome 1, I have reads aligning from > position "1002501" to "1003300" but then there is a region of 1400 bp (from > "1003301" to "1004700") where there are no reads aligned (rpkm = 0). Then, > again, from "1004701" to "1005400", there reads aligning to this region. So > my question is does MEDIPS consider this to be a case of 2 methylated > regions, that is, "1002501-1003300" and "1004701-1005400" or does MEDIPS > try to consolidate these 2 methylated regions into one methylated > region/cluster (that is, from "1002501-1005400") since they are fairly > close to one another. > > Each window is considered as one region; CpG density, counts, rpkm, rms, > and differential coverage will be calculated for each window separately. > When windows with significant differential enrichment between groups have > been identified, the MEDIPS.mergeFrames() function can be used to merge > adjacent significant windows into extended regions. > > 3) I used "uniq=TRUE" to get rid of stacked/clonal reads and so for each > 100 bp bin, I am usually getting rpkm=1 within each bin but occasionally, > it may be as many rpkm=3. This lead me to wonder how the relative > methylation score ("Ca2_MAPQ20.bam.rms") is calculated? For example, 3 bins > have a value of 1 rpkm each, but one has an rms value of "1660.409928", > another an rms of "7122.10254" and the third is only "679.1814523". How is > that possible that 3 bins that have the same rpkm can have such varying rms > values? > > This is due to the different CpG densities in these windows. > > 4)Can the rms be added up for a region so as to represent the cumulative > methylation level for that region. I am asking because I do not know if the > rms value is a log value or what? So, in my above question (3), if the 3 > bins are adjacent to one another and I decide to cluster them together and > consider them as one methylated region, then would the rms value for this > new 300bp bin that I have created be 1660.409928 + 7122.10254 + 679.1814523 > = 9461.68. > > Instead of merging any values across windows, I recommend to make use of > the MEDIPS.createROIse() function which allows for analyzing specific > regions of interest. > > 5)Finally, what does "Ca2_MAPQ20.bam.prob" represent? I figured that is > some sort of a probability score but I am not sure of what. > > As mentioned in the manual, please disregard the prob values which will be > removed in a future version. > > Thanks in advance for any assistance in answering my questions. > > Best regards, > Allen > > -- output of sessionInfo(): > > error > > -- > Sent via the guest posting facility at bioconductor.org< > http://bioconductor.org>. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > Important: This email is confidential and may be privileged. If you are > not the intended recipient, please delete it and notify us immediately; you > should not copy or use it for any purpose, nor disclose its contents to any > other person. Thank you. > [[alternative HTML version deleted]]

ADD REPLY • link 10.6 years ago Lukas Chavez ▴ 570

0

Entering edit mode

Dear Lukas, Firstly, thank you so very much for your kind replies. You have been so patient and I am very, very grateful for your help. I have been away and so I'm trying now to gather my thoughts on this again. 1) I did read section 7.7 of the MEDIPS manual. My understanding (correct me if I am wrong) is that MEDIPS identifies differentially methylated regions (DMRs) by a sample-against-sample method rather than a group-against-group method. So, if I have 5 controls (A,B,C,D,E) and 6 (1,2,3,4,5,6) test samples, then EACH individual control sample is compared against EACH test sample. So, A vs. 1, A vs. 2, A vs. 3, A vs. 4, A vs. 5, A vs. 6, B vs. 1, B vs. 2, ......etc. instead of comparing ALL of the control samples against ALL of the test samples. Is my understanding correct? You are therefore saying that because of this all-against-all method that MEDIPS uses, correcting for CpG density is actually not necessary. But that's for identifying differentially methylated regions in MEDIPS. I don't quite understand how this relates to my Principal Component Analysis since I believe that I need to export my data out from MEDIPS and maybe use a different R package to do the PCA. So I am guessing that I need to export the methylation profile files for each sample and then input them into the PCA analysis software. 2) I have decided to try and do a PCA/hierarchical clustering using only the methylation levels of a specific genomic feature (eg. all gene promoters) to make things simple. I have decided to take your suggestion. I believe I can use the MEDIPS.createROIset() to calculate the methylation levels at all gene promoters. I can then use the methylation profile for the ROI (in this case, promoter) from each sample to do a a hierarchical clustering or PCA. 3) You said: " In case you want to apply a PCA on CpG density methylation estimates anyway, you can either use the rms values ......."...however, I still don't know if RMS is a linear or logarithmic value. If I have 4 100-bp windows where the RMS is 3, 5, 8, 9, can I say that the RMS for the this 400-bp region is 25? If RMS is a log value then obviously I cannot just add them up. 4) Also, I was looking at the methylation profile file and I noticed that some RMS values are given as "Inf" or "NA". I am confused as to how to use or interpret this. I also don't understand why it is "NA"...why not just put a 0 (zero) instead of an "NA"? 5) Thanks for the suggestion to use the MEDIPS.meth() function to remove windows with low or zero counts. That's really helpful. I hadn't even considered that. Have a great weekend. Warmest regards, Allen ________________________________________ From: Lukas Chavez [lukas.chavez.mailings@googlemail.com] Sent: Tuesday, September 09, 2014 1:09 AM To: Chong Kim San Allen Cc: Allen [guest]; bioconductor at r-project.org; MEDIPS Maintainer Subject: Re: [BioC] MEDIPS: how does MEDIPS define a methylated region / cluster? Dear Allen, I was referring to section 7.7 of the MEDIPS Tutorial (or vignette) available at: http://www.bioconductor.org/packages/release/bioc/vignettes/MEDIPS/ins t/doc/MEDIPS.pdf I recommend to apply the a PCA to the counts at the genomic windows returned by the MEDIPS.meth() function where you probably want to remove windows with zero or low counts first. Here, the assumption is that CpG density correction of MeDIP signals might not be necessary as you are comparing different samples with each other (please see section 7.7 of the MEDIPS tutorial). In case you want to apply a PCA on CpG density methylation estimates anyway, you can either use the rms values or calculate methylation estimates using BayMeth, a recent tool that appears to be superior for such methylation estimates. All the best, Lukas On Sun, Sep 7, 2014 at 11:10 AM, Chong Kim San Allen <patcksa at="" nus.edu.sg<mailto:patcksa="" at="" nus.edu.sg="">> wrote: Dear Lukas, Again, I am so grateful for your quick reply. The only manual for MEDIPS that I know of is this one: http://www.bioc onductor.org/packages/release/bioc/manuals/MEDIPS/man/MEDIPS.pdf and I have looked and just can't find the section 7.7 that you mentioned. The reason I asked about how MEDIPS defines a cluster is because I wanted to try and reduce the number of methylated regions that I look at and then use these "consolidated" regions to do a principal component analysis (PCA). I have 12 cell types and I wanted to see if I could either do a PCA or Unsupervised Hierarchical Clustering to separate these 12 cell types into subgroups, based on their methylation profile. You said: "When windows with significant differential enrichment between groups have been identified, the MEDIPS.mergeFrames() function can be used to merge adjacent significant windows into extended regions." But at this point, I don't have any groupings. Normally, one would have a control and test group and so you can do what you suggested. But I have only, for example, a test group and I want to know if this test group can be further subdivided into a few subgroups based on their methylation profile. I would think that it would be computationally intensive if I used original 100bp windows generated by MEDIPS and entered all the rms values for each window of each cell type into the PCA. You said: "Instead of merging any values across windows, I recommend to make use of the MEDIPS.createROIse() function which allows for analyzing specific regions of interest." I didn't want my PCA analysis to be biased by a predefined "region of interest", be it whether the ROI is a promoter, CpG, etc. I wanted to just input the raw methylation scores (rms) to see what I could get. So that is why I asked if the rms can be added up (in question 4 of my last email)...also I was curious if it could be done. Once again., thanks so very much for taking time from your busy schedule to answer my questions. Best regards, Allen ________________________________________ From: Lukas Chavez [lukas.chavez.mailings@googlemail.com<mailto:lukas. chavez.mailings@googlemail.com="">] Sent: Saturday, September 06, 2014 11:09 AM To: Allen [guest] Cc: bioconductor at r-project.org<mailto:bioconductor at="" r-project.org="">; Chong Kim San Allen; MEDIPS Maintainer Subject: Re: [BioC] MEDIPS: how does MEDIPS define a methylated region / cluster? Dear Allen, please see my comments below. Moreover, please consider reading section 7.7 of the MEDIPS manual. Best, Lukas On Fri, Sep 5, 2014 at 1:14 AM, Allen [guest] <guest at="" bioconductor.org<mailto:guest="" at="" bioconductor.org=""><mailto:guest at="" bioconductor.org<mailto:guest="" at="" bioconductor.org="">>> wrote: Hi, I have aligned my sequencing data for one sample (filename: Ca2_MAPQ20.bam --- because I filtered out everything with MAPQ20 or less) and I've put it through MEDIPS. At the moment, I am not making a comparison between samples. Therefore, the result file shows the methylation profile for only one sample. In this results file, I noticed that the data is organized as such: chr start stop CF Ca2_MAPQ20.bam.counts Ca2_MAPQ20.bam.rpkm Ca2_MAPQ20.bam.rms Ca2_MAPQ20.bam.prob MSets1.counts.mean MSets1.rpkm.mean MSets1.rms.mean MSets1.prob.mean 1) I tried reading the Down et al. (2008) paper that explains the concept of coupling factor and I think it is, simply put, a measure of local CpG density. I am not sure if my understanding of 'coupling factor' is correct? That is correct. 2) This lead to question how or what MEDIPS defines as a "region or cluster"? That is to say, on Chromosome 1, I have reads aligning from position "1002501" to "1003300" but then there is a region of 1400 bp (from "1003301" to "1004700") where there are no reads aligned (rpkm = 0). Then, again, from "1004701" to "1005400", there reads aligning to this region. So my question is does MEDIPS consider this to be a case of 2 methylated regions, that is, "1002501-1003300" and "1004701-1005400" or does MEDIPS try to consolidate these 2 methylated regions into one methylated region/cluster (that is, from "1002501-1005400") since they are fairly close to one another. Each window is considered as one region; CpG density, counts, rpkm, rms, and differential coverage will be calculated for each window separately. When windows with significant differential enrichment between groups have been identified, the MEDIPS.mergeFrames() function can be used to merge adjacent significant windows into extended regions. 3) I used "uniq=TRUE" to get rid of stacked/clonal reads and so for each 100 bp bin, I am usually getting rpkm=1 within each bin but occasionally, it may be as many rpkm=3. This lead me to wonder how the relative methylation score ("Ca2_MAPQ20.bam.rms") is calculated? For example, 3 bins have a value of 1 rpkm each, but one has an rms value of "1660.409928", another an rms of "7122.10254" and the third is only "679.1814523". How is that possible that 3 bins that have the same rpkm can have such varying rms values? This is due to the different CpG densities in these windows. 4)Can the rms be added up for a region so as to represent the cumulative methylation level for that region. I am asking because I do not know if the rms value is a log value or what? So, in my above question (3), if the 3 bins are adjacent to one another and I decide to cluster them together and consider them as one methylated region, then would the rms value for this new 300bp bin that I have created be 1660.409928 + 7122.10254 + 679.1814523 = 9461.68. Instead of merging any values across windows, I recommend to make use of the MEDIPS.createROIse() function which allows for analyzing specific regions of interest. 5)Finally, what does "Ca2_MAPQ20.bam.prob" represent? I figured that is some sort of a probability score but I am not sure of what. As mentioned in the manual, please disregard the prob values which will be removed in a future version. Thanks in advance for any assistance in answering my questions. Best regards, Allen -- output of sessionInfo(): error -- Sent via the guest posting facility at bioconductor.org<http: bioconductor.org=""><http: bioconductor.org="">. _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""><mailto:bioconductor at="" r-project.org<mailto:bioconductor="" at="" r-project.org="">> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor Important: This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately; you should not copy or use it for any purpose, nor disclose its contents to any other person. Thank you. Important: This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately; you should not copy or use it for any purpose, nor disclose its contents to any other person. Thank you.

ADD REPLY • link 10.6 years ago Chong Kim San Allen ▴ 70

0

Entering edit mode

One more thing which I have just come to realize. The rms for some windows are given as 'NA'. What does that mean? Why not just give a value of 0 (zero)? While on the other hand, the rms for some windows are given as 'Inf', which I am assuming is short of "Infinity"? I noticed this tends to be windows with a low coupling factor. So, I am wondering (1) is there an upper limit for the rms value, above which it is reported as 'Inf'? and (2) do I enter the mathematical value of infinity into the table and then perform a PCA? ________________________________________ From: Lukas Chavez [lukas.chavez.mailings@googlemail.com] Sent: Saturday, September 06, 2014 11:09 AM To: Allen [guest] Cc: bioconductor at r-project.org; Chong Kim San Allen; MEDIPS Maintainer Subject: Re: [BioC] MEDIPS: how does MEDIPS define a methylated region / cluster? Dear Allen, please see my comments below. Moreover, please consider reading section 7.7 of the MEDIPS manual. Best, Lukas On Fri, Sep 5, 2014 at 1:14 AM, Allen [guest] <guest at="" bioconductor.org<mailto:guest="" at="" bioconductor.org="">> wrote: Hi, I have aligned my sequencing data for one sample (filename: Ca2_MAPQ20.bam --- because I filtered out everything with MAPQ20 or less) and I've put it through MEDIPS. At the moment, I am not making a comparison between samples. Therefore, the result file shows the methylation profile for only one sample. In this results file, I noticed that the data is organized as such: chr start stop CF Ca2_MAPQ20.bam.counts Ca2_MAPQ20.bam.rpkm Ca2_MAPQ20.bam.rms Ca2_MAPQ20.bam.prob MSets1.counts.mean MSets1.rpkm.mean MSets1.rms.mean MSets1.prob.mean 1) I tried reading the Down et al. (2008) paper that explains the concept of coupling factor and I think it is, simply put, a measure of local CpG density. I am not sure if my understanding of 'coupling factor' is correct? That is correct. 2) This lead to question how or what MEDIPS defines as a "region or cluster"? That is to say, on Chromosome 1, I have reads aligning from position "1002501" to "1003300" but then there is a region of 1400 bp (from "1003301" to "1004700") where there are no reads aligned (rpkm = 0). Then, again, from "1004701" to "1005400", there reads aligning to this region. So my question is does MEDIPS consider this to be a case of 2 methylated regions, that is, "1002501-1003300" and "1004701-1005400" or does MEDIPS try to consolidate these 2 methylated regions into one methylated region/cluster (that is, from "1002501-1005400") since they are fairly close to one another. Each window is considered as one region; CpG density, counts, rpkm, rms, and differential coverage will be calculated for each window separately. When windows with significant differential enrichment between groups have been identified, the MEDIPS.mergeFrames() function can be used to merge adjacent significant windows into extended regions. 3) I used "uniq=TRUE" to get rid of stacked/clonal reads and so for each 100 bp bin, I am usually getting rpkm=1 within each bin but occasionally, it may be as many rpkm=3. This lead me to wonder how the relative methylation score ("Ca2_MAPQ20.bam.rms") is calculated? For example, 3 bins have a value of 1 rpkm each, but one has an rms value of "1660.409928", another an rms of "7122.10254" and the third is only "679.1814523". How is that possible that 3 bins that have the same rpkm can have such varying rms values? This is due to the different CpG densities in these windows. 4)Can the rms be added up for a region so as to represent the cumulative methylation level for that region. I am asking because I do not know if the rms value is a log value or what? So, in my above question (3), if the 3 bins are adjacent to one another and I decide to cluster them together and consider them as one methylated region, then would the rms value for this new 300bp bin that I have created be 1660.409928 + 7122.10254 + 679.1814523 = 9461.68. Instead of merging any values across windows, I recommend to make use of the MEDIPS.createROIse() function which allows for analyzing specific regions of interest. 5)Finally, what does "Ca2_MAPQ20.bam.prob" represent? I figured that is some sort of a probability score but I am not sure of what. As mentioned in the manual, please disregard the prob values which will be removed in a future version. Thanks in advance for any assistance in answering my questions. Best regards, Allen -- output of sessionInfo(): error -- Sent via the guest posting facility at bioconductor.org<http: bioconductor.org="">. _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor Important: This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately; you should not copy or use it for any purpose, nor disclose its contents to any other person. Thank you.

ADD REPLY • link 10.6 years ago Chong Kim San Allen ▴ 70

0

Entering edit mode

Lukas Chavez ▴ 570

@lukas-chavez-5781

Last seen 7.2 years ago

USA/La Jolla/UCSD

Dear Allen, please see my comments below. Best, Lukas On Fri, Sep 12, 2014 at 3:00 AM, Chong Kim San Allen <patcksa at="" nus.edu.sg=""> wrote: > Dear Lukas, > Firstly, thank you so very much for your kind replies. You have been so > patient and I am very, very grateful for your help. > > I have been away and so I'm trying now to gather my thoughts on this again. > > 1) I did read section 7.7 of the MEDIPS manual. My understanding (correct > me if I am wrong) is that MEDIPS identifies differentially methylated > regions (DMRs) by a sample-against-sample method rather than a > group-against-group method. So, if I have 5 controls (A,B,C,D,E) and 6 > (1,2,3,4,5,6) test samples, then EACH individual control sample is compared > against EACH test sample. So, A vs. 1, A vs. 2, A vs. 3, A vs. 4, A vs. 5, > A vs. 6, B vs. 1, B vs. 2, ......etc. instead of comparing ALL of the > control samples against ALL of the test samples. Is my understanding > correct? > No that is not correct. MEDIPS identifies differentially enriched regions by comparing two groups of samples. > > You are therefore saying that because of this all-against-all method that > MEDIPS uses, correcting for CpG density is actually not necessary. But > that's for identifying differentially methylated regions in MEDIPS. I don't > quite understand how this relates to my Principal Component Analysis since > I believe that I need to export my data out from MEDIPS and maybe use a > different R package to do the PCA. So I am guessing that I need to export > the methylation profile files for each sample and then input them into the > PCA analysis software. > The motivation of CpG normalization is to quantify MeDIP signals at genomic regions with varying CpG density across a genome. The differential enrichment analysis compares IP-seq derived signals at genomic windows across samples. Therefore, the signal for a specific window will not be affected by different CpG densities across samples. The same assumption applies when you are comparing different samples with each other employing PCA. Due to a lack of experience I cannot recommend to use CpG normalized methylation estimates or for example variance stabilized counts as done by DESeq for RNA-seq data. > 2) I have decided to try and do a PCA/hierarchical clustering using only > the methylation levels of a specific genomic feature (eg. all gene > promoters) to make things simple. I have decided to take your suggestion. I > believe I can use the MEDIPS.createROIset() to calculate the methylation > levels at all gene promoters. I can then use the methylation profile for > the ROI (in this case, promoter) from each sample to do a a hierarchical > clustering or PCA. > > 3) You said: " In case you want to apply a PCA on CpG density methylation > estimates anyway, you can either use the rms values ......."...however, I > still don't know if RMS is a linear or logarithmic value. If I have 4 > 100-bp windows where the RMS is 3, 5, 8, 9, can I say that the RMS for the > this 400-bp region is 25? If RMS is a log value then obviously I cannot > just add them up. > The rms values are not log transformed. However, the signals in neighbouring windows are not independent. Extended reads spanning more than one window will be counted several times. Therefore, adding up counts or rms values is not valid. You can use MEDIPS.createROIset() instead. > > 4) Also, I was looking at the methylation profile file and I noticed that > some RMS values are given as "Inf" or "NA". I am confused as to how to use > or interpret this. I also don't understand why it is "NA"...why not just > put a 0 (zero) instead of an "NA"? > What are the CF and count values for windows where rms values are NA or Inf? Can you send examples? > > 5) Thanks for the suggestion to use the MEDIPS.meth() function to remove > windows with low or zero counts. That's really helpful. I hadn't even > considered that. > > Have a great weekend. > > Warmest regards, > Allen > > > > > ________________________________________ > From: Lukas Chavez [lukas.chavez.mailings at googlemail.com] > Sent: Tuesday, September 09, 2014 1:09 AM > To: Chong Kim San Allen > Cc: Allen [guest]; bioconductor at r-project.org; MEDIPS Maintainer > Subject: Re: [BioC] MEDIPS: how does MEDIPS define a methylated region / > cluster? > > Dear Allen, > > I was referring to section 7.7 of the MEDIPS Tutorial (or vignette) > available at: > > http://www.bioconductor.org/packages/release/bioc/vignettes/MEDIPS/i nst/doc/MEDIPS.pdf > > I recommend to apply the a PCA to the counts at the genomic windows > returned by the MEDIPS.meth() function where you probably want to remove > windows with zero or low counts first. Here, the assumption is that CpG > density correction of MeDIP signals might not be necessary as you are > comparing different samples with each other (please see section 7.7 of the > MEDIPS tutorial). In case you want to apply a PCA on CpG density > methylation estimates anyway, you can either use the rms values or > calculate methylation estimates using BayMeth, a recent tool that appears > to be superior for such methylation estimates. > > All the best, > Lukas > > On Sun, Sep 7, 2014 at 11:10 AM, Chong Kim San Allen <patcksa at="" nus.edu.sg=""> <mailto:patcksa at="" nus.edu.sg="">> wrote: > Dear Lukas, > Again, I am so grateful for your quick reply. > > The only manual for MEDIPS that I know of is this one: > http://www.bioconductor.org/packages/release/bioc/manuals/MEDIPS/man /MEDIPS.pdf > and I have looked and just can't find the section 7.7 that you mentioned. > > The reason I asked about how MEDIPS defines a cluster is because I wanted > to try and reduce the number of methylated regions that I look at and then > use these "consolidated" regions to do a principal component analysis > (PCA). I have 12 cell types and I wanted to see if I could either do a PCA > or Unsupervised Hierarchical Clustering to separate these 12 cell types > into subgroups, based on their methylation profile. > > You said: "When windows with significant differential enrichment between > groups have been identified, the MEDIPS.mergeFrames() function can be used > to merge adjacent significant windows into extended regions." > But at this point, I don't have any groupings. Normally, one would have a > control and test group and so you can do what you suggested. But I have > only, for example, a test group and I want to know if this test group can > be further subdivided into a few subgroups based on their methylation > profile. I would think that it would be computationally intensive if I used > original 100bp windows generated by MEDIPS and entered all the rms values > for each window of each cell type into the PCA. > > You said: "Instead of merging any values across windows, I recommend to > make use of the MEDIPS.createROIse() function which allows for analyzing > specific regions of interest." > I didn't want my PCA analysis to be biased by a predefined "region of > interest", be it whether the ROI is a promoter, CpG, etc. I wanted to just > input the raw methylation scores (rms) to see what I could get. So that is > why I asked if the rms can be added up (in question 4 of my last > email)...also I was curious if it could be done. > > Once again., thanks so very much for taking time from your busy schedule > to answer my questions. > > Best regards, > Allen > > ________________________________________ > From: Lukas Chavez [lukas.chavez.mailings at googlemail.com<mailto:> lukas.chavez.mailings at googlemail.com>] > Sent: Saturday, September 06, 2014 11:09 AM > To: Allen [guest] > Cc: bioconductor at r-project.org<mailto:bioconductor at="" r-project.org="">; Chong > Kim San Allen; MEDIPS Maintainer > Subject: Re: [BioC] MEDIPS: how does MEDIPS define a methylated region / > cluster? > > Dear Allen, > > please see my comments below. Moreover, please consider reading section > 7.7 of the MEDIPS manual. > > Best, > Lukas > > > On Fri, Sep 5, 2014 at 1:14 AM, Allen [guest] <guest at="" bioconductor.org=""> <mailto:guest at="" bioconductor.org=""><mailto:guest at="" bioconductor.org<mailto:=""> guest at bioconductor.org>>> wrote: > Hi, > I have aligned my sequencing data for one sample (filename: Ca2_MAPQ20.bam > --- because I filtered out everything with MAPQ20 or less) and I've put it > through MEDIPS. At the moment, I am not making a comparison between > samples. Therefore, the result file shows the methylation profile for only > one sample. In this results file, I noticed that the data is organized as > such: > chr start stop CF Ca2_MAPQ20.bam.counts > Ca2_MAPQ20.bam.rpkm Ca2_MAPQ20.bam.rms Ca2_MAPQ20.bam.prob > MSets1.counts.mean MSets1.rpkm.mean MSets1.rms.mean > MSets1.prob.mean > > 1) I tried reading the Down et al. (2008) paper that explains the concept > of coupling factor and I think it is, simply put, a measure of local CpG > density. I am not sure if my understanding of 'coupling factor' is correct? > > That is correct. > > 2) This lead to question how or what MEDIPS defines as a "region or > cluster"? That is to say, on Chromosome 1, I have reads aligning from > position "1002501" to "1003300" but then there is a region of 1400 bp (from > "1003301" to "1004700") where there are no reads aligned (rpkm = 0). Then, > again, from "1004701" to "1005400", there reads aligning to this region. So > my question is does MEDIPS consider this to be a case of 2 methylated > regions, that is, "1002501-1003300" and "1004701-1005400" or does MEDIPS > try to consolidate these 2 methylated regions into one methylated > region/cluster (that is, from "1002501-1005400") since they are fairly > close to one another. > > Each window is considered as one region; CpG density, counts, rpkm, rms, > and differential coverage will be calculated for each window separately. > When windows with significant differential enrichment between groups have > been identified, the MEDIPS.mergeFrames() function can be used to merge > adjacent significant windows into extended regions. > > 3) I used "uniq=TRUE" to get rid of stacked/clonal reads and so for each > 100 bp bin, I am usually getting rpkm=1 within each bin but occasionally, > it may be as many rpkm=3. This lead me to wonder how the relative > methylation score ("Ca2_MAPQ20.bam.rms") is calculated? For example, 3 bins > have a value of 1 rpkm each, but one has an rms value of "1660.409928", > another an rms of "7122.10254" and the third is only "679.1814523". How is > that possible that 3 bins that have the same rpkm can have such varying rms > values? > > This is due to the different CpG densities in these windows. > > 4)Can the rms be added up for a region so as to represent the cumulative > methylation level for that region. I am asking because I do not know if the > rms value is a log value or what? So, in my above question (3), if the 3 > bins are adjacent to one another and I decide to cluster them together and > consider them as one methylated region, then would the rms value for this > new 300bp bin that I have created be 1660.409928 + 7122.10254 + 679.1814523 > = 9461.68. > > Instead of merging any values across windows, I recommend to make use of > the MEDIPS.createROIse() function which allows for analyzing specific > regions of interest. > > 5)Finally, what does "Ca2_MAPQ20.bam.prob" represent? I figured that is > some sort of a probability score but I am not sure of what. > > As mentioned in the manual, please disregard the prob values which will be > removed in a future version. > > Thanks in advance for any assistance in answering my questions. > > Best regards, > Allen > > -- output of sessionInfo(): > > error > > -- > Sent via the guest posting facility at bioconductor.org< > http://bioconductor.org><http: bioconductor.org="">. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""><mailto:> Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org="">> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > Important: This email is confidential and may be privileged. If you are > not the intended recipient, please delete it and notify us immediately; you > should not copy or use it for any purpose, nor disclose its contents to any > other person. Thank you. > > Important: This email is confidential and may be privileged. If you are > not the intended recipient, please delete it and notify us immediately; you > should not copy or use it for any purpose, nor disclose its contents to any > other person. Thank you. > [[alternative HTML version deleted]]

ADD COMMENT • link 10.6 years ago Lukas Chavez ▴ 570

0

Entering edit mode

Lukas Chavez ▴ 570

@lukas-chavez-5781

Last seen 7.2 years ago

USA/La Jolla/UCSD

Dear Allen, you example shows that the rms value will be NA when the count is zero. Surprisingly, there was one example where the rms value is NA and the count value was non-zero. I will have to investigate why this occurs. As outlined in section 7.7. transformation of of the rms values has been disabled since MEDIPS. >= 1.10.0. Lukas On Mon, Sep 15, 2014 at 3:20 AM, Chong Kim San Allen <patcksa@nus.edu.sg> wrote: > Dear Lukas, > I went back and tried to find those with Inf but I can't. I don't remember > where I put the files that contained those values. > > I did however find those that contained the NA values and I have attached > the file containing some of these. > > I was thinking whether I should use the "transf" parameter so as all > values are set to between 0 and 1000 for RMS. Perhaps this would not give > the NA or Inf values > > Thanks, > Allen > ________________________________________ > From: Lukas Chavez [lukas.chavez.mailings@googlemail.com] > Sent: Saturday, September 13, 2014 1:12 AM > To: Chong Kim San Allen > Cc: Allen [guest]; bioconductor@r-project.org; MEDIPS Maintainer > Subject: Re: [BioC] MEDIPS: how does MEDIPS define a methylated region / > cluster? > > Dear Allen, > > please see my comments below. > > Best, > Lukas > > On Fri, Sep 12, 2014 at 3:00 AM, Chong Kim San Allen <patcksa@nus.edu.sg> <mailto:patcksa@nus.edu.sg>> wrote: > Dear Lukas, > Firstly, thank you so very much for your kind replies. You have been so > patient and I am very, very grateful for your help. > > I have been away and so I'm trying now to gather my thoughts on this again. > > 1) I did read section 7.7 of the MEDIPS manual. My understanding (correct > me if I am wrong) is that MEDIPS identifies differentially methylated > regions (DMRs) by a sample-against-sample method rather than a > group-against-group method. So, if I have 5 controls (A,B,C,D,E) and 6 > (1,2,3,4,5,6) test samples, then EACH individual control sample is compared > against EACH test sample. So, A vs. 1, A vs. 2, A vs. 3, A vs. 4, A vs. 5, > A vs. 6, B vs. 1, B vs. 2, ......etc. instead of comparing ALL of the > control samples against ALL of the test samples. Is my understanding > correct? > > No that is not correct. MEDIPS identifies differentially enriched regions > by comparing two groups of samples. > > You are therefore saying that because of this all-against-all method that > MEDIPS uses, correcting for CpG density is actually not necessary. But > that's for identifying differentially methylated regions in MEDIPS. I don't > quite understand how this relates to my Principal Component Analysis since > I believe that I need to export my data out from MEDIPS and maybe use a > different R package to do the PCA. So I am guessing that I need to export > the methylation profile files for each sample and then input them into the > PCA analysis software. > > The motivation of CpG normalization is to quantify MeDIP signals at > genomic regions with varying CpG density across a genome. The differential > enrichment analysis compares IP-seq derived signals at genomic windows > across samples. Therefore, the signal for a specific window will not be > affected by different CpG densities across samples. The same assumption > applies when you are comparing different samples with each other employing > PCA. Due to a lack of experience I cannot recommend to use CpG normalized > methylation estimates or for example variance stabilized counts as done by > DESeq for RNA-seq data. > > > 2) I have decided to try and do a PCA/hierarchical clustering using only > the methylation levels of a specific genomic feature (eg. all gene > promoters) to make things simple. I have decided to take your suggestion. I > believe I can use the MEDIPS.createROIset() to calculate the methylation > levels at all gene promoters. I can then use the methylation profile for > the ROI (in this case, promoter) from each sample to do a a hierarchical > clustering or PCA. > > 3) You said: " In case you want to apply a PCA on CpG density methylation > estimates anyway, you can either use the rms values ......."...however, I > still don't know if RMS is a linear or logarithmic value. If I have 4 > 100-bp windows where the RMS is 3, 5, 8, 9, can I say that the RMS for the > this 400-bp region is 25? If RMS is a log value then obviously I cannot > just add them up. > The rms values are not log transformed. However, the signals in > neighbouring windows are not independent. Extended reads spanning more than > one window will be counted several times. Therefore, adding up counts or > rms values is not valid. You can use MEDIPS.createROIset() instead. > > 4) Also, I was looking at the methylation profile file and I noticed that > some RMS values are given as "Inf" or "NA". I am confused as to how to use > or interpret this. I also don't understand why it is "NA"...why not just > put a 0 (zero) instead of an "NA"? > What are the CF and count values for windows where rms values are NA or > Inf? Can you send examples? > > 5) Thanks for the suggestion to use the MEDIPS.meth() function to remove > windows with low or zero counts. That's really helpful. I hadn't even > considered that. > > Have a great weekend. > > Warmest regards, > Allen > > > > > ________________________________________ > From: Lukas Chavez [lukas.chavez.mailings@googlemail.com<mailto:> lukas.chavez.mailings@googlemail.com>] > Sent: Tuesday, September 09, 2014 1:09 AM > To: Chong Kim San Allen > Cc: Allen [guest]; bioconductor@r-project.org<mailto:> bioconductor@r-project.org>; MEDIPS Maintainer > Subject: Re: [BioC] MEDIPS: how does MEDIPS define a methylated region / > cluster? > > Dear Allen, > > I was referring to section 7.7 of the MEDIPS Tutorial (or vignette) > available at: > > http://www.bioconductor.org/packages/release/bioc/vignettes/MEDIPS/i nst/doc/MEDIPS.pdf > > I recommend to apply the a PCA to the counts at the genomic windows > returned by the MEDIPS.meth() function where you probably want to remove > windows with zero or low counts first. Here, the assumption is that CpG > density correction of MeDIP signals might not be necessary as you are > comparing different samples with each other (please see section 7.7 of the > MEDIPS tutorial). In case you want to apply a PCA on CpG density > methylation estimates anyway, you can either use the rms values or > calculate methylation estimates using BayMeth, a recent tool that appears > to be superior for such methylation estimates. > > All the best, > Lukas > > On Sun, Sep 7, 2014 at 11:10 AM, Chong Kim San Allen <patcksa@nus.edu.sg> <mailto:patcksa@nus.edu.sg><mailto:patcksa@nus.edu.sg<mailto:> patcksa@nus.edu.sg>>> wrote: > Dear Lukas, > Again, I am so grateful for your quick reply. > > The only manual for MEDIPS that I know of is this one: > http://www.bioconductor.org/packages/release/bioc/manuals/MEDIPS/man /MEDIPS.pdf > and I have looked and just can't find the section 7.7 that you mentioned. > > The reason I asked about how MEDIPS defines a cluster is because I wanted > to try and reduce the number of methylated regions that I look at and then > use these "consolidated" regions to do a principal component analysis > (PCA). I have 12 cell types and I wanted to see if I could either do a PCA > or Unsupervised Hierarchical Clustering to separate these 12 cell types > into subgroups, based on their methylation profile. > > You said: "When windows with significant differential enrichment between > groups have been identified, the MEDIPS.mergeFrames() function can be used > to merge adjacent significant windows into extended regions." > But at this point, I don't have any groupings. Normally, one would have a > control and test group and so you can do what you suggested. But I have > only, for example, a test group and I want to know if this test group can > be further subdivided into a few subgroups based on their methylation > profile. I would think that it would be computationally intensive if I used > original 100bp windows generated by MEDIPS and entered all the rms values > for each window of each cell type into the PCA. > > You said: "Instead of merging any values across windows, I recommend to > make use of the MEDIPS.createROIse() function which allows for analyzing > specific regions of interest." > I didn't want my PCA analysis to be biased by a predefined "region of > interest", be it whether the ROI is a promoter, CpG, etc. I wanted to just > input the raw methylation scores (rms) to see what I could get. So that is > why I asked if the rms can be added up (in question 4 of my last > email)...also I was curious if it could be done. > > Once again., thanks so very much for taking time from your busy schedule > to answer my questions. > > Best regards, > Allen > > ________________________________________ > From: Lukas Chavez [lukas.chavez.mailings@googlemail.com<mailto:> lukas.chavez.mailings@googlemail.com><mailto:> lukas.chavez.mailings@googlemail.com<mailto:> lukas.chavez.mailings@googlemail.com>>] > Sent: Saturday, September 06, 2014 11:09 AM > To: Allen [guest] > Cc: bioconductor@r-project.org<mailto:bioconductor@r-project.org><mailto:> bioconductor@r-project.org<mailto:bioconductor@r-project.org>>; Chong Kim > San Allen; MEDIPS Maintainer > Subject: Re: [BioC] MEDIPS: how does MEDIPS define a methylated region / > cluster? > > Dear Allen, > > please see my comments below. Moreover, please consider reading section > 7.7 of the MEDIPS manual. > > Best, > Lukas > > > On Fri, Sep 5, 2014 at 1:14 AM, Allen [guest] <guest@bioconductor.org> <mailto:guest@bioconductor.org><mailto:guest@bioconductor.org<mailto:> guest@bioconductor.org>><mailto:guest@bioconductor.org<mailto:> guest@bioconductor.org><mailto:guest@bioconductor.org<mailto:> guest@bioconductor.org>>>> wrote: > Hi, > I have aligned my sequencing data for one sample (filename: Ca2_MAPQ20.bam > --- because I filtered out everything with MAPQ20 or less) and I've put it > through MEDIPS. At the moment, I am not making a comparison between > samples. Therefore, the result file shows the methylation profile for only > one sample. In this results file, I noticed that the data is organized as > such: > chr start stop CF Ca2_MAPQ20.bam.counts > Ca2_MAPQ20.bam.rpkm Ca2_MAPQ20.bam.rms Ca2_MAPQ20.bam.prob > MSets1.counts.mean MSets1.rpkm.mean MSets1.rms.mean > MSets1.prob.mean > > 1) I tried reading the Down et al. (2008) paper that explains the concept > of coupling factor and I think it is, simply put, a measure of local CpG > density. I am not sure if my understanding of 'coupling factor' is correct? > > That is correct. > > 2) This lead to question how or what MEDIPS defines as a "region or > cluster"? That is to say, on Chromosome 1, I have reads aligning from > position "1002501" to "1003300" but then there is a region of 1400 bp (from > "1003301" to "1004700") where there are no reads aligned (rpkm = 0). Then, > again, from "1004701" to "1005400", there reads aligning to this region. So > my question is does MEDIPS consider this to be a case of 2 methylated > regions, that is, "1002501-1003300" and "1004701-1005400" or does MEDIPS > try to consolidate these 2 methylated regions into one methylated > region/cluster (that is, from "1002501-1005400") since they are fairly > close to one another. > > Each window is considered as one region; CpG density, counts, rpkm, rms, > and differential coverage will be calculated for each window separately. > When windows with significant differential enrichment between groups have > been identified, the MEDIPS.mergeFrames() function can be used to merge > adjacent significant windows into extended regions. > > 3) I used "uniq=TRUE" to get rid of stacked/clonal reads and so for each > 100 bp bin, I am usually getting rpkm=1 within each bin but occasionally, > it may be as many rpkm=3. This lead me to wonder how the relative > methylation score ("Ca2_MAPQ20.bam.rms") is calculated? For example, 3 bins > have a value of 1 rpkm each, but one has an rms value of "1660.409928", > another an rms of "7122.10254" and the third is only "679.1814523". How is > that possible that 3 bins that have the same rpkm can have such varying rms > values? > > This is due to the different CpG densities in these windows. > > 4)Can the rms be added up for a region so as to represent the cumulative > methylation level for that region. I am asking because I do not know if the > rms value is a log value or what? So, in my above question (3), if the 3 > bins are adjacent to one another and I decide to cluster them together and > consider them as one methylated region, then would the rms value for this > new 300bp bin that I have created be 1660.409928 + 7122.10254 + 679.1814523 > = 9461.68. > > Instead of merging any values across windows, I recommend to make use of > the MEDIPS.createROIse() function which allows for analyzing specific > regions of interest. > > 5)Finally, what does "Ca2_MAPQ20.bam.prob" represent? I figured that is > some sort of a probability score but I am not sure of what. > > As mentioned in the manual, please disregard the prob values which will be > removed in a future version. > > Thanks in advance for any assistance in answering my questions. > > Best regards, > Allen > > -- output of sessionInfo(): > > error > > -- > Sent via the guest posting facility at bioconductor.org< > http://bioconductor.org><http: bioconductor.org=""><http: bioconducto="" r.org=""> >. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org<mailto:bioconductor@r-project.org><mailto:> Bioconductor@r-project.org<mailto:bioconductor@r-project.org>><mailto:> Bioconductor@r-project.org<mailto:bioconductor@r-project.org><mailto:> Bioconductor@r-project.org<mailto:bioconductor@r-project.org>>> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > Important: This email is confidential and may be privileged. If you are > not the intended recipient, please delete it and notify us immediately; you > should not copy or use it for any purpose, nor disclose its contents to any > other person. Thank you. > > Important: This email is confidential and may be privileged. If you are > not the intended recipient, please delete it and notify us immediately; you > should not copy or use it for any purpose, nor disclose its contents to any > other person. Thank you. > > > Important: This email is confidential and may be privileged. If you are > not the intended recipient, please delete it and notify us immediately; you > should not copy or use it for any purpose, nor disclose its contents to any > other person. Thank you. > [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 10.6 years ago Lukas Chavez ▴ 570

0

Entering edit mode

Chong Kim San Allen ▴ 70

@chong-kim-san-allen-6697

Last seen 10.6 years ago

Dear Lukas, I went back and tried to find those with Inf but I can't. I don't remember where I put the files that contained those values. I did however find those that contained the NA values and I have attached the file containing some of these. I was thinking whether I should use the "transf" parameter so as all values are set to between 0 and 1000 for RMS. Perhaps this would not give the NA or Inf values Thanks, Allen ________________________________________ From: Lukas Chavez [lukas.chavez.mailings@googlemail.com] Sent: Saturday, September 13, 2014 1:12 AM To: Chong Kim San Allen Cc: Allen [guest]; bioconductor@r-project.org; MEDIPS Maintainer Subject: Re: [BioC] MEDIPS: how does MEDIPS define a methylated region / cluster? Dear Allen, please see my comments below. Best, Lukas On Fri, Sep 12, 2014 at 3:00 AM, Chong Kim San Allen <patcksa@nus.edu.sg<mailto:patcksa@nus.edu.sg>> wrote: Dear Lukas, Firstly, thank you so very much for your kind replies. You have been so patient and I am very, very grateful for your help. I have been away and so I'm trying now to gather my thoughts on this again. 1) I did read section 7.7 of the MEDIPS manual. My understanding (correct me if I am wrong) is that MEDIPS identifies differentially methylated regions (DMRs) by a sample-against-sample method rather than a group-against-group method. So, if I have 5 controls (A,B,C,D,E) and 6 (1,2,3,4,5,6) test samples, then EACH individual control sample is compared against EACH test sample. So, A vs. 1, A vs. 2, A vs. 3, A vs. 4, A vs. 5, A vs. 6, B vs. 1, B vs. 2, ......etc. instead of comparing ALL of the control samples against ALL of the test samples. Is my understanding correct? No that is not correct. MEDIPS identifies differentially enriched regions by comparing two groups of samples. You are therefore saying that because of this all-against-all method that MEDIPS uses, correcting for CpG density is actually not necessary. But that's for identifying differentially methylated regions in MEDIPS. I don't quite understand how this relates to my Principal Component Analysis since I believe that I need to export my data out from MEDIPS and maybe use a different R package to do the PCA. So I am guessing that I need to export the methylation profile files for each sample and then input them into the PCA analysis software. The motivation of CpG normalization is to quantify MeDIP signals at genomic regions with varying CpG density across a genome. The differential enrichment analysis compares IP-seq derived signals at genomic windows across samples. Therefore, the signal for a specific window will not be affected by different CpG densities across samples. The same assumption applies when you are comparing different samples with each other employing PCA. Due to a lack of experience I cannot recommend to use CpG normalized methylation estimates or for example variance stabilized counts as done by DESeq for RNA-seq data. 2) I have decided to try and do a PCA/hierarchical clustering using only the methylation levels of a specific genomic feature (eg. all gene promoters) to make things simple. I have decided to take your suggestion. I believe I can use the MEDIPS.createROIset() to calculate the methylation levels at all gene promoters. I can then use the methylation profile for the ROI (in this case, promoter) from each sample to do a a hierarchical clustering or PCA. 3) You said: " In case you want to apply a PCA on CpG density methylation estimates anyway, you can either use the rms values ......."...however, I still don't know if RMS is a linear or logarithmic value. If I have 4 100-bp windows where the RMS is 3, 5, 8, 9, can I say that the RMS for the this 400-bp region is 25? If RMS is a log value then obviously I cannot just add them up. The rms values are not log transformed. However, the signals in neighbouring windows are not independent. Extended reads spanning more than one window will be counted several times. Therefore, adding up counts or rms values is not valid. You can use MEDIPS.createROIset() instead. 4) Also, I was looking at the methylation profile file and I noticed that some RMS values are given as "Inf" or "NA". I am confused as to how to use or interpret this. I also don't understand why it is "NA"...why not just put a 0 (zero) instead of an "NA"? What are the CF and count values for windows where rms values are NA or Inf? Can you send examples? 5) Thanks for the suggestion to use the MEDIPS.meth() function to remove windows with low or zero counts. That's really helpful. I hadn't even considered that. Have a great weekend. Warmest regards, Allen ________________________________________ From: Lukas Chavez [lukas.chavez.mailings@googlemail.com<mailto:lukas. chavez.mailings@googlemail.com="">] Sent: Tuesday, September 09, 2014 1:09 AM To: Chong Kim San Allen Cc: Allen [guest]; bioconductor@r-project.org<mailto:bioconductor@r-project.org>; MEDIPS Maintainer Subject: Re: [BioC] MEDIPS: how does MEDIPS define a methylated region / cluster? Dear Allen, I was referring to section 7.7 of the MEDIPS Tutorial (or vignette) available at: http://www.bioconductor.org/packages/release/bioc/vignettes/MEDIPS/ins t/doc/MEDIPS.pdf I recommend to apply the a PCA to the counts at the genomic windows returned by the MEDIPS.meth() function where you probably want to remove windows with zero or low counts first. Here, the assumption is that CpG density correction of MeDIP signals might not be necessary as you are comparing different samples with each other (please see section 7.7 of the MEDIPS tutorial). In case you want to apply a PCA on CpG density methylation estimates anyway, you can either use the rms values or calculate methylation estimates using BayMeth, a recent tool that appears to be superior for such methylation estimates. All the best, Lukas On Sun, Sep 7, 2014 at 11:10 AM, Chong Kim San Allen <patcksa@nus.edu. sg<mailto:patcksa@nus.edu.sg=""><mailto:patcksa@nus.edu.sg<mailto:patcksa @nus.edu.sg="">>> wrote: Dear Lukas, Again, I am so grateful for your quick reply. The only manual for MEDIPS that I know of is this one: http://www.bioc onductor.org/packages/release/bioc/manuals/MEDIPS/man/MEDIPS.pdf and I have looked and just can't find the section 7.7 that you mentioned. The reason I asked about how MEDIPS defines a cluster is because I wanted to try and reduce the number of methylated regions that I look at and then use these "consolidated" regions to do a principal component analysis (PCA). I have 12 cell types and I wanted to see if I could either do a PCA or Unsupervised Hierarchical Clustering to separate these 12 cell types into subgroups, based on their methylation profile. You said: "When windows with significant differential enrichment between groups have been identified, the MEDIPS.mergeFrames() function can be used to merge adjacent significant windows into extended regions." But at this point, I don't have any groupings. Normally, one would have a control and test group and so you can do what you suggested. But I have only, for example, a test group and I want to know if this test group can be further subdivided into a few subgroups based on their methylation profile. I would think that it would be computationally intensive if I used original 100bp windows generated by MEDIPS and entered all the rms values for each window of each cell type into the PCA. You said: "Instead of merging any values across windows, I recommend to make use of the MEDIPS.createROIse() function which allows for analyzing specific regions of interest." I didn't want my PCA analysis to be biased by a predefined "region of interest", be it whether the ROI is a promoter, CpG, etc. I wanted to just input the raw methylation scores (rms) to see what I could get. So that is why I asked if the rms can be added up (in question 4 of my last email)...also I was curious if it could be done. Once again., thanks so very much for taking time from your busy schedule to answer my questions. Best regards, Allen ________________________________________ From: Lukas Chavez [lukas.chavez.mailings@googlemail.com<mailto:lukas. chavez.mailings@googlemail.com=""><mailto:lukas.chavez.mailings@googlemai l.com<mailto:lukas.chavez.mailings@googlemail.com="">>] Sent: Saturday, September 06, 2014 11:09 AM To: Allen [guest] Cc: bioconductor@r-project.org<mailto:bioconductor@r-project.org><mail to:bioconductor@r-project.org<mailto:bioconductor@r-project.org="">>; Chong Kim San Allen; MEDIPS Maintainer Subject: Re: [BioC] MEDIPS: how does MEDIPS define a methylated region / cluster? Dear Allen, please see my comments below. Moreover, please consider reading section 7.7 of the MEDIPS manual. Best, Lukas On Fri, Sep 5, 2014 at 1:14 AM, Allen [guest] <guest@bioconductor.org< mailto:guest@bioconductor.org=""><mailto:guest@bioconductor.org<mailto:gu est@bioconductor.org="">><mailto:guest@bioconductor.org<mailto:guest@bioc onductor.org=""><mailto:guest@bioconductor.org<mailto:guest@bioconductor. org="">>>> wrote: Hi, I have aligned my sequencing data for one sample (filename: Ca2_MAPQ20.bam --- because I filtered out everything with MAPQ20 or less) and I've put it through MEDIPS. At the moment, I am not making a comparison between samples. Therefore, the result file shows the methylation profile for only one sample. In this results file, I noticed that the data is organized as such: chr start stop CF Ca2_MAPQ20.bam.counts Ca2_MAPQ20.bam.rpkm Ca2_MAPQ20.bam.rms Ca2_MAPQ20.bam.prob MSets1.counts.mean MSets1.rpkm.mean MSets1.rms.mean MSets1.prob.mean 1) I tried reading the Down et al. (2008) paper that explains the concept of coupling factor and I think it is, simply put, a measure of local CpG density. I am not sure if my understanding of 'coupling factor' is correct? That is correct. 2) This lead to question how or what MEDIPS defines as a "region or cluster"? That is to say, on Chromosome 1, I have reads aligning from position "1002501" to "1003300" but then there is a region of 1400 bp (from "1003301" to "1004700") where there are no reads aligned (rpkm = 0). Then, again, from "1004701" to "1005400", there reads aligning to this region. So my question is does MEDIPS consider this to be a case of 2 methylated regions, that is, "1002501-1003300" and "1004701-1005400" or does MEDIPS try to consolidate these 2 methylated regions into one methylated region/cluster (that is, from "1002501-1005400") since they are fairly close to one another. Each window is considered as one region; CpG density, counts, rpkm, rms, and differential coverage will be calculated for each window separately. When windows with significant differential enrichment between groups have been identified, the MEDIPS.mergeFrames() function can be used to merge adjacent significant windows into extended regions. 3) I used "uniq=TRUE" to get rid of stacked/clonal reads and so for each 100 bp bin, I am usually getting rpkm=1 within each bin but occasionally, it may be as many rpkm=3. This lead me to wonder how the relative methylation score ("Ca2_MAPQ20.bam.rms") is calculated? For example, 3 bins have a value of 1 rpkm each, but one has an rms value of "1660.409928", another an rms of "7122.10254" and the third is only "679.1814523". How is that possible that 3 bins that have the same rpkm can have such varying rms values? This is due to the different CpG densities in these windows. 4)Can the rms be added up for a region so as to represent the cumulative methylation level for that region. I am asking because I do not know if the rms value is a log value or what? So, in my above question (3), if the 3 bins are adjacent to one another and I decide to cluster them together and consider them as one methylated region, then would the rms value for this new 300bp bin that I have created be 1660.409928 + 7122.10254 + 679.1814523 = 9461.68. Instead of merging any values across windows, I recommend to make use of the MEDIPS.createROIse() function which allows for analyzing specific regions of interest. 5)Finally, what does "Ca2_MAPQ20.bam.prob" represent? I figured that is some sort of a probability score but I am not sure of what. As mentioned in the manual, please disregard the prob values which will be removed in a future version. Thanks in advance for any assistance in answering my questions. Best regards, Allen -- output of sessionInfo(): error -- Sent via the guest posting facility at bioconductor.org<http: biocond="" uctor.org=""><http: bioconductor.org=""><http: bioconductor.org="">. _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org><mailto:b ioconductor@r-project.org<mailto:bioconductor@r-project.org="">><mailto:b ioconductor@r-project.org<mailto:bioconductor@r-project.org=""><mailto:bi oconductor@r-project.org<mailto:bioconductor@r-project.org="">>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor Important: This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately; you should not copy or use it for any purpose, nor disclose its contents to any other person. Thank you. Important: This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately; you should not copy or use it for any purpose, nor disclose its contents to any other person. Thank you. Important: This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately; you should not copy or use it for any purpose, nor disclose its contents to any other person. Thank you.

ADD COMMENT • link 10.6 years ago Chong Kim San Allen ▴ 70

0

Entering edit mode

mengru1223 • 0

@mengru1223-6807

Last seen 10.6 years ago

Taiwan

Dear Lukas,
I have encountered the same problem as Allen as well. That is, some regions' rms are "NA". Here are some cases as follows.
case1. CF:45, Set1count:40, Set2count:38
case2. CF:46, Set1count:58, Set2count:43
case3. CF:48, Set1count:3, Set2count:6
case4. CF:50, Set1count:4, Set2count:0
case5. CF:48, Set1count:0, Set2count:1

I am afriad that I have done somthing wrong. Do you have any idea why this happened?
Thanks in advance for answering my questions.

Best regards,
Mirrian

ADD COMMENT • link 10.6 years ago mengru1223 • 0

Login before adding your answer.