segmentation aCGH data

0

Entering edit mode

jhs1jjm@leeds.ac.uk ▴ 230

@jhs1jjmleedsacuk-2338

Last seen 10.2 years ago

Hi list, I've been looking at 3*44k and 2*244k agilent CGH arrays. To date I've used limma to read in the processed signals (no background correction or normalization as this has been done), then the DNAcopy package for segmentation as well as the snapCGH package to employ other segmentation methods rather than use each segmentation package individually. Firstly using the DNAcopy segmentation I can see a significant pattern across my 3*44k arrays which disappears when I perform the step to remove unnecessary change points due to trends in the data. As these are in the same locations across the 3 arrays then is it likely that this is biologically significant rather than being a trend? Obviously others do not have a definitive answer for this but I wondered if anyone had seen similar results in a different scenario. Additionally I'm wondering what segmentation methods people have tended to employ. The heterogeneous nature of my data means that I need to identify single probe as well as larger region aberrations and I'd read that the CBS algorithm is not particular suited to doing this? Apologies if this is a bit vague. Thanks for any input, John

Normalization CGH probe limma DNAcopy snapCGH Normalization CGH probe limma DNAcopy • 2.3k views

ADD COMMENT • link 17.1 years ago jhs1jjm@leeds.ac.uk ▴ 230

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 3 months ago

United States

jhs1jjm at leeds.ac.uk wrote: > Hi list, > > I've been looking at 3*44k and 2*244k agilent CGH arrays. To date I've > used limma to read in the processed signals (no background correction > or normalization as this has been done), then the DNAcopy package for > segmentation as well as the snapCGH package to employ other > segmentation methods rather than use each segmentation package > individually. > > Firstly using the DNAcopy segmentation I can see a significant pattern > across my 3*44k arrays which disappears when I perform the step to > remove unnecessary change points due to trends in the data. As these > are in the same locations across the 3 arrays then is it likely that > this is biologically significant rather than being a trend? Obviously > others do not have a definitive answer for this but I wondered if > anyone had seen similar results in a different scenario. What you are describing could be technical in nature or copy-number-variants. You will probably need to review those regions for known copy-number-variants and also look at the quality control metrics for those probes. Unfortunately, segmentation is not "the final answer" to CGH analysis--there has to be some curation (either manual or automated) to find the regions of greatest interest and remove the regions that are likely not associated with the disease state. > Additionally I'm wondering what segmentation methods people have tended > to employ. The heterogeneous nature of my data means that I need to > identify single probe as well as larger region aberrations and I'd > read that the CBS algorithm is not particular suited to doing this? > Apologies if this is a bit vague. Single probes are problematic and require validation using another technology or array platform, in my opinion.

ADD COMMENT • link 17.1 years ago Sean Davis 21k

0

Entering edit mode

Hi Sean, As its 2 colour so I'm looking at relative amounts wouldn't that mean I wouldn't see copy number variants, would they not be in both my samples? I was also pondering the advantages of using R and bioconductor, vs say Agilent's z score, for the purposes of my discussion. Is the simple answer simply a flexible approach to these matters? Also if possible could you expand a bit in regards to the single probes argument. Thanks for the input John Quoting Sean Davis <sdavis2 at="" mail.nih.gov=""> on Wed 10 Oct 2007 15:36:21 BST: > jhs1jjm at leeds.ac.uk wrote: > > Hi list, > > > > I've been looking at 3*44k and 2*244k agilent CGH arrays. To date > I've > > used limma to read in the processed signals (no background > correction > > or normalization as this has been done), then the DNAcopy package > for > > segmentation as well as the snapCGH package to employ other > > segmentation methods rather than use each segmentation package > > individually. > > > > Firstly using the DNAcopy segmentation I can see a significant > pattern > > across my 3*44k arrays which disappears when I perform the step to > > remove unnecessary change points due to trends in the data. As > these > > are in the same locations across the 3 arrays then is it likely > that > > this is biologically significant rather than being a trend? > Obviously > > others do not have a definitive answer for this but I wondered if > > anyone had seen similar results in a different scenario. > > What you are describing could be technical in nature or > copy-number-variants. You will probably need to review those regions > for known copy-number-variants and also look at the quality control > metrics for those probes. Unfortunately, segmentation is not "the > final > answer" to CGH analysis--there has to be some curation (either manual > or > automated) to find the regions of greatest interest and remove the > regions that are likely not associated with the disease state. > > > Additionally I'm wondering what segmentation methods people have > tended > > to employ. The heterogeneous nature of my data means that I need to > > identify single probe as well as larger region aberrations and I'd > > read that the CBS algorithm is not particular suited to doing this? > > Apologies if this is a bit vague. > > Single probes are problematic and require validation using another > technology or array platform, in my opinion. >

ADD REPLY • link 17.1 years ago jhs1jjm@leeds.ac.uk ▴ 230

0

Entering edit mode

jhs1jjm at leeds.ac.uk wrote: > Hi Sean, > > As its 2 colour so I'm looking at relative amounts wouldn't that mean I > wouldn't see copy number variants, would they not be in both my > samples? I was also pondering the advantages of using R and > bioconductor, vs say Agilent's z score, for the purposes of my > discussion. Is the simple answer simply a flexible approach to these > matters? Also if possible could you expand a bit in regards to the > single probes argument. If using Agilent CGHAnalytics, you will probably want to use ADM-1, not z-score. For the 44k arrays, a threshold of around 6 is probably appropriate. For the 244k arrays, something closer to 10 or 11 is more appropriate. ADM-1 is exquisitely sensitive to single probes that are extreme values. These may represent real signal, or may be noise. There is no way to tell without validation, in my opinion. However, If there are two or more probes behaving similarly, then you can be more assured of real biology. The real biology could be directly disease-related or not. The ones that are not are copy number variants (although there is now plenty of evidence that copy number variants can be disease-associated, as well). When using high-resolution oligo arrays, you will need to become familiar with copy number polymorphism and databases for annotating them. CGHAnalytics contains a catalog of those built-in. As for R/Bioc versus commercial packages, that will be dictated by the questions you want to ask. We find that we routinely need and want to ask questions that are not easily answered by commercial packages. That said, a good visualization tool for CGH is HIGHLY useful, and there are now several available. Sean

ADD REPLY • link 17.1 years ago Sean Davis 21k

0

Entering edit mode

jhs1jjm at leeds.ac.uk wrote: > Hi Sean, > > As its 2 colour so I'm looking at relative amounts wouldn't that mean I > wouldn't see copy number variants, would they not be in both my > samples? I forgot to answer this question, directly. If the reference genome and the test genome contain the same number of copies of a CNV region, you will not see it, as you suggest. However, if your reference and test samples contain different numbers of copies, then this will potentially be evident in your data. Sean

ADD REPLY • link 17.1 years ago Sean Davis 21k

0

Entering edit mode

Ramon Diaz ★ 1.1k

@ramon-diaz-159

Last seen 10.2 years ago

Dear John, On Wednesday 10 October 2007 15:52, jhs1jjm at leeds.ac.uk wrote: > Hi list, > > I've been looking at 3*44k and 2*244k agilent CGH arrays. To date I've > used limma to read in the processed signals (no background correction > or normalization as this has been done), then the DNAcopy package for > segmentation as well as the snapCGH package to employ other > segmentation methods rather than use each segmentation package > individually. > > Firstly using the DNAcopy segmentation I can see a significant pattern > across my 3*44k arrays which disappears when I perform the step to > remove unnecessary change points due to trends in the data. As these How exactly are you removing "unnecesary change points due to trends in the data"? > are in the same locations across the 3 arrays then is it likely that > this is biologically significant rather than being a trend? Obviously > others do not have a definitive answer for this but I wondered if > anyone had seen similar results in a different scenario. > > Additionally I'm wondering what segmentation methods people have tended > to employ. The heterogeneous nature of my data means that I need to > identify single probe as well as larger region aberrations and I'd > read that the CBS algorithm is not particular suited to doing this? If you run the "smooth.CNA" function (in the DNAcopy package), as it is recommended in the documentation for DNAcopy (IIRC), then single probe aberrations are not detectable (you are smoothing them away). Single probe aberrations might be detected with the HMM model in snapCGH or our HMM model in RJaCGH, available from CRAN (http://cran.r-project.org/src/contrib/Descriptions/RJaCGH.html). (Details of the method available from the paper: http://compbiol.plosjournals.org/perlserv/?request=get- document&doi=10.1371%2Fjournal.pcbi.0030122). Best, R. > Apologies if this is a bit vague. > > Thanks for any input, > > John > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Ram?n D?az-Uriarte Statistical Computing Team Centro Nacional de Investigaciones Oncol?gicas (CNIO) (Spanish National Cancer Center) Melchor Fern?ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://ligarto.org/rdiaz PGP KeyID: 0xE89B3462 (http://ligarto.org/rdiaz/0xE89B3462.asc) **NOTA DE CONFIDENCIALIDAD** Este correo electr?nico, y ...{{dropped:3}}

ADD COMMENT • link 17.1 years ago Ramon Diaz ★ 1.1k

0

Entering edit mode

Hi Ramon, Ah, of course, I'd forgotten I'd performed that step. I'm still getting some segments with high means corresponding to single genes but this'll be because they are represented by more than 1 probe I guess. The DNAcopy document has a step in it to remove local trends in the data. I'm undoing splits that are not at least 3 SDs apart as set out in the document. To summarize then,I might use DNA copy to identify regions but in order to look at single probe aberrations I'd want to use one of the other methods i.e HMM Thanks John Quoting Ramon Diaz-Uriarte <rdiaz at="" cnio.es=""> on Wed 10 Oct 2007 15:22:22 BST: > Dear John, > > On Wednesday 10 October 2007 15:52, jhs1jjm at leeds.ac.uk wrote: > > Hi list, > > > > I've been looking at 3*44k and 2*244k agilent CGH arrays. To date > I've > > used limma to read in the processed signals (no background > correction > > or normalization as this has been done), then the DNAcopy package > for > > segmentation as well as the snapCGH package to employ other > > segmentation methods rather than use each segmentation package > > individually. > > > > Firstly using the DNAcopy segmentation I can see a significant > pattern > > across my 3*44k arrays which disappears when I perform the step to > > remove unnecessary change points due to trends in the data. As > these > > How exactly are you removing "unnecesary change points due to trends > in the > data"? > > > > are in the same locations across the 3 arrays then is it likely > that > > this is biologically significant rather than being a trend? > Obviously > > others do not have a definitive answer for this but I wondered if > > anyone had seen similar results in a different scenario. > > > > Additionally I'm wondering what segmentation methods people have > tended > > to employ. The heterogeneous nature of my data means that I need to > > identify single probe as well as larger region aberrations and I'd > > read that the CBS algorithm is not particular suited to doing this? > > If you run the "smooth.CNA" function (in the DNAcopy package), as it > is > recommended in the documentation for DNAcopy (IIRC), then single > probe > aberrations are not detectable (you are smoothing them away). > > Single probe aberrations might be detected with the HMM model in > snapCGH or > our HMM model in RJaCGH, available from CRAN > (http://cran.r-project.org/src/contrib/Descriptions/RJaCGH.html). > (Details of > the method available from the paper: > http://compbiol.plosjournals.org/perlserv/?request=get- document&doi=10.1371%2Fjournal.pcbi.0030122). > > > Best, > > R. > > > > Apologies if this is a bit vague. > > > > Thanks for any input, > > > > John > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > Ram?n D?az-Uriarte > Statistical Computing Team > Centro Nacional de Investigaciones Oncol?gicas (CNIO) > (Spanish National Cancer Center) > Melchor Fern?ndez Almagro, 3 > 28029 Madrid (Spain) > Fax: +-34-91-224-6972 > Phone: +-34-91-224-6900 > > http://ligarto.org/rdiaz > PGP KeyID: 0xE89B3462 > (http://ligarto.org/rdiaz/0xE89B3462.asc) > > > > **NOTA DE CONFIDENCIALIDAD** Este correo electr?nico, y en su caso > los ficheros adjuntos, pueden contener informaci?n protegida para el > uso exclusivo de su destinatario. Se proh?be la distribuci?n, > reproducci?n o cualquier otro tipo de transmisi?n por parte de otra > persona que no sea el destinatario. Si usted recibe por error este > correo, se ruega comunicarlo al remitente y borrar el mensaje > recibido. > **CONFIDENTIALITY NOTICE** This email communication and any > attachments may contain confidential and privileged information for > the sole use of the designated recipient named above. Distribution, > reproduction or any other use of this transmission by any party other > than the intended recipient is prohibited. If you are not the > intended recipient please contact the sender and delete all copies. > >

ADD REPLY • link 17.1 years ago jhs1jjm@leeds.ac.uk ▴ 230

0

Entering edit mode

On Wednesday 10 October 2007 17:04, jhs1jjm at leeds.ac.uk wrote: > Hi Ramon, > > Ah, of course, I'd forgotten I'd performed that step. I'm still getting > some segments with high means corresponding to single genes but this'll > be because they are represented by more than 1 probe I guess. The > DNAcopy document has a step in it to remove local trends in the data. > I'm undoing splits that are not at least 3 SDs apart as set out in the > document. > Ah, OK. I thought you were referring to other trends (I've heard people mention waves, and relations to CG content, etc ---the later, I think, commonly done in Affy). > To summarize then,I might use DNA copy to identify regions but in order > to look at single probe aberrations I'd want to use one of the other > methods i.e HMM > We often analyze data with four or five different methods (our own HMM in RJaCGH, Olshen's CBS, HMM as in Marioni et al., Piccard's et al CGHseg, and Hsu et al. wavelet-based smoothing) because different approaches are sensitive to different features of the data (or can be misled by different features of the data). (Of course, we do think our approach is the best overall performer, but this way we can keep learning about relative strengths of different methods and/or detect bugs in the code). Detecting single point aberrations might be trickier than, say, detecting a long alteration that involves tens of probes. But then, inability to detect single gene alterations can be very relevant in some studies (e.g., IIRC, Aguirre et al., in PNAS 2004, in their study of pancreatic adenocarcinoma, have some discussion not detecting the loss of the tumor supressor SMAD4). As for the need for validation, etc, if you have a gene covered by a bunch of probes and only a single probe is being called aberrant then I'd be more concerned; but you might be averaging over probes, or use platforms where some genes only have a probe, etc. In general, many/most of the current aCGH studies are really exploratory studies (i.e., they are in the "copy number differences discovery" stage, not "copy number association studies" stage) with results that need to be validated further (other aCGH platforms, other molecular techniques); there are several papers in the July 2007 issue of Nature Genetics (volume 39) that go into these issues. Best, R. > Thanks > John > > Quoting Ramon Diaz-Uriarte <rdiaz at="" cnio.es=""> on Wed 10 Oct 2007 15:22:22 > > BST: > > Dear John, > > > > On Wednesday 10 October 2007 15:52, jhs1jjm at leeds.ac.uk wrote: > > > Hi list, > > > > > > I've been looking at 3*44k and 2*244k agilent CGH arrays. To date > > > > I've > > > > > used limma to read in the processed signals (no background > > > > correction > > > > > or normalization as this has been done), then the DNAcopy package > > > > for > > > > > segmentation as well as the snapCGH package to employ other > > > segmentation methods rather than use each segmentation package > > > individually. > > > > > > Firstly using the DNAcopy segmentation I can see a significant > > > > pattern > > > > > across my 3*44k arrays which disappears when I perform the step to > > > remove unnecessary change points due to trends in the data. As > > > > these > > > > How exactly are you removing "unnecesary change points due to trends > > in the > > data"? > > > > > are in the same locations across the 3 arrays then is it likely > > > > that > > > > > this is biologically significant rather than being a trend? > > > > Obviously > > > > > others do not have a definitive answer for this but I wondered if > > > anyone had seen similar results in a different scenario. > > > > > > Additionally I'm wondering what segmentation methods people have > > > > tended > > > > > to employ. The heterogeneous nature of my data means that I need to > > > identify single probe as well as larger region aberrations and I'd > > > read that the CBS algorithm is not particular suited to doing this? > > > > If you run the "smooth.CNA" function (in the DNAcopy package), as it > > is > > recommended in the documentation for DNAcopy (IIRC), then single > > probe > > aberrations are not detectable (you are smoothing them away). > > > > Single probe aberrations might be detected with the HMM model in > > snapCGH or > > our HMM model in RJaCGH, available from CRAN > > (http://cran.r-project.org/src/contrib/Descriptions/RJaCGH.html). > > (Details of > > the method available from the paper: > > http://compbiol.plosjournals.org/perlserv/?request=get- document&doi=10.1371 >%2Fjournal.pcbi.0030122). > > > Best, > > > > R. > > > > > Apologies if this is a bit vague. > > > > > > Thanks for any input, > > > > > > John > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor at stat.math.ethz.ch > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > > Ram?n D?az-Uriarte > > Statistical Computing Team > > Centro Nacional de Investigaciones Oncol?gicas (CNIO) > > (Spanish National Cancer Center) > > Melchor Fern?ndez Almagro, 3 > > 28029 Madrid (Spain) > > Fax: +-34-91-224-6972 > > Phone: +-34-91-224-6900 > > > > http://ligarto.org/rdiaz > > PGP KeyID: 0xE89B3462 > > (http://ligarto.org/rdiaz/0xE89B3462.asc) > > > > > > > > **NOTA DE CONFIDENCIALIDAD** Este correo electr?nico, y en su caso > > los ficheros adjuntos, pueden contener informaci?n protegida para el > > uso exclusivo de su destinatario. Se proh?be la distribuci?n, > > reproducci?n o cualquier otro tipo de transmisi?n por parte de otra > > persona que no sea el destinatario. Si usted recibe por error este > > correo, se ruega comunicarlo al remitente y borrar el mensaje > > recibido. > > **CONFIDENTIALITY NOTICE** This email communication and any > > attachments may contain confidential and privileged information for > > the sole use of the designated recipient named above. Distribution, > > reproduction or any other use of this transmission by any party other > > than the intended recipient is prohibited. If you are not the > > intended recipient please contact the sender and delete all copies. -- Ram?n D?az-Uriarte Statistical Computing Team Centro Nacional de Investigaciones Oncol?gicas (CNIO) (Spanish National Cancer Center) Melchor Fern?ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://ligarto.org/rdiaz PGP KeyID: 0xE89B3462 (http://ligarto.org/rdiaz/0xE89B3462.asc) **NOTA DE CONFIDENCIALIDAD** Este correo electr?nico, y ...{{dropped:3}}

ADD REPLY • link 17.1 years ago Ramon Diaz ★ 1.1k

0

Entering edit mode

jhs1jjm@leeds.ac.uk ▴ 230

@jhs1jjmleedsacuk-2338

Last seen 10.2 years ago

Sean, Thanks, that helps a lot. I've purposely stayed away from using the Agilent software, one as I'm not on campus (this is where it is) and secondly I wanted to do the analysis using R and bioconductor and any other open source software I can get my hands on. I was also wondering whether its the case that a lot of the packages and the algorithms they use are found in bioconductor/R first and it may take time to implement them on commercial platform i.e with a nice GUI? I wonder also if you could help on another matter. At the moment I'm exporting the DNAcopy segment output as csv file then opening it in open office calc and correlating the map position with the agilent text file to find the corresponding genes. This is fine for the 44k arrays but I'm unable to see all the rows for the 244k text file in calc so cannot correlate the map position with genes. Regards John Quoting Sean Davis <sdavis2 at="" mail.nih.gov=""> on Wed 10 Oct 2007 17:15:52 BST: > jhs1jjm at leeds.ac.uk wrote: > > Hi Sean, > > > > As its 2 colour so I'm looking at relative amounts wouldn't that > mean I > > wouldn't see copy number variants, would they not be in both my > > samples? I was also pondering the advantages of using R and > > bioconductor, vs say Agilent's z score, for the purposes of my > > discussion. Is the simple answer simply a flexible approach to > these > > matters? Also if possible could you expand a bit in regards to the > > single probes argument. > > If using Agilent CGHAnalytics, you will probably want to use ADM-1, > not > z-score. For the 44k arrays, a threshold of around 6 is probably > appropriate. For the 244k arrays, something closer to 10 or 11 is > more > appropriate. ADM-1 is exquisitely sensitive to single probes that > are > extreme values. These may represent real signal, or may be noise. > There is no way to tell without validation, in my opinion. However, > If > there are two or more probes behaving similarly, then you can be more > assured of real biology. The real biology could be directly > disease-related or not. The ones that are not are copy number > variants > (although there is now plenty of evidence that copy number variants > can > be disease-associated, as well). When using high-resolution oligo > arrays, you will need to become familiar with copy number > polymorphism > and databases for annotating them. CGHAnalytics contains a catalog > of > those built-in. > > As for R/Bioc versus commercial packages, that will be dictated by > the > questions you want to ask. We find that we routinely need and want > to > ask questions that are not easily answered by commercial packages. > That > said, a good visualization tool for CGH is HIGHLY useful, and there > are > now several available. > > Sean >

ADD COMMENT • link 17.1 years ago jhs1jjm@leeds.ac.uk ▴ 230

Login before adding your answer.