linkage distances

0

Entering edit mode

Daniel Brewer ★ 1.9k

@daniel-brewer-1791

Last seen 10.6 years ago

Hi, I have been producing some dendograms using hclust with a variety of linkage distance measures. Does anyone know or is there a good resource that explains why one would use one linkage distance rather than another? I don't really like dealing with dendograms, but we want to produce groupings based on these to do differential analysis on, and I would like to be able to justify it. Thanks Dan -- ************************************************************** Daniel Brewer, Ph.D. Institute of Cancer Research Email: daniel.brewer at icr.ac.uk ************************************************************** The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addre...{{dropped}}

Cancer Cancer • 1.4k views

ADD COMMENT • link updated 17.9 years ago by David Ruau ▴ 190 • written 17.9 years ago by Daniel Brewer ★ 1.9k

0

Entering edit mode

William Shannon ▴ 280

@william-shannon-1787

Last seen 10.6 years ago

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070613/ 629ab5af/attachment.pl

ADD COMMENT • link 17.9 years ago William Shannon ▴ 280

0

Entering edit mode

Dear Daniel, The only reference that I know that addresses this topic to some extend is this book: The Elements of Statistical Learning by T. Hastie, R. Tibshirani, J. H. Friedman With regard to William's suggestion: I don't have anything available that would calculate the consensus between different denrograms. As a start to compute these comparisons, I would loop over the height component in the hclust objects with the cutree function. This way one can obtain all possible clusters defined by each dendrogram and then perform all-against-all consensus comparisons between different dendrograms using one of the intersect functions (e.g. %in%). # For example: y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) hr <- hclust(dist(y, method = "euclidean") ) sapply(hr$height, function(x) cutree(hr, h=x)) Thomas On Wed 06/13/07 06:25, William Shannon wrote: > I tend to use a 'consensus' approach when doing cluster analysis. If by linkage distance you mean genetic linkage (I assume you do), you could try the various linkage distances and see if the dendrogram is stable. This also works if you are dealing with non-genetic distance measures. > > If you do this and the dendrograms are essentially stable you are done. More formal methods of consensus trees (dendrograms) can be found doing a search on work by Fred McMorris (look in discrete math and evolutionary biology) and the numerical taxonomy software PAUP I believe has consensus methods in it. > > Maybe Tom Girke has consensus tools in R/Bioconductor. > > Bill Shannon > Washington Univ. School of Medicine > > PS -- I am running for President elect of the Classification Society of North America and encourage anyone doing cluster/classification work to look at this society for their research and publications (Journal of Classification and http://www.classification- society.org/csna/csna.html) > > > > Daniel Brewer <daniel.brewer at="" icr.ac.uk=""> wrote: Hi, > > I have been producing some dendograms using hclust with a variety of > linkage distance measures. Does anyone know or is there a good resource > that explains why one would use one linkage distance rather than another? > > I don't really like dealing with dendograms, but we want to produce > groupings based on these to do differential analysis on, and I would > like to be able to justify it. > > Thanks > > Dan > > -- > ************************************************************** > Daniel Brewer, Ph.D. > Institute of Cancer Research > Email: daniel.brewer at icr.ac.uk > ************************************************************** > > The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. > > This e-mail message is confidential and for use by the addre...{{dropped}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Dr. Thomas Girke Assistant Professor of Bioinformatics Director, IIGB Bioinformatic Facility Center for Plant Cell Biology (CEPCEB) Institute for Integrative Genome Biology (IIGB) Department of Botany and Plant Sciences 1008 Noel T. Keen Hall University of California Riverside, CA 92521 E-mail: thomas.girke at ucr.edu Website: http://faculty.ucr.edu/~tgirke Ph: 951-827-2469 Fax: 951-827-4437

ADD REPLY • link 17.9 years ago Thomas Girke ★ 1.7k

0

Entering edit mode

Dear Daniel, Package ape in CRAN contains functions one can use for calculating a consensus of several dendrograms (consensus). Consensus function can produce either a strict or majority rule consensus. Strict consensus contains only the groups that are present in all the trees, whereas majority rule consensus contains only the trees that are present in the majority of the trees. I've usually used majority rule consensus, ans its the standard method used with bootstrapping analyses. Jarno On Wed, 13 Jun 2007, Thomas Girke wrote: > Dear Daniel, > > The only reference that I know that addresses this topic to some extend is > this book: > The Elements of Statistical Learning > by T. Hastie, R. Tibshirani, J. H. Friedman > > > With regard to William's suggestion: I don't have anything available that would > calculate the consensus between different denrograms. As a start to compute these > comparisons, I would loop over the height component in the hclust objects > with the cutree function. This way one can obtain all possible clusters > defined by each dendrogram and then perform all-against-all consensus comparisons > between different dendrograms using one of the intersect functions (e.g. %in%). > > # For example: > y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) > hr <- hclust(dist(y, method = "euclidean") ) > sapply(hr$height, function(x) cutree(hr, h=x)) > > > Thomas > > > On Wed 06/13/07 06:25, William Shannon wrote: >> I tend to use a 'consensus' approach when doing cluster analysis. If by linkage distance you mean genetic linkage (I assume you do), you could try the various linkage distances and see if the dendrogram is stable. This also works if you are dealing with non-genetic distance measures. >> >> If you do this and the dendrograms are essentially stable you are done. More formal methods of consensus trees (dendrograms) can be found doing a search on work by Fred McMorris (look in discrete math and evolutionary biology) and the numerical taxonomy software PAUP I believe has consensus methods in it. >> >> Maybe Tom Girke has consensus tools in R/Bioconductor. >> >> Bill Shannon >> Washington Univ. School of Medicine >> >> PS -- I am running for President elect of the Classification Society of North America and encourage anyone doing cluster/classification work to look at this society for their research and publications (Journal of Classification and http://www .classification-society.org/csna/csna.html) >> >> >> >> Daniel Brewer <daniel.brewer at="" icr.ac.uk=""> wrote: Hi, >> >> I have been producing some dendograms using hclust with a variety of >> linkage distance measures. Does anyone know or is there a good resource >> that explains why one would use one linkage distance rather than another? >> >> I don't really like dealing with dendograms, but we want to produce >> groupings based on these to do differential analysis on, and I would >> like to be able to justify it. >> >> Thanks >> >> Dan >> >> -- >> ************************************************************** >> Daniel Brewer, Ph.D. >> Institute of Cancer Research >> Email: daniel.brewer at icr.ac.uk >> ************************************************************** >> >> The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. >> >> This e-mail message is confidential and for use by the addre...{{dropped}} >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > Dr. Thomas Girke > Assistant Professor of Bioinformatics > Director, IIGB Bioinformatic Facility > Center for Plant Cell Biology (CEPCEB) > Institute for Integrative Genome Biology (IIGB) > Department of Botany and Plant Sciences > 1008 Noel T. Keen Hall > University of California > Riverside, CA 92521 > > E-mail: thomas.girke at ucr.edu > Website: http://faculty.ucr.edu/~tgirke > Ph: 951-827-2469 > Fax: 951-827-4437 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > ---------------------------------------------------------------------- ------- Jarno Tuimala, FT, bioinformatiikan asiantuntija, CSC, PL 405, 02101 Espoo puh.: (09) 457 2226, fax: (09) 457 2302, s-posti: jarno.tuimala at csc.fi CSC on tieteen tietotekniikan keskus, http://www.csc.fi/molbio Jarno Tuimala, PhD, bioinformatics, CSC, P.O.Box 405, FI-02101 Espoo, Finland tel.: +358 9 457 2226, fax: +358 9 457 2302, e-mail: jarno.tuimala at csc.fi CSC is the Finnish IT Center for Science, http://www.csc.fi/molbio

ADD REPLY • link 17.9 years ago Jarno Tuimala ▴ 140

0

Entering edit mode

Thanks to all of you for your responses, that is really helpful. Dan Jarno Tuimala wrote: > Dear Daniel, > > Package ape in CRAN contains functions one can use for calculating a > consensus of several dendrograms (consensus). Consensus function can > produce either a strict or majority rule consensus. Strict consensus > contains only the groups that are present in all the trees, whereas > majority rule consensus contains only the trees that are present in the > majority of the trees. I've usually used majority rule consensus, ans its > the standard method used with bootstrapping analyses. > > Jarno > > > > On Wed, 13 Jun 2007, Thomas Girke wrote: > >> Dear Daniel, >> >> The only reference that I know that addresses this topic to some extend is >> this book: >> The Elements of Statistical Learning >> by T. Hastie, R. Tibshirani, J. H. Friedman >> >> >> With regard to William's suggestion: I don't have anything available that would >> calculate the consensus between different denrograms. As a start to compute these >> comparisons, I would loop over the height component in the hclust objects >> with the cutree function. This way one can obtain all possible clusters >> defined by each dendrogram and then perform all-against-all consensus comparisons >> between different dendrograms using one of the intersect functions (e.g. %in%). >> >> # For example: >> y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) >> hr <- hclust(dist(y, method = "euclidean") ) >> sapply(hr$height, function(x) cutree(hr, h=x)) >> >> >> Thomas >> >> >> On Wed 06/13/07 06:25, William Shannon wrote: >>> I tend to use a 'consensus' approach when doing cluster analysis. If by linkage distance you mean genetic linkage (I assume you do), you could try the various linkage distances and see if the dendrogram is stable. This also works if you are dealing with non-genetic distance measures. >>> >>> If you do this and the dendrograms are essentially stable you are done. More formal methods of consensus trees (dendrograms) can be found doing a search on work by Fred McMorris (look in discrete math and evolutionary biology) and the numerical taxonomy software PAUP I believe has consensus methods in it. >>> >>> Maybe Tom Girke has consensus tools in R/Bioconductor. >>> >>> Bill Shannon >>> Washington Univ. School of Medicine >>> >>> PS -- I am running for President elect of the Classification Society of North America and encourage anyone doing cluster/classification work to look at this society for their research and publications (Journal of Classification and http://www .classification-society.org/csna/csna.html) >>> >>> >>> >>> Daniel Brewer <daniel.brewer at="" icr.ac.uk=""> wrote: Hi, >>> >>> I have been producing some dendograms using hclust with a variety of >>> linkage distance measures. Does anyone know or is there a good resource >>> that explains why one would use one linkage distance rather than another? >>> >>> I don't really like dealing with dendograms, but we want to produce >>> groupings based on these to do differential analysis on, and I would >>> like to be able to justify it. >>> >>> Thanks >>> >>> Dan >>> -- ************************************************************** Daniel Brewer, Ph.D. Institute of Cancer Research United Kingdom ************************************************************** The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addre...{{dropped}}

ADD REPLY • link 17.9 years ago Daniel Brewer ★ 1.9k

0

Entering edit mode

David Ruau ▴ 190

@david-ruau-1562

Last seen 10.6 years ago

For a good source of information on linkage methods you should have a look at this book: "Finding Groups in Data. An introduction to cluster analysis" from L. Kaufman and P. J. Rousseeuw at Wiley This is a really easy book to read. For understanding linkage methods look at chapter 5, page 199. An explanation is given page 225 also. Have also a look for a quick overview on page 47. All the method describe in this book are implemented into the package 'cluster' In the end, for the linkage method, I always use the same: UPGMA also call average method. In the book they mention that you have to choose the linkage method according to the type of cluster shape you search. I never found the answer to the cluster shape when your matrix has more than 3 dimension... :) What I play with is the distance/similarity measure. When speaking about distance you should make a difference between Metric (euclidean...), parametric and non-parametric. Parametric correlation measures can, due to their sensitivity to outliers, give non-homogeneous cluster solutions. In this case non- parametric correlations, such as Spearman Rank correlation or Kendall?s t rank correlation, are preferred. The distance use by Eisen in his paper of 1998 is the cosine distance correlation also call not centered Pearson. And it give good results. David --- David Ruau Institute for Biomedical Engineering -Cell Biology- Universitatsklinikum Aachen, RWTH Pauwelsstrasse 30 52074 Aachen GERMANY GPG: 4210CA11 On Jun 13, 2007, at 3:13 PM, Daniel Brewer wrote: > Hi, > > I have been producing some dendograms using hclust with a variety of > linkage distance measures. Does anyone know or is there a good > resource > that explains why one would use one linkage distance rather than > another? > > I don't really like dealing with dendograms, but we want to produce > groupings based on these to do differential analysis on, and I would > like to be able to justify it. > > Thanks > > Dan > > -- > ************************************************************** > Daniel Brewer, Ph.D. > Institute of Cancer Research > Email: daniel.brewer at icr.ac.uk > ************************************************************** > > The Institute of Cancer Research: Royal Cancer Hospital, a > charitable Company Limited by Guarantee, Registered in England > under Company No. 534147 with its Registered Office at 123 Old > Brompton Road, London SW7 3RP. > > This e-mail message is confidential and for use by the add...{{dropped}}

ADD COMMENT • link 17.9 years ago David Ruau ▴ 190

Login before adding your answer.