Hi,
I have been producing some dendograms using hclust with a variety of
linkage distance measures. Does anyone know or is there a good
resource
that explains why one would use one linkage distance rather than
another?
I don't really like dealing with dendograms, but we want to produce
groupings based on these to do differential analysis on, and I would
like to be able to justify it.
Thanks
Dan
--
**************************************************************
Daniel Brewer, Ph.D.
Institute of Cancer Research
Email: daniel.brewer at icr.ac.uk
**************************************************************
The Institute of Cancer Research: Royal Cancer Hospital, a charitable
Company Limited by Guarantee, Registered in England under Company No.
534147 with its Registered Office at 123 Old Brompton Road, London SW7
3RP.
This e-mail message is confidential and for use by the
addre...{{dropped}}
Dear Daniel,
The only reference that I know that addresses this topic to some
extend is
this book:
The Elements of Statistical Learning
by T. Hastie, R. Tibshirani, J. H. Friedman
With regard to William's suggestion: I don't have anything available
that would
calculate the consensus between different denrograms. As a start to
compute these
comparisons, I would loop over the height component in the hclust
objects
with the cutree function. This way one can obtain all possible
clusters
defined by each dendrogram and then perform all-against-all consensus
comparisons
between different dendrograms using one of the intersect functions
(e.g. %in%).
# For example:
y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""),
paste("t", 1:5, sep="")))
hr <- hclust(dist(y, method = "euclidean") )
sapply(hr$height, function(x) cutree(hr, h=x))
Thomas
On Wed 06/13/07 06:25, William Shannon wrote:
> I tend to use a 'consensus' approach when doing cluster analysis.
If by linkage distance you mean genetic linkage (I assume you do), you
could try the various linkage distances and see if the dendrogram is
stable. This also works if you are dealing with non-genetic distance
measures.
>
> If you do this and the dendrograms are essentially stable you are
done. More formal methods of consensus trees (dendrograms) can be
found doing a search on work by Fred McMorris (look in discrete math
and evolutionary biology) and the numerical taxonomy software PAUP I
believe has consensus methods in it.
>
> Maybe Tom Girke has consensus tools in R/Bioconductor.
>
> Bill Shannon
> Washington Univ. School of Medicine
>
> PS -- I am running for President elect of the Classification Society
of North America and encourage anyone doing cluster/classification
work to look at this society for their research and publications
(Journal of Classification and http://www.classification-
society.org/csna/csna.html)
>
>
>
> Daniel Brewer <daniel.brewer at="" icr.ac.uk=""> wrote: Hi,
>
> I have been producing some dendograms using hclust with a variety of
> linkage distance measures. Does anyone know or is there a good
resource
> that explains why one would use one linkage distance rather than
another?
>
> I don't really like dealing with dendograms, but we want to produce
> groupings based on these to do differential analysis on, and I would
> like to be able to justify it.
>
> Thanks
>
> Dan
>
> --
> **************************************************************
> Daniel Brewer, Ph.D.
> Institute of Cancer Research
> Email: daniel.brewer at icr.ac.uk
> **************************************************************
>
> The Institute of Cancer Research: Royal Cancer Hospital, a
charitable Company Limited by Guarantee, Registered in England under
Company No. 534147 with its Registered Office at 123 Old Brompton
Road, London SW7 3RP.
>
> This e-mail message is confidential and for use by the
addre...{{dropped}}
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Dr. Thomas Girke
Assistant Professor of Bioinformatics
Director, IIGB Bioinformatic Facility
Center for Plant Cell Biology (CEPCEB)
Institute for Integrative Genome Biology (IIGB)
Department of Botany and Plant Sciences
1008 Noel T. Keen Hall
University of California
Riverside, CA 92521
E-mail: thomas.girke at ucr.edu
Website: http://faculty.ucr.edu/~tgirke
Ph: 951-827-2469
Fax: 951-827-4437
Dear Daniel,
Package ape in CRAN contains functions one can use for calculating a
consensus of several dendrograms (consensus). Consensus function can
produce either a strict or majority rule consensus. Strict consensus
contains only the groups that are present in all the trees, whereas
majority rule consensus contains only the trees that are present in
the
majority of the trees. I've usually used majority rule consensus, ans
its
the standard method used with bootstrapping analyses.
Jarno
On Wed, 13 Jun 2007, Thomas Girke wrote:
> Dear Daniel,
>
> The only reference that I know that addresses this topic to some
extend is
> this book:
> The Elements of Statistical Learning
> by T. Hastie, R. Tibshirani, J. H. Friedman
>
>
> With regard to William's suggestion: I don't have anything available
that would
> calculate the consensus between different denrograms. As a start to
compute these
> comparisons, I would loop over the height component in the hclust
objects
> with the cutree function. This way one can obtain all possible
clusters
> defined by each dendrogram and then perform all-against-all
consensus comparisons
> between different dendrograms using one of the intersect functions
(e.g. %in%).
>
> # For example:
> y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10,
sep=""), paste("t", 1:5, sep="")))
> hr <- hclust(dist(y, method = "euclidean") )
> sapply(hr$height, function(x) cutree(hr, h=x))
>
>
> Thomas
>
>
> On Wed 06/13/07 06:25, William Shannon wrote:
>> I tend to use a 'consensus' approach when doing cluster analysis.
If by linkage distance you mean genetic linkage (I assume you do), you
could try the various linkage distances and see if the dendrogram is
stable. This also works if you are dealing with non-genetic distance
measures.
>>
>> If you do this and the dendrograms are essentially stable you are
done. More formal methods of consensus trees (dendrograms) can be
found doing a search on work by Fred McMorris (look in discrete math
and evolutionary biology) and the numerical taxonomy software PAUP I
believe has consensus methods in it.
>>
>> Maybe Tom Girke has consensus tools in R/Bioconductor.
>>
>> Bill Shannon
>> Washington Univ. School of Medicine
>>
>> PS -- I am running for President elect of the Classification
Society of North America and encourage anyone doing
cluster/classification work to look at this society for their research
and publications (Journal of Classification and http://www
.classification-society.org/csna/csna.html)
>>
>>
>>
>> Daniel Brewer <daniel.brewer at="" icr.ac.uk=""> wrote: Hi,
>>
>> I have been producing some dendograms using hclust with a variety
of
>> linkage distance measures. Does anyone know or is there a good
resource
>> that explains why one would use one linkage distance rather than
another?
>>
>> I don't really like dealing with dendograms, but we want to produce
>> groupings based on these to do differential analysis on, and I
would
>> like to be able to justify it.
>>
>> Thanks
>>
>> Dan
>>
>> --
>> **************************************************************
>> Daniel Brewer, Ph.D.
>> Institute of Cancer Research
>> Email: daniel.brewer at icr.ac.uk
>> **************************************************************
>>
>> The Institute of Cancer Research: Royal Cancer Hospital, a
charitable Company Limited by Guarantee, Registered in England under
Company No. 534147 with its Registered Office at 123 Old Brompton
Road, London SW7 3RP.
>>
>> This e-mail message is confidential and for use by the
addre...{{dropped}}
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> --
> Dr. Thomas Girke
> Assistant Professor of Bioinformatics
> Director, IIGB Bioinformatic Facility
> Center for Plant Cell Biology (CEPCEB)
> Institute for Integrative Genome Biology (IIGB)
> Department of Botany and Plant Sciences
> 1008 Noel T. Keen Hall
> University of California
> Riverside, CA 92521
>
> E-mail: thomas.girke at ucr.edu
> Website: http://faculty.ucr.edu/~tgirke
> Ph: 951-827-2469
> Fax: 951-827-4437
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
----------------------------------------------------------------------
-------
Jarno Tuimala, FT, bioinformatiikan asiantuntija, CSC, PL 405, 02101
Espoo
puh.: (09) 457 2226, fax: (09) 457 2302, s-posti: jarno.tuimala at
csc.fi
CSC on tieteen tietotekniikan keskus, http://www.csc.fi/molbio
Jarno Tuimala, PhD, bioinformatics, CSC, P.O.Box 405, FI-02101 Espoo,
Finland
tel.: +358 9 457 2226, fax: +358 9 457 2302, e-mail: jarno.tuimala at
csc.fi
CSC is the Finnish IT Center for Science, http://www.csc.fi/molbio
Thanks to all of you for your responses, that is really helpful.
Dan
Jarno Tuimala wrote:
> Dear Daniel,
>
> Package ape in CRAN contains functions one can use for calculating a
> consensus of several dendrograms (consensus). Consensus function can
> produce either a strict or majority rule consensus. Strict consensus
> contains only the groups that are present in all the trees, whereas
> majority rule consensus contains only the trees that are present in
the
> majority of the trees. I've usually used majority rule consensus,
ans its
> the standard method used with bootstrapping analyses.
>
> Jarno
>
>
>
> On Wed, 13 Jun 2007, Thomas Girke wrote:
>
>> Dear Daniel,
>>
>> The only reference that I know that addresses this topic to some
extend is
>> this book:
>> The Elements of Statistical Learning
>> by T. Hastie, R. Tibshirani, J. H. Friedman
>>
>>
>> With regard to William's suggestion: I don't have anything
available that would
>> calculate the consensus between different denrograms. As a start to
compute these
>> comparisons, I would loop over the height component in the hclust
objects
>> with the cutree function. This way one can obtain all possible
clusters
>> defined by each dendrogram and then perform all-against-all
consensus comparisons
>> between different dendrograms using one of the intersect functions
(e.g. %in%).
>>
>> # For example:
>> y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10,
sep=""), paste("t", 1:5, sep="")))
>> hr <- hclust(dist(y, method = "euclidean") )
>> sapply(hr$height, function(x) cutree(hr, h=x))
>>
>>
>> Thomas
>>
>>
>> On Wed 06/13/07 06:25, William Shannon wrote:
>>> I tend to use a 'consensus' approach when doing cluster analysis.
If by linkage distance you mean genetic linkage (I assume you do), you
could try the various linkage distances and see if the dendrogram is
stable. This also works if you are dealing with non-genetic distance
measures.
>>>
>>> If you do this and the dendrograms are essentially stable you are
done. More formal methods of consensus trees (dendrograms) can be
found doing a search on work by Fred McMorris (look in discrete math
and evolutionary biology) and the numerical taxonomy software PAUP I
believe has consensus methods in it.
>>>
>>> Maybe Tom Girke has consensus tools in R/Bioconductor.
>>>
>>> Bill Shannon
>>> Washington Univ. School of Medicine
>>>
>>> PS -- I am running for President elect of the Classification
Society of North America and encourage anyone doing
cluster/classification work to look at this society for their research
and publications (Journal of Classification and http://www
.classification-society.org/csna/csna.html)
>>>
>>>
>>>
>>> Daniel Brewer <daniel.brewer at="" icr.ac.uk=""> wrote: Hi,
>>>
>>> I have been producing some dendograms using hclust with a variety
of
>>> linkage distance measures. Does anyone know or is there a good
resource
>>> that explains why one would use one linkage distance rather than
another?
>>>
>>> I don't really like dealing with dendograms, but we want to
produce
>>> groupings based on these to do differential analysis on, and I
would
>>> like to be able to justify it.
>>>
>>> Thanks
>>>
>>> Dan
>>>
--
**************************************************************
Daniel Brewer, Ph.D.
Institute of Cancer Research
United Kingdom
**************************************************************
The Institute of Cancer Research: Royal Cancer Hospital, a charitable
Company Limited by Guarantee, Registered in England under Company No.
534147 with its Registered Office at 123 Old Brompton Road, London SW7
3RP.
This e-mail message is confidential and for use by the
addre...{{dropped}}
For a good source of information on linkage methods you should have a
look at this book:
"Finding Groups in Data. An introduction to cluster analysis"
from L. Kaufman and P. J. Rousseeuw
at Wiley
This is a really easy book to read.
For understanding linkage methods look at chapter 5, page 199. An
explanation is given page 225 also.
Have also a look for a quick overview on page 47. All the method
describe in this book are implemented into the package 'cluster'
In the end, for the linkage method, I always use the same: UPGMA also
call average method.
In the book they mention that you have to choose the linkage method
according to the type of cluster shape you search.
I never found the answer to the cluster shape when your matrix has
more than 3 dimension... :)
What I play with is the distance/similarity measure.
When speaking about distance you should make a difference between
Metric (euclidean...), parametric and non-parametric.
Parametric correlation measures can, due to their sensitivity to
outliers, give non-homogeneous cluster solutions. In this case non-
parametric correlations, such as Spearman Rank correlation or
Kendall?s t rank correlation, are preferred.
The distance use by Eisen in his paper of 1998 is the cosine distance
correlation also call not centered Pearson. And it give good results.
David
---
David Ruau
Institute for Biomedical Engineering
-Cell Biology-
Universitatsklinikum Aachen, RWTH
Pauwelsstrasse 30
52074 Aachen
GERMANY
GPG: 4210CA11
On Jun 13, 2007, at 3:13 PM, Daniel Brewer wrote:
> Hi,
>
> I have been producing some dendograms using hclust with a variety of
> linkage distance measures. Does anyone know or is there a good
> resource
> that explains why one would use one linkage distance rather than
> another?
>
> I don't really like dealing with dendograms, but we want to produce
> groupings based on these to do differential analysis on, and I would
> like to be able to justify it.
>
> Thanks
>
> Dan
>
> --
> **************************************************************
> Daniel Brewer, Ph.D.
> Institute of Cancer Research
> Email: daniel.brewer at icr.ac.uk
> **************************************************************
>
> The Institute of Cancer Research: Royal Cancer Hospital, a
> charitable Company Limited by Guarantee, Registered in England
> under Company No. 534147 with its Registered Office at 123 Old
> Brompton Road, London SW7 3RP.
>
> This e-mail message is confidential and for use by the
add...{{dropped}}