Re: [S] Error in clustering procedure

0

Entering edit mode

cstrato ★ 3.9k

@cstrato-908

Last seen 6.6 years ago

Austria

Sorry, but I cannot resist: Any comments of the microarry community on the usefulness of hierarchical clustering of 7000 genes? Best regards Christian -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- C.h.r.i.s.t.i.a.n. .S.t.r.a.t.o.w.a V.i.e.n.n.a. .A.u.s.t.r.i.a -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- Prof Brian Ripley wrote: > A distance matrix on 7000 objects alone takes up 187Mb. I don't know how > your machine is set up re swap space, but you should use your task manager > to monitor memory usage. Almost certainly you are running out of memory. > > However, I have never seen an agglomerative clustering of 7000 objects > make sense scientifically (not that that stops the bioinformatics people). > I think you need either to work in smaller subsets or to combine objects > into clusters before starting. > > On Tue, 7 Sep 2004, Joao Baptista de O. e Souza Filho wrote: > > >>I am working with SPLUS 2000 using Windows 2000 SP4, 512 MBytes RAM, >>3 GBytes of free space in HD. >> >>When I try to do an aglomerative clustering upon a matriz of >>dimensions 7000 x 5, the program, after some time spent in >>calculations, returns the following error message: >> >>==================================================================== ======================================================== >>Error in disv == -1: Unable to obtain requested dynamic memory (this >>request is for 200194252 bytes, 0 bytes already in use) >>==================================================================== ======================================================== >> >>First, I have used the command: "options(object.size=300e6)", since the >>program presented the messsage: >> >>==================================================================== ============================================================= >>Error in double(1 + (n * (n - 1))/2): Cannot allocate 200194208 bytes: >>options("object.size") is 100000000: see options help file >>==================================================================== ============================================================= >> >>Does someone know how should I proceed? >> >>Thanks in advance >> >>Joao Baptista Filho >> >>-------------------------------------------------------------------- >>This message was distributed by s-news@lists.biostat.wustl.edu. To >>...(s-news.. clipped)... >> >> > >

Clustering Clustering • 2.5k views

ADD COMMENT • link updated 20.6 years ago by Stephen Henderson ★ 1.0k • written 20.7 years ago by cstrato ★ 3.9k

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 1 hour ago

United States

cstrato wrote: > Sorry, but I cannot resist: > > Any comments of the microarry community on the usefulness of > hierarchical clustering of 7000 genes? > I think this would be almost completely useless. First off, clustering is not an inferential technique, so its use has been completely oversold IMO to the biological community. Secondly, clustering is usually done to produce a 'heat map' to put in a paper or flash on the screen during a presentation. How on earth would this be of any use? You couldn't even read any of the gene names! Of course you could use the heatmap to impress friends and colleagues with the fact that you rate a computer powerful enough to *do* a heatmap with a 7000 x 5 matrix ;-D Jim > Best regards > Christian > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > C.h.r.i.s.t.i.a.n. .S.t.r.a.t.o.w.a > V.i.e.n.n.a. .A.u.s.t.r.i.a > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- -- James W. MacDonald Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109

ADD COMMENT • link 20.7 years ago James W. MacDonald 68k

0

Entering edit mode

Dear all First of all, I want to apologize to Prof. Ripley, since I forgot to ask him first for permission to publish his comment. Personally, I agree that this would be useless, as Prof. Ripley has already told me some years ago. However, almost everybody still seems to do it and publish the corresponding results. Companies such as Spotfire are proud that you can do hierarchical clustering with more than 20,000 genes. I have never seen a publication where it was done differently. I think that the bioconductor list would be the best forum to discuss this issue, and provide solutions (besides the obvious suggestion to filter non-varying genes). Best regards Christian James W. MacDonald wrote: > cstrato wrote: > >> Sorry, but I cannot resist: >> >> Any comments of the microarry community on the usefulness of >> hierarchical clustering of 7000 genes? >> > > I think this would be almost completely useless. First off, clustering > is not an inferential technique, so its use has been completely oversold > IMO to the biological community. Secondly, clustering is usually done to > produce a 'heat map' to put in a paper or flash on the screen during a > presentation. How on earth would this be of any use? You couldn't even > read any of the gene names! > > Of course you could use the heatmap to impress friends and colleagues > with the fact that you rate a computer powerful enough to *do* a heatmap > with a 7000 x 5 matrix ;-D > > Jim > > > > >> Best regards >> Christian >> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- >> C.h.r.i.s.t.i.a.n. .S.t.r.a.t.o.w.a >> V.i.e.n.n.a. .A.u.s.t.r.i.a >> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > > >

ADD REPLY • link 20.7 years ago cstrato ★ 3.9k

0

Entering edit mode

On Tuesday 07 September 2004 21:17, cstrato wrote: > Dear all > > First of all, I want to apologize to Prof. Ripley, since I forgot > to ask him first for permission to publish his comment. > > Personally, I agree that this would be useless, as Prof. Ripley > has already told me some years ago. However, almost everybody > still seems to do it and publish the corresponding results. > Companies such as Spotfire are proud that you can do hierarchical > clustering with more than 20,000 genes. > I have never seen a publication where it was done differently. Part of this could be the result of imitative behavior, beliefs that "unless I put a neat heatmap I won't get it past reviewers", etc, but not evidence that it is the best way to go. If several companies are making an issue out of it in their brochures, maybe it is because customers ask for clustering. As for "publish the corresponding results" I am not sure what the "results" are, since after clustering 7000 genes you can almost always make up a story after the fact; but I would not call that a result. I think clustering (and biclustering) do have a place, but I guess they should be used as a tool to answer some question (e.g., I think I understand what question a t-test is helping to answer; I am not sure about most clustering procedures), or as a guidance for something, not as some kind of magic tool to "let the data speak for themselves" ( = a) get the microarray data; b) run a clustering procedure; c) come up with a question that your cluster "answered".) R. > > I think that the bioconductor list would be the best forum to > discuss this issue, and provide solutions (besides the obvious > suggestion to filter non-varying genes). > > Best regards > Christian > > James W. MacDonald wrote: > > cstrato wrote: > >> Sorry, but I cannot resist: > >> > >> Any comments of the microarry community on the usefulness of > >> hierarchical clustering of 7000 genes? > > > > I think this would be almost completely useless. First off, clustering > > is not an inferential technique, so its use has been completely oversold > > IMO to the biological community. Secondly, clustering is usually done to > > produce a 'heat map' to put in a paper or flash on the screen during a > > presentation. How on earth would this be of any use? You couldn't even > > read any of the gene names! > > > > Of course you could use the heatmap to impress friends and colleagues > > with the fact that you rate a computer powerful enough to *do* a heatmap > > with a 7000 x 5 matrix ;-D > > > > Jim > > > >> Best regards > >> Christian > >> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > >> C.h.r.i.s.t.i.a.n. .S.t.r.a.t.o.w.a > >> V.i.e.n.n.a. .A.u.s.t.r.i.a > >> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor -- Ram?n D?az-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncol?gicas (CNIO) (Spanish National Cancer Center) Melchor Fern?ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://ligarto.org/rdiaz PGP KeyID: 0xE89B3462 (http://ligarto.org/rdiaz/0xE89B3462.asc)

ADD REPLY • link 20.7 years ago Ramon Diaz ★ 1.1k

0

Entering edit mode

Please note my comment was not about the usefulness of clustering or even of hierarchical clustering, but about the sub-optimality of *agglomerative* clustering on large sets. If you think you need clustering with thousands of objects there are in my experience always better ways to achieve the real objective than agglomerative clustering. Typically people are looking for a few large clusters or outliers or many small clusters within already known larger groupings. In the case of a heatmap, clustering is being used to produce a 1D MDS (a seriation) for which better methods are known. BDR On Wed, 8 Sep 2004, Ramon Diaz-Uriarte wrote: > On Tuesday 07 September 2004 21:17, cstrato wrote: > > Dear all > > > > First of all, I want to apologize to Prof. Ripley, since I forgot > > to ask him first for permission to publish his comment. > > > > Personally, I agree that this would be useless, as Prof. Ripley > > has already told me some years ago. However, almost everybody > > still seems to do it and publish the corresponding results. > > Companies such as Spotfire are proud that you can do hierarchical > > clustering with more than 20,000 genes. > > I have never seen a publication where it was done differently. > > > Part of this could be the result of imitative behavior, beliefs that "unless I > put a neat heatmap I won't get it past reviewers", etc, but not evidence that > it is the best way to go. If several companies are making an issue out of it > in their brochures, maybe it is because customers ask for clustering. As for > "publish the corresponding results" I am not sure what the "results" are, > since after clustering 7000 genes you can almost always make up a story after > the fact; but I would not call that a result. > > I think clustering (and biclustering) do have a place, but I guess they should > be used as a tool to answer some question (e.g., I think I understand what > question a t-test is helping to answer; I am not sure about most clustering > procedures), or as a guidance for something, not as some kind of magic tool > to "let the data speak for themselves" ( = a) get the microarray data; b) run > a clustering procedure; c) come up with a question that your cluster > "answered".) > > R. > > > > > > I think that the bioconductor list would be the best forum to > > discuss this issue, and provide solutions (besides the obvious > > suggestion to filter non-varying genes). > > > > Best regards > > Christian > > > > James W. MacDonald wrote: > > > cstrato wrote: > > >> Sorry, but I cannot resist: > > >> > > >> Any comments of the microarry community on the usefulness of > > >> hierarchical clustering of 7000 genes? > > > > > > I think this would be almost completely useless. First off, clustering > > > is not an inferential technique, so its use has been completely oversold > > > IMO to the biological community. Secondly, clustering is usually done to > > > produce a 'heat map' to put in a paper or flash on the screen during a > > > presentation. How on earth would this be of any use? You couldn't even > > > read any of the gene names! > > > > > > Of course you could use the heatmap to impress friends and colleagues > > > with the fact that you rate a computer powerful enough to *do* a heatmap > > > with a 7000 x 5 matrix ;-D > > > > > > Jim > > > > > >> Best regards > > >> Christian > > >> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > > >> C.h.r.i.s.t.i.a.n. .S.t.r.a.t.o.w.a > > >> V.i.e.n.n.a. .A.u.s.t.r.i.a > > >> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > -- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595

ADD REPLY • link 20.7 years ago ripley@stats.ox.ac.uk ▴ 50

0

Entering edit mode

Thank you all very much for your replies. Already three years ago an identical question about memory error with clustering encouraged me to start a similar discussion, see: https://www.stat.math.ethz.ch/pipermail/r-help/2001-November/015524.ht ml https://www.stat.math.ethz.ch/pipermail/r-help/2001-December/015557.ht ml For some reason I have the feeling that nothing has changed since then, and personally I am still uncomfortable to do clustering. For me, many of the questions that I brought up, are still not solved. Best regards Christian Prof Brian Ripley wrote: > Please note my comment was not about the usefulness of clustering or even > of hierarchical clustering, but about the sub-optimality of > *agglomerative* clustering on large sets. > > If you think you need clustering with thousands of objects there are in my > experience always better ways to achieve the real objective than > agglomerative clustering. Typically people are looking for a few large > clusters or outliers or many small clusters within already known larger > groupings. In the case of a heatmap, clustering is being used to produce a > 1D MDS (a seriation) for which better methods are known. > > BDR > > On Wed, 8 Sep 2004, Ramon Diaz-Uriarte wrote: > > >>On Tuesday 07 September 2004 21:17, cstrato wrote: >> >>>Dear all >>> >>>First of all, I want to apologize to Prof. Ripley, since I forgot >>>to ask him first for permission to publish his comment. >>> >>>Personally, I agree that this would be useless, as Prof. Ripley >>>has already told me some years ago. However, almost everybody >>>still seems to do it and publish the corresponding results. >>>Companies such as Spotfire are proud that you can do hierarchical >>>clustering with more than 20,000 genes. >>>I have never seen a publication where it was done differently. >> >> >>Part of this could be the result of imitative behavior, beliefs that "unless I >>put a neat heatmap I won't get it past reviewers", etc, but not evidence that >>it is the best way to go. If several companies are making an issue out of it >>in their brochures, maybe it is because customers ask for clustering. As for >>"publish the corresponding results" I am not sure what the "results" are, >>since after clustering 7000 genes you can almost always make up a story after >>the fact; but I would not call that a result. >> >>I think clustering (and biclustering) do have a place, but I guess they should >>be used as a tool to answer some question (e.g., I think I understand what >>question a t-test is helping to answer; I am not sure about most clustering >>procedures), or as a guidance for something, not as some kind of magic tool >>to "let the data speak for themselves" ( = a) get the microarray data; b) run >>a clustering procedure; c) come up with a question that your cluster >>"answered".) >> >>R. >> >> >> >>>I think that the bioconductor list would be the best forum to >>>discuss this issue, and provide solutions (besides the obvious >>>suggestion to filter non-varying genes). >>> >>>Best regards >>>Christian >>> >>>James W. MacDonald wrote: >>> >>>>cstrato wrote: >>>> >>>>>Sorry, but I cannot resist: >>>>> >>>>>Any comments of the microarry community on the usefulness of >>>>>hierarchical clustering of 7000 genes? >>>> >>>>I think this would be almost completely useless. First off, clustering >>>>is not an inferential technique, so its use has been completely oversold >>>>IMO to the biological community. Secondly, clustering is usually done to >>>>produce a 'heat map' to put in a paper or flash on the screen during a >>>>presentation. How on earth would this be of any use? You couldn't even >>>>read any of the gene names! >>>> >>>>Of course you could use the heatmap to impress friends and colleagues >>>>with the fact that you rate a computer powerful enough to *do* a heatmap >>>>with a 7000 x 5 matrix ;-D >>>> >>>>Jim >>>> >>>> >>>>>Best regards >>>>>Christian >>>>>-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- >>>>>C.h.r.i.s.t.i.a.n. .S.t.r.a.t.o.w.a >>>>>V.i.e.n.n.a. .A.u.s.t.r.i.a >>>>>-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- >>> >>>_______________________________________________ >>>Bioconductor mailing list >>>Bioconductor@stat.math.ethz.ch >>>https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> >

ADD REPLY • link 20.7 years ago cstrato ★ 3.9k

0

Entering edit mode

David K Pritchard ▴ 70

@david-k-pritchard-590

Last seen 10.7 years ago

Christian, I think it is overstating the matter to say it is useless to hierarchically cluster 7000 genes. In most studies where one is comparing only a two or a few different conditions there is generally not alot of structure in the data and clustering is not useful. However, I have been involved with rare experiments where there is alot of structure in the data and clustering the whole dataset (10 or 20K genes) is useful to see that structure. I am presently analyzing an experiment where overexpression of a gene is compared to overexpression of a number of mutant forms of the gene. In this study hierarchically clustering the data (20K genes) revealed structure in the data that would have been hard to see otherwise. Clearly there is no good way to look at all of this data at one time - however, programs like MEV from TIGR do a good job of presenting a useful interface for browsing that much data. I also believe that MEV will hierarchically cluster ~20K genes and is freely available from the TIGR website. David Pritchard On Tue, 7 Sep 2004, cstrato wrote: > Sorry, but I cannot resist: > > Any comments of the microarry community on the usefulness of > hierarchical clustering of 7000 genes? > > Best regards > Christian > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > C.h.r.i.s.t.i.a.n. .S.t.r.a.t.o.w.a > V.i.e.n.n.a. .A.u.s.t.r.i.a > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > > Prof Brian Ripley wrote: > > > A distance matrix on 7000 objects alone takes up 187Mb. I don't know how > > your machine is set up re swap space, but you should use your task manager > > to monitor memory usage. Almost certainly you are running out of memory. > > > > However, I have never seen an agglomerative clustering of 7000 objects > > make sense scientifically (not that that stops the bioinformatics people). > > I think you need either to work in smaller subsets or to combine objects > > into clusters before starting. > > > > On Tue, 7 Sep 2004, Joao Baptista de O. e Souza Filho wrote: > > > > > >>I am working with SPLUS 2000 using Windows 2000 SP4, 512 MBytes RAM, > >>3 GBytes of free space in HD. > >> > >>When I try to do an aglomerative clustering upon a matriz of > >>dimensions 7000 x 5, the program, after some time spent in > >>calculations, returns the following error message: > >> > >>================================================================== ========================================================== > >>Error in disv == -1: Unable to obtain requested dynamic memory (this > >>request is for 200194252 bytes, 0 bytes already in use) > >>================================================================== ========================================================== > >> > >>First, I have used the command: "options(object.size=300e6)", since the > >>program presented the messsage: > >> > >>================================================================== =============================================================== > >>Error in double(1 + (n * (n - 1))/2): Cannot allocate 200194208 bytes: > >>options("object.size") is 100000000: see options help file > >>================================================================== =============================================================== > >> > >>Does someone know how should I proceed? > >> > >>Thanks in advance > >> > >>Joao Baptista Filho > >> > >>-------------------------------------------------------------------- > >>This message was distributed by s-news@lists.biostat.wustl.edu. To > >>...(s-news.. clipped)... > > >> > >> > > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor >

ADD COMMENT • link 20.7 years ago David K Pritchard ▴ 70

0

Entering edit mode

michael watson IAH-C ★ 3.4k

@michael-watson-iah-c-378

Last seen 10.7 years ago

I guess I'm coming to this late, but I'm pretty sure all biologists use cluster analysis for is for finding out which genes are behaving similarly to one another in a large data set. Then if, for example, all genes from a certain pathway are showing a similar expression pattern, we have a hypothesis which can be tested further. If cluster analysis has indeed been "over-sold", please suggest a better algorithm for summarising groups of genes that are behaving similarly across a group of experiments or time-points :-) M -----Original Message----- From: Ramon Diaz-Uriarte [mailto:rdiaz@cnio.es] Sent: 08 September 2004 09:33 To: bioconductor@stat.math.ethz.ch Cc: Prof Brian Ripley; cstrato; James W. MacDonald Subject: Re: [BioC] Re: [S] Error in clustering procedure On Tuesday 07 September 2004 21:17, cstrato wrote: > Dear all > > First of all, I want to apologize to Prof. Ripley, since I forgot to > ask him first for permission to publish his comment. > > Personally, I agree that this would be useless, as Prof. Ripley has > already told me some years ago. However, almost everybody still seems > to do it and publish the corresponding results. Companies such as > Spotfire are proud that you can do hierarchical clustering with more > than 20,000 genes. I have never seen a publication where it was done > differently. Part of this could be the result of imitative behavior, beliefs that "unless I put a neat heatmap I won't get it past reviewers", etc, but not evidence that it is the best way to go. If several companies are making an issue out of it in their brochures, maybe it is because customers ask for clustering. As for "publish the corresponding results" I am not sure what the "results" are, since after clustering 7000 genes you can almost always make up a story after the fact; but I would not call that a result. I think clustering (and biclustering) do have a place, but I guess they should be used as a tool to answer some question (e.g., I think I understand what question a t-test is helping to answer; I am not sure about most clustering procedures), or as a guidance for something, not as some kind of magic tool to "let the data speak for themselves" ( = a) get the microarray data; b) run a clustering procedure; c) come up with a question that your cluster "answered".) R. > > I think that the bioconductor list would be the best forum to discuss > this issue, and provide solutions (besides the obvious suggestion to > filter non-varying genes). > > Best regards > Christian > > James W. MacDonald wrote: > > cstrato wrote: > >> Sorry, but I cannot resist: > >> > >> Any comments of the microarry community on the usefulness of > >> hierarchical clustering of 7000 genes? > > > > I think this would be almost completely useless. First off, > > clustering is not an inferential technique, so its use has been > > completely oversold IMO to the biological community. Secondly, > > clustering is usually done to produce a 'heat map' to put in a paper > > or flash on the screen during a presentation. How on earth would > > this be of any use? You couldn't even read any of the gene names! > > > > Of course you could use the heatmap to impress friends and > > colleagues with the fact that you rate a computer powerful enough to > > *do* a heatmap with a 7000 x 5 matrix ;-D > > > > Jim > > > >> Best regards > >> Christian > >> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > >> C.h.r.i.s.t.i.a.n. .S.t.r.a.t.o.w.a > >> V.i.e.n.n.a. .A.u.s.t.r.i.a > >> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor -- Ram?n D?az-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncol?gicas (CNIO) (Spanish National Cancer Center) Melchor Fern?ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://ligarto.org/rdiaz PGP KeyID: 0xE89B3462 (http://ligarto.org/rdiaz/0xE89B3462.asc) _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor

ADD COMMENT • link 20.6 years ago michael watson IAH-C ★ 3.4k

0

Entering edit mode

On Monday 13 September 2004 10:36, michael watson (IAH-C) wrote: > I guess I'm coming to this late, but I'm pretty sure all biologists use > cluster analysis for is for finding out which genes are behaving similarly > to one another in a large data set. Then if, for example, all genes from a Oh, but that is one problem I was referring to: say you use UPGMA; then, you will get a dendrogram; then, you can make up any story. That was one of my concerns. Clustering gives you clusters, but most papers I've seen that "use" clustering do not seem to be overly concerned about how meaningful and repeatable those clusters are. Related to the above, and to clustering being over-sold, is the fact that very rarely does one find discussion in those papers about how the type of clustering algorithm affects the results, and how different clustering algorihms/different metrics, etc, can relate to the prior beliefs about the shape of clusters (or how different clustering algorithms are better to detect different patterns). And finally, it is not always clear that the difference between exploratory and confirmatory is being made. We can read senteces such as "the clustering results show that there are two groups"... Well, in what sense and how do the results from some aglomerative clustering algorithm show that there are two groups (and not twenty)? But, again, I do think clustering has a role for certain types of questions. I just think it is not the magic bullet to "let the data speak for themselves", and similar marketing hype. Best, R. > certain pathway are showing a similar expression pattern, we have a > hypothesis which can be tested further. > > If cluster analysis has indeed been "over-sold", please suggest a better > algorithm for summarising groups of genes that are behaving similarly > across a group of experiments or time-points :-) > > M > > -----Original Message----- > From: Ramon Diaz-Uriarte [mailto:rdiaz@cnio.es] > Sent: 08 September 2004 09:33 > To: bioconductor@stat.math.ethz.ch > Cc: Prof Brian Ripley; cstrato; James W. MacDonald > Subject: Re: [BioC] Re: [S] Error in clustering procedure > > On Tuesday 07 September 2004 21:17, cstrato wrote: > > Dear all > > > > First of all, I want to apologize to Prof. Ripley, since I forgot to > > ask him first for permission to publish his comment. > > > > Personally, I agree that this would be useless, as Prof. Ripley has > > already told me some years ago. However, almost everybody still seems > > to do it and publish the corresponding results. Companies such as > > Spotfire are proud that you can do hierarchical clustering with more > > than 20,000 genes. I have never seen a publication where it was done > > differently. > > Part of this could be the result of imitative behavior, beliefs that > "unless I put a neat heatmap I won't get it past reviewers", etc, but not > evidence that it is the best way to go. If several companies are making an > issue out of it in their brochures, maybe it is because customers ask for > clustering. As for "publish the corresponding results" I am not sure what > the "results" are, since after clustering 7000 genes you can almost always > make up a story after the fact; but I would not call that a result. > > I think clustering (and biclustering) do have a place, but I guess they > should be used as a tool to answer some question (e.g., I think I > understand what question a t-test is helping to answer; I am not sure about > most clustering procedures), or as a guidance for something, not as some > kind of magic tool to "let the data speak for themselves" ( = a) get the > microarray data; b) run a clustering procedure; c) come up with a question > that your cluster "answered".) > > R. > > > I think that the bioconductor list would be the best forum to discuss > > this issue, and provide solutions (besides the obvious suggestion to > > filter non-varying genes). > > > > Best regards > > Christian > > > > James W. MacDonald wrote: > > > cstrato wrote: > > >> Sorry, but I cannot resist: > > >> > > >> Any comments of the microarry community on the usefulness of > > >> hierarchical clustering of 7000 genes? > > > > > > I think this would be almost completely useless. First off, > > > clustering is not an inferential technique, so its use has been > > > completely oversold IMO to the biological community. Secondly, > > > clustering is usually done to produce a 'heat map' to put in a paper > > > or flash on the screen during a presentation. How on earth would > > > this be of any use? You couldn't even read any of the gene names! > > > > > > Of course you could use the heatmap to impress friends and > > > colleagues with the fact that you rate a computer powerful enough to > > > *do* a heatmap with a 7000 x 5 matrix ;-D > > > > > > Jim > > > > > >> Best regards > > >> Christian > > >> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > > >> C.h.r.i.s.t.i.a.n. .S.t.r.a.t.o.w.a > > >> V.i.e.n.n.a. .A.u.s.t.r.i.a > > >> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor -- Ram?n D?az-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncol?gicas (CNIO) (Spanish National Cancer Center) Melchor Fern?ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://ligarto.org/rdiaz PGP KeyID: 0xE89B3462 (http://ligarto.org/rdiaz/0xE89B3462.asc)

ADD REPLY • link 20.6 years ago Ramon Diaz ★ 1.1k

0

Entering edit mode

Another issue which I do not understand is: Why do all people use the same hierarchical clustering method and none of the many new clustering methods which exist. To mention a few examples in each clustering category: Partitioning methods: CLARA or CLARANS Hierarchical methods: BIRCH or CURE Density-based methods: DBSCAN, OPTICS or DENCLUE Grid-based methods: STING, WaveCluster or CLIQUE Model-based methods: COBWEB or CLASSIT It would be great to be able to try these novel methods and to know, which method would be especially suitable for which purpose. Best regards Christian Ramon Diaz-Uriarte wrote: > On Monday 13 September 2004 10:36, michael watson (IAH-C) wrote: > >>I guess I'm coming to this late, but I'm pretty sure all biologists use >>cluster analysis for is for finding out which genes are behaving similarly >>to one another in a large data set. Then if, for example, all genes from a > > > Oh, but that is one problem I was referring to: say you use UPGMA; then, you > will get a dendrogram; then, you can make up any story. That was one of my > concerns. Clustering gives you clusters, but most papers I've seen that "use" > clustering do not seem to be overly concerned about how meaningful and > repeatable those clusters are. > > Related to the above, and to clustering being over-sold, is the fact that very > rarely does one find discussion in those papers about how the type of > clustering algorithm affects the results, and how different clustering > algorihms/different metrics, etc, can relate to the prior beliefs about the > shape of clusters (or how different clustering algorithms are better to > detect different patterns). > > And finally, it is not always clear that the difference between exploratory > and confirmatory is being made. We can read senteces such as "the clustering > results show that there are two groups"... Well, in what sense and how do the > results from some aglomerative clustering algorithm show that there are two > groups (and not twenty)? > > But, again, I do think clustering has a role for certain types of questions. I > just think it is not the magic bullet to "let the data speak for themselves", > and similar marketing hype. > > Best, > > R. > > >>certain pathway are showing a similar expression pattern, we have a >>hypothesis which can be tested further. >> >>If cluster analysis has indeed been "over-sold", please suggest a better >>algorithm for summarising groups of genes that are behaving similarly >>across a group of experiments or time-points :-) >> >>M >> >>-----Original Message----- >>From: Ramon Diaz-Uriarte [mailto:rdiaz@cnio.es] >>Sent: 08 September 2004 09:33 >>To: bioconductor@stat.math.ethz.ch >>Cc: Prof Brian Ripley; cstrato; James W. MacDonald >>Subject: Re: [BioC] Re: [S] Error in clustering procedure >> >>On Tuesday 07 September 2004 21:17, cstrato wrote: >> >>>Dear all >>> >>>First of all, I want to apologize to Prof. Ripley, since I forgot to >>>ask him first for permission to publish his comment. >>> >>>Personally, I agree that this would be useless, as Prof. Ripley has >>>already told me some years ago. However, almost everybody still seems >>>to do it and publish the corresponding results. Companies such as >>>Spotfire are proud that you can do hierarchical clustering with more >>>than 20,000 genes. I have never seen a publication where it was done >>>differently. >> >>Part of this could be the result of imitative behavior, beliefs that >>"unless I put a neat heatmap I won't get it past reviewers", etc, but not >>evidence that it is the best way to go. If several companies are making an >>issue out of it in their brochures, maybe it is because customers ask for >>clustering. As for "publish the corresponding results" I am not sure what >>the "results" are, since after clustering 7000 genes you can almost always >>make up a story after the fact; but I would not call that a result. >> >>I think clustering (and biclustering) do have a place, but I guess they >>should be used as a tool to answer some question (e.g., I think I >>understand what question a t-test is helping to answer; I am not sure about >>most clustering procedures), or as a guidance for something, not as some >>kind of magic tool to "let the data speak for themselves" ( = a) get the >>microarray data; b) run a clustering procedure; c) come up with a question >>that your cluster "answered".) >> >>R. >> >> >>>I think that the bioconductor list would be the best forum to discuss >>>this issue, and provide solutions (besides the obvious suggestion to >>>filter non-varying genes). >>> >>>Best regards >>>Christian >>> >>>James W. MacDonald wrote: >>> >>>>cstrato wrote: >>>> >>>>>Sorry, but I cannot resist: >>>>> >>>>>Any comments of the microarry community on the usefulness of >>>>>hierarchical clustering of 7000 genes? >>>> >>>>I think this would be almost completely useless. First off, >>>>clustering is not an inferential technique, so its use has been >>>>completely oversold IMO to the biological community. Secondly, >>>>clustering is usually done to produce a 'heat map' to put in a paper >>>>or flash on the screen during a presentation. How on earth would >>>>this be of any use? You couldn't even read any of the gene names! >>>> >>>>Of course you could use the heatmap to impress friends and >>>>colleagues with the fact that you rate a computer powerful enough to >>>>*do* a heatmap with a 7000 x 5 matrix ;-D >>>> >>>>Jim >>>> >>>> >>>>>Best regards >>>>>Christian >>>>>-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- >>>>>C.h.r.i.s.t.i.a.n. .S.t.r.a.t.o.w.a >>>>>V.i.e.n.n.a. .A.u.s.t.r.i.a >>>>>-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- >>> >>>_______________________________________________ >>>Bioconductor mailing list >>>Bioconductor@stat.math.ethz.ch >>>https://stat.ethz.ch/mailman/listinfo/bioconductor > >

ADD REPLY • link 20.6 years ago cstrato ★ 3.9k

0

Entering edit mode

On Mon, 13 Sep 2004, michael watson (IAH-C) wrote: > I guess I'm coming to this late, You are, yet have overlooked important points in later parts of the thread. > but I'm pretty sure all biologists use > cluster analysis for is for finding out which genes are behaving > similarly to one another in a large data set. Really? Have you never seen a heatmap with clustering on the margins? There clustering is being used for seriation. > Then if, for example, all > genes from a certain pathway are showing a similar expression pattern, > we have a hypothesis which can be tested further. > > If cluster analysis has indeed been "over-sold", please suggest a better > algorithm for summarising groups of genes that are behaving similarly > across a group of experiments or time-points :-) My point was about methods/algorithms for cluster analysis, as I have already replied in this thread. But MDS-like methods (note, not algorithms) are better for your stated purpose. > > M > > -----Original Message----- > From: Ramon Diaz-Uriarte [mailto:rdiaz@cnio.es] > Sent: 08 September 2004 09:33 > To: bioconductor@stat.math.ethz.ch > Cc: Prof Brian Ripley; cstrato; James W. MacDonald > Subject: Re: [BioC] Re: [S] Error in clustering procedure > > > On Tuesday 07 September 2004 21:17, cstrato wrote: > > Dear all > > > > First of all, I want to apologize to Prof. Ripley, since I forgot to > > ask him first for permission to publish his comment. > > > > Personally, I agree that this would be useless, as Prof. Ripley has > > already told me some years ago. However, almost everybody still seems > > to do it and publish the corresponding results. Companies such as > > Spotfire are proud that you can do hierarchical clustering with more > > than 20,000 genes. I have never seen a publication where it was done > > differently. > > > Part of this could be the result of imitative behavior, beliefs that "unless I > put a neat heatmap I won't get it past reviewers", etc, but not evidence that > it is the best way to go. If several companies are making an issue out of it > in their brochures, maybe it is because customers ask for clustering. As for > "publish the corresponding results" I am not sure what the "results" are, > since after clustering 7000 genes you can almost always make up a story after > the fact; but I would not call that a result. > > I think clustering (and biclustering) do have a place, but I guess they should > be used as a tool to answer some question (e.g., I think I understand what > question a t-test is helping to answer; I am not sure about most clustering > procedures), or as a guidance for something, not as some kind of magic tool > to "let the data speak for themselves" ( = a) get the microarray data; b) run > a clustering procedure; c) come up with a question that your cluster > "answered".) > > R. > > > > > > I think that the bioconductor list would be the best forum to discuss > > this issue, and provide solutions (besides the obvious suggestion to > > filter non-varying genes). > > > > Best regards > > Christian > > > > James W. MacDonald wrote: > > > cstrato wrote: > > >> Sorry, but I cannot resist: > > >> > > >> Any comments of the microarry community on the usefulness of > > >> hierarchical clustering of 7000 genes? > > > > > > I think this would be almost completely useless. First off, > > > clustering is not an inferential technique, so its use has been > > > completely oversold IMO to the biological community. Secondly, > > > clustering is usually done to produce a 'heat map' to put in a paper > > > or flash on the screen during a presentation. How on earth would > > > this be of any use? You couldn't even read any of the gene names! > > > > > > Of course you could use the heatmap to impress friends and > > > colleagues with the fact that you rate a computer powerful enough to > > > *do* a heatmap with a 7000 x 5 matrix ;-D > > > > > > Jim > > > > > >> Best regards > > >> Christian > > >> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > > >> C.h.r.i.s.t.i.a.n. .S.t.r.a.t.o.w.a > > >> V.i.e.n.n.a. .A.u.s.t.r.i.a > > >> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > -- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595

ADD REPLY • link 20.6 years ago ripley@stats.ox.ac.uk ▴ 50

0

Entering edit mode

michael watson IAH-C ★ 3.4k

@michael-watson-iah-c-378

Last seen 10.7 years ago

Great, that's what I was looking for! Personally, I use cluster analysis sparingly and as a very "exploratory" tool. I think, though I may be wrong, that most biologists realise its limitations. I also think that it is not "completely useless", and perhaps if people do think a method is useless, they should suggest an alternative, which you have. Thank you! M -----Original Message----- From: Prof Brian Ripley [mailto:ripley@stats.ox.ac.uk] Sent: 13 September 2004 10:03 To: michael watson (IAH-C) Cc: Ramon Diaz-Uriarte; bioconductor@stat.math.ethz.ch; cstrato; James W. MacDonald Subject: RE: [BioC] Re: [S] Error in clustering procedure On Mon, 13 Sep 2004, michael watson (IAH-C) wrote: > I guess I'm coming to this late, You are, yet have overlooked important points in later parts of the thread. > but I'm pretty sure all biologists use > cluster analysis for is for finding out which genes are behaving > similarly to one another in a large data set. Really? Have you never seen a heatmap with clustering on the margins? There clustering is being used for seriation. > Then if, for example, all > genes from a certain pathway are showing a similar expression pattern, > we have a hypothesis which can be tested further. > > If cluster analysis has indeed been "over-sold", please suggest a > better algorithm for summarising groups of genes that are behaving > similarly across a group of experiments or time-points :-) My point was about methods/algorithms for cluster analysis, as I have already replied in this thread. But MDS-like methods (note, not algorithms) are better for your stated purpose. > > M > > -----Original Message----- > From: Ramon Diaz-Uriarte [mailto:rdiaz@cnio.es] > Sent: 08 September 2004 09:33 > To: bioconductor@stat.math.ethz.ch > Cc: Prof Brian Ripley; cstrato; James W. MacDonald > Subject: Re: [BioC] Re: [S] Error in clustering procedure > > > On Tuesday 07 September 2004 21:17, cstrato wrote: > > Dear all > > > > First of all, I want to apologize to Prof. Ripley, since I forgot to > > ask him first for permission to publish his comment. > > > > Personally, I agree that this would be useless, as Prof. Ripley has > > already told me some years ago. However, almost everybody still seems > > to do it and publish the corresponding results. Companies such as > > Spotfire are proud that you can do hierarchical clustering with more > > than 20,000 genes. I have never seen a publication where it was done > > differently. > > > Part of this could be the result of imitative behavior, beliefs that > "unless I > put a neat heatmap I won't get it past reviewers", etc, but not evidence that > it is the best way to go. If several companies are making an issue out of it > in their brochures, maybe it is because customers ask for clustering. As for > "publish the corresponding results" I am not sure what the "results" are, > since after clustering 7000 genes you can almost always make up a story after > the fact; but I would not call that a result. > > I think clustering (and biclustering) do have a place, but I guess > they should > be used as a tool to answer some question (e.g., I think I understand what > question a t-test is helping to answer; I am not sure about most clustering > procedures), or as a guidance for something, not as some kind of magic tool > to "let the data speak for themselves" ( = a) get the microarray data; b) run > a clustering procedure; c) come up with a question that your cluster > "answered".) > > R. > > > > > > I think that the bioconductor list would be the best forum to > > discuss > > this issue, and provide solutions (besides the obvious suggestion to > > filter non-varying genes). > > > > Best regards > > Christian > > > > James W. MacDonald wrote: > > > cstrato wrote: > > >> Sorry, but I cannot resist: > > >> > > >> Any comments of the microarry community on the usefulness of > > >> hierarchical clustering of 7000 genes? > > > > > > I think this would be almost completely useless. First off, > > > clustering is not an inferential technique, so its use has been > > > completely oversold IMO to the biological community. Secondly, > > > clustering is usually done to produce a 'heat map' to put in a paper > > > or flash on the screen during a presentation. How on earth would > > > this be of any use? You couldn't even read any of the gene names! > > > > > > Of course you could use the heatmap to impress friends and > > > colleagues with the fact that you rate a computer powerful enough to > > > *do* a heatmap with a 7000 x 5 matrix ;-D > > > > > > Jim > > > > > >> Best regards > > >> Christian > > >> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > > >> C.h.r.i.s.t.i.a.n. .S.t.r.a.t.o.w.a > > >> V.i.e.n.n.a. .A.u.s.t.r.i.a > > >> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > -- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595

ADD COMMENT • link 20.6 years ago michael watson IAH-C ★ 3.4k

0

Entering edit mode

michael watson IAH-C ★ 3.4k

@michael-watson-iah-c-378

Last seen 10.7 years ago

-----Original Message----- From: Prof Brian Ripley [mailto:ripley@stats.ox.ac.uk] >But MDS-like methods (note, not algorithms) are better for your stated >purpose. Hi Just thinking out-loud here, which can be a painful process... So MDS/PCA is an exercise in dimension reduction. Therefore, if we reduce the dimensionality of the dataset to few(er) dimensions which explain most of the variability, then order the data set by those dimensions, then that will place together genes (in the list) which are behaving similarly - is that what you are suggesting?

ADD COMMENT • link 20.6 years ago michael watson IAH-C ★ 3.4k

0

Entering edit mode

"Dimension reduction" brings up another important issue: I had discussions with quite a few scientists who believe that dimension reduction is not allowed, since you are loosing worthwile information. With respect to gene expression I believe hat it makes sense to filter first non-variant genes to reduce the number of dimensions. But..., these people are using hierarchical clustering to cluster chemical compound libraries in "chemical space", and there are no compounds to eliminate. So, another question is, which method would be best to cluster about one million compounds in chemical space in order to be able reduce the number of compounds used in screening by selecting only representative members of a certain cluster. Best regards Christian michael watson (IAH-C) wrote: > -----Original Message----- > From: Prof Brian Ripley [mailto:ripley@stats.ox.ac.uk] > > >>But MDS-like methods (note, not algorithms) are better for your stated >>purpose. > > > Hi > > Just thinking out-loud here, which can be a painful process... > > So MDS/PCA is an exercise in dimension reduction. Therefore, if we > reduce the dimensionality of the dataset to few(er) dimensions which > explain most of the variability, then order the data set by those > dimensions, then that will place together genes (in the list) which are > behaving similarly - is that what you are suggesting? > >

ADD REPLY • link 20.6 years ago cstrato ★ 3.9k

0

Entering edit mode

Liaw, Andy ▴ 360

@liaw-andy-125

Last seen 10.7 years ago

> From: cstrato > > "Dimension reduction" brings up another important issue: > I had discussions with quite a few scientists who believe > that dimension reduction is not allowed, since you are > loosing worthwile information. Eh? By this logic, we shouldn't believe any conclusions drawn in any paper that does not contain the rawest of raw data? Part of data analysis is summmarizing data into the bare essentials (have you heard of `sufficient statistics'? If not, might worth your while) and extracting useful information from data that contain noise. People who make statements like that probably believe there's no such thing as noise in their data. May God have mercy on them. > With respect to gene expression I believe hat it makes > sense to filter first non-variant genes to reduce the > number of dimensions. > > But..., these people are using hierarchical clustering > to cluster chemical compound libraries in "chemical space", > and there are no compounds to eliminate. Who are `these people' now? Seems like you're changing the subject to one that's probably off-topic for BioC. > So, another question is, which method would be best to > cluster about one million compounds in chemical space in > order to be able reduce the number of compounds used in > screening by selecting only representative members of a > certain cluster. There's quite a bit of work done on this subject in the computational chemistry literature. The context is really quite different from gene expression. Molecules are clustered based on their chemical structures (which are known), and those data are not measured (usually), but computed, so there's no measurement errors. The goal is also quite different. I have not heard of anyone trying to find `representative genes' (but I'm not familiar with bioinformatics--- maybe someone _would_ be interested in that?). Andy > Best regards > Christian > > michael watson (IAH-C) wrote: > > -----Original Message----- > > From: Prof Brian Ripley [mailto:ripley@stats.ox.ac.uk] > > > > > >>But MDS-like methods (note, not algorithms) are better for > your stated > >>purpose. > > > > > > Hi > > > > Just thinking out-loud here, which can be a painful process... > > > > So MDS/PCA is an exercise in dimension reduction. Therefore, if we > > reduce the dimensionality of the dataset to few(er) dimensions which > > explain most of the variability, then order the data set by those > > dimensions, then that will place together genes (in the > list) which are > > behaving similarly - is that what you are suggesting? > > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > >

ADD COMMENT • link 20.6 years ago Liaw, Andy ▴ 360

0

Entering edit mode

Liaw, Andy wrote: >>From: cstrato >> >>"Dimension reduction" brings up another important issue: >>I had discussions with quite a few scientists who believe >>that dimension reduction is not allowed, since you are >>loosing worthwile information. > > > Eh? By this logic, we shouldn't believe any conclusions drawn in any paper > that does not contain the rawest of raw data? Part of data analysis is > summmarizing data into the bare essentials (have you heard of `sufficient > statistics'? If not, might worth your while) and extracting useful > information from data that contain noise. People who make statements like > that probably believe there's no such thing as noise in their data. May God > have mercy on them. > I have mentioned this only to show that it still sometimes hard to argue; mentioning "sufficient statistics" could be helpful. > >>With respect to gene expression I believe hat it makes >>sense to filter first non-variant genes to reduce the >>number of dimensions. >> >>But..., these people are using hierarchical clustering >>to cluster chemical compound libraries in "chemical space", >>and there are no compounds to eliminate. > > > Who are `these people' now? Seems like you're changing the subject to one > that's probably off-topic for BioC. > I would not consider this off-topic but a natural extension: "expression profiling -> compound profiling -> compound activity profiling -> compound structure profiling" All these steps share the same problem: What is the best clustering algorithm to use (if there is any)? Furthermore, it is my believe that in the future these steps will be analyzed together resulting in a much deeper understanding. P.S.: Looking at the BioC packages, BioC is already expanding to include proteomics analysis. It would be a natural step for BioC to expand further to cover chemoinformatics. > >>So, another question is, which method would be best to >>cluster about one million compounds in chemical space in >>order to be able reduce the number of compounds used in >>screening by selecting only representative members of a >>certain cluster. > > > There's quite a bit of work done on this subject in the computational > chemistry literature. The context is really quite different from gene > expression. Molecules are clustered based on their chemical structures > (which are known), and those data are not measured (usually), but computed, > so there's no measurement errors. The goal is also quite different. I have > not heard of anyone trying to find `representative genes' (but I'm not > familiar with bioinformatics--- maybe someone _would_ be interested in > that?). > > Andy > Christian > >>Best regards >>Christian >>

ADD REPLY • link 20.6 years ago cstrato ★ 3.9k

0

Entering edit mode

Stephen Henderson ★ 1.0k

@stephen-henderson-71

Last seen 8.0 years ago

perhaps because they don't add anything beyond the simple and broadly understood method?? -----Original Message----- From: cstrato To: Ramon Diaz-Uriarte Cc: Prof Brian Ripley; James W. MacDonald; bioconductor@stat.math.ethz.ch Sent: 13/09/04 20:26 Subject: Re: [BioC] Re: [S] Error in clustering procedure Another issue which I do not understand is: Why do all people use the same hierarchical clustering method and none of the many new clustering methods which exist. To mention a few examples in each clustering category: Partitioning methods: CLARA or CLARANS Hierarchical methods: BIRCH or CURE Density-based methods: DBSCAN, OPTICS or DENCLUE Grid-based methods: STING, WaveCluster or CLIQUE Model-based methods: COBWEB or CLASSIT It would be great to be able to try these novel methods and to know, which method would be especially suitable for which purpose. Best regards Christian Ramon Diaz-Uriarte wrote: > On Monday 13 September 2004 10:36, michael watson (IAH-C) wrote: > >>I guess I'm coming to this late, but I'm pretty sure all biologists use >>cluster analysis for is for finding out which genes are behaving similarly >>to one another in a large data set. Then if, for example, all genes from a > > > Oh, but that is one problem I was referring to: say you use UPGMA; then, you > will get a dendrogram; then, you can make up any story. That was one of my > concerns. Clustering gives you clusters, but most papers I've seen that "use" > clustering do not seem to be overly concerned about how meaningful and > repeatable those clusters are. > > Related to the above, and to clustering being over-sold, is the fact that very > rarely does one find discussion in those papers about how the type of > clustering algorithm affects the results, and how different clustering > algorihms/different metrics, etc, can relate to the prior beliefs about the > shape of clusters (or how different clustering algorithms are better to > detect different patterns). > > And finally, it is not always clear that the difference between exploratory > and confirmatory is being made. We can read senteces such as "the clustering > results show that there are two groups"... Well, in what sense and how do the > results from some aglomerative clustering algorithm show that there are two > groups (and not twenty)? > > But, again, I do think clustering has a role for certain types of questions. I > just think it is not the magic bullet to "let the data speak for themselves", > and similar marketing hype. > > Best, > > R. > > >>certain pathway are showing a similar expression pattern, we have a >>hypothesis which can be tested further. >> >>If cluster analysis has indeed been "over-sold", please suggest a better >>algorithm for summarising groups of genes that are behaving similarly >>across a group of experiments or time-points :-) >> >>M >> >>-----Original Message----- >>From: Ramon Diaz-Uriarte [mailto:rdiaz@cnio.es] >>Sent: 08 September 2004 09:33 >>To: bioconductor@stat.math.ethz.ch >>Cc: Prof Brian Ripley; cstrato; James W. MacDonald >>Subject: Re: [BioC] Re: [S] Error in clustering procedure >> >>On Tuesday 07 September 2004 21:17, cstrato wrote: >> >>>Dear all >>> >>>First of all, I want to apologize to Prof. Ripley, since I forgot to >>>ask him first for permission to publish his comment. >>> >>>Personally, I agree that this would be useless, as Prof. Ripley has >>>already told me some years ago. However, almost everybody still seems >>>to do it and publish the corresponding results. Companies such as >>>Spotfire are proud that you can do hierarchical clustering with more >>>than 20,000 genes. I have never seen a publication where it was done >>>differently. >> >>Part of this could be the result of imitative behavior, beliefs that >>"unless I put a neat heatmap I won't get it past reviewers", etc, but not >>evidence that it is the best way to go. If several companies are making an >>issue out of it in their brochures, maybe it is because customers ask for >>clustering. As for "publish the corresponding results" I am not sure what >>the "results" are, since after clustering 7000 genes you can almost always >>make up a story after the fact; but I would not call that a result. >> >>I think clustering (and biclustering) do have a place, but I guess they >>should be used as a tool to answer some question (e.g., I think I >>understand what question a t-test is helping to answer; I am not sure about >>most clustering procedures), or as a guidance for something, not as some >>kind of magic tool to "let the data speak for themselves" ( = a) get the >>microarray data; b) run a clustering procedure; c) come up with a question >>that your cluster "answered".) >> >>R. >> >> >>>I think that the bioconductor list would be the best forum to discuss >>>this issue, and provide solutions (besides the obvious suggestion to >>>filter non-varying genes). >>> >>>Best regards >>>Christian >>> >>>James W. MacDonald wrote: >>> >>>>cstrato wrote: >>>> >>>>>Sorry, but I cannot resist: >>>>> >>>>>Any comments of the microarry community on the usefulness of >>>>>hierarchical clustering of 7000 genes? >>>> >>>>I think this would be almost completely useless. First off, >>>>clustering is not an inferential technique, so its use has been >>>>completely oversold IMO to the biological community. Secondly, >>>>clustering is usually done to produce a 'heat map' to put in a paper >>>>or flash on the screen during a presentation. How on earth would >>>>this be of any use? You couldn't even read any of the gene names! >>>> >>>>Of course you could use the heatmap to impress friends and >>>>colleagues with the fact that you rate a computer powerful enough to >>>>*do* a heatmap with a 7000 x 5 matrix ;-D >>>> >>>>Jim >>>> >>>> >>>>>Best regards >>>>>Christian >>>>>-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- >>>>>C.h.r.i.s.t.i.a.n. .S.t.r.a.t.o.w.a >>>>>V.i.e.n.n.a. .A.u.s.t.r.i.a >>>>>-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- >>> >>>_______________________________________________ >>>Bioconductor mailing list >>>Bioconductor@stat.math.ethz.ch >>>https://stat.ethz.ch/mailman/listinfo/bioconductor > > _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor ********************************************************************** This email and any files transmitted with it are confidentia...{{dropped}}

ADD COMMENT • link 20.6 years ago Stephen Henderson ★ 1.0k

Login before adding your answer.