HOPACH clustering of genes

0

Entering edit mode

Nathan Harmston ▴ 100

@nathan-harmston-2904

Last seen 10.6 years ago

Hi, I m currently trying to run some clustering on some expression arrays and I was wondering about the best way of doing it, I have 81 samples on hgu133plus2 (55000), I have filtered this down to approximately 10000 (X, Y, low variabilty, control probes), and wanted to try hierarchical clustering on these both by arrays and genes. I was planning on using hopach as this seems an easy and obvious choice. How long would such a lot of comparisons take? I make it something like ( 81 * 10000 ) ^ 2 comparisons, I have a machine with 24gb of memory. Has anybody ever done something like this before? and what was the amount of time it took to actually do it? Given it might take a while are there any suggestions for how I might decrease the running time for such a program? I am already creating the distance matrix prior to clustering. Why is it better to use cosangle for gene clustering and euclidean distance for arrays? Is there a good reason for this and why would you use one distance over another. Many thanks in advance, Nathan [[alternative HTML version deleted]]

Clustering hopach Clustering hopach • 1.3k views

ADD COMMENT • link updated 16.8 years ago by Shannon, William ▴ 20 • written 16.8 years ago by Nathan Harmston ▴ 100

0

Entering edit mode

Shannon, William ▴ 20

@shannon-william-2930

Last seen 10.6 years ago

You may want to look at kmeans clustering instead of hierarchical if you are interesed in genes with correlated expression patterns across the samples. Imposing a hierarchical structure/model on 10,000 genes is probably incorrect -- gene A and B may be correlated but independent in terms of function, evolutionary history, pathway etc. In terms of how long it takes you would have to calculate a 10000*(9999)/2 = 49,995,000 element distance matrix -- my best suggestion is start the distance calculation and see if it gets finished in a reasonable amount of time. Bill Shannon, PhD Associate Professor of Biostatistics in Medicine Washington University in St Louis President-elect, Classificatin Society ________________________________________ From: bioconductor-bounces@stat.math.ethz.ch [bioconductor- bounces@stat.math.ethz.ch] On Behalf Of Nathan Harmston [iwanttobeabadger@googlemail.com] Sent: Monday, July 21, 2008 8:55 AM To: bioconductor at stat.math.ethz.ch Subject: [BioC] HOPACH clustering of genes Hi, I m currently trying to run some clustering on some expression arrays and I was wondering about the best way of doing it, I have 81 samples on hgu133plus2 (55000), I have filtered this down to approximately 10000 (X, Y, low variabilty, control probes), and wanted to try hierarchical clustering on these both by arrays and genes. I was planning on using hopach as this seems an easy and obvious choice. How long would such a lot of comparisons take? I make it something like ( 81 * 10000 ) ^ 2 comparisons, I have a machine with 24gb of memory. Has anybody ever done something like this before? and what was the amount of time it took to actually do it? Given it might take a while are there any suggestions for how I might decrease the running time for such a program? I am already creating the distance matrix prior to clustering. Why is it better to use cosangle for gene clustering and euclidean distance for arrays? Is there a good reason for this and why would you use one distance over another. Many thanks in advance, Nathan [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 16.8 years ago Shannon, William ▴ 20

Login before adding your answer.