Hierarchical clustering and shrinking centroids...

0

Entering edit mode

Tan, MinHan ▴ 180

@tan-minhan-431

Last seen 10.6 years ago

Dear list members, I have been unable to resolve this conceptual problem. I performed hierarchical clustering on a filtered sample (cv=0.04, at least 2 samples > level of log 9) of 80 tumor samples, and obtained several groups. Some of these clusters were definitely more stable than others. Subsequently, based on visual inspection, and my knowledge of the case outcomes, I arbitrarily classified one large cluster as 'good prognosis' and other clusters as 'bad prognosis'. Using this classification obtained above, I did a supervised analysis using PAMR to obtain a gene list. However, the misclassification rate during cross-validation for my good prognosis is fairly low and stable (<0.05) throughout the shrinking gene list, but the misclassification rate for my poor prognosis case is relatively higher, and also fairly stable (approx 0.2). I examined the classification of my cases, and some 'poor prognosis' cases seemed to be persistently recognized as 'good prognosis' cases. Evidently, there is some problem with the classification arising from the choice of algorithm. I have tried kth nearest neighbour, and the same problem occurs. Relooking at the HC tree, some of these good/bad prognosis genes are clustered together, suggesting other genes I wonder how I may explain this - I suppose the clustering of these cases is determined by genes other than those differentiating between these two major groups. Naturally, validation by an independent set is ideal, but I guess my question is more on this problem of cross-validation. I would appreciate any advice, or pointers to any references for this! Thanks. Min-Han Tan This email message, including any attachments, is for the so...{{dropped}}

Classification Clustering pamr Classification Clustering pamr • 1.4k views

ADD COMMENT • link updated 20.9 years ago by Stephen Henderson ★ 1.0k • written 20.9 years ago by Tan, MinHan ▴ 180

0

Entering edit mode

Tom R. Fahland ▴ 60

@tom-r-fahland-616

Last seen 10.6 years ago

Tan I have been doing a lot of classification using PAMR, as well as LDA and SVM's. The overused phrase the data is what it is is valid here. I look at highly correlated samples that mis-classify, and they are usually the same with differnet classification algorithms. Sometimes I don't get really good stability with different gene lists also. HC clustering uses simple correlation metrics, so starting from this can be problematic. I kow I really didn't answer anything, but thought sharing my experience might help. Tom -----Original Message----- From: Tan, MinHan [mailto:MinHan.Tan@vai.org] Sent: Monday, May 24, 2004 18:57 To: bioconductor@stat.math.ethz.ch Subject: [BioC] Hierarchical clustering and shrinking centroids... Dear list members, I have been unable to resolve this conceptual problem. I performed hierarchical clustering on a filtered sample (cv=0.04, at least 2 samples > level of log 9) of 80 tumor samples, and obtained several groups. Some of these clusters were definitely more stable than others. Subsequently, based on visual inspection, and my knowledge of the case outcomes, I arbitrarily classified one large cluster as 'good prognosis' and other clusters as 'bad prognosis'. Using this classification obtained above, I did a supervised analysis using PAMR to obtain a gene list. However, the misclassification rate during cross-validation for my good prognosis is fairly low and stable (<0.05) throughout the shrinking gene list, but the misclassification rate for my poor prognosis case is relatively higher, and also fairly stable (approx 0.2). I examined the classification of my cases, and some 'poor prognosis' cases seemed to be persistently recognized as 'good prognosis' cases. Evidently, there is some problem with the classification arising from the choice of algorithm. I have tried kth nearest neighbour, and the same problem occurs. Relooking at the HC tree, some of these good/bad prognosis genes are clustered together, suggesting other genes I wonder how I may explain this - I suppose the clustering of these cases is determined by genes other than those differentiating between these two major groups. Naturally, validation by an independent set is ideal, but I guess my question is more on this problem of cross-validation. I would appreciate any advice, or pointers to any references for this! Thanks. Min-Han Tan This email message, including any attachments, is for the so...{{dropped}} _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD COMMENT • link 20.9 years ago Tom R. Fahland ▴ 60

0

Entering edit mode

Stephen Henderson ★ 1.0k

@stephen-henderson-71

Last seen 8.0 years ago

Yes I'm not sure why you have started with the clustering either (though it suggests that you are on the right track). You should classify the samples based on their actual outcome and try PAMR and not whether they are in the imperfect good or bad cluster. Forgive if me if I've misunderstood you. There is a useful guide to using classification on array data (using e1071 svm) under the short courses page on Bioconductor, the Heidelberg Course Sept 2002. I found this helpful getting started in R. The guide to the ipred package is also excellent. Stephen ps 0.2 error is reasonable I think for a tumour prognosis. No? -----Original Message----- From: Tom R. Fahland To: Tan, MinHan; bioconductor Sent: 5/26/04 12:52 AM Subject: RE: [BioC] Hierarchical clustering and shrinking centroids... Tan I have been doing a lot of classification using PAMR, as well as LDA and SVM's. The overused phrase the data is what it is is valid here. I look at highly correlated samples that mis-classify, and they are usually the same with differnet classification algorithms. Sometimes I don't get really good stability with different gene lists also. HC clustering uses simple correlation metrics, so starting from this can be problematic. I kow I really didn't answer anything, but thought sharing my experience might help. Tom -----Original Message----- From: Tan, MinHan [mailto:MinHan.Tan@vai.org] Sent: Monday, May 24, 2004 18:57 To: bioconductor@stat.math.ethz.ch Subject: [BioC] Hierarchical clustering and shrinking centroids... Dear list members, I have been unable to resolve this conceptual problem. I performed hierarchical clustering on a filtered sample (cv=0.04, at least 2 samples > level of log 9) of 80 tumor samples, and obtained several groups. Some of these clusters were definitely more stable than others. Subsequently, based on visual inspection, and my knowledge of the case outcomes, I arbitrarily classified one large cluster as 'good prognosis' and other clusters as 'bad prognosis'. Using this classification obtained above, I did a supervised analysis using PAMR to obtain a gene list. However, the misclassification rate during cross-validation for my good prognosis is fairly low and stable (<0.05) throughout the shrinking gene list, but the misclassification rate for my poor prognosis case is relatively higher, and also fairly stable (approx 0.2). I examined the classification of my cases, and some 'poor prognosis' cases seemed to be persistently recognized as 'good prognosis' cases. Evidently, there is some problem with the classification arising from the choice of algorithm. I have tried kth nearest neighbour, and the same problem occurs. Relooking at the HC tree, some of these good/bad prognosis genes are clustered together, suggesting other genes I wonder how I may explain this - I suppose the clustering of these cases is determined by genes other than those differentiating between these two major groups. Naturally, validation by an independent set is ideal, but I guess my question is more on this problem of cross-validation. I would appreciate any advice, or pointers to any references for this! Thanks. Min-Han Tan This email message, including any attachments, is for the so...{{dropped}} _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor ********************************************************************** This email and any files transmitted with it are confidentia...{{dropped}}

ADD COMMENT • link 20.9 years ago Stephen Henderson ★ 1.0k

Login before adding your answer.