Question about clustering and cluster validation

0

Entering edit mode

January Weiner ▴ 370

@january-weiner-3999

Last seen 10.2 years ago

Dear all, in short, I would like to decide whether a certain data set contains sub-groups (clusters), or is uniform. There are roughly 500 features and 50 samples. I am looking for clusters of samples. There is a clear division in a small number of features (3-4) indicating the existence of subgroups, and a much less clear situation in many other features. Pvclust, which I use preferentially (mostly because it gives me a p-value surrogate), indicates two main clusters with AU p-values of 99 and 98, and BP p-values of 0 and 1, respectively. Clustering with other methods gives contradictory results. I have tried MClust and several "regular" methods. In short, I am not really sure. On a PCA plot using all features, two clusters can be seen, but are not clearly divided. If I assign the samples to the clusters identified by pvclust and apply randomForests, I can distinguish between the classes fairly well, but that seems like something one should rather not do. Furthermore, there is for sure an additional complication, which is the fact that for some particular features, there is a pre-defined clustering (male vs female). However, the clusters I am considering are not related to the difference between sexes. Is there a statistical test available that would compare the zero hypothesis "there are no sub-clusters" with the alternative hypothesis "there are two clusters", or "there are no sub-clusters" with "there are these two particular clusters"? I was thinking along the following lines: perform X random divisions. Perform t-tests for each feature, record significance. See whether the proposed division is significantly better than random divisions in the data, the statistics being here "number of significantly different features" or something similar. Best regards, January -- -------- Dr. January Weiner 3 --------------------------------------

ASSIGN ASSIGN • 1.5k views

ADD COMMENT • link updated 14.0 years ago by Robert Chapman ▴ 20 • written 14.0 years ago by January Weiner ▴ 370

0

Entering edit mode

Robert Chapman ▴ 20

@robert-chapman-4356

Last seen 10.2 years ago

Have you tried the freeware program called WEKA? Bob ________________________________________ From: bioconductor-bounces@stat.math.ethz.ch [bioconductor- bounces@stat.math.ethz.ch] On Behalf Of January Weiner [january.weiner @mpiib-berlin.mpg.de] Sent: Friday, November 19, 2010 7:58 AM To: BioC Subject: [BioC] Question about clustering and cluster validation Dear all, in short, I would like to decide whether a certain data set contains sub-groups (clusters), or is uniform. There are roughly 500 features and 50 samples. I am looking for clusters of samples. There is a clear division in a small number of features (3-4) indicating the existence of subgroups, and a much less clear situation in many other features. Pvclust, which I use preferentially (mostly because it gives me a p-value surrogate), indicates two main clusters with AU p-values of 99 and 98, and BP p-values of 0 and 1, respectively. Clustering with other methods gives contradictory results. I have tried MClust and several "regular" methods. In short, I am not really sure. On a PCA plot using all features, two clusters can be seen, but are not clearly divided. If I assign the samples to the clusters identified by pvclust and apply randomForests, I can distinguish between the classes fairly well, but that seems like something one should rather not do. Furthermore, there is for sure an additional complication, which is the fact that for some particular features, there is a pre-defined clustering (male vs female). However, the clusters I am considering are not related to the difference between sexes. Is there a statistical test available that would compare the zero hypothesis "there are no sub-clusters" with the alternative hypothesis "there are two clusters", or "there are no sub-clusters" with "there are these two particular clusters"? I was thinking along the following lines: perform X random divisions. Perform t-tests for each feature, record significance. See whether the proposed division is significantly better than random divisions in the data, the statistics being here "number of significantly different features" or something similar. Best regards, January -- -------- Dr. January Weiner 3 -------------------------------------- _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 14.0 years ago Robert Chapman ▴ 20

0

Entering edit mode

Thanks for the suggestion. This looks really interesting, but I would rather stick to R/Bioconductor, as Weka seems to be a whole new environment which, at least partially, implements the same algorithms that can be found in R. Cheers, j. On Fri, Nov 19, 2010 at 2:08 PM, Robert Chapman <chapmanr at="" dnr.sc.gov=""> wrote: > Have you tried the freeware program called WEKA? > Bob > ________________________________________ > From: bioconductor-bounces at stat.math.ethz.ch [bioconductor- bounces at stat.math.ethz.ch] On Behalf Of January Weiner [january.weiner at mpiib-berlin.mpg.de] > Sent: Friday, November 19, 2010 7:58 AM > To: BioC > Subject: [BioC] Question about clustering and cluster validation > > Dear all, > > in short, I would like to decide whether a certain data set contains > sub-groups (clusters), or is uniform. > > There are roughly 500 features and 50 samples. I am looking for > clusters of samples. > > There is a clear division in a small number of features (3-4) > indicating the existence of subgroups, and a much less clear situation > in many other features. Pvclust, which I use preferentially (mostly > because it gives me a p-value surrogate), indicates two main clusters > with AU p-values of 99 and 98, and BP p-values of 0 and 1, > respectively. > > Clustering with other methods gives contradictory results. I have > tried MClust and several "regular" methods. In short, I am not really > sure. > > On a PCA plot using all features, two clusters can be seen, but are > not clearly divided. If I assign the samples to the clusters > identified by pvclust and apply randomForests, I can distinguish > between the classes fairly well, but that seems like something one > should rather not do. > > Furthermore, there is for sure an additional complication, which is > the fact that for some particular features, there is a pre-defined > clustering (male vs female). However, the clusters I am considering > are not related to the difference between sexes. > > Is there a statistical test available that would compare the zero > hypothesis "there are no sub-clusters" with the alternative hypothesis > "there are two clusters", or "there are no sub-clusters" with "there > are these two particular clusters"? > > I was thinking along the following lines: perform X random divisions. > Perform t-tests for each feature, record significance. See whether the > proposed division is significantly better than random divisions in the > data, the statistics being here "number of significantly different > features" or something similar. > > Best regards, > > January > > -- > -------- Dr. January Weiner 3 -------------------------------------- > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- -------- Dr. January Weiner 3 -------------------------------------- Max Planck Institute for Infection Biology Charit?platz 1 D-10117 Berlin, Germany Web?? : www.mpiib-berlin.mpg.de Tel? ?? : +49-30-28460514

ADD REPLY • link 14.0 years ago January Weiner ▴ 370

0

Entering edit mode

IIRC, BayesClust (or some variation on the naming theme) implements assorted tests for "one homogeneous population" versus "more than one subpopulation" (aka cluster). Personally I have been using the PMA package to do semi-supervised feature selection and, since some of my measurements are dichotomous, some proportions, and some quasi-normal (or at least gamma), the new Modalclust package (along with the 'seriation' package for ordering within groups) to try and identify relevant dimensions along which the clustering is informative. The nice thing about something like PMA/CCA is that you can "lift out" canonical vectors that are obviously technical or otherwise irrelevant to the question at hand, then proceed hunting for interesting relationships. By varying the shrinkage intensity and/or choosing outcomes of interest, technical variations of non-interest, etc., one can make a lot of progress fast using CCA. On Fri, Nov 19, 2010 at 6:30 AM, January Weiner < january.weiner@mpiib-berlin.mpg.de> wrote: > Thanks for the suggestion. This looks really interesting, but I would > rather stick to R/Bioconductor, as Weka seems to be a whole new > environment which, at least partially, implements the same algorithms > that can be found in R. > > Cheers, > j. > > On Fri, Nov 19, 2010 at 2:08 PM, Robert Chapman <chapmanr@dnr.sc.gov> > wrote: > > Have you tried the freeware program called WEKA? > > Bob > > ________________________________________ > > From: bioconductor-bounces@stat.math.ethz.ch [ > bioconductor-bounces@stat.math.ethz.ch] On Behalf Of January Weiner [ > january.weiner@mpiib-berlin.mpg.de] > > Sent: Friday, November 19, 2010 7:58 AM > > To: BioC > > Subject: [BioC] Question about clustering and cluster validation > > > > Dear all, > > > > in short, I would like to decide whether a certain data set contains > > sub-groups (clusters), or is uniform. > > > > There are roughly 500 features and 50 samples. I am looking for > > clusters of samples. > > > > There is a clear division in a small number of features (3-4) > > indicating the existence of subgroups, and a much less clear situation > > in many other features. Pvclust, which I use preferentially (mostly > > because it gives me a p-value surrogate), indicates two main clusters > > with AU p-values of 99 and 98, and BP p-values of 0 and 1, > > respectively. > > > > Clustering with other methods gives contradictory results. I have > > tried MClust and several "regular" methods. In short, I am not really > > sure. > > > > On a PCA plot using all features, two clusters can be seen, but are > > not clearly divided. If I assign the samples to the clusters > > identified by pvclust and apply randomForests, I can distinguish > > between the classes fairly well, but that seems like something one > > should rather not do. > > > > Furthermore, there is for sure an additional complication, which is > > the fact that for some particular features, there is a pre-defined > > clustering (male vs female). However, the clusters I am considering > > are not related to the difference between sexes. > > > > Is there a statistical test available that would compare the zero > > hypothesis "there are no sub-clusters" with the alternative hypothesis > > "there are two clusters", or "there are no sub-clusters" with "there > > are these two particular clusters"? > > > > I was thinking along the following lines: perform X random divisions. > > Perform t-tests for each feature, record significance. See whether the > > proposed division is significantly better than random divisions in the > > data, the statistics being here "number of significantly different > > features" or something similar. > > > > Best regards, > > > > January > > > > -- > > -------- Dr. January Weiner 3 -------------------------------------- > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > -- > -------- Dr. January Weiner 3 -------------------------------------- > Max Planck Institute for Infection Biology > Charit�platz 1 > D-10117 Berlin, Germany > Web : www.mpiib-berlin.mpg.de > Tel : +49-30-28460514 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- If people do not believe that mathematics is simple, it is only because they do not realize how complicated life is. John von Neumann<http: www-groups.dcs.st-="" and.ac.uk="" %7ehistory="" biographies="" von_neumann.html=""> [[alternative HTML version deleted]]

ADD REPLY • link 14.0 years ago Tim Triche ★ 4.2k

Login before adding your answer.