supervised & unsupervised analysis of samples of microarray data

0

Entering edit mode

wenhuo hu ▴ 200

@wenhuo-hu-5208

Last seen 6.9 years ago

United States

Hi all, I am recently analyzing the array data. There are several groups represent different disease subtype. I will just describe what I did here. I identified significant genes. And extract the expression levels of these genes, and performed the cluster analysis using gplots package in bioconductor/R. The question I have here is the cluster analysis did not group the samples well according the disease subtype. So I assume this is a question about supervised and unsupervised cluster. From online data, it seems this not really right, because supervised analysis describe more likely the way to classify new samples based on previous data. And there come with the idea of semi-supervised concept. Here I am already confused. Would the analysis methods, such as PAM, SOM, and Kmeans, be supervised or semi-supervised clusters? Could anyone spend time to clear my idea about supervised, semi-supervised, and unsupervised? And recommend any packages in bioconductor that might help me to group the samples according disease sub-type? I like programming, and have biology/medicine background, with relatively limited bioinformatics. Any interpretation are welcome. Thanks! Wenhuo Hu Park lab Memorial Sloan Kettering Cancer Center Zuckerman Research Building 408 East 69th Street Room ZRC-527 New York, NY 10065 Phone 646-888-3220 huw@mskcc.org [[alternative HTML version deleted]]

Cancer Cancer • 4.2k views

ADD COMMENT • link updated 13.0 years ago by Steve Lianoglou ★ 13k • written 13.0 years ago by wenhuo hu ▴ 200

0

Entering edit mode

Tim Triche ★ 4.2k

@tim-triche-3561

Last seen 4.6 years ago

United States

Supervised: all samples are labeled (e.g. by disease subtype) Semi-supervised: only some of the samples are labeled, or the labels are expressed in terms of probabilities Unsupervised: there are no labels and the method of choice is supposed to discover them >From the above I hope it will be clear that partitioning around medioids, self-organizing maps, and k-means are all unsupervised. ANOVA, SVMs, multinomial logistic regression, and linear discriminant analysis are examples of supervised methods. For a useful paper describing differences between the two, and a model-based semi-supervised approach, you might like this JSS paper.<http: www.jstatsoft.org="" v47="" i03=""> A far more in-depth treatment can be freely perused at the website for Elements of Statistical Learning (2nd edition)<http: www-stat.stanford.edu="" ~tibs="" elemstatlearn=""/> . Hope this helps. --t On Wed, Apr 25, 2012 at 8:26 PM, wenhuo hu <huwenhuo@gmail.com> wrote: > Hi all, > > I am recently analyzing the array data. There are several groups represent > different disease subtype. I will just describe what I did here. I > identified significant genes. And extract the expression levels of these > genes, and performed the cluster analysis using gplots package in > bioconductor/R. The question I have here is the cluster analysis did not > group the samples well according the disease subtype. So I assume this is a > question about supervised and unsupervised cluster. From online data, it > seems this not really right, because supervised analysis describe more > likely the way to classify new samples based on previous data. And there > come with the idea of semi-supervised concept. Here I am already confused. > Would the analysis methods, such as PAM, SOM, and Kmeans, be supervised or > semi-supervised clusters? Could anyone spend time to clear my idea about > supervised, semi-supervised, and unsupervised? And recommend any packages > in bioconductor that might help me to group the samples according disease > sub-type? > > I like programming, and have biology/medicine background, with relatively > limited bioinformatics. Any interpretation are welcome. > > Thanks! > > > Wenhuo Hu > Park lab > > Memorial Sloan Kettering Cancer Center > Zuckerman Research Building > 408 East 69th Street > Room ZRC-527 > New York, NY 10065 > Phone 646-888-3220 > huw@mskcc.org > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD COMMENT • link 13.0 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 16 days ago

United States

Hi, Before we get into the weeds over supervised vs. unsupervised learning, I'm curious -- how is your data clustering? Is the clustering representing more of a technical artifact (batch effect) vs. the biological "effect" you are trying to see? Like so: Tackling the widespread and critical impact of batch effects in high-throughput data http://www.nature.com/nrg/journal/v11/n10/full/nrg2825.html Are you trying to do this differential expression/clustering thing as a QC thing? A "gene signature" thing? -steve On Wed, Apr 25, 2012 at 11:26 PM, wenhuo hu <huwenhuo at="" gmail.com=""> wrote: > Hi all, > > I am recently analyzing the array data. There are several groups represent > different disease subtype. I will just describe what I did here. I > identified significant genes. And extract the expression levels of these > genes, and performed the cluster analysis using gplots package in > bioconductor/R. The question I have here is the cluster analysis did not > group the samples well according the disease subtype. So I assume this is a > question about supervised and unsupervised cluster. From online data, it > seems this not really right, because supervised analysis describe more > likely the way to classify new samples based on previous data. And there > come with the idea of semi-supervised concept. Here I am already confused. > Would the analysis methods, such as PAM, SOM, and Kmeans, be supervised or > semi-supervised clusters? Could anyone spend time to clear my idea about > supervised, semi-supervised, and unsupervised? And recommend any packages > in bioconductor that might help me to group the samples according disease > sub-type? > > I like programming, and have biology/medicine background, with relatively > limited bioinformatics. Any interpretation are welcome. > > Thanks! > > > Wenhuo Hu > Park lab > > Memorial Sloan Kettering Cancer Center > Zuckerman Research Building > 408 East 69th Street > Room ZRC-527 > New York, NY 10065 > Phone 646-888-3220 > huw at mskcc.org > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD COMMENT • link 13.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

The above is a really good point. Although, I've seen plenty of non-batch-affected data that lines up by molecular mechanism rather than disease subtype. If you know what batch each sample belongs to, you can use ComBat (in the 'sva' package) to clean up a lot. On Wed, Apr 25, 2012 at 9:05 PM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hi, > > Before we get into the weeds over supervised vs. unsupervised > learning, I'm curious -- how is your data clustering? Is the > clustering representing more of a technical artifact (batch effect) > vs. the biological "effect" you are trying to see? Like so: > > Tackling the widespread and critical impact of batch effects in > high-throughput data > http://www.nature.com/nrg/journal/v11/n10/full/nrg2825.html > > Are you trying to do this differential expression/clustering thing as > a QC thing? A "gene signature" thing? > > -steve > > On Wed, Apr 25, 2012 at 11:26 PM, wenhuo hu <huwenhuo@gmail.com> wrote: > > Hi all, > > > > I am recently analyzing the array data. There are several groups > represent > > different disease subtype. I will just describe what I did here. I > > identified significant genes. And extract the expression levels of these > > genes, and performed the cluster analysis using gplots package in > > bioconductor/R. The question I have here is the cluster analysis did not > > group the samples well according the disease subtype. So I assume this > is a > > question about supervised and unsupervised cluster. From online data, it > > seems this not really right, because supervised analysis describe more > > likely the way to classify new samples based on previous data. And there > > come with the idea of semi-supervised concept. Here I am already > confused. > > Would the analysis methods, such as PAM, SOM, and Kmeans, be supervised > or > > semi-supervised clusters? Could anyone spend time to clear my idea about > > supervised, semi-supervised, and unsupervised? And recommend any packages > > in bioconductor that might help me to group the samples according disease > > sub-type? > > > > I like programming, and have biology/medicine background, with relatively > > limited bioinformatics. Any interpretation are welcome. > > > > Thanks! > > > > > > Wenhuo Hu > > Park lab > > > > Memorial Sloan Kettering Cancer Center > > Zuckerman Research Building > > 408 East 69th Street > > Room ZRC-527 > > New York, NY 10065 > > Phone 646-888-3220 > > huw@mskcc.org > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD REPLY • link 13.0 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Thank you very much Steve and Tim. My samples are really from patients that have defined sub types. But the tricky thing is that we really have limited cell numbers, these cells are FACS-sorted. So that the samples are preamplified before hybridization. I more likely think there are batch effect that could rise from sample sorting, RNA preparation, and preamplification. But we really believe these special sorted cells could at least provide some kind of gene list that we can experimentally do something for top genes, such as shRNA screen. As point out by both Steve and Tim, batch effect is concerning. For me, the gene list is not even really affected by this cluster analysis, I just want to show a better figure of cluster analysis. Please do not question about this for now. I think there are recognition gaps between bioinformatic analysis, basic experimental studies, and clinical/basic medical studies regarding the large scale data analysis. Thanks for the paper suggested, though. I'll check the ANOVA, SVMs, multinomial logistic regression, and linear discriminant analysis, and leave the batch effect question now. Thanks again, Tim, from South California, I hope it is right, I just searched by google; and Steve, next door neighbor. Wenhuo On Thu, Apr 26, 2012 at 1:09 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > The above is a really good point. Although, I've seen plenty of > non-batch-affected data that lines up by molecular mechanism rather than > disease subtype. If you know what batch each sample belongs to, you can > use ComBat (in the 'sva' package) to clean up a lot. > > > On Wed, Apr 25, 2012 at 9:05 PM, Steve Lianoglou < > mailinglist.honeypot@gmail.com> wrote: > >> Hi, >> >> Before we get into the weeds over supervised vs. unsupervised >> learning, I'm curious -- how is your data clustering? Is the >> clustering representing more of a technical artifact (batch effect) >> vs. the biological "effect" you are trying to see? Like so: >> >> Tackling the widespread and critical impact of batch effects in >> high-throughput data >> http://www.nature.com/nrg/journal/v11/n10/full/nrg2825.html >> >> Are you trying to do this differential expression/clustering thing as >> a QC thing? A "gene signature" thing? >> >> -steve >> >> On Wed, Apr 25, 2012 at 11:26 PM, wenhuo hu <huwenhuo@gmail.com> wrote: >> > Hi all, >> > >> > I am recently analyzing the array data. There are several groups >> represent >> > different disease subtype. I will just describe what I did here. I >> > identified significant genes. And extract the expression levels of these >> > genes, and performed the cluster analysis using gplots package in >> > bioconductor/R. The question I have here is the cluster analysis did not >> > group the samples well according the disease subtype. So I assume this >> is a >> > question about supervised and unsupervised cluster. From online data, it >> > seems this not really right, because supervised analysis describe more >> > likely the way to classify new samples based on previous data. And there >> > come with the idea of semi-supervised concept. Here I am already >> confused. >> > Would the analysis methods, such as PAM, SOM, and Kmeans, be supervised >> or >> > semi-supervised clusters? Could anyone spend time to clear my idea about >> > supervised, semi-supervised, and unsupervised? And recommend any >> packages >> > in bioconductor that might help me to group the samples according >> disease >> > sub-type? >> > >> > I like programming, and have biology/medicine background, with >> relatively >> > limited bioinformatics. Any interpretation are welcome. >> > >> > Thanks! >> > >> > >> > Wenhuo Hu >> > Park lab >> > >> > Memorial Sloan Kettering Cancer Center >> > Zuckerman Research Building >> > 408 East 69th Street >> > Room ZRC-527 >> > New York, NY 10065 >> > Phone 646-888-3220 >> > huw@mskcc.org >> > >> > [[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor@r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > > -- Wenhuo Hu Park lab Memorial Sloan Kettering Cancer Center Zuckerman Research Building 408 East 69th Street Room ZRC-527 New York, NY 10065 Phone 646-888-3220 huw@mskcc.org [[alternative HTML version deleted]]

ADD REPLY • link 13.0 years ago wenhuo hu ▴ 200

Login before adding your answer.