significance of "wrong" clustering of differential genes

0

Entering edit mode

Benjamin Otto ▴ 830

@benjamin-otto-1519

Last seen 10.4 years ago

Hi, Please imagine the following situation: For two sample sets (set1, set2) the most differentially expressed genes are identified by limma. The p.value correction would be "holm". Afterwards a heatmap is printed for these genes. The procedure would look like: > f <- factor(as.character(pheno[,marker])) > design <- model.matrix(~f) > fit <- eBayes(lmFit(eSet,design)) > tab <- topTable(fit, coef=2, number=nrow(eSet), adjust.method="holm") > selected <- tab$adj.P.Val < 0.01 & abs(tab$M) >= 1 > ## print a heatmap for eSet[selected,] What can lead to a misclassification in the clustering, say one sample of set1 is clustered together with set2? Afterall according to the workflow I have explicitly been searching for the genes which should discriminate between the two sets! However the expression values displayed in the heatmap assume, that this samle IS more similar to the "wrong" set than to the true one. (have a look at the jpg) Is it possible, that this sample is always treated as outlier in the significance calculations? And if it is so, then: Is it sensible to take such a misclassification as kind of significane? Regards Benjamin -- Benjamin Otto Universitaetsklinikum Eppendorf Hamburg Institut fuer Klinische Chemie Martinistrasse 52 20246 Hamburg

Clustering limma Clustering limma • 1.5k views

ADD COMMENT • link updated 18.3 years ago by Naomi Altman ★ 6.0k • written 18.3 years ago by Benjamin Otto ▴ 830

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 3.8 years ago

United States

The heatmap did not come through (to me). However, clustering is highly dependent on the choice of distance measure. --Naomi At 09:57 AM 11/13/2006, Benjamin Otto wrote: >Hi, > > > >Please imagine the following situation: > >For two sample sets (set1, set2) the most differentially expressed genes are >identified by limma. The p.value correction would be "holm". Afterwards a >heatmap is printed for these genes. The procedure would look like: > > > > > f <- factor(as.character(pheno[,marker])) > > > design <- model.matrix(~f) > > > fit <- eBayes(lmFit(eSet,design)) > > > tab <- topTable(fit, coef=2, number=nrow(eSet), adjust.method="holm") > > > selected <- tab$adj.P.Val < 0.01 & abs(tab$M) >= 1 > > > ## print a heatmap for eSet[selected,] > > > > > >What can lead to a misclassification in the clustering, say one sample of >set1 is clustered together with set2? Afterall according to the workflow I >have explicitly been searching for the genes which should discriminate >between the two sets! However the expression values displayed in the heatmap >assume, that this samle IS more similar to the "wrong" set than to the true >one. (have a look at the jpg) > >Is it possible, that this sample is always treated as outlier in the >significance calculations? > >And if it is so, then: Is it sensible to take such a misclassification as >kind of significane? > >Regards > > > >Benjamin > > > > > >-- >Benjamin Otto >Universitaetsklinikum Eppendorf Hamburg >Institut fuer Klinische Chemie >Martinistrasse 52 >20246 Hamburg > > > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD COMMENT • link 18.3 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

In addition to Naomi's comments, remember that a desired property of a statistic is that it be "robust" to outliers (ignoring them when appropriate). I think it is probably fine to have some proportion of the samples "misclassified" by your clustering. However, when this happens, it is a good idea to make sure that a sample mislabeling or some such thing has not occurred. I have discovered an adult sample in what were supposed to be pediatric samples, a mouse cell line among what were supposed to be all canine, and other oddities like that by looking back at data. Most of the time, though, these samples simply represent biological or technical variation that we cannot fully explain. Sean On Monday 13 November 2006 16:02, Naomi Altman wrote: > The heatmap did not come through (to me). However, clustering is > highly dependent on the choice of distance measure. > > --Naomi > > At 09:57 AM 11/13/2006, Benjamin Otto wrote: > >Hi, > > > > > > > >Please imagine the following situation: > > > >For two sample sets (set1, set2) the most differentially expressed genes > > are identified by limma. The p.value correction would be "holm". > > Afterwards a > > > >heatmap is printed for these genes. The procedure would look like: > > > f <- factor(as.character(pheno[,marker])) > > > > > > design <- model.matrix(~f) > > > > > > fit <- eBayes(lmFit(eSet,design)) > > > > > > tab <- topTable(fit, coef=2, number=nrow(eSet), adjust.method="holm") > > > > > > selected <- tab$adj.P.Val < 0.01 & abs(tab$M) >= 1 > > > > > > ## print a heatmap for eSet[selected,] > > > >What can lead to a misclassification in the clustering, say one sample of > >set1 is clustered together with set2? Afterall according to the workflow I > >have explicitly been searching for the genes which should discriminate > >between the two sets! However the expression values displayed in the > > heatmap assume, that this samle IS more similar to the "wrong" set than > > to the true one. (have a look at the jpg) > > > >Is it possible, that this sample is always treated as outlier in the > >significance calculations? > > > >And if it is so, then: Is it sensible to take such a misclassification as > >kind of significane? > > > >Regards > > > > > > > >Benjamin > > > > > > > > > > > >-- > >Benjamin Otto > >Universitaetsklinikum Eppendorf Hamburg > >Institut fuer Klinische Chemie > >Martinistrasse 52 > >20246 Hamburg > > > > > > > >_______________________________________________ > >Bioconductor mailing list > >Bioconductor at stat.math.ethz.ch > >https://stat.ethz.ch/mailman/listinfo/bioconductor > >Search the archives: > >http://news.gmane.org/gmane.science.biology.informatics.conductor > > Naomi S. Altman 814-865-3791 (voice) > Associate Professor > Dept. of Statistics 814-863-7114 (fax) > Penn State University 814-865-1348 (Statistics) > University Park, PA 16802-2111 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 18.3 years ago Sean Davis 21k

0

Entering edit mode

Hi Naomi, sorry, probably the image size (37kb) exeeded the 40kb limit together with the rest of the mail. Here it comes again in higher compression. Concerning the distance measure I would agree with you. However --that's why I initially thought to provide the cluster plot-- according the expression values I DO agree with the clustering result! And that is the point I wouldn't normally expect from clustering extra determined significant genes... Benjamin -----Urspr?ngliche Nachricht----- Von: Naomi Altman [mailto:naomi at stat.psu.edu] Gesendet: 13 November 2006 22:03 An: Benjamin Otto; 'BioClist' Betreff: Re: [BioC] significance of "wrong" clustering of differential genes The heatmap did not come through (to me). However, clustering is highly dependent on the choice of distance measure. --Naomi

ADD REPLY • link 18.3 years ago Benjamin Otto ▴ 830

Login before adding your answer.