Heatmaps of K-clusters don't Match Expression

0

Entering edit mode

Edwin Groot ▴ 230

@edwin-groot-3606

Last seen 10.2 years ago

Dear List, I am exploring several methods of clustering gene expression microarray data, and I have some problems with the k-means method. Is scaling necessary for my data, and if so, what type is better? My expression data is ca. 5000 genes in rows and 5 cell types in columns. I want to visualize which groups of genes are up or down in one cell type relative to other cell types. The data ranges from 2.5e+00 to 1.9e+05, and has a median of 2.8e+02. The strategy in this clustering is to increase k, until no new expression relationships among the 5 cell types are found. I followed Thomas Girke's fine introduction to Bioconductor: http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_BioCondManual.htm l For performance reasons, clara() works best but following Thomas's one-liner gave me a green field (all other commands ommitted): > clarax <- clara(y, 4) Scaling the data gave the expected red-green colours, but exporting the clustering information, showed no relationship between expression and colour. When one cell type was red, and the other green for a given cluster, the expression of the member genes were up, down or unchanged relative to the other cell type. I would have expected the great majority of expressions to be up relative to the other. > library(cluster) #Scale my data > myscale <- t(scale(t(meanexp))) #Seven K-clusters gave the best result > kclusters7 <- clara(myscale,7,stand=FALSE) #Plot the heatmap. The data is transposed so that samples are in columns. The data is also sorted by cluster number. > image(c(1:ncol(myscale)), c(1:nrow(myscale)), t(myscale[names(sort(kclusters7$clustering)),]), col=my.colorFct(), xaxt="n", yaxt="n", ylab="clusters", xlab="samples") The problem is I am too much of a statistics weakling to determine what is the appropriate scaling method. If t(scale(t(meanexp))) is scaling each gene independently of all the others, then that is probably the source of my problem. The expressions differ widely among cell types (that is how I selected the 5000 genes in the first place). I also see in the tutorial the scaling step written as: > scale(t(y)) Why are there sometimes one transposition, sometimes two? What's wrong with no transposition? > scale(y) Some insights would be much appreciated. Regards, Edwin p.s. > R.version _ platform i486-pc-linux-gnu arch i486 os linux-gnu system i486, linux-gnu status major 2 minor 7.1 year 2008 month 06 day 23 svn rev 45970 language R version.string R version 2.7.1 (2008-06-23) --- Dr. Edwin Groot, postdoctoral associate AG Laux Institut fuer Biologie III Schaenzlestr. 1 79104 Freiburg, Deutschland +49 761-2032945

Clustering Clustering • 1.2k views

ADD COMMENT • link updated 15.0 years ago by Steve Lianoglou ★ 13k • written 15.0 years ago by Edwin Groot ▴ 230

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 21 months ago

United States

Hi Edwin, I'm not familiar with Thomas Girke's tutorial, so I'm just going to jump to your "direct" questions: On Dec 14, 2009, at 9:33 AM, Edwin Groot wrote: [snip] > The problem is I am too much of a statistics weakling to determine what > is the appropriate scaling method. If t(scale(t(meanexp))) is scaling > each gene independently of all the others, then that is probably the > source of my problem. Take a look at the documentation in ?scale It works over the *columns* of the matrix: each column is treated independently of the other. I'm not sure what you mean when you ask if each gene is scaled independently of all the others, but I guess you can answer that now? If you don't pass any other arguments to the functions, you are calculating a z-score so that in each column, the element is replaced with the number of std. deviations it is away from the mean of that column: http://en.wikipedia.org/wiki/Standard_score Regarding the "appropriateness" of the scaling: this transformation uses means and std. deviations, so there's an implicit assumption that your data is normally distributed. If you're using log-transformed gene expression values in your matrix, I think common wisdom is that this usually isn't an evil assumption to make, but you can see for your self by plotting the density distributions of the (log- transformed) columns in your data. > The expressions differ widely among cell types How do you mean? The absolute value of the expression of a given gene differs widely across each sample? Or after you t(scale(t()) your data, each gene still differs widely across samples, or what? > (that is how I selected the 5000 genes in the first place). Are you saying you're just keeping the 5000 genes with the highest variance across your samples? > I also see > in the tutorial the scaling step written as: >> scale(t(y)) > Why are there sometimes one transposition, sometimes two? What's wrong > with no transposition? Presumably the double-transposition is done because you want to scale the rows of a matrix, but `scale` only works on the columns, so you t() your matrix once before you pass it into `scale`, then you t() the result of scale so your data comes back the same way you sent it in (rows are rows, cols are cols). >> scale(y) > > Some insights would be much appreciated. Hope I provided some. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD COMMENT • link 15.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Edwin, Steve comments should have clarified your question regarding the scaling and transposition steps. Some more general remarks on this topic: Whether scaling is necessary depends, among other factors, on the data type and the distance measure used for clustering. For clustering expression profiles, scaling is often used in combination with 'scale sensitive' distance measures to focus the analysis on the profiles of gene expressions rather than differences in their overall expression strengths. For instance, when two genes follow exactly the same expression profile, but one gene is highly and the other one weakly expressed, then their unscaled data sets will have a relatively large Euclidean distances while it is close to zero for the scaled values. Usually, the latter is the preferred behavior for expression data. In case of correlation-based distances (e.g. Spearman), however, scaling will not make any difference. For very similar reasons, scaling is often used to plot 'meaningful' heatmaps. Without scaling, only the strongly expressed genes would show a rich color pattern in a heatmap, while the weaker ones will disappear in 'darkness'. To your analysis: The way you are running Clara, Euclidean distances are used as default. So scaling makes sense here. Since your k value seems to be rather low for a dataset of 5000 objects, I would also try much larger k values. In case you want to try correlation-based distance methods, you want to give PAM a try. The algorithms of PAM and Clara are very similar, but PAM allows you to pass on your own distance matrix. In addition, clara's time efficiency will diminish with increasing k values. For a first exploration of your data, you may want to perform hierarchical clustering first. In a second step you could compare your optimized k-means (pam) clustering with the hierarchical clustering by highlighting the first one in the color bar utility of the heatmap/heatmap.2 function. I hope this helps. Thomas On Mon, Dec 14, 2009 at 01:04:10PM -0500, Steve Lianoglou wrote: > Hi Edwin, > > I'm not familiar with Thomas Girke's tutorial, so I'm just going to jump to your "direct" questions: > > On Dec 14, 2009, at 9:33 AM, Edwin Groot wrote: > [snip] > > The problem is I am too much of a statistics weakling to determine what > > is the appropriate scaling method. If t(scale(t(meanexp))) is scaling > > each gene independently of all the others, then that is probably the > > source of my problem. > > Take a look at the documentation in ?scale > > It works over the *columns* of the matrix: each column is treated independently of the other. I'm not sure what you mean when you ask if each gene is scaled independently of all the others, but I guess you can answer that now? > > If you don't pass any other arguments to the functions, you are calculating a z-score so that in each column, the element is replaced with the number of std. deviations it is away from the mean of that column: > > http://en.wikipedia.org/wiki/Standard_score > > Regarding the "appropriateness" of the scaling: this transformation uses means and std. deviations, so there's an implicit assumption that your data is normally distributed. If you're using log-transformed gene expression values in your matrix, I think common wisdom is that this usually isn't an evil assumption to make, but you can see for your self by plotting the density distributions of the (log- transformed) columns in your data. > > > The expressions differ widely among cell types > > How do you mean? The absolute value of the expression of a given gene differs widely across each sample? Or after you t(scale(t()) your data, each gene still differs widely across samples, or what? > > > (that is how I selected the 5000 genes in the first place). > > Are you saying you're just keeping the 5000 genes with the highest variance across your samples? > > > I also see > > in the tutorial the scaling step written as: > >> scale(t(y)) > > Why are there sometimes one transposition, sometimes two? What's wrong > > with no transposition? > > > Presumably the double-transposition is done because you want to scale the rows of a matrix, but `scale` only works on the columns, so you t() your matrix once before you pass it into `scale`, then you t() the result of scale so your data comes back the same way you sent it in (rows are rows, cols are cols). > > > >> scale(y) > > > > Some insights would be much appreciated. > > Hope I provided some. > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 15.0 years ago Thomas Girke ★ 1.7k

Login before adding your answer.