Entering edit mode
Edwin Groot
▴
230
@edwin-groot-3606
Last seen 10.2 years ago
Dear List,
I am exploring several methods of clustering gene expression
microarray
data, and I have some problems with the k-means method. Is scaling
necessary for my data, and if so, what type is better?
My expression data is ca. 5000 genes in rows and 5 cell types in
columns. I want to visualize which groups of genes are up or down in
one cell type relative to other cell types. The data ranges from
2.5e+00 to 1.9e+05, and has a median of 2.8e+02. The strategy in this
clustering is to increase k, until no new expression relationships
among the 5 cell types are found.
I followed Thomas Girke's fine introduction to Bioconductor:
http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_BioCondManual.htm
l
For performance reasons, clara() works best but following Thomas's
one-liner gave me a green field (all other commands ommitted):
> clarax <- clara(y, 4)
Scaling the data gave the expected red-green colours, but exporting
the
clustering information, showed no relationship between expression and
colour. When one cell type was red, and the other green for a given
cluster, the expression of the member genes were up, down or unchanged
relative to the other cell type. I would have expected the great
majority of expressions to be up relative to the other.
> library(cluster)
#Scale my data
> myscale <- t(scale(t(meanexp)))
#Seven K-clusters gave the best result
> kclusters7 <- clara(myscale,7,stand=FALSE)
#Plot the heatmap. The data is transposed so that samples are in
columns. The data is also sorted by cluster number.
> image(c(1:ncol(myscale)), c(1:nrow(myscale)),
t(myscale[names(sort(kclusters7$clustering)),]), col=my.colorFct(),
xaxt="n", yaxt="n", ylab="clusters", xlab="samples")
The problem is I am too much of a statistics weakling to determine
what
is the appropriate scaling method. If t(scale(t(meanexp))) is scaling
each gene independently of all the others, then that is probably the
source of my problem. The expressions differ widely among cell types
(that is how I selected the 5000 genes in the first place). I also see
in the tutorial the scaling step written as:
> scale(t(y))
Why are there sometimes one transposition, sometimes two? What's wrong
with no transposition?
> scale(y)
Some insights would be much appreciated.
Regards,
Edwin
p.s. > R.version
_
platform i486-pc-linux-gnu
arch i486
os linux-gnu
system i486, linux-gnu
status
major 2
minor 7.1
year 2008
month 06
day 23
svn rev 45970
language R
version.string R version 2.7.1 (2008-06-23)
---
Dr. Edwin Groot, postdoctoral associate
AG Laux
Institut fuer Biologie III
Schaenzlestr. 1
79104 Freiburg, Deutschland
+49 761-2032945