Hi, Guys
I am always confused what normalized count type should be put into k-means cluster or other cluster or Network structure. And how to deal with the repliactes, after all, I have to put one count for one treat(or time) when I do k-means cluster to find similar expression type gene. I once tried CPM, TPM, FRKM, normalized counts from DESeq2 and then get the mean count within samples, do Z-scale(mean = 0, sd = 1).
Recently, I read the DESeq2 2014 paper and some normalized QA about DESeq2. And it recommend using vst
or rlog
transformation to get homoskedastic data
In addition, the rlog transformation, which implements shrinkage of fold changes on a per-sample basis, facilitates visualization of differences, for example in heat maps, and enables the application of a wide range of techniques that require homoskedastic input data, including machine-learning or ordination techniques such as principal component analysis and clustering.
And I also find rnaseqGene manual. It use the mat - rowMeans
to get mean=0
mat <- assay(vsd)[ topVarGenes, ]
mat <- mat - rowMeans(mat)
anno <- as.data.frame(colData(vsd)[, c("cell","dex")])
pheatmap(mat, annotation_col = anno)
But I am still confused how to deal with repliactes count. Is it reasonable to merge the vst result, divided the replicates number and get the mean vst as the input of k-means cluster or some other methods about Network or cluster?
Guandong Shang
Thanks for your reply
For me, I just want do a kmeans cluster to get different gene cluster in different condition just like this https://www.biostars.org/p/343055/ And the input data in kmeans maybe like
And then I do kmeans
But the vst(or someother) count matrix should be like
So I am not sure whether I should merge the Rep vst and get mean. Or DESeq2 can have someother ways to condiser the repliates within sample just like shrinkage dispersion or LFC
I don’t have any particular recommendation, i tend to include all biological replicates.
Thanks for your reply :)