Question

how to deal with the replicate count when do k-means or someother machine-learning

0

Entering edit mode

Guandong Shang ▴ 40

@shangguandong1996-21805

Last seen 2.5 years ago

China

Hi, Guys I am always confused what normalized count type should be put into k-means cluster or other cluster or Network structure. And how to deal with the repliactes, after all, I have to put one count for one treat(or time) when I do k-means cluster to find similar expression type gene. I once tried CPM, TPM, FRKM, normalized counts from DESeq2 and then get the mean count within samples, do Z-scale(mean = 0, sd = 1). Recently, I read the DESeq2 2014 paper and some normalized QA about DESeq2. And it recommend using vst or rlog transformation to get homoskedastic data

In addition, the rlog transformation, which implements shrinkage of fold changes on a per-sample basis, facilitates visualization of differences, for example in heat maps, and enables the application of a wide range of techniques that require homoskedastic input data, including machine-learning or ordination techniques such as principal component analysis and clustering.

And I also find rnaseqGene manual. It use the mat - rowMeans to get mean=0

mat  <- assay(vsd)[ topVarGenes, ]
mat  <- mat - rowMeans(mat)
anno <- as.data.frame(colData(vsd)[, c("cell","dex")])
pheatmap(mat, annotation_col = anno)

But I am still confused how to deal with repliactes count. Is it reasonable to merge the vst result, divided the replicates number and get the mean vst as the input of k-means cluster or some other methods about Network or cluster?

Guandong Shang

deseq2 • 1.8k views

ADD COMMENT • link updated 5.0 years ago by Michael Love 43k • written 5.0 years ago by Guandong Shang ▴ 40

score 1 · Answer 1 · 2020-04-22

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 4 hours ago

United States

I'm not sure exactly what you want to do downstream, but VST data is quite flexible and deals with the sequencing depth scaling and variance stabilization together.

ADD COMMENT • link 5.0 years ago Michael Love 43k

0

Entering edit mode

Thanks for your reply

For me, I just want do a kmeans cluster to get different gene cluster in different condition just like this https://www.biostars.org/p/343055/ And the input data in kmeans maybe like

      Time1 Time2 Time3 Time4 Time5
Gene1 1 2 3 4 2
Gene2 1 1 1 2 3

And then I do kmeans

km <- kmeans(data,centers = 7,iter.max = 50)

> km$centers
       Time1     Time2      Time3     Time4      Time5
1 -0.9238159  0.3348299  1.2321682  0.83295709 -0.250745982
2 -0.7506946 -0.3936903 -0.6068091 -0.47479372 -0.264158223
3 -1.9095537  0.4267504  0.3821658  0.24165976  0.004372463
4 -0.9256701  1.6558903  0.4004714  0.10654355 -0.411843097
5  1.7844635 -0.1537015 -0.1693351 -0.17492346 -0.377649364
6 -1.3075766 -0.6379933 -0.6096608 -0.09668287  0.607616676
7 -1.3182925 -0.6905387 -0.1432101  0.70321781  1.320898588

But the vst(or someother) count matrix should be like

      Time1_Rep1 Time1_Rep2 Time2_Rep3 Time2_Rep1 Time2_Rep2……
Gene1 1 1 2 1 2 1 2
Gene2 1 1 1 2 3 1 2

So I am not sure whether I should merge the Rep vst and get mean. Or DESeq2 can have someother ways to condiser the repliates within sample just like shrinkage dispersion or LFC

ADD REPLY • link 5.0 years ago Guandong Shang ▴ 40

1

Entering edit mode

I don’t have any particular recommendation, i tend to include all biological replicates.