Hi everyone,
I was hoping to get an answer on an issue I have been struggling for a while.
I have a raw count data from RNA-seq experiment and want to develop a model for separating two group of samples. I used Deseq2 to select my top genes and trained and test the model on the dataset using variance stabilizing transformation.
To make sure my model is robust, I tried to use rlog and other normalization methods (TPM, RPKM) on the raw count matrix with the same set of selected genes.
My problem - I get different performance of my model (different accuracy) depending on the normalization method (even between rlog and vsd). Of note, just by looking at the values of the normalized matrix, I can see that there is a substantial difference in the normalized counts between the different methods. For example, in one of the selected genes the normalized value for one sample is 4.328 using vsd and 0.02 using RPKM. I am not sure I fully understand where this big difference is coming from.
Anyone has encountered a similar situation? Any help would be appreciated.
Thanks!
Dear Michael, so in a survival analysis with Kaplan-Meier curves stratifying patients by high/low expression of a gene, which unit of measurement in your opinion would be the most appropriate to use? Thank you! Ale