Hi,
I have tried to develop a machine learning model based on a set of differentially expressed genes obtained using DESeq2. Variance stabilising transformation (vst()) was performed for this purpose as it is recommended in the vignette. But I have applied it to the entire dataset together (as it was passed together as a dds object for running vst(). Note that this was the same dds object used for running DESeq2), prior to dividing into training and testing for model development. I feel like I have applied it in a wrong way because ideally I should apply vst() only on the training data and then transform the test data based on the transformation values obtained on the training data. Even if for simplicity, I apply it on the entire dataset in the beginning, I would still need to retain these values for transforming the unknown validation dataset which might only include one sample. Please let me know if there is a way to do so or if there is a better strategy in your opinion. Shall I make a dds object using the training data only, apply vst() and somehow store these values and then use these stored values to apply vst() on the dds object created using test data only? And how to find these stored transformation values?
Many thanks!
It’s an unusual idea to pre select genes ranked according to a prior statistical test. If you read the caret manual they suggest allowing the underlying algorithm to select features automatically eg glmnet. There is also the overfitting issue if you haven’t cross validated the feature selection with deseq2. Maybe you know this already, I don’t know. I wouldn’t use deseq2 for a machine learning classification problem, aside from the vst normalisation.
VST is not a statistical test and can be run without knowing the condition groups.
Compare it to log transformation or centering and scaling.
Indeed I use the vst function all the time before running caret or clustering etc, it is very useful, you misunderstood my point.
There isn’t a lot of good sense to using deseq2 to select features to make a classification model to me, seeing as many algorithms will select features for you according to their method and generally will do a better job in finding those that classify best. This is mentioned in the caret manual when Max talks about univariate filtering using t-tests etc.
It also sounds like the OP may be overfitting I.e. selecting genes using deseq run on the whole data then splitting into training and testing afterwards instead of just running it on the training data only. A lot of people don’t realise feature selection should be cross validated.
I use deseq2 for DEG analysis, it’s very good so thanks for that Michael Love.