Variance stabilising transformation for machine learning
1
2
Entering edit mode
Manik Garg ▴ 20
@manik-garg-20656
Last seen 5.0 years ago
Cambridge, UK

Hi,

I have tried to develop a machine learning model based on a set of differentially expressed genes obtained using DESeq2. Variance stabilising transformation (vst()) was performed for this purpose as it is recommended in the vignette. But I have applied it to the entire dataset together (as it was passed together as a dds object for running vst(). Note that this was the same dds object used for running DESeq2), prior to dividing into training and testing for model development. I feel like I have applied it in a wrong way because ideally I should apply vst() only on the training data and then transform the test data based on the transformation values obtained on the training data. Even if for simplicity, I apply it on the entire dataset in the beginning, I would still need to retain these values for transforming the unknown validation dataset which might only include one sample. Please let me know if there is a way to do so or if there is a better strategy in your opinion. Shall I make a dds object using the training data only, apply vst() and somehow store these values and then use these stored values to apply vst() on the dds object created using test data only? And how to find these stored transformation values?

Many thanks!

deseq2 vst rna-seq machine-learning • 24k views
ADD COMMENT
0
Entering edit mode

It’s an unusual idea to pre select genes ranked according to a prior statistical test. If you read the caret manual they suggest allowing the underlying algorithm to select features automatically eg glmnet. There is also the overfitting issue if you haven’t cross validated the feature selection with deseq2. Maybe you know this already, I don’t know. I wouldn’t use deseq2 for a machine learning classification problem, aside from the vst normalisation.

ADD REPLY
0
Entering edit mode

VST is not a statistical test and can be run without knowing the condition groups.

Compare it to log transformation or centering and scaling.

ADD REPLY
0
Entering edit mode

Indeed I use the vst function all the time before running caret or clustering etc, it is very useful, you misunderstood my point.

There isn’t a lot of good sense to using deseq2 to select features to make a classification model to me, seeing as many algorithms will select features for you according to their method and generally will do a better job in finding those that classify best. This is mentioned in the caret manual when Max talks about univariate filtering using t-tests etc.

It also sounds like the OP may be overfitting I.e. selecting genes using deseq run on the whole data then splitting into training and testing afterwards instead of just running it on the training data only. A lot of people don’t realise feature selection should be cross validated.

I use deseq2 for DEG analysis, it’s very good so thanks for that Michael Love.

ADD REPLY
2
Entering edit mode
@mikelove
Last seen 1 day ago
United States

Good question. The VST is based on the variance-mean relationship. So you can learn that on any dataset, and then apply the transformation on a new dataset using the frozen parameters. There is some information on how to do this in ?varianceStabilizingTransformation. Although it looks like the example has been removed. But the advice should work (quoted below)

The frozen VST is accomplished by saving the dispersion function accessible with dispersionFunction, assigning this to the DESeqDataSet with the new samples, and running varianceStabilizingTransformation with 'blind' set to FALSE (see example below). Then the dispersion function from the previous dataset will be used to transform the new sample(s).

ADD COMMENT
0
Entering edit mode

Many thanks for your speedy reply! I will get back to you if I have any further questions. Cheers!

ADD REPLY
0
Entering edit mode

Hi Michael, Do you mean something like this? Please note that for the test data, I will not be having colData for "condition" as this is the category I am trying to predict. Do you suggest creating a dummy column for the same? As blind = FALSE, varianceStabilizingTransformation() may consider this dummy "condition" column into consideration and random assignment of categories may not be beneficial in this case. What do you think? I also see that varianceStabilizingTransformation() can directly take count matrix as an input, obviating the need for having a proper colData for creating the DESeqDataSet but it seems necessary for the dispersionFunction(). Somehow saving these values while calling the varianceStabilizingTransformation() function itself might have been ideal. Many thanks!

library("DESeq2")
#For training dataset
ddsTrain <- DESeqDataSetFromMatrix(countData = trainData, colData = colDataTrain, design = ~condition)
ddsTrain <- DESeq(ddsTrain)
vstNormalizedExpressionDataForTrain <- varianceStabilizingTransformation(ddsTrain, blind = FALSE)

#For testing dataset
ddsTest <- DESeqDataSetFromMatrix(countData = testData, colData = colDataTest, design = ~condition)
ddsTest <- DESeq(ddsTest)
dispersionFunction(ddsTest) <- dispersionFunction(ddsTrain)
vstNormalizedExpressionDataForTest <- varianceStabilizingTransformation(ddsTest, blind = FALSE)
ADD REPLY
0
Entering edit mode

Yes, that looks correct.

You can specify a design of ~1 for the dds so that no sample information is used for dispersion estimation. Still specify blind=FALSE when you run the VST. It will just use an intercept model, but it won't re-estimate the dispersion function.

ADD REPLY
0
Entering edit mode

Many thanks Michael! I see that the ?varianceStabilizingTransformation vignette has also been updated with the example. Cheers!

There is some information on how to do this in ?varianceStabilizingTransformation. Although it looks like the example has been removed.

ADD REPLY

Login before adding your answer.

Traffic: 950 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6