Using voom-transformed counts for unsupervised analyses (PCA, Random Forest, Elastic Net): Best practices?
2
0
Entering edit mode
Ahdee ▴ 50
@ahdee-8938
Last seen 8 weeks ago
United States

I'm using voom transformation (limma package) for RNA-seq analysis and considering using the voom-transformed counts (vs CPM; log2(CPM) for downstream analyses like PCA, Random Forest, and Elastic Net. The homoscedastic property of voom transformation seems advantageous for these methods however I'm not sure what if this is advisable? Moreover, if so, then I'm wondering about best practices - specifically, should the voom transformation be performed without a design matrix for these unsupervised analyses to avoid potential bias?

thanks in advance!

R version 4.1.2 (2021-11-01)
limma_3.50.3
limma limma-voom • 573 views
ADD COMMENT
2
Entering edit mode
@gordon-smyth
Last seen 3 hours ago
WEHI, Melbourne, Australia

voom() does not transform the data to be homoscedastic. Quite the opposite, it computes precision weights that reflect the mean-variance trend, i.e., unequal variances. In other words, it accounts for unequal variances rather than removes them.

I do not recommend the use of voom() output for any downstream tool that does not use precision weights. To export expression values from limma or edgeR for input to PCA, Random Forest, and Elastic Net, I simply use cpm(counts,log=TRUE), as recommended in the edgeR User's Guide. Indeed, if you type plotMDS(y, gene.selection="common"), where y is a DGEList, you will automatically get a PCA plot of the log2CPM values from cpm().

ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 8 hours ago
United States

The 'voom transformed counts' are just logCPM with a prior count of 0.5. The 'magic' of limma-voom is the weights that are computed and then used in a weighted linear regression to account for the heteroscedasticity. In other words, the observational weights are used in a linear regression (with the logCPM values as the outcome) in order to remove heteroscedasticity of the model residuals. But it appears you want to use the gene expression data as predictors, not outcomes, in which case I don't think the weights are going to be helpful (model weights apply to the outcome, not the predictors).

An alternative I have used in the past when doing WGCNA, where you want the information provided by each gene to be somewhat equivalent, was to use the cqn package to generate GC-bias and length adjusted RPKM values, which will then hypothetically provide 'purer' gene expression values.

Login before adding your answer.

Traffic: 550 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6