Question

Using voom-transformed counts for unsupervised analyses (PCA, Random Forest, Elastic Net): Best practices?

0

Entering edit mode

Ahdee ▴ 60

@ahdee-8938

Last seen 5 months ago

United States

I'm using voom transformation (limma package) for RNA-seq analysis and considering using the voom-transformed counts (vs CPM; log2(CPM) for downstream analyses like PCA, Random Forest, and Elastic Net. The homoscedastic property of voom transformation seems advantageous for these methods however I'm not sure what if this is advisable? Moreover, if so, then I'm wondering about best practices - specifically, should the voom transformation be performed without a design matrix for these unsupervised analyses to avoid potential bias?

thanks in advance!

R version 4.1.2 (2021-11-01)
limma_3.50.3

limma limma-voom • 768 views

ADD COMMENT • link updated 5 months ago by Gordon Smyth 52k • written 5 months ago by Ahdee ▴ 60

2

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 20 hours ago

United States

The 'voom transformed counts' are just logCPM with a prior count of 0.5. The 'magic' of limma-voom is the weights that are computed and then used in a weighted linear regression to account for the heteroscedasticity. In other words, the observational weights are used in a linear regression (with the logCPM values as the outcome) in order to remove heteroscedasticity of the model residuals. But it appears you want to use the gene expression data as predictors, not outcomes, in which case I don't think the weights are going to be helpful (model weights apply to the outcome, not the predictors).

An alternative I have used in the past when doing WGCNA, where you want the information provided by each gene to be somewhat equivalent, was to use the cqn package to generate GC-bias and length adjusted RPKM values, which will then hypothetically provide 'purer' gene expression values.

ADD COMMENT • link 5 months ago James W. MacDonald 68k

score 2 · Accepted Answer · 2024-11-18

voom() does not transform the data to be homoscedastic. Quite the opposite, it computes precision weights that reflect the mean-variance trend, i.e., unequal variances. In other words, it accounts for unequal variances rather than removes them.

I do not recommend the use of voom() output for any downstream tool that does not use precision weights. To export expression values from limma or edgeR for input to PCA, Random Forest, and Elastic Net, I simply use cpm(counts,log=TRUE), as recommended in the edgeR User's Guide. Indeed, if you type plotMDS(y, gene.selection="common"), where y is a DGEList, you will automatically get a PCA plot of the log2CPM values from cpm().