Question

Top variable features used by 'runPCA' in scater

0

Entering edit mode

jws • 0

@jws-18804

Last seen 6.3 years ago

As the scater vignette (https://bioconductor.org/packages/devel/bioc/vignettes/scater/inst/doc/vignette-dataviz.html#generating-pca-plots) describes, by default, runPCA performs PCA on the log-counts using the 500 features with the most variable expression across all cells.

I am wondering how the most variable expression is determined, and how the names of features (genes) can be extracted. Thanks!

scater pca features • 1.6k views

ADD COMMENT • link updated 6.3 years ago by Aaron Lun ★ 28k • written 6.3 years ago by jws • 0

score 2 · Accepted Answer · 2018-12-12

It's pretty literal. The top 500 genes with the largest variance of the log-counts are used - and that's it. You can get them by doing:

vars <- DelayedMatrixStats::rowVars(logcounts(sce))
head(order(vars, decreasing=TRUE), 500)

There's no consideration of the mean-variance trend or of technical components of variance or anything like that. If you want something more sophisticated, check out trendVar and decomposeVar (or possibly technicalCV2 and improvedCV2) in scran.