Question

Why select top highly expressed genes based on normalized counts, but then plot VST-transformed values in DESeq2 pheatmap?

0

Entering edit mode

Quang NN ▴ 10

@873e0bdb

Last seen 11 weeks ago

USA

Hi all,

In the tutorial (https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-quality-assessment-by-sample-clustering-and-visualization), the selected the top genes based on normalized counts, but then plot VST values (APPROACH 1).

### APPROACH 1
# Select top genes by rowMeans of normalized counts
select <- order(rowMeans(counts(dds, normalized=TRUE), 
                         na.rm=TRUE), 
                decreasing=TRUE)[1:100]

# Transform data and use 'select' indices for plotting
vsd <- vst(dds, blind=TRUE)
pheatmap(t(assay(vsd)[select, ]), 
         cluster_rows=TRUE, 
         cluster_cols=TRUE,
         ...)

If I want to plot the top highly expressed genes in a group of biological replicates (e.g., within the same cell type and same condition just across different donors), should I instead select top genes based on VST-transformed values, then plot VST (APPROACH 2)?

Thank you for your input!

### APPROACH 2
# 1) Transform data first
vsd <- vst(dds, blind=TRUE)

# 2) Select top genes by rowMeans of VST
gene_means_vst <- rowMeans(assay(vsd))
select <- order(gene_means_vst, decreasing=TRUE)[1:100]

# 3) Plot using the same VST data
pheatmap(t(assay(vsd)[select, ]), 
         cluster_rows=TRUE, 
         cluster_cols=TRUE,
         ...)

RNASeq DESeq2 • 1.0k views

ADD COMMENT • link 12 weeks ago • updated 11 weeks ago Quang NN ▴ 10

score 0 · Answer 1 · 2025-01-27

0

Entering edit mode

ATpoint ★ 4.8k

@atpoint-13662

Last seen 3 days ago

Germany

The choice of variable genes is typically part of a QC process. You select genes that in an unbiased fashion (unbiased because variance calculation does not know about sample-to-group relationship) separate samples and then feed these into clustering or PCA to identify outliers or batch effects.

You can plot "highly-expressed" genes, but I am not sure what this will give you. Absolute expression level (to me) does not mean much in RNA-seq since it is influenced by a lot of technical factors such as mappability, GC and amplification bias of the gene and the overall composition of the library.

ADD COMMENT • link 11 weeks ago ATpoint ★ 4.8k

0

Entering edit mode

Thank you ATpoint for your responses.

I am a bit confused because the difference between the two approaches is the selection of the genes, because in the plotting, it still uses assay(vsd) to subset the selected genes.

Are you implying that approach 1 select <- order(rowMeans(counts(dds, normalized=TRUE), na.rm=TRUE), decreasing=TRUE)[1:100] would select the highly expressed genes, whereas approach 2 select <- order(rowMeans(assay(vsd)), decreasing=TRUE)[1:100] would select the variable genes?

Accordingly, if I want to plot the highly expressed genes selected by approach 1, I wonder why it was then suggested to use pheatmap(assay(vsd)[select, ]) in the tutorial, i.e., why cannot we use directly the normalized count matrix from counts(dds, normalized=TRUE), na.rm=TRUE)[select, ]

Thank you again.

ADD REPLY • link 11 weeks ago Quang NN ▴ 10

0

Entering edit mode

I think I explained what the differences between approach 1 and 2 is in my answer. I assume that the tutorial uses vst because it is variance-stabilized and on log2 scale. The normalized counts are neither, so not optimal for plotting and downstream analysis.