Why select top highly expressed genes based on normalized counts, but then plot VST-transformed values in DESeq2 pheatmap?
1
0
Entering edit mode
Quang NN ▴ 10
@873e0bdb
Last seen 7 hours ago
USA

Hi all,

In the tutorial (https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-quality-assessment-by-sample-clustering-and-visualization), the selected the top genes based on normalized counts, but then plot VST values (APPROACH 1).

### APPROACH 1
# Select top genes by rowMeans of normalized counts
select <- order(rowMeans(counts(dds, normalized=TRUE), 
                         na.rm=TRUE), 
                decreasing=TRUE)[1:100]

# Transform data and use 'select' indices for plotting
vsd <- vst(dds, blind=TRUE)
pheatmap(t(assay(vsd)[select, ]), 
         cluster_rows=TRUE, 
         cluster_cols=TRUE,
         ...)

If I want to plot the top highly expressed genes in a group of biological replicates (e.g., within the same cell type and same condition just across different donors), should I instead select top genes based on VST-transformed values, then plot VST (APPROACH 2)?

Thank you for your input!

### APPROACH 2
# 1) Transform data first
vsd <- vst(dds, blind=TRUE)

# 2) Select top genes by rowMeans of VST
gene_means_vst <- rowMeans(assay(vsd))
select <- order(gene_means_vst, decreasing=TRUE)[1:100]

# 3) Plot using the same VST data
pheatmap(t(assay(vsd)[select, ]), 
         cluster_rows=TRUE, 
         cluster_cols=TRUE,
         ...)
RNASeq DESeq2 • 391 views
ADD COMMENT
0
Entering edit mode
ATpoint ★ 4.6k
@atpoint-13662
Last seen 10 hours ago
Germany

The choice of variable genes is typically part of a QC process. You select genes that in an unbiased fashion (unbiased because variance calculation does not know about sample-to-group relationship) separate samples and then feed these into clustering or PCA to identify outliers or batch effects.

You can plot "highly-expressed" genes, but I am not sure what this will give you. Absolute expression level (to me) does not mean much in RNA-seq since it is influenced by a lot of technical factors such as mappability, GC and amplification bias of the gene and the overall composition of the library.

ADD COMMENT
0
Entering edit mode

Thank you ATpoint for your responses.

I am a bit confused because the difference between the two approaches is the selection of the genes, because in the plotting, it still uses assay(vsd) to subset the selected genes.

Are you implying that approach 1 select <- order(rowMeans(counts(dds, normalized=TRUE), na.rm=TRUE), decreasing=TRUE)[1:100] would select the highly expressed genes, whereas approach 2 select <- order(rowMeans(assay(vsd)), decreasing=TRUE)[1:100] would select the variable genes?

Accordingly, if I want to plot the highly expressed genes selected by approach 1, I wonder why it was then suggested to use pheatmap(assay(vsd)[select, ]) in the tutorial, i.e., why cannot we use directly the normalized count matrix from counts(dds, normalized=TRUE), na.rm=TRUE)[select, ]

Thank you again.

ADD REPLY
0
Entering edit mode

I think I explained what the differences between approach 1 and 2 is in my answer. I assume that the tutorial uses vst because it is variance-stabilized and on log2 scale. The normalized counts are neither, so not optimal for plotting and downstream analysis.

ADD REPLY
0
Entering edit mode

I see. Thank you ATpoint!

ADD REPLY

Login before adding your answer.

Traffic: 779 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6