Hi all,
In the tutorial (https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-quality-assessment-by-sample-clustering-and-visualization), the selected the top genes based on normalized counts, but then plot VST values (APPROACH 1).
### APPROACH 1
# Select top genes by rowMeans of normalized counts
select <- order(rowMeans(counts(dds, normalized=TRUE),
na.rm=TRUE),
decreasing=TRUE)[1:100]
# Transform data and use 'select' indices for plotting
vsd <- vst(dds, blind=TRUE)
pheatmap(t(assay(vsd)[select, ]),
cluster_rows=TRUE,
cluster_cols=TRUE,
...)
If I want to plot the top highly expressed genes in a group of biological replicates (e.g., within the same cell type and same condition just across different donors), should I instead select top genes based on VST-transformed values, then plot VST (APPROACH 2)?
Thank you for your input!
### APPROACH 2
# 1) Transform data first
vsd <- vst(dds, blind=TRUE)
# 2) Select top genes by rowMeans of VST
gene_means_vst <- rowMeans(assay(vsd))
select <- order(gene_means_vst, decreasing=TRUE)[1:100]
# 3) Plot using the same VST data
pheatmap(t(assay(vsd)[select, ]),
cluster_rows=TRUE,
cluster_cols=TRUE,
...)
Thank you ATpoint for your responses.
I am a bit confused because the difference between the two approaches is the selection of the genes, because in the plotting, it still uses
assay(vsd)
to subset the selected genes.Are you implying that approach 1
select <- order(rowMeans(counts(dds, normalized=TRUE), na.rm=TRUE), decreasing=TRUE)[1:100]
would select the highly expressed genes, whereas approach 2select <- order(rowMeans(assay(vsd)), decreasing=TRUE)[1:100]
would select the variable genes?Accordingly, if I want to plot the highly expressed genes selected by approach 1, I wonder why it was then suggested to use
pheatmap(assay(vsd)[select, ])
in the tutorial, i.e., why cannot we use directly the normalized count matrix fromcounts(dds, normalized=TRUE), na.rm=TRUE)[select, ]
Thank you again.
I think I explained what the differences between approach 1 and 2 is in my answer. I assume that the tutorial uses vst because it is variance-stabilized and on log2 scale. The normalized counts are neither, so not optimal for plotting and downstream analysis.
I see. Thank you ATpoint!