Question

Using the standard deviation of DESeq rlog transformed counts to identify genes for hierarchical clustering

0

Entering edit mode

Josiah • 0

@3b6f944a

Last seen 2.1 years ago

United States

Hi, I am trying to perform quality assessment for bulk rna-seq data that I analyzed via DESeq2. I would like to try clustering samples on a heatmap showing genes that were variably expressed. I saw on the deseq vignette that it is possible to generate a heatmap of 20 genes with the highest mean expression using the following code:

library("pheatmap")
select <- order(rowMeans(counts(dds,normalized=TRUE)),
                decreasing=TRUE)[1:20]
df <- as.data.frame(colData(dds)[,c("condition","type")])
pheatmap(assay(ntd)[select,], cluster_rows=FALSE, show_rownames=FALSE,
         cluster_cols=FALSE, annotation_col=df)

I would like to adapt this code to make a heatmap of the 200 genes that are most variably expressed (have the greatest SD after rlog transformation).

library("pheatmap")
row.sd <- order(apply(assay(rld), 1, sd), decreasing = TRUE) [1:200]
inf <- as.data.frame(colData(dds)[,c("Condition")])

pheatmap(assay(rld)[row.sd,], cluster_rows = TRUE, show_rownames = FALSE,
         cluster_cols = TRUE, annotation_col=inf)

sessionInfo( )
R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.6

This code runs fine and generates a heatmap, but the control and experimental conditions do not form separate clusters. My question is whether or not this is a valid way to attempt hierarchical clustering for rna-seq samples? Since I am using the transformed dds to calculate standard deviation, I think this code should be selecting for genes that have variable expression, while minimizing the bias for highly expressed genes that would occur if used the untransformed data set.

It also seems possible that I'm selecting for genes that are outliers regardless of sample condition and that this could lead to unreliable clustering.

Thanks!

DESeq2 • 867 views

ADD COMMENT • link updated 2.1 years ago by ATpoint ★ 4.6k • written 2.1 years ago by Josiah • 0

score 1 · Answer 1 · 2023-01-04

hclust is typically done to group genes into clusters having similar expression patterns but this only makes sense for genes that are DE. Genes selected by variance/standard deviation can either be DE (different between groups) or just very noisy (different between all samples, randomly), and therefore not reliable to drive any group separation. It is unusual to do hclust as a QC metric. I would stick to PCA as done in the vignette, use plotPCA. Somewhat related Gordon Smyth How to identify top most variable genes from log2 normalized data?