Hi, I am trying to perform quality assessment for bulk rna-seq data that I analyzed via DESeq2. I would like to try clustering samples on a heatmap showing genes that were variably expressed. I saw on the deseq vignette that it is possible to generate a heatmap of 20 genes with the highest mean expression using the following code:
library("pheatmap")
select <- order(rowMeans(counts(dds,normalized=TRUE)),
decreasing=TRUE)[1:20]
df <- as.data.frame(colData(dds)[,c("condition","type")])
pheatmap(assay(ntd)[select,], cluster_rows=FALSE, show_rownames=FALSE,
cluster_cols=FALSE, annotation_col=df)
I would like to adapt this code to make a heatmap of the 200 genes that are most variably expressed (have the greatest SD after rlog transformation).
library("pheatmap")
row.sd <- order(apply(assay(rld), 1, sd), decreasing = TRUE) [1:200]
inf <- as.data.frame(colData(dds)[,c("Condition")])
pheatmap(assay(rld)[row.sd,], cluster_rows = TRUE, show_rownames = FALSE,
cluster_cols = TRUE, annotation_col=inf)
sessionInfo( )
R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.6
This code runs fine and generates a heatmap, but the control and experimental conditions do not form separate clusters. My question is whether or not this is a valid way to attempt hierarchical clustering for rna-seq samples? Since I am using the transformed dds to calculate standard deviation, I think this code should be selecting for genes that have variable expression, while minimizing the bias for highly expressed genes that would occur if used the untransformed data set.
It also seems possible that I'm selecting for genes that are outliers regardless of sample condition and that this could lead to unreliable clustering.
Thanks!