Question

DESeq2 Heatmap

0

Entering edit mode

pm_25 • 0

@4b4c4e07

Last seen 17 hours ago

United States

Hello all,

I'm new to RNASeq and bioinformatics. I'm trying to create a map of the top 500 differentially expressed genes across treatments, where I have one vehicle/control condition compared to six treatments. First, I calculated the mean expression for replicates to collapse them into a single column per condition. Next, I scaled the data and generated the heatmap.

However, I've been asked why the first column (vehicle/control) isn't entirely zero on the log fold change (logFC) expression scale. I was told that this might indicate an issue in my code, as all treatments should be normalized relative to the vehicle/control condition. Where the first column should then look all white (or at 0).

I seem to be lacking understanding in this area. Any input would be appreciated,

dds <- DESeqDataSetFromTximport(txi, colData = sampleTable,~Condition-1)

dds <- DESeq(dds) 

keep <- rowSums(counts(dds)) >= 1
dds <- dds[keep,]
rld <- rlog(dds, blind = FALSE)

data.frame(assay(rld))

rld.df <- data.frame(assay(rld))

rld.collapsed  <- t(apply(rld.df, 1, function(row) c(mean(row[1:3]), mean(row[4:6]), mean(row[7:9]),mean(row[10:12]),mean(row[13:15]),mean(row[16:18]),mean(row[19:21]),mean(row[22:24]) )))

colnames(rld.df)


colnames(rld.collapsed)  <- c("Veh_1d", "Treatment_1_1d", "Treatment_2_1d", "Treatment_3_1d",
                         "Treatment_4_1d", "Treatment_5_1d", "Treatment_6_1d", "Treatment_7_1d")


top_500_sig_genes <- rownames(res_ord_padj)[1:500]
data.frame(top_500_sig_genes)


selected_conditions <- c("Veh_1d", "Treatment_1_1d", "Treatment_2_1d", "Treatment_3_1d",
                         "Treatment_4_1d", "Treatment_5_1d", "Treatment_6_1d")

cellLine <- list(c())

rld_heatmap <- rld_collapsed_obj[, colData(rld_collapsed_obj)$Condition %in% selected_conditions]

mat <- assay(rld_heatmap)[top_500_sig_genes, ]


mat_scaled <- t(apply(mat, 1, scale))


colnames(mat_scaled) <- colnames(assay(rld_heatmap))


cols <- colorRamp2(c(-3, 0, 3), c("blue", "white", "red"))


ha <- HeatmapAnnotation(
  Treatment = as.factor(colData(rld_heatmap)$Condition),
  col = list(Treatment = c(
    "Veh_1d" = "#440154ff",
    "Treatment_1_1d" = "#440154ff",
    "Treatment_2_1d" = "#440154ff",
    "Treatment_3_1d" = "#440154ff",
    "Treatment_4_1d" = "#21908dff",
    "Treatment_5_1d" = "#21908dff",
    "Treatment_6_1d" = "#21908dff"
  )),
  show_legend = TRUE
)


ht <- Heatmap(
  mat_scaled,
  name = "Expression",
  col = cols,
  top_annotation = ha,
  show_row_names = FALSE,
  show_column_names = TRUE,
  column_names_rot = 45,
  cluster_rows = TRUE,
  cluster_columns = FALSE,
)


ppi=500
png(paste0(figures, "heatmap_DEG_top500_", ".png"), width=7.5*ppi, height=6*ppi, res=ppi)
draw(ht, row_title = "Genes", column_title = "Day 1 Vehicle vs. Treatments")
dev.off()

enter image description here

heatmaps • 529 views

ADD COMMENT • link updated 11 weeks ago by swbarnes2 ★ 1.4k • written 11 weeks ago by pm_25 • 0

score 0 · Answer 1 · 2025-01-14

Two things. First, scale is already vectorized, so you don't have to use apply. You can instead just do t(scale(t(mat))) . Second, the default for scale is to convert data to z-scores, so the individual values represent the distance from the average gene expression in units of standard deviation. If you wanted a heatmap that represented deviations from some baseline, you should compute logCPM values and then subtract out the baseline (you could use sweep, which is what scale uses under the hood). I have generated heatmaps with a baseline like that in the past, but they usually look sort of dumb with one entirely white column. I find z-scores like you already have to be more informative.

score 0 · Answer 2 · 2025-01-15

However, I've been asked why the first column (vehicle/control) isn't entirely zero on the log fold change (logFC) expression scale.

You are not graphing Log fold change as compared to Wt; you are looking at expression (a little normalized with rlog) as compared to the mean across all samples. If you were looking at log fold changes, it wouldn't make sense to have the wt there, and you wouldn't have to average all the replicates, you'd already have just one value for each sample group for each gene.