Question

DESeq2 is it OK to scale transformed data in order to plot heatmaps?

0

Entering edit mode

jeba95 • 0

@4fbbf72a

Last seen 7 weeks ago

France

Hello everyone, I have a question that I could not find any CLEAR answer on that, even though I think I have an "intuition" (because I am relatively new in bulk RNA-seq analysis) that just need to be confirmed, hence my post here.

I have run the DESeq2 pipeline on my data according to the official guidelines (link_1, and link_2). Now, I want to plot a heatmap of my genes of interest, but how to do that PROPERLY?

Indeed, usually in heatmaps, the Z-score is plotted, which should be (x - row_average)/row_standard_deviation, right? However, in the guidelines, the code is:

topVarGenes <- head(order(rowVars(assay(vsd)), decreasing = TRUE), 20)
mat  <- assay(vsd)[ topVarGenes, ]
mat  <- mat - rowMeans(mat)
anno <- as.data.frame(colData(vsd)[, c("cell","dex")])
pheatmap(mat, annotation_col = anno)

Here, after data transformation (vst method), the transformed values are only subtracted to the row_average, and not "scaled" by dividing by the row_standard deviation. From what I have understood in forums related to DESeq2, it's because the vst (or rlog) transformations are affecting the variance of the data (since they have been designed for that purpose actually), therefore it is not correct to divide by the row_standard_deviation anymore. Is that right ? I am asking that because of two things:

_ First, in some forums (example here), sometimes people claims that you should "scale" the transformed data (vsd or rlog) in the pheatmap function by using the scale="row" option with mat=assay(vsd)[ topVarGenes, ] (not the matrix with - rowMeans(mat)). Indeed scale="row" performs the (x - row_average)/row_standard_deviation formula, but it is not correct to do after vsd transformation, right ? Instead, it should be more correct to apply that formula on the "normalised counts" that can be retrieved by counts(dds,normalized=TRUE), is that right ? But in that case, why having calculated the vsd data in the first place then? This could be wrong to do too as it will over-amplify the genes with low expression..., right?

_ Second, when I plot heatmaps according to the guidelines, I usually have a heatmap that does not look with strong differences between my 3 conditions. However, when I scale either the vsd_transformed_data or the normalized_counts, I have heatmaps that show way more differences between conditions. Additionally, the dendrogram does not look the same if I scale or not. Plus, sometimes, for the same gene, I have different color contrasts if I scale or not, so I don't know what graph is to trust...

Therefore, I am afraid to use the wrong way of plotting my data, which will ultimately lead to misinterpretation of my data, and I want to do it the proper way! Could you please explain clearly what is correct or incorrect to do for heatmaps after data transformation in DESeq2:

A) vst transformation and row_average subtraction, and that's all ? B) vst tranformation (without row_average subtraction) and scaling in pheatmap with the scale="row" option ? C) vst transformation, row_average subtraction, and scaling in pheatmap with the scale="row" option ? D) normalized counts and scaling in pheatmap with the scale="row" option ?

I think A is correct (and maybe D) ?), but they do not generate the same graphs as I said earlier (see the images attached bellow). Actually, every-time, when I scale (whether using B), C), or D), I see bigger differences between my conditions, but is that because I do the wrong thing ?

Thanks in advance for your response, I really like this package as well as your responses on previous forums that have been very useful to me, thank you for all this work, Best regards,

Method A) Method B) Method C) Method D)

sessionInfo( )
R version 4.4.0 (2024-04-24)
Platform: x86_64-apple-darwin20
Running under: macOS Monterey 12.7.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Paris
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] plyr_1.8.9                  htmlwidgets_1.6.4          
 [3] plotly_4.10.4               genefilter_1.86.0          
 [5] ashr_2.2-63                 glmpca_0.2.0               
 [7] PoiClaClu_1.0.2.1           pheatmap_1.0.12            
 [9] dplyr_1.1.4                 vsn_3.72.0                 
[11] RColorBrewer_1.1-3          BiocManager_1.30.23        
[13] patchwork_1.2.0             ggplot2_3.5.1              
[15] DESeq2_1.44.0               SummarizedExperiment_1.34.0
[17] Biobase_2.64.0              MatrixGenerics_1.16.0      
[19] matrixStats_1.3.0           GenomicRanges_1.56.1       
[21] GenomeInfoDb_1.40.1         IRanges_2.38.0             
[23] S4Vectors_0.42.0            BiocGenerics_0.50.0        

loaded via a namespace (and not attached):
 [1] DBI_1.2.3               rlang_1.1.4             magrittr_2.0.3         
 [4] compiler_4.4.0          RSQLite_2.3.7           png_0.1-8              
 [7] vctrs_0.6.5             pkgconfig_2.0.3         crayon_1.5.3           
[10] fastmap_1.2.0           XVector_0.44.0          labeling_0.4.3         
[13] utf8_1.2.4              UCSC.utils_1.0.0        preprocessCore_1.66.0  
[16] purrr_1.0.2             bit_4.0.5               zlibbioc_1.50.0        
[19] cachem_1.1.0            jsonlite_1.8.8          blob_1.2.4             
[22] DelayedArray_0.30.1     BiocParallel_1.38.0     irlba_2.3.5.1          
[25] parallel_4.4.0          R6_2.5.1                SQUAREM_2021.1         
[28] limma_3.60.3            Rcpp_1.0.12             Matrix_1.7-0           
[31] splines_4.4.0           tidyselect_1.2.1        abind_1.4-5            
[34] codetools_0.2-20        affy_1.82.0             lattice_0.22-6         
[37] tibble_3.2.1            withr_3.0.0             KEGGREST_1.44.1        
[40] survival_3.7-0          Biostrings_2.72.1       pillar_1.9.0           
[43] affyio_1.74.0           generics_0.1.3          invgamma_1.1           
[46] truncnorm_1.0-9         munsell_0.5.1           scales_1.3.0           
[49] xtable_1.8-4            glue_1.7.0              lazyeval_0.2.2         
[52] tools_4.4.0             hexbin_1.28.3           data.table_1.15.4      
[55] annotate_1.82.0         locfit_1.5-9.9          XML_3.99-0.16.1        
[58] grid_4.4.0              tidyr_1.3.1             AnnotationDbi_1.66.0   
[61] colorspace_2.1-0        GenomeInfoDbData_1.2.12 cli_3.6.3              
[64] fansi_1.0.6             mixsqp_0.3-54           S4Arrays_1.4.1         
[67] viridisLite_0.4.2       gtable_0.3.5            digest_0.6.36          
[70] SparseArray_1.4.8       farver_2.1.2            memoise_2.0.1          
[73] htmltools_0.5.8.1       lifecycle_1.0.4         httr_1.4.7             
[76] statmod_1.5.0           bit64_4.0.5             MASS_7.3-61

DESeq2 heatmaps • 1.2k views

ADD COMMENT • link updated 8 months ago by ATpoint ★ 4.8k • written 8 months ago by jeba95 • 0

score 2 · Accepted Answer · 2024-08-12

The RNA-seq workflow you link is indeed a bit cumbersome when it comes to the heatmap section. The primary reason is that you usually would not cluster highly-variable but significantly different genes from the DESeq2 analysis itself. So basically the output of results() or lfcShrink after filtering for padj (and maybe logFC). In any case, most people do centering and scaling, which means mean subtraction and division by the standard deviation. This is called the Z-score. In R, if you have a numeric matrix x then it is simple t(scale(t(x))) to do the Z-transformation. I would stick with this. This by the way is the same pheatmap does with you set row scaling to TRUE.

As a general remark: In bioinformatics there will never be the one guide that answers all questions for all situations. There will always be many options available, each with strengths and shortcomings. Hence, for starters try what everyone does, or what a tutorial or vignette suggests, and then deviate from it if you have good data-driven reason to do so. Otherwise, if always trying everything, you end up in the so-called "garden of forked paths".