I'm comparing normalized gene expression using a naive log2 transformation of raw counts and DESeq::rlog method. It seems to me that rlog tends to apply considerable shrinkage in this dataset resulting in a noticeable difference between the two methods even for highly expressed genes. I.e., in the figure below I would expect the dots (samples) to align more closely along the diagonal.
This is real data from Nanostring which I have anonymized and put on dropbox.
I'm inclined to think that in this case rlog is not very appropriate perhaps due to the nature of Nanostring experiments (few genes many of them expected to be different between conditions) compared to RNAseq. Also, after testing for differential expression analysis using lfcShrink, the estimates of fold-change are quite a bit higher relative to the rlog estimates. That is, estimates of log2 fold-change are better recapitulated by the naive log2 normalization than by the rlog (I understand that rlog and similar should not be used for testing DE but still...)
Any comment much appreciated... Is it appropriate to use rlog here? Is this difference expected?
library(data.table)
library(ggplot2)
library(DESeq2)
mat2 <- read.csv('https://www.dropbox.com/s/xdrdgc06392vp3g/nano.tsv?dl=1', sep= '\t')
acc <- mat2$Accession
mat2$Accession <- NULL
mat2 <- as.matrix(mat2)
rownames(mat2) <- acc
design <- data.frame(sample_id= colnames(mat2), tissue= sub('.*_', '', colnames(mat2)))
deseq <- DESeqDataSetFromMatrix(mat2, colData= design, design= ~ tissue)
deseq <- DESeq(deseq)
# RLOG NORMALIZATION
rlog_gex <- rlog(deseq)
rlog_gex <- data.table(assay(rlog_gex), Accession= rownames(rlog_gex))
rlog_gex <- melt(data= rlog_gex, id.vars= 'Accession', variable.name= 'nanostring_id', value.name= 'rlog_gex')
# NAIVE log2 NORMALIZATION
log_gex <- normTransform(deseq)
log_gex <- data.table(assay(log_gex), Accession= rownames(log_gex))
log_gex <- melt(data= log_gex, id.vars= 'Accession', variable.name= 'nanostring_id', value.name= 'log2_gex')
dat <- merge(rlog_gex, log_gex, by= c('Accession', 'nanostring_id'))
gg <- ggplot(data= dat, aes(x= log2_gex, y= rlog_gex)) +
geom_abline(intercept= 0, slope= 1, colour= 'blue') +
geom_point(size= 0.2) +
facet_wrap(~Accession, scale= 'free') +
theme_light()
ggsave('tmp.png', width= 13, height= 13)
sessionInfo( )
R version 3.6.3 (2020-02-29)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /export/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/20200306_matt_nanostr_bmpb/lib/libopenblasp-r0.3.12.so
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] DESeq2_1.26.0 SummarizedExperiment_1.16.0 DelayedArray_0.12.0 BiocParallel_1.20.0 matrixStats_0.58.0 Biobase_2.46.0 GenomicRanges_1.38.0 GenomeInfoDb_1.22.0 IRanges_2.20.0
[10] S4Vectors_0.24.0 BiocGenerics_0.32.0 ggplot2_3.3.2 data.table_1.12.2
loaded via a namespace (and not attached):
[1] bit64_4.0.5 splines_3.6.3 Formula_1.2-4 latticeExtra_0.6-29 blob_1.2.1 GenomeInfoDbData_1.2.2 pillar_1.6.0 RSQLite_2.2.5 backports_1.2.1 lattice_0.20-41 glue_1.4.2
[12] digest_0.6.27 RColorBrewer_1.1-2 XVector_0.26.0 checkmate_2.0.0 colorspace_2.0-0 htmltools_0.5.1.1 Matrix_1.3-2 XML_3.99-0.3 pkgconfig_2.0.3 genefilter_1.68.0 zlibbioc_1.32.0
[23] xtable_1.8-4 scales_1.1.1 jpeg_0.1-8.1 htmlTable_2.2.1 tibble_3.1.0 annotate_1.64.0 ellipsis_0.3.1 cachem_1.0.4 withr_2.4.1 nnet_7.3-16 survival_3.2-11
[34] magrittr_2.0.1 crayon_1.4.1 memoise_2.0.0 fansi_0.4.2 foreign_0.8-76 tools_3.6.3 lifecycle_1.0.0 stringr_1.4.0 locfit_1.5-9.4 munsell_0.5.0 cluster_2.1.0
[45] AnnotationDbi_1.48.0 compiler_3.6.3 rlang_0.4.10 grid_3.6.3 RCurl_1.98-1.3 rstudioapi_0.13 htmlwidgets_1.5.3 bitops_1.0-6 base64enc_0.1-3 gtable_0.3.0 DBI_1.1.1
[56] R6_2.5.0 gridExtra_2.3 knitr_1.33 fastmap_1.1.0 bit_4.0.4 utf8_1.2.1 Hmisc_4.4-1 stringi_1.4.3 Rcpp_1.0.6 geneplotter_1.64.0 vctrs_0.3.7
[67] rpart_4.1-15 png_0.1-7 xfun_0.23
Mike will give his two cents for sure but based on Which rlog should I use for DESeq2 analysis and other threads and tweets I recall
rlog
is discouraged these days (as is thenormal
method inlfcShrink
) as it tends to overshrink the effect size. For shrinkage the recommendation is either ashr or apeglm, and for data transformation it is thevst
afaik. I also recall threads that DESeq2+Nanostring is fine as long as you make sure that you confidently normalize based on a set of non-changing/housekeeping genes.