rlog transformation producing outliers with very high log-transformed counts
1
0
Entering edit mode
@frederik-ziebell-14676
Last seen 8 months ago
Heidelberg, Germany

I have a data set in which the rlog transformation produces in one condition a few genes with very high log-transformed counts. Here is a mimimal example (the real data set has more gene but the same problems):

library("DESeq2")
library("tidyverse")

# load data
load(url("https://github.com/frederikziebell/rlog-example/raw/master/data.RData"))

# # re-run DESeq and rlog; takes about 15min on a laptop
# rowData(dds) <- NULL
# colData(dds) <- colData(dds)[,c("condition","replicate")]
# dds <- DESeq(dds)
# rld <- rlog(dds)

# rlog vs. log2(.+1)
assay(rld) %>%
  data.frame %>%
  rownames_to_column(var="gene_id") %>%
  gather("sample","rlog",-gene_id) %>%
  left_join(
    counts(dds, normalized=TRUE) %>%
    data.frame %>%
    rownames_to_column(var="gene_id") %>%
    gather("sample","count",-gene_id), by=c("gene_id","sample")
  ) %>%
  ggplot(aes(log2(count+1),rlog)) +
  xlim(-5,30) +
  ylim(-5,100) +
  coord_fixed(ratio=35/105)+
  geom_hex(bins=80) +
  scale_fill_gradient(name = "count", trans = "log", breaks = 10^(0:5)) +
  geom_abline(slope=1, intercept=0, color="red")

Identifying those genes, it appears that some (but not all) of them have a very high MAP dispersion estimate and are thus flagged as an outlier.

# find genes with high rlog value
trouble_genes <- assay(rld) %>%
  data.frame %>%
  rownames_to_column(var="gene_id") %>%
  gather("sample","value",-gene_id) %>%
  filter(value>40) %>%
  pull(gene_id) %>%
  unique

# info on genes with high rlog value
rowData(rld)[which(rownames(rld) %in% trouble_genes),c(1:8,316:323)]
DataFrame with 7 rows and 16 columns
   baseMean   baseVar   allZero   dispersion  dispIter dispOutlier      dispMAP  Intercept  betaConv  betaIter  deviance    maxCooks   replace dispGeneEst    dispFit    rlogIntercept
  <numeric> <numeric> <logical>    <numeric> <integer>   <logical>    <numeric>  <numeric> <logical> <numeric> <numeric>   <numeric> <logical>   <numeric>  <numeric>         <matrix>
1  33.48204  81957.56     FALSE   0.98408592        10       FALSE   0.98408592  1.4531139      TRUE         9  456.8104 0.007709898     FALSE    15.51009 0.09718006  5.0658900019208
2  33.27621  80449.15     FALSE   0.05568701         4       FALSE   0.05568701  0.7721303      TRUE         7  452.0163 0.008590950     FALSE    13.36176 0.09737754 5.05696691913032
3  15.17365  17086.37     FALSE 228.00000000        12        TRUE 228.00000000 -0.1547366      TRUE         7  275.3917 0.005929372     FALSE    82.04500 0.13570113 3.92289498948781
4  20.64698  31552.81     FALSE 228.00000000        11        TRUE  39.37817492 -0.1036617      TRUE         8  475.7777 0.006615980      TRUE    41.18565 0.11702636 4.36741465803137
5  20.65373  31552.54     FALSE 228.00000000        11        TRUE  35.66040070 -0.1034237      TRUE         8  487.4940 0.006618849      TRUE    39.87769 0.11700946  4.3678728069692
6  17.43267  22559.42     FALSE 228.00000000         7        TRUE 228.00000000 -0.1543131      TRUE         6  309.6846 0.002378541     FALSE    71.02055 0.12657227 4.12325093031352
7  33.27158  80449.98     FALSE   0.05742513         4       FALSE   0.05742513  0.7724590      TRUE         7  449.8221 0.008426312     FALSE    13.52475 0.09738201 5.05676727448283

Here is a representative example for one of those genes:

plotCounts(dds, trouble_genes[1], intgroup="condition", returnData=TRUE) %>%
  ggplot(aes(condition, log2(count+1))) +
  geom_point() +
  theme(axis.text.x=element_text(angle = 90, hjust = 0))

Is there a way to get more meaningful rlog-transformed values for those gene/sample pairs? I also tried setting blind=FALSE in rlog-call, but with no success. 

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2               hexbin_1.27.2              forcats_0.2.0              stringr_1.2.0              dplyr_0.7.4               
 [6] purrr_0.2.4                readr_1.1.1                tidyr_0.7.2                tibble_1.4.2               ggplot2_2.2.1             
[11] tidyverse_1.2.1            DESeq2_1.18.1              SummarizedExperiment_1.8.1 DelayedArray_0.4.1         matrixStats_0.53.0        
[16] Biobase_2.38.0             GenomicRanges_1.30.1       GenomeInfoDb_1.14.0        IRanges_2.12.0             S4Vectors_0.16.0          
[21] BiocGenerics_0.24.0       

loaded via a namespace (and not attached):
 [1] nlme_3.1-131           bitops_1.0-6           lubridate_1.7.1        bit64_0.9-7            RColorBrewer_1.1-2     httr_1.3.1            
 [7] tools_3.4.2            backports_1.1.2        R6_2.2.2               rpart_4.1-12           Hmisc_4.1-1            DBI_0.7               
[13] lazyeval_0.2.1         colorspace_1.3-2       nnet_7.3-12            tidyselect_0.2.3       gridExtra_2.3          mnormt_1.5-5          
[19] bit_1.1-12             compiler_3.4.2         cli_1.0.0              rvest_0.3.2            htmlTable_1.11.2       xml2_1.2.0            
[25] labeling_0.3           scales_0.5.0           checkmate_1.8.5        psych_1.7.8            genefilter_1.60.0      digest_0.6.14         
[31] foreign_0.8-69         XVector_0.18.0         base64enc_0.1-3        pkgconfig_2.0.1        htmltools_0.3.6        htmlwidgets_1.0       
[37] rlang_0.1.6            readxl_1.0.0           rstudioapi_0.7         RSQLite_2.0            bindr_0.1              jsonlite_1.5          
[43] BiocParallel_1.12.0    acepack_1.4.1          RCurl_1.95-4.10        magrittr_1.5           GenomeInfoDbData_1.0.0 Formula_1.2-2         
[49] Matrix_1.2-12          Rcpp_0.12.15           munsell_0.4.3          stringi_1.1.6          yaml_2.1.16            zlibbioc_1.24.0       
[55] plyr_1.8.4             grid_3.4.2             blob_1.1.0             crayon_1.3.4           lattice_0.20-35        haven_1.1.1           
[61] splines_3.4.2          annotate_1.56.1        hms_0.4.1              locfit_1.5-9.1         knitr_1.18             pillar_1.1.0          
[67] geneplotter_1.56.0     reshape2_1.4.3         XML_3.98-1.9           glue_1.2.0             latticeExtra_0.6-28    data.table_1.10.4-3   
[73] modelr_0.1.1           cellranger_1.1.0       gtable_0.2.0           assertthat_0.2.0       xtable_1.8-2           broom_0.4.3           
[79] survival_2.41-3        AnnotationDbi_1.40.0   memoise_1.1.0          cluster_2.0.6
deseq2 rlog • 2.3k views
ADD COMMENT
0
Entering edit mode

Hi Mike,

thank you for your explanations. The reason I like to use rlog() over vst() is that it achieves almost equal within-sample distributions of counts across all samples, which may be connected to the size factor issue you were mentioning. Do you see potential future improvements of rlog() using the different shrinkage methods currently implemented in lfcShrink()?

ADD REPLY
0
Entering edit mode

That’s a really good question. In theory the new estimators would be better than the normal prior based estimators, but I’m not directly working on that implementation right now and have too many other projects on my plate. So for DESeq2 i think vst() is best here. Another option would be too try zinbwave factorization which can account for groups with zeros and high counts.

ADD REPLY
0
Entering edit mode

Two more points:

  • Data quality: If the size factors (or indeed the library sizes) vary strongly across what should be comparable assays, is this perhaps a quality issue, meaning that some of the (presumably very low coverage) samples should be ignored rather than making a pretense of analysing them?
  • Propagation of uncertainty: A related concern is that rlog (as well as VST) transformed values are point estimators that carry no direct information on uncertainty. This is not so bad if the size factors are similar, since then at least the same values on the transformed scale have similar uncertainties - but this useful property breaks down if the size factors are very different across samples.

 

ADD REPLY
0
Entering edit mode
@mikelove
Last seen 2 hours ago
United States

hi Frederik,

I have noticed the rlog() does not perform well when there are many samples (both speed and quality) and especially when the data are far from following a NB distribution, typically meaning large count outliers. The rlog() does not have any procedures to deal with outliers, like what DESeq() uses. This seems like a bad example of crashing when the data are far from NB.

I recommend in the vignette and on the forum to use vst() with large datasets, both for speed and because you are more likely to encounter a problem for the rlog(). I've even added a warning now when rlog() is called on large datasets, to switch to using the vst, which will avoid these issues. 

rlog() is still useful and provided by the DESeq2 package, because on small to medium (I'm thinking ~30 samples and less) datasets, it potentially could deal better with matrices having large range of size factor compared to vst(), and it's interesting in terms of methods extensions and development.

ADD COMMENT

Login before adding your answer.

Traffic: 1248 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6