Question

rlog transformation for downstream analyses in DESeq2: Importance of experiment design

0

Entering edit mode

masumistadler • 0

@masumistadler-23285

Last seen 4.8 years ago

Hello,

I am trying to transform my 16S rRNA count table with rlog for downstream analyses. At the moment I am not particularly interested in continuing with analysing the log-fold change.

I have a study design that covers several years, seasons and different sample types (soil, lake, stream, rivers etc). And we sequenced the samples for 16S rRNA DNA and RNA. Together, I have over 250 samples and around 30000 ASVs (= same as genes in case of RNA-Seq).

As DNA and RNA extraction methods and the meaning of read counts vary, I don't think it would be appropriate to apply the transformation to the two nucleic acid types together. So I've split the data set into two by DNA and RNA and applied the experimental design to both of them:

~ sample.type + year + season

While I don't have any error message with the DNA subset when creating a DESeqDataSet, the RNA subset gives the following error:

Error in DESeqDataSet(se, design = design, ignoreRank) :  
design contains one or more variables with all samples having the same value,
remove these variables from the design

I removed all ASVs without any count in the subset datasets, however, I suppose the same values (probably 0s) occur when the experimental design factors are combined.

I've read that one can use ~ 1 as an experimental design in this post, however, I've seen in the DESeq2 vignette that the experimental design is important for the transformation and that's why blind = FALSE should be set for downstream analyses.

I'm not sure how to proceed right now. It is possible that many ASV have very few RNA counts, but that's the nature of RNA.

Additionally, I have many 0s. Not only for some RNA but also a few within the DNA subset. So I added type = iterate in estimateSizeFactor() after reading this post.

I just wanted to make sure that my workflow for transforming the counts is correct as I am struggling to find a good workflow for cases when one is only interested in transforming the data and not applying the log-fold change.

# make DESeqDataSet
dds.dna <- phyloseq_to_deseq2(dna, ~ sample.type + year + season)
dds.rna <- phyloseq_to_deseq2(rna, ~  sample.type + year + season)

# estimate size factors, set type = "iterate" for 0s
dds.dna <- estimateSizeFactors(dds.dna, type = "iterate")
dds.rna <- estimateSizeFactors(dds.rna, type = "iterate")

# estimate dispersions
dds.dna <- estimateDispersions(dds.dna, fitType = "mean")
dds.rna <- estimateDispersions(dds.rna, fitType = "mean")

# apply transformation
r.dna <- rlog(dds.dna, blind=FALSE)
r.rna <- rlog(dds.rna, blind=FALSE)

# or probably better (rlog may take too long)
v.dna <- vst(dds.dna, blind=FALSE)
v.rna <- vst(dds.rna, blind=FALSE)

So far my computer is computing the estimateSizeFactors. It's been several hours, maybe I should figure out a way to parallelise things.

Thank you in advance. Any comments/advice are/is highly appreciated.

Masumi

> sessionInfo()

R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] plyr_1.8.6                  data.table_1.12.8          
 [3] DESeq2_1.26.0               SummarizedExperiment_1.16.1
 [5] DelayedArray_0.12.2         BiocParallel_1.20.1        
 [7] matrixStats_0.56.0          Biobase_2.46.0             
 [9] GenomicRanges_1.38.0        GenomeInfoDb_1.22.1        
[11] IRanges_2.20.2              S4Vectors_0.24.3           
[13] BiocGenerics_0.32.0         forcats_0.5.0              
[15] stringr_1.4.0               dplyr_0.8.5                
[17] purrr_0.3.3                 readr_1.3.1                
[19] tidyr_1.0.2                 tibble_3.0.0               
[21] ggplot2_3.3.0               tidyverse_1.3.0            
[23] phyloseq_1.30.0            

loaded via a namespace (and not attached):
  [1] colorspace_1.4-1       ellipsis_0.3.0         htmlTable_1.13.3      
  [4] XVector_0.26.0         base64enc_0.1-3        fs_1.4.1              
  [7] rstudioapi_0.11        bit64_0.9-7            AnnotationDbi_1.48.0  
 [10] fansi_0.4.1            lubridate_1.7.8        xml2_1.3.1            
 [13] codetools_0.2-16       splines_3.6.3          geneplotter_1.64.0    
 [16] knitr_1.28             ade4_1.7-15            Formula_1.2-3         
 [19] jsonlite_1.6.1         broom_0.5.5            annotate_1.64.0       
 [22] cluster_2.1.0          dbplyr_1.4.2           png_0.1-7             
 [25] BiocManager_1.30.10    compiler_3.6.3         httr_1.4.1            
 [28] backports_1.1.6        assertthat_0.2.1       Matrix_1.2-18         
 [31] cli_2.0.2              acepack_1.4.1          htmltools_0.4.0       
 [34] tools_3.6.3            igraph_1.2.5           gtable_0.3.0          
 [37] glue_1.4.0             GenomeInfoDbData_1.2.2 reshape2_1.4.4        
 [40] Rcpp_1.0.4.6           cellranger_1.1.0       vctrs_0.2.4           
 [43] Biostrings_2.54.0      multtest_2.42.0        ape_5.3               
 [46] nlme_3.1-144           iterators_1.0.12       xfun_0.12             
 [49] rvest_0.3.5            lifecycle_0.2.0        XML_3.99-0.3          
 [52] zlibbioc_1.32.0        MASS_7.3-51.5          scales_1.1.0          
 [55] hms_0.5.3              biomformat_1.14.0      rhdf5_2.30.1          
 [58] RColorBrewer_1.1-2     yaml_2.2.1             memoise_1.1.0         
 [61] gridExtra_2.3          rpart_4.1-15           RSQLite_2.2.0         
 [64] latticeExtra_0.6-29    stringi_1.4.6          genefilter_1.68.0     
 [67] foreach_1.5.0          checkmate_2.0.0        permute_0.9-5         
 [70] rlang_0.4.5            pkgconfig_2.0.3        bitops_1.0-6          
 [73] lattice_0.20-40        Rhdf5lib_1.8.0         htmlwidgets_1.5.1     
 [76] bit_1.1-15.2           tidyselect_1.0.0       magrittr_1.5          
 [79] R6_2.4.1               generics_0.0.2         Hmisc_4.4-0           
 [82] DBI_1.1.0              pillar_1.4.3           haven_2.2.0           
 [85] foreign_0.8-76         withr_2.1.2            mgcv_1.8-31           
 [88] survival_3.1-11        RCurl_1.98-1.1         nnet_7.3-13           
 [91] modelr_0.1.6           crayon_1.3.4           jpeg_0.1-8.1          
 [94] locfit_1.5-9.4         grid_3.6.3             readxl_1.3.1          
 [97] blob_1.2.1             vegan_2.5-6            reprex_0.3.0          
[100] digest_0.6.25          xtable_1.8-4           munsell_0.5.0

deseq2 • 398 views

ADD COMMENT • link updated 4.8 years ago by Michael Love 43k • written 4.8 years ago by masumistadler • 0

score 2 · Accepted Answer · 2020-04-10

2

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

You're misunderstanding the error message. It is saying that in the design e.g. in ~sample.type + year + season, one or more of these variables, that is, sample.type, year, or season has the same value across all samples. So it is collinear with the intercept. Be careful about what variables you put in the design. Take a look at the sample information table (e.g. colData) and see which is the problematic variable.

I don't recommend rlog for large datasets, you should use vst instead.

Also I don't particularly recommend DESeq2 for microbiome / metagenomic datasets. It's not designed for these datasets and may not perform as well as dedicated packages.

ADD COMMENT • link 4.8 years ago Michael Love 43k

0

Entering edit mode

Thank you, Michael, for the fast answer.

Oh, I see. I will look into the sample information and see what the problem is. And thank you for the advice, too. I had looked into metagenomeSeqbut I couldn't find anything similar to your rlogor vst transformations. Now that you mentioned it, I looked a bit further and found a good discussion on their transformation approach (post on github, for other people's reference). I'll definitely look into this, thank you!

ADD REPLY • link 4.8 years ago masumistadler • 0