Hello,
I am trying to transform my 16S rRNA count table with rlog for downstream analyses. At the moment I am not particularly interested in continuing with analysing the log-fold change.
I have a study design that covers several years, seasons and different sample types (soil, lake, stream, rivers etc). And we sequenced the samples for 16S rRNA DNA and RNA. Together, I have over 250 samples and around 30000 ASVs (= same as genes in case of RNA-Seq).
As DNA and RNA extraction methods and the meaning of read counts vary, I don't think it would be appropriate to apply the transformation to the two nucleic acid types together. So I've split the data set into two by DNA and RNA and applied the experimental design to both of them:
~ sample.type + year + season
While I don't have any error message with the DNA subset when creating a DESeqDataSet
, the RNA subset gives the following error:
Error in DESeqDataSet(se, design = design, ignoreRank) :
design contains one or more variables with all samples having the same value,
remove these variables from the design
I removed all ASVs without any count in the subset datasets, however, I suppose the same values (probably 0s) occur when the experimental design factors are combined.
I've read that one can use ~ 1
as an experimental design in this post, however, I've seen in the DESeq2 vignette that the experimental design is important for the transformation and that's why blind = FALSE
should be set for downstream analyses.
I'm not sure how to proceed right now. It is possible that many ASV have very few RNA counts, but that's the nature of RNA.
Additionally, I have many 0s. Not only for some RNA but also a few within the DNA subset. So I added type = iterate
in estimateSizeFactor()
after reading this post.
I just wanted to make sure that my workflow for transforming the counts is correct as I am struggling to find a good workflow for cases when one is only interested in transforming the data and not applying the log-fold change.
# make DESeqDataSet
dds.dna <- phyloseq_to_deseq2(dna, ~ sample.type + year + season)
dds.rna <- phyloseq_to_deseq2(rna, ~ sample.type + year + season)
# estimate size factors, set type = "iterate" for 0s
dds.dna <- estimateSizeFactors(dds.dna, type = "iterate")
dds.rna <- estimateSizeFactors(dds.rna, type = "iterate")
# estimate dispersions
dds.dna <- estimateDispersions(dds.dna, fitType = "mean")
dds.rna <- estimateDispersions(dds.rna, fitType = "mean")
# apply transformation
r.dna <- rlog(dds.dna, blind=FALSE)
r.rna <- rlog(dds.rna, blind=FALSE)
# or probably better (rlog may take too long)
v.dna <- vst(dds.dna, blind=FALSE)
v.rna <- vst(dds.rna, blind=FALSE)
So far my computer is computing the estimateSizeFactors
. It's been several hours, maybe I should figure out a way to parallelise things.
Thank you in advance. Any comments/advice are/is highly appreciated.
Masumi
> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] plyr_1.8.6 data.table_1.12.8
[3] DESeq2_1.26.0 SummarizedExperiment_1.16.1
[5] DelayedArray_0.12.2 BiocParallel_1.20.1
[7] matrixStats_0.56.0 Biobase_2.46.0
[9] GenomicRanges_1.38.0 GenomeInfoDb_1.22.1
[11] IRanges_2.20.2 S4Vectors_0.24.3
[13] BiocGenerics_0.32.0 forcats_0.5.0
[15] stringr_1.4.0 dplyr_0.8.5
[17] purrr_0.3.3 readr_1.3.1
[19] tidyr_1.0.2 tibble_3.0.0
[21] ggplot2_3.3.0 tidyverse_1.3.0
[23] phyloseq_1.30.0
loaded via a namespace (and not attached):
[1] colorspace_1.4-1 ellipsis_0.3.0 htmlTable_1.13.3
[4] XVector_0.26.0 base64enc_0.1-3 fs_1.4.1
[7] rstudioapi_0.11 bit64_0.9-7 AnnotationDbi_1.48.0
[10] fansi_0.4.1 lubridate_1.7.8 xml2_1.3.1
[13] codetools_0.2-16 splines_3.6.3 geneplotter_1.64.0
[16] knitr_1.28 ade4_1.7-15 Formula_1.2-3
[19] jsonlite_1.6.1 broom_0.5.5 annotate_1.64.0
[22] cluster_2.1.0 dbplyr_1.4.2 png_0.1-7
[25] BiocManager_1.30.10 compiler_3.6.3 httr_1.4.1
[28] backports_1.1.6 assertthat_0.2.1 Matrix_1.2-18
[31] cli_2.0.2 acepack_1.4.1 htmltools_0.4.0
[34] tools_3.6.3 igraph_1.2.5 gtable_0.3.0
[37] glue_1.4.0 GenomeInfoDbData_1.2.2 reshape2_1.4.4
[40] Rcpp_1.0.4.6 cellranger_1.1.0 vctrs_0.2.4
[43] Biostrings_2.54.0 multtest_2.42.0 ape_5.3
[46] nlme_3.1-144 iterators_1.0.12 xfun_0.12
[49] rvest_0.3.5 lifecycle_0.2.0 XML_3.99-0.3
[52] zlibbioc_1.32.0 MASS_7.3-51.5 scales_1.1.0
[55] hms_0.5.3 biomformat_1.14.0 rhdf5_2.30.1
[58] RColorBrewer_1.1-2 yaml_2.2.1 memoise_1.1.0
[61] gridExtra_2.3 rpart_4.1-15 RSQLite_2.2.0
[64] latticeExtra_0.6-29 stringi_1.4.6 genefilter_1.68.0
[67] foreach_1.5.0 checkmate_2.0.0 permute_0.9-5
[70] rlang_0.4.5 pkgconfig_2.0.3 bitops_1.0-6
[73] lattice_0.20-40 Rhdf5lib_1.8.0 htmlwidgets_1.5.1
[76] bit_1.1-15.2 tidyselect_1.0.0 magrittr_1.5
[79] R6_2.4.1 generics_0.0.2 Hmisc_4.4-0
[82] DBI_1.1.0 pillar_1.4.3 haven_2.2.0
[85] foreign_0.8-76 withr_2.1.2 mgcv_1.8-31
[88] survival_3.1-11 RCurl_1.98-1.1 nnet_7.3-13
[91] modelr_0.1.6 crayon_1.3.4 jpeg_0.1-8.1
[94] locfit_1.5-9.4 grid_3.6.3 readxl_1.3.1
[97] blob_1.2.1 vegan_2.5-6 reprex_0.3.0
[100] digest_0.6.25 xtable_1.8-4 munsell_0.5.0
Thank you, Michael, for the fast answer.
Oh, I see. I will look into the sample information and see what the problem is. And thank you for the advice, too. I had looked into
metagenomeSeq
but I couldn't find anything similar to yourrlog
orvst
transformations. Now that you mentioned it, I looked a bit further and found a good discussion on their transformation approach (post on github, for other people's reference). I'll definitely look into this, thank you!