Hello All,
I have 90 samples in total for my dataset with two genotypes, A (wild-type) and B (single gene mutant). Each genotype is broken down into 5 treatments groups, which represent the time since the treatment was applied. Each treatment has 3 biological replicates and this experiment setup was repeated 3 times. So there are 3 biological replicates from each experiment (E1,E2,E3), giving a total of 9 biological replicates per treatment. So for the genotype A, the breakdown is:
Treatment | Biological Replicates |
---|---|
0hr/Control | 9 samples (3 from E1, 3 from E2, 3 from E3) |
12hrs | 9 samples (3 from E1, 3 from E2, 3 from E3) |
24hrs | 9 samples (3 from E1, 3 from E2, 3 from E3) |
36hrs | 9 samples (3 from E1, 3 from E2, 3 from E3) |
48hrs | 9 samples (3 from E1, 3 from E2, 3 from E3) |
(The same setup exists for genotype B)
My initial thought was to run the analyses separately for each genotype, but I saw in other posts that it would probably be better to combine the genotypes into the same dataset. I did so using the following code:
The design is meant to test for the effect of treatment time (12hrs, 24hrs, 36hrs, 48hrs) while controlling for the effect of the different experiments (E1, E2, E3) and genotype.
ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory, design =~ experiment + genotype + treatment)
colData(ddsHTSeq)$treatment <- factor(colData(ddsHTSeq)$treatment, levels = c("Control", "12hrs", "24hrs", "36hrs", "48hrs"))
colData(ddsHTSeq)$experiment <- factor(colData(ddsHTSeq)$experiment, levels = c("E1", "E2", "E3"))
colData(ddsHTSeq)$genotype <- factor(colData(ddsHTSeq)$genotype, levels = c("A", "B))
dds <- DESeq(ddsHTSeq)
I am looking to find the genes that change across the treatment relative to the 0h/Control. I want to find genes that have a large (+/-) log fold-change at either 12hrs or 24hrs and then come back to 0 in 36hrs and 48hrs. My thought was to perform a pair-wise comparison between the treatments and the Control. I would then take the significant genes for 12hrs and 24hrs and plot the log fold-changes for the comparisons. I also used the lfcShrink
function with the apeglm setting. (See attached plot) To find the genes of interest, I would then perform some clustering.
I think my approach is appropriate, but I wanted some confirmation of this.
Also, because I am including genotype in the design formula, the log fold-changes for the genes in the plot are the average between the two genotypes or only the log fold-changes for the wild-type genotype?
Thanks in advance!
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] tcltk parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] Mfuzz_2.44.0 DynDoc_1.62.0 widgetTools_1.62.0 e1071_1.7-2 pheatmap_1.0.12
[6] scales_1.0.0 tidyr_0.8.3 RColorBrewer_1.1-2 gplots_3.0.1.1 ggplot2_3.2.1
[11] dplyr_0.8.3 tibble_2.1.3 readr_1.3.1 DESeq2_1.24.0 SummarizedExperiment_1.14.1
[16] DelayedArray_0.10.0 BiocParallel_1.18.1 matrixStats_0.55.0 Biobase_2.44.0 GenomicRanges_1.36.1
[21] GenomeInfoDb_1.20.0 IRanges_2.18.2 S4Vectors_0.22.0 BiocGenerics_0.30.0
loaded via a namespace (and not attached):
[1] bitops_1.0-6 bit64_0.9-7 tools_3.6.1 backports_1.1.4 R6_2.4.0 rpart_4.1-15
[7] KernSmooth_2.23-15 Hmisc_4.2-0 DBI_1.0.0 lazyeval_0.2.2 colorspace_1.4-1 nnet_7.3-12
[13] withr_2.1.2 tidyselect_0.2.5 gridExtra_2.3 bit_1.1-14 compiler_3.6.1 htmlTable_1.13.1
[19] labeling_0.3 caTools_1.17.1.2 checkmate_1.9.4 genefilter_1.66.0 stringr_1.4.0 digest_0.6.20
[25] foreign_0.8-72 XVector_0.24.0 base64enc_0.1-3 pkgconfig_2.0.2 htmltools_0.3.6 htmlwidgets_1.3
[31] rlang_0.4.0 rstudioapi_0.10 RSQLite_2.1.2 gtools_3.8.1 acepack_1.4.1 RCurl_1.95-4.12
[37] magrittr_1.5 GenomeInfoDbData_1.2.1 Formula_1.2-3 Matrix_1.2-17 Rcpp_1.0.2 munsell_0.5.0
[43] stringi_1.4.3 zlibbioc_1.30.0 grid_3.6.1 blob_1.2.0 gdata_2.18.0 crayon_1.3.4
[49] lattice_0.20-38 splines_3.6.1 annotate_1.62.0 hms_0.5.1 locfit_1.5-9.1 zeallot_0.1.0
[55] knitr_1.24 pillar_1.4.2 tkWidgets_1.62.0 geneplotter_1.62.0 XML_3.98-1.20 glue_1.3.1
[61] latticeExtra_0.6-28 data.table_1.12.2 vctrs_0.2.0 gtable_0.3.0 purrr_0.3.2 assertthat_0.2.1
[67] xfun_0.9 xtable_1.8-4 class_7.3-15 survival_2.44-1.1 AnnotationDbi_1.46.1 memoise_1.1.0
[73] cluster_2.1.0
Great, thank you for your response Michael.
For the second part, regarding the log fold-changes, if I wanted only the log fold-changes for the wild-type genotype I would add an interaction term to the design to make it:
Then the wild-type log fold-changes would be represented by the main condition effect, right?
When I tried this though, I got much fewer significant genes for the wild-type and no significant genes for the mutant genotype compared to the design without the interaction, as shown below.
Design =~ experiment + genotype + condition
Design: =~ experiment + genotype * condition
These may be novice questions but could you please provide some insight into:
1) One reason you can see less DEG thresholding on adjusted p-value is that you've added parameters to estimate while keeping your sample size the same.
2) The initial one is looking at the condition effect across both genotypes (assuming that it is identical) and controlling for any shift that is due to genotype.