Hello, I am a graduate student who would like to know if I am designing my differential expression experiment correctly using DESeq2. I performed RNA-seq on tissues originating from 14 total animals at 4 time points of exposure to a drug. 3 animals were exposed for 0 hours (no exposure; control), 3 animals were exposed for 0.5 hours, 4 animals were exposed for 1 hour, and 4 animals were exposed for 2 hours. From each animal, I extracted multiple tissue samples - from the controls I got 8 tissue samples from each animal, for the drug-exposed animals I was only (due to experimental difficulties) able to get some of the same tissues (the best was 7 of 8 tissues samples from animal #6, the worst was 3 of 8 from animals #8 and #9). Additionally, during extraction, some of the tissues were slightly damaged; due to this, they were excluded from drug treatment (e.g. Ac5_LPeG has 0hr in the treatment column below, while the remaining tissues from Ac5 were subjected to 2hr of drug treatment). My sample information column data is as follows:
colData
Animal Tissue Treatment
Ac1c_AG Ac1c AG 0hr
Ac1c_CG Ac1c CG 0hr
Ac1c_LPeG Ac1c LPeG 0hr
Ac1c_RPeG Ac1c RPeG 0hr
Ac1c_LPlG Ac1c LPlG 0hr
Ac1c_RPlG Ac1c RPlG 0hr
Ac1c_LBG Ac1c LBG 0hr
Ac1c_RBG Ac1c RBG 0hr
Ac2c_AG Ac2c AG 0hr
Ac2c_CG Ac2c CG 0hr
Ac2c_LPeG Ac2c LPeG 0hr
Ac2c_RPeG Ac2c RPeG 0hr
Ac2c_LPlG Ac2c LPlG 0hr
Ac2c_RPlG Ac2c RPlG 0hr
Ac2c_LBG Ac2c LBG 0hr
Ac2c_RBG Ac2c RBG 0hr
Ac3c_AG Ac3c AG 0hr
Ac3c_CG Ac3c CG 0hr
Ac3c_LPeG Ac3c LPeG 0hr
Ac3c_RPeG Ac3c RPeG 0hr
Ac3c_LPlG Ac3c LPlG 0hr
Ac3c_RPlG Ac3c RPlG 0hr
Ac3c_LBG Ac3c LBG 0hr
Ac3c_RBG Ac3c RBG 0hr
Ac1_AG Ac1 AG 0.5hr
Ac1_CG Ac1 CG 0.5hr
Ac1_LPeG Ac1 LPeG 0.5hr
Ac1_LPlG Ac1 LPlG 0.5hr
Ac1_BG Ac1 BG 0.5hr
Ac2_AG Ac2 AG 0.5hr
Ac2_CG Ac2 CG 0.5hr
Ac2_LPeG Ac2 LPeG 0hr
Ac2_RPeG Ac2 RPeG 0.5hr
Ac2_RPlG Ac2 RPlG 0.5hr
Ac2_BG Ac2 BG 0.5hr
Ac3_AG Ac3 AG 0.5hr
Ac3_CG Ac3 CG 0.5hr
Ac3_LPeG Ac3 LPeG 0.5hr
Ac3_LPlG Ac3 LPlG 0.5hr
Ac3_BG Ac3 BG 0.5hr
Ac4_AG Ac4 AG 2hr
Ac4_CG Ac4 CG 2hr
Ac4_LPeG Ac4 LPeG 2hr
Ac4_BG Ac4 BG 2hr
Ac5_AG Ac5 AG 2hr
Ac5_CG Ac5 CG 2hr
Ac5_LPeG Ac5 LPeG 0hr
Ac5_RPeG Ac5 RPeG 2hr
Ac5_RPlG Ac5 RPlG 2hr
Ac5_BG Ac5 BG 2hr
Ac6_AG Ac6 AG 2hr
Ac6_CG Ac6 CG 2hr
Ac6_LPeG Ac6 LPeG 2hr
Ac6_RPeG Ac6 RPeG 2hr
Ac6_LPlG Ac6 LPlG 2hr
Ac6_RPlG Ac6 RPlG 2hr
Ac6_BG Ac6 BG 2hr
Ac7_AG Ac7 AG 1hr
Ac7_CG Ac7 CG 1hr
Ac7_LPeG Ac7 LPeG 0hr
Ac7_RPeG Ac7 RPeG 1hr
Ac7_RPlG Ac7 RPlG 1hr
Ac7_BG Ac7 BG 1hr
Ac8_AG Ac8 AG 1hr
Ac8_CG Ac8 CG 1hr
Ac8_LPeG Ac8 LPeG 1hr
Ac9_LPeG Ac9 LPeG 1hr
Ac9_RPlG Ac9 RPlG 1hr
Ac9_BG Ac9 BG 1hr
Ac10_AG Ac10 AG 2hr
Ac10_CG Ac10 CG 2hr
Ac10_LPeG Ac10 LPeG 2hr
Ac10_LPlG Ac10 LPlG 2hr
Ac10_BG Ac10 BG 2hr
Ac11_AG Ac11 AG 1hr
Ac11_CG Ac11 CG 1hr
Ac11_RPeG Ac11 RPeG 1hr
Ac11_RPlG Ac11 RPlG 1hr
Ac11_BG Ac11 BG 1hr
I want to make several multifactor comparisons.
- First, I would like to compare a time course of the drug effects (treatment) on each tissue type (ideally for each tissue, e.g AG 0hr vs AG 2hr - however I realize some tissues/timepoints don't have at least 3 replicates and I am prepared to disregard those). Examining the column data, one can see the 3 control animals have tissues LBG and RBG, while the drug-exposed animals only have BG (due to experimental difficulties). The LBG and RBG are half-portions of the whole BG tissue in the animals. To compare these tissues, I have averaged the counts for each row (gene) from the LBG and RBG to get a "pseudo-BG" count for each control animal - is this correct/reasonable to do, and if not, is there an accepted way to deal with this issue?
- I would also like to compare differences between the tissues (e.g. 0hr AG vs 0hr CG). I am unsure how to design the formula to control for the treatment variation (and inter-animal variation?) in order to increase the sensitivity for finding differences due to the tissue type.
- Finally, I would like to compare differences between individual animals using all the tissues extracted from that animal (particularly those in the same treatment group - eg. Ac1c vs Ac2c or Ac4 vs Ac6 - here I wonder if I should exclude "damaged" samples like Ac5_LPeG from the overall Ac5 "profile" ?)
So far, I have combined the tissue type and treatment time into one factor (group) as follows:
dds$group <- factor(paste0(dds$Tissue, dds$Treatment))
and used this as my experimental design:
design(dds) <- ~ group
to give the following colData:
colData(dds) DataFrame with 76 rows and 5 columns Animal Tissue Treatment group sizeFactor <factor> <factor> <factor> <factor> <numeric> Ac1c_AG Ac1c AG 0hr AG0hr 1.004482 Ac1c_CG Ac1c CG 0hr CG0hr 1.629999 Ac1c_LPeG Ac1c LPeG 0hr LPeG0hr 1.412265 Ac1c_RPeG Ac1c RPeG 0hr RPeG0hr 2.084133 Ac1c_LPlG Ac1c LPlG 0hr LPlG0hr 2.176424 ... ... ... ... ... ... Ac11_AG Ac11 AG 1hr AG1hr 1.5952341 Ac11_CG Ac11 CG 1hr CG1hr 0.4082796 Ac11_RPeG Ac11 RPeG 1hr RPeG1hr 2.3312163 Ac11_RPlG Ac11 RPlG 1hr RPlG1hr 0.6341157 Ac11_BG Ac11 BG 1hr BG1hr 1.6152299
I have been using this design to make contrasts such as:
AGres2v0hr <- results(dds, contrast=c("group","AG2hr","AG0hr"))
to answer my first comparison question about the time course of the treatment on each tissue, as well as:
AGvCGres0hr <- results(dds, contrast=c("group","AG0hr","CG0hr"))
to compare differences between the tissues.
However, I am not confident this design is always best for controlling variation and making the 3 types of comparisons I listed previously. Should I be using different designs to make the various comparisons I want? I am concerned about not controlling for inter-animal variation because when I perform an exploratory PCA after using the variance stabilizing transformation:
var <- varianceStabilizingTransformation(dds, blind = F) pca <- plotPCA(var, intgroup = c("Treatment", "Tissue")
I get a similar overall clustering pattern with each tissue - i.e. Ac1c and Ac3c are always closely together to the right on the x-axis (PC1) while Ac4 is always to the far left on the x-axis for each tissue group. My collaborators are concerned this demonstrates that inter-animal variation has a greater effect on clustering than the effects of my drug treatment. My last question is: can we discern that is true from this PCA plot and is it possibly due to my experimental design?
My apologies for the extremely long post. I have highlighted key questions from the post to assist in locating them. If there is any additional information or clarification needed, please just let me know and I can provide it. I greatly appreciate any insight or help on these topics. Thank you very much for taking the time to read this. Finally, below is my R session information:
Best,
Caleb
sessionInfo() R version 3.4.1 (2017-06-30) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] ggplot2_2.2.1 DESeq2_1.16.1 SummarizedExperiment_1.6.3 DelayedArray_0.2.7 matrixStats_0.52.2 [6] Biobase_2.36.2 GenomicRanges_1.28.4 GenomeInfoDb_1.12.2 IRanges_2.10.2 S4Vectors_0.14.3 [11] BiocGenerics_0.22.0 edgeR_3.18.1 limma_3.32.5 loaded via a namespace (and not attached): [1] genefilter_1.58.1 locfit_1.5-9.1 splines_3.4.1 lattice_0.20-35 colorspace_1.3-2 htmltools_0.3.6 [7] base64enc_0.1-3 blob_1.1.0 survival_2.41-3 XML_3.98-1.9 rlang_0.1.1 DBI_0.7 [13] foreign_0.8-69 BiocParallel_1.10.1 bit64_0.9-7 RColorBrewer_1.1-2 GenomeInfoDbData_0.99.0 plyr_1.8.4 [19] stringr_1.2.0 zlibbioc_1.22.0 munsell_0.4.3 gtable_0.2.0 htmlwidgets_0.9 memoise_1.1.0 [25] labeling_0.3 latticeExtra_0.6-28 knitr_1.16 geneplotter_1.54.0 AnnotationDbi_1.38.2 htmlTable_1.9 [31] Rcpp_0.12.12 acepack_1.4.1 xtable_1.8-2 scales_0.4.1 backports_1.1.0 checkmate_1.8.3 [37] Hmisc_4.0-3 annotate_1.54.0 XVector_0.16.0 bit_1.1-12 gridExtra_2.2.1 digest_0.6.12 [43] stringi_1.1.5 grid_3.4.1 tools_3.4.1 bitops_1.0-6 magrittr_1.5 RSQLite_2.0 [49] lazyeval_0.2.0 RCurl_1.95-4.8 tibble_1.3.3 Formula_1.2-2 cluster_2.0.6 Matrix_1.2-10 [55] data.table_1.10.4 rpart_4.1-11 nnet_7.3-12 compiler_3.4.1