Design formula in DESeq2
1
0
Entering edit mode
@998340ab
Last seen 2.4 years ago
United Kingdom

Hello,

I am using DESeq2 for analysis of RNAseq data. I would like to ask you about the design in the DESEq2 formula. I have tissue from animals treated with a chemical and my animal model is a colorectal cancer model. My variables are gender (male or female), treatment (treated or untreated), genotype (knockout or wild-type), tissue type (tumour or healthy). In my initial analysis I saw that gender plays a role if I compare condition across animals, so I decided to look specifically on males or females across treatment. I read that I need to include all samples in the analysis for calculation of gene dispersion unless some of the samples are highly dispersed by looking a PCA plot. My PCA plot showed that tissue type was a main factor so I decided to make two datasets one for ileum samples and one for colon. In my initial design I had

dds_colon<-DESeqDataSetFromMatrix(countData=countdata_colon,colData=coldata_colon, design= ~Batch+ Gender +Treatment + Genotype + Treatment:Genotype

However as in my analysis I found that gender had a role, I decided to look on males and females separately. So for example wild type untreated males vs knockout untreated males or wild type untreated males vs knockout treated males. I am wondering how I shall make the design as it gets really complicated. What I did it is that I created a new column where I used the paste command combining information from different columns as you will see in the code below where I have created the group variable. My question is: is it a right approach? Shall I make a design like the one on the above command specifying separately gender, treatment and genotype or is it fine that I created a new variable with all the groups I am going to need?

coldata<-read.csv('metadata.csv',header=TRUE,row.names=1,stringsAsFactors = FALSE)#a table with information about the samples
head(coldata)
str(coldata)
coldata$Treatment<-factor(coldata$Treatment)
coldata$Treatment %<>% relevel("water") #we need to have as reference always the control or untreated condition
coldata$Genotype<-factor(coldata$Genotype)
coldata$Genotype %<>% relevel("wt")
coldata$Gender<-factor(coldata$Gender)
coldata$Tissue<-factor(coldata$Tissue)
coldata$Tumour<-factor(coldata$Tumour)
coldata$Parental<-factor(coldata$Parental)
coldata$Batch<-factor(coldata$Batch)

all(colnames(countdata) %in% rownames(coldata)) #very important to check manually that the columns of the count matrix correspond to the rows of the sample information table.
all(rownames(coldata) == colnames(countdata))#to check they are in the same order
countdata <- countdata[, rownames(coldata)]
all(rownames(coldata) == colnames(countdata))
#subset for colon as colon had the biggest variation samples were split based on the area of the gut they were collected
coldata_colon<- coldata%>% filter(Tissue=="Colon")
countdata_colon<-countdata[,rownames(coldata_colon)]
coldata_colon$groupt<- factor(paste0(coldata_colon$Genotype,coldata_colon$Treatment,coldata_colon$Gender))

#format the data so that we can apply DEseq analysis
dds_colon<-DESeqDataSetFromMatrix(countData=countdata_colon,colData=coldata_colon, design= ~Batch +group)


sessionInfo( )

R version 4.1.0 (2021-05-18) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale: [1] LC_COLLATE=English_United Kingdom.1252 [2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages: [1] grid parallel stats4 stats graphics grDevices utils
[8] datasets methods base

other attached packages: [1] org.Mm.eg.db_3.13.0 AnnotationDbi_1.54.1
[3] sva_3.40.0 BiocParallel_1.26.2
[5] genefilter_1.74.0 mgcv_1.8-35
[7] nlme_3.1-153 glmpca_0.2.0
[9] PoiClaClu_1.0.2.1 ggbeeswarm_0.6.0
[11] RColorBrewer_1.1-2 vsn_3.60.0
[13] pcaExplorer_2.18.0 UpSetR_1.4.0
[15] pheatmap_1.0.12 magrittr_2.0.1
[17] DEFormats_1.20.0 reshape2_1.4.4
[19] ggpubr_0.4.0 DESeq2_1.32.0
[21] SummarizedExperiment_1.22.0 Biobase_2.52.0
[23] MatrixGenerics_1.4.3 matrixStats_0.61.0
[25] GenomicRanges_1.44.0 GenomeInfoDb_1.28.4
[27] IRanges_2.26.0 S4Vectors_0.30.0
[29] BiocGenerics_0.38.0 forcats_0.5.1
[31] stringr_1.4.0 dplyr_1.0.7
[33] purrr_0.3.4 readr_2.0.2
[35] tidyr_1.1.4 tibble_3.1.5
[37] ggplot2_3.3.5 tidyverse_1.3.1
[39] edgeR_3.34.1 limma_3.48.3

loaded via a namespace (and not attached): [1] utf8_1.2.2 shinydashboard_0.7.2 tidyselect_1.1.1
[4] heatmaply_1.3.0 RSQLite_2.2.8 htmlwidgets_1.5.4
[7] TSP_1.1-11 munsell_0.5.0 preprocessCore_1.54.0 [10] codetools_0.2-18 DT_0.19 withr_2.4.2
[13] colorspace_2.0-2 Category_2.58.0 filelock_1.0.2
[16] knitr_1.36 rstudioapi_0.13 ggsignif_0.6.3
[19] NMF_0.23.0 labeling_0.4.2 GenomeInfoDbData_1.2.6 [22] topGO_2.44.0 farver_2.1.0 bit64_4.0.5
[25] vctrs_0.3.8 generics_0.1.0 xfun_0.26
[28] BiocFileCache_2.0.0 R6_2.5.1 doParallel_1.0.16
[31] seriation_1.3.1 locfit_1.5-9.4 bitops_1.0-7
[34] cachem_1.0.6 shinyAce_0.4.1 DelayedArray_0.18.0
[37] assertthat_0.2.1 promises_1.2.0.1 scales_1.1.1
[40] beeswarm_0.4.0 gtable_0.3.0 affy_1.70.0
[43] rlang_0.4.11 splines_4.1.0 rstatix_0.7.0
[46] lazyeval_0.2.2 hexbin_1.28.2 shinyBS_0.61
[49] broom_0.7.9 checkmate_2.0.0 BiocManager_1.30.16
[52] abind_1.4-5 modelr_0.1.8 threejs_0.3.3
[55] crosstalk_1.1.1 backports_1.2.1 httpuv_1.6.3
[58] RBGL_1.68.0 tools_4.1.0 gridBase_0.4-7
[61] affyio_1.62.0 ellipsis_0.3.2 Rcpp_1.0.7
[64] plyr_1.8.6 base64enc_0.1-3 progress_1.2.2
[67] zlibbioc_1.38.0 RCurl_1.98-1.5 prettyunits_1.1.1
[70] viridis_0.6.2 haven_2.4.3 ggrepel_0.9.1
[73] cluster_2.1.2 fs_1.5.0 data.table_1.14.2
[76] openxlsx_4.2.4 SparseM_1.81 reprex_2.0.1
[79] hms_1.1.1 mime_0.12 evaluate_0.14
[82] xtable_1.8-4 XML_3.99-0.8 rio_0.5.27
[85] readxl_1.3.1 gridExtra_2.3 compiler_4.1.0
[88] biomaRt_2.48.3 crayon_1.4.1 htmltools_0.5.2
[91] GOstats_2.58.0 later_1.3.0 tzdb_0.1.2
[94] geneplotter_1.70.0 lubridate_1.8.0 DBI_1.1.1
[97] dbplyr_2.1.1 MASS_7.3-54 rappdirs_0.3.3
[100] Matrix_1.3-3 car_3.0-11 cli_3.0.1
[103] igraph_1.2.7 pkgconfig_2.0.3 registry_0.5-1
[106] foreign_0.8-81 plotly_4.10.0 xml2_1.3.2
[109] foreach_1.5.1 annotate_1.70.0 vipor_0.4.5
[112] rngtools_1.5.2 pkgmaker_0.32.2 webshot_0.5.2
[115] XVector_0.32.0 AnnotationForge_1.34.0 rvest_1.0.2
[118] digest_0.6.28 graph_1.70.0 Biostrings_2.60.2
[121] rmarkdown_2.11 cellranger_1.1.0 dendextend_1.15.1
[124] GSEABase_1.54.0 curl_4.3.2 shiny_1.7.1
[127] lifecycle_1.0.1 jsonlite_1.7.2 carData_3.0-4
[130] viridisLite_0.4.0 fansi_0.5.0 pillar_1.6.4
[133] lattice_0.20-44 KEGGREST_1.32.0 fastmap_1.1.0
[136] httr_1.4.2 survival_3.2-11 GO.db_3.13.0
[139] glue_1.4.2 zip_2.2.0 png_0.1-7
[142] iterators_1.0.13 Rgraphviz_2.36.0 bit_4.0.4
[145] stringi_1.7.5 blob_1.2.2 memoise_2.0.0

groups dispersion DESeq2 desing • 1.5k views
ADD COMMENT
1
Entering edit mode
swbarnes2 ★ 1.4k
@swbarnes2-14086
Last seen 1 day ago
San Diego

if sex is, like batch, just a 'nuisance' actor that you aren't really going to interrogate, but just want to include as a co-factor to be accounted for, I think piling it into the design like that is what you should do. (Assuming that you have a large enough number of samples to make it feasible to account for all your variables.)

You can make one dds object, and subset that like this:

dds_colon <- dds[ ,dds$Tissue == "colon"]

That's probably simpler than what you are doing. However, you might have to do a droplevels command to remove colData elements not present in your subset.

ADD COMMENT
0
Entering edit mode

Thank you! Much appreciated all the help.

ADD REPLY

Login before adding your answer.

Traffic: 761 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6