Question

Multi-group design when performing DE analysis

0

Entering edit mode

n.naharfancy • 0

@nnaharfancy-20022

Last seen 3.9 years ago

Imperial College London

Hi All,

I am comparing bulkRNA-seq data for differential expression between three groups. When I keep all the samples together and run a contrast between two groups I get less numbers of DEGs compared to when I just take two groups of samples at a time and perform the DE analysis? Between them the common genes were less than 50%. Is this difference expected? What is the best way to perform this kind of analysis? Thank you very much.

>SampleData.selected$Disease_Status
 [1] HI_MS HI_MS HI_MS LI_MS LI_MS LI_MS CTRL  HI_MS HI_MS HI_MS HI_MS LI_MS LI_MS LI_MS CTRL  CTRL  CTRL  CTRL 
Levels: CTRL LI_MS HI_MS
#############taking all three groups at a time

dds <- DESeqDataSetFromMatrix(countData = Count.selected,
                              colData = SampleData.selected, 
                              design = ~RNA_Batch+RIN+Sex+Disease_Status)

dds <- estimateSizeFactors(dds)
idx <- rowSums( counts(dds, normalized=TRUE) > 0) >= 5
dds <- dds[idx,]

dds <- DESeq(dds)
res <- results(dds, contrast=c("Disease_Status","HI_MS","CTRL"), independentFiltering = TRUE)
res <- res[order(res$padj), ]

summary(res)

out of 37443 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 48, 0.13%
LFC < 0 (down)     : 12, 0.032%
outliers [1]       : 0, 0%
low counts [2]     : 0, 0%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

################## taking just two groups at a time
keep <-  which(SampleData$Disease_Status %in% c("CTRL", "HI_MS"))
dds <- DESeqDataSetFromMatrix(countData = Count.selected[, keep],
                              colData = SampleData.selected[keep, ], 
                              design = ~RNA_Batch+RIN+Sex+Disease_Status)

dds <- estimateSizeFactors(dds)
idx <- rowSums( counts(dds, normalized=TRUE) > 0) >= 5
dds <- dds[idx,]

dds <- DESeq(dds)
res <- results(dds, contrast=c("Disease_Status","HI_MS","CTRL"), independentFiltering = TRUE)
res <- res[order(res$padj), ]

summary(res)

out of 34811 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 62, 0.18%
LFC < 0 (down)     : 25, 0.072%
outliers [1]       : 0, 0%
low counts [2]     : 9449, 27%
(mean count < 4)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

> intersect(rownames(res.two.groups)[1:87], rownames(res.all.groups.together)[1:60])
 [1] "IGHG1"     "IGKC"      "A2M"       "FCGBP"     "CAPS"      "LINC00551" "COL1A2"    "RASSF9"    "NES"       "OLFML2A"   "COL5A1"    "CPAMD8"   
[13] "ESAM"      "SNX31"     "CLIC6"     "COL4A6"    "C10orf105" "LEF1"      "HPSE2"     "RPS2P5"    "NPIPB2"   


> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8   
 [6] LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] psych_1.8.12                AnnotationHub_2.16.1        BiocFileCache_1.8.0         dbplyr_1.4.2                clusterProfiler_3.12.0     
 [6] pathfindR_1.3.0             pathview_1.24.0             org.Hs.eg.db_3.8.2          AnnotationDbi_1.46.0        knitr_1.22                 
[11] ComplexHeatmap_2.0.0        ggpubr_0.2                  magrittr_1.5                ggrepel_0.8.1               reshape2_1.4.3             
[16] dplyr_0.8.0.1               ggplot2_3.1.1               pheatmap_1.0.12             DESeq2_1.24.0               EDASeq_2.18.0              
[21] ShortRead_1.42.0            GenomicAlignments_1.20.0    SummarizedExperiment_1.14.0 DelayedArray_0.10.0         matrixStats_0.54.0         
[26] Rsamtools_2.0.0             GenomicRanges_1.36.0        GenomeInfoDb_1.20.0         Biostrings_2.52.0           XVector_0.24.0             
[31] IRanges_2.18.0              S4Vectors_0.22.0            BiocParallel_1.18.0         Biobase_2.44.0              BiocGenerics_0.30.0        

loaded via a namespace (and not attached):
  [1] R.utils_2.8.0                 tidyselect_0.2.5              RSQLite_2.1.1                 htmlwidgets_1.3              
  [5] DESeq_1.36.0                  munsell_0.5.0                 codetools_0.2-16              withr_2.1.2                  
  [9] colorspace_1.4-1              GOSemSim_2.10.0               rstudioapi_0.10               DOSE_3.10.2                  
 [13] KEGGgraph_1.44.0              urltools_1.7.3                GenomeInfoDbData_1.2.1        mnormt_1.5-5                 
 [17] hwriter_1.3.2                 polyclip_1.10-0               MCMCpack_1.4-4                bit64_0.9-7                  
 [21] farver_1.1.0                  coda_0.19-3                   xfun_0.6                      R6_2.4.0                     
 [25] doParallel_1.0.14             clue_0.3-57                   graphlayouts_0.5.0            locfit_1.5-9.1               
 [29] bitops_1.0-6                  fgsea_1.10.1                  gridGraphics_0.4-1            assertthat_0.2.1             
 [33] promises_1.0.1                scales_1.0.0                  ggraph_2.0.0                  nnet_7.3-12                  
 [37] enrichplot_1.4.0              gtable_0.3.0                  mcmc_0.9-6                    tidygraph_1.1.2              
 [41] rlang_0.4.0                   MatrixModels_0.4-1            genefilter_1.66.0             GlobalOptions_0.1.1          
 [45] splines_3.6.1                 rtracklayer_1.44.0            lazyeval_0.2.2                acepack_1.4.1                
 [49] europepmc_0.3                 checkmate_1.9.3               BiocManager_1.30.4            yaml_2.2.0                   
 [53] GenomicFeatures_1.36.0        backports_1.1.4               httpuv_1.5.1                  qvalue_2.16.0                
 [57] Hmisc_4.2-0                   tools_3.6.1                   ggplotify_0.0.4               RColorBrewer_1.1-2           
 [61] ggridges_0.5.1                Rcpp_1.0.1                    plyr_1.8.4                    base64enc_0.1-3              
 [65] progress_1.2.1                zlibbioc_1.30.0               purrr_0.3.2                   RCurl_1.95-4.12              
 [69] prettyunits_1.0.2             rpart_4.1-15                  GetoptLong_0.1.7              viridis_0.5.1                
 [73] cowplot_0.9.4                 cluster_2.1.0                 data.table_1.12.2             DO.db_2.9                    
 [77] SparseM_1.77                  circlize_0.4.8                triebeard_0.3.0               aroma.light_3.14.0           
 [81] mime_0.6                      hms_0.4.2                     evaluate_0.13                 xtable_1.8-4                 
 [85] XML_3.98-1.19                 gridExtra_2.3                 shape_1.4.4                   compiler_3.6.1               
 [89] biomaRt_2.40.0                tibble_2.1.1                  crayon_1.3.4                  R.oo_1.22.0                  
 [93] htmltools_0.3.6               later_0.8.0                   Formula_1.2-3                 tidyr_0.8.3                  
 [97] geneplotter_1.62.0            DBI_1.0.0                     tweenr_1.0.1                  MASS_7.3-51.4                
[101] MuSiC_0.1.1                   rappdirs_0.3.1                Matrix_1.2-17                 R.methodsS3_1.7.1            
[105] igraph_1.2.4.1                pkgconfig_2.0.2               rvcheck_0.1.5                 foreign_0.8-72               
[109] xml2_1.2.0                    foreach_1.4.4                 annotate_1.62.0               stringr_1.4.0                
[113] digest_0.6.18                 graph_1.62.0                  rmarkdown_1.14                fastmatch_1.1-0              
[117] htmlTable_1.13.1              curl_3.3                      shiny_1.3.2                   quantreg_5.38                
[121] rjson_0.2.20                  nlme_3.1-141                  jsonlite_1.6                  viridisLite_0.3.0            
[125] pillar_1.4.0                  lattice_0.20-38               KEGGREST_1.24.0               httr_1.4.0                   
[129] survival_2.44-1.1             GO.db_3.8.2                   interactiveDisplayBase_1.22.0 glue_1.3.1                   
[133] UpSetR_1.4.0                  png_0.1-7                     iterators_1.0.10              bit_1.1-14                   
[137] Rgraphviz_2.28.0              ggforce_0.3.1                 stringi_1.4.3                 nnls_1.4                     
[141] blob_1.1.1                    latticeExtra_0.6-28           memoise_1.1.0

deseq2 • 468 views

ADD COMMENT • link updated 4.9 years ago by Kevin Blighe ★ 4.0k • written 4.9 years ago by n.naharfancy • 0

score 0 · Answer 1 · 2019-11-01

In your code, you do not define the following objects (although we can infer what they are):

res.two.groups
res.all.groups.together

By removing samples from your dataset, the size factors that are calculated will differ (please review the DESeq2 published manuscript). Your rowSums calculation may also differ. Based on this, you will obtain slightly different results when you perform a differential expression analysis.

If you have no other reason to exclude samples in the way that you have done, then you should keep all samples in your cohort and normalise them together.

Reasons for exclusion could be, for example, a case where the transcriptional profiles are so vastly different such that normalising them together introduces bias.

Kevin