Hi,
I’m having some difficulty with what would seem to be a pretty basic set-up. I have a panel of, say, 20 cell lines, c1, c2, …,c20, with two replicates for each cell line, so 40 samples total. The cell lines can be divided into groups based on different phenotypic properties and I’d like to do comparisons between groups. As an example, based on one phenotype the cell lines segregate into 4 groups, with c1-c5 in g1, c6-10 in g2, c11-15 in g3, c16-20 in g4. And I’d like to compare, say, g1 with g2, which would be 10 samples (2 replicates each of 5 cell lines) in each group; then compare g1 with g3, etc.
What is the best way to model this in DESeq2?
The approach I’ve been taking is to use all samples in the model and capture the replicate relationships using cell line labels,
cold <- data.frame(cells = c(“c1”, “c1”, “c2”, “c2”, …”c20”, “c20”)) dds <- DESeqDataSetFromMatrix(counts, cold, formula(~ cells)) DEOut <- DESeq(dds)
...and then use contrasts to group the cell lines for a given comparison. Since resultsNames(DEOut)
gives elements like
“Intercept” “cells_c2_vs_c1” “cells_c3_vs_c1” …
to compare g2 with g3, I’d do
results(DEOut, contrast = list(c(“cells_c6_vs_c1”, “cells_c7_vs_c1”,...“cells_c10_vs_c1”), c(“cells_c11_vs_c1”, “cells_c12_vs_c1”,...“cells_c15_vs_c1”)))
Is there something not recommended here? The reasons I’m suspicious:
- A particular comparison like the above turns up 7000+ regulated genes at an adjusted p-value threshold of 0.01, with extreme p-values and fold changes: returned adjusted p-values actually reach 0, and absolute log2FoldChanges go as high as 60. If I instead model by using group labels (g1, g2, g3, g4) for the samples (and therefore ignore the cell line replicate relationships), then comparing g2 with g3 results in fewer than 500 regulated genes, and absolute log2FoldChanges going as high as 25.
- Spot-checking the regulated genes seems to indicate a dependence on the expression of the reference cell line c1 that I don’t want. That is, the genes that are up-regulated (log2FoldChange > 0) not only roughly have g2 average expression higher than g3 average expression (which is what I want), but even more apparent is that the average expression of the g2 cell lines is higher than the individual cell line expression for c1. Likewise, the trend for down-regulated genes (log2FoldChange < 0) is that the average expression of the g2 cell lines is lower than for c1.
Should I perhaps be using a model without an Intercept term?
Thanks in advance for the help!
Bob
sessionInfo() R version 3.4.0 (2017-04-21) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux Matrix products: default BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] DESeq2_1.16.1 SummarizedExperiment_1.6.3 [3] DelayedArray_0.2.7 matrixStats_0.52.2 [5] Biobase_2.36.2 GenomicRanges_1.28.4 [7] GenomeInfoDb_1.12.2 IRanges_2.10.2 [9] S4Vectors_0.14.3 BiocGenerics_0.22.0
I'd second the recommendations here from Gavin.
To compare directly across groups and control for nested structure isn't possible with fixed effects models, and hence I often point users to limma's duplicateCorrelation() function which is designed for this.
If you were making within-cell-line comparisons (which it sounds like you're not), then it would be possible to account for the cell line differences and compare those within-cell-line effects across groups (and this is the section of the vignette Gavin mentions).
Very helpful, Gavin and Michael; thank you!