Hello! I want to make a differential expression analysis of some RNAseq data using DESeq2.
Essentially, I have 3 types of tissue: normal, primary tumor and metastasis. These samples were sequenced in 4 groups and I found an evident batch effect (clusters by group processed, indicated in the variable "lib_prep") and also clusters by patient.
This is my data: (sample id, patient, library preparation and sample type)
Patient | ID | sample_type | lib_prep | nested_patient |
1 | P1N1 | normal | A | A |
1 | P1P1 | primary | A | A |
1 | P1M1 | metastasis | A | A |
1 | P1M2 | metastasis | A | A |
1 | P1M3 | metastasis | A | A |
1 | P1M4 | metastasis | A | A |
2 | P2N1 | normal | A | B |
2 | P2P1 | primary | A | B |
2 | P2M1 | metastasis | A | B |
2 | P2M2 | metastasis | A | B |
3 | P3N1 | normal | A | C |
3 | P3P1 | primary | A | C |
3 | P3M1 | metastasis | A | C |
3 | P3M2 | metastasis | A | C |
4 | P4N1 | normal | B | A |
4 | P4P1 | primary | B | A |
4 | P4M1 | metastasis | B | A |
4 | P4M2 | metastasis | B | A |
4 | P4M3 | metastasis | B | A |
5 | P5N1 | normal | B | B |
5 | P5P1 | primary | B | B |
6 | P6N1 | normal | C | A |
6 | P6P1 | primary | C | A |
6 | P6M1 | metastasis | C | A |
6 | P6M2 | metastasis | C | A |
7 | P7N1 | normal | C | B |
7 | P7P1 | primary | C | B |
7 | P7M1 | metastasis | C | B |
8 | P8N1 | normal | C | C |
8 | P8N2 | normal | C | C |
8 | P8P1 | primary | C | C |
8 | P8P2 | primary | C | C |
8 | P8M1 | metastasis | C | C |
9 | P9N1 | normal | C | D |
9 | P9P1 | primary | C | D |
9 | P9M1 | metastasis | C | D |
9 | P9M2 | metastasis | C | D |
10 | P10N1 | normal | D | A |
10 | P10M1 | metastasis | D | A |
11 | P11N1 | normal | D | B |
11 | P11M1 | metastasis | D | B |
11 | P11M2 | metastasis | D | B |
12 | P12N1 | normal | D | C |
12 | P12M1 | metastasis | D | C |
12 | P12M2 | metastasis | D | C |
I'm interested in the genes differentially expressed between the metastasis and primary tumor samples, using as base level the normal samples. As patient is collinear with lib_prep, I created a new variable "nested_patient".
So, I made the DESeq object:
ds.deseq <- DESeqDataSetFromMatrix( countData = countstable.num, colData = sampledata, design = ~ sample_type) colData(ds.deseq)$sample_type <- relevel(colData(ds.deseq)$sample_type, "normal")
I also provided the corresponding model matrix:
mm <- model.matrix(~ lib_prep + lib_prep:nested_patient + sample_type, colData(ds.deseq)) # To remove the columns full of zeros: mm <- mm[ , colSums(mm) > 0]
And I ran the program as follows:
ds.deseq <- DESeq(ds.deseq, full=mm,
modelMatrixType="standard", betaPrior=FALSE,
parallel = TRUE, BPPARAM=BatchJobsParam(workers = 64))
As result, I obtained:
> resultsNames(ds.deseq)
[1] "Intercept" "lib_prepB" "lib_prepC" "lib_prepD"
[5] "sample_typemetastasis" "sample_typeprimary" "lib_prepA.nested_patientB" "lib_prepB.nested_patientB"
[9] "lib_prepC.nested_patientB" "lib_prepD.nested_patientB" "lib_prepA.nested_patientC" "lib_prepC.nested_patientC"
[13] "lib_prepC.nested_patientD" "lib_prepC.nested_patientE"
But I don't know which of these is the comparison that I want. How to select the contrast appropriately to have only the genes differentially expressed in metastasis vs primary tumor, but different than the normal samples, and controlling by library preparation and patient. Could you help me, please?
David
Samples summary:
ID | Library | Normal | Primary | Metastasis |
1 | A | 1 | 1 | 4 |
2 | A | 1 | 1 | 2 |
3 | A | 1 | 1 | 2 |
4 | B | 1 | 1 | 3 |
5 | B | 1 | 1 | - |
6 | C | 1 | 1 | 2 |
7 | C | 1 | 1 | 1 |
8 | C | 2 | 2 | 1 |
9 | C | 1 | 1 | 2 |
10 | D | 1 | - | 1 |
11 | D | 1 | - | 2 |
12 | D | 1 | - | 2 |
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.04
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] checkmate_1.8.3 pheatmap_1.0.8 tsne_0.1-3 ggrepel_0.6.5
[5] ggplot2_2.2.1 gplots_3.0.1 RColorBrewer_1.1-2 BiocParallel_1.8.2
[9] BatchJobs_1.6 BBmisc_1.11 matrixStats_0.52.2 biomaRt_2.30.0
[13] knitr_1.17 DESeq2_1.14.1 SummarizedExperiment_1.4.0 Biobase_2.34.0
[17] GenomicRanges_1.26.4 GenomeInfoDb_1.10.3 IRanges_2.8.2 S4Vectors_0.12.2
[21] BiocGenerics_0.20.0
Having the normal samples in the deseq object, even if you're not using them explicitly in the contrast, adds a few extra degrees of freedom better to estimate the variance of the residuals. So it then comes down to how much you believe the within-normals variability is consistent with the within-tumour variability. My experience is that they are commensurate generally, sometimes they are bit less variable (if the Tolstoy maxim holds, that all normal samples are the same, but all tumour samples are tumorous in their own way), and so I generally keep them (look at a few positive controls' behaviour to get a feel) but occasionally if there's varying contamination of normals with tumour I'll then definitely subset the deseq object to remove them from the analysis entirely.
Thanks a lot!
David