DESeq2 design formula - how to account for donor and /or sample?
1
0
Entering edit mode
C. Janssen • 0
@c-janssen-15831
Last seen 6.6 years ago

Hi everyone,

I would like to use DESeq2 to analyse RNAseq data from 2 different cell populations from 8 donors (positive vs negative T-cells).

Please note, I’m a beginner, so my apologies for any mistakes!

I searched through different posts and the DESeq2 vignette - and I think I'm on the right track, but I’d like to ask for some advice on my project, in particular the design formula.

 

My coldata includes a unique code for each sample, the donor number, the cell population and patient info:

 head.DataTable(samplesheet, n=16)
   Sample Donor Population RNAseqBatch     Study        Number   Gender  Birthdate
1      1A    D1        CD8           1     03/0174      166779      f    07-06-1989
2      2A    D2        CD8           1     03/0620      168739      f    14-11-1991
3      3A    D3        CD8           1     03/0293      167373      f    15-sep-87
4      4A    D4        CD8           1     01/0839      168505      f    18-okt-89
5      5A    D5        CD8           1     01/0202      166296      f    08-dec-88
6      6A    D6        CD8           1     03/0105      166322      f    18-okt-89
7      7A    D7        CD8           1     01/0894      168770      m    21-jan-90
8      8A    D8        CD8           1     01/0781      168366      f    08-okt-89
9      1B    D1        CD8pos        1     03/0174      166779      f    07-06-1989
10     2B    D2        CD8pos        1     03/0620      168739      f    14-11-1991
11     3B    D3        CD8pos        1     03/0293      167373      f    15-sep-87
12     4B    D4        CD8pos        1     01/0839      168505      f    18-okt-89
13     5B    D5        CD8pos        1     01/0202      166296      f    08-dec-88
14     6B    D6        CD8pos        1     03/0105      166322      f    18-okt-89
15     7B    D7        CD8pos        1     01/0894      168770      m    21-jan-90
16     8B    D8        CD8pos        1     01/0781      168366      f    08-okt-89

 

We want to detect the differentially expressed genes in CD8pos vs CD8.

So I made my DESeqDataSet:

counts <- read.delim("FG1703_All_mapped_reads2.txt", header = TRUE, row.names = 1)
samplesheet <- read.delim("S1samplesheetanalysis3.txt", header = TRUE, row.names = NULL)


ddsPD  <- DESeqDataSetFromMatrix(countData = counts, colData = samplesheet,
                              design = ~ Population + Donor )

My reasoning here is that I can now calculate the effect of cell population, accounting for the fact that each donor has sample pairing (2 or 3 cell populations per donor). But is this design correct this way? We get quite a lot of genes this way:

#Results 

ddsPD <- DESeq(ddsPD)

resPD <- results(ddsPD)

>head(resPD)

>resultsNames(ddsPD)

 [1] "Intercept"             "Population_CD8pos2_vs_CD8" "Population_CD8pos_vs_CD8"

[4] "Donor_D2_vs_D1"        "Donor_D3_vs_D1"        "Donor_D4_vs_D1"      

[7] "Donor_D5_vs_D1"        "Donor_D6_vs_D1"        "Donor_D7_vs_D1"      

[10] "Donor_D8_vs_D1"


# Then contrasting the cell populations

resPD2 <- results(ddsPD, contrast=c("Population", "CD8pos", "CD8"), alpha = 0.05, lfcThreshold=1)

table(resPD2$padj < 0.05)

>summary(resPD2)

out of 33246 with nonzero total read count

adjusted p-value < 0.05

LFC > 0 (up)     : 353, 1.1%

LFC < 0 (down)   : 1, 0.003%

#outliers [1]     : 0, 0%

#low counts [2]   : 14181, 43%

#(mean count < 7)


>resPD2

log2 fold change (MLE): Population CD8pos vs CD8

wald test p-value: Population CD8pos vs CD8

 

Is my design correct using:  formula(~ Population + Donor)? Do I need to include ‘Sample’ in the formula as well? Or is this not necessary? I’m still in doubt.

If yes, when I add ‘Sample’ to the formula, I get the ‘Model matrix not full rank’ error. Should I then use the ‘Model matrix not full rank’ method?

 

Thank you for any advice you can provide!

Kind regards,

C. Janssen

 

 

> sessionInfo()

R version 3.5.0 (2018-04-23)

Platform: x86_64-w64-mingw32/x64 (64-bit)

Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

 

attached base packages:

[1] parallel  stats4    stats     graphics  grDevices utils     datasets

[8] methods   base    

 

other attached packages:

 [1] ggplot2_2.2.1               dplyr_0.7.4               

 [3] DESeq2_1.20.0               SummarizedExperiment_1.10.1

 [5] DelayedArray_0.6.0          BiocParallel_1.14.1       

 [7] matrixStats_0.53.1          Biobase_2.40.0            

 [9] GenomicRanges_1.32.2        GenomeInfoDb_1.16.0        

[11] IRanges_2.14.9              S4Vectors_0.18.1          

[13] BiocGenerics_0.26.0         BiocInstaller_1.30.0 

deseq2 multiple factor design rna-seq • 2.0k views
ADD COMMENT
1
Entering edit mode
swbarnes2 ★ 1.4k
@swbarnes2-14086
Last seen 4 hours ago
San Diego

Do not add "sample" to the design.  You can't add anything to the design if it's different in every single sample.

ADD COMMENT
0
Entering edit mode

Ah I didn't know that! Thanks for your response.

ADD REPLY
0
Entering edit mode

For this I'd use a design of ~donor + population (it's typical in R/Bioconductor to put the variable of interest at the end of the design formula, although the models are mathematically equivalent). This means, each donor has a baseline level, and then there is a coefficient (an LFC) which will be estimated for the difference between populations, controlling for the donor baselines.

ADD REPLY
0
Entering edit mode

Thanks Michael! I'm glad to have confirmation of the design. 

ADD REPLY

Login before adding your answer.

Traffic: 556 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6