Hi everyone,
A coworker used DESeq2 for a project we worked on together, but based on the results I'm not sure the contrasts performed were doing what we thought they were. Could y'all take a look and help me figure out what the contrasts are doing, and how I should change them so that they do what we intended them to?
We have RNA-seq data that has two different species, each with three tissues (pollen, style, and leaf), each with subcategories (ungerminated and germinated pollen, unpollinated and pollinated style, and several leaf stages). Given the nesting structure (tissue status in tissue, tissues in species) we made the following model matrix:
id species tissue1 status status_nested
1 Slyc_polgerm1 lyc pollen germ 1
2 Slyc_polgerm2 lyc pollen germ 1
3 Slyc_polgerm3 lyc pollen germ 1
4 Slyc_polungerm1 lyc pollen ungerm 2
5 Slyc_polungerm2 lyc pollen ungerm 2
6 Slyc_polungerm3 lyc pollen ungerm 2
7 Slyc_stygerm1 lyc style germ 1
8 Slyc_stygerm2 lyc style germ 1
9 Slyc_stygerm3 lyc style germ 1
10 Slyc_styungerm1 lyc style ungerm 2
11 Slyc_styungerm2 lyc style ungerm 2
12 Slyc_styungerm3 lyc style ungerm 2
13 Spen_polgerm1 pen pollen germ 1
14 Spen_polgerm2 pen pollen germ 1
15 Spen_polgerm3 pen pollen germ 1
16 Spen_polungerm2 pen pollen ungerm 2
17 Spen_polungerm3 pen pollen ungerm 2
18 Spen_stygerm1 pen style germ 1
19 Spen_stygerm2 pen style germ 1
20 Spen_stygerm3 pen style germ 1
21 Spen_styungerm1 pen style ungerm 2
22 Spen_styungerm2 pen style ungerm 2
23 Spen_styungerm3 pen style ungerm 2
24 Spen_polungerm1 pen pollen ungerm 2
25 Slyc_leafP3r1 lyc leaf P3 1
26 Slyc_leafP3r2 lyc leaf P3 1
27 Slyc_leafP3r3 lyc leaf P3 1
28 Slyc_leafP4d1 lyc leaf P4 2
29 Slyc_leafP4d2 lyc leaf P4 2
30 Slyc_leafP4d3 lyc leaf P4 2
31 Slyc_leafP4p1 lyc leaf P4 2
32 Slyc_leafP4p2 lyc leaf P4 2
33 Slyc_leafP4p3 lyc leaf P4 2
34 Slyc_leafP5d1 lyc leaf P5 3
35 Slyc_leafP5d2 lyc leaf P5 3
36 Slyc_leafP5d3 lyc leaf P5 3
37 Slyc_leafP5p1 lyc leaf P5 3
38 Slyc_leafP5p2 lyc leaf P5 3
39 Slyc_leafP5p3 lyc leaf P5 3
40 Slyc_leafP6d1 lyc leaf P6 4
41 Slyc_leafP6d2 lyc leaf P6 4
42 Slyc_leafP6d3 lyc leaf P6 4
43 Slyc_leafP6p1 lyc leaf P6 4
44 Slyc_leafP6p2 lyc leaf P6 4
45 Slyc_leafP6p3 lyc leaf P6 4
46 Spen_leafP3r1 pen leaf P3 1
47 Spen_leafP3r2 pen leaf P3 1
48 Spen_leafP3r3 pen leaf P3 1
49 Spen_leafP4d1 pen leaf P4 2
50 Spen_leafP4d2 pen leaf P4 2
51 Spen_leafP4d3 pen leaf P4 2
52 Spen_leafP4p1 pen leaf P4 2
53 Spen_leafP4p2 pen leaf P4 2
54 Spen_leafP4p3 pen leaf P4 2
55 Spen_leafP5d1 pen leaf P5 3
56 Spen_leafP5d2 pen leaf P5 3
57 Spen_leafP5d3 pen leaf P5 3
58 Spen_leafP5p1 pen leaf P5 3
59 Spen_leafP5p2 pen leaf P5 3
60 Spen_leafP5p3 pen leaf P5 3
61 Spen_leafP6d1 pen leaf P6 4
62 Spen_leafP6d2 pen leaf P6 4
63 Spen_leafP6d3 pen leaf P6 4
64 Spen_leafP6p1 pen leaf P6 4
65 Spen_leafP6p2 pen leaf P6 4
66 Spen_leafP6p3 pen leaf P6 4
We defined this as 'meta' in the R file. A few notes on it. ungerm and germ for style is unpollinated and pollinated style tissue. I worry this may be an issue because its called the same for ungerminated and germinated pollen tissue. Also, one of the ungerminated pollen tissue samples from the 'pen' species is separated from the other, but this matches with the column order in the read counts dataframe.
Here's how we defined the associated model matrix:
model_matrix <- model.matrix(~species+tissue1+tissue1:status_nested+tissue1:species, meta)
We also modified the ensuing model matrix to remove combinations of factors that had no observations:
all.zero <- apply(model_matrix, 2, function(x) all(x==0))
idx <- which(all.zero)
model_matrix <- model_matrix[,-idx]
Then we defined the DESeq2 object and ran DESeq2:
df <- DESeqDataSetFromMatrix(countData = readCounts, tidy=T, colData = meta, design =~1)
dds <- DESeq(df, full = model_matrix)
Then we ran several contrasts to determine genes that were significantly different between or across certain tissues. I think this is where issues may have cropped up, but I'm having difficulty understanding what contrasts they actually ran and how to change them so that they are comparing what we intended to compare.
Here are the resultsNames() from our dds object:
[1] "Intercept" "speciespen" "tissue1pollen"
[4] "tissue1style" "tissue1leaf.status_nested2" "tissue1pollen.status_nested2"
[7] "tissue1style.status_nested2" "tissue1leaf.status_nested3" "tissue1leaf.status_nested4"
[10] "speciespen.tissue1pollen" "speciespen.tissue1style"
The first contrasts were trying to compare pollen expression to leaf expression, and style expression to leaf expression irrespective of species:
pollen_v_leaf <- results(dds, contrast = c(0,0,1,0,0,0.33,0,0,0,1,0))
style_v_leaf <- results(dds, contrast = c(0,0,0,1,0,0,0.33,0,0,0,1))
The next contrast was trying to compare expression between species across all tissues:
lyc_vs_pen <- results(dds, contrast = c(0,1,0,0,0,0,0,0,0,0.33,1))
Then we ran two contrasts to try and compare pollen expression between species and style expression between species:
pollen_species <- results(dds, contrast = c(0,0,0,0,0,0,0,0,0,1,0))
style_species <- results(dds, contrast = c(0,0,0,0,0,0,0,0,0,0,1))
The pollen species contrast was the first I noticed had issues, based on the log fold changes for certain genes being in opposite directions, despite changing expression a large amount in the same direction between species. Then I noticed that none of the log fold change results really made sense. I changed the pollen contrast to this to see if it would fix it:
pollen_speciesTest <- results(dds, contrast = c(0,0,1,0,0,0,0,0,0,1,0))
I changed it to this because I thought that adding the 1 in the third spot would possibly compare the pollen expression from one species ("tissue1pollen" from resultsNames for the third position) with the expression in the other species ("speciespen.tissue1pollen"). This resulted in the log fold change values being closer to ones I calculated for these species, but they still weren't exact, and many genes, especially those with considerable expression in none pollen tissues, were far off what the log fold change differences should have been for pollen.
Thank you for your time! If you have any idea about where things might be going wrong I would appreciate the assistance. If there's any other information you think would help in understanding the issue I will happily add it to this post.