Hi,
I am trying to figure out what is the correct model formula and analysis approach is with DESeq2 for my multifactor design in a context of microbiome sequencing. What I have is 42 samples coming from rhizospheric soils from 7 different plant species (that are considered as invasive or native) and that are infected or not by a plant parasite.
Here is the initial colDataset:
infection host_species host_origin
s2-1 2 6 2
s2-2 2 6 2
s2-3 2 6 2
s2-4 1 6 2
s2-5 1 6 2
s2-6 1 6 2
s3-1 2 7 2
[...]
s7-5 1 2 1
s7-6 1 2 1
s8-1 2 3 1
s8-2 2 3 1
s8-3 2 3 1
s8-4 1 3 1
s8-5 1 3 1
s8-6 1 3 1
What I would like is to observe the main effect of the treatment and wether the response to infection is modulated by the host_origin of the plant species. I have to consider host_species since I know it acts as a batch factor. Then, I would be interested in the LRT results with the following design:
design <- ~ host_species + host_origin*infection
However, host_species is nested in the host_origin, so I created species.nested column, and since my design is unbalanced I created a model matrix where I removed the column with all zero, as suggested in the vignette.
Here is the modified colDataset:
spec_nested infection host_species host_origin
s2-1 1 2 6 2
s2-2 1 2 6 2
s2-3 1 2 6 2
s2-4 1 1 6 2
s2-5 1 1 6 2
s2-6 1 1 6 2
s3-1 2 2 7 2
[...]
s6-1 4 2 5 2
s7-5 3 1 2 1
s7-6 3 1 2 1
s8-1 4 2 3 1
s8-2 4 2 3 1
s8-3 4 2 3 1
s8-4 4 1 3 1
s8-5 4 1 3 1
s8-6 4 1 3 1
Here line code for the full model matrix:
fm <- model.matrix(~ spec.nested:host_origin + infection*host_origin ,colData(ps_deseq))
idx <- which(colSums(fm == 0) == nrow(fm))
fm <- fm[,-idx]
The full model matrix:
> unname(fm)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 1 0 1 0 0 0 0 0 0
[2,] 1 1 0 1 0 0 0 0 0 0
[3,] 1 1 0 1 0 0 0 0 0 0
[4,] 1 0 0 1 0 0 0 0 0 0
[5,] 1 0 0 1 0 0 0 0 0 0
[6,] 1 0 0 1 0 0 0 0 0 0
[7,] 1 1 0 0 1 0 0 0 0 0
[8,] 1 1 0 0 1 0 0 0 0 0
[9,] 1 1 0 0 1 0 0 0 0 0
[...]
[32,] 1 1 0 0 0 0 0 0 1 1
[33,] 1 1 0 0 0 0 0 0 1 1
[34,] 1 0 0 0 0 0 0 0 1 0
[35,] 1 0 0 0 0 0 0 0 1 0
[36,] 1 0 0 0 0 0 0 0 1 0
[37,] 1 1 0 0 0 0 1 0 0 1
[38,] 1 1 0 0 0 0 1 0 0 1
[39,] 1 1 0 0 0 0 1 0 0 1
[40,] 1 0 0 0 0 0 1 0 0 0
[41,] 1 0 0 0 0 0 1 0 0 0
[42,] 1 0 0 0 0 0 1 0 0 0
rm1 <- model.matrix(~ spec_origin:host_origin + host_origin:infection ,colData(ps_deseq))
idx <- which(colSums(rm1 == 0) == nrow(rm1))
rm1 <- rm1[,-idx]
The reduced model matrix (I removed the pure effect of infection)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 0 1 0 0 0 0 0 1 0
[2,] 1 0 1 0 0 0 0 0 1 0
[3,] 1 0 1 0 0 0 0 0 1 0
[4,] 1 0 1 0 0 0 0 0 0 0
[5,] 1 0 1 0 0 0 0 0 0 0
[6,] 1 0 1 0 0 0 0 0 0 0
[7,] 1 0 0 1 0 0 0 0 1 0
[8,] 1 0 0 1 0 0 0 0 1 0
[9,] 1 0 0 1 0 0 0 0 1 0
[...]
[32,] 1 0 0 0 0 0 0 1 0 1
[33,] 1 0 0 0 0 0 0 1 0 1
[34,] 1 0 0 0 0 0 0 1 0 0
[35,] 1 0 0 0 0 0 0 1 0 0
[36,] 1 0 0 0 0 0 0 1 0 0
[37,] 1 0 0 0 0 1 0 0 0 1
[38,] 1 0 0 0 0 1 0 0 0 1
[39,] 1 0 0 0 0 1 0 0 0 1
[40,] 1 0 0 0 0 1 0 0 0 0
[41,] 1 0 0 0 0 1 0 0 0 0
[42,] 1 0 0 0 0 1 0 0 0 0
I understand I made something wrong in the designs of the full model or of the reduced model matrix since the dim of the two matrix is similar but I definitly can't determine what is my mistake.
Then I cannot run the LRT test I would like:
ps_deseq = DESeq(ps_deseq, test="LRT", full=mm, reduced=rm1)
I have two questions:
1) What would be the right syntax for my full and reduced model matrix?
2) My sample are microbiome samples, reason why every gene of my raw data contains at least one zero and I have to use the geometric mean as defined by :
gm_mean = function(x, na.rm=TRUE){ exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))}
Since I am not using the design as defined in my DESeqDataSet, how can I manage estimateSizeFactors according to my full model.matrix?
I cannot not find a solution for this in the vignette or in other discussions in this forum. I'd greatly appreciate if some help can be provided. Thanks a lot.
Caroline
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] DESeq2_1.20.0 SummarizedExperiment_1.10.1 DelayedArray_0.6.1
[4] BiocParallel_1.14.1 matrixStats_0.53.1 Biobase_2.40.0
[7] GenomicRanges_1.32.3 GenomeInfoDb_1.16.0 IRanges_2.14.10
[10] S4Vectors_0.18.3 BiocGenerics_0.26.0 biomformat_1.8.0
[13] ape_5.1 scales_1.0.0 phyloseq_1.24.0
[16] RColorBrewer_1.1-2 wesanderson_0.3.6 manipulate_1.0.1
[19] stringr_1.3.1 tidyr_0.8.1 ggplot2_3.0.0
[22] fossil_0.3.7 shapefiles_0.7 foreign_0.8-70
For a reproductible example, please find the dataset and the script here:
https://wetransfer.com/downloads/db45667b6d030720caa485fdfd9b459f20180927090701/983649e13f9e65d3ef3cadbcefc3874720180927090701/1be251