I have biological states called sample1, sample2 and negative control (NC) in duplicate. They are all from the same sequencing run.
I wrote R code using DESeq2 which generates all pairwise comparisons, i.e. sample1 vs NC, sample2 vs NC, sample1 vs sample2.
I have all raw reads in a file expr_table.csv
, i.e. a file with seven columns.
I seems to me that when DESeq2 deals with one pairwise comparison, then the raw reads in the expression table of the state which is not considered somehow influences the result. In other words, it makes a difference if I compare sample1 to NC using read counts from a table containing only sample1 and NC, or doing the same comparison using read counts from a table which contains columns for sample1, sample2 and NC.
Which approach is to be preferred? For each test, should I make a new subset data frame and corresponding contrasts description which contains only the samples being compared?
Or is it recommended to put everything (all biological states) into one big data frame with all raw read counts?
P.S.: in actuality I have more more states than sample1, sample2, NC. There are sometime up to 12 or more states.
For reference, my contrasts definition for the above example is:
condition
sample1_rep1 sample1
sample1_rep2 sample1
sample2_rep1 sample2
sample2_rep2 sample2
negctrl_rep1 NC
negctrl_rep2 NC
And my R code is:
library("DESeq2")
library("ggplot2")
library("corrplot")
mycountdata <- read.delim("expr_table.csv", header = TRUE, sep = "\t")
mycoldata <- read.delim("contrast.txt", header = TRUE, sep = "\t")
# colnames(mycountdata) <- NULL
dds <- DESeqDataSetFromMatrix(countData = mycountdata, colData = mycoldata, design =~ condition)
dds <- DESeq(dds)
png(file="tmp/png/dispersion.png")
plotDispEsts(dds)
dev.off()
# Define the main test function
de_check <- function (this_dds, this_contrast, outfile) {
res <- results(this_dds, contrast=this_contrast)
write.table(as.data.frame(res),file=paste("tmp/tsv/", outfile, ".tsv", sep = ""), quote=FALSE, sep="\t")
print(sum(res$padj < 0.1, na.rm=TRUE))
print(sum(res$pvalue < 0.05, na.rm=TRUE))
png(file=paste("tmp/png/", outfile, ".png", sep = ""))
ylim_min <- min(as.data.frame(res)$log2FoldChange)
ylim_max <- max(as.data.frame(res)$log2FoldChange)
DESeq2::plotMA(res, ylim = c(ylim_min, ylim_max))
dev.off()
}
mycombo <- combn(c("sample1", "sample2", "NC"),2)
myfunction <- function (x) {
de_check(dds, c("condition",x[1],x[2]), paste(x[1],x[2],sep="_"))
plotDispEsts(dds)
}
apply(mycombo, 2, myfunction)
Thank you Michael! Indeed it is there at https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#if-i-have-multiple-groups-should-i-run-all-together-or-split-into-pairs-of-groups