Hello All,
I'm something of a neophyte to DESeq2 and want to be sure I'm setting up my analysis appropriately given my sample groups.
First, some background: I have a gene expression dataset which captures five distinct age stages (1 thru 5, 3 reps a piece) as well as 2 clear behavioral states from the earliest and latest age stages (i.e. State A and B from stage 1, 4 reps a piece; State A and B from stage 5, 5 reps a piece). I can thus address questions regarding aging overall as well as the effects of age on behavioral state.
Thus far, I've been splitting this total sample set into separate GLM runs to approach each question independently (i.e. Age: stages 1-5 together in run 1; Behavior: stages 1 and 5, and all behavioral states A and B together in run 2). I am wondering if this is acceptable, or if it wouldn't be more statistically appropriate to combine all samples, from both age and behavior, together in a single GLM, using interaction terms to assign compound conditions to each sample and extracting results from this grander setup.
What follows is an example setup for my 'single-question' GLM addressing age:
total_counts<-read.table(file="TimeCourseIndividualsOnly.txt",head=TRUE,row.names=1)
expt_design <- data.frame(rows = colnames(total_counts),
condition = c("Time1", "Time1", "Time1", "Time2", "Time2", "Time2", "Time3", "Time3", "Time3", "Time4", "Time4", "Time4", "Time5", "Time5", "Time5"))
dds <- DESeqDataSetFromMatrix( countData = total_counts, colData = expt_design, design = ~ condition)
dds <- DESeq(dds)
colData(dds)
res <- results(dds)
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
dds <- nbinomWaldTest(dds)
#ExampleResult_Time1vsTime 2
Time1vsTime2 <- results(dds, contrast=c("condition","Time1","Time2"))
Time1vsTime2 <- as.data.frame(Time1vsTime2)
write.csv(Time1vsTime2, "T01_Time1vsTime2_DESeq2test_09202018.csv", row.names=TRUE)
Thanks very kindly in advance!
Hey Michael,
Thank you very much for your attention and questions! I hope the following will address your questions and better clarify my sample set.
The libraries for all 33 samples were prepared in one go (we sent away for the work). The fifteen samples collected for age were gathered during a steady state at timepoints 1, 2, 3, 4, and 5; so they are neither A nor B. The eighteen behavior-associated samples reflect the gene expression of individuals performing either behavior A or B at either timepoint 1 (N = 4 for each behavior) or timepoint 5 (N=5 for each behavior). As such, individuals that are A or B are also either 1 or 5.
Given this, I believe expression data for individuals collected at timepoints 1 and 5 could be used to address the question of gradual aging, but could also be used to compare against our two conspicuous behavioral states.
Again, hoping this helps, and do let me know if I can provide further info. Thank you again for your time!
I’m not sure yet how the extra 1 and 5 samples help because they are A and B while the other samples are some other category (neither A nor B) and so it’s not so easy to lump them in. Presumably there is some difference between A, B and neither A nor B?
Ah, I see. To be more explicit, we're looking at nestmates of a social insect. We have samples that capture brain gene expression of young females who are just resting in the nest (timepoint 1) as well as young foragers (Young A) and young nest guards (Young B). We've also collected old females (timepoint 5), as well as old foragers (Old A) and old nest guards (Old B). I'm interested in exploring the ways in which age may effect gene expression underlying each behavioral state (i.e. foraging and guarding), and figured part of that process would involve a comparison to age-matched individuals that were not engaged in either task.
I think it's easiest to analyze the datasets separately here, as you don't want to assume that the difference between resting, foraging and guarding is the same at time 1 and 5, and then you have three missing time points for the resting. It will make the analysis more straightforward, and you have plenty of degrees of freedom to estimate the dispersion (sometimes it is recommended to put all samples together to aid in estimation of dispersion, but here you have many samples).
Thank you very much, Michael!