Hello,
I have 8 samples. 4 control and 4 treated. I can obtain a within sample dispersion estimate if I make a dds object using only the control samples.
condition <- factor(c(rep("Control1", 2), rep("Control2", 2)))
And obtain dispersion estimate from this WITHIN group comparison.
I would like to use the dispersion estimates obtained from this WITHIN group comparison and apply it to the BETWEEN group comparison
condition <- factor(c(rep("Control", 4), rep("Treatment", 4)))
in order to perform differential expression analysis on the BETWEEN group comparison using the WITHIN group dispersion estimate.
Can someone tell me how to accomplish this.
TIA.
hi,
Note that you can add a comment to an existing answer by clicking Add Comment. You've added a new answer, which is responding to the top post, which was your original question.
So you haven't posted your code, but typically, when you include a design formula that specifies which samples are in which group then the dispersion estimates are based off of the within group variance.
You should post all your code and your sessionInfo() so I and others can determine if there is an error.
Hi,
Here is my code:
b_frame<-read.csv("input_data.txt", header=T, row.names=1)
#b_frame<-na.omit(b_frame)
condition <- factor(c(rep("A", 3), rep("B", 3)))
##create column data
coldata <- data.frame(row.names=colnames(b_frame), condition)
##make DESeqDataSetFromMatrix
dds<-DESeqDataSetFromMatrix(countData=b_frame, colData=coldata, design=~condition)
dds<-estimateSizeFactors(dds)
dds <- DESeq(dds)
res <- results(dds)
table(res$padj<0.05)
## Order by adjusted p-value
res <- res[order(res$padj), ]
## Merge with normalized count data
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene"
head(resdata)
write.csv(resdata, file="A1_Jenning_DESeq2.csv")
##AND sessionInfo()
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
l
Previously you said you had 8 samples, 4 control and 4 treated, and within control there was some grouping of 2 vs 2. Here you have 3 control and 3 treated.
The point is, it matters which dispersion is estimated. If there are differences among the control and treated you can account for that with a batch term in the design. But I'm still not sure exactly how many samples you have and what is the difference between control1 and control2, so I can't give concrete advice.
Let us say I have 8 samples. The first 4 are control and the second 4 are treatment.
b_frame<-read.csv("input_data.txt", header=T, row.names=1)
#b_frame<-na.omit(b_frame)
condition <- factor(c(rep("control", 4), rep("treatment", 4)))
##create column data
coldata <- data.frame(row.names=colnames(b_frame), condition)
##make DESeqDataSetFromMatrix
dds1<-DESeqDataSetFromMatrix(countData=b_frame, colData=coldata, design=~condition)
dds1<-estimateSizeFactors(dds1)
dds1 <- DESeq(dds1)
"I am not sure if I am accounting for the differences with a batch term here, I am assuming the condition term did this automatically?"
From the dds1 object I get the estimated dispersions.
I was curious to see what the estimated dispersions were from this file so I split the control samples in half so control1 is the first 2 samples of the original control and control2 is the second 2 samples of the original control and performed the same analysis
b_frame<-read.csv("input_data.txt", header=T, row.names=1)
c_frame<-b_frame[,1:4]
condition <- factor(c(rep("control1", 2), rep("control2", 2)))
##create column data
coldata <- data.frame(row.names=colnames(b_frame), condition)
##make DESeqDataSetFromMatrix
dds1<-DESeqDataSetFromMatrix(countData=c_frame, colData=coldata, design=~condition)
dds2<-estimateSizeFactors(dds2)
dds2 <- DESeq(dds2)
When I compare the estimatedDispersions from dds2 to dds1 the estimatedDispersions are much smaller in dds2.
Thanks
I'm sorry, I'm having a hard time following this question with the hypotheticals. The dispersion estimation depends on which samples you provide. If you provide the software with control and treated samples, it estimates using the within group variance of those samples. If you only provide it the control samples, and create an artificial division among those, it will use the within-(artificial)-group variance of those samples. If the control samples have less variability than the treatment samples, then the second estimate will be lower (although this is not the proper way to estimate control sample variability, because you've made an artificial division among these samples). Still, any lower estimate you derive from the control samples alone shouldn't be used to compare control and treated, because, as I've said previously, you want the model to use the variability of all the samples in the comparison, not a subset.
WIll the following setup estimate within group variances for the control and treatment?
library(DESeq2)
c_frame<-read.csv("data.csv", header=T, sep=",", row.names=1)
condition <- factor(c(rep("control", 4), rep("treatment", 4)))
coldata <- data.frame(row.names=colnames(c_frame), condition)
dds<-DESeqDataSetFromMatrix(countData=c_frame, colData=coldata, design=~condition)
dds<-estimateSizeFactors(dds)
dds <- DESeq(dds)
res <- results(dds)
Also, if I have a dataset the has a greater fold change dynamic range than RNAseq but also has a larger variance, is there a specific estimateDispersions parameter to use? Is there an appropriate test to use with the results functions?
Thanks for your patience and help.
"WIll the following setup estimate within group variances for the control and treatment?"
yes.
"Also, if I have a dataset the has a greater fold change dynamic range than RNAseq but also has a larger variance, is there a specific estimateDispersions parameter to use? Is there an appropriate test to use with the results functions?"
You don't have to do anything special. The DESeq() functions will handle estimation regardless of the range of fold changes and within-group variance.