Hi,
First of all, I'll excuse for the long question I'm about to post.
I am working on RNA-seq data from mice with 5 gene knock out on one allele (KN), contra wildtype (WT) mice. The knock out has been very well studied and it is absolutely certain they have a 50% decrees in expression of these 5 genes, proved in microarray studies and qPCR. Therefore, I start the valitation of my DEG by investigating the logFC of only these 5 genes which should be around a logFC -1, before I continue to look at all other genes.
Here is my cuestion. I get very different LogFC depending on DESeq version, and I'm therefore in doubt which one I should use, also when taking into acount that I would like the LogFC of these 5 genes to be close to -1.
The code I run is here:
countTable <- as.matrix(read.table("15qCtx_Week18_RW_S5.csv", header=TRUE, sep = ",", row.names = 1))
data = read.table("/home/projects9/pr_99009/people/dalby/data/metafile_MD2_S5_row.txt", header=TRUE, sep="\t", as.is=TRUE, row.names=1)
keep = grep("E_Ctx_Week18_15q_WT|E_Ctx_Week18_15q_KN", rownames(data))
data = data[keep,]
head(countTable)
X143E_Ctx_Week18_15q_R X144E_Ctx_Week18_15q_R
Klf13 1116 1048
Otud7a 290 289
Mtmr10 80 101
Fan1 42 56
Chrna7 81 78
E130012A19Rik 1547 1471
...
> rownames(data) = paste("X",rownames(data), sep='')
> data
Condition Study
X133E_Ctx_Week18_15q_WT WT S5
X134E_Ctx_Week18_15q_WT WT S5
X135E_Ctx_Week18_15q_WT WT S5
X136E_Ctx_Week18_15q_WT WT S5
X137E_Ctx_Week18_15q_WT WT S5
X138E_Ctx_Week18_15q_WT WT S5
X139E_Ctx_Week18_15q_WT WT S5
X140E_Ctx_Week18_15q_WT WT S5
X141E_Ctx_Week18_15q_WT WT S5
X142E_Ctx_Week18_15q_WT WT S5
X143E_Ctx_Week18_15q_KN KN S5
X144E_Ctx_Week18_15q_KN KN S5
X145E_Ctx_Week18_15q_KN KN S5
X146E_Ctx_Week18_15q_KN KN S5
X147E_Ctx_Week18_15q_KN KN S5
X148E_Ctx_Week18_15q_KN KN S5
X149E_Ctx_Week18_15q_KN KN S5
X150E_Ctx_Week18_15q_KN KN S5
X151E_Ctx_Week18_15q_KN KN S5
X152E_Ctx_Week18_15q_KN KN S5
grps <- unique(data$Condition)
countData <- countTable
colData <- data[,]
dds <- DESeqDataSetFromMatrix(countData = countTable, colData = colData, design = ~ Condition)
colData(dds)$Condition <- factor(colData(dds)$Condition,levels=c(paste(grps[2]), paste(grps[1])))
dds <- DESeq(dds)
res <- results(dds)
df = as.data.frame(res@listData)
rownames(df)=rownames(res)
#the 5 knock out genes
mygenes=c("Klf13","Fan1","Mtmr10","Trpm1","Otud7a","Chrna7")
res.mygenes <- df[match(mygenes, rownames(df)),]
#Results from DESeq2.1.2.10
res.mygenes
baseMean log2FoldChange lfcSE stat pvalue
Klf13 1504.620099 -0.9643976 0.03709111 -26.000772 4.853623e-149
Fan1 70.370590 -0.9792759 0.09649980 -10.147958 3.383700e-24
Mtmr10 147.037349 -0.9954604 0.08112723 -12.270362 1.306677e-34
Trpm1 1.688376 -0.2920688 0.15518364 -1.882085 5.982449e-02
Otud7a 410.085585 -0.9448236 0.04486432 -21.059576 1.868414e-98
Chrna7 97.909469 -0.7648929 0.08076019 -9.471163 2.767450e-21
padj
Klf13 8.518108e-145
Fan1 1.484599e-20
Mtmr10 7.644059e-31
Trpm1 9.998854e-01
Otud7a 1.639533e-94
Chrna7 9.713749e-18
sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] DESeq2_1.2.10 RcppArmadillo_0.4.600.0 Rcpp_0.11.3
[4] GenomicRanges_1.18.4 GenomeInfoDb_1.2.4 IRanges_2.0.1
[7] S4Vectors_0.4.0 BiocGenerics_0.12.1
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.28.1 Biobase_2.26.0 DBI_0.3.1
[4] RColorBrewer_1.1-2 RSQLite_1.0.0 XML_3.98-1.1
[7] XVector_0.6.0 annotate_1.44.0 genefilter_1.48.1
[10] grid_3.1.2 lattice_0.20-29 locfit_1.5-9.1
[13] splines_3.1.2 survival_2.37-7 xtable_1.7-4
When i do the exact same thing in DESeq2_1.6.3 i get these results for "mygenes"
res.mygenes
baseMean log2FoldChange lfcSE stat pvalue
Klf13 1504.620099 -0.834331 0.0324531356346944 -25.7087863665747 9.32262159169848e-146
Fan1 70.370589 -0.460786 0.0459702788278632 -10.0235537123434 1.2010507673883e-23
Mtmr10 147.037349 -0.564155 0.0461138730430435 -12.2339469291176 2.04742960211648e-34
Trpm1 2264.363771 0.027985 0.0338212899033766 0.827432430445383 0.407992005469013
Otud7a 410.085585 -0.766018 0.0370385043881411 -20.6816646445871 5.06621655507093e-95
Chrna7 97.90947 -0.431161 0.046145258218213 -9.34355679610976 9.31511337873981e-21
padj
Klf13 1.63770493501367e-141
Fan1 5.27471470767755e-2
Mtmr10 1.19890652734601e-30
Trpm1 0.999800543063416
Otud7a 4.44991131114655e-91
Chrna7 3.27277193448644e-17
As far as I can read, there has been a change in the scrinkage feature by betaPrior settings for the DESeq command between version 1.4 and 1.5. Therefore I tried to run the commands in DESeq2_1.6.3 with betaPrior=FALSE. However, I still did not get the same results as in version 1.2.10 and also, the LogFC where much higher now, then expected.
"dds <- DESeq(dds, betaPrior = FALSE)"
#Results version DESeq2_1.6.3 betaPrior=FALSE
res.mygenes
baseMean log2FoldChange lfcSE stat pvalue
Klf13 1504.62010 -0.9746567 0.03796482 -25.672628 2.363611e-145
Otud7a 410.08558 -0.9596691 0.04666032 -20.567134 5.406790e-94
Mtmr10 147.03735 -1.0508193 0.08629489 -12.177075 4.117856e-34
Fan1 70.37059 -1.0605465 0.10667968 -9.941410 2.749071e-23
Chrna7 97.90947 -0.8070744 0.08689973 -9.287422 1.580709e-20
My question is therefore, can someone help me explain the reason why I see these changes?
I really hope you will take the time answering this in order for me to choose the most suitable method. I cant find any really good explanation about the difference in version and advice for my data.
Thank you,
Furthermore, I do not understand the cut-off for the adjusted pvalue "padj". In some of my data
Ex. I experience that a pvalue of 7.60152130102369e-06 becomes NA for padj, even though the result dataset contains 26.741 "hit" genes, for instance. Doesn't is use a Benjamini and Hochberg udjestment for multiple testing?
Thank you, best Maria