Question

Difference between DESeq2 versions and explanation of betaPrior = False and padj

0

Entering edit mode

madal • 0

@madal-7238

Last seen 10.3 years ago

Denmark

Hi,
First of all, I'll excuse for the long question I'm about to post.
I am working on RNA-seq data from mice with 5 gene knock out on one allele (KN), contra wildtype (WT) mice. The knock out has been very well studied and it is absolutely certain they have a 50% decrees in expression of these 5 genes, proved in microarray studies and qPCR. Therefore, I start the valitation of my DEG by investigating the logFC of only these 5 genes which should be around a logFC -1, before I continue to look at all other genes.

Here is my cuestion. I get very different LogFC depending on DESeq version, and I'm therefore in doubt which one I should use, also when taking into acount that I would like the LogFC of these 5 genes to be close to -1.

The code I run is here:

countTable <- as.matrix(read.table("15qCtx_Week18_RW_S5.csv", header=TRUE, sep = ",", row.names = 1))

data = read.table("/home/projects9/pr_99009/people/dalby/data/metafile_MD2_S5_row.txt", header=TRUE, sep="\t", as.is=TRUE, row.names=1)
keep = grep("E_Ctx_Week18_15q_WT|E_Ctx_Week18_15q_KN", rownames(data))
data = data[keep,]

head(countTable)
              X143E_Ctx_Week18_15q_R X144E_Ctx_Week18_15q_R
Klf13                           1116                   1048
Otud7a                           290                    289
Mtmr10                            80                    101
Fan1                              42                     56
Chrna7                            81                     78
E130012A19Rik                   1547                   1471
...

> rownames(data) = paste("X",rownames(data), sep='')
> data
                        Condition Study
X133E_Ctx_Week18_15q_WT        WT    S5
X134E_Ctx_Week18_15q_WT        WT    S5
X135E_Ctx_Week18_15q_WT        WT    S5
X136E_Ctx_Week18_15q_WT        WT    S5
X137E_Ctx_Week18_15q_WT        WT    S5
X138E_Ctx_Week18_15q_WT        WT    S5
X139E_Ctx_Week18_15q_WT        WT    S5
X140E_Ctx_Week18_15q_WT        WT    S5
X141E_Ctx_Week18_15q_WT        WT    S5
X142E_Ctx_Week18_15q_WT        WT    S5
X143E_Ctx_Week18_15q_KN        KN    S5
X144E_Ctx_Week18_15q_KN        KN    S5
X145E_Ctx_Week18_15q_KN        KN    S5
X146E_Ctx_Week18_15q_KN        KN    S5
X147E_Ctx_Week18_15q_KN        KN    S5
X148E_Ctx_Week18_15q_KN        KN    S5
X149E_Ctx_Week18_15q_KN        KN    S5
X150E_Ctx_Week18_15q_KN        KN    S5
X151E_Ctx_Week18_15q_KN        KN    S5
X152E_Ctx_Week18_15q_KN        KN    S5

grps <- unique(data$Condition)
countData <- countTable
colData <- data[,]
dds <- DESeqDataSetFromMatrix(countData = countTable, colData = colData, design = ~ Condition)
colData(dds)$Condition <- factor(colData(dds)$Condition,levels=c(paste(grps[2]), paste(grps[1])))
dds <- DESeq(dds)
res <- results(dds)

df = as.data.frame(res@listData)
rownames(df)=rownames(res)

#the 5 knock out genes
mygenes=c("Klf13","Fan1","Mtmr10","Trpm1","Otud7a","Chrna7")
res.mygenes <- df[match(mygenes, rownames(df)),]

#Results from DESeq2.1.2.10

res.mygenes
          baseMean log2FoldChange      lfcSE       stat        pvalue
Klf13 1504.620099     -0.9643976 0.03709111 -26.000772 4.853623e-149
Fan1     70.370590     -0.9792759 0.09649980 -10.147958 3.383700e-24
Mtmr10 147.037349     -0.9954604 0.08112723 -12.270362 1.306677e-34
Trpm1     1.688376     -0.2920688 0.15518364 -1.882085 5.982449e-02
Otud7a 410.085585     -0.9448236 0.04486432 -21.059576 1.868414e-98
Chrna7   97.909469     -0.7648929 0.08076019 -9.471163 2.767450e-21
                padj
Klf13 8.518108e-145
Fan1    1.484599e-20
Mtmr10 7.644059e-31
Trpm1   9.998854e-01
Otud7a 1.639533e-94
Chrna7 9.713749e-18

sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] DESeq2_1.2.10           RcppArmadillo_0.4.600.0 Rcpp_0.11.3
[4] GenomicRanges_1.18.4    GenomeInfoDb_1.2.4      IRanges_2.0.1
[7] S4Vectors_0.4.0         BiocGenerics_0.12.1

loaded via a namespace (and not attached):
[1] AnnotationDbi_1.28.1 Biobase_2.26.0       DBI_0.3.1
[4] RColorBrewer_1.1-2   RSQLite_1.0.0        XML_3.98-1.1
[7] XVector_0.6.0        annotate_1.44.0      genefilter_1.48.1
[10] grid_3.1.2           lattice_0.20-29      locfit_1.5-9.1
[13] splines_3.1.2        survival_2.37-7      xtable_1.7-4

When i do the exact same thing in DESeq2_1.6.3 i get these results for "mygenes"

res.mygenes
         baseMean   log2FoldChange      lfcSE       stat        pvalue
Klf13 1504.620099 -0.834331 0.0324531356346944 -25.7087863665747 9.32262159169848e-146
Fan1     70.370589 -0.460786 0.0459702788278632 -10.0235537123434 1.2010507673883e-23
Mtmr10 147.037349 -0.564155 0.0461138730430435 -12.2339469291176 2.04742960211648e-34
Trpm1     2264.363771 0.027985 0.0338212899033766 0.827432430445383 0.407992005469013
Otud7a 410.085585 -0.766018 0.0370385043881411 -20.6816646445871 5.06621655507093e-95
Chrna7   97.90947 -0.431161 0.046145258218213 -9.34355679610976 9.31511337873981e-21
                padj
Klf13 1.63770493501367e-141
Fan1    5.27471470767755e-2
Mtmr10 1.19890652734601e-30
Trpm1   0.999800543063416
Otud7a 4.44991131114655e-91
Chrna7 3.27277193448644e-17

As far as I can read, there has been a change in the scrinkage feature by betaPrior settings for the DESeq command between version 1.4 and 1.5. Therefore I tried to run the commands in DESeq2_1.6.3 with betaPrior=FALSE. However, I still did not get the same results as in version 1.2.10 and also, the LogFC where much higher now, then expected.

"dds <- DESeq(dds, betaPrior = FALSE)"

#Results version DESeq2_1.6.3 betaPrior=FALSE

res.mygenes
        baseMean log2FoldChange      lfcSE         stat        pvalue
Klf13 1504.62010     -0.9746567 0.03796482   -25.672628 2.363611e-145
Otud7a 410.08558     -0.9596691 0.04666032   -20.567134 5.406790e-94
Mtmr10 147.03735     -1.0508193 0.08629489   -12.177075 4.117856e-34
Fan1     70.37059     -1.0605465 0.10667968    -9.941410 2.749071e-23
Chrna7   97.90947     -0.8070744 0.08689973    -9.287422 1.580709e-20

My question is therefore, can someone help me explain the reason why I see these changes?
I really hope you will take the time answering this in order for me to choose the most suitable method. I cant find any really good explanation about the difference in version and advice for my data.
Thank you,

Furthermore, I do not understand the cut-off for the adjusted pvalue "padj". In some of my data
Ex. I experience that a pvalue of 7.60152130102369e-06 becomes NA for padj, even though the result dataset contains 26.741 "hit" genes, for instance. Doesn't is use a Benjamini and Hochberg udjestment for multiple testing?

Thank you, best Maria

DESeq2 LogFC Adjusted pvalue • 4.0k views

ADD COMMENT • link updated 10.3 years ago by Michael Love 43k • written 10.3 years ago by madal • 0

score 2 · Answer 1 · 2015-01-12

"My question is therefore, can someone help me explain the reason why I see these changes?"

There would not be changes to LFC if you compared between version 1.2 without prior and version 1.6 without prior. But there are obviously changes when you use a prior or when you do not.

You might want to not use a prior if you want to recover LFCs from genes with very low counts, if you are willing to accept that these unshrunken estimates of LFC can have high variance. Specifically, if you want to recover an LFC of -1 regardless of whether the genes have low counts and the LFCs have higher variance, it looks like using betaPrior=FALSE would be the right option.

"I cant find any really good explanation about the difference in version and advice for my data."

It sounds like you found the NEWS file, which explains all changes which would affect user's results. There we explain that we made the prior calculation more robust in v1.5 by incorporating weights. Previously, if there were many genes with low counts, these genes would overly influence the prior calculation, which ended up producing very little shrinkage (this is why your LFC estimates from version 1.2 look very similar to version 1.6 without prior). In version 1.6, the genes are weighted by the expected variance of LFC, such that the low count genes with noisy LFC estimates contribute very little to the prior calculation.

"Furthermore, I do not understand the cut-off for the adjusted pvalue padj. In some of my data Ex. I experience that a pvalue of 7.60152130102369e-06 becomes NA for padj"

This filtering is explained in the vignette Section 3.8 "Independent filtering of results", so check there first. A filter on the mean of counts is chosen that maximizes the number of adjusted p-value less than a user-defined FDR threshold (default 0.1). Those genes which don't pass the mean filter get an NA. So if a p-value gets an NA, it means that it had a low count, and by filtering out this gene, more genes with larger counts could pass the FDR threshold. However, you can choose to turn off the independent filtering, which will give you less genes in total with small FDR, but could pass through some genes with small mean count.