I am working with 18 RNAseq datasets of 6 conditions in triplicates. I am importing the data from RSEM using TXimport and proceeding with deseq2 - My question is that looking at the results for a preliminary comparison I see that I have high Log2fc (~20) and significant Padjust, however when I go to the counts from the original RSEM file I see the counts for these specific gene (OTC-60486) as CTRL 22.06 , 0 , 0 and INF 0, 1190.1, 0. I would expect to have a high P adjust. However this is the result of the comparison
Gene id baseMean log2FoldChange lfcSE stat pvalue padj
Otc-60486 566.31317 25.6076609 2.63792993 9.70748334 2.80E-22 4.07E-19
Obviously, the zeros are having a big effect since if I replace the zeros for 1 this log2FC is reduced to 1,5 and the Pvalue is no significant. Am I missing something? Please see R season below:
> dir<- "/Users/loliveira/RSEM/OTC"
> samples <- read.table(file.path(dir,"OTC.txt"), header=TRUE)
> samples
sample treatment rep run
1 SG4 Fed_Inf_4yrs a OTC-SG004.genes.results
2 SG5 Fed_Inf_4yrs a OTC-Sg005.genes.results
3 SG6 Fed_Inf_4yrs a OTC-Sg006.genes.results
4 SG10 CTRL b OTC-Sg010.genes.results
5 SG11 CTRL b OTC-Sg011.genes.results
6 SG12 CTRL b OTC-Sg012.genes.results
7 SG16 Fed_Inf_1wk c OTC-Sg016.genes.results
8 SG17 Fed_Inf_1wk c OTC-Sg017.genes.results
9 SG18 Fed_Inf_1wk c OTC-Sg018.genes.results
10 SG22 Fed_Uni_1wk d OTC-Sg022.genes.results
11 SG23 Fed_Uni_1wk d OTC-Sg023.genes.results
12 SG24 Fed_Uni_1wk d OTC-Sg024.genes.results
13 SG28 Fed_Inf_2bm_1wk e OTC-Sg028.genes.results
14 SG29 Fed_Inf_2bm_1wk e OTC-Sg029.genes.results
15 SG30 Fed_Inf_2bm_1wk e OTC-Sg030.genes.results
16 SG34 Fed_Inf_1month f OTC-Sg034.genes.results
17 SG35 Fed_Inf_1month f OTC-Sg035.genes.results
18 SG36 Fed_Inf_1month f OTC-Sg036.genes.results
>
> files <- file.path(dir,samples$run)
> names(files) <- samples$sample
>
> files
SG4 SG5 SG6
"/Users/loliveira/RSEM/OTC/OTC-SG004.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg005.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg006.genes.results"
SG10 SG11 SG12
"/Users/loliveira/RSEM/OTC/OTC-Sg010.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg011.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg012.genes.results"
SG16 SG17 SG18
"/Users/loliveira/RSEM/OTC/OTC-Sg016.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg017.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg018.genes.results"
SG22 SG23 SG24
"/Users/loliveira/RSEM/OTC/OTC-Sg022.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg023.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg024.genes.results"
SG28 SG29 SG30
"/Users/loliveira/RSEM/OTC/OTC-Sg028.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg029.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg030.genes.results"
SG34 SG35 SG36
"/Users/loliveira/RSEM/OTC/OTC-Sg034.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg035.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg036.genes.results"
> library(tximport)
> library(readr)
> txi <- tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE)
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
> names(txi)
[1] "abundance" "counts" "length" "countsFromAbundance"
>
> head(txi$counts)
SG4 SG5 SG6 SG10 SG11 SG12 SG16 SG17 SG18 SG22 SG23 SG24 SG28 SG29 SG30 SG34 SG35 SG36
Otc-10005 119 140 70 217 33 76.00 153 197 76 47 35 48 60 68 67 81.00 63 158
Otc-10006 12 6 7 68 4 5.00 7 8 4 13 2 1 3 6 4 7.00 5 16
Otc-10009 2 10 16 25 4 9.01 5 5 2 11 10 10 7 10 7 7.01 10 10
Otc-10027 145 179 164 427 58 79.00 69 72 109 62 63 61 88 101 79 83.00 90 81
Otc-10042 16 30 26 71 15 16.00 19 19 24 27 20 12 11 11 16 16.00 20 26
Otc-10050 17 16 11 10 5 9.00 6 9 8 6 2 5 4 4 2 4.00 12 2
>
> library(DESeq2)
> dds <- DESeqDataSetFromTximport(txi,colData = samples, design = ~treatment)
using counts and average transcript lengths from tximport
> dds
class: DESeqDataSet
dim: 10984 18
metadata(1): version
assays(2): counts avgTxLength
rownames(10984): Otc-10005 Otc-10006 ... OtcSigP-9981 OtcSigP-9986
rowData names(0):
colnames(18): SG4 SG5 ... SG35 SG36
colData names(4): sample treatment rep run
>
> keep <- rowSums(counts(dds)) >= 10
> dds <- dds[keep,]
> dds
class: DESeqDataSet
dim: 8376 18
metadata(1): version
assays(2): counts avgTxLength
rownames(8376): Otc-10005 Otc-10006 ... OtcSigP-9936 OtcSigP-9981
rowData names(0):
colnames(18): SG4 SG5 ... SG35 SG36
colData names(4): sample treatment rep run
> dds <- DESeq(dds)
estimating size factors
using 'avgTxLength' from assays(dds), correcting for library size
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> res <- results(dds)
> res
log2 fold change (MLE): treatment Fed Uni 1wk vs CTRL
Wald test p-value: treatment Fed Uni 1wk vs CTRL
DataFrame with 8376 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
Otc-10005 90.829530 -0.05694356 0.3738994 -0.1522965 0.8789531 0.9862238
Otc-10006 7.495166 -0.56269995 0.7472859 -0.7529915 0.4514550 0.8895800
Otc-10009 8.529125 0.96111647 0.5616601 1.7112065 0.0870430 0.5468635
Otc-10027 97.147211 -0.16809618 0.2412074 -0.6968949 0.4858686 0.9022474
Otc-10042 20.431021 0.49162397 0.3751679 1.3104104 0.1900570 0.7129306
... ... ... ... ... ... ...
OtcSigP-9829 1.034385 -2.69934118 1.7906619 -1.50745441 0.13169422 NA
OtcSigP-9877 14.310269 -1.16920823 0.6066144 -1.92743229 0.05392578 0.4518551
OtcSigP-9904 59.446744 0.43915855 0.2894595 1.51716731 0.12922445 0.6367444
OtcSigP-9936 9.917122 0.01944894 0.6563387 0.02963248 0.97636016 0.9961168
OtcSigP-9981 12.489498 0.94278120 0.5968313 1.57964433 0.11418834 0.6097460
>
> summary(res)
out of 8376 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 167, 2%
LFC < 0 (down) : 90, 1.1%
outliers [1] : 0, 0%
low counts [2] : 975, 12%
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
>
> dds$group <- factor(paste0(dds$rep, dds$treatment))
> design(dds) <- ~ group
> resultsNames(dds)
[1] "Intercept" "treatment_Fed_Inf_1month_vs_CTRL" "treatment_Fed_Inf_1wk_vs_CTRL" "treatment_Fed_Inf_2bm_1wk_vs_CTRL"
[5] "treatment_Fed_Inf_4yrs_vs_CTRL" "treatment_Fed_Uni_1wk_vs_CTRL"
> dds <- DESeq(dds)
using pre-existing normalization factors
estimating dispersions
found already estimated dispersions, replacing these
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> resultsNames(dds)
[1] "Intercept" "group_bCTRL_vs_aFed_Inf_4yrs" "group_cFed_Inf_1wk_vs_aFed_Inf_4yrs"
[4] "group_dFed_Uni_1wk_vs_aFed_Inf_4yrs" "group_eFed_Inf_2bm_1wk_vs_aFed_Inf_4yrs" "group_fFed_Inf_1month_vs_aFed_Inf_4yrs"
>
> library("vsn")
Warning message:
package ‘vsn’ was built under R version 3.4.2
> rld <- normTransform(dds)
>
> meanSdPlot(assay(rld))
Warning message:
package ‘hexbin’ was built under R version 3.4.3
>
> plotPCA(rld, intgroup=c("treatment"))
> result = results(dds, contrast=c("group", "cFed_Inf_1wk", "bCTRL"))
> write.table(result, file='Fed_Inf_1wkvsCTRL.txt',sep='\t',quote=FALSE)
Thanks in advance for the feedback
Dear Michael,
Thanks for your reply, I need to do multiple comparing the conditions, but when I try the coef= on the lfcShrink it restrict me to the resultsNames(dds) as below, Since I need more iterations I am considering going for the betaPrior=TRUE.
What would you indicate in this case? How can I work all iterations of the results with lfcShrink using the coef variable? if this was answered before, please point me to the forum post - I couldn't find anything relevant.
Very best and thanks again for the support
resultsNames(dds)
[1] "Intercept" "group_bCTRL_vs_aFed_Inf_4yrs" "group_cFed_Inf_1wk_vs_aFed_Inf_4yrs"
[4] "group_dFed_Uni_1wk_vs_aFed_Inf_4yrs" "group_eFed_Inf_2bm_1wk_vs_aFed_Inf_4yrs" "group_fFed_Inf_1month_vs_aFed_Inf_4yrs"