Large fold changes after deseq2 when one condition has two zeros and small number the other condition has two zeros and a large number
1
0
Entering edit mode
loliveira • 0
@loliveira-17518
Last seen 6.2 years ago

I am working with 18 RNAseq datasets of 6 conditions in triplicates. I am importing the data from RSEM using TXimport and proceeding with  deseq2 - My question is that looking at the results for a preliminary comparison I see that I have high Log2fc (~20) and significant Padjust, however when I go to the counts from the original RSEM file I see the counts for these specific gene (OTC-60486)  as  CTRL 22.06 , 0 , 0 and INF 0, 1190.1, 0. I would expect to have a high P adjust. However this is the result of the comparison 

Gene id      baseMean     log2FoldChange lfcSE            stat              pvalue                 padj

Otc-60486  566.31317   25.6076609      2.63792993     9.70748334     2.80E-22         4.07E-19

Obviously, the zeros are having a big effect  since if I replace the zeros for 1 this log2FC is reduced to 1,5 and the Pvalue is no significant. Am I missing something? Please see R season below:

> dir<- "/Users/loliveira/RSEM/OTC"

>  samples <- read.table(file.path(dir,"OTC.txt"), header=TRUE)
>  samples
   sample       treatment rep                     run
1     SG4    Fed_Inf_4yrs   a OTC-SG004.genes.results
2     SG5    Fed_Inf_4yrs   a OTC-Sg005.genes.results
3     SG6    Fed_Inf_4yrs   a OTC-Sg006.genes.results
4    SG10            CTRL   b OTC-Sg010.genes.results
5    SG11            CTRL   b OTC-Sg011.genes.results
6    SG12            CTRL   b OTC-Sg012.genes.results
7    SG16     Fed_Inf_1wk   c OTC-Sg016.genes.results
8    SG17     Fed_Inf_1wk   c OTC-Sg017.genes.results
9    SG18     Fed_Inf_1wk   c OTC-Sg018.genes.results
10   SG22     Fed_Uni_1wk   d OTC-Sg022.genes.results
11   SG23     Fed_Uni_1wk   d OTC-Sg023.genes.results
12   SG24     Fed_Uni_1wk   d OTC-Sg024.genes.results
13   SG28 Fed_Inf_2bm_1wk   e OTC-Sg028.genes.results
14   SG29 Fed_Inf_2bm_1wk   e OTC-Sg029.genes.results
15   SG30 Fed_Inf_2bm_1wk   e OTC-Sg030.genes.results
16   SG34  Fed_Inf_1month   f OTC-Sg034.genes.results
17   SG35  Fed_Inf_1month   f OTC-Sg035.genes.results
18   SG36  Fed_Inf_1month   f OTC-Sg036.genes.results

>  files <- file.path(dir,samples$run)
>  names(files) <- samples$sample
>  
>  files
                                                SG4                                                 SG5                                                 SG6 
"/Users/loliveira/RSEM/OTC/OTC-SG004.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg005.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg006.genes.results" 
                                               SG10                                                SG11                                                SG12 
"/Users/loliveira/RSEM/OTC/OTC-Sg010.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg011.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg012.genes.results" 
                                               SG16                                                SG17                                                SG18 
"/Users/loliveira/RSEM/OTC/OTC-Sg016.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg017.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg018.genes.results" 
                                               SG22                                                SG23                                                SG24 
"/Users/loliveira/RSEM/OTC/OTC-Sg022.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg023.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg024.genes.results" 
                                               SG28                                                SG29                                                SG30 
"/Users/loliveira/RSEM/OTC/OTC-Sg028.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg029.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg030.genes.results" 
                                               SG34                                                SG35                                                SG36 
"/Users/loliveira/RSEM/OTC/OTC-Sg034.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg035.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg036.genes.results" 
>  library(tximport)
>  library(readr)
>   txi <- tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE)
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 
>  names(txi)
[1] "abundance"           "counts"              "length"              "countsFromAbundance"

>  head(txi$counts)
          SG4 SG5 SG6 SG10 SG11  SG12 SG16 SG17 SG18 SG22 SG23 SG24 SG28 SG29 SG30  SG34 SG35 SG36
Otc-10005 119 140  70  217   33 76.00  153  197   76   47   35   48   60   68   67 81.00   63  158
Otc-10006  12   6   7   68    4  5.00    7    8    4   13    2    1    3    6    4  7.00    5   16
Otc-10009   2  10  16   25    4  9.01    5    5    2   11   10   10    7   10    7  7.01   10   10
Otc-10027 145 179 164  427   58 79.00   69   72  109   62   63   61   88  101   79 83.00   90   81
Otc-10042  16  30  26   71   15 16.00   19   19   24   27   20   12   11   11   16 16.00   20   26
Otc-10050  17  16  11   10    5  9.00    6    9    8    6    2    5    4    4    2  4.00   12    2
>         
>  library(DESeq2)

>  dds <- DESeqDataSetFromTximport(txi,colData = samples, design = ~treatment)
using counts and average transcript lengths from tximport
>  dds
class: DESeqDataSet 
dim: 10984 18 
metadata(1): version
assays(2): counts avgTxLength
rownames(10984): Otc-10005 Otc-10006 ... OtcSigP-9981 OtcSigP-9986
rowData names(0):
colnames(18): SG4 SG5 ... SG35 SG36
colData names(4): sample treatment rep run

> keep <- rowSums(counts(dds)) >= 10
> dds <- dds[keep,]
> dds
class: DESeqDataSet 
dim: 8376 18 
metadata(1): version
assays(2): counts avgTxLength
rownames(8376): Otc-10005 Otc-10006 ... OtcSigP-9936 OtcSigP-9981
rowData names(0):
colnames(18): SG4 SG5 ... SG35 SG36
colData names(4): sample treatment rep run
>  dds <- DESeq(dds)
estimating size factors
using 'avgTxLength' from assays(dds), correcting for library size
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
>  res <- results(dds)
>  res
log2 fold change (MLE): treatment Fed Uni 1wk vs CTRL 
Wald test p-value: treatment Fed Uni 1wk vs CTRL 
DataFrame with 8376 rows and 6 columns
              baseMean log2FoldChange     lfcSE        stat     pvalue      padj
             <numeric>      <numeric> <numeric>   <numeric>  <numeric> <numeric>
Otc-10005    90.829530    -0.05694356 0.3738994  -0.1522965  0.8789531 0.9862238
Otc-10006     7.495166    -0.56269995 0.7472859  -0.7529915  0.4514550 0.8895800
Otc-10009     8.529125     0.96111647 0.5616601   1.7112065  0.0870430 0.5468635
Otc-10027    97.147211    -0.16809618 0.2412074  -0.6968949  0.4858686 0.9022474
Otc-10042    20.431021     0.49162397 0.3751679   1.3104104  0.1900570 0.7129306
...                ...            ...       ...         ...        ...       ...
OtcSigP-9829  1.034385    -2.69934118 1.7906619 -1.50745441 0.13169422        NA
OtcSigP-9877 14.310269    -1.16920823 0.6066144 -1.92743229 0.05392578 0.4518551
OtcSigP-9904 59.446744     0.43915855 0.2894595  1.51716731 0.12922445 0.6367444
OtcSigP-9936  9.917122     0.01944894 0.6563387  0.02963248 0.97636016 0.9961168
OtcSigP-9981 12.489498     0.94278120 0.5968313  1.57964433 0.11418834 0.6097460

>  summary(res)

out of 8376 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 167, 2% 
LFC < 0 (down)   : 90, 1.1% 
outliers [1]     : 0, 0% 
low counts [2]   : 975, 12% 
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results


>  dds$group <- factor(paste0(dds$rep, dds$treatment))
>  design(dds) <- ~ group
>  resultsNames(dds)
[1] "Intercept"                         "treatment_Fed_Inf_1month_vs_CTRL"  "treatment_Fed_Inf_1wk_vs_CTRL"     "treatment_Fed_Inf_2bm_1wk_vs_CTRL"
[5] "treatment_Fed_Inf_4yrs_vs_CTRL"    "treatment_Fed_Uni_1wk_vs_CTRL"    
>  dds <- DESeq(dds)
using pre-existing normalization factors
estimating dispersions
found already estimated dispersions, replacing these
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
>  resultsNames(dds)
[1] "Intercept"                               "group_bCTRL_vs_aFed_Inf_4yrs"            "group_cFed_Inf_1wk_vs_aFed_Inf_4yrs"    
[4] "group_dFed_Uni_1wk_vs_aFed_Inf_4yrs"     "group_eFed_Inf_2bm_1wk_vs_aFed_Inf_4yrs" "group_fFed_Inf_1month_vs_aFed_Inf_4yrs" 

>   library("vsn")
Warning message:
package ‘vsn’ was built under R version 3.4.2 
>  rld <- normTransform(dds)

> meanSdPlot(assay(rld))
Warning message:
package ‘hexbin’ was built under R version 3.4.3 

>  plotPCA(rld, intgroup=c("treatment"))
>  result = results(dds, contrast=c("group", "cFed_Inf_1wk", "bCTRL"))
>  write.table(result, file='Fed_Inf_1wkvsCTRL.txt',sep='\t',quote=FALSE)

Thanks in advance for the feedback 

 

rnaseq rsem deseq2 • 2.7k views
ADD COMMENT
0
Entering edit mode

 

Dear Michael, 

Thanks for your reply, I need to do multiple comparing the  conditions, but when I try the coef= on the lfcShrink it restrict me to the resultsNames(dds) as below, Since I need more iterations I am considering going for the betaPrior=TRUE.

What would you indicate in this case? How can I work all iterations of the results with lfcShrink using the coef variable? if this was answered before, please point me to the forum post - I couldn't find anything relevant. 

Very best and thanks again for the support 

resultsNames(dds)

[1] "Intercept"                               "group_bCTRL_vs_aFed_Inf_4yrs"            "group_cFed_Inf_1wk_vs_aFed_Inf_4yrs"    
[4] "group_dFed_Uni_1wk_vs_aFed_Inf_4yrs"     "group_eFed_Inf_2bm_1wk_vs_aFed_Inf_4yrs" "group_fFed_Inf_1month_vs_aFed_Inf_4yrs" 

 

ADD REPLY
0
Entering edit mode
@mikelove
Last seen 1 day ago
United States

Why aren’t you using lfcShrink to produce posterior effect sizes? This is the recommended workflow, see the vignette or the workflow. You might also want to try lfcThreshold within lfcShrink to specify a minimum effect size.

ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Thanks again for the support, I did add the lfcShrink to my datasets and relevel for each combination of variables. The results made much more sense now(Thank you). However, I still have several genes where  (ex. below from-Ctrl vs Infected 1wk) several genes are zeros on the CTRL and I have only one replicate as a high positive value. In this case, The LOG2FC reduced from ~25 to ~10, but I still think it is an "artifact". I keep asking myself why are the P adjust values still below <0.05 for this comparisons? Is there a way to recalculate if making it more stringent? I did the IHW but I still get a significant value... I am a bit lost here ...

Also, I couldn't run the lfcThreshold with the apeglm, but I don't think the discrepancy is coming from the FC, but from the P adjust values calculation.

# lfcThreshold error

resLFC <- lfcShrink(dds, coef="group_cFed_Inf_1wk_vs_bCTRL", type="apeglm",lfcThreshold= .1)
using 'apeglm' for LFC shrinkage
Error in apeglm::apeglm(Y = Y, x = design, log.lik = apeglm::logLikNB,  : 
  unused argument (lfcThreshold = 0.1)

#Examples with original and shrink FC. 

Counts from RSEM 3 replicates

otc-60486  0,0,0 and 0, 1190.1, 0

Original results with IHW

                       baseMean    log2FoldChange    lfcSE        stat        pvalue        padj    `    weight

Otc-60486    566.31317    24.7938956        2.63770541    9.39979708    5.47E-21    2.70E-18    1.69356631

> dds$group <- relevel(dds$group, ref = "bCTRL")
> dds <- DESeq(dds)

>resLFC <- lfcShrink(dds, coef="group_cFed_Inf_1wk_vs_bCTRL", type="apeglm")

write.table(resLFC, file='group_cFed_Inf_1wk_vs_bCTRL.txt',sep='\t',quote=FALSE)

# Results with lfcShrink

baseMean    log2FoldChange    lfcSE              pvalue        padj    `   

566.31317     10.9832851      4.10034068     1.40E-19     2.02E-16

ADD REPLY
0
Entering edit mode

I'd guess these have small dispersion estimates because of the count distributions of the other conditions. So there is a very large count and information from other genes saying the the count is likely not due to high dispersion of the gene, and then you have a very large difference in means (0 vs ~400) which is not likely equal to 0 given the estimates from the data.

lfcShrink() doesn't replace the p-value column, unless you specify lfcThreshold or svalue=TRUE. This is available in the latest release (v1.20), but not previously.

Another option would be to filter per comparison to remove the genes where there are not, e.g. 3 samples with a count of 5 or higher:

idx <- dds$condition %in% c("A","B") # levels of two groups of interest:
keep <- rowSums(counts(dds)[,idx] >= 5) >= 3
res <- res[keep,]
ADD REPLY
0
Entering edit mode

Hi Mike, 

Thanks for lingering on, 

I updated my DESeq2 to the latest version, and did exactly the same run as before, (as below) The first difference I noted was that now I see the citation to the manuscript (Nice paper!), the second is that the results for the lfcShrink changed dramatically. Now that problematic genes all have an irrelevant log2FoldChange  

XX

  baseMean log2FoldChange lfcSE pvalue padj
Otc-60486 566.31317 0.04097335 0.27390458 1.40E-19 2.02E-16

Also now I can run the lfcThreshold, That replaces the P values for the s values. I am unsure how to define the lfcThreshold value that make sense to my dataset. I will read a bit more to see if I find more information. 

​Thanks for the support 

> resLFC <- lfcShrink(dds, coef="group_cFed_Inf_1wk_vs_bCTRL", type="apeglm")
using 'apeglm' for LFC shrinkage. If used in published research, please cite:
    Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
    sequence count data: removing the noise and preserving large differences.
    bioRxiv. https://doi.org/10.1101/303255
> resLFC
log2 fold change (MAP): group cFed Inf 1wk vs bCTRL 
Wald test p-value: group cFed Inf 1wk vs bCTRL 
DataFrame with 8376 rows and 5 columns
                     baseMean       log2FoldChange             lfcSE               pvalue                padj
                    <numeric>            <numeric>         <numeric>            <numeric>           <numeric>
Otc-10005    90.8295300215003     1.21921880536886 0.384682618069075 8.44358364780871e-05 0.00379609448166066
Otc-10006    7.49516608368584  -0.0750104915728319 0.265213683522076    0.413564208091076                  NA
Otc-10009    8.52912466546036   -0.108342087836708 0.273789519999909    0.298256656398482     0.7309913282316
Otc-10027    97.1472108432723 -0.00711785369618373  0.17665262570643    0.952230663441284   0.991549155246969
Otc-10042    20.4310212153511    0.110932202547201 0.236218265690278     0.41208242297299    0.79934729777592
...                       ...                  ...               ...                  ...                 ...
OtcSigP-9829 1.03438505866019   -0.034002243697861 0.268577778110672    0.441778223544063                  NA
OtcSigP-9877 14.3102685292295   -0.277025457033635 0.409757551028378   0.0568212737905355   0.358965529484449
OtcSigP-9904 59.4467435346105    0.279170607879664  0.25741738666105   0.0884109150084793   0.448447832903963
OtcSigP-9936 9.91712247256658   -0.119224918545589 0.283109314021552    0.235066258566969    0.66878442450563
OtcSigP-9981 12.4894977143121    0.719314822405624 0.780726333153449   0.0127635007879737   0.153875054192443
> write.table(resLFC, file='group_cFed_Inf_1wk_vs_bCTRL.txt',sep='\t',quote=FALSE)

 

ADD REPLY

Login before adding your answer.

Traffic: 991 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6