DeSeq2 - adjusted p-value changes substantially after including non-coding RNA [SOLVED]
3
0
Entering edit mode
@david-eccles-gringer-7488
Last seen 4.8 years ago
New Zealand

We've got a gene that was previously identified as differentially expressed, which has dropped off the list after a re-mapping of the sequences to a different transcriptome (incorporating ncRNA as well as cDNA), and excluding a filter that was dropping out reads mapping to multiple transcripts. Here are the previous counts and statistics, comparing WT vs ρ0 cell lines:

ρ0  ρ0  ρ0  SC  SC  SC  SC SCL SCL SCL SCL SCL  WT  WT  WT  WT
  0  0   0   0   0   0   3   0   0   0   0   0  27  68  31  63

 L2FC FCSE      pval      padj
 7.62 1.34  1.91E-09  2.98E-07

After the re-analysis, the counts changed slightly, but the DE statistics for WT vs ρ0 changed a lot, to the degree that this gene has now been kicked out:

ρ0  ρ0  ρ0  SC  SC  SC  SC SC SCL SCL SCL SCL SCL  WT  WT  WT  WT
 0   0   0   0   0   0   0  4   0   0   0   0   0  29  69  31  64

 L2FC FCSE      pval      padj
 0.11 0.69     0.071     0.395

Any ideas why this has happened?

Just in case it is useful, the number of tested genes changed from about 15,000 to about 70,000. Here's the dispersion plot: dispersion plot for ~75k genes

Run script here.

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux bullseye/sid

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_NZ.UTF-8          LC_NUMERIC=C                 
 [3] LC_TIME=en_NZ.UTF-8           LC_COLLATE=en_NZ.UTF-8       
 [5] LC_MONETARY=en_NZ.UTF-8       LC_MESSAGES=en_NZ.UTF-8      
 [7] LC_PAPER=en_NZ.UTF-8          LC_NAME=en_NZ.UTF-8          
 [9] LC_ADDRESS=en_NZ.UTF-8        LC_TELEPHONE=en_NZ.UTF-8     
[11] LC_MEASUREMENT=en_NZ.UTF-8    LC_IDENTIFICATION=en_NZ.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] xlsx_0.6.1                  DESeq2_1.26.0              
 [3] SummarizedExperiment_1.16.1 DelayedArray_0.12.1        
 [5] BiocParallel_1.20.1         matrixStats_0.55.0         
 [7] Biobase_2.46.0              GenomicRanges_1.38.0       
 [9] GenomeInfoDb_1.22.0         IRanges_2.20.1             
[11] S4Vectors_0.24.1            BiocGenerics_0.32.0        
[13] magrittr_1.5                forcats_0.4.0              
[15] stringr_1.4.0               dplyr_0.8.3                
[17] purrr_0.3.3                 readr_1.3.1                
[19] tidyr_1.0.0                 tibble_2.1.3               
[21] ggplot2_3.2.1               tidyverse_1.3.0            

loaded via a namespace (and not attached):
 [1] colorspace_1.4-1       htmlTable_1.13.3       XVector_0.26.0        
 [4] base64enc_0.1-3        fs_1.3.1               rstudioapi_0.10       
 [7] bit64_0.9-7            AnnotationDbi_1.48.0   fansi_0.4.0           
[10] lubridate_1.7.4        xml2_1.2.2             codetools_0.2-16      
[13] splines_3.6.1          pscl_1.5.2             doParallel_1.0.15     
[16] geneplotter_1.64.0     knitr_1.26             zeallot_0.1.0         
[19] Formula_1.2-3          jsonlite_1.6           rJava_0.9-11          
[22] broom_0.5.3            annotate_1.64.0        cluster_2.1.0         
[25] ashr_2.2-39            dbplyr_1.4.2           png_0.1-7             
[28] compiler_3.6.1         httr_1.4.1             backports_1.1.5       
[31] assertthat_0.2.1       Matrix_1.2-18          lazyeval_0.2.2        
[34] cli_2.0.0              acepack_1.4.1          htmltools_0.4.0       
[37] tools_3.6.1            gtable_0.3.0           glue_1.3.1            
[40] GenomeInfoDbData_1.2.2 Rcpp_1.0.3             cellranger_1.1.0      
[43] vctrs_0.2.1            nlme_3.1-142           iterators_1.0.12      
[46] xfun_0.11              xlsxjars_0.6.1         rvest_0.3.5           
[49] lifecycle_0.1.0        XML_3.98-1.20          MASS_7.3-51.4         
[52] zlibbioc_1.32.0        scales_1.1.0           hms_0.5.2             
[55] RColorBrewer_1.1-2     memoise_1.1.0          gridExtra_2.3         
[58] rpart_4.1-15           latticeExtra_0.6-29    stringi_1.4.3         
[61] RSQLite_2.1.5          SQUAREM_2017.10-1      genefilter_1.68.0     
[64] foreach_1.4.7          checkmate_1.9.4        truncnorm_1.0-8       
[67] rlang_0.4.2            pkgconfig_2.0.3        bitops_1.0-6          
[70] lattice_0.20-38        htmlwidgets_1.5.1      bit_1.1-14            
[73] tidyselect_0.2.5       R6_2.4.1               generics_0.0.2        
[76] Hmisc_4.3-0            DBI_1.1.0              pillar_1.4.3          
[79] haven_2.2.0            foreign_0.8-72         withr_2.1.2           
[82] mixsqp_0.2-2           survival_3.1-7         RCurl_1.95-4.12       
[85] nnet_7.3-12            modelr_0.1.5           crayon_1.3.4          
[88] utf8_1.1.4             jpeg_0.1-8.1           locfit_1.5-9.1        
[91] grid_3.6.1             readxl_1.3.1           data.table_1.12.8     
[94] blob_1.2.0             reprex_0.3.0           digest_0.6.23         
[97] xtable_1.8-4           munsell_0.5.0

PlotCounts

These looked as I expected... here is the one without library as a factor:

plotCounts - no Library statistical factor

and with library as a factor; it's identical (ignoring horizontal jitter):

plotCounts - with Library statistical factor

Data Set Size

The non-coding RNA brought in a lot of transcripts that mapped to the reverse strand of non-coding genes. I experimented with what happens when I excluded these reverse-mapped transcripts (reducing the dataset down to about 30,000 genes, and there was no substantial change to the statistics (adjusted p value changed from 0.39 to 0.37):

 L2FC FCSE      pval      padj
 0.17 0.86     0.073     0.366

Factors in the Statistical Model

I've noticed that if I exclude the sequencing library as a factor in the statistical model, then the differential expression comes back:

 L2FC FCSE      pval      padj
 7.77 1.25  6.75E-11  7.19E-08

The library compositions were as follows:

1: WT, ρ0
2: WT, ρ0, SCL
3: WT, WT, ρ0, SCL, SCL, SCL
4: SC, SC, SC [+ 2 other samples from a separate experiment]
5: SC, SC, SCL [+ 1 other sample from a separate experiment]

So... maybe I shouldn't use the library as a factor. Mike love suggests that I should drop library as a factor given the experimental design, so I'll do that.

deseq2 • 1.2k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 3 days ago
United States

Check plotCounts to make sure it looks as you expect.

ADD COMMENT
1
Entering edit mode

I can't tell exactly from the R script what the design was because it's created programmatically, but my guess is that the additional factor(s) are not helping and this seems to be confirmed by your saying that when you drop these the significance is recovered. If you add factors that don't help explain variation in the counts, you can end up limiting your degrees of freedom without advantage. If you're getting down to only a few residual degrees of freedom, there's not much information for estimating dispersion. If you have 4 groups, but then you add other coefficients beyond the four for each group, you're entering the territory of few degrees of freedom.

ADD REPLY
0
Entering edit mode

Okay, this makes some sense, except that the two lines I'm testing are WT and ρ0, and those lines seem (to me) to be fairly well mixed in the first three libraries.

Are there other tests I can do to confirm that this is happening? If this gene comes up as significant with library 3 vs library 4 (or any other library vs library comparison), would that stop it from being significant for WT vs ρ0?

ADD REPLY
1
Entering edit mode

While WT and ρ0 are mixed across 1-3, you're also asking DESeq2 to estimate other coefficients, including SC vs WT, SCL vs WT, 4 vs 1, and 5 vs 1. There will be linear dependencies (or nearly so) among this set of coefficients, and so generally it's recommended in linear modeling to avoid adding partially confounded or nearly colinear coefficients.

You can detect colinearity by forming the design matrix and computing cor. And you'd be concerned about covariates with strong positive or negative correlations.

ADD REPLY
2
Entering edit mode

Sorry just to add to this - cor will just tell you part of the story because it won't show near linear dependence of combinations of columns of the design matrix.

Here SC vs WT is nearly equal to 4 vs 1 + 5 vs 1.

If you wanted to detect this programmatically without looking at individual design matrices (although I recommend looking at the design matrices to eyeball colinearity), you would want to look at the eigenvalues of X' X, and see if any eigenvalues are near 0 (meaning that you are approaching insufficient rank). Plugging your design matrix into R you have one of the last eigenvalues close to 0 which points out the near linear dependence I describe above.

ADD REPLY
0
Entering edit mode

Thanks for the idea. plotCounts looks fine.

ADD REPLY
0
Entering edit mode

What does it mean “include library as a factor”?

ADD REPLY
0
Entering edit mode

design = ~ Line + Library
vs
design = ~ Line

The SC cell line was only run together with SCL (i.e. it was never run together with WT), and perhaps that has caused a problem because SC and SCL have fairly similar gene counts overall. Here is the library composition:

1: WT, ρ0
2: WT, ρ0, SCL
3: WT, WT, ρ0, SCL, SCL, SCL
4: SC, SC, SC [+ 2 other samples from a separate experiment]
5: SC, SC, SCL [+ 1 other sample from a separate experiment]
ADD REPLY
1
Entering edit mode

Ah so not only was library reducing degrees of freedom but the 4 additional coefficients from adding library have correlation with your comparison of interest. The colinearity of those coefficients is the culprit. I would recommend to drop library.

ADD REPLY
0
Entering edit mode

I am not sure I follow completely; does "Library" mean only library preparation, or also different sequencing runs, i.e. chips? If so, how long was the time between the different runs?

In any case, isn't Michaels proposition the same as saying "disregard your batch correction"? How is that acceptable?

Since you are checking only WT against ρ0 at the moment, why not try and use "libraries" 1 through 3, with the correction? At least as a sanity check...

Sorry if I am getting something completely wrong here.

ADD REPLY
0
Entering edit mode
@sebastianlobentanzer-11790
Last seen 3.5 years ago
Germany

I think they way DESeq2 handles zero counts might provide an answer. As Michael stated in response to this question: Your LFC estimate "depends on the dispersion for this gene, how large the counts are for B [the group that shows expression, i.e. ρ0], and the distribution of log fold changes for other genes which had finite MLE LFCs."

So your sequencing depth and the LFC distribution across your other genes both have an impact on estimation of this value. Since you substantially altered your test set (almost 5 times as many genes), I think the most logical explanation is there.

ADD COMMENT
0
Entering edit mode

I don't think that's it (or at least, it's not a complete explanation), given that the removal of the sequencing library as a factor has such a big effect on fold change.

edit: I reduced the input data set size, and it didn't make a substantial difference to the shrunk L2FC or adjusted p-value.

ADD REPLY

Login before adding your answer.

Traffic: 870 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6