We've got a gene that was previously identified as differentially expressed, which has dropped off the list after a re-mapping of the sequences to a different transcriptome (incorporating ncRNA as well as cDNA), and excluding a filter that was dropping out reads mapping to multiple transcripts. Here are the previous counts and statistics, comparing WT vs ρ0 cell lines:
ρ0 ρ0 ρ0 SC SC SC SC SCL SCL SCL SCL SCL WT WT WT WT
0 0 0 0 0 0 3 0 0 0 0 0 27 68 31 63
L2FC FCSE pval padj
7.62 1.34 1.91E-09 2.98E-07
After the re-analysis, the counts changed slightly, but the DE statistics for WT vs ρ0 changed a lot, to the degree that this gene has now been kicked out:
ρ0 ρ0 ρ0 SC SC SC SC SC SCL SCL SCL SCL SCL WT WT WT WT
0 0 0 0 0 0 0 4 0 0 0 0 0 29 69 31 64
L2FC FCSE pval padj
0.11 0.69 0.071 0.395
Any ideas why this has happened?
Just in case it is useful, the number of tested genes changed from about 15,000 to about 70,000. Here's the dispersion plot:
Run script here.
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux bullseye/sid
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_NZ.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_NZ.UTF-8 LC_COLLATE=en_NZ.UTF-8
[5] LC_MONETARY=en_NZ.UTF-8 LC_MESSAGES=en_NZ.UTF-8
[7] LC_PAPER=en_NZ.UTF-8 LC_NAME=en_NZ.UTF-8
[9] LC_ADDRESS=en_NZ.UTF-8 LC_TELEPHONE=en_NZ.UTF-8
[11] LC_MEASUREMENT=en_NZ.UTF-8 LC_IDENTIFICATION=en_NZ.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] xlsx_0.6.1 DESeq2_1.26.0
[3] SummarizedExperiment_1.16.1 DelayedArray_0.12.1
[5] BiocParallel_1.20.1 matrixStats_0.55.0
[7] Biobase_2.46.0 GenomicRanges_1.38.0
[9] GenomeInfoDb_1.22.0 IRanges_2.20.1
[11] S4Vectors_0.24.1 BiocGenerics_0.32.0
[13] magrittr_1.5 forcats_0.4.0
[15] stringr_1.4.0 dplyr_0.8.3
[17] purrr_0.3.3 readr_1.3.1
[19] tidyr_1.0.0 tibble_2.1.3
[21] ggplot2_3.2.1 tidyverse_1.3.0
loaded via a namespace (and not attached):
[1] colorspace_1.4-1 htmlTable_1.13.3 XVector_0.26.0
[4] base64enc_0.1-3 fs_1.3.1 rstudioapi_0.10
[7] bit64_0.9-7 AnnotationDbi_1.48.0 fansi_0.4.0
[10] lubridate_1.7.4 xml2_1.2.2 codetools_0.2-16
[13] splines_3.6.1 pscl_1.5.2 doParallel_1.0.15
[16] geneplotter_1.64.0 knitr_1.26 zeallot_0.1.0
[19] Formula_1.2-3 jsonlite_1.6 rJava_0.9-11
[22] broom_0.5.3 annotate_1.64.0 cluster_2.1.0
[25] ashr_2.2-39 dbplyr_1.4.2 png_0.1-7
[28] compiler_3.6.1 httr_1.4.1 backports_1.1.5
[31] assertthat_0.2.1 Matrix_1.2-18 lazyeval_0.2.2
[34] cli_2.0.0 acepack_1.4.1 htmltools_0.4.0
[37] tools_3.6.1 gtable_0.3.0 glue_1.3.1
[40] GenomeInfoDbData_1.2.2 Rcpp_1.0.3 cellranger_1.1.0
[43] vctrs_0.2.1 nlme_3.1-142 iterators_1.0.12
[46] xfun_0.11 xlsxjars_0.6.1 rvest_0.3.5
[49] lifecycle_0.1.0 XML_3.98-1.20 MASS_7.3-51.4
[52] zlibbioc_1.32.0 scales_1.1.0 hms_0.5.2
[55] RColorBrewer_1.1-2 memoise_1.1.0 gridExtra_2.3
[58] rpart_4.1-15 latticeExtra_0.6-29 stringi_1.4.3
[61] RSQLite_2.1.5 SQUAREM_2017.10-1 genefilter_1.68.0
[64] foreach_1.4.7 checkmate_1.9.4 truncnorm_1.0-8
[67] rlang_0.4.2 pkgconfig_2.0.3 bitops_1.0-6
[70] lattice_0.20-38 htmlwidgets_1.5.1 bit_1.1-14
[73] tidyselect_0.2.5 R6_2.4.1 generics_0.0.2
[76] Hmisc_4.3-0 DBI_1.1.0 pillar_1.4.3
[79] haven_2.2.0 foreign_0.8-72 withr_2.1.2
[82] mixsqp_0.2-2 survival_3.1-7 RCurl_1.95-4.12
[85] nnet_7.3-12 modelr_0.1.5 crayon_1.3.4
[88] utf8_1.1.4 jpeg_0.1-8.1 locfit_1.5-9.1
[91] grid_3.6.1 readxl_1.3.1 data.table_1.12.8
[94] blob_1.2.0 reprex_0.3.0 digest_0.6.23
[97] xtable_1.8-4 munsell_0.5.0
PlotCounts
These looked as I expected... here is the one without library as a factor:
and with library as a factor; it's identical (ignoring horizontal jitter):
Data Set Size
The non-coding RNA brought in a lot of transcripts that mapped to the reverse strand of non-coding genes. I experimented with what happens when I excluded these reverse-mapped transcripts (reducing the dataset down to about 30,000 genes, and there was no substantial change to the statistics (adjusted p value changed from 0.39 to 0.37):
L2FC FCSE pval padj
0.17 0.86 0.073 0.366
Factors in the Statistical Model
I've noticed that if I exclude the sequencing library as a factor in the statistical model, then the differential expression comes back:
L2FC FCSE pval padj
7.77 1.25 6.75E-11 7.19E-08
The library compositions were as follows:
1: WT, ρ0
2: WT, ρ0, SCL
3: WT, WT, ρ0, SCL, SCL, SCL
4: SC, SC, SC [+ 2 other samples from a separate experiment]
5: SC, SC, SCL [+ 1 other sample from a separate experiment]
So... maybe I shouldn't use the library as a factor. Mike love suggests that I should drop library as a factor given the experimental design, so I'll do that.
I can't tell exactly from the R script what the design was because it's created programmatically, but my guess is that the additional factor(s) are not helping and this seems to be confirmed by your saying that when you drop these the significance is recovered. If you add factors that don't help explain variation in the counts, you can end up limiting your degrees of freedom without advantage. If you're getting down to only a few residual degrees of freedom, there's not much information for estimating dispersion. If you have 4 groups, but then you add other coefficients beyond the four for each group, you're entering the territory of few degrees of freedom.
Okay, this makes some sense, except that the two lines I'm testing are WT and ρ0, and those lines seem (to me) to be fairly well mixed in the first three libraries.
Are there other tests I can do to confirm that this is happening? If this gene comes up as significant with library 3 vs library 4 (or any other library vs library comparison), would that stop it from being significant for WT vs ρ0?
While WT and ρ0 are mixed across 1-3, you're also asking DESeq2 to estimate other coefficients, including SC vs WT, SCL vs WT, 4 vs 1, and 5 vs 1. There will be linear dependencies (or nearly so) among this set of coefficients, and so generally it's recommended in linear modeling to avoid adding partially confounded or nearly colinear coefficients.
You can detect colinearity by forming the design matrix and computing
cor
. And you'd be concerned about covariates with strong positive or negative correlations.Sorry just to add to this -
cor
will just tell you part of the story because it won't show near linear dependence of combinations of columns of the design matrix.Here SC vs WT is nearly equal to 4 vs 1 + 5 vs 1.
If you wanted to detect this programmatically without looking at individual design matrices (although I recommend looking at the design matrices to eyeball colinearity), you would want to look at the eigenvalues of X' X, and see if any eigenvalues are near 0 (meaning that you are approaching insufficient rank). Plugging your design matrix into R you have one of the last eigenvalues close to 0 which points out the near linear dependence I describe above.
Thanks for the idea. plotCounts looks fine.
What does it mean “include library as a factor”?
design = ~ Line + Library
vs
design = ~ Line
The SC cell line was only run together with SCL (i.e. it was never run together with WT), and perhaps that has caused a problem because SC and SCL have fairly similar gene counts overall. Here is the library composition:
Ah so not only was library reducing degrees of freedom but the 4 additional coefficients from adding
library
have correlation with your comparison of interest. The colinearity of those coefficients is the culprit. I would recommend to droplibrary
.I am not sure I follow completely; does "Library" mean only library preparation, or also different sequencing runs, i.e. chips? If so, how long was the time between the different runs?
In any case, isn't Michaels proposition the same as saying "disregard your batch correction"? How is that acceptable?
Since you are checking only WT against ρ0 at the moment, why not try and use "libraries" 1 through 3, with the correction? At least as a sanity check...
Sorry if I am getting something completely wrong here.