Hi
I am comparing two different types of sequencing method for RNAseq. I am using the same set of samples (3 control, 3 treated) to prepare library using illumina mRNA kit and lexogen 3 prime library kit. The idea is to compare both library preparation method with the aim of using the three prime one for future experiments. we were hoping lexogen will generate roughly similar numbers of DGE, but contrary to our expectation three prime generated only 60% of the DGE identified in the illumina library. I have identified the genes that were not called significant by DEseq2 from lexogen dataset. How can I find why the genes are not called DEs? Any suggesion is much appreciated. The basic code is as follows.
Many thanks Nurun
dds <- DESeqDataSetFromMatrix(countData = Standard,
colData = SampleData.standard,
design = ~Treatment+Subject)
dds <- estimateSizeFactors(dds)
vsd <- vst(dds)
plotPCA(vsd, intgroup = "Treatment")
dds <- DESeq(dds)
res <- results(dds, lfcThreshold = 2, alpha = 0.05, contrast=c("Treatment","LPS","Control"))
summary(res)
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] grid parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.0.1
[4] purrr_0.3.2 readr_1.3.1 tidyr_0.8.3
[7] tibble_2.1.1 tidyverse_1.2.1 ggforce_0.2.2
[10] VennDiagram_1.6.20 futile.logger_1.4.3 org.Hs.eg.db_3.8.2
[13] AnnotationDbi_1.46.0 DESeq2_1.24.0 SummarizedExperiment_1.14.0
[16] DelayedArray_0.10.0 BiocParallel_1.18.0 matrixStats_0.54.0
[19] Biobase_2.44.0 GenomicRanges_1.36.0 GenomeInfoDb_1.20.0
[22] IRanges_2.18.0 S4Vectors_0.22.0 BiocGenerics_0.30.0
[25] reshape2_1.4.3 ggplot2_3.1.1
loaded via a namespace (and not attached):
[1] nlme_3.1-140 bitops_1.0-6 lubridate_1.7.4 bit64_0.9-7
[5] httr_1.4.0 RColorBrewer_1.1-2 tools_3.6.1 backports_1.1.4
[9] R6_2.4.0 rpart_4.1-15 Hmisc_4.2-0 DBI_1.0.0
[13] lazyeval_0.2.2 colorspace_1.4-1 nnet_7.3-12 withr_2.1.2
[17] tidyselect_0.2.5 gridExtra_2.3 bit_1.1-14 compiler_3.6.1
[21] cli_1.1.0 rvest_0.3.4 formatR_1.6 htmlTable_1.13.1
[25] xml2_1.2.0 scales_1.0.0 checkmate_1.9.3 genefilter_1.66.0
[29] digest_0.6.18 foreign_0.8-71 XVector_0.24.0 base64enc_0.1-3
[33] pkgconfig_2.0.2 htmltools_0.3.6 readxl_1.3.1 htmlwidgets_1.3
[37] rlang_0.3.4 rstudioapi_0.10 RSQLite_2.1.1 farver_1.1.0
[41] generics_0.0.2 jsonlite_1.6 acepack_1.4.1 RCurl_1.95-4.12
[45] magrittr_1.5 GenomeInfoDbData_1.2.1 Formula_1.2-3 Matrix_1.2-17
[49] Rcpp_1.0.1 munsell_0.5.0 stringi_1.4.3 MASS_7.3-51.4
[53] zlibbioc_1.30.0 plyr_1.8.4 blob_1.1.1 crayon_1.3.4
[57] lattice_0.20-38 haven_2.1.0 splines_3.6.1 annotate_1.62.0
[61] hms_0.4.2 locfit_1.5-9.1 knitr_1.22 pillar_1.4.0
[65] geneplotter_1.62.0 futile.options_1.0.1 XML_3.98-1.19 glue_1.3.1
[69] latticeExtra_0.6-28 modelr_0.1.4 lambda.r_1.2.3 data.table_1.12.2
[73] tweenr_1.0.1 cellranger_1.1.0 gtable_0.3.0 polyclip_1.10-0
[77] assertthat_0.2.1 xfun_0.6 xtable_1.8-4 broom_0.5.2
[81] survival_2.44-1.1 memoise_1.1.0 cluster_2.1.0
Hi Steve,
Thank you for taking the time to respond with such detailed answer.
You are right in assuming that most of the lower count genes are not getting called significant. 1. I did analyse them completely separately to avoid wrong estimation of dispersion. I will look into
plotDispEsts
. 2. Did not try MA plot, will do that soon. 3. PCA for both datasets looks identical, clearly separating control and treated samples. 4 Will do that.The idea of this experiment was to check how much the quantseq can replicate from illumina samples, to decide whether it would be worth sequencing using quantseq which will be much cheaper. Also, this is our second trial, in the first trial we did ended up with deeper sequencing (~40 million per library from both illumina and quantseq but it yielded even poorer result.). As quantseq suggest 5-10 million reads from quantseq library , this time we have ~20 million, which seems to yield better result.
I tried playing with the parameters to check whether that increases the number of DGEs. Which was not completely successful.
Thanks Nurun
Hi Steve,
I was also wondering if I can do something computationally to increase the DGEs number so that I have at least 85-90% overlapping? Any ideas?
Also, some of the genes not called DGEs were of moderate expression. If you want to see any particular plot or data, please let me know and I can attach them here.
I think you're focussed on the wrong question. There won't be some computational trick you can do to increase the overlap, and even if there was (this time), is that something you really want to do everywhere going forward?
If I were you, I'd focus on understanding what is driving the differences you see between the downstream results -- or even if these differences are something you really care about.
There are many many things to try to start to understand that, and I've outlined some of those. Once you feel like you have a handle on why things are different, you would be in a better place to propose solutions.
For instance, if everything is essentially the same between the two kits except the quantseq kit has lower power to detect DGE based on the sequencing depth, you might advise your lab colleagues to provide more RNA (more input) and sequence deeper so you can get more counts ... for instance.
Or, with all that money you are saving using the quantseq kits vs the other Ilumina kits, you might ask to include a few more replicates per group to increase your power.
While I'm on my soapbox, some of the "solutions" I propose above should really be answered by a more complex experimental design than the 3 vs 3 you have. You can imagine a much larger (still two-group) experiment you can design that has more repliclates across a range of RNA input amounts.
This would allow you to analyze the data in such a way that can more definitively inform colleagues about what range of input RNA and replicates are required in the quantseq world that will match match the sensitivity/specificity of some "standard" illumina experiment you folks are already running.
Know what I mean?
Yes, makes perfect sense, thank you. :)
Hi Steve
Thank you very much for your thought and input. I just remember, previously we have done the same experiment by sequencing the quantseq library at higher depth (~40 million) same as illumina standard but we did not see an improvement in DGEs. We only saw an increased percentage of read duplication (55-60%). I am going to check all the other parameters you mentioned and also the RNA quantity as a starting material for library preparation.
No problem ... also, I don't think you mentioned it yet, but you should find out if you're using the UMI mojo with the quantseq kids during the library prep, and the are subsequently taking advantage of an NGS pipeline that leverages those to properly remove PCR duplicate reads before gene quantitation. If you're not using them, you should be pushing hard on your lab-based colleagues to do so.