Question

Comparing two different sequencing method by DEseq2

0

Entering edit mode

n.naharfancy • 0

@nnaharfancy-20022

Last seen 4.4 years ago

Imperial College London

Hi

I am comparing two different types of sequencing method for RNAseq. I am using the same set of samples (3 control, 3 treated) to prepare library using illumina mRNA kit and lexogen 3 prime library kit. The idea is to compare both library preparation method with the aim of using the three prime one for future experiments. we were hoping lexogen will generate roughly similar numbers of DGE, but contrary to our expectation three prime generated only 60% of the DGE identified in the illumina library. I have identified the genes that were not called significant by DEseq2 from lexogen dataset. How can I find why the genes are not called DEs? Any suggesion is much appreciated. The basic code is as follows.

Many thanks Nurun

dds <- DESeqDataSetFromMatrix(countData = Standard,
                              colData = SampleData.standard,
                              design = ~Treatment+Subject)

dds <- estimateSizeFactors(dds)

vsd <- vst(dds)

plotPCA(vsd, intgroup = "Treatment")

dds <- DESeq(dds)

res <- results(dds, lfcThreshold = 2, alpha = 0.05, contrast=c("Treatment","LPS","Control"))

summary(res)

sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.4.0               stringr_1.4.0               dplyr_0.8.0.1              
 [4] purrr_0.3.2                 readr_1.3.1                 tidyr_0.8.3                
 [7] tibble_2.1.1                tidyverse_1.2.1             ggforce_0.2.2              
[10] VennDiagram_1.6.20          futile.logger_1.4.3         org.Hs.eg.db_3.8.2         
[13] AnnotationDbi_1.46.0        DESeq2_1.24.0               SummarizedExperiment_1.14.0
[16] DelayedArray_0.10.0         BiocParallel_1.18.0         matrixStats_0.54.0         
[19] Biobase_2.44.0              GenomicRanges_1.36.0        GenomeInfoDb_1.20.0        
[22] IRanges_2.18.0              S4Vectors_0.22.0            BiocGenerics_0.30.0        
[25] reshape2_1.4.3              ggplot2_3.1.1              

loaded via a namespace (and not attached):
 [1] nlme_3.1-140           bitops_1.0-6           lubridate_1.7.4        bit64_0.9-7           
 [5] httr_1.4.0             RColorBrewer_1.1-2     tools_3.6.1            backports_1.1.4       
 [9] R6_2.4.0               rpart_4.1-15           Hmisc_4.2-0            DBI_1.0.0             
[13] lazyeval_0.2.2         colorspace_1.4-1       nnet_7.3-12            withr_2.1.2           
[17] tidyselect_0.2.5       gridExtra_2.3          bit_1.1-14             compiler_3.6.1        
[21] cli_1.1.0              rvest_0.3.4            formatR_1.6            htmlTable_1.13.1      
[25] xml2_1.2.0             scales_1.0.0           checkmate_1.9.3        genefilter_1.66.0     
[29] digest_0.6.18          foreign_0.8-71         XVector_0.24.0         base64enc_0.1-3       
[33] pkgconfig_2.0.2        htmltools_0.3.6        readxl_1.3.1           htmlwidgets_1.3       
[37] rlang_0.3.4            rstudioapi_0.10        RSQLite_2.1.1          farver_1.1.0          
[41] generics_0.0.2         jsonlite_1.6           acepack_1.4.1          RCurl_1.95-4.12       
[45] magrittr_1.5           GenomeInfoDbData_1.2.1 Formula_1.2-3          Matrix_1.2-17         
[49] Rcpp_1.0.1             munsell_0.5.0          stringi_1.4.3          MASS_7.3-51.4         
[53] zlibbioc_1.30.0        plyr_1.8.4             blob_1.1.1             crayon_1.3.4          
[57] lattice_0.20-38        haven_2.1.0            splines_3.6.1          annotate_1.62.0       
[61] hms_0.4.2              locfit_1.5-9.1         knitr_1.22             pillar_1.4.0          
[65] geneplotter_1.62.0     futile.options_1.0.1   XML_3.98-1.19          glue_1.3.1            
[69] latticeExtra_0.6-28    modelr_0.1.4           lambda.r_1.2.3         data.table_1.12.2     
[73] tweenr_1.0.1           cellranger_1.1.0       gtable_0.3.0           polyclip_1.10-0       
[77] assertthat_0.2.1       xfun_0.6               xtable_1.8-4           broom_0.5.2           
[81] survival_2.44-1.1      memoise_1.1.0          cluster_2.1.0

deseq2 • 2.3k views

ADD COMMENT • link updated 5.8 years ago by swbarnes2 ★ 1.4k • written 5.8 years ago by n.naharfancy • 0

score 0 · Answer 1 · 2019-07-16

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 9 days ago

United States

Assuming both protocols were performed well (skilled person at the bench, enough RNA input, comparable RIN scores, etc), my guess would be that the lower counts / gene you are getting in the quantseq prep should probably explain the majority of the differences ... features with more counts are more easily called as DGE.

If I were poking around to these data to better sort this out, here are some things I might try:

I think I'd rather analyze these data in independently from each other, ie. create two separate DESeqDataSets, one for each sequencing protocol used, and run the analysis / DGE pipeline separately for each. Perhaps the dispersion estimate is better fit separately for the protocols. You can then also see differences between plotDispEsts on the quantseq vs the other Illumina protocol you are using.

This should also allow you to light up the genes on the dispersion plot that are not identified in as DGE in each protocol ... do the tend to be lowly expressed, have high dispersion, both?
Compare the MA plots for the replicates within each protocol ... anything funky there?
You only have 3 vs 3, so perhaps not that informative, but you should verify that the replicates at least group together reasonably enough via some clustering approach or PCA.
Plot the logFC's calculated from each library prep vs each other. Are the results largely concordant (ie. on the 45° line?) and it's just the statistical power that's lower in the quantseq data? If so, perhaps sequencing the quantseq library deeper (with, perhaps, requring more input RNA to begin with).

Also: are you using the UMI kits with your quantseq prep? Given the tagging nature of the protocol, you definitely should consider using them to remove PCR duplicates ... which also should make you think about how much PCR you're doing with the illumina prep. It's likely less of a problem there, but still ... these are all good things to know.

Also: I just noticed you are testing against a threshold, ie. results(dds, lfcThreshold = 2, ...). Start your initial analysis by dropping the lfcThreshold parameter (leaving it set to 0).

ADD COMMENT • link 5.8 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Steve,

Thank you for taking the time to respond with such detailed answer.

You are right in assuming that most of the lower count genes are not getting called significant. 1. I did analyse them completely separately to avoid wrong estimation of dispersion. I will look into plotDispEsts. 2. Did not try MA plot, will do that soon. 3. PCA for both datasets looks identical, clearly separating control and treated samples. 4 Will do that.

The idea of this experiment was to check how much the quantseq can replicate from illumina samples, to decide whether it would be worth sequencing using quantseq which will be much cheaper. Also, this is our second trial, in the first trial we did ended up with deeper sequencing (~40 million per library from both illumina and quantseq but it yielded even poorer result.). As quantseq suggest 5-10 million reads from quantseq library , this time we have ~20 million, which seems to yield better result.

I tried playing with the parameters to check whether that increases the number of DGEs. Which was not completely successful.

Thanks Nurun

ADD REPLY • link 5.8 years ago n.naharfancy • 0

0

Entering edit mode

Hi Steve,

I was also wondering if I can do something computationally to increase the DGEs number so that I have at least 85-90% overlapping? Any ideas?

Also, some of the genes not called DGEs were of moderate expression. If you want to see any particular plot or data, please let me know and I can attach them here.

ADD REPLY • link 5.8 years ago n.naharfancy • 0

1

Entering edit mode

I think you're focussed on the wrong question. There won't be some computational trick you can do to increase the overlap, and even if there was (this time), is that something you really want to do everywhere going forward?

If I were you, I'd focus on understanding what is driving the differences you see between the downstream results -- or even if these differences are something you really care about.

There are many many things to try to start to understand that, and I've outlined some of those. Once you feel like you have a handle on why things are different, you would be in a better place to propose solutions.

For instance, if everything is essentially the same between the two kits except the quantseq kit has lower power to detect DGE based on the sequencing depth, you might advise your lab colleagues to provide more RNA (more input) and sequence deeper so you can get more counts ... for instance.

Or, with all that money you are saving using the quantseq kits vs the other Ilumina kits, you might ask to include a few more replicates per group to increase your power.

While I'm on my soapbox, some of the "solutions" I propose above should really be answered by a more complex experimental design than the 3 vs 3 you have. You can imagine a much larger (still two-group) experiment you can design that has more repliclates across a range of RNA input amounts.

This would allow you to analyze the data in such a way that can more definitively inform colleagues about what range of input RNA and replicates are required in the quantseq world that will match match the sensitivity/specificity of some "standard" illumina experiment you folks are already running.

Know what I mean?

ADD REPLY • link 5.8 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Yes, makes perfect sense, thank you. :)

ADD REPLY • link 5.8 years ago n.naharfancy • 0

0

Entering edit mode

Hi Steve

Thank you very much for your thought and input. I just remember, previously we have done the same experiment by sequencing the quantseq library at higher depth (~40 million) same as illumina standard but we did not see an improvement in DGEs. We only saw an increased percentage of read duplication (55-60%). I am going to check all the other parameters you mentioned and also the RNA quantity as a starting material for library preparation.

ADD REPLY • link 5.8 years ago n.naharfancy • 0

0

Entering edit mode

No problem ... also, I don't think you mentioned it yet, but you should find out if you're using the UMI mojo with the quantseq kids during the library prep, and the are subsequently taking advantage of an NGS pipeline that leverages those to properly remove PCR duplicate reads before gene quantitation. If you're not using them, you should be pushing hard on your lab-based colleagues to do so.

ADD REPLY • link 5.8 years ago Steve Lianoglou ★ 13k

score 0 · Answer 2 · 2019-07-16

0

Entering edit mode

swbarnes2 ★ 1.4k

@swbarnes2-14086

Last seen 2 days ago

San Diego

What will people downstream do with the list of DE genes? If you do pathway analysis or look at gene enrichment, and you get the same pathways and gene groups returned with both preps, the it might be okay that you lost some genes with lower expression.

ADD COMMENT • link 5.8 years ago swbarnes2 ★ 1.4k

0

Entering edit mode

Hi

Thanks for commenting. The purpose for this test was actually to see if three prime sequencing will be suitable for future experiment. So, this time the treatment was LPS, which gives over 1000 DGEs, so with three prime I got about 60% DGEs overlapping with illumina standard mRNA, but we are not sure, if we will have 60% when we have our actual experiment and if those 60% will be a large number for pathway analysis. I was wondering, if I can do something computationally to get at least 85-90% DGEs overlapping by both method? Any suggestion?

ADD REPLY • link 5.8 years ago n.naharfancy • 0