Question

False positives due to GC content correction - DESeq2

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 10.6 years ago

Hi Mike, I have been trying to use DESeq2 for a differential analysis of Chipseq data using 8 T/N pairs. There is a lot of heterogeneity in the samples due to clinical differences ( tumor stage etc), total mapped reads ( some samples are much better than the others), batch effects ( since they were processed at different times and not by the same person). I wanted to correct atleast some of the biases starting with GC content and what I did was to use offsets from EDAseq as an input to DESeq2 and introduced the batch variable in the model. What I dont understand is that when I corrected for GC bias in the samples, the final results tend to have a lot of false positives. I have attached the dispersion plots for both the runs. I cant seem to figure why -- output of sessionInfo(): - -- Sent via the guest posting facility at bioconductor.org.

EDASeq DESeq2 EDASeq DESeq2 • 2.1k views

ADD COMMENT • link updated 10.7 years ago by Michael Love 43k • written 10.7 years ago by Guest User ★ 13k

score 0 · Answer 1 · 2014-08-08

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 3 days ago

United States

hi Aditi, Please include all the code you used for EDAseq and DESeq2, and the sessionInfo() How do you know there are false positive? Are these genes which you know are not differentially expressed? Your dispersion plots didn't come through. You can email those attachments to my email address, and we will continue discussion on the Bioc list. Mike On Fri, Aug 8, 2014 at 1:54 PM, Aditi [guest] <guest at="" bioconductor.org=""> wrote: > Hi Mike, > > I have been trying to use DESeq2 for a differential analysis of Chipseq data using 8 T/N pairs. There is a lot of heterogeneity in the samples due to clinical differences ( tumor stage etc), total mapped reads ( some samples are much better than the others), batch effects ( since they were processed at different times and not by the same person). I wanted to correct atleast some of the biases starting with GC content and what I did was to use offsets from EDAseq as an input to DESeq2 and introduced the batch variable in the model. > > What I dont understand is that when I corrected for GC bias in the samples, the final results tend to have a lot of false positives. I have attached the dispersion plots for both the runs. I cant seem to figure why > > > -- output of sessionInfo(): > > - > > -- > Sent via the guest posting facility at bioconductor.org.

ADD COMMENT • link 10.7 years ago Michael Love 43k

0

Entering edit mode

Hi Mike, Sorry seems like my message got cut midway. What I was saying was that I don't understand how can I estimate what could be the source of these false positives. Yes these are regions that I know are not differentially expressed. I've attached the code for the analysis as well the dispersion plots. Session Info - R version 3.1.0 (2014-04-10) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] EDASeq_1.10.0 aroma.light_2.0.0 matrixStats_0.10.0 [4] ShortRead_1.22.0 GenomicAlignments_1.0.3 BSgenome_1.32.0 [7] Rsamtools_1.16.1 Biostrings_2.32.1 XVector_0.4.0 [10] BiocParallel_0.6.1 Biobase_2.24.0 DESeq2_1.4.5 [13] RcppArmadillo_0.4.320.0 Rcpp_0.11.2 GenomicRanges_1.16.3 [16] GenomeInfoDb_1.0.2 IRanges_1.22.10 BiocGenerics_0.10.0 [19] BiocInstaller_1.14.2 loaded via a namespace (and not attached): [1] annotate_1.42.1 AnnotationDbi_1.26.0 BatchJobs_1.3 [4] BBmisc_1.7 bitops_1.0-6 brew_1.0-6 [7] checkmate_1.2 codetools_0.2-8 DBI_0.2-7 [10] DESeq_1.16.0 digest_0.6.4 fail_1.2 [13] foreach_1.4.2 genefilter_1.46.1 geneplotter_1.42.0 [16] grid_3.1.0 hwriter_1.3 iterators_1.0.7 [19] lattice_0.20-29 latticeExtra_0.6-26 locfit_1.5-9.1 [22] RColorBrewer_1.0-5 R.methodsS3_1.6.1 R.oo_1.18.0 [25] RSQLite_0.11.4 sendmailR_1.1-2 splines_3.1.0 [28] stats4_3.1.0 stringr_0.6.2 survival_2.37-7 [31] tools_3.1.0 XML_3.98-1.1 xtable_1.7-3 [34] zlibbioc_1.10.0 ________________________________________ From: Michael Love [michaelisaiahlove@gmail.com] Sent: Saturday, August 09, 2014 2:11 AM To: Aditi [guest] Cc: bioconductor at r-project.org; QAMRA Aditi (GIS) Subject: Re: False positives due to GC content correction - DESeq2 hi Aditi, Please include all the code you used for EDAseq and DESeq2, and the sessionInfo() How do you know there are false positive? Are these genes which you know are not differentially expressed? Your dispersion plots didn't come through. You can email those attachments to my email address, and we will continue discussion on the Bioc list. Mike On Fri, Aug 8, 2014 at 1:54 PM, Aditi [guest] <guest at="" bioconductor.org=""> wrote: > Hi Mike, > > I have been trying to use DESeq2 for a differential analysis of Chipseq data using 8 T/N pairs. There is a lot of heterogeneity in the samples due to clinical differences ( tumor stage etc), total mapped reads ( some samples are much better than the others), batch effects ( since they were processed at different times and not by the same person). I wanted to correct atleast some of the biases starting with GC content and what I did was to use offsets from EDAseq as an input to DESeq2 and introduced the batch variable in the model. > > What I dont understand is that when I corrected for GC bias in the samples, the final results tend to have a lot of false positives. I have attached the dispersion plots for both the runs. I cant seem to figure why > > > -- output of sessionInfo(): > > - > > -- > Sent via the guest posting facility at bioconductor.org. ------------------------------- This e-mail and any attachments are only for the use of the intended recipient and may be confidential and/or privileged. If you are not the recipient, please delete it or notify the sender immediately. Please do not copy or use it for any purpose or disclose the contents to any other person as it may be an offence under the Official Secrets Act. ------------------------------- -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: EDAseq+DESeq_Script.txt URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140809="" 04d467b0="" attachment.txt="">

ADD REPLY • link 10.7 years ago QAMRA Aditi GIS ▴ 120

0

Entering edit mode

hi Aditi, Your code looks correct to me. Also the normalization factors are correctly taking into account sequencing depth, which is what I wanted to check on by looking at scatterplots for normalized counts of pairs of samples. I took a look at the results, and I also see as you say, the additional genes after using GC correction: > res <- results(dds) > res2 <- results(dds2_nongc) > table(gc.correct=res$padj < .1, no.correct=res2$padj < .1) no.correct gc.correct FALSE TRUE FALSE 20810 143 TRUE 368 472 Ideally, we can have additional genes showing up as significant if we have reduced technical noise through modeling the normalization factors using the technical covariates like GC content. But you suspect these new genes. Can you explain how you know that these are false positive? And is it just the genes which are added after GC correction which are enriched with FP? Mike On Fri, Aug 8, 2014 at 2:29 PM, QAMRA Aditi (GIS) <qamraa99 at="" gis.a-star.edu.sg=""> wrote: > Hi Mike, > > Sorry seems like my message got cut midway. What I was saying was that I don't understand how can I estimate what could be the source of these false positives. Yes these are regions that I know are not differentially expressed. > > I've attached the code for the analysis as well the dispersion plots. > > Session Info - > R version 3.1.0 (2014-04-10) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] EDASeq_1.10.0 aroma.light_2.0.0 matrixStats_0.10.0 > [4] ShortRead_1.22.0 GenomicAlignments_1.0.3 BSgenome_1.32.0 > [7] Rsamtools_1.16.1 Biostrings_2.32.1 XVector_0.4.0 > [10] BiocParallel_0.6.1 Biobase_2.24.0 DESeq2_1.4.5 > [13] RcppArmadillo_0.4.320.0 Rcpp_0.11.2 GenomicRanges_1.16.3 > [16] GenomeInfoDb_1.0.2 IRanges_1.22.10 BiocGenerics_0.10.0 > [19] BiocInstaller_1.14.2 > > loaded via a namespace (and not attached): > [1] annotate_1.42.1 AnnotationDbi_1.26.0 BatchJobs_1.3 > [4] BBmisc_1.7 bitops_1.0-6 brew_1.0-6 > [7] checkmate_1.2 codetools_0.2-8 DBI_0.2-7 > [10] DESeq_1.16.0 digest_0.6.4 fail_1.2 > [13] foreach_1.4.2 genefilter_1.46.1 geneplotter_1.42.0 > [16] grid_3.1.0 hwriter_1.3 iterators_1.0.7 > [19] lattice_0.20-29 latticeExtra_0.6-26 locfit_1.5-9.1 > [22] RColorBrewer_1.0-5 R.methodsS3_1.6.1 R.oo_1.18.0 > [25] RSQLite_0.11.4 sendmailR_1.1-2 splines_3.1.0 > [28] stats4_3.1.0 stringr_0.6.2 survival_2.37-7 > [31] tools_3.1.0 XML_3.98-1.1 xtable_1.7-3 > [34] zlibbioc_1.10.0 > > > > > > > ________________________________________ > From: Michael Love [michaelisaiahlove at gmail.com] > Sent: Saturday, August 09, 2014 2:11 AM > To: Aditi [guest] > Cc: bioconductor at r-project.org; QAMRA Aditi (GIS) > Subject: Re: False positives due to GC content correction - DESeq2 > > hi Aditi, > > Please include all the code you used for EDAseq and DESeq2, and the > sessionInfo() > > How do you know there are false positive? Are these genes which you > know are not differentially expressed? > > Your dispersion plots didn't come through. You can email those > attachments to my email address, and we will continue discussion on > the Bioc list. > > Mike > > On Fri, Aug 8, 2014 at 1:54 PM, Aditi [guest] <guest at="" bioconductor.org=""> wrote: >> Hi Mike, >> >> I have been trying to use DESeq2 for a differential analysis of Chipseq data using 8 T/N pairs. There is a lot of heterogeneity in the samples due to clinical differences ( tumor stage etc), total mapped reads ( some samples are much better than the others), batch effects ( since they were processed at different times and not by the same person). I wanted to correct atleast some of the biases starting with GC content and what I did was to use offsets from EDAseq as an input to DESeq2 and introduced the batch variable in the model. >> >> What I dont understand is that when I corrected for GC bias in the samples, the final results tend to have a lot of false positives. I have attached the dispersion plots for both the runs. I cant seem to figure why >> >> >> -- output of sessionInfo(): >> >> - >> >> -- >> Sent via the guest posting facility at bioconductor.org. > > ------------------------------- > This e-mail and any attachments are only for the use of the intended recipient and may be confidential and/or privileged. If you are not the recipient, please delete it or notify the sender immediately. Please do not copy or use it for any purpose or disclose the contents to any other person as it may be an offence under the Official Secrets Act. > -------------------------------

ADD REPLY • link 10.7 years ago Michael Love 43k

0

Entering edit mode

Hi Michael, Yes the regions that are added after GC correction are mostly regions with very low read count and while some correspond to genes/regions I know from beforehand are not different, others mark regions that on looking at the bedgraph tracks show no difference in the read count. Aditi ________________________________________ From: Michael Love [michaelisaiahlove@gmail.com] Sent: Saturday, August 09, 2014 5:31 AM To: QAMRA Aditi (GIS) Cc: bioconductor at r-project.org Subject: Re: False positives due to GC content correction - DESeq2 hi Aditi, Your code looks correct to me. Also the normalization factors are correctly taking into account sequencing depth, which is what I wanted to check on by looking at scatterplots for normalized counts of pairs of samples. I took a look at the results, and I also see as you say, the additional genes after using GC correction: > res <- results(dds) > res2 <- results(dds2_nongc) > table(gc.correct=res$padj < .1, no.correct=res2$padj < .1) no.correct gc.correct FALSE TRUE FALSE 20810 143 TRUE 368 472 Ideally, we can have additional genes showing up as significant if we have reduced technical noise through modeling the normalization factors using the technical covariates like GC content. But you suspect these new genes. Can you explain how you know that these are false positive? And is it just the genes which are added after GC correction which are enriched with FP? Mike On Fri, Aug 8, 2014 at 2:29 PM, QAMRA Aditi (GIS) <qamraa99 at="" gis.a-star.edu.sg=""> wrote: > Hi Mike, > > Sorry seems like my message got cut midway. What I was saying was that I don't understand how can I estimate what could be the source of these false positives. Yes these are regions that I know are not differentially expressed. > > I've attached the code for the analysis as well the dispersion plots. > > Session Info - > R version 3.1.0 (2014-04-10) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] EDASeq_1.10.0 aroma.light_2.0.0 matrixStats_0.10.0 > [4] ShortRead_1.22.0 GenomicAlignments_1.0.3 BSgenome_1.32.0 > [7] Rsamtools_1.16.1 Biostrings_2.32.1 XVector_0.4.0 > [10] BiocParallel_0.6.1 Biobase_2.24.0 DESeq2_1.4.5 > [13] RcppArmadillo_0.4.320.0 Rcpp_0.11.2 GenomicRanges_1.16.3 > [16] GenomeInfoDb_1.0.2 IRanges_1.22.10 BiocGenerics_0.10.0 > [19] BiocInstaller_1.14.2 > > loaded via a namespace (and not attached): > [1] annotate_1.42.1 AnnotationDbi_1.26.0 BatchJobs_1.3 > [4] BBmisc_1.7 bitops_1.0-6 brew_1.0-6 > [7] checkmate_1.2 codetools_0.2-8 DBI_0.2-7 > [10] DESeq_1.16.0 digest_0.6.4 fail_1.2 > [13] foreach_1.4.2 genefilter_1.46.1 geneplotter_1.42.0 > [16] grid_3.1.0 hwriter_1.3 iterators_1.0.7 > [19] lattice_0.20-29 latticeExtra_0.6-26 locfit_1.5-9.1 > [22] RColorBrewer_1.0-5 R.methodsS3_1.6.1 R.oo_1.18.0 > [25] RSQLite_0.11.4 sendmailR_1.1-2 splines_3.1.0 > [28] stats4_3.1.0 stringr_0.6.2 survival_2.37-7 > [31] tools_3.1.0 XML_3.98-1.1 xtable_1.7-3 > [34] zlibbioc_1.10.0 > > > > > > > ________________________________________ > From: Michael Love [michaelisaiahlove at gmail.com] > Sent: Saturday, August 09, 2014 2:11 AM > To: Aditi [guest] > Cc: bioconductor at r-project.org; QAMRA Aditi (GIS) > Subject: Re: False positives due to GC content correction - DESeq2 > > hi Aditi, > > Please include all the code you used for EDAseq and DESeq2, and the > sessionInfo() > > How do you know there are false positive? Are these genes which you > know are not differentially expressed? > > Your dispersion plots didn't come through. You can email those > attachments to my email address, and we will continue discussion on > the Bioc list. > > Mike > > On Fri, Aug 8, 2014 at 1:54 PM, Aditi [guest] <guest at="" bioconductor.org=""> wrote: >> Hi Mike, >> >> I have been trying to use DESeq2 for a differential analysis of Chipseq data using 8 T/N pairs. There is a lot of heterogeneity in the samples due to clinical differences ( tumor stage etc), total mapped reads ( some samples are much better than the others), batch effects ( since they were processed at different times and not by the same person). I wanted to correct atleast some of the biases starting with GC content and what I did was to use offsets from EDAseq as an input to DESeq2 and introduced the batch variable in the model. >> >> What I dont understand is that when I corrected for GC bias in the samples, the final results tend to have a lot of false positives. I have attached the dispersion plots for both the runs. I cant seem to figure why >> >> >> -- output of sessionInfo(): >> >> - >> >> -- >> Sent via the guest posting facility at bioconductor.org. > > ------------------------------- > This e-mail and any attachments are only for the use of the intended recipient and may be confidential and/or privileged. If you are not the recipient, please delete it or notify the sender immediately. Please do not copy or use it for any purpose or disclose the contents to any other person as it may be an offence under the Official Secrets Act. > ------------------------------- ------------------------------- This e-mail and any attachments are only for the use of the intended recipient and may be confidential and/or privileged. If you are not the recipient, please delete it or notify the sender immediately. Please do not copy or use it for any purpose or disclose the contents to any other person as it may be an offence under the Official Secrets Act.

ADD REPLY • link 10.7 years ago QAMRA Aditi GIS ▴ 120