Question

DiffBind and paired end data

0

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 3 months ago

Cambridge, UK

Hi Frederico- Looks like you've got it just about right you want to use the bUseSummarizedOverlaps option. For paired end, you don't need to set the insert length (changed to fragmentSize moving forward) as with paired end data the size of each fragment is known. I will add the option to set the fragments parameter (in DBA$config$fragments), it should appear in the development version 1.9.9. Do let me know if you have any problems wit this, as this is a scenario we'd really like to see work and can help debugging if there are issues. Cheers- Rory From: Federico Marini <marinif@uni-mainz.de<mailto:marinif@uni- mainz.de="">> Date: Mon, 24 Feb 2014 14:51:49 +0100 To: Rory Stark <rory.stark@cruk.cam.ac.uk<mailto:rory.stark@cruk.cam.ac.uk>> Subject: DiffBind and paired end data Dear Dr. Stark, I had the pleasure to attend ("back then in 2013") to the presentation you gave at the EBI Advanced Course for RNA-seq and ChIP-seq, where you introduced DiffBind and its usage examples. Now I moved from IMB - Mainz and I am working in the Biostatistics department of the University of Medical Center in Mainz as a research associate, where I also have the chance of doing my PhD. I contact you regarding indeed DiffBind, and the possibility of doing differential binding analysis for ChIP-seq paired end data. As one of our collaborators recently produced such datasets, and they would like to investigate the aspect of changes in the binding for TFs and histone modifications, I thought the DiffBind framework would be a very solid solution for analysis. The doubt I have, is whether DiffBind can use the information of the paired end data, and how. Ideally I guess the best way would be this: properly paired reads will be counted just once for each of the ends, and singletons(/not properly paired) will also count one. I was following the discussions around https://stat.ethz.ch/pipermail/bioconductor/2013-June/053394.html, and it seems that by incorporating the summarizeOverlaps functions it would be (almost-)possible, but I would like to double check it with you. bUseSummarizeOverlaps would be set to TRUE, DBA$config$singleEnd to false. Is there anything else I should take into account (e.g. the insert length in this case would be meaningful?)? Is there also a parameter to use the "fragment" parameter set to true in the summarizeOverlaps function? Thank you very much in advance for the attention, and thank you again for the nice package on top of it! Best regards, Federico -- Federico Marini, M.Sc. Medizinische Biometrie _____________________________________________ UNIVERSITÄTSMEDIZIN der Johannes Gutenberg-Universität Mainz Institut für Medizinische Biometrie, Epidemiologie und Informatik (IMBEI) Abteilung Medizinische Biometrie Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere Zahlbacher Str. 69, 55131 Mainz www.imbei.uni-mainz.de<http: www.imbei.uni-mainz.de=""> Telefon +49 (0) 6131 17-7029 Telefax +49 (0) 6131 17-472433 E-Mail: federico.marini@unimedizin-mainz.de<mailto:federico.marini @unimedizin-mainz.de=""> marinif@uni-mainz.de<mailto:marinif@uni-mainz.de> [[alternative HTML version deleted]]

DiffBind DiffBind • 2.4k views

ADD COMMENT • link 11.2 years ago • updated 11.1 years ago Rory Stark ★ 5.2k

score 0 · Answer 1 · 2014-03-20

0

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 3 months ago

Cambridge, UK

Hi Frederico- Let me know when you are able to use a recent build of DiffBind to see fi it works on your paord-end data. At some point I'll probably need access to at least a subset of the experiment (eg using Dropbox) so I can debug any further problems you may be having. Cheers- Rory From: Federico Marini <marinif@uni-mainz.de<mailto:marinif@uni- mainz.de="">> Date: Tue, 4 Mar 2014 15:21:44 +0100 To: Rory Stark <rory.stark@cruk.cam.ac.uk<mailto:rory.stark@cruk.cam.ac.uk>> Subject: Re: DiffBind and paired end data Dear Rory, Great to hear back from you. And even better to read that the feature is going to be implemented soon. So far I am trying it then with bUseSummarizeOverlaps = TRUE and with DBA$config$singleEnd <- FALSE before calling the dba.count. I got this warning message during the count operations: Warnung in FUN(bf, param = param) : 'yieldSize' set to 'NA' # (8 times, I assume it is due to the 8 samples in total - triplicates + respective input, 2 conditions) The counting part still takes quite a very long time to perform. When it was done, I sadly got this error message (sorry for some parts in german): Fehler: Error processing one or more read files. Check warnings(). Zusätzlich: Warnmeldungen: 1: In mclapply(arglist, fn, ..., mc.preschedule = TRUE, mc.allow.recursive = TRUE) : scheduled cores 2 encountered errors in user code, all values of the jobs will be affected 2: 3: Fehler bei der Auswertung des Argumentes 'x' bei der Methodenauswahl 4: 5: 6: 7: Fehler bei der Auswertung des Argumentes 'x' bei der Methodenauswahl 8: The sessionInfo for this is as follows: sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] Rsamtools_1.14.3 Biostrings_2.30.1 DiffBind_1.8.4 [4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.6 [7] BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] amap_0.8-11 bitops_1.0-6 caTools_1.16 edgeR_3.4.2 [5] gdata_2.13.2 gplots_2.12.1 gtools_3.3.0 KernSmooth_2.23-10 [9] limma_3.18.13 RColorBrewer_1.0-5 stats4_3.0.2 tools_3.0.2 [13] zlibbioc_1.8.0 Will you also plan to implement the recent featureCounts, too? This could possibly speed up massively the operations, I guess. If I'd ask you from your hands-on experience, what is a "good" number of DB sites? I assume this is not a one size fits all situation, but just as an overall idea it could help to have some numbers. Thank you again for replying, and I will get back to you as soon as I have a better idea of what possibly went wrong. Best regards, Federico On 04.03.14 12:36, Rory Stark wrote: Hi Frederico- Looks like you've got it just about right you want to use the bUseSummarizedOverlaps option. For paired end, you don't need to set the insert length (changed to fragmentSize moving forward) as with paired end data the size of each fragment is known. I will add the option to set the fragments parameter (in DBA$config$fragments), it should appear in the development version 1.9.9. Do let me know if you have any problems wit this, as this is a scenario we'd really like to see work and can help debugging if there are issues. Cheers- Rory From: Federico Marini <marinif@uni-mainz.de<mailto:marinif@uni- mainz.de="">> Date: Mon, 24 Feb 2014 14:51:49 +0100 To: Rory Stark <rory.stark@cruk.cam.ac.uk<mailto:rory.stark@cruk.cam.ac.uk>> Subject: DiffBind and paired end data Dear Dr. Stark, I had the pleasure to attend ("back then in 2013") to the presentation you gave at the EBI Advanced Course for RNA-seq and ChIP-seq, where you introduced DiffBind and its usage examples. Now I moved from IMB - Mainz and I am working in the Biostatistics department of the University of Medical Center in Mainz as a research associate, where I also have the chance of doing my PhD. I contact you regarding indeed DiffBind, and the possibility of doing differential binding analysis for ChIP-seq paired end data. As one of our collaborators recently produced such datasets, and they would like to investigate the aspect of changes in the binding for TFs and histone modifications, I thought the DiffBind framework would be a very solid solution for analysis. The doubt I have, is whether DiffBind can use the information of the paired end data, and how. Ideally I guess the best way would be this: properly paired reads will be counted just once for each of the ends, and singletons(/not properly paired) will also count one. I was following the discussions around https://stat.ethz.ch/pipermail/bioconductor/2013-June/053394.html, and it seems that by incorporating the summarizeOverlaps functions it would be (almost-)possible, but I would like to double check it with you. bUseSummarizeOverlaps would be set to TRUE, DBA$config$singleEnd to false. Is there anything else I should take into account (e.g. the insert length in this case would be meaningful?)? Is there also a parameter to use the "fragment" parameter set to true in the summarizeOverlaps function? Thank you very much in advance for the attention, and thank you again for the nice package on top of it! Best regards, Federico -- Federico Marini, M.Sc. Medizinische Biometrie _____________________________________________ UNIVERSITÄTSMEDIZIN der Johannes Gutenberg-Universität Mainz Institut für Medizinische Biometrie, Epidemiologie und Informatik (IMBEI) Abteilung Medizinische Biometrie Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere Zahlbacher Str. 69, 55131 Mainz www.imbei.uni-mainz.de<http: www.imbei.uni-mainz.de=""> Telefon +49 (0) 6131 17-7029 Telefax +49 (0) 6131 17-472433 E-Mail: federico.marini@unimedizin-mainz.de<mailto:federico.marini @unimedizin-mainz.de=""> marinif@uni-mainz.de<mailto:marinif@uni-mainz.de> -- Federico Marini, M.Sc. Medizinische Biometrie _____________________________________________ UNIVERSITÄTSMEDIZIN der Johannes Gutenberg-Universität Mainz Institut für Medizinische Biometrie, Epidemiologie und Informatik (IMBEI) Abteilung Medizinische Biometrie Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere Zahlbacher Str. 69, 55131 Mainz www.imbei.uni-mainz.de<http: www.imbei.uni-mainz.de=""> Telefon +49 (0) 6131 17-7029 Telefax +49 (0) 6131 17-472433 E-Mail: federico.marini@unimedizin-mainz.de<mailto:federico.marini @unimedizin-mainz.de=""> marinif@uni-mainz.de<mailto:marinif@uni-mainz.de> [[alternative HTML version deleted]]

ADD COMMENT • link 11.1 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Hi Rory, I sorted the problem out, it apparently was a lack of memory in the old server. I could not think that summarizeOverlaps was so demanding on these very deeply sequenced samples. Still, I have one additional question for you. I am trying to understand what DiffBind is doing for the normalization with the control sample. Could you elaborate the concept a little more? Is the raw count subtracted/rounded and subtracted or something else? I am asking because I see that a lot of regions get quite low values in the final report (shown with the bCounts=TRUE option) even if the peak caller identified peaks over there. Thanks in advance for the explanation! And once again, thank you for the nice tool you provided. Federico On 20.03.14 16:08, Rory Stark wrote: > Hi Frederico- > > Let me know when you are able to use a recent build of DiffBind to see > fi it works on your paord-end data. > > At some point I'll probably need access to at least a subset of the > experiment (eg using Dropbox) so I can debug any further problems you > may be having. > > Cheers- > Rory > > From: Federico Marini <marinif@uni-mainz.de <mailto:marinif@uni-="" mainz.de="">> > Date: Tue, 4 Mar 2014 15:21:44 +0100 > To: Rory Stark <rory.stark@cruk.cam.ac.uk> <mailto:rory.stark@cruk.cam.ac.uk>> > Subject: Re: DiffBind and paired end data > > Dear Rory, > > Great to hear back from you. And even better to read that the feature > is going to be implemented soon. > > So far I am trying it then with bUseSummarizeOverlaps = TRUE and with > DBA$config$singleEnd <- FALSE before calling the dba.count. > I got this warning message during the count operations: > > Warnung in FUN(bf, param = param) : 'yieldSize' set to 'NA' > > # (8 times, I assume it is due to the 8 samples in total - > triplicates + respective input, 2 conditions) > > > The counting part still takes quite a very long time to perform. When > it was done, I sadly got this error message (sorry for some parts in > german): > > Fehler: Error processing one or more read files. Check warnings(). > Zusätzlich: Warnmeldungen: > 1: In mclapply(arglist, fn, ..., mc.preschedule = TRUE, > mc.allow.recursive = TRUE) : > scheduled cores 2 encountered errors in user code, all values of > the jobs will be affected > 2: > 3: Fehler bei der Auswertung des Argumentes 'x' bei der > Methodenauswahl > 4: > 5: > 6: > 7: Fehler bei der Auswertung des Argumentes 'x' bei der > Methodenauswahl > 8: > > > The sessionInfo for this is as follows: > > sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C > [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 > [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 > [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets > methods > [8] base > > other attached packages: > [1] Rsamtools_1.14.3 Biostrings_2.30.1 DiffBind_1.8.4 > [4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.6 > [7] BiocGenerics_0.8.0 > > loaded via a namespace (and not attached): > [1] amap_0.8-11 bitops_1.0-6 caTools_1.16 edgeR_3.4.2 > [5] gdata_2.13.2 gplots_2.12.1 gtools_3.3.0 > KernSmooth_2.23-10 > [9] limma_3.18.13 RColorBrewer_1.0-5 stats4_3.0.2 > tools_3.0.2 > [13] zlibbioc_1.8.0 > > > > Will you also plan to implement the recent featureCounts, too? This > could possibly speed up massively the operations, I guess. > > If I'd ask you from your hands-on experience, what is a "good" number > of DB sites? I assume this is not a one size fits all situation, but > just as an overall idea it could help to have some numbers. > > Thank you again for replying, and I will get back to you as soon as I > have a better idea of what possibly went wrong. > > Best regards, > > Federico > > > > On 04.03.14 12:36, Rory Stark wrote: >> Hi Frederico- >> >> Looks like you've got it just about right you want to use the >> bUseSummarizedOverlaps option. For paired end, you don't need to set >> the insert length (changed to fragmentSize moving forward) as with >> paired end data the size of each fragment is known. >> >> I will add the option to set the fragments parameter (in >> DBA$config$fragments), it should appear in the development version 1.9.9. >> >> Do let me know if you have any problems wit this, as this is a >> scenario we'd really like to see work and can help debugging if there >> are issues. >> >> Cheers- >> Rory >> >> From: Federico Marini <marinif@uni-mainz.de>> <mailto:marinif@uni-mainz.de>> >> Date: Mon, 24 Feb 2014 14:51:49 +0100 >> To: Rory Stark <rory.stark@cruk.cam.ac.uk>> <mailto:rory.stark@cruk.cam.ac.uk>> >> Subject: DiffBind and paired end data >> >> Dear Dr. Stark, >> >> I had the pleasure to attend ("back then in 2013") to the >> presentation you gave at the EBI Advanced Course for RNA-seq and >> ChIP-seq, where you introduced DiffBind and its usage examples. >> >> Now I moved from IMB - Mainz and I am working in the Biostatistics >> department of the University of Medical Center in Mainz as a research >> associate, where I also have the chance of doing my PhD. >> >> I contact you regarding indeed DiffBind, and the possibility of doing >> differential binding analysis for ChIP-seq paired end data. As one of >> our collaborators recently produced such datasets, and they would >> like to investigate the aspect of changes in the binding for TFs and >> histone modifications, I thought the DiffBind framework would be a >> very solid solution for analysis. >> >> The doubt I have, is whether DiffBind can use the information of the >> paired end data, and how. Ideally I guess the best way would be this: >> properly paired reads will be counted just once for each of the ends, >> and singletons(/not properly paired) will also count one. >> >> I was following the discussions around >> https://stat.ethz.ch/pipermail/bioconductor/2013-June/053394.html, >> and it seems that by incorporating the summarizeOverlaps functions it >> would be (almost-)possible, but I would like to double check it with you. >> bUseSummarizeOverlaps would be set to TRUE, DBA$config$singleEnd to >> false. Is there anything else I should take into account (e.g. the >> insert length in this case would be meaningful?)? Is there also a >> parameter to use the "fragment" parameter set to true in the >> summarizeOverlaps function? >> >> Thank you very much in advance for the attention, and thank you again >> for the nice package on top of it! >> >> Best regards, >> >> Federico >> >> -- >> *Federico Marini, M.Sc.* >> Medizinische Biometrie >> _____________________________________________ >> UNIVERSITÄTSMEDIZIN >> der Johannes Gutenberg-Universität Mainz >> Institut für Medizinische Biometrie, Epidemiologie und Informatik (IMBEI) >> Abteilung Medizinische Biometrie >> Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere >> Zahlbacher Str. 69, 55131 Mainz >> www.imbei.uni-mainz.de >> Telefon +49 (0) 6131 17-7029 >> Telefax +49 (0) 6131 17-472433 >> E-Mail: federico.marini@unimedizin-mainz.de >> marinif@uni-mainz.de > > -- > *Federico Marini, M.Sc.* > Medizinische Biometrie > _____________________________________________ > UNIVERSITÄTSMEDIZIN > der Johannes Gutenberg-Universität Mainz > Institut für Medizinische Biometrie, Epidemiologie und Informatik (IMBEI) > Abteilung Medizinische Biometrie > Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere Zahlbacher > Str. 69, 55131 Mainz > www.imbei.uni-mainz.de > Telefon +49 (0) 6131 17-7029 > Telefax +49 (0) 6131 17-472433 > E-Mail: federico.marini@unimedizin-mainz.de > marinif@uni-mainz.de -- *Federico Marini, M.Sc.* Medical Biostatistics _____________________________________________ University Medical Center of the Johannes Gutenberg University Mainz Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI) Medical Biostatistics Department Postal address: 55101 Mainz Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz www.imbei.uni-mainz.de Phone +49 (0) 6131 17-7029 Fax +49 (0) 6131 17-472433 E-Mail: marinif@uni-mainz.de [[alternative HTML version deleted]]

ADD REPLY • link 11.1 years ago Federico Marini ▴ 180

0

Entering edit mode

Hi Frederico- Regarding normalizing to controls, the default behavior in DiffBind is to first get the number of reads in the ChIP, and the number of reads in the control, then scale the control reads if the control library was sequenced to higher depth. Then it subtracts the control reads from the ChIP reads (the _MINUS scoring options). Finally the read counts for each sample are normalised with the other samples (this depends which analysis method you use; the default is edgeR's TMM method, based on the total library size).This is not an overly- principled step, it just dampens some peaks where there are a high concentration of reads int he control. The assumption is that you used the controls for the peak calling peak callers like MACS model the background more elaborately. Read scores can't be negative. If you want to see what is really going on, I suggest retrieving the report as a SummarizedExperiment and look through the appropriate assays to get the raw counts. For example, using the example set: > data(tamoxifen_analysis) > sumexp = dba.report(tamoxifen,1,DataType=DBA_DATA_SUMMARIZED_EXPERIMENT) > assays(sumexp) List of length 5 names(5): scores RPKM Reads cRPKM cReads You can see the actual (non-normalized) read counts for the ChIP (min 1): > chipReads = assay(sumexp,3) And for the Control (min 1): > controlReads = assay(sumexp,5) And the subtracted values that are actually given to edgeR (or DESeq/DESeq2): > readCountsForAnalysis = assay(sumexp,3) - assay(sumexp,5) You can compare these counts with what you see in the browser for the same region for ChIP and control. If there is a big mismatch let me know and we can figure out why summarizeOverlaps isn't doing what we expect. Cheers- Rory From: Federico Marini <marinif@uni-mainz.de<mailto:marinif@uni- mainz.de="">> Date: Mon, 24 Mar 2014 15:47:56 +0100 To: Rory Stark <rory.stark@cruk.cam.ac.uk<mailto:rory.stark@cruk.cam.ac.uk>> Cc: "bioconductor@r-project.org<mailto:bioconductor@r-project.org>" <bioconductor@r-project.org<mailto:bioconductor@r-project.org>> Subject: Re: DiffBind and paired end data Hi Rory, I sorted the problem out, it apparently was a lack of memory in the old server. I could not think that summarizeOverlaps was so demanding on these very deeply sequenced samples. Still, I have one additional question for you. I am trying to understand what DiffBind is doing for the normalization with the control sample. Could you elaborate the concept a little more? Is the raw count subtracted/rounded and subtracted or something else? I am asking because I see that a lot of regions get quite low values in the final report (shown with the bCounts=TRUE option) even if the peak caller identified peaks over there. Thanks in advance for the explanation! And once again, thank you for the nice tool you provided. Federico On 20.03.14 16:08, Rory Stark wrote: Hi Frederico- Let me know when you are able to use a recent build of DiffBind to see fi it works on your paord-end data. At some point I'll probably need access to at least a subset of the experiment (eg using Dropbox) so I can debug any further problems you may be having. Cheers- Rory From: Federico Marini <marinif@uni-mainz.de<mailto:marinif@uni- mainz.de="">> Date: Tue, 4 Mar 2014 15:21:44 +0100 To: Rory Stark <rory.stark@cruk.cam.ac.uk<mailto:rory.stark@cruk.cam.ac.uk>> Subject: Re: DiffBind and paired end data Dear Rory, Great to hear back from you. And even better to read that the feature is going to be implemented soon. So far I am trying it then with bUseSummarizeOverlaps = TRUE and with DBA$config$singleEnd <- FALSE before calling the dba.count. I got this warning message during the count operations: Warnung in FUN(bf, param = param) : 'yieldSize' set to 'NA' # (8 times, I assume it is due to the 8 samples in total - triplicates + respective input, 2 conditions) The counting part still takes quite a very long time to perform. When it was done, I sadly got this error message (sorry for some parts in german): Fehler: Error processing one or more read files. Check warnings(). Zusätzlich: Warnmeldungen: 1: In mclapply(arglist, fn, ..., mc.preschedule = TRUE, mc.allow.recursive = TRUE) : scheduled cores 2 encountered errors in user code, all values of the jobs will be affected 2: 3: Fehler bei der Auswertung des Argumentes 'x' bei der Methodenauswahl 4: 5: 6: 7: Fehler bei der Auswertung des Argumentes 'x' bei der Methodenauswahl 8: The sessionInfo for this is as follows: sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] Rsamtools_1.14.3 Biostrings_2.30.1 DiffBind_1.8.4 [4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.6 [7] BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] amap_0.8-11 bitops_1.0-6 caTools_1.16 edgeR_3.4.2 [5] gdata_2.13.2 gplots_2.12.1 gtools_3.3.0 KernSmooth_2.23-10 [9] limma_3.18.13 RColorBrewer_1.0-5 stats4_3.0.2 tools_3.0.2 [13] zlibbioc_1.8.0 Will you also plan to implement the recent featureCounts, too? This could possibly speed up massively the operations, I guess. If I'd ask you from your hands-on experience, what is a "good" number of DB sites? I assume this is not a one size fits all situation, but just as an overall idea it could help to have some numbers. Thank you again for replying, and I will get back to you as soon as I have a better idea of what possibly went wrong. Best regards, Federico On 04.03.14 12:36, Rory Stark wrote: Hi Frederico- Looks like you've got it just about right you want to use the bUseSummarizedOverlaps option. For paired end, you don't need to set the insert length (changed to fragmentSize moving forward) as with paired end data the size of each fragment is known. I will add the option to set the fragments parameter (in DBA$config$fragments), it should appear in the development version 1.9.9. Do let me know if you have any problems wit this, as this is a scenario we'd really like to see work and can help debugging if there are issues. Cheers- Rory From: Federico Marini <marinif@uni-mainz.de<mailto:marinif@uni- mainz.de="">> Date: Mon, 24 Feb 2014 14:51:49 +0100 To: Rory Stark <rory.stark@cruk.cam.ac.uk<mailto:rory.stark@cruk.cam.ac.uk>> Subject: DiffBind and paired end data Dear Dr. Stark, I had the pleasure to attend ("back then in 2013") to the presentation you gave at the EBI Advanced Course for RNA-seq and ChIP-seq, where you introduced DiffBind and its usage examples. Now I moved from IMB - Mainz and I am working in the Biostatistics department of the University of Medical Center in Mainz as a research associate, where I also have the chance of doing my PhD. I contact you regarding indeed DiffBind, and the possibility of doing differential binding analysis for ChIP-seq paired end data. As one of our collaborators recently produced such datasets, and they would like to investigate the aspect of changes in the binding for TFs and histone modifications, I thought the DiffBind framework would be a very solid solution for analysis. The doubt I have, is whether DiffBind can use the information of the paired end data, and how. Ideally I guess the best way would be this: properly paired reads will be counted just once for each of the ends, and singletons(/not properly paired) will also count one. I was following the discussions around https://stat.ethz.ch/pipermail/bioconductor/2013-June/053394.html, and it seems that by incorporating the summarizeOverlaps functions it would be (almost-)possible, but I would like to double check it with you. bUseSummarizeOverlaps would be set to TRUE, DBA$config$singleEnd to false. Is there anything else I should take into account (e.g. the insert length in this case would be meaningful?)? Is there also a parameter to use the "fragment" parameter set to true in the summarizeOverlaps function? Thank you very much in advance for the attention, and thank you again for the nice package on top of it! Best regards, Federico -- Federico Marini, M.Sc. Medizinische Biometrie _____________________________________________ UNIVERSITÄTSMEDIZIN der Johannes Gutenberg-Universität Mainz Institut für Medizinische Biometrie, Epidemiologie und Informatik (IMBEI) Abteilung Medizinische Biometrie Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere Zahlbacher Str. 69, 55131 Mainz www.imbei.uni-mainz.de<http: www.imbei.uni-mainz.de=""> Telefon +49 (0) 6131 17-7029 Telefax +49 (0) 6131 17-472433 E-Mail: federico.marini@unimedizin-mainz.de<mailto:federico.marini @unimedizin-mainz.de=""> marinif@uni-mainz.de<mailto:marinif@uni-mainz.de> -- Federico Marini, M.Sc. Medizinische Biometrie _____________________________________________ UNIVERSITÄTSMEDIZIN der Johannes Gutenberg-Universität Mainz Institut für Medizinische Biometrie, Epidemiologie und Informatik (IMBEI) Abteilung Medizinische Biometrie Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere Zahlbacher Str. 69, 55131 Mainz www.imbei.uni-mainz.de<http: www.imbei.uni-mainz.de=""> Telefon +49 (0) 6131 17-7029 Telefax +49 (0) 6131 17-472433 E-Mail: federico.marini@unimedizin-mainz.de<mailto:federico.marini @unimedizin-mainz.de=""> marinif@uni-mainz.de<mailto:marinif@uni-mainz.de> -- Federico Marini, M.Sc. Medical Biostatistics _____________________________________________ University Medical Center of the Johannes Gutenberg University Mainz Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI) Medical Biostatistics Department Postal address: 55101 Mainz Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz www.imbei.uni-mainz.de<http: www.imbei.uni-mainz.de=""> Phone +49 (0) 6131 17-7029 Fax +49 (0) 6131 17-472433 E-Mail: marinif@uni-mainz.de<mailto:marinif@uni-mainz.de> [[alternative HTML version deleted]]

ADD REPLY • link 11.1 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Hi Rory, Thanks for the detailed reply. On 27.03.14 20:14, Rory Stark wrote: > Hi Frederico- > > Regarding normalizing to controls, the default behavior in DiffBind is > to first get the number of reads in the ChIP, and the number of reads > in the control, then scale the control reads if the control library > was sequenced to higher depth. Then it subtracts the control reads > from the ChIP reads (the _MINUS scoring options). Finally the read > counts for each sample are normalised with the other samples (this > depends which analysis method you use; the default is edgeR's TMM > method, based on the total library size).This is not an > overly-principled step, it just dampens some peaks where there are a > high concentration of reads int he control. The assumption is that you > used the controls for the peak calling peak callers like MACS model > the background more elaborately. Read scores can't be negative. I guess that the rescaling, if performed, is then somehow rounded up, so that the amounts of reads are still discrete numbers - to feed for edgeR/DESeq. If it is not the case, please correct me. I was asking about details because I was trying to "do what DiffBind does" with single steps performed outside of the pipeline, i.e. counting in the enriched regions with featureCounts, importing the count table back in a semi-standard DESeq analysis (and there I did not know what to do for the input to be as close as possible to your procedure). One point I could not figure out of the vignette and also of the reference manual is the sequence of exact steps to pass the count matrix back to diffbind (to exploit the plot aids and so on). Could you provide me a pseudocode example for it? I got lost in the reference with the dba.peakset explanation. You can assume I have one vector of counts per each sample. > > If you want to see what is really going on, I suggest retrieving the > report as a SummarizedExperiment and look through the appropriate > assays to get the raw counts. For example, using the example set: > > > data(tamoxifen_analysis) > > sumexp = dba.report(tamoxifen,1,DataType=DBA_DATA_SUMMARIZED_EXPERIMENT) > > assays(sumexp) > List of length 5 > names(5): scores RPKM Reads cRPKM cReads > > You can see the actual (non-normalized) read counts for the ChIP (min 1): > > chipReads = assay(sumexp,3) > > And for the Control (min 1): > > controlReads = assay(sumexp,5) > > And the subtracted values that are actually given to edgeR (or > DESeq/DESeq2): > > readCountsForAnalysis = assay(sumexp,3) - assay(sumexp,5) I tried this on your tamoxifen sample dataset, as well as the ones we have. A few additional information, the data we had was extremely deep sequenced (drosophila, sometimes more than 20M reads per replicate) - so I tried to downsample the data, and I did this scaling each dataset to 20M reads. So for this reason I think that the scaling for the input should not have that much of an impact. > > You can compare these counts with what you see in the browser for the > same region for ChIP and control. If there is a big mismatch let me > know and we can figure out why summarizeOverlaps isn't doing what we > expect. > It seems that the numbers are somehow correct - I did a few spot checks. The thing is, the concentration values look kind of strange to me. Indeed, for regions where I detected peaks in all samples (I used MACS2 with mostly default parameters) I see that the final concentration value is quite close to 0 or even negative. The readCountsForAnalysis object in my case has indeed quite some negative values, quite more than what you have in the same object derived from the tamoxifen dataset. Can this be to the fact that my merging adjacent peaks coming from different datasets the input amount gets "overestimated"? If so, do you have any suggestion to take in this case? Thanks again for the advices. Cheers, Federico > Cheers- > Rory > > > > From: Federico Marini <marinif@uni-mainz.de <mailto:marinif@uni-="" mainz.de="">> > Date: Mon, 24 Mar 2014 15:47:56 +0100 > To: Rory Stark <rory.stark@cruk.cam.ac.uk> <mailto:rory.stark@cruk.cam.ac.uk>> > Cc: "bioconductor@r-project.org <mailto:bioconductor@r-project.org>" > <bioconductor@r-project.org <mailto:bioconductor@r-project.org="">> > Subject: Re: DiffBind and paired end data > > Hi Rory, > > I sorted the problem out, it apparently was a lack of memory in the > old server. I could not think that summarizeOverlaps was so demanding > on these very deeply sequenced samples. > > Still, I have one additional question for you. > I am trying to understand what DiffBind is doing for the normalization > with the control sample. Could you elaborate the concept a little > more? Is the raw count subtracted/rounded and subtracted or something > else? > I am asking because I see that a lot of regions get quite low values > in the final report (shown with the bCounts=TRUE option) even if the > peak caller identified peaks over there. > > Thanks in advance for the explanation! And once again, thank you for > the nice tool you provided. > > Federico > > > On 20.03.14 16:08, Rory Stark wrote: >> Hi Frederico- >> >> Let me know when you are able to use a recent build of DiffBind to >> see fi it works on your paord-end data. >> >> At some point I'll probably need access to at least a subset of the >> experiment (eg using Dropbox) so I can debug any further problems you >> may be having. >> >> Cheers- >> Rory >> >> From: Federico Marini <marinif@uni-mainz.de>> <mailto:marinif@uni-mainz.de>> >> Date: Tue, 4 Mar 2014 15:21:44 +0100 >> To: Rory Stark <rory.stark@cruk.cam.ac.uk>> <mailto:rory.stark@cruk.cam.ac.uk>> >> Subject: Re: DiffBind and paired end data >> >> Dear Rory, >> >> Great to hear back from you. And even better to read that the feature >> is going to be implemented soon. >> >> So far I am trying it then with bUseSummarizeOverlaps = TRUE and with >> DBA$config$singleEnd <- FALSE before calling the dba.count. >> I got this warning message during the count operations: >> >> Warnung in FUN(bf, param = param) : 'yieldSize' set to 'NA' >> >> # (8 times, I assume it is due to the 8 samples in total - >> triplicates + respective input, 2 conditions) >> >> >> The counting part still takes quite a very long time to perform. When >> it was done, I sadly got this error message (sorry for some parts in >> german): >> >> Fehler: Error processing one or more read files. Check warnings(). >> Zusätzlich: Warnmeldungen: >> 1: In mclapply(arglist, fn, ..., mc.preschedule = TRUE, >> mc.allow.recursive = TRUE) : >> scheduled cores 2 encountered errors in user code, all values >> of the jobs will be affected >> 2: >> 3: Fehler bei der Auswertung des Argumentes 'x' bei der >> Methodenauswahl >> 4: >> 5: >> 6: >> 7: Fehler bei der Auswertung des Argumentes 'x' bei der >> Methodenauswahl >> 8: >> >> >> The sessionInfo for this is as follows: >> >> sessionInfo() >> R version 3.0.2 (2013-09-25) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 >> [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 >> [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets >> methods >> [8] base >> >> other attached packages: >> [1] Rsamtools_1.14.3 Biostrings_2.30.1 DiffBind_1.8.4 >> [4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.6 >> [7] BiocGenerics_0.8.0 >> >> loaded via a namespace (and not attached): >> [1] amap_0.8-11 bitops_1.0-6 caTools_1.16 edgeR_3.4.2 >> [5] gdata_2.13.2 gplots_2.12.1 gtools_3.3.0 >> KernSmooth_2.23-10 >> [9] limma_3.18.13 RColorBrewer_1.0-5 stats4_3.0.2 >> tools_3.0.2 >> [13] zlibbioc_1.8.0 >> >> >> >> Will you also plan to implement the recent featureCounts, too? This >> could possibly speed up massively the operations, I guess. >> >> If I'd ask you from your hands-on experience, what is a "good" number >> of DB sites? I assume this is not a one size fits all situation, but >> just as an overall idea it could help to have some numbers. >> >> Thank you again for replying, and I will get back to you as soon as I >> have a better idea of what possibly went wrong. >> >> Best regards, >> >> Federico >> >> >> >> On 04.03.14 12:36, Rory Stark wrote: >>> Hi Frederico- >>> >>> Looks like you've got it just about right you want to use the >>> bUseSummarizedOverlaps option. For paired end, you don't need to set >>> the insert length (changed to fragmentSize moving forward) as with >>> paired end data the size of each fragment is known. >>> >>> I will add the option to set the fragments parameter (in >>> DBA$config$fragments), it should appear in the development version >>> 1.9.9. >>> >>> Do let me know if you have any problems wit this, as this is a >>> scenario we'd really like to see work and can help debugging if >>> there are issues. >>> >>> Cheers- >>> Rory >>> >>> From: Federico Marini <marinif@uni-mainz.de>>> <mailto:marinif@uni-mainz.de>> >>> Date: Mon, 24 Feb 2014 14:51:49 +0100 >>> To: Rory Stark <rory.stark@cruk.cam.ac.uk>>> <mailto:rory.stark@cruk.cam.ac.uk>> >>> Subject: DiffBind and paired end data >>> >>> Dear Dr. Stark, >>> >>> I had the pleasure to attend ("back then in 2013") to the >>> presentation you gave at the EBI Advanced Course for RNA-seq and >>> ChIP-seq, where you introduced DiffBind and its usage examples. >>> >>> Now I moved from IMB - Mainz and I am working in the Biostatistics >>> department of the University of Medical Center in Mainz as a >>> research associate, where I also have the chance of doing my PhD. >>> >>> I contact you regarding indeed DiffBind, and the possibility of >>> doing differential binding analysis for ChIP-seq paired end data. As >>> one of our collaborators recently produced such datasets, and they >>> would like to investigate the aspect of changes in the binding for >>> TFs and histone modifications, I thought the DiffBind framework >>> would be a very solid solution for analysis. >>> >>> The doubt I have, is whether DiffBind can use the information of the >>> paired end data, and how. Ideally I guess the best way would be >>> this: properly paired reads will be counted just once for each of >>> the ends, and singletons(/not properly paired) will also count one. >>> >>> I was following the discussions around >>> https://stat.ethz.ch/pipermail/bioconductor/2013-June/053394.html, >>> and it seems that by incorporating the summarizeOverlaps functions >>> it would be (almost-)possible, but I would like to double check it >>> with you. >>> bUseSummarizeOverlaps would be set to TRUE, DBA$config$singleEnd to >>> false. Is there anything else I should take into account (e.g. the >>> insert length in this case would be meaningful?)? Is there also a >>> parameter to use the "fragment" parameter set to true in the >>> summarizeOverlaps function? >>> >>> Thank you very much in advance for the attention, and thank you >>> again for the nice package on top of it! >>> >>> Best regards, >>> >>> Federico >>> >>> -- >>> *Federico Marini, M.Sc.* >>> Medizinische Biometrie >>> _____________________________________________ >>> UNIVERSITÄTSMEDIZIN >>> der Johannes Gutenberg-Universität Mainz >>> Institut für Medizinische Biometrie, Epidemiologie und Informatik >>> (IMBEI) >>> Abteilung Medizinische Biometrie >>> Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere >>> Zahlbacher Str. 69, 55131 Mainz >>> www.imbei.uni-mainz.de >>> Telefon +49 (0) 6131 17-7029 >>> Telefax +49 (0) 6131 17-472433 >>> E-Mail: federico.marini@unimedizin-mainz.de >>> marinif@uni-mainz.de >> >> -- >> *Federico Marini, M.Sc.* >> Medizinische Biometrie >> _____________________________________________ >> UNIVERSITÄTSMEDIZIN >> der Johannes Gutenberg-Universität Mainz >> Institut für Medizinische Biometrie, Epidemiologie und Informatik (IMBEI) >> Abteilung Medizinische Biometrie >> Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere >> Zahlbacher Str. 69, 55131 Mainz >> www.imbei.uni-mainz.de >> Telefon +49 (0) 6131 17-7029 >> Telefax +49 (0) 6131 17-472433 >> E-Mail: federico.marini@unimedizin-mainz.de >> marinif@uni-mainz.de > > -- > *Federico Marini, M.Sc.* > Medical Biostatistics > _____________________________________________ > University Medical Center of the Johannes Gutenberg University Mainz > Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI) > Medical Biostatistics Department > Postal address: 55101 Mainz > Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz > www.imbei.uni-mainz.de > Phone +49 (0) 6131 17-7029 > Fax +49 (0) 6131 17-472433 > E-Mail: marinif@uni-mainz.de -- *Federico Marini, M.Sc.* Medical Biostatistics _____________________________________________ University Medical Center of the Johannes Gutenberg University Mainz Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI) Medical Biostatistics Department Postal address: 55101 Mainz Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz www.imbei.uni-mainz.de Phone +49 (0) 6131 17-7029 Fax +49 (0) 6131 17-472433 E-Mail: marinif@uni-mainz.de [[alternative HTML version deleted]]

ADD REPLY • link 11.1 years ago Federico Marini ▴ 180

0

Entering edit mode

Hi Rory, An update on the read count values. For your tamoxifen datasets it happens the following: head(controlReads) BT474.1- BT474.2- MCF7.1- MCF7.2- MCF7.1+ MCF7.2+ MCF7.3+ T47D.1+ T47D.2+ ZR75.1+ ZR75.2+ 1433 3 3 8 8 4 4 4 3 3 7 7 7 43 43 8 8 24 25 25 6 6 20 20 1606 13 13 2 2 19 19 19 8 8 21 21 156 33 33 4 4 6 6 6 8 8 8 8 1238 27 27 5 5 29 30 30 11 11 29 29 1446 3 3 2 2 3 3 3 1 1 4 4 Correctly each input has the same amount of reads. In our case we had one input for each condition and for each batch. Samples 1 and 2 for each condition were sequenced in batch 1, samples numbered as 3 were done in batch 2. head(dba_controlReads) wt_r1 wt_r2 wt_r3 KD_r1 KD_r2 KD_r3 4442 168 226 152 239 241 100 1464 696 934 471 592 596 289 2956 4293 5762 3538 4677 4711 2178 1292 1035 1390 704 1110 1118 482 5741 229 307 172 230 232 112 557 150 201 87 165 166 71 The numbers differ in every column, and this is somehow probably pointing to some strange behaviour of summarizeOverlaps. Would you have the same guess if you look at this lines? Cheers, Federico On 28.03.14 13:29, Federico Marini wrote: > Hi Rory, > > Thanks for the detailed reply. > > On 27.03.14 20:14, Rory Stark wrote: >> Hi Frederico- >> >> Regarding normalizing to controls, the default behavior in DiffBind >> is to first get the number of reads in the ChIP, and the number of >> reads in the control, then scale the control reads if the control >> library was sequenced to higher depth. Then it subtracts the control >> reads from the ChIP reads (the _MINUS scoring options). Finally the >> read counts for each sample are normalised with the other samples >> (this depends which analysis method you use; the default is edgeR's >> TMM method, based on the total library size).This is not an >> overly-principled step, it just dampens some peaks where there are a >> high concentration of reads int he control. The assumption is >> that you used the controls for the peak calling peak callers like >> MACS model the background more elaborately. Read scores can't be >> negative. > I guess that the rescaling, if performed, is then somehow rounded up, > so that the amounts of reads are still discrete numbers - to feed for > edgeR/DESeq. If it is not the case, please correct me. I was asking > about details because I was trying to "do what DiffBind does" with > single steps performed outside of the pipeline, i.e. counting in the > enriched regions with featureCounts, importing the count table back in > a semi-standard DESeq analysis (and there I did not know what to do > for the input to be as close as possible to your procedure). > > One point I could not figure out of the vignette and also of the > reference manual is the sequence of exact steps to pass the count > matrix back to diffbind (to exploit the plot aids and so on). Could > you provide me a pseudocode example for it? I got lost in the > reference with the dba.peakset explanation. You can assume I have one > vector of counts per each sample. >> >> If you want to see what is really going on, I suggest retrieving the >> report as a SummarizedExperiment and look through the appropriate >> assays to get the raw counts. For example, using the example set: >> >> > data(tamoxifen_analysis) >> > sumexp = >> dba.report(tamoxifen,1,DataType=DBA_DATA_SUMMARIZED_EXPERIMENT) >> > assays(sumexp) >> List of length 5 >> names(5): scores RPKM Reads cRPKM cReads >> >> You can see the actual (non-normalized) read counts for the ChIP (min 1): >> > chipReads = assay(sumexp,3) >> >> And for the Control (min 1): >> > controlReads = assay(sumexp,5) >> >> And the subtracted values that are actually given to edgeR (or >> DESeq/DESeq2): >> > readCountsForAnalysis = assay(sumexp,3) - assay(sumexp,5) > I tried this on your tamoxifen sample dataset, as well as the ones we > have. A few additional information, the data we had was extremely deep > sequenced (drosophila, sometimes more than 20M reads per replicate) - > so I tried to downsample the data, and I did this scaling each dataset > to 20M reads. So for this reason I think that the scaling for the > input should not have that much of an impact. >> >> You can compare these counts with what you see in the browser for the >> same region for ChIP and control. If there is a big mismatch let me >> know and we can figure out why summarizeOverlaps isn't doing what we >> expect. >> > It seems that the numbers are somehow correct - I did a few spot > checks. The thing is, the concentration values look kind of strange to > me. Indeed, for regions where I detected peaks in all samples (I used > MACS2 with mostly default parameters) I see that the final > concentration value is quite close to 0 or even negative. The > readCountsForAnalysis object in my case has indeed quite some negative > values, quite more than what you have in the same object derived from > the tamoxifen dataset. > Can this be to the fact that my merging adjacent peaks coming from > different datasets the input amount gets "overestimated"? If so, do > you have any suggestion to take in this case? > > Thanks again for the advices. > > Cheers, > Federico >> Cheers- >> Rory >> >> >> >> From: Federico Marini <marinif@uni-mainz.de>> <mailto:marinif@uni-mainz.de>> >> Date: Mon, 24 Mar 2014 15:47:56 +0100 >> To: Rory Stark <rory.stark@cruk.cam.ac.uk>> <mailto:rory.stark@cruk.cam.ac.uk>> >> Cc: "bioconductor@r-project.org <mailto:bioconductor@r-project.org>" >> <bioconductor@r-project.org <mailto:bioconductor@r-project.org="">> >> Subject: Re: DiffBind and paired end data >> >> Hi Rory, >> >> I sorted the problem out, it apparently was a lack of memory in the >> old server. I could not think that summarizeOverlaps was so demanding >> on these very deeply sequenced samples. >> >> Still, I have one additional question for you. >> I am trying to understand what DiffBind is doing for the >> normalization with the control sample. Could you elaborate the >> concept a little more? Is the raw count subtracted/rounded and >> subtracted or something else? >> I am asking because I see that a lot of regions get quite low values >> in the final report (shown with the bCounts=TRUE option) even if the >> peak caller identified peaks over there. >> >> Thanks in advance for the explanation! And once again, thank you for >> the nice tool you provided. >> >> Federico >> >> >> On 20.03.14 16:08, Rory Stark wrote: >>> Hi Frederico- >>> >>> Let me know when you are able to use a recent build of DiffBind to >>> see fi it works on your paord-end data. >>> >>> At some point I'll probably need access to at least a subset of the >>> experiment (eg using Dropbox) so I can debug any further problems >>> you may be having. >>> >>> Cheers- >>> Rory >>> >>> From: Federico Marini <marinif@uni-mainz.de>>> <mailto:marinif@uni-mainz.de>> >>> Date: Tue, 4 Mar 2014 15:21:44 +0100 >>> To: Rory Stark <rory.stark@cruk.cam.ac.uk>>> <mailto:rory.stark@cruk.cam.ac.uk>> >>> Subject: Re: DiffBind and paired end data >>> >>> Dear Rory, >>> >>> Great to hear back from you. And even better to read that the >>> feature is going to be implemented soon. >>> >>> So far I am trying it then with bUseSummarizeOverlaps = TRUE and >>> with DBA$config$singleEnd <- FALSE before calling the dba.count. >>> I got this warning message during the count operations: >>> >>> Warnung in FUN(bf, param = param) : 'yieldSize' set to 'NA' >>> >>> # (8 times, I assume it is due to the 8 samples in total - >>> triplicates + respective input, 2 conditions) >>> >>> >>> The counting part still takes quite a very long time to perform. >>> When it was done, I sadly got this error message (sorry for some >>> parts in german): >>> >>> Fehler: Error processing one or more read files. Check warnings(). >>> Zusätzlich: Warnmeldungen: >>> 1: In mclapply(arglist, fn, ..., mc.preschedule = TRUE, >>> mc.allow.recursive = TRUE) : >>> scheduled cores 2 encountered errors in user code, all values >>> of the jobs will be affected >>> 2: >>> 3: Fehler bei der Auswertung des Argumentes 'x' bei der >>> Methodenauswahl >>> 4: >>> 5: >>> 6: >>> 7: Fehler bei der Auswertung des Argumentes 'x' bei der >>> Methodenauswahl >>> 8: >>> >>> >>> The sessionInfo for this is as follows: >>> >>> sessionInfo() >>> R version 3.0.2 (2013-09-25) >>> Platform: x86_64-pc-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 >>> [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 >>> [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets >>> methods >>> [8] base >>> >>> other attached packages: >>> [1] Rsamtools_1.14.3 Biostrings_2.30.1 DiffBind_1.8.4 >>> [4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.6 >>> [7] BiocGenerics_0.8.0 >>> >>> loaded via a namespace (and not attached): >>> [1] amap_0.8-11 bitops_1.0-6 caTools_1.16 edgeR_3.4.2 >>> [5] gdata_2.13.2 gplots_2.12.1 gtools_3.3.0 KernSmooth_2.23-10 >>> [9] limma_3.18.13 RColorBrewer_1.0-5 stats4_3.0.2 tools_3.0.2 >>> [13] zlibbioc_1.8.0 >>> >>> >>> >>> Will you also plan to implement the recent featureCounts, too? This >>> could possibly speed up massively the operations, I guess. >>> >>> If I'd ask you from your hands-on experience, what is a "good" >>> number of DB sites? I assume this is not a one size fits all >>> situation, but just as an overall idea it could help to have some >>> numbers. >>> >>> Thank you again for replying, and I will get back to you as soon as >>> I have a better idea of what possibly went wrong. >>> >>> Best regards, >>> >>> Federico >>> >>> >>> >>> On 04.03.14 12:36, Rory Stark wrote: >>>> Hi Frederico- >>>> >>>> Looks like you've got it just about right you want to use the >>>> bUseSummarizedOverlaps option. For paired end, you don't need to >>>> set the insert length (changed to fragmentSize moving forward) as >>>> with paired end data the size of each fragment is known. >>>> >>>> I will add the option to set the fragments parameter (in >>>> DBA$config$fragments), it should appear in the development version >>>> 1.9.9. >>>> >>>> Do let me know if you have any problems wit this, as this is a >>>> scenario we'd really like to see work and can help debugging if >>>> there are issues. >>>> >>>> Cheers- >>>> Rory >>>> >>>> From: Federico Marini <marinif@uni-mainz.de>>>> <mailto:marinif@uni-mainz.de>> >>>> Date: Mon, 24 Feb 2014 14:51:49 +0100 >>>> To: Rory Stark <rory.stark@cruk.cam.ac.uk>>>> <mailto:rory.stark@cruk.cam.ac.uk>> >>>> Subject: DiffBind and paired end data >>>> >>>> Dear Dr. Stark, >>>> >>>> I had the pleasure to attend ("back then in 2013") to the >>>> presentation you gave at the EBI Advanced Course for RNA-seq and >>>> ChIP-seq, where you introduced DiffBind and its usage examples. >>>> >>>> Now I moved from IMB - Mainz and I am working in the Biostatistics >>>> department of the University of Medical Center in Mainz as a >>>> research associate, where I also have the chance of doing my PhD. >>>> >>>> I contact you regarding indeed DiffBind, and the possibility of >>>> doing differential binding analysis for ChIP-seq paired end data. >>>> As one of our collaborators recently produced such datasets, and >>>> they would like to investigate the aspect of changes in the binding >>>> for TFs and histone modifications, I thought the DiffBind framework >>>> would be a very solid solution for analysis. >>>> >>>> The doubt I have, is whether DiffBind can use the information of >>>> the paired end data, and how. Ideally I guess the best way would be >>>> this: properly paired reads will be counted just once for each of >>>> the ends, and singletons(/not properly paired) will also count one. >>>> >>>> I was following the discussions around >>>> https://stat.ethz.ch/pipermail/bioconductor/2013-June/053394.html, >>>> and it seems that by incorporating the summarizeOverlaps functions >>>> it would be (almost-)possible, but I would like to double check it >>>> with you. >>>> bUseSummarizeOverlaps would be set to TRUE, DBA$config$singleEnd to >>>> false. Is there anything else I should take into account (e.g. the >>>> insert length in this case would be meaningful?)? Is there also a >>>> parameter to use the "fragment" parameter set to true in the >>>> summarizeOverlaps function? >>>> >>>> Thank you very much in advance for the attention, and thank you >>>> again for the nice package on top of it! >>>> >>>> Best regards, >>>> >>>> Federico >>>> >>>> -- >>>> *Federico Marini, M.Sc.* >>>> Medizinische Biometrie >>>> _____________________________________________ >>>> UNIVERSITÄTSMEDIZIN >>>> der Johannes Gutenberg-Universität Mainz >>>> Institut für Medizinische Biometrie, Epidemiologie und Informatik >>>> (IMBEI) >>>> Abteilung Medizinische Biometrie >>>> Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere >>>> Zahlbacher Str. 69, 55131 Mainz >>>> www.imbei.uni-mainz.de >>>> Telefon +49 (0) 6131 17-7029 >>>> Telefax +49 (0) 6131 17-472433 >>>> E-Mail: federico.marini@unimedizin-mainz.de >>>> marinif@uni-mainz.de >>> >>> -- >>> *Federico Marini, M.Sc.* >>> Medizinische Biometrie >>> _____________________________________________ >>> UNIVERSITÄTSMEDIZIN >>> der Johannes Gutenberg-Universität Mainz >>> Institut für Medizinische Biometrie, Epidemiologie und Informatik >>> (IMBEI) >>> Abteilung Medizinische Biometrie >>> Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere >>> Zahlbacher Str. 69, 55131 Mainz >>> www.imbei.uni-mainz.de >>> Telefon +49 (0) 6131 17-7029 >>> Telefax +49 (0) 6131 17-472433 >>> E-Mail: federico.marini@unimedizin-mainz.de >>> marinif@uni-mainz.de >> >> -- >> *Federico Marini, M.Sc.* >> Medical Biostatistics >> _____________________________________________ >> University Medical Center of the Johannes Gutenberg University Mainz >> Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI) >> Medical Biostatistics Department >> Postal address: 55101 Mainz >> Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz >> www.imbei.uni-mainz.de >> Phone +49 (0) 6131 17-7029 >> Fax +49 (0) 6131 17-472433 >> E-Mail: marinif@uni-mainz.de > > -- > *Federico Marini, M.Sc.* > Medical Biostatistics > _____________________________________________ > University Medical Center of the Johannes Gutenberg University Mainz > Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI) > Medical Biostatistics Department > Postal address: 55101 Mainz > Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz > www.imbei.uni-mainz.de > Phone +49 (0) 6131 17-7029 > Fax +49 (0) 6131 17-472433 > E-Mail: marinif@uni-mainz.de -- *Federico Marini, M.Sc.* Medical Biostatistics _____________________________________________ University Medical Center of the Johannes Gutenberg University Mainz Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI) Medical Biostatistics Department Postal address: 55101 Mainz Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz www.imbei.uni-mainz.de Phone +49 (0) 6131 17-7029 Fax +49 (0) 6131 17-472433 E-Mail: marinif@uni-mainz.de [[alternative HTML version deleted]]

ADD REPLY • link 11.1 years ago Federico Marini ▴ 180

0

Entering edit mode

Frederico- So the controls for wt_r1 and wt_r2 are the exact same, and likewise for KD_r1 and KD_r2? If the number of reads in the control is close to those in the ChIP, this could't account for the large differences in control reads you are seeing. Two things we can try to see of the control scaling is doing something unexpected: One is to turn of control scaling in the call to dba.count (bScaleControl=FALSE) -- then you should get identical counts for the shared controls. The other other thing I can think of to try is to make sure that we are getting the correct library size from genomicAlignment using countBam. The easiest way to check this is to examine the DBA object after the call to dba.count: > lib.sizes = DBA$class[ DiffBind:::PV_READS,] Will give you the total number of reads in each library. Check these numbers to see if they correspond to how many reads you think there should be, or if there are any big disparities. You are correct that after scaling the counts are rounded to given the DE package whole count numbers. In the report, the concentrations are reported as log2 values, which is why they can be very small or even negative. Finally, to retrieve the DESeq CountDataSet object: > DESeqCountDataSetObject = DBA$contrasts[[n]]$DESeq1$DEdata You'll need to run dba.analyze with bReduceObject=FALSE to get back a "complete" result set for use in DESeq. Cheers- Rory From: Federico Marini <marinif@uni-mainz.de<mailto:marinif@uni- mainz.de="">> Date: Fri, 28 Mar 2014 14:28:15 +0100 To: Rory Stark <rory.stark@cruk.cam.ac.uk<mailto:rory.stark@cruk.cam.ac.uk>> Cc: "bioconductor@r-project.org<mailto:bioconductor@r-project.org>" <bioconductor@r-project.org<mailto:bioconductor@r-project.org>> Subject: Re: DiffBind and paired end data Hi Rory, An update on the read count values. For your tamoxifen datasets it happens the following: head(controlReads) BT474.1- BT474.2- MCF7.1- MCF7.2- MCF7.1+ MCF7.2+ MCF7.3+ T47D.1+ T47D.2+ ZR75.1+ ZR75.2+ 1433 3 3 8 8 4 4 4 3 3 7 7 7 43 43 8 8 24 25 25 6 6 20 20 1606 13 13 2 2 19 19 19 8 8 21 21 156 33 33 4 4 6 6 6 8 8 8 8 1238 27 27 5 5 29 30 30 11 11 29 29 1446 3 3 2 2 3 3 3 1 1 4 4 Correctly each input has the same amount of reads. In our case we had one input for each condition and for each batch. Samples 1 and 2 for each condition were sequenced in batch 1, samples numbered as 3 were done in batch 2. head(dba_controlReads) wt_r1 wt_r2 wt_r3 KD_r1 KD_r2 KD_r3 4442 168 226 152 239 241 100 1464 696 934 471 592 596 289 2956 4293 5762 3538 4677 4711 2178 1292 1035 1390 704 1110 1118 482 5741 229 307 172 230 232 112 557 150 201 87 165 166 71 The numbers differ in every column, and this is somehow probably pointing to some strange behaviour of summarizeOverlaps. Would you have the same guess if you look at this lines? Cheers, Federico On 28.03.14 13:29, Federico Marini wrote: Hi Rory, Thanks for the detailed reply. On 27.03.14 20:14, Rory Stark wrote: Hi Frederico- Regarding normalizing to controls, the default behavior in DiffBind is to first get the number of reads in the ChIP, and the number of reads in the control, then scale the control reads if the control library was sequenced to higher depth. Then it subtracts the control reads from the ChIP reads (the _MINUS scoring options). Finally the read counts for each sample are normalised with the other samples (this depends which analysis method you use; the default is edgeR's TMM method, based on the total library size).This is not an overly- principled step, it just dampens some peaks where there are a high concentration of reads int he control. The assumption is that you used the controls for the peak calling peak callers like MACS model the background more elaborately. Read scores can't be negative. I guess that the rescaling, if performed, is then somehow rounded up, so that the amounts of reads are still discrete numbers - to feed for edgeR/DESeq. If it is not the case, please correct me. I was asking about details because I was trying to "do what DiffBind does" with single steps performed outside of the pipeline, i.e. counting in the enriched regions with featureCounts, importing the count table back in a semi-standard DESeq analysis (and there I did not know what to do for the input to be as close as possible to your procedure). One point I could not figure out of the vignette and also of the reference manual is the sequence of exact steps to pass the count matrix back to diffbind (to exploit the plot aids and so on). Could you provide me a pseudocode example for it? I got lost in the reference with the dba.peakset explanation. You can assume I have one vector of counts per each sample. If you want to see what is really going on, I suggest retrieving the report as a SummarizedExperiment and look through the appropriate assays to get the raw counts. For example, using the example set: > data(tamoxifen_analysis) > sumexp = dba.report(tamoxifen,1,DataType=DBA_DATA_SUMMARIZED_EXPERIMENT) > assays(sumexp) List of length 5 names(5): scores RPKM Reads cRPKM cReads You can see the actual (non-normalized) read counts for the ChIP (min 1): > chipReads = assay(sumexp,3) And for the Control (min 1): > controlReads = assay(sumexp,5) And the subtracted values that are actually given to edgeR (or DESeq/DESeq2): > readCountsForAnalysis = assay(sumexp,3) - assay(sumexp,5) I tried this on your tamoxifen sample dataset, as well as the ones we have. A few additional information, the data we had was extremely deep sequenced (drosophila, sometimes more than 20M reads per replicate) - so I tried to downsample the data, and I did this scaling each dataset to 20M reads. So for this reason I think that the scaling for the input should not have that much of an impact. You can compare these counts with what you see in the browser for the same region for ChIP and control. If there is a big mismatch let me know and we can figure out why summarizeOverlaps isn't doing what we expect. It seems that the numbers are somehow correct - I did a few spot checks. The thing is, the concentration values look kind of strange to me. Indeed, for regions where I detected peaks in all samples (I used MACS2 with mostly default parameters) I see that the final concentration value is quite close to 0 or even negative. The readCountsForAnalysis object in my case has indeed quite some negative values, quite more than what you have in the same object derived from the tamoxifen dataset. Can this be to the fact that my merging adjacent peaks coming from different datasets the input amount gets "overestimated"? If so, do you have any suggestion to take in this case? Thanks again for the advices. Cheers, Federico Cheers- Rory From: Federico Marini <marinif@uni-mainz.de<mailto:marinif@uni- mainz.de="">> Date: Mon, 24 Mar 2014 15:47:56 +0100 To: Rory Stark <rory.stark@cruk.cam.ac.uk<mailto:rory.stark@cruk.cam.ac.uk>> Cc: "bioconductor@r-project.org<mailto:bioconductor@r-project.org>" <bioconductor@r-project.org<mailto:bioconductor@r-project.org>> Subject: Re: DiffBind and paired end data Hi Rory, I sorted the problem out, it apparently was a lack of memory in the old server. I could not think that summarizeOverlaps was so demanding on these very deeply sequenced samples. Still, I have one additional question for you. I am trying to understand what DiffBind is doing for the normalization with the control sample. Could you elaborate the concept a little more? Is the raw count subtracted/rounded and subtracted or something else? I am asking because I see that a lot of regions get quite low values in the final report (shown with the bCounts=TRUE option) even if the peak caller identified peaks over there. Thanks in advance for the explanation! And once again, thank you for the nice tool you provided. Federico On 20.03.14 16:08, Rory Stark wrote: Hi Frederico- Let me know when you are able to use a recent build of DiffBind to see fi it works on your paord-end data. At some point I'll probably need access to at least a subset of the experiment (eg using Dropbox) so I can debug any further problems you may be having. Cheers- Rory From: Federico Marini <marinif@uni-mainz.de<mailto:marinif@uni- mainz.de="">> Date: Tue, 4 Mar 2014 15:21:44 +0100 To: Rory Stark <rory.stark@cruk.cam.ac.uk<mailto:rory.stark@cruk.cam.ac.uk>> Subject: Re: DiffBind and paired end data Dear Rory, Great to hear back from you. And even better to read that the feature is going to be implemented soon. So far I am trying it then with bUseSummarizeOverlaps = TRUE and with DBA$config$singleEnd <- FALSE before calling the dba.count. I got this warning message during the count operations: Warnung in FUN(bf, param = param) : 'yieldSize' set to 'NA' # (8 times, I assume it is due to the 8 samples in total - triplicates + respective input, 2 conditions) The counting part still takes quite a very long time to perform. When it was done, I sadly got this error message (sorry for some parts in german): Fehler: Error processing one or more read files. Check warnings(). Zusätzlich: Warnmeldungen: 1: In mclapply(arglist, fn, ..., mc.preschedule = TRUE, mc.allow.recursive = TRUE) : scheduled cores 2 encountered errors in user code, all values of the jobs will be affected 2: 3: Fehler bei der Auswertung des Argumentes 'x' bei der Methodenauswahl 4: 5: 6: 7: Fehler bei der Auswertung des Argumentes 'x' bei der Methodenauswahl 8: The sessionInfo for this is as follows: sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] Rsamtools_1.14.3 Biostrings_2.30.1 DiffBind_1.8.4 [4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.6 [7] BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] amap_0.8-11 bitops_1.0-6 caTools_1.16 edgeR_3.4.2 [5] gdata_2.13.2 gplots_2.12.1 gtools_3.3.0 KernSmooth_2.23-10 [9] limma_3.18.13 RColorBrewer_1.0-5 stats4_3.0.2 tools_3.0.2 [13] zlibbioc_1.8.0 Will you also plan to implement the recent featureCounts, too? This could possibly speed up massively the operations, I guess. If I'd ask you from your hands-on experience, what is a "good" number of DB sites? I assume this is not a one size fits all situation, but just as an overall idea it could help to have some numbers. Thank you again for replying, and I will get back to you as soon as I have a better idea of what possibly went wrong. Best regards, Federico On 04.03.14 12:36, Rory Stark wrote: Hi Frederico- Looks like you've got it just about right you want to use the bUseSummarizedOverlaps option. For paired end, you don't need to set the insert length (changed to fragmentSize moving forward) as with paired end data the size of each fragment is known. I will add the option to set the fragments parameter (in DBA$config$fragments), it should appear in the development version 1.9.9. Do let me know if you have any problems wit this, as this is a scenario we'd really like to see work and can help debugging if there are issues. Cheers- Rory From: Federico Marini <marinif@uni-mainz.de<mailto:marinif@uni- mainz.de="">> Date: Mon, 24 Feb 2014 14:51:49 +0100 To: Rory Stark <rory.stark@cruk.cam.ac.uk<mailto:rory.stark@cruk.cam.ac.uk>> Subject: DiffBind and paired end data Dear Dr. Stark, I had the pleasure to attend ("back then in 2013") to the presentation you gave at the EBI Advanced Course for RNA-seq and ChIP-seq, where you introduced DiffBind and its usage examples. Now I moved from IMB - Mainz and I am working in the Biostatistics department of the University of Medical Center in Mainz as a research associate, where I also have the chance of doing my PhD. I contact you regarding indeed DiffBind, and the possibility of doing differential binding analysis for ChIP-seq paired end data. As one of our collaborators recently produced such datasets, and they would like to investigate the aspect of changes in the binding for TFs and histone modifications, I thought the DiffBind framework would be a very solid solution for analysis. The doubt I have, is whether DiffBind can use the information of the paired end data, and how. Ideally I guess the best way would be this: properly paired reads will be counted just once for each of the ends, and singletons(/not properly paired) will also count one. I was following the discussions around https://stat.ethz.ch/pipermail/bioconductor/2013-June/053394.html, and it seems that by incorporating the summarizeOverlaps functions it would be (almost-)possible, but I would like to double check it with you. bUseSummarizeOverlaps would be set to TRUE, DBA$config$singleEnd to false. Is there anything else I should take into account (e.g. the insert length in this case would be meaningful?)? Is there also a parameter to use the "fragment" parameter set to true in the summarizeOverlaps function? Thank you very much in advance for the attention, and thank you again for the nice package on top of it! Best regards, Federico -- Federico Marini, M.Sc. Medizinische Biometrie _____________________________________________ UNIVERSITÄTSMEDIZIN der Johannes Gutenberg-Universität Mainz Institut für Medizinische Biometrie, Epidemiologie und Informatik (IMBEI) Abteilung Medizinische Biometrie Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere Zahlbacher Str. 69, 55131 Mainz www.imbei.uni-mainz.de<http: www.imbei.uni-mainz.de=""> Telefon +49 (0) 6131 17-7029 Telefax +49 (0) 6131 17-472433 E-Mail: federico.marini@unimedizin-mainz.de<mailto:federico.marini @unimedizin-mainz.de=""> marinif@uni-mainz.de<mailto:marinif@uni-mainz.de> -- Federico Marini, M.Sc. Medizinische Biometrie _____________________________________________ UNIVERSITÄTSMEDIZIN der Johannes Gutenberg-Universität Mainz Institut für Medizinische Biometrie, Epidemiologie und Informatik (IMBEI) Abteilung Medizinische Biometrie Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere Zahlbacher Str. 69, 55131 Mainz www.imbei.uni-mainz.de<http: www.imbei.uni-mainz.de=""> Telefon +49 (0) 6131 17-7029 Telefax +49 (0) 6131 17-472433 E-Mail: federico.marini@unimedizin-mainz.de<mailto:federico.marini @unimedizin-mainz.de=""> marinif@uni-mainz.de<mailto:marinif@uni-mainz.de> -- Federico Marini, M.Sc. Medical Biostatistics _____________________________________________ University Medical Center of the Johannes Gutenberg University Mainz Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI) Medical Biostatistics Department Postal address: 55101 Mainz Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz www.imbei.uni-mainz.de<http: www.imbei.uni-mainz.de=""> Phone +49 (0) 6131 17-7029 Fax +49 (0) 6131 17-472433 E-Mail: marinif@uni-mainz.de<mailto:marinif@uni-mainz.de> -- Federico Marini, M.Sc. Medical Biostatistics _____________________________________________ University Medical Center of the Johannes Gutenberg University Mainz Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI) Medical Biostatistics Department Postal address: 55101 Mainz Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz www.imbei.uni-mainz.de<http: www.imbei.uni-mainz.de=""> Phone +49 (0) 6131 17-7029 Fax +49 (0) 6131 17-472433 E-Mail: marinif@uni-mainz.de<mailto:marinif@uni-mainz.de> -- Federico Marini, M.Sc. Medical Biostatistics _____________________________________________ University Medical Center of the Johannes Gutenberg University Mainz Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI) Medical Biostatistics Department Postal address: 55101 Mainz Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz www.imbei.uni-mainz.de<http: www.imbei.uni-mainz.de=""> Phone +49 (0) 6131 17-7029 Fax +49 (0) 6131 17-472433 E-Mail: marinif@uni-mainz.de<mailto:marinif@uni-mainz.de> [[alternative HTML version deleted]]

ADD REPLY • link 11.1 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Hi Rory, On 28.03.14 18:10, Rory Stark wrote: > Frederico- > > So the controls for wt_r1 and wt_r2 are the exact same, and likewise > for KD_r1 and KD_r2? If the number of reads in the control is close to > those in the ChIP, this could't account for the large differences in > control reads you are seeing. > The controls are indeed the very same. And in advance I already downsampled all samples actually down to about 20 mln uniquely mapping reads. So as you say, it seems kind of strange. > Two things we can try to see of the control scaling is doing something > unexpected: > > One is to turn of control scaling in the call to dba.count > (bScaleControl=FALSE) -- then you should get identical counts for the > shared controls. > > The other other thing I can think of to try is to make sure that we > are getting the correct library size from genomicAlignment using > countBam. The easiest way to check this is to examine the DBA object > after the call to dba.count: > > > lib.sizes = DBA$class[ DiffBind:::PV_READS,] > > Will give you the total number of reads in each library. Check these > numbers to see if they correspond to how many reads you think there > should be, or if there are any big disparities. > I am currently re-runnign the counting step, which is the most intensive on RAM to carry out. But the lib sizes seem to be recognized correctly - the numbers all range from 20 to 22 mln reads as they should (I think the size accounts for multi-mapping reads too). The samples for one batch of one factor had lower amount of mapped reads, which is mirrored by higher values in the size of the library. I will update you when the counting is over. > You are correct that after scaling the counts are rounded to given the > DE package whole count numbers. In the report, the concentrations are > reported as log2 values, which is why they can be very small or even > negative. > > Finally, to retrieve the DESeq CountDataSet object: > > > DESeqCountDataSetObject = DBA$contrasts[[n]]$DESeq1$DEdata > > You'll need to run dba.analyze with bReduceObject=FALSE to get back a > "complete" result set for use in DESeq. Thanks, great. So far then the concentrations are fine, but the concerns are still on for the reads. Would it make sense for you to use as said in the previous messages a tool such as featureCounts as a check? My question was pointing at how to re-import the count matrix into DiffBind, and as I said use the nice heatmaps/MA plots you already provide. I'll keep you posted! Best, Federico > > Cheers- > Rory > > > From: Federico Marini <marinif@uni-mainz.de <mailto:marinif@uni-="" mainz.de="">> > Date: Fri, 28 Mar 2014 14:28:15 +0100 > To: Rory Stark <rory.stark@cruk.cam.ac.uk> <mailto:rory.stark@cruk.cam.ac.uk>> > Cc: "bioconductor@r-project.org <mailto:bioconductor@r-project.org>" > <bioconductor@r-project.org <mailto:bioconductor@r-project.org="">> > Subject: Re: DiffBind and paired end data > > Hi Rory, > > An update on the read count values. For your tamoxifen datasets it > happens the following: > > head(controlReads) > BT474.1- BT474.2- MCF7.1- MCF7.2- MCF7.1+ MCF7.2+ MCF7.3+ T47D.1+ > T47D.2+ ZR75.1+ ZR75.2+ > 1433 3 3 8 8 4 4 4 3 > 3 7 7 > 7 43 43 8 8 24 25 25 > 6 6 20 20 > 1606 13 13 2 2 19 19 19 > 8 8 21 21 > 156 33 33 4 4 6 6 6 8 > 8 8 8 > 1238 27 27 5 5 29 30 30 11 > 11 29 29 > 1446 3 3 2 2 3 3 3 1 > 1 4 4 > > Correctly each input has the same amount of reads. > > In our case we had one input for each condition and for each batch. > Samples 1 and 2 for each condition were sequenced in batch 1, samples > numbered as 3 were done in batch 2. > > head(dba_controlReads) > wt_r1 wt_r2 wt_r3 KD_r1 KD_r2 KD_r3 > 4442 168 226 152 239 > 241 100 > 1464 696 934 471 592 > 596 289 > 2956 4293 5762 3538 4677 > 4711 2178 > 1292 1035 1390 704 1110 > 1118 482 > 5741 229 307 172 230 > 232 112 > 557 150 201 87 165 > 166 71 > > The numbers differ in every column, and this is somehow probably > pointing to some strange behaviour of summarizeOverlaps. Would you > have the same guess if you look at this lines? > > Cheers, > Federico > > On 28.03.14 13:29, Federico Marini wrote: >> Hi Rory, >> >> Thanks for the detailed reply. >> >> On 27.03.14 20:14, Rory Stark wrote: >>> Hi Frederico- >>> >>> Regarding normalizing to controls, the default behavior in DiffBind >>> is to first get the number of reads in the ChIP, and the number of >>> reads in the control, then scale the control reads if the control >>> library was sequenced to higher depth. Then it subtracts the control >>> reads from the ChIP reads (the _MINUS scoring options). Finally the >>> read counts for each sample are normalised with the other samples >>> (this depends which analysis method you use; the default is >>> edgeR's TMM method, based on the total library size).This is not an >>> overly-principled step, it just dampens some peaks where there are a >>> high concentration of reads int he control. The assumption is >>> that you used the controls for the peak calling peak callers like >>> MACS model the background more elaborately. Read scores can't be >>> negative. >> I guess that the rescaling, if performed, is then somehow rounded up, >> so that the amounts of reads are still discrete numbers - to feed for >> edgeR/DESeq. If it is not the case, please correct me. I was asking >> about details because I was trying to "do what DiffBind does" with >> single steps performed outside of the pipeline, i.e. counting in the >> enriched regions with featureCounts, importing the count table back >> in a semi-standard DESeq analysis (and there I did not know what to >> do for the input to be as close as possible to your procedure). >> >> One point I could not figure out of the vignette and also of the >> reference manual is the sequence of exact steps to pass the count >> matrix back to diffbind (to exploit the plot aids and so on). Could >> you provide me a pseudocode example for it? I got lost in the >> reference with the dba.peakset explanation. You can assume I have one >> vector of counts per each sample. >>> >>> If you want to see what is really going on, I suggest retrieving the >>> report as a SummarizedExperiment and look through the appropriate >>> assays to get the raw counts. For example, using the example set: >>> >>> > data(tamoxifen_analysis) >>> > sumexp = dba.report(tamoxifen,1,DataType=DBA_DATA_SUMMARIZED_EXPERIMENT) >>> > assays(sumexp) >>> List of length 5 >>> names(5): scores RPKM Reads cRPKM cReads >>> >>> You can see the actual (non-normalized) read counts for the ChIP >>> (min 1): >>> > chipReads = assay(sumexp,3) >>> >>> And for the Control (min 1): >>> > controlReads = assay(sumexp,5) >>> >>> And the subtracted values that are actually given to edgeR (or >>> DESeq/DESeq2): >>> > readCountsForAnalysis = assay(sumexp,3) - assay(sumexp,5) >> I tried this on your tamoxifen sample dataset, as well as the ones we >> have. A few additional information, the data we had was extremely >> deep sequenced (drosophila, sometimes more than 20M reads per >> replicate) - so I tried to downsample the data, and I did this >> scaling each dataset to 20M reads. So for this reason I think that >> the scaling for the input should not have that much of an impact. >>> >>> You can compare these counts with what you see in the browser for >>> the same region for ChIP and control. If there is a big mismatch let >>> me know and we can figure out why summarizeOverlaps isn't doing what >>> we expect. >>> >> It seems that the numbers are somehow correct - I did a few spot >> checks. The thing is, the concentration values look kind of strange >> to me. Indeed, for regions where I detected peaks in all samples (I >> used MACS2 with mostly default parameters) I see that the final >> concentration value is quite close to 0 or even negative. The >> readCountsForAnalysis object in my case has indeed quite some >> negative values, quite more than what you have in the same object >> derived from the tamoxifen dataset. >> Can this be to the fact that my merging adjacent peaks coming from >> different datasets the input amount gets "overestimated"? If so, do >> you have any suggestion to take in this case? >> >> Thanks again for the advices. >> >> Cheers, >> Federico >>> Cheers- >>> Rory >>> >>> >>> >>> From: Federico Marini <marinif@uni-mainz.de>>> <mailto:marinif@uni-mainz.de>> >>> Date: Mon, 24 Mar 2014 15:47:56 +0100 >>> To: Rory Stark <rory.stark@cruk.cam.ac.uk>>> <mailto:rory.stark@cruk.cam.ac.uk>> >>> Cc: "bioconductor@r-project.org <mailto:bioconductor@r-project.org>" >>> <bioconductor@r-project.org <mailto:bioconductor@r-project.org="">> >>> Subject: Re: DiffBind and paired end data >>> >>> Hi Rory, >>> >>> I sorted the problem out, it apparently was a lack of memory in the >>> old server. I could not think that summarizeOverlaps was so >>> demanding on these very deeply sequenced samples. >>> >>> Still, I have one additional question for you. >>> I am trying to understand what DiffBind is doing for the >>> normalization with the control sample. Could you elaborate the >>> concept a little more? Is the raw count subtracted/rounded and >>> subtracted or something else? >>> I am asking because I see that a lot of regions get quite low values >>> in the final report (shown with the bCounts=TRUE option) even if the >>> peak caller identified peaks over there. >>> >>> Thanks in advance for the explanation! And once again, thank you for >>> the nice tool you provided. >>> >>> Federico >>> >>> >>> On 20.03.14 16:08, Rory Stark wrote: >>>> Hi Frederico- >>>> >>>> Let me know when you are able to use a recent build of DiffBind to >>>> see fi it works on your paord-end data. >>>> >>>> At some point I'll probably need access to at least a subset of the >>>> experiment (eg using Dropbox) so I can debug any further problems >>>> you may be having. >>>> >>>> Cheers- >>>> Rory >>>> >>>> From: Federico Marini <marinif@uni-mainz.de>>>> <mailto:marinif@uni-mainz.de>> >>>> Date: Tue, 4 Mar 2014 15:21:44 +0100 >>>> To: Rory Stark <rory.stark@cruk.cam.ac.uk>>>> <mailto:rory.stark@cruk.cam.ac.uk>> >>>> Subject: Re: DiffBind and paired end data >>>> >>>> Dear Rory, >>>> >>>> Great to hear back from you. And even better to read that the >>>> feature is going to be implemented soon. >>>> >>>> So far I am trying it then with bUseSummarizeOverlaps = TRUE and >>>> with DBA$config$singleEnd <- FALSE before calling the dba.count. >>>> I got this warning message during the count operations: >>>> >>>> Warnung in FUN(bf, param = param) : 'yieldSize' set to 'NA' >>>> >>>> # (8 times, I assume it is due to the 8 samples in total - >>>> triplicates + respective input, 2 conditions) >>>> >>>> >>>> The counting part still takes quite a very long time to perform. >>>> When it was done, I sadly got this error message (sorry for some >>>> parts in german): >>>> >>>> Fehler: Error processing one or more read files. Check warnings(). >>>> Zusätzlich: Warnmeldungen: >>>> 1: In mclapply(arglist, fn, ..., mc.preschedule = TRUE, >>>> mc.allow.recursive = TRUE) : >>>> scheduled cores 2 encountered errors in user code, all values >>>> of the jobs will be affected >>>> 2: >>>> 3: Fehler bei der Auswertung des Argumentes 'x' bei der >>>> Methodenauswahl >>>> 4: >>>> 5: >>>> 6: >>>> 7: Fehler bei der Auswertung des Argumentes 'x' bei der >>>> Methodenauswahl >>>> 8: >>>> >>>> >>>> The sessionInfo for this is as follows: >>>> >>>> sessionInfo() >>>> R version 3.0.2 (2013-09-25) >>>> Platform: x86_64-pc-linux-gnu (64-bit) >>>> >>>> locale: >>>> [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C >>>> [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 >>>> [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 >>>> [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C >>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] parallel stats graphics grDevices utils datasets >>>> methods >>>> [8] base >>>> >>>> other attached packages: >>>> [1] Rsamtools_1.14.3 Biostrings_2.30.1 DiffBind_1.8.4 >>>> [4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.6 >>>> [7] BiocGenerics_0.8.0 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] amap_0.8-11 bitops_1.0-6 caTools_1.16 edgeR_3.4.2 >>>> [5] gdata_2.13.2 gplots_2.12.1 gtools_3.3.0 >>>> KernSmooth_2.23-10 >>>> [9] limma_3.18.13 RColorBrewer_1.0-5 stats4_3.0.2 tools_3.0.2 >>>> [13] zlibbioc_1.8.0 >>>> >>>> >>>> >>>> Will you also plan to implement the recent featureCounts, too? This >>>> could possibly speed up massively the operations, I guess. >>>> >>>> If I'd ask you from your hands-on experience, what is a "good" >>>> number of DB sites? I assume this is not a one size fits all >>>> situation, but just as an overall idea it could help to have some >>>> numbers. >>>> >>>> Thank you again for replying, and I will get back to you as soon as >>>> I have a better idea of what possibly went wrong. >>>> >>>> Best regards, >>>> >>>> Federico >>>> >>>> >>>> >>>> On 04.03.14 12:36, Rory Stark wrote: >>>>> Hi Frederico- >>>>> >>>>> Looks like you've got it just about right you want to use the >>>>> bUseSummarizedOverlaps option. For paired end, you don't need to >>>>> set the insert length (changed to fragmentSize moving forward) as >>>>> with paired end data the size of each fragment is known. >>>>> >>>>> I will add the option to set the fragments parameter (in >>>>> DBA$config$fragments), it should appear in the development version >>>>> 1.9.9. >>>>> >>>>> Do let me know if you have any problems wit this, as this is a >>>>> scenario we'd really like to see work and can help debugging if >>>>> there are issues. >>>>> >>>>> Cheers- >>>>> Rory >>>>> >>>>> From: Federico Marini <marinif@uni-mainz.de>>>>> <mailto:marinif@uni-mainz.de>> >>>>> Date: Mon, 24 Feb 2014 14:51:49 +0100 >>>>> To: Rory Stark <rory.stark@cruk.cam.ac.uk>>>>> <mailto:rory.stark@cruk.cam.ac.uk>> >>>>> Subject: DiffBind and paired end data >>>>> >>>>> Dear Dr. Stark, >>>>> >>>>> I had the pleasure to attend ("back then in 2013") to the >>>>> presentation you gave at the EBI Advanced Course for RNA-seq and >>>>> ChIP-seq, where you introduced DiffBind and its usage examples. >>>>> >>>>> Now I moved from IMB - Mainz and I am working in the Biostatistics >>>>> department of the University of Medical Center in Mainz as a >>>>> research associate, where I also have the chance of doing my PhD. >>>>> >>>>> I contact you regarding indeed DiffBind, and the possibility of >>>>> doing differential binding analysis for ChIP-seq paired end data. >>>>> As one of our collaborators recently produced such datasets, and >>>>> they would like to investigate the aspect of changes in the >>>>> binding for TFs and histone modifications, I thought the DiffBind >>>>> framework would be a very solid solution for analysis. >>>>> >>>>> The doubt I have, is whether DiffBind can use the information of >>>>> the paired end data, and how. Ideally I guess the best way would >>>>> be this: properly paired reads will be counted just once for each >>>>> of the ends, and singletons(/not properly paired) will also count one. >>>>> >>>>> I was following the discussions around >>>>> https://stat.ethz.ch/pipermail/bioconductor/2013-June/053394.html, >>>>> and it seems that by incorporating the summarizeOverlaps functions >>>>> it would be (almost-)possible, but I would like to double check it >>>>> with you. >>>>> bUseSummarizeOverlaps would be set to TRUE, DBA$config$singleEnd >>>>> to false. Is there anything else I should take into account (e.g. >>>>> the insert length in this case would be meaningful?)? Is there >>>>> also a parameter to use the "fragment" parameter set to true in >>>>> the summarizeOverlaps function? >>>>> >>>>> Thank you very much in advance for the attention, and thank you >>>>> again for the nice package on top of it! >>>>> >>>>> Best regards, >>>>> >>>>> Federico >>>>> >>>>> -- >>>>> *Federico Marini, M.Sc.* >>>>> Medizinische Biometrie >>>>> _____________________________________________ >>>>> UNIVERSITÄTSMEDIZIN >>>>> der Johannes Gutenberg-Universität Mainz >>>>> Institut für Medizinische Biometrie, Epidemiologie und Informatik >>>>> (IMBEI) >>>>> Abteilung Medizinische Biometrie >>>>> Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere >>>>> Zahlbacher Str. 69, 55131 Mainz >>>>> www.imbei.uni-mainz.de >>>>> Telefon +49 (0) 6131 17-7029 >>>>> Telefax +49 (0) 6131 17-472433 >>>>> E-Mail: federico.marini@unimedizin-mainz.de >>>>> marinif@uni-mainz.de >>>> >>>> -- >>>> *Federico Marini, M.Sc.* >>>> Medizinische Biometrie >>>> _____________________________________________ >>>> UNIVERSITÄTSMEDIZIN >>>> der Johannes Gutenberg-Universität Mainz >>>> Institut für Medizinische Biometrie, Epidemiologie und Informatik >>>> (IMBEI) >>>> Abteilung Medizinische Biometrie >>>> Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere >>>> Zahlbacher Str. 69, 55131 Mainz >>>> www.imbei.uni-mainz.de >>>> Telefon +49 (0) 6131 17-7029 >>>> Telefax +49 (0) 6131 17-472433 >>>> E-Mail: federico.marini@unimedizin-mainz.de >>>> marinif@uni-mainz.de >>> >>> -- >>> *Federico Marini, M.Sc.* >>> Medical Biostatistics >>> _____________________________________________ >>> University Medical Center of the Johannes Gutenberg University Mainz >>> Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI) >>> Medical Biostatistics Department >>> Postal address: 55101 Mainz >>> Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz >>> www.imbei.uni-mainz.de >>> Phone +49 (0) 6131 17-7029 >>> Fax +49 (0) 6131 17-472433 >>> E-Mail: marinif@uni-mainz.de >> >> -- >> *Federico Marini, M.Sc.* >> Medical Biostatistics >> _____________________________________________ >> University Medical Center of the Johannes Gutenberg University Mainz >> Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI) >> Medical Biostatistics Department >> Postal address: 55101 Mainz >> Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz >> www.imbei.uni-mainz.de >> Phone +49 (0) 6131 17-7029 >> Fax +49 (0) 6131 17-472433 >> E-Mail: marinif@uni-mainz.de > > -- > *Federico Marini, M.Sc.* > Medical Biostatistics > _____________________________________________ > University Medical Center of the Johannes Gutenberg University Mainz > Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI) > Medical Biostatistics Department > Postal address: 55101 Mainz > Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz > www.imbei.uni-mainz.de > Phone +49 (0) 6131 17-7029 > Fax +49 (0) 6131 17-472433 > E-Mail: marinif@uni-mainz.de -- *Federico Marini, M.Sc.* Medical Biostatistics _____________________________________________ University Medical Center of the Johannes Gutenberg University Mainz Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI) Medical Biostatistics Department Postal address: 55101 Mainz Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz www.imbei.uni-mainz.de Phone +49 (0) 6131 17-7029 Fax +49 (0) 6131 17-472433 E-Mail: marinif@uni-mainz.de [[alternative HTML version deleted]]

ADD REPLY • link 11.1 years ago Federico Marini ▴ 180

0

Entering edit mode

Dear Rory, just a brief update: The counting with bScaleControl is returning pretty much the same results as without it - which can be correct, as the sizes are somehow comparable. As I wrote you in the former email, I do have the samples from batch 2 for RNA PolII that differ in terms of % of mapped reads. But nevertheless I get a very comparable amount of counts (checked afterwards with the chipReads and controlReads variables). This means it should not be due to the scaling. I am doing the same checks for the other ChIPped material and it seems the strange pattern is not there for them. The one thing that is still consistently there is the relative high number of negative read values in the readCountsForAnalysis object, created with the subtraction of the scaled/unscaled counts. I hope this helps somehow in finding out what can be wrong. From my side I will give the manual approach a try soon, so that it can be found whether summarizeOverlaps is behaving not as expected. Cheers, Federico On 31.03.14 10:45, Federico Marini wrote: > Hi Rory, > > On 28.03.14 18:10, Rory Stark wrote: >> Frederico- >> >> So the controls for wt_r1 and wt_r2 are the exact same, and likewise >> for KD_r1 and KD_r2? If the number of reads in the control is close >> to those in the ChIP, this could't account for the large differences >> in control reads you are seeing. >> > The controls are indeed the very same. And in advance I already > downsampled all samples actually down to about 20 mln uniquely mapping > reads. So as you say, it seems kind of strange. >> Two things we can try to see of the control scaling is doing >> something unexpected: >> >> One is to turn of control scaling in the call to dba.count >> (bScaleControl=FALSE) -- then you should get identical counts for the >> shared controls. >> >> The other other thing I can think of to try is to make sure that we >> are getting the correct library size from genomicAlignment using >> countBam. The easiest way to check this is to examine the DBA object >> after the call to dba.count: >> >> > lib.sizes = DBA$class[ DiffBind:::PV_READS,] >> >> Will give you the total number of reads in each library. Check these >> numbers to see if they correspond to how many reads you think there >> should be, or if there are any big disparities. >> > I am currently re-runnign the counting step, which is the most > intensive on RAM to carry out. But the lib sizes seem to be recognized > correctly - the numbers all range from 20 to 22 mln reads as they > should (I think the size accounts for multi-mapping reads too). The > samples for one batch of one factor had lower amount of mapped reads, > which is mirrored by higher values in the size of the library. I will > update you when the counting is over. >> You are correct that after scaling the counts are rounded to given >> the DE package whole count numbers. In the report, the concentrations >> are reported as log2 values, which is why they can be very small or >> even negative. >> >> Finally, to retrieve the DESeq CountDataSet object: >> >> > DESeqCountDataSetObject = DBA$contrasts[[n]]$DESeq1$DEdata >> >> You'll need to run dba.analyze with bReduceObject=FALSE to get back a >> "complete" result set for use in DESeq. > Thanks, great. So far then the concentrations are fine, but the > concerns are still on for the reads. > Would it make sense for you to use as said in the previous messages a > tool such as featureCounts as a check? My question was pointing at how > to re-import the count matrix into DiffBind, and as I said use the > nice heatmaps/MA plots you already provide. > > I'll keep you posted! > > Best, > > Federico >> >> Cheers- >> Rory >> >> >> From: Federico Marini <marinif@uni-mainz.de>> <mailto:marinif@uni-mainz.de>> >> Date: Fri, 28 Mar 2014 14:28:15 +0100 >> To: Rory Stark <rory.stark@cruk.cam.ac.uk>> <mailto:rory.stark@cruk.cam.ac.uk>> >> Cc: "bioconductor@r-project.org <mailto:bioconductor@r-project.org>" >> <bioconductor@r-project.org <mailto:bioconductor@r-project.org="">> >> Subject: Re: DiffBind and paired end data >> >> Hi Rory, >> >> An update on the read count values. For your tamoxifen datasets it >> happens the following: >> >> head(controlReads) >> BT474.1- BT474.2- MCF7.1- MCF7.2- MCF7.1+ MCF7.2+ MCF7.3+ >> T47D.1+ T47D.2+ ZR75.1+ ZR75.2+ >> 1433 3 3 8 8 4 4 4 >> 3 3 7 7 >> 7 43 43 8 8 24 25 25 >> 6 6 20 20 >> 1606 13 13 2 2 19 19 19 >> 8 8 21 21 >> 156 33 33 4 4 6 6 6 >> 8 8 8 8 >> 1238 27 27 5 5 29 30 30 >> 11 11 29 29 >> 1446 3 3 2 2 3 3 3 >> 1 1 4 4 >> >> Correctly each input has the same amount of reads. >> >> In our case we had one input for each condition and for each batch. >> Samples 1 and 2 for each condition were sequenced in batch 1, samples >> numbered as 3 were done in batch 2. >> >> head(dba_controlReads) >> wt_r1 wt_r2 wt_r3 KD_r1 KD_r2 KD_r3 >> 4442 168 226 152 239 >> 241 100 >> 1464 696 934 471 592 >> 596 289 >> 2956 4293 5762 3538 4677 >> 4711 2178 >> 1292 1035 1390 704 1110 >> 1118 482 >> 5741 229 307 172 230 >> 232 112 >> 557 150 201 87 165 >> 166 71 >> >> The numbers differ in every column, and this is somehow probably >> pointing to some strange behaviour of summarizeOverlaps. Would you >> have the same guess if you look at this lines? >> >> Cheers, >> Federico >> >> On 28.03.14 13:29, Federico Marini wrote: >>> Hi Rory, >>> >>> Thanks for the detailed reply. >>> >>> On 27.03.14 20:14, Rory Stark wrote: >>>> Hi Frederico- >>>> >>>> Regarding normalizing to controls, the default behavior in DiffBind >>>> is to first get the number of reads in the ChIP, and the number of >>>> reads in the control, then scale the control reads if the control >>>> library was sequenced to higher depth. Then it subtracts the >>>> control reads from the ChIP reads (the _MINUS scoring options). >>>> Finally the read counts for each sample are normalised with the >>>> other samples (this depends which analysis method you use; the >>>> default is edgeR's TMM method, based on the total library >>>> size).This is not an overly-principled step, it just dampens some >>>> peaks where there are a high concentration of reads int he control. >>>> The assumption is that you used the controls for the peak calling >>>> peak callers like MACS model the background more elaborately. Read >>>> scores can't be negative. >>> I guess that the rescaling, if performed, is then somehow rounded >>> up, so that the amounts of reads are still discrete numbers - to >>> feed for edgeR/DESeq. If it is not the case, please correct me. I >>> was asking about details because I was trying to "do what DiffBind >>> does" with single steps performed outside of the pipeline, i.e. >>> counting in the enriched regions with featureCounts, importing the >>> count table back in a semi-standard DESeq analysis (and there I did >>> not know what to do for the input to be as close as possible to your >>> procedure). >>> >>> One point I could not figure out of the vignette and also of the >>> reference manual is the sequence of exact steps to pass the count >>> matrix back to diffbind (to exploit the plot aids and so on). Could >>> you provide me a pseudocode example for it? I got lost in the >>> reference with the dba.peakset explanation. You can assume I have >>> one vector of counts per each sample. >>>> >>>> If you want to see what is really going on, I suggest retrieving >>>> the report as a SummarizedExperiment and look through the >>>> appropriate assays to get the raw counts. For example, using the >>>> example set: >>>> >>>> > data(tamoxifen_analysis) >>>> > sumexp = dba.report(tamoxifen,1,DataType=DBA_DATA_SUMMARIZED_EXPERIMENT) >>>> > assays(sumexp) >>>> List of length 5 >>>> names(5): scores RPKM Reads cRPKM cReads >>>> >>>> You can see the actual (non-normalized) read counts for the ChIP >>>> (min 1): >>>> > chipReads = assay(sumexp,3) >>>> >>>> And for the Control (min 1): >>>> > controlReads = assay(sumexp,5) >>>> >>>> And the subtracted values that are actually given to edgeR (or >>>> DESeq/DESeq2): >>>> > readCountsForAnalysis = assay(sumexp,3) - assay(sumexp,5) >>> I tried this on your tamoxifen sample dataset, as well as the ones >>> we have. A few additional information, the data we had was extremely >>> deep sequenced (drosophila, sometimes more than 20M reads per >>> replicate) - so I tried to downsample the data, and I did this >>> scaling each dataset to 20M reads. So for this reason I think that >>> the scaling for the input should not have that much of an impact. >>>> >>>> You can compare these counts with what you see in the browser for >>>> the same region for ChIP and control. If there is a big mismatch >>>> let me know and we can figure out why summarizeOverlaps isn't doing >>>> what we expect. >>>> >>> It seems that the numbers are somehow correct - I did a few spot >>> checks. The thing is, the concentration values look kind of strange >>> to me. Indeed, for regions where I detected peaks in all samples (I >>> used MACS2 with mostly default parameters) I see that the final >>> concentration value is quite close to 0 or even negative. The >>> readCountsForAnalysis object in my case has indeed quite some >>> negative values, quite more than what you have in the same object >>> derived from the tamoxifen dataset. >>> Can this be to the fact that my merging adjacent peaks coming from >>> different datasets the input amount gets "overestimated"? If so, do >>> you have any suggestion to take in this case? >>> >>> Thanks again for the advices. >>> >>> Cheers, >>> Federico >>>> Cheers- >>>> Rory >>>> >>>> >>>> >>>> From: Federico Marini <marinif@uni-mainz.de>>>> <mailto:marinif@uni-mainz.de>> >>>> Date: Mon, 24 Mar 2014 15:47:56 +0100 >>>> To: Rory Stark <rory.stark@cruk.cam.ac.uk>>>> <mailto:rory.stark@cruk.cam.ac.uk>> >>>> Cc: "bioconductor@r-project.org >>>> <mailto:bioconductor@r-project.org>" <bioconductor@r-project.org>>>> <mailto:bioconductor@r-project.org>> >>>> Subject: Re: DiffBind and paired end data >>>> >>>> Hi Rory, >>>> >>>> I sorted the problem out, it apparently was a lack of memory in the >>>> old server. I could not think that summarizeOverlaps was so >>>> demanding on these very deeply sequenced samples. >>>> >>>> Still, I have one additional question for you. >>>> I am trying to understand what DiffBind is doing for the >>>> normalization with the control sample. Could you elaborate the >>>> concept a little more? Is the raw count subtracted/rounded and >>>> subtracted or something else? >>>> I am asking because I see that a lot of regions get quite low >>>> values in the final report (shown with the bCounts=TRUE option) >>>> even if the peak caller identified peaks over there. >>>> >>>> Thanks in advance for the explanation! And once again, thank you >>>> for the nice tool you provided. >>>> >>>> Federico >>>> >>>> >>>> On 20.03.14 16:08, Rory Stark wrote: >>>>> Hi Frederico- >>>>> >>>>> Let me know when you are able to use a recent build of DiffBind to >>>>> see fi it works on your paord-end data. >>>>> >>>>> At some point I'll probably need access to at least a subset of >>>>> the experiment (eg using Dropbox) so I can debug any further >>>>> problems you may be having. >>>>> >>>>> Cheers- >>>>> Rory >>>>> >>>>> From: Federico Marini <marinif@uni-mainz.de>>>>> <mailto:marinif@uni-mainz.de>> >>>>> Date: Tue, 4 Mar 2014 15:21:44 +0100 >>>>> To: Rory Stark <rory.stark@cruk.cam.ac.uk>>>>> <mailto:rory.stark@cruk.cam.ac.uk>> >>>>> Subject: Re: DiffBind and paired end data >>>>> >>>>> Dear Rory, >>>>> >>>>> Great to hear back from you. And even better to read that the >>>>> feature is going to be implemented soon. >>>>> >>>>> So far I am trying it then with bUseSummarizeOverlaps = TRUE and >>>>> with DBA$config$singleEnd <- FALSE before calling the dba.count. >>>>> I got this warning message during the count operations: >>>>> >>>>> Warnung in FUN(bf, param = param) : 'yieldSize' set to 'NA' >>>>> >>>>> # (8 times, I assume it is due to the 8 samples in total - >>>>> triplicates + respective input, 2 conditions) >>>>> >>>>> >>>>> The counting part still takes quite a very long time to perform. >>>>> When it was done, I sadly got this error message (sorry for some >>>>> parts in german): >>>>> >>>>> Fehler: Error processing one or more read files. Check warnings(). >>>>> Zusätzlich: Warnmeldungen: >>>>> 1: In mclapply(arglist, fn, ..., mc.preschedule = TRUE, >>>>> mc.allow.recursive = TRUE) : >>>>> scheduled cores 2 encountered errors in user code, all >>>>> values of the jobs will be affected >>>>> 2: >>>>> 3: Fehler bei der Auswertung des Argumentes 'x' bei der >>>>> Methodenauswahl >>>>> 4: >>>>> 5: >>>>> 6: >>>>> 7: Fehler bei der Auswertung des Argumentes 'x' bei der >>>>> Methodenauswahl >>>>> 8: >>>>> >>>>> >>>>> The sessionInfo for this is as follows: >>>>> >>>>> sessionInfo() >>>>> R version 3.0.2 (2013-09-25) >>>>> Platform: x86_64-pc-linux-gnu (64-bit) >>>>> >>>>> locale: >>>>> [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C >>>>> [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 >>>>> [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 >>>>> [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C >>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C >>>>> >>>>> attached base packages: >>>>> [1] parallel stats graphics grDevices utils datasets methods >>>>> [8] base >>>>> >>>>> other attached packages: >>>>> [1] Rsamtools_1.14.3 Biostrings_2.30.1 DiffBind_1.8.4 >>>>> [4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.6 >>>>> [7] BiocGenerics_0.8.0 >>>>> >>>>> loaded via a namespace (and not attached): >>>>> [1] amap_0.8-11 bitops_1.0-6 caTools_1.16 edgeR_3.4.2 >>>>> [5] gdata_2.13.2 gplots_2.12.1 gtools_3.3.0 KernSmooth_2.23-10 >>>>> [9] limma_3.18.13 RColorBrewer_1.0-5 stats4_3.0.2 >>>>> tools_3.0.2 >>>>> [13] zlibbioc_1.8.0 >>>>> >>>>> >>>>> >>>>> Will you also plan to implement the recent featureCounts, too? >>>>> This could possibly speed up massively the operations, I guess. >>>>> >>>>> If I'd ask you from your hands-on experience, what is a "good" >>>>> number of DB sites? I assume this is not a one size fits all >>>>> situation, but just as an overall idea it could help to have some >>>>> numbers. >>>>> >>>>> Thank you again for replying, and I will get back to you as soon >>>>> as I have a better idea of what possibly went wrong. >>>>> >>>>> Best regards, >>>>> >>>>> Federico >>>>> >>>>> >>>>> >>>>> On 04.03.14 12:36, Rory Stark wrote: >>>>>> Hi Frederico- >>>>>> >>>>>> Looks like you've got it just about right you want to use the >>>>>> bUseSummarizedOverlaps option. For paired end, you don't need to >>>>>> set the insert length (changed to fragmentSize moving forward) as >>>>>> with paired end data the size of each fragment is known. >>>>>> >>>>>> I will add the option to set the fragments parameter (in >>>>>> DBA$config$fragments), it should appear in the development >>>>>> version 1.9.9. >>>>>> >>>>>> Do let me know if you have any problems wit this, as this is a >>>>>> scenario we'd really like to see work and can help debugging if >>>>>> there are issues. >>>>>> >>>>>> Cheers- >>>>>> Rory >>>>>> >>>>>> From: Federico Marini <marinif@uni-mainz.de>>>>>> <mailto:marinif@uni-mainz.de>> >>>>>> Date: Mon, 24 Feb 2014 14:51:49 +0100 >>>>>> To: Rory Stark <rory.stark@cruk.cam.ac.uk>>>>>> <mailto:rory.stark@cruk.cam.ac.uk>> >>>>>> Subject: DiffBind and paired end data >>>>>> >>>>>> Dear Dr. Stark, >>>>>> >>>>>> I had the pleasure to attend ("back then in 2013") to the >>>>>> presentation you gave at the EBI Advanced Course for RNA-seq and >>>>>> ChIP-seq, where you introduced DiffBind and its usage examples. >>>>>> >>>>>> Now I moved from IMB - Mainz and I am working in the >>>>>> Biostatistics department of the University of Medical Center in >>>>>> Mainz as a research associate, where I also have the chance of >>>>>> doing my PhD. >>>>>> >>>>>> I contact you regarding indeed DiffBind, and the possibility of >>>>>> doing differential binding analysis for ChIP-seq paired end data. >>>>>> As one of our collaborators recently produced such datasets, and >>>>>> they would like to investigate the aspect of changes in the >>>>>> binding for TFs and histone modifications, I thought the DiffBind >>>>>> framework would be a very solid solution for analysis. >>>>>> >>>>>> The doubt I have, is whether DiffBind can use the information of >>>>>> the paired end data, and how. Ideally I guess the best way would >>>>>> be this: properly paired reads will be counted just once for each >>>>>> of the ends, and singletons(/not properly paired) will also count >>>>>> one. >>>>>> >>>>>> I was following the discussions around >>>>>> https://stat.ethz.ch/pipermail/bioconductor/2013-June/053394.html, and >>>>>> it seems that by incorporating the summarizeOverlaps functions it >>>>>> would be (almost-)possible, but I would like to double check it >>>>>> with you. >>>>>> bUseSummarizeOverlaps would be set to TRUE, DBA$config$singleEnd >>>>>> to false. Is there anything else I should take into account (e.g. >>>>>> the insert length in this case would be meaningful?)? Is there >>>>>> also a parameter to use the "fragment" parameter set to true in >>>>>> the summarizeOverlaps function? >>>>>> >>>>>> Thank you very much in advance for the attention, and thank you >>>>>> again for the nice package on top of it! >>>>>> >>>>>> Best regards, >>>>>> >>>>>> Federico >>>>>> >>>>>> -- >>>>>> *Federico Marini, M.Sc.* >>>>>> Medizinische Biometrie >>>>>> _____________________________________________ >>>>>> UNIVERSITÄTSMEDIZIN >>>>>> der Johannes Gutenberg-Universität Mainz >>>>>> Institut für Medizinische Biometrie, Epidemiologie und Informatik >>>>>> (IMBEI) >>>>>> Abteilung Medizinische Biometrie >>>>>> Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere >>>>>> Zahlbacher Str. 69, 55131 Mainz >>>>>> www.imbei.uni-mainz.de >>>>>> Telefon +49 (0) 6131 17-7029 >>>>>> Telefax +49 (0) 6131 17-472433 >>>>>> E-Mail: federico.marini@unimedizin-mainz.de >>>>>> marinif@uni-mainz.de >>>>> >>>>> -- >>>>> *Federico Marini, M.Sc.* >>>>> Medizinische Biometrie >>>>> _____________________________________________ >>>>> UNIVERSITÄTSMEDIZIN >>>>> der Johannes Gutenberg-Universität Mainz >>>>> Institut für Medizinische Biometrie, Epidemiologie und Informatik >>>>> (IMBEI) >>>>> Abteilung Medizinische Biometrie >>>>> Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere >>>>> Zahlbacher Str. 69, 55131 Mainz >>>>> www.imbei.uni-mainz.de >>>>> Telefon +49 (0) 6131 17-7029 >>>>> Telefax +49 (0) 6131 17-472433 >>>>> E-Mail: federico.marini@unimedizin-mainz.de >>>>> marinif@uni-mainz.de >>>> >>>> -- >>>> *Federico Marini, M.Sc.* >>>> Medical Biostatistics >>>> _____________________________________________ >>>> University Medical Center of the Johannes Gutenberg University Mainz >>>> Institute of Medical Biostatistics, Epidemiology and Informatics >>>> (IMBEI) >>>> Medical Biostatistics Department >>>> Postal address: 55101 Mainz >>>> Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz >>>> www.imbei.uni-mainz.de >>>> Phone +49 (0) 6131 17-7029 >>>> Fax +49 (0) 6131 17-472433 >>>> E-Mail: marinif@uni-mainz.de >>> >>> -- >>> *Federico Marini, M.Sc.* >>> Medical Biostatistics >>> _____________________________________________ >>> University Medical Center of the Johannes Gutenberg University Mainz >>> Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI) >>> Medical Biostatistics Department >>> Postal address: 55101 Mainz >>> Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz >>> www.imbei.uni-mainz.de >>> Phone +49 (0) 6131 17-7029 >>> Fax +49 (0) 6131 17-472433 >>> E-Mail: marinif@uni-mainz.de >> >> -- >> *Federico Marini, M.Sc.* >> Medical Biostatistics >> _____________________________________________ >> University Medical Center of the Johannes Gutenberg University Mainz >> Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI) >> Medical Biostatistics Department >> Postal address: 55101 Mainz >> Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz >> www.imbei.uni-mainz.de >> Phone +49 (0) 6131 17-7029 >> Fax +49 (0) 6131 17-472433 >> E-Mail: marinif@uni-mainz.de > > -- > *Federico Marini, M.Sc.* > Medical Biostatistics > _____________________________________________ > University Medical Center of the Johannes Gutenberg University Mainz > Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI) > Medical Biostatistics Department > Postal address: 55101 Mainz > Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz > www.imbei.uni-mainz.de > Phone +49 (0) 6131 17-7029 > Fax +49 (0) 6131 17-472433 > E-Mail: marinif@uni-mainz.de -- *Federico Marini, M.Sc.* Medical Biostatistics _____________________________________________ University Medical Center of the Johannes Gutenberg University Mainz Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI) Medical Biostatistics Department Postal address: 55101 Mainz Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz www.imbei.uni-mainz.de Phone +49 (0) 6131 17-7029 Fax +49 (0) 6131 17-472433 E-Mail: marinif@uni-mainz.de [[alternative HTML version deleted]]

ADD REPLY • link 11.1 years ago Federico Marini ▴ 180