Hi Frederico-
Looks like you've got it just about right you want to use the
bUseSummarizedOverlaps option. For paired end, you don't need to set
the insert length (changed to fragmentSize moving forward) as with
paired end data the size of each fragment is known.
I will add the option to set the fragments parameter (in
DBA$config$fragments), it should appear in the development version
1.9.9.
Do let me know if you have any problems wit this, as this is a
scenario we'd really like to see work and can help debugging if there
are issues.
Cheers-
Rory
From: Federico Marini <marinif@uni-mainz.de<mailto:marinif@uni- mainz.de="">>
Date: Mon, 24 Feb 2014 14:51:49 +0100
To: Rory Stark
<rory.stark@cruk.cam.ac.uk<mailto:rory.stark@cruk.cam.ac.uk>>
Subject: DiffBind and paired end data
Dear Dr. Stark,
I had the pleasure to attend ("back then in 2013") to the presentation
you gave at the EBI Advanced Course for RNA-seq and ChIP-seq, where
you introduced DiffBind and its usage examples.
Now I moved from IMB - Mainz and I am working in the Biostatistics
department of the University of Medical Center in Mainz as a research
associate, where I also have the chance of doing my PhD.
I contact you regarding indeed DiffBind, and the possibility of doing
differential binding analysis for ChIP-seq paired end data. As one of
our collaborators recently produced such datasets, and they would like
to investigate the aspect of changes in the binding for TFs and
histone modifications, I thought the DiffBind framework would be a
very solid solution for analysis.
The doubt I have, is whether DiffBind can use the information of the
paired end data, and how. Ideally I guess the best way would be this:
properly paired reads will be counted just once for each of the ends,
and singletons(/not properly paired) will also count one.
I was following the discussions around
https://stat.ethz.ch/pipermail/bioconductor/2013-June/053394.html, and
it seems that by incorporating the summarizeOverlaps functions it
would be (almost-)possible, but I would like to double check it with
you.
bUseSummarizeOverlaps would be set to TRUE, DBA$config$singleEnd to
false. Is there anything else I should take into account (e.g. the
insert length in this case would be meaningful?)? Is there also a
parameter to use the "fragment" parameter set to true in the
summarizeOverlaps function?
Thank you very much in advance for the attention, and thank you again
for the nice package on top of it!
Best regards,
Federico
--
Federico Marini, M.Sc.
Medizinische Biometrie
_____________________________________________
UNIVERSITÄTSMEDIZIN
der Johannes Gutenberg-Universität Mainz
Institut für Medizinische Biometrie, Epidemiologie und Informatik
(IMBEI)
Abteilung Medizinische Biometrie
Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere Zahlbacher
Str. 69, 55131 Mainz
www.imbei.uni-mainz.de<http: www.imbei.uni-mainz.de="">
Telefon +49 (0) 6131 17-7029
Telefax +49 (0) 6131 17-472433
E-Mail: federico.marini@unimedizin-mainz.de<mailto:federico.marini @unimedizin-mainz.de="">
marinif@uni-mainz.de<mailto:marinif@uni-mainz.de>
[[alternative HTML version deleted]]
Hi Frederico-
Let me know when you are able to use a recent build of DiffBind to see
fi it works on your paord-end data.
At some point I'll probably need access to at least a subset of the
experiment (eg using Dropbox) so I can debug any further problems you
may be having.
Cheers-
Rory
From: Federico Marini <marinif@uni-mainz.de<mailto:marinif@uni- mainz.de="">>
Date: Tue, 4 Mar 2014 15:21:44 +0100
To: Rory Stark
<rory.stark@cruk.cam.ac.uk<mailto:rory.stark@cruk.cam.ac.uk>>
Subject: Re: DiffBind and paired end data
Dear Rory,
Great to hear back from you. And even better to read that the feature
is going to be implemented soon.
So far I am trying it then with bUseSummarizeOverlaps = TRUE and with
DBA$config$singleEnd <- FALSE before calling the dba.count.
I got this warning message during the count operations:
Warnung in FUN(bf, param = param) : 'yieldSize' set to 'NA'
# (8 times, I assume it is due to the 8 samples in total - triplicates
+ respective input, 2 conditions)
The counting part still takes quite a very long time to perform. When
it was done, I sadly got this error message (sorry for some parts in
german):
Fehler: Error processing one or more read files. Check warnings().
Zusätzlich: Warnmeldungen:
1: In mclapply(arglist, fn, ..., mc.preschedule = TRUE,
mc.allow.recursive = TRUE) :
scheduled cores 2 encountered errors in user code, all values of the
jobs will be affected
2:
3: Fehler bei der Auswertung des Argumentes 'x' bei der
Methodenauswahl
4:
5:
6:
7: Fehler bei der Auswertung des Argumentes 'x' bei der
Methodenauswahl
8:
The sessionInfo for this is as follows:
sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
[3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets
methods
[8] base
other attached packages:
[1] Rsamtools_1.14.3 Biostrings_2.30.1 DiffBind_1.8.4
[4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.6
[7] BiocGenerics_0.8.0
loaded via a namespace (and not attached):
[1] amap_0.8-11 bitops_1.0-6 caTools_1.16
edgeR_3.4.2
[5] gdata_2.13.2 gplots_2.12.1 gtools_3.3.0
KernSmooth_2.23-10
[9] limma_3.18.13 RColorBrewer_1.0-5 stats4_3.0.2
tools_3.0.2
[13] zlibbioc_1.8.0
Will you also plan to implement the recent featureCounts, too? This
could possibly speed up massively the operations, I guess.
If I'd ask you from your hands-on experience, what is a "good" number
of DB sites? I assume this is not a one size fits all situation, but
just as an overall idea it could help to have some numbers.
Thank you again for replying, and I will get back to you as soon as I
have a better idea of what possibly went wrong.
Best regards,
Federico
On 04.03.14 12:36, Rory Stark wrote:
Hi Frederico-
Looks like you've got it just about right you want to use the
bUseSummarizedOverlaps option. For paired end, you don't need to set
the insert length (changed to fragmentSize moving forward) as with
paired end data the size of each fragment is known.
I will add the option to set the fragments parameter (in
DBA$config$fragments), it should appear in the development version
1.9.9.
Do let me know if you have any problems wit this, as this is a
scenario we'd really like to see work and can help debugging if there
are issues.
Cheers-
Rory
From: Federico Marini <marinif@uni-mainz.de<mailto:marinif@uni- mainz.de="">>
Date: Mon, 24 Feb 2014 14:51:49 +0100
To: Rory Stark
<rory.stark@cruk.cam.ac.uk<mailto:rory.stark@cruk.cam.ac.uk>>
Subject: DiffBind and paired end data
Dear Dr. Stark,
I had the pleasure to attend ("back then in 2013") to the presentation
you gave at the EBI Advanced Course for RNA-seq and ChIP-seq, where
you introduced DiffBind and its usage examples.
Now I moved from IMB - Mainz and I am working in the Biostatistics
department of the University of Medical Center in Mainz as a research
associate, where I also have the chance of doing my PhD.
I contact you regarding indeed DiffBind, and the possibility of doing
differential binding analysis for ChIP-seq paired end data. As one of
our collaborators recently produced such datasets, and they would like
to investigate the aspect of changes in the binding for TFs and
histone modifications, I thought the DiffBind framework would be a
very solid solution for analysis.
The doubt I have, is whether DiffBind can use the information of the
paired end data, and how. Ideally I guess the best way would be this:
properly paired reads will be counted just once for each of the ends,
and singletons(/not properly paired) will also count one.
I was following the discussions around
https://stat.ethz.ch/pipermail/bioconductor/2013-June/053394.html, and
it seems that by incorporating the summarizeOverlaps functions it
would be (almost-)possible, but I would like to double check it with
you.
bUseSummarizeOverlaps would be set to TRUE, DBA$config$singleEnd to
false. Is there anything else I should take into account (e.g. the
insert length in this case would be meaningful?)? Is there also a
parameter to use the "fragment" parameter set to true in the
summarizeOverlaps function?
Thank you very much in advance for the attention, and thank you again
for the nice package on top of it!
Best regards,
Federico
--
Federico Marini, M.Sc.
Medizinische Biometrie
_____________________________________________
UNIVERSITÄTSMEDIZIN
der Johannes Gutenberg-Universität Mainz
Institut für Medizinische Biometrie, Epidemiologie und Informatik
(IMBEI)
Abteilung Medizinische Biometrie
Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere Zahlbacher
Str. 69, 55131 Mainz
www.imbei.uni-mainz.de<http: www.imbei.uni-mainz.de="">
Telefon +49 (0) 6131 17-7029
Telefax +49 (0) 6131 17-472433
E-Mail: federico.marini@unimedizin-mainz.de<mailto:federico.marini @unimedizin-mainz.de="">
marinif@uni-mainz.de<mailto:marinif@uni-mainz.de>
--
Federico Marini, M.Sc.
Medizinische Biometrie
_____________________________________________
UNIVERSITÄTSMEDIZIN
der Johannes Gutenberg-Universität Mainz
Institut für Medizinische Biometrie, Epidemiologie und Informatik
(IMBEI)
Abteilung Medizinische Biometrie
Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere Zahlbacher
Str. 69, 55131 Mainz
www.imbei.uni-mainz.de<http: www.imbei.uni-mainz.de="">
Telefon +49 (0) 6131 17-7029
Telefax +49 (0) 6131 17-472433
E-Mail: federico.marini@unimedizin-mainz.de<mailto:federico.marini @unimedizin-mainz.de="">
marinif@uni-mainz.de<mailto:marinif@uni-mainz.de>
[[alternative HTML version deleted]]
Hi Rory,
I sorted the problem out, it apparently was a lack of memory in the
old
server. I could not think that summarizeOverlaps was so demanding on
these very deeply sequenced samples.
Still, I have one additional question for you.
I am trying to understand what DiffBind is doing for the normalization
with the control sample. Could you elaborate the concept a little
more?
Is the raw count subtracted/rounded and subtracted or something else?
I am asking because I see that a lot of regions get quite low values
in
the final report (shown with the bCounts=TRUE option) even if the peak
caller identified peaks over there.
Thanks in advance for the explanation! And once again, thank you for
the
nice tool you provided.
Federico
On 20.03.14 16:08, Rory Stark wrote:
> Hi Frederico-
>
> Let me know when you are able to use a recent build of DiffBind to
see
> fi it works on your paord-end data.
>
> At some point I'll probably need access to at least a subset of the
> experiment (eg using Dropbox) so I can debug any further problems
you
> may be having.
>
> Cheers-
> Rory
>
> From: Federico Marini <marinif@uni-mainz.de <mailto:marinif@uni-="" mainz.de="">>
> Date: Tue, 4 Mar 2014 15:21:44 +0100
> To: Rory Stark <rory.stark@cruk.cam.ac.uk> <mailto:rory.stark@cruk.cam.ac.uk>>
> Subject: Re: DiffBind and paired end data
>
> Dear Rory,
>
> Great to hear back from you. And even better to read that the
feature
> is going to be implemented soon.
>
> So far I am trying it then with bUseSummarizeOverlaps = TRUE and
with
> DBA$config$singleEnd <- FALSE before calling the dba.count.
> I got this warning message during the count operations:
>
> Warnung in FUN(bf, param = param) : 'yieldSize' set to 'NA'
>
> # (8 times, I assume it is due to the 8 samples in total -
> triplicates + respective input, 2 conditions)
>
>
> The counting part still takes quite a very long time to perform.
When
> it was done, I sadly got this error message (sorry for some parts in
> german):
>
> Fehler: Error processing one or more read files. Check
warnings().
> Zusätzlich: Warnmeldungen:
> 1: In mclapply(arglist, fn, ..., mc.preschedule = TRUE,
> mc.allow.recursive = TRUE) :
> scheduled cores 2 encountered errors in user code, all values
of
> the jobs will be affected
> 2:
> 3: Fehler bei der Auswertung des Argumentes 'x' bei der
> Methodenauswahl
> 4:
> 5:
> 6:
> 7: Fehler bei der Auswertung des Argumentes 'x' bei der
> Methodenauswahl
> 8:
>
>
> The sessionInfo for this is as follows:
>
> sessionInfo()
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
> [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
> [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets
> methods
> [8] base
>
> other attached packages:
> [1] Rsamtools_1.14.3 Biostrings_2.30.1 DiffBind_1.8.4
> [4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.6
> [7] BiocGenerics_0.8.0
>
> loaded via a namespace (and not attached):
> [1] amap_0.8-11 bitops_1.0-6 caTools_1.16
edgeR_3.4.2
> [5] gdata_2.13.2 gplots_2.12.1 gtools_3.3.0
> KernSmooth_2.23-10
> [9] limma_3.18.13 RColorBrewer_1.0-5 stats4_3.0.2
> tools_3.0.2
> [13] zlibbioc_1.8.0
>
>
>
> Will you also plan to implement the recent featureCounts, too? This
> could possibly speed up massively the operations, I guess.
>
> If I'd ask you from your hands-on experience, what is a "good"
number
> of DB sites? I assume this is not a one size fits all situation, but
> just as an overall idea it could help to have some numbers.
>
> Thank you again for replying, and I will get back to you as soon as
I
> have a better idea of what possibly went wrong.
>
> Best regards,
>
> Federico
>
>
>
> On 04.03.14 12:36, Rory Stark wrote:
>> Hi Frederico-
>>
>> Looks like you've got it just about right you want to use the
>> bUseSummarizedOverlaps option. For paired end, you don't need to
set
>> the insert length (changed to fragmentSize moving forward) as with
>> paired end data the size of each fragment is known.
>>
>> I will add the option to set the fragments parameter (in
>> DBA$config$fragments), it should appear in the development version
1.9.9.
>>
>> Do let me know if you have any problems wit this, as this is a
>> scenario we'd really like to see work and can help debugging if
there
>> are issues.
>>
>> Cheers-
>> Rory
>>
>> From: Federico Marini <marinif@uni-mainz.de>> <mailto:marinif@uni-mainz.de>>
>> Date: Mon, 24 Feb 2014 14:51:49 +0100
>> To: Rory Stark <rory.stark@cruk.cam.ac.uk>> <mailto:rory.stark@cruk.cam.ac.uk>>
>> Subject: DiffBind and paired end data
>>
>> Dear Dr. Stark,
>>
>> I had the pleasure to attend ("back then in 2013") to the
>> presentation you gave at the EBI Advanced Course for RNA-seq and
>> ChIP-seq, where you introduced DiffBind and its usage examples.
>>
>> Now I moved from IMB - Mainz and I am working in the Biostatistics
>> department of the University of Medical Center in Mainz as a
research
>> associate, where I also have the chance of doing my PhD.
>>
>> I contact you regarding indeed DiffBind, and the possibility of
doing
>> differential binding analysis for ChIP-seq paired end data. As one
of
>> our collaborators recently produced such datasets, and they would
>> like to investigate the aspect of changes in the binding for TFs
and
>> histone modifications, I thought the DiffBind framework would be a
>> very solid solution for analysis.
>>
>> The doubt I have, is whether DiffBind can use the information of
the
>> paired end data, and how. Ideally I guess the best way would be
this:
>> properly paired reads will be counted just once for each of the
ends,
>> and singletons(/not properly paired) will also count one.
>>
>> I was following the discussions around
>> https://stat.ethz.ch/pipermail/bioconductor/2013-June/053394.html,
>> and it seems that by incorporating the summarizeOverlaps functions
it
>> would be (almost-)possible, but I would like to double check it
with you.
>> bUseSummarizeOverlaps would be set to TRUE, DBA$config$singleEnd to
>> false. Is there anything else I should take into account (e.g. the
>> insert length in this case would be meaningful?)? Is there also a
>> parameter to use the "fragment" parameter set to true in the
>> summarizeOverlaps function?
>>
>> Thank you very much in advance for the attention, and thank you
again
>> for the nice package on top of it!
>>
>> Best regards,
>>
>> Federico
>>
>> --
>> *Federico Marini, M.Sc.*
>> Medizinische Biometrie
>> _____________________________________________
>> UNIVERSITÄTSMEDIZIN
>> der Johannes Gutenberg-Universität Mainz
>> Institut für Medizinische Biometrie, Epidemiologie und Informatik
(IMBEI)
>> Abteilung Medizinische Biometrie
>> Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere
>> Zahlbacher Str. 69, 55131 Mainz
>> www.imbei.uni-mainz.de
>> Telefon +49 (0) 6131 17-7029
>> Telefax +49 (0) 6131 17-472433
>> E-Mail: federico.marini@unimedizin-mainz.de
>> marinif@uni-mainz.de
>
> --
> *Federico Marini, M.Sc.*
> Medizinische Biometrie
> _____________________________________________
> UNIVERSITÄTSMEDIZIN
> der Johannes Gutenberg-Universität Mainz
> Institut für Medizinische Biometrie, Epidemiologie und Informatik
(IMBEI)
> Abteilung Medizinische Biometrie
> Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere
Zahlbacher
> Str. 69, 55131 Mainz
> www.imbei.uni-mainz.de
> Telefon +49 (0) 6131 17-7029
> Telefax +49 (0) 6131 17-472433
> E-Mail: federico.marini@unimedizin-mainz.de
> marinif@uni-mainz.de
--
*Federico Marini, M.Sc.*
Medical Biostatistics
_____________________________________________
University Medical Center of the Johannes Gutenberg University Mainz
Institute of Medical Biostatistics, Epidemiology and Informatics
(IMBEI)
Medical Biostatistics Department
Postal address: 55101 Mainz
Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz
www.imbei.uni-mainz.de
Phone +49 (0) 6131 17-7029
Fax +49 (0) 6131 17-472433
E-Mail: marinif@uni-mainz.de
[[alternative HTML version deleted]]
Hi Frederico-
Regarding normalizing to controls, the default behavior in DiffBind is
to first get the number of reads in the ChIP, and the number of reads
in the control, then scale the control reads if the control library
was sequenced to higher depth. Then it subtracts the control reads
from the ChIP reads (the _MINUS scoring options). Finally the read
counts for each sample are normalised with the other samples (this
depends which analysis method you use; the default is edgeR's TMM
method, based on the total library size).This is not an overly-
principled step, it just dampens some peaks where there are a high
concentration of reads int he control. The assumption is that you used
the controls for the peak calling peak callers like MACS model the
background more elaborately. Read scores can't be negative.
If you want to see what is really going on, I suggest retrieving the
report as a SummarizedExperiment and look through the appropriate
assays to get the raw counts. For example, using the example set:
> data(tamoxifen_analysis)
> sumexp =
dba.report(tamoxifen,1,DataType=DBA_DATA_SUMMARIZED_EXPERIMENT)
> assays(sumexp)
List of length 5
names(5): scores RPKM Reads cRPKM cReads
You can see the actual (non-normalized) read counts for the ChIP (min
1):
> chipReads = assay(sumexp,3)
And for the Control (min 1):
> controlReads = assay(sumexp,5)
And the subtracted values that are actually given to edgeR (or
DESeq/DESeq2):
> readCountsForAnalysis = assay(sumexp,3) - assay(sumexp,5)
You can compare these counts with what you see in the browser for the
same region for ChIP and control. If there is a big mismatch let me
know and we can figure out why summarizeOverlaps isn't doing what we
expect.
Cheers-
Rory
From: Federico Marini <marinif@uni-mainz.de<mailto:marinif@uni- mainz.de="">>
Date: Mon, 24 Mar 2014 15:47:56 +0100
To: Rory Stark
<rory.stark@cruk.cam.ac.uk<mailto:rory.stark@cruk.cam.ac.uk>>
Cc: "bioconductor@r-project.org<mailto:bioconductor@r-project.org>"
<bioconductor@r-project.org<mailto:bioconductor@r-project.org>>
Subject: Re: DiffBind and paired end data
Hi Rory,
I sorted the problem out, it apparently was a lack of memory in the
old server. I could not think that summarizeOverlaps was so demanding
on these very deeply sequenced samples.
Still, I have one additional question for you.
I am trying to understand what DiffBind is doing for the normalization
with the control sample. Could you elaborate the concept a little
more? Is the raw count subtracted/rounded and subtracted or something
else?
I am asking because I see that a lot of regions get quite low values
in the final report (shown with the bCounts=TRUE option) even if the
peak caller identified peaks over there.
Thanks in advance for the explanation! And once again, thank you for
the nice tool you provided.
Federico
On 20.03.14 16:08, Rory Stark wrote:
Hi Frederico-
Let me know when you are able to use a recent build of DiffBind to see
fi it works on your paord-end data.
At some point I'll probably need access to at least a subset of the
experiment (eg using Dropbox) so I can debug any further problems you
may be having.
Cheers-
Rory
From: Federico Marini <marinif@uni-mainz.de<mailto:marinif@uni- mainz.de="">>
Date: Tue, 4 Mar 2014 15:21:44 +0100
To: Rory Stark
<rory.stark@cruk.cam.ac.uk<mailto:rory.stark@cruk.cam.ac.uk>>
Subject: Re: DiffBind and paired end data
Dear Rory,
Great to hear back from you. And even better to read that the feature
is going to be implemented soon.
So far I am trying it then with bUseSummarizeOverlaps = TRUE and with
DBA$config$singleEnd <- FALSE before calling the dba.count.
I got this warning message during the count operations:
Warnung in FUN(bf, param = param) : 'yieldSize' set to 'NA'
# (8 times, I assume it is due to the 8 samples in total - triplicates
+ respective input, 2 conditions)
The counting part still takes quite a very long time to perform. When
it was done, I sadly got this error message (sorry for some parts in
german):
Fehler: Error processing one or more read files. Check warnings().
Zusätzlich: Warnmeldungen:
1: In mclapply(arglist, fn, ..., mc.preschedule = TRUE,
mc.allow.recursive = TRUE) :
scheduled cores 2 encountered errors in user code, all values of the
jobs will be affected
2:
3: Fehler bei der Auswertung des Argumentes 'x' bei der
Methodenauswahl
4:
5:
6:
7: Fehler bei der Auswertung des Argumentes 'x' bei der
Methodenauswahl
8:
The sessionInfo for this is as follows:
sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
[3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets
methods
[8] base
other attached packages:
[1] Rsamtools_1.14.3 Biostrings_2.30.1 DiffBind_1.8.4
[4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.6
[7] BiocGenerics_0.8.0
loaded via a namespace (and not attached):
[1] amap_0.8-11 bitops_1.0-6 caTools_1.16
edgeR_3.4.2
[5] gdata_2.13.2 gplots_2.12.1 gtools_3.3.0
KernSmooth_2.23-10
[9] limma_3.18.13 RColorBrewer_1.0-5 stats4_3.0.2
tools_3.0.2
[13] zlibbioc_1.8.0
Will you also plan to implement the recent featureCounts, too? This
could possibly speed up massively the operations, I guess.
If I'd ask you from your hands-on experience, what is a "good" number
of DB sites? I assume this is not a one size fits all situation, but
just as an overall idea it could help to have some numbers.
Thank you again for replying, and I will get back to you as soon as I
have a better idea of what possibly went wrong.
Best regards,
Federico
On 04.03.14 12:36, Rory Stark wrote:
Hi Frederico-
Looks like you've got it just about right you want to use the
bUseSummarizedOverlaps option. For paired end, you don't need to set
the insert length (changed to fragmentSize moving forward) as with
paired end data the size of each fragment is known.
I will add the option to set the fragments parameter (in
DBA$config$fragments), it should appear in the development version
1.9.9.
Do let me know if you have any problems wit this, as this is a
scenario we'd really like to see work and can help debugging if there
are issues.
Cheers-
Rory
From: Federico Marini <marinif@uni-mainz.de<mailto:marinif@uni- mainz.de="">>
Date: Mon, 24 Feb 2014 14:51:49 +0100
To: Rory Stark
<rory.stark@cruk.cam.ac.uk<mailto:rory.stark@cruk.cam.ac.uk>>
Subject: DiffBind and paired end data
Dear Dr. Stark,
I had the pleasure to attend ("back then in 2013") to the presentation
you gave at the EBI Advanced Course for RNA-seq and ChIP-seq, where
you introduced DiffBind and its usage examples.
Now I moved from IMB - Mainz and I am working in the Biostatistics
department of the University of Medical Center in Mainz as a research
associate, where I also have the chance of doing my PhD.
I contact you regarding indeed DiffBind, and the possibility of doing
differential binding analysis for ChIP-seq paired end data. As one of
our collaborators recently produced such datasets, and they would like
to investigate the aspect of changes in the binding for TFs and
histone modifications, I thought the DiffBind framework would be a
very solid solution for analysis.
The doubt I have, is whether DiffBind can use the information of the
paired end data, and how. Ideally I guess the best way would be this:
properly paired reads will be counted just once for each of the ends,
and singletons(/not properly paired) will also count one.
I was following the discussions around
https://stat.ethz.ch/pipermail/bioconductor/2013-June/053394.html, and
it seems that by incorporating the summarizeOverlaps functions it
would be (almost-)possible, but I would like to double check it with
you.
bUseSummarizeOverlaps would be set to TRUE, DBA$config$singleEnd to
false. Is there anything else I should take into account (e.g. the
insert length in this case would be meaningful?)? Is there also a
parameter to use the "fragment" parameter set to true in the
summarizeOverlaps function?
Thank you very much in advance for the attention, and thank you again
for the nice package on top of it!
Best regards,
Federico
--
Federico Marini, M.Sc.
Medizinische Biometrie
_____________________________________________
UNIVERSITÄTSMEDIZIN
der Johannes Gutenberg-Universität Mainz
Institut für Medizinische Biometrie, Epidemiologie und Informatik
(IMBEI)
Abteilung Medizinische Biometrie
Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere Zahlbacher
Str. 69, 55131 Mainz
www.imbei.uni-mainz.de<http: www.imbei.uni-mainz.de="">
Telefon +49 (0) 6131 17-7029
Telefax +49 (0) 6131 17-472433
E-Mail: federico.marini@unimedizin-mainz.de<mailto:federico.marini @unimedizin-mainz.de="">
marinif@uni-mainz.de<mailto:marinif@uni-mainz.de>
--
Federico Marini, M.Sc.
Medizinische Biometrie
_____________________________________________
UNIVERSITÄTSMEDIZIN
der Johannes Gutenberg-Universität Mainz
Institut für Medizinische Biometrie, Epidemiologie und Informatik
(IMBEI)
Abteilung Medizinische Biometrie
Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere Zahlbacher
Str. 69, 55131 Mainz
www.imbei.uni-mainz.de<http: www.imbei.uni-mainz.de="">
Telefon +49 (0) 6131 17-7029
Telefax +49 (0) 6131 17-472433
E-Mail: federico.marini@unimedizin-mainz.de<mailto:federico.marini @unimedizin-mainz.de="">
marinif@uni-mainz.de<mailto:marinif@uni-mainz.de>
--
Federico Marini, M.Sc.
Medical Biostatistics
_____________________________________________
University Medical Center of the Johannes Gutenberg University Mainz
Institute of Medical Biostatistics, Epidemiology and Informatics
(IMBEI)
Medical Biostatistics Department
Postal address: 55101 Mainz
Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz
www.imbei.uni-mainz.de<http: www.imbei.uni-mainz.de="">
Phone +49 (0) 6131 17-7029
Fax +49 (0) 6131 17-472433
E-Mail: marinif@uni-mainz.de<mailto:marinif@uni-mainz.de>
[[alternative HTML version deleted]]
Hi Rory,
Thanks for the detailed reply.
On 27.03.14 20:14, Rory Stark wrote:
> Hi Frederico-
>
> Regarding normalizing to controls, the default behavior in DiffBind
is
> to first get the number of reads in the ChIP, and the number of
reads
> in the control, then scale the control reads if the control library
> was sequenced to higher depth. Then it subtracts the control reads
> from the ChIP reads (the _MINUS scoring options). Finally the read
> counts for each sample are normalised with the other samples (this
> depends which analysis method you use; the default is edgeR's TMM
> method, based on the total library size).This is not an
> overly-principled step, it just dampens some peaks where there are a
> high concentration of reads int he control. The assumption is that
you
> used the controls for the peak calling peak callers like MACS
model
> the background more elaborately. Read scores can't be negative.
I guess that the rescaling, if performed, is then somehow rounded up,
so
that the amounts of reads are still discrete numbers - to feed for
edgeR/DESeq. If it is not the case, please correct me. I was asking
about details because I was trying to "do what DiffBind does" with
single steps performed outside of the pipeline, i.e. counting in the
enriched regions with featureCounts, importing the count table back in
a
semi-standard DESeq analysis (and there I did not know what to do for
the input to be as close as possible to your procedure).
One point I could not figure out of the vignette and also of the
reference manual is the sequence of exact steps to pass the count
matrix
back to diffbind (to exploit the plot aids and so on). Could you
provide
me a pseudocode example for it? I got lost in the reference with the
dba.peakset explanation. You can assume I have one vector of counts
per
each sample.
>
> If you want to see what is really going on, I suggest retrieving the
> report as a SummarizedExperiment and look through the appropriate
> assays to get the raw counts. For example, using the example set:
>
> > data(tamoxifen_analysis)
> > sumexp =
dba.report(tamoxifen,1,DataType=DBA_DATA_SUMMARIZED_EXPERIMENT)
> > assays(sumexp)
> List of length 5
> names(5): scores RPKM Reads cRPKM cReads
>
> You can see the actual (non-normalized) read counts for the ChIP
(min 1):
> > chipReads = assay(sumexp,3)
>
> And for the Control (min 1):
> > controlReads = assay(sumexp,5)
>
> And the subtracted values that are actually given to edgeR (or
> DESeq/DESeq2):
> > readCountsForAnalysis = assay(sumexp,3) - assay(sumexp,5)
I tried this on your tamoxifen sample dataset, as well as the ones we
have. A few additional information, the data we had was extremely deep
sequenced (drosophila, sometimes more than 20M reads per replicate) -
so
I tried to downsample the data, and I did this scaling each dataset to
20M reads. So for this reason I think that the scaling for the input
should not have that much of an impact.
>
> You can compare these counts with what you see in the browser for
the
> same region for ChIP and control. If there is a big mismatch let me
> know and we can figure out why summarizeOverlaps isn't doing what we
> expect.
>
It seems that the numbers are somehow correct - I did a few spot
checks.
The thing is, the concentration values look kind of strange to me.
Indeed, for regions where I detected peaks in all samples (I used
MACS2
with mostly default parameters) I see that the final concentration
value
is quite close to 0 or even negative. The readCountsForAnalysis object
in my case has indeed quite some negative values, quite more than what
you have in the same object derived from the tamoxifen dataset.
Can this be to the fact that my merging adjacent peaks coming from
different datasets the input amount gets "overestimated"? If so, do
you
have any suggestion to take in this case?
Thanks again for the advices.
Cheers,
Federico
> Cheers-
> Rory
>
>
>
> From: Federico Marini <marinif@uni-mainz.de <mailto:marinif@uni-="" mainz.de="">>
> Date: Mon, 24 Mar 2014 15:47:56 +0100
> To: Rory Stark <rory.stark@cruk.cam.ac.uk> <mailto:rory.stark@cruk.cam.ac.uk>>
> Cc: "bioconductor@r-project.org <mailto:bioconductor@r-project.org>"
> <bioconductor@r-project.org <mailto:bioconductor@r-project.org="">>
> Subject: Re: DiffBind and paired end data
>
> Hi Rory,
>
> I sorted the problem out, it apparently was a lack of memory in the
> old server. I could not think that summarizeOverlaps was so
demanding
> on these very deeply sequenced samples.
>
> Still, I have one additional question for you.
> I am trying to understand what DiffBind is doing for the
normalization
> with the control sample. Could you elaborate the concept a little
> more? Is the raw count subtracted/rounded and subtracted or
something
> else?
> I am asking because I see that a lot of regions get quite low values
> in the final report (shown with the bCounts=TRUE option) even if the
> peak caller identified peaks over there.
>
> Thanks in advance for the explanation! And once again, thank you for
> the nice tool you provided.
>
> Federico
>
>
> On 20.03.14 16:08, Rory Stark wrote:
>> Hi Frederico-
>>
>> Let me know when you are able to use a recent build of DiffBind to
>> see fi it works on your paord-end data.
>>
>> At some point I'll probably need access to at least a subset of the
>> experiment (eg using Dropbox) so I can debug any further problems
you
>> may be having.
>>
>> Cheers-
>> Rory
>>
>> From: Federico Marini <marinif@uni-mainz.de>> <mailto:marinif@uni-mainz.de>>
>> Date: Tue, 4 Mar 2014 15:21:44 +0100
>> To: Rory Stark <rory.stark@cruk.cam.ac.uk>> <mailto:rory.stark@cruk.cam.ac.uk>>
>> Subject: Re: DiffBind and paired end data
>>
>> Dear Rory,
>>
>> Great to hear back from you. And even better to read that the
feature
>> is going to be implemented soon.
>>
>> So far I am trying it then with bUseSummarizeOverlaps = TRUE and
with
>> DBA$config$singleEnd <- FALSE before calling the dba.count.
>> I got this warning message during the count operations:
>>
>> Warnung in FUN(bf, param = param) : 'yieldSize' set to 'NA'
>>
>> # (8 times, I assume it is due to the 8 samples in total -
>> triplicates + respective input, 2 conditions)
>>
>>
>> The counting part still takes quite a very long time to perform.
When
>> it was done, I sadly got this error message (sorry for some parts
in
>> german):
>>
>> Fehler: Error processing one or more read files. Check
warnings().
>> Zusätzlich: Warnmeldungen:
>> 1: In mclapply(arglist, fn, ..., mc.preschedule = TRUE,
>> mc.allow.recursive = TRUE) :
>> scheduled cores 2 encountered errors in user code, all values
>> of the jobs will be affected
>> 2:
>> 3: Fehler bei der Auswertung des Argumentes 'x' bei der
>> Methodenauswahl
>> 4:
>> 5:
>> 6:
>> 7: Fehler bei der Auswertung des Argumentes 'x' bei der
>> Methodenauswahl
>> 8:
>>
>>
>> The sessionInfo for this is as follows:
>>
>> sessionInfo()
>> R version 3.0.2 (2013-09-25)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
>> [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
>> [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel stats graphics grDevices utils datasets
>> methods
>> [8] base
>>
>> other attached packages:
>> [1] Rsamtools_1.14.3 Biostrings_2.30.1 DiffBind_1.8.4
>> [4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.6
>> [7] BiocGenerics_0.8.0
>>
>> loaded via a namespace (and not attached):
>> [1] amap_0.8-11 bitops_1.0-6 caTools_1.16
edgeR_3.4.2
>> [5] gdata_2.13.2 gplots_2.12.1 gtools_3.3.0
>> KernSmooth_2.23-10
>> [9] limma_3.18.13 RColorBrewer_1.0-5 stats4_3.0.2
>> tools_3.0.2
>> [13] zlibbioc_1.8.0
>>
>>
>>
>> Will you also plan to implement the recent featureCounts, too? This
>> could possibly speed up massively the operations, I guess.
>>
>> If I'd ask you from your hands-on experience, what is a "good"
number
>> of DB sites? I assume this is not a one size fits all situation,
but
>> just as an overall idea it could help to have some numbers.
>>
>> Thank you again for replying, and I will get back to you as soon as
I
>> have a better idea of what possibly went wrong.
>>
>> Best regards,
>>
>> Federico
>>
>>
>>
>> On 04.03.14 12:36, Rory Stark wrote:
>>> Hi Frederico-
>>>
>>> Looks like you've got it just about right you want to use the
>>> bUseSummarizedOverlaps option. For paired end, you don't need to
set
>>> the insert length (changed to fragmentSize moving forward) as with
>>> paired end data the size of each fragment is known.
>>>
>>> I will add the option to set the fragments parameter (in
>>> DBA$config$fragments), it should appear in the development version
>>> 1.9.9.
>>>
>>> Do let me know if you have any problems wit this, as this is a
>>> scenario we'd really like to see work and can help debugging if
>>> there are issues.
>>>
>>> Cheers-
>>> Rory
>>>
>>> From: Federico Marini <marinif@uni-mainz.de>>> <mailto:marinif@uni-mainz.de>>
>>> Date: Mon, 24 Feb 2014 14:51:49 +0100
>>> To: Rory Stark <rory.stark@cruk.cam.ac.uk>>> <mailto:rory.stark@cruk.cam.ac.uk>>
>>> Subject: DiffBind and paired end data
>>>
>>> Dear Dr. Stark,
>>>
>>> I had the pleasure to attend ("back then in 2013") to the
>>> presentation you gave at the EBI Advanced Course for RNA-seq and
>>> ChIP-seq, where you introduced DiffBind and its usage examples.
>>>
>>> Now I moved from IMB - Mainz and I am working in the Biostatistics
>>> department of the University of Medical Center in Mainz as a
>>> research associate, where I also have the chance of doing my PhD.
>>>
>>> I contact you regarding indeed DiffBind, and the possibility of
>>> doing differential binding analysis for ChIP-seq paired end data.
As
>>> one of our collaborators recently produced such datasets, and they
>>> would like to investigate the aspect of changes in the binding for
>>> TFs and histone modifications, I thought the DiffBind framework
>>> would be a very solid solution for analysis.
>>>
>>> The doubt I have, is whether DiffBind can use the information of
the
>>> paired end data, and how. Ideally I guess the best way would be
>>> this: properly paired reads will be counted just once for each of
>>> the ends, and singletons(/not properly paired) will also count
one.
>>>
>>> I was following the discussions around
>>> https://stat.ethz.ch/pipermail/bioconductor/2013-June/053394.html,
>>> and it seems that by incorporating the summarizeOverlaps functions
>>> it would be (almost-)possible, but I would like to double check it
>>> with you.
>>> bUseSummarizeOverlaps would be set to TRUE, DBA$config$singleEnd
to
>>> false. Is there anything else I should take into account (e.g. the
>>> insert length in this case would be meaningful?)? Is there also a
>>> parameter to use the "fragment" parameter set to true in the
>>> summarizeOverlaps function?
>>>
>>> Thank you very much in advance for the attention, and thank you
>>> again for the nice package on top of it!
>>>
>>> Best regards,
>>>
>>> Federico
>>>
>>> --
>>> *Federico Marini, M.Sc.*
>>> Medizinische Biometrie
>>> _____________________________________________
>>> UNIVERSITÄTSMEDIZIN
>>> der Johannes Gutenberg-Universität Mainz
>>> Institut für Medizinische Biometrie, Epidemiologie und Informatik
>>> (IMBEI)
>>> Abteilung Medizinische Biometrie
>>> Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere
>>> Zahlbacher Str. 69, 55131 Mainz
>>> www.imbei.uni-mainz.de
>>> Telefon +49 (0) 6131 17-7029
>>> Telefax +49 (0) 6131 17-472433
>>> E-Mail: federico.marini@unimedizin-mainz.de
>>> marinif@uni-mainz.de
>>
>> --
>> *Federico Marini, M.Sc.*
>> Medizinische Biometrie
>> _____________________________________________
>> UNIVERSITÄTSMEDIZIN
>> der Johannes Gutenberg-Universität Mainz
>> Institut für Medizinische Biometrie, Epidemiologie und Informatik
(IMBEI)
>> Abteilung Medizinische Biometrie
>> Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere
>> Zahlbacher Str. 69, 55131 Mainz
>> www.imbei.uni-mainz.de
>> Telefon +49 (0) 6131 17-7029
>> Telefax +49 (0) 6131 17-472433
>> E-Mail: federico.marini@unimedizin-mainz.de
>> marinif@uni-mainz.de
>
> --
> *Federico Marini, M.Sc.*
> Medical Biostatistics
> _____________________________________________
> University Medical Center of the Johannes Gutenberg University Mainz
> Institute of Medical Biostatistics, Epidemiology and Informatics
(IMBEI)
> Medical Biostatistics Department
> Postal address: 55101 Mainz
> Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz
> www.imbei.uni-mainz.de
> Phone +49 (0) 6131 17-7029
> Fax +49 (0) 6131 17-472433
> E-Mail: marinif@uni-mainz.de
--
*Federico Marini, M.Sc.*
Medical Biostatistics
_____________________________________________
University Medical Center of the Johannes Gutenberg University Mainz
Institute of Medical Biostatistics, Epidemiology and Informatics
(IMBEI)
Medical Biostatistics Department
Postal address: 55101 Mainz
Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz
www.imbei.uni-mainz.de
Phone +49 (0) 6131 17-7029
Fax +49 (0) 6131 17-472433
E-Mail: marinif@uni-mainz.de
[[alternative HTML version deleted]]
Hi Rory,
An update on the read count values. For your tamoxifen datasets it
happens the following:
head(controlReads)
BT474.1- BT474.2- MCF7.1- MCF7.2- MCF7.1+ MCF7.2+ MCF7.3+
T47D.1+
T47D.2+ ZR75.1+ ZR75.2+
1433 3 3 8 8 4 4 4 3
3 7 7
7 43 43 8 8 24 25 25 6
6 20 20
1606 13 13 2 2 19 19 19 8
8 21 21
156 33 33 4 4 6 6 6 8
8 8 8
1238 27 27 5 5 29 30 30 11
11 29 29
1446 3 3 2 2 3 3 3 1
1 4 4
Correctly each input has the same amount of reads.
In our case we had one input for each condition and for each batch.
Samples 1 and 2 for each condition were sequenced in batch 1, samples
numbered as 3 were done in batch 2.
head(dba_controlReads)
wt_r1 wt_r2 wt_r3 KD_r1 KD_r2 KD_r3
4442 168 226 152 239 241
100
1464 696 934 471 592 596
289
2956 4293 5762 3538 4677 4711
2178
1292 1035 1390 704 1110
1118 482
5741 229 307 172 230 232
112
557 150 201 87 165 166
71
The numbers differ in every column, and this is somehow probably
pointing to some strange behaviour of summarizeOverlaps. Would you
have
the same guess if you look at this lines?
Cheers,
Federico
On 28.03.14 13:29, Federico Marini wrote:
> Hi Rory,
>
> Thanks for the detailed reply.
>
> On 27.03.14 20:14, Rory Stark wrote:
>> Hi Frederico-
>>
>> Regarding normalizing to controls, the default behavior in DiffBind
>> is to first get the number of reads in the ChIP, and the number of
>> reads in the control, then scale the control reads if the control
>> library was sequenced to higher depth. Then it subtracts the
control
>> reads from the ChIP reads (the _MINUS scoring options). Finally the
>> read counts for each sample are normalised with the other samples
>> (this depends which analysis method you use; the default is
edgeR's
>> TMM method, based on the total library size).This is not an
>> overly-principled step, it just dampens some peaks where there are
a
>> high concentration of reads int he control. The assumption is
>> that you used the controls for the peak calling peak callers like
>> MACS model the background more elaborately. Read scores can't be
>> negative.
> I guess that the rescaling, if performed, is then somehow rounded
up,
> so that the amounts of reads are still discrete numbers - to feed
for
> edgeR/DESeq. If it is not the case, please correct me. I was asking
> about details because I was trying to "do what DiffBind does" with
> single steps performed outside of the pipeline, i.e. counting in the
> enriched regions with featureCounts, importing the count table back
in
> a semi-standard DESeq analysis (and there I did not know what to do
> for the input to be as close as possible to your procedure).
>
> One point I could not figure out of the vignette and also of the
> reference manual is the sequence of exact steps to pass the count
> matrix back to diffbind (to exploit the plot aids and so on). Could
> you provide me a pseudocode example for it? I got lost in the
> reference with the dba.peakset explanation. You can assume I have
one
> vector of counts per each sample.
>>
>> If you want to see what is really going on, I suggest retrieving
the
>> report as a SummarizedExperiment and look through the appropriate
>> assays to get the raw counts. For example, using the example set:
>>
>> > data(tamoxifen_analysis)
>> > sumexp =
>> dba.report(tamoxifen,1,DataType=DBA_DATA_SUMMARIZED_EXPERIMENT)
>> > assays(sumexp)
>> List of length 5
>> names(5): scores RPKM Reads cRPKM cReads
>>
>> You can see the actual (non-normalized) read counts for the ChIP
(min 1):
>> > chipReads = assay(sumexp,3)
>>
>> And for the Control (min 1):
>> > controlReads = assay(sumexp,5)
>>
>> And the subtracted values that are actually given to edgeR (or
>> DESeq/DESeq2):
>> > readCountsForAnalysis = assay(sumexp,3) - assay(sumexp,5)
> I tried this on your tamoxifen sample dataset, as well as the ones
we
> have. A few additional information, the data we had was extremely
deep
> sequenced (drosophila, sometimes more than 20M reads per replicate)
-
> so I tried to downsample the data, and I did this scaling each
dataset
> to 20M reads. So for this reason I think that the scaling for the
> input should not have that much of an impact.
>>
>> You can compare these counts with what you see in the browser for
the
>> same region for ChIP and control. If there is a big mismatch let me
>> know and we can figure out why summarizeOverlaps isn't doing what
we
>> expect.
>>
> It seems that the numbers are somehow correct - I did a few spot
> checks. The thing is, the concentration values look kind of strange
to
> me. Indeed, for regions where I detected peaks in all samples (I
used
> MACS2 with mostly default parameters) I see that the final
> concentration value is quite close to 0 or even negative. The
> readCountsForAnalysis object in my case has indeed quite some
negative
> values, quite more than what you have in the same object derived
from
> the tamoxifen dataset.
> Can this be to the fact that my merging adjacent peaks coming from
> different datasets the input amount gets "overestimated"? If so, do
> you have any suggestion to take in this case?
>
> Thanks again for the advices.
>
> Cheers,
> Federico
>> Cheers-
>> Rory
>>
>>
>>
>> From: Federico Marini <marinif@uni-mainz.de>> <mailto:marinif@uni-mainz.de>>
>> Date: Mon, 24 Mar 2014 15:47:56 +0100
>> To: Rory Stark <rory.stark@cruk.cam.ac.uk>> <mailto:rory.stark@cruk.cam.ac.uk>>
>> Cc: "bioconductor@r-project.org
<mailto:bioconductor@r-project.org>"
>> <bioconductor@r-project.org <mailto:bioconductor@r-project.org="">>
>> Subject: Re: DiffBind and paired end data
>>
>> Hi Rory,
>>
>> I sorted the problem out, it apparently was a lack of memory in the
>> old server. I could not think that summarizeOverlaps was so
demanding
>> on these very deeply sequenced samples.
>>
>> Still, I have one additional question for you.
>> I am trying to understand what DiffBind is doing for the
>> normalization with the control sample. Could you elaborate the
>> concept a little more? Is the raw count subtracted/rounded and
>> subtracted or something else?
>> I am asking because I see that a lot of regions get quite low
values
>> in the final report (shown with the bCounts=TRUE option) even if
the
>> peak caller identified peaks over there.
>>
>> Thanks in advance for the explanation! And once again, thank you
for
>> the nice tool you provided.
>>
>> Federico
>>
>>
>> On 20.03.14 16:08, Rory Stark wrote:
>>> Hi Frederico-
>>>
>>> Let me know when you are able to use a recent build of DiffBind to
>>> see fi it works on your paord-end data.
>>>
>>> At some point I'll probably need access to at least a subset of
the
>>> experiment (eg using Dropbox) so I can debug any further problems
>>> you may be having.
>>>
>>> Cheers-
>>> Rory
>>>
>>> From: Federico Marini <marinif@uni-mainz.de>>> <mailto:marinif@uni-mainz.de>>
>>> Date: Tue, 4 Mar 2014 15:21:44 +0100
>>> To: Rory Stark <rory.stark@cruk.cam.ac.uk>>> <mailto:rory.stark@cruk.cam.ac.uk>>
>>> Subject: Re: DiffBind and paired end data
>>>
>>> Dear Rory,
>>>
>>> Great to hear back from you. And even better to read that the
>>> feature is going to be implemented soon.
>>>
>>> So far I am trying it then with bUseSummarizeOverlaps = TRUE and
>>> with DBA$config$singleEnd <- FALSE before calling the dba.count.
>>> I got this warning message during the count operations:
>>>
>>> Warnung in FUN(bf, param = param) : 'yieldSize' set to 'NA'
>>>
>>> # (8 times, I assume it is due to the 8 samples in total -
>>> triplicates + respective input, 2 conditions)
>>>
>>>
>>> The counting part still takes quite a very long time to perform.
>>> When it was done, I sadly got this error message (sorry for some
>>> parts in german):
>>>
>>> Fehler: Error processing one or more read files. Check
warnings().
>>> Zusätzlich: Warnmeldungen:
>>> 1: In mclapply(arglist, fn, ..., mc.preschedule = TRUE,
>>> mc.allow.recursive = TRUE) :
>>> scheduled cores 2 encountered errors in user code, all
values
>>> of the jobs will be affected
>>> 2:
>>> 3: Fehler bei der Auswertung des Argumentes 'x' bei der
>>> Methodenauswahl
>>> 4:
>>> 5:
>>> 6:
>>> 7: Fehler bei der Auswertung des Argumentes 'x' bei der
>>> Methodenauswahl
>>> 8:
>>>
>>>
>>> The sessionInfo for this is as follows:
>>>
>>> sessionInfo()
>>> R version 3.0.2 (2013-09-25)
>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
>>> [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
>>> [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
>>> [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] parallel stats graphics grDevices utils datasets
>>> methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] Rsamtools_1.14.3 Biostrings_2.30.1 DiffBind_1.8.4
>>> [4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.6
>>> [7] BiocGenerics_0.8.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] amap_0.8-11 bitops_1.0-6 caTools_1.16 edgeR_3.4.2
>>> [5] gdata_2.13.2 gplots_2.12.1 gtools_3.3.0
KernSmooth_2.23-10
>>> [9] limma_3.18.13 RColorBrewer_1.0-5 stats4_3.0.2 tools_3.0.2
>>> [13] zlibbioc_1.8.0
>>>
>>>
>>>
>>> Will you also plan to implement the recent featureCounts, too?
This
>>> could possibly speed up massively the operations, I guess.
>>>
>>> If I'd ask you from your hands-on experience, what is a "good"
>>> number of DB sites? I assume this is not a one size fits all
>>> situation, but just as an overall idea it could help to have some
>>> numbers.
>>>
>>> Thank you again for replying, and I will get back to you as soon
as
>>> I have a better idea of what possibly went wrong.
>>>
>>> Best regards,
>>>
>>> Federico
>>>
>>>
>>>
>>> On 04.03.14 12:36, Rory Stark wrote:
>>>> Hi Frederico-
>>>>
>>>> Looks like you've got it just about right you want to use the
>>>> bUseSummarizedOverlaps option. For paired end, you don't need to
>>>> set the insert length (changed to fragmentSize moving forward) as
>>>> with paired end data the size of each fragment is known.
>>>>
>>>> I will add the option to set the fragments parameter (in
>>>> DBA$config$fragments), it should appear in the development
version
>>>> 1.9.9.
>>>>
>>>> Do let me know if you have any problems wit this, as this is a
>>>> scenario we'd really like to see work and can help debugging if
>>>> there are issues.
>>>>
>>>> Cheers-
>>>> Rory
>>>>
>>>> From: Federico Marini <marinif@uni-mainz.de>>>> <mailto:marinif@uni-mainz.de>>
>>>> Date: Mon, 24 Feb 2014 14:51:49 +0100
>>>> To: Rory Stark <rory.stark@cruk.cam.ac.uk>>>> <mailto:rory.stark@cruk.cam.ac.uk>>
>>>> Subject: DiffBind and paired end data
>>>>
>>>> Dear Dr. Stark,
>>>>
>>>> I had the pleasure to attend ("back then in 2013") to the
>>>> presentation you gave at the EBI Advanced Course for RNA-seq and
>>>> ChIP-seq, where you introduced DiffBind and its usage examples.
>>>>
>>>> Now I moved from IMB - Mainz and I am working in the
Biostatistics
>>>> department of the University of Medical Center in Mainz as a
>>>> research associate, where I also have the chance of doing my PhD.
>>>>
>>>> I contact you regarding indeed DiffBind, and the possibility of
>>>> doing differential binding analysis for ChIP-seq paired end data.
>>>> As one of our collaborators recently produced such datasets, and
>>>> they would like to investigate the aspect of changes in the
binding
>>>> for TFs and histone modifications, I thought the DiffBind
framework
>>>> would be a very solid solution for analysis.
>>>>
>>>> The doubt I have, is whether DiffBind can use the information of
>>>> the paired end data, and how. Ideally I guess the best way would
be
>>>> this: properly paired reads will be counted just once for each of
>>>> the ends, and singletons(/not properly paired) will also count
one.
>>>>
>>>> I was following the discussions around
>>>>
https://stat.ethz.ch/pipermail/bioconductor/2013-June/053394.html,
>>>> and it seems that by incorporating the summarizeOverlaps
functions
>>>> it would be (almost-)possible, but I would like to double check
it
>>>> with you.
>>>> bUseSummarizeOverlaps would be set to TRUE, DBA$config$singleEnd
to
>>>> false. Is there anything else I should take into account (e.g.
the
>>>> insert length in this case would be meaningful?)? Is there also a
>>>> parameter to use the "fragment" parameter set to true in the
>>>> summarizeOverlaps function?
>>>>
>>>> Thank you very much in advance for the attention, and thank you
>>>> again for the nice package on top of it!
>>>>
>>>> Best regards,
>>>>
>>>> Federico
>>>>
>>>> --
>>>> *Federico Marini, M.Sc.*
>>>> Medizinische Biometrie
>>>> _____________________________________________
>>>> UNIVERSITÄTSMEDIZIN
>>>> der Johannes Gutenberg-Universität Mainz
>>>> Institut für Medizinische Biometrie, Epidemiologie und Informatik
>>>> (IMBEI)
>>>> Abteilung Medizinische Biometrie
>>>> Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere
>>>> Zahlbacher Str. 69, 55131 Mainz
>>>> www.imbei.uni-mainz.de
>>>> Telefon +49 (0) 6131 17-7029
>>>> Telefax +49 (0) 6131 17-472433
>>>> E-Mail: federico.marini@unimedizin-mainz.de
>>>> marinif@uni-mainz.de
>>>
>>> --
>>> *Federico Marini, M.Sc.*
>>> Medizinische Biometrie
>>> _____________________________________________
>>> UNIVERSITÄTSMEDIZIN
>>> der Johannes Gutenberg-Universität Mainz
>>> Institut für Medizinische Biometrie, Epidemiologie und Informatik
>>> (IMBEI)
>>> Abteilung Medizinische Biometrie
>>> Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere
>>> Zahlbacher Str. 69, 55131 Mainz
>>> www.imbei.uni-mainz.de
>>> Telefon +49 (0) 6131 17-7029
>>> Telefax +49 (0) 6131 17-472433
>>> E-Mail: federico.marini@unimedizin-mainz.de
>>> marinif@uni-mainz.de
>>
>> --
>> *Federico Marini, M.Sc.*
>> Medical Biostatistics
>> _____________________________________________
>> University Medical Center of the Johannes Gutenberg University
Mainz
>> Institute of Medical Biostatistics, Epidemiology and Informatics
(IMBEI)
>> Medical Biostatistics Department
>> Postal address: 55101 Mainz
>> Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz
>> www.imbei.uni-mainz.de
>> Phone +49 (0) 6131 17-7029
>> Fax +49 (0) 6131 17-472433
>> E-Mail: marinif@uni-mainz.de
>
> --
> *Federico Marini, M.Sc.*
> Medical Biostatistics
> _____________________________________________
> University Medical Center of the Johannes Gutenberg University Mainz
> Institute of Medical Biostatistics, Epidemiology and Informatics
(IMBEI)
> Medical Biostatistics Department
> Postal address: 55101 Mainz
> Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz
> www.imbei.uni-mainz.de
> Phone +49 (0) 6131 17-7029
> Fax +49 (0) 6131 17-472433
> E-Mail: marinif@uni-mainz.de
--
*Federico Marini, M.Sc.*
Medical Biostatistics
_____________________________________________
University Medical Center of the Johannes Gutenberg University Mainz
Institute of Medical Biostatistics, Epidemiology and Informatics
(IMBEI)
Medical Biostatistics Department
Postal address: 55101 Mainz
Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz
www.imbei.uni-mainz.de
Phone +49 (0) 6131 17-7029
Fax +49 (0) 6131 17-472433
E-Mail: marinif@uni-mainz.de
[[alternative HTML version deleted]]
Frederico-
So the controls for wt_r1 and wt_r2 are the exact same, and likewise
for KD_r1 and KD_r2? If the number of reads in the control is close to
those in the ChIP, this could't account for the large differences in
control reads you are seeing.
Two things we can try to see of the control scaling is doing something
unexpected:
One is to turn of control scaling in the call to dba.count
(bScaleControl=FALSE) -- then you should get identical counts for the
shared controls.
The other other thing I can think of to try is to make sure that we
are getting the correct library size from genomicAlignment using
countBam. The easiest way to check this is to examine the DBA object
after the call to dba.count:
> lib.sizes = DBA$class[ DiffBind:::PV_READS,]
Will give you the total number of reads in each library. Check these
numbers to see if they correspond to how many reads you think there
should be, or if there are any big disparities.
You are correct that after scaling the counts are rounded to given the
DE package whole count numbers. In the report, the concentrations are
reported as log2 values, which is why they can be very small or even
negative.
Finally, to retrieve the DESeq CountDataSet object:
> DESeqCountDataSetObject = DBA$contrasts[[n]]$DESeq1$DEdata
You'll need to run dba.analyze with bReduceObject=FALSE to get back a
"complete" result set for use in DESeq.
Cheers-
Rory
From: Federico Marini <marinif@uni-mainz.de<mailto:marinif@uni- mainz.de="">>
Date: Fri, 28 Mar 2014 14:28:15 +0100
To: Rory Stark
<rory.stark@cruk.cam.ac.uk<mailto:rory.stark@cruk.cam.ac.uk>>
Cc: "bioconductor@r-project.org<mailto:bioconductor@r-project.org>"
<bioconductor@r-project.org<mailto:bioconductor@r-project.org>>
Subject: Re: DiffBind and paired end data
Hi Rory,
An update on the read count values. For your tamoxifen datasets it
happens the following:
head(controlReads)
BT474.1- BT474.2- MCF7.1- MCF7.2- MCF7.1+ MCF7.2+ MCF7.3+ T47D.1+
T47D.2+ ZR75.1+ ZR75.2+
1433 3 3 8 8 4 4 4 3
3 7 7
7 43 43 8 8 24 25 25 6
6 20 20
1606 13 13 2 2 19 19 19 8
8 21 21
156 33 33 4 4 6 6 6 8
8 8 8
1238 27 27 5 5 29 30 30 11
11 29 29
1446 3 3 2 2 3 3 3 1
1 4 4
Correctly each input has the same amount of reads.
In our case we had one input for each condition and for each batch.
Samples 1 and 2 for each condition were sequenced in batch 1, samples
numbered as 3 were done in batch 2.
head(dba_controlReads)
wt_r1 wt_r2 wt_r3 KD_r1 KD_r2 KD_r3
4442 168 226 152 239
241 100
1464 696 934 471 592
596 289
2956 4293 5762 3538 4677
4711 2178
1292 1035 1390 704 1110
1118 482
5741 229 307 172 230
232 112
557 150 201 87 165
166 71
The numbers differ in every column, and this is somehow probably
pointing to some strange behaviour of summarizeOverlaps. Would you
have the same guess if you look at this lines?
Cheers,
Federico
On 28.03.14 13:29, Federico Marini wrote:
Hi Rory,
Thanks for the detailed reply.
On 27.03.14 20:14, Rory Stark wrote:
Hi Frederico-
Regarding normalizing to controls, the default behavior in DiffBind is
to first get the number of reads in the ChIP, and the number of reads
in the control, then scale the control reads if the control library
was sequenced to higher depth. Then it subtracts the control reads
from the ChIP reads (the _MINUS scoring options). Finally the read
counts for each sample are normalised with the other samples (this
depends which analysis method you use; the default is edgeR's TMM
method, based on the total library size).This is not an overly-
principled step, it just dampens some peaks where there are a high
concentration of reads int he control. The assumption is that you used
the controls for the peak calling peak callers like MACS model the
background more elaborately. Read scores can't be negative.
I guess that the rescaling, if performed, is then somehow rounded up,
so that the amounts of reads are still discrete numbers - to feed for
edgeR/DESeq. If it is not the case, please correct me. I was asking
about details because I was trying to "do what DiffBind does" with
single steps performed outside of the pipeline, i.e. counting in the
enriched regions with featureCounts, importing the count table back in
a semi-standard DESeq analysis (and there I did not know what to do
for the input to be as close as possible to your procedure).
One point I could not figure out of the vignette and also of the
reference manual is the sequence of exact steps to pass the count
matrix back to diffbind (to exploit the plot aids and so on). Could
you provide me a pseudocode example for it? I got lost in the
reference with the dba.peakset explanation. You can assume I have one
vector of counts per each sample.
If you want to see what is really going on, I suggest retrieving the
report as a SummarizedExperiment and look through the appropriate
assays to get the raw counts. For example, using the example set:
> data(tamoxifen_analysis)
> sumexp =
dba.report(tamoxifen,1,DataType=DBA_DATA_SUMMARIZED_EXPERIMENT)
> assays(sumexp)
List of length 5
names(5): scores RPKM Reads cRPKM cReads
You can see the actual (non-normalized) read counts for the ChIP (min
1):
> chipReads = assay(sumexp,3)
And for the Control (min 1):
> controlReads = assay(sumexp,5)
And the subtracted values that are actually given to edgeR (or
DESeq/DESeq2):
> readCountsForAnalysis = assay(sumexp,3) - assay(sumexp,5)
I tried this on your tamoxifen sample dataset, as well as the ones we
have. A few additional information, the data we had was extremely deep
sequenced (drosophila, sometimes more than 20M reads per replicate) -
so I tried to downsample the data, and I did this scaling each dataset
to 20M reads. So for this reason I think that the scaling for the
input should not have that much of an impact.
You can compare these counts with what you see in the browser for the
same region for ChIP and control. If there is a big mismatch let me
know and we can figure out why summarizeOverlaps isn't doing what we
expect.
It seems that the numbers are somehow correct - I did a few spot
checks. The thing is, the concentration values look kind of strange to
me. Indeed, for regions where I detected peaks in all samples (I used
MACS2 with mostly default parameters) I see that the final
concentration value is quite close to 0 or even negative. The
readCountsForAnalysis object in my case has indeed quite some negative
values, quite more than what you have in the same object derived from
the tamoxifen dataset.
Can this be to the fact that my merging adjacent peaks coming from
different datasets the input amount gets "overestimated"? If so, do
you have any suggestion to take in this case?
Thanks again for the advices.
Cheers,
Federico
Cheers-
Rory
From: Federico Marini <marinif@uni-mainz.de<mailto:marinif@uni- mainz.de="">>
Date: Mon, 24 Mar 2014 15:47:56 +0100
To: Rory Stark
<rory.stark@cruk.cam.ac.uk<mailto:rory.stark@cruk.cam.ac.uk>>
Cc: "bioconductor@r-project.org<mailto:bioconductor@r-project.org>"
<bioconductor@r-project.org<mailto:bioconductor@r-project.org>>
Subject: Re: DiffBind and paired end data
Hi Rory,
I sorted the problem out, it apparently was a lack of memory in the
old server. I could not think that summarizeOverlaps was so demanding
on these very deeply sequenced samples.
Still, I have one additional question for you.
I am trying to understand what DiffBind is doing for the normalization
with the control sample. Could you elaborate the concept a little
more? Is the raw count subtracted/rounded and subtracted or something
else?
I am asking because I see that a lot of regions get quite low values
in the final report (shown with the bCounts=TRUE option) even if the
peak caller identified peaks over there.
Thanks in advance for the explanation! And once again, thank you for
the nice tool you provided.
Federico
On 20.03.14 16:08, Rory Stark wrote:
Hi Frederico-
Let me know when you are able to use a recent build of DiffBind to see
fi it works on your paord-end data.
At some point I'll probably need access to at least a subset of the
experiment (eg using Dropbox) so I can debug any further problems you
may be having.
Cheers-
Rory
From: Federico Marini <marinif@uni-mainz.de<mailto:marinif@uni- mainz.de="">>
Date: Tue, 4 Mar 2014 15:21:44 +0100
To: Rory Stark
<rory.stark@cruk.cam.ac.uk<mailto:rory.stark@cruk.cam.ac.uk>>
Subject: Re: DiffBind and paired end data
Dear Rory,
Great to hear back from you. And even better to read that the feature
is going to be implemented soon.
So far I am trying it then with bUseSummarizeOverlaps = TRUE and with
DBA$config$singleEnd <- FALSE before calling the dba.count.
I got this warning message during the count operations:
Warnung in FUN(bf, param = param) : 'yieldSize' set to 'NA'
# (8 times, I assume it is due to the 8 samples in total - triplicates
+ respective input, 2 conditions)
The counting part still takes quite a very long time to perform. When
it was done, I sadly got this error message (sorry for some parts in
german):
Fehler: Error processing one or more read files. Check warnings().
Zusätzlich: Warnmeldungen:
1: In mclapply(arglist, fn, ..., mc.preschedule = TRUE,
mc.allow.recursive = TRUE) :
scheduled cores 2 encountered errors in user code, all values of the
jobs will be affected
2:
3: Fehler bei der Auswertung des Argumentes 'x' bei der
Methodenauswahl
4:
5:
6:
7: Fehler bei der Auswertung des Argumentes 'x' bei der
Methodenauswahl
8:
The sessionInfo for this is as follows:
sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
[3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets
methods
[8] base
other attached packages:
[1] Rsamtools_1.14.3 Biostrings_2.30.1 DiffBind_1.8.4
[4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.6
[7] BiocGenerics_0.8.0
loaded via a namespace (and not attached):
[1] amap_0.8-11 bitops_1.0-6 caTools_1.16
edgeR_3.4.2
[5] gdata_2.13.2 gplots_2.12.1 gtools_3.3.0
KernSmooth_2.23-10
[9] limma_3.18.13 RColorBrewer_1.0-5 stats4_3.0.2
tools_3.0.2
[13] zlibbioc_1.8.0
Will you also plan to implement the recent featureCounts, too? This
could possibly speed up massively the operations, I guess.
If I'd ask you from your hands-on experience, what is a "good" number
of DB sites? I assume this is not a one size fits all situation, but
just as an overall idea it could help to have some numbers.
Thank you again for replying, and I will get back to you as soon as I
have a better idea of what possibly went wrong.
Best regards,
Federico
On 04.03.14 12:36, Rory Stark wrote:
Hi Frederico-
Looks like you've got it just about right you want to use the
bUseSummarizedOverlaps option. For paired end, you don't need to set
the insert length (changed to fragmentSize moving forward) as with
paired end data the size of each fragment is known.
I will add the option to set the fragments parameter (in
DBA$config$fragments), it should appear in the development version
1.9.9.
Do let me know if you have any problems wit this, as this is a
scenario we'd really like to see work and can help debugging if there
are issues.
Cheers-
Rory
From: Federico Marini <marinif@uni-mainz.de<mailto:marinif@uni- mainz.de="">>
Date: Mon, 24 Feb 2014 14:51:49 +0100
To: Rory Stark
<rory.stark@cruk.cam.ac.uk<mailto:rory.stark@cruk.cam.ac.uk>>
Subject: DiffBind and paired end data
Dear Dr. Stark,
I had the pleasure to attend ("back then in 2013") to the presentation
you gave at the EBI Advanced Course for RNA-seq and ChIP-seq, where
you introduced DiffBind and its usage examples.
Now I moved from IMB - Mainz and I am working in the Biostatistics
department of the University of Medical Center in Mainz as a research
associate, where I also have the chance of doing my PhD.
I contact you regarding indeed DiffBind, and the possibility of doing
differential binding analysis for ChIP-seq paired end data. As one of
our collaborators recently produced such datasets, and they would like
to investigate the aspect of changes in the binding for TFs and
histone modifications, I thought the DiffBind framework would be a
very solid solution for analysis.
The doubt I have, is whether DiffBind can use the information of the
paired end data, and how. Ideally I guess the best way would be this:
properly paired reads will be counted just once for each of the ends,
and singletons(/not properly paired) will also count one.
I was following the discussions around
https://stat.ethz.ch/pipermail/bioconductor/2013-June/053394.html, and
it seems that by incorporating the summarizeOverlaps functions it
would be (almost-)possible, but I would like to double check it with
you.
bUseSummarizeOverlaps would be set to TRUE, DBA$config$singleEnd to
false. Is there anything else I should take into account (e.g. the
insert length in this case would be meaningful?)? Is there also a
parameter to use the "fragment" parameter set to true in the
summarizeOverlaps function?
Thank you very much in advance for the attention, and thank you again
for the nice package on top of it!
Best regards,
Federico
--
Federico Marini, M.Sc.
Medizinische Biometrie
_____________________________________________
UNIVERSITÄTSMEDIZIN
der Johannes Gutenberg-Universität Mainz
Institut für Medizinische Biometrie, Epidemiologie und Informatik
(IMBEI)
Abteilung Medizinische Biometrie
Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere Zahlbacher
Str. 69, 55131 Mainz
www.imbei.uni-mainz.de<http: www.imbei.uni-mainz.de="">
Telefon +49 (0) 6131 17-7029
Telefax +49 (0) 6131 17-472433
E-Mail: federico.marini@unimedizin-mainz.de<mailto:federico.marini @unimedizin-mainz.de="">
marinif@uni-mainz.de<mailto:marinif@uni-mainz.de>
--
Federico Marini, M.Sc.
Medizinische Biometrie
_____________________________________________
UNIVERSITÄTSMEDIZIN
der Johannes Gutenberg-Universität Mainz
Institut für Medizinische Biometrie, Epidemiologie und Informatik
(IMBEI)
Abteilung Medizinische Biometrie
Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere Zahlbacher
Str. 69, 55131 Mainz
www.imbei.uni-mainz.de<http: www.imbei.uni-mainz.de="">
Telefon +49 (0) 6131 17-7029
Telefax +49 (0) 6131 17-472433
E-Mail: federico.marini@unimedizin-mainz.de<mailto:federico.marini @unimedizin-mainz.de="">
marinif@uni-mainz.de<mailto:marinif@uni-mainz.de>
--
Federico Marini, M.Sc.
Medical Biostatistics
_____________________________________________
University Medical Center of the Johannes Gutenberg University Mainz
Institute of Medical Biostatistics, Epidemiology and Informatics
(IMBEI)
Medical Biostatistics Department
Postal address: 55101 Mainz
Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz
www.imbei.uni-mainz.de<http: www.imbei.uni-mainz.de="">
Phone +49 (0) 6131 17-7029
Fax +49 (0) 6131 17-472433
E-Mail: marinif@uni-mainz.de<mailto:marinif@uni-mainz.de>
--
Federico Marini, M.Sc.
Medical Biostatistics
_____________________________________________
University Medical Center of the Johannes Gutenberg University Mainz
Institute of Medical Biostatistics, Epidemiology and Informatics
(IMBEI)
Medical Biostatistics Department
Postal address: 55101 Mainz
Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz
www.imbei.uni-mainz.de<http: www.imbei.uni-mainz.de="">
Phone +49 (0) 6131 17-7029
Fax +49 (0) 6131 17-472433
E-Mail: marinif@uni-mainz.de<mailto:marinif@uni-mainz.de>
--
Federico Marini, M.Sc.
Medical Biostatistics
_____________________________________________
University Medical Center of the Johannes Gutenberg University Mainz
Institute of Medical Biostatistics, Epidemiology and Informatics
(IMBEI)
Medical Biostatistics Department
Postal address: 55101 Mainz
Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz
www.imbei.uni-mainz.de<http: www.imbei.uni-mainz.de="">
Phone +49 (0) 6131 17-7029
Fax +49 (0) 6131 17-472433
E-Mail: marinif@uni-mainz.de<mailto:marinif@uni-mainz.de>
[[alternative HTML version deleted]]
Hi Rory,
On 28.03.14 18:10, Rory Stark wrote:
> Frederico-
>
> So the controls for wt_r1 and wt_r2 are the exact same, and
likewise
> for KD_r1 and KD_r2? If the number of reads in the control is close
to
> those in the ChIP, this could't account for the large differences in
> control reads you are seeing.
>
The controls are indeed the very same. And in advance I already
downsampled all samples actually down to about 20 mln uniquely mapping
reads. So as you say, it seems kind of strange.
> Two things we can try to see of the control scaling is doing
something
> unexpected:
>
> One is to turn of control scaling in the call to dba.count
> (bScaleControl=FALSE) -- then you should get identical counts for
the
> shared controls.
>
> The other other thing I can think of to try is to make sure that we
> are getting the correct library size from genomicAlignment using
> countBam. The easiest way to check this is to examine the DBA object
> after the call to dba.count:
>
> > lib.sizes = DBA$class[ DiffBind:::PV_READS,]
>
> Will give you the total number of reads in each library. Check these
> numbers to see if they correspond to how many reads you think there
> should be, or if there are any big disparities.
>
I am currently re-runnign the counting step, which is the most
intensive
on RAM to carry out. But the lib sizes seem to be recognized correctly
-
the numbers all range from 20 to 22 mln reads as they should (I think
the size accounts for multi-mapping reads too). The samples for one
batch of one factor had lower amount of mapped reads, which is
mirrored
by higher values in the size of the library. I will update you when
the
counting is over.
> You are correct that after scaling the counts are rounded to given
the
> DE package whole count numbers. In the report, the concentrations
are
> reported as log2 values, which is why they can be very small or even
> negative.
>
> Finally, to retrieve the DESeq CountDataSet object:
>
> > DESeqCountDataSetObject = DBA$contrasts[[n]]$DESeq1$DEdata
>
> You'll need to run dba.analyze with bReduceObject=FALSE to get back
a
> "complete" result set for use in DESeq.
Thanks, great. So far then the concentrations are fine, but the
concerns
are still on for the reads.
Would it make sense for you to use as said in the previous messages a
tool such as featureCounts as a check? My question was pointing at how
to re-import the count matrix into DiffBind, and as I said use the
nice
heatmaps/MA plots you already provide.
I'll keep you posted!
Best,
Federico
>
> Cheers-
> Rory
>
>
> From: Federico Marini <marinif@uni-mainz.de <mailto:marinif@uni-="" mainz.de="">>
> Date: Fri, 28 Mar 2014 14:28:15 +0100
> To: Rory Stark <rory.stark@cruk.cam.ac.uk> <mailto:rory.stark@cruk.cam.ac.uk>>
> Cc: "bioconductor@r-project.org <mailto:bioconductor@r-project.org>"
> <bioconductor@r-project.org <mailto:bioconductor@r-project.org="">>
> Subject: Re: DiffBind and paired end data
>
> Hi Rory,
>
> An update on the read count values. For your tamoxifen datasets it
> happens the following:
>
> head(controlReads)
> BT474.1- BT474.2- MCF7.1- MCF7.2- MCF7.1+ MCF7.2+ MCF7.3+
T47D.1+
> T47D.2+ ZR75.1+ ZR75.2+
> 1433 3 3 8 8 4 4 4 3
> 3 7 7
> 7 43 43 8 8 24 25 25
> 6 6 20 20
> 1606 13 13 2 2 19 19 19
> 8 8 21 21
> 156 33 33 4 4 6 6 6 8
> 8 8 8
> 1238 27 27 5 5 29 30 30 11
> 11 29 29
> 1446 3 3 2 2 3 3 3 1
> 1 4 4
>
> Correctly each input has the same amount of reads.
>
> In our case we had one input for each condition and for each batch.
> Samples 1 and 2 for each condition were sequenced in batch 1,
samples
> numbered as 3 were done in batch 2.
>
> head(dba_controlReads)
> wt_r1 wt_r2 wt_r3 KD_r1 KD_r2 KD_r3
> 4442 168 226 152 239
> 241 100
> 1464 696 934 471 592
> 596 289
> 2956 4293 5762 3538 4677
> 4711 2178
> 1292 1035 1390 704 1110
> 1118 482
> 5741 229 307 172 230
> 232 112
> 557 150 201 87 165
> 166 71
>
> The numbers differ in every column, and this is somehow probably
> pointing to some strange behaviour of summarizeOverlaps. Would you
> have the same guess if you look at this lines?
>
> Cheers,
> Federico
>
> On 28.03.14 13:29, Federico Marini wrote:
>> Hi Rory,
>>
>> Thanks for the detailed reply.
>>
>> On 27.03.14 20:14, Rory Stark wrote:
>>> Hi Frederico-
>>>
>>> Regarding normalizing to controls, the default behavior in
DiffBind
>>> is to first get the number of reads in the ChIP, and the number
of
>>> reads in the control, then scale the control reads if the control
>>> library was sequenced to higher depth. Then it subtracts the
control
>>> reads from the ChIP reads (the _MINUS scoring options). Finally
the
>>> read counts for each sample are normalised with the other samples
>>> (this depends which analysis method you use; the default is
>>> edgeR's TMM method, based on the total library size).This is not
an
>>> overly-principled step, it just dampens some peaks where there are
a
>>> high concentration of reads int he control. The assumption is
>>> that you used the controls for the peak calling peak callers
like
>>> MACS model the background more elaborately. Read scores can't be
>>> negative.
>> I guess that the rescaling, if performed, is then somehow rounded
up,
>> so that the amounts of reads are still discrete numbers - to feed
for
>> edgeR/DESeq. If it is not the case, please correct me. I was asking
>> about details because I was trying to "do what DiffBind does" with
>> single steps performed outside of the pipeline, i.e. counting in
the
>> enriched regions with featureCounts, importing the count table back
>> in a semi-standard DESeq analysis (and there I did not know what to
>> do for the input to be as close as possible to your procedure).
>>
>> One point I could not figure out of the vignette and also of the
>> reference manual is the sequence of exact steps to pass the count
>> matrix back to diffbind (to exploit the plot aids and so on). Could
>> you provide me a pseudocode example for it? I got lost in the
>> reference with the dba.peakset explanation. You can assume I have
one
>> vector of counts per each sample.
>>>
>>> If you want to see what is really going on, I suggest retrieving
the
>>> report as a SummarizedExperiment and look through the appropriate
>>> assays to get the raw counts. For example, using the example set:
>>>
>>> > data(tamoxifen_analysis)
>>> > sumexp =
dba.report(tamoxifen,1,DataType=DBA_DATA_SUMMARIZED_EXPERIMENT)
>>> > assays(sumexp)
>>> List of length 5
>>> names(5): scores RPKM Reads cRPKM cReads
>>>
>>> You can see the actual (non-normalized) read counts for the ChIP
>>> (min 1):
>>> > chipReads = assay(sumexp,3)
>>>
>>> And for the Control (min 1):
>>> > controlReads = assay(sumexp,5)
>>>
>>> And the subtracted values that are actually given to edgeR (or
>>> DESeq/DESeq2):
>>> > readCountsForAnalysis = assay(sumexp,3) - assay(sumexp,5)
>> I tried this on your tamoxifen sample dataset, as well as the ones
we
>> have. A few additional information, the data we had was extremely
>> deep sequenced (drosophila, sometimes more than 20M reads per
>> replicate) - so I tried to downsample the data, and I did this
>> scaling each dataset to 20M reads. So for this reason I think that
>> the scaling for the input should not have that much of an impact.
>>>
>>> You can compare these counts with what you see in the browser for
>>> the same region for ChIP and control. If there is a big mismatch
let
>>> me know and we can figure out why summarizeOverlaps isn't doing
what
>>> we expect.
>>>
>> It seems that the numbers are somehow correct - I did a few spot
>> checks. The thing is, the concentration values look kind of strange
>> to me. Indeed, for regions where I detected peaks in all samples (I
>> used MACS2 with mostly default parameters) I see that the final
>> concentration value is quite close to 0 or even negative. The
>> readCountsForAnalysis object in my case has indeed quite some
>> negative values, quite more than what you have in the same object
>> derived from the tamoxifen dataset.
>> Can this be to the fact that my merging adjacent peaks coming from
>> different datasets the input amount gets "overestimated"? If so, do
>> you have any suggestion to take in this case?
>>
>> Thanks again for the advices.
>>
>> Cheers,
>> Federico
>>> Cheers-
>>> Rory
>>>
>>>
>>>
>>> From: Federico Marini <marinif@uni-mainz.de>>> <mailto:marinif@uni-mainz.de>>
>>> Date: Mon, 24 Mar 2014 15:47:56 +0100
>>> To: Rory Stark <rory.stark@cruk.cam.ac.uk>>> <mailto:rory.stark@cruk.cam.ac.uk>>
>>> Cc: "bioconductor@r-project.org
<mailto:bioconductor@r-project.org>"
>>> <bioconductor@r-project.org <mailto:bioconductor@r-project.org="">>
>>> Subject: Re: DiffBind and paired end data
>>>
>>> Hi Rory,
>>>
>>> I sorted the problem out, it apparently was a lack of memory in
the
>>> old server. I could not think that summarizeOverlaps was so
>>> demanding on these very deeply sequenced samples.
>>>
>>> Still, I have one additional question for you.
>>> I am trying to understand what DiffBind is doing for the
>>> normalization with the control sample. Could you elaborate the
>>> concept a little more? Is the raw count subtracted/rounded and
>>> subtracted or something else?
>>> I am asking because I see that a lot of regions get quite low
values
>>> in the final report (shown with the bCounts=TRUE option) even if
the
>>> peak caller identified peaks over there.
>>>
>>> Thanks in advance for the explanation! And once again, thank you
for
>>> the nice tool you provided.
>>>
>>> Federico
>>>
>>>
>>> On 20.03.14 16:08, Rory Stark wrote:
>>>> Hi Frederico-
>>>>
>>>> Let me know when you are able to use a recent build of DiffBind
to
>>>> see fi it works on your paord-end data.
>>>>
>>>> At some point I'll probably need access to at least a subset of
the
>>>> experiment (eg using Dropbox) so I can debug any further problems
>>>> you may be having.
>>>>
>>>> Cheers-
>>>> Rory
>>>>
>>>> From: Federico Marini <marinif@uni-mainz.de>>>> <mailto:marinif@uni-mainz.de>>
>>>> Date: Tue, 4 Mar 2014 15:21:44 +0100
>>>> To: Rory Stark <rory.stark@cruk.cam.ac.uk>>>> <mailto:rory.stark@cruk.cam.ac.uk>>
>>>> Subject: Re: DiffBind and paired end data
>>>>
>>>> Dear Rory,
>>>>
>>>> Great to hear back from you. And even better to read that the
>>>> feature is going to be implemented soon.
>>>>
>>>> So far I am trying it then with bUseSummarizeOverlaps = TRUE and
>>>> with DBA$config$singleEnd <- FALSE before calling the dba.count.
>>>> I got this warning message during the count operations:
>>>>
>>>> Warnung in FUN(bf, param = param) : 'yieldSize' set to 'NA'
>>>>
>>>> # (8 times, I assume it is due to the 8 samples in total -
>>>> triplicates + respective input, 2 conditions)
>>>>
>>>>
>>>> The counting part still takes quite a very long time to perform.
>>>> When it was done, I sadly got this error message (sorry for some
>>>> parts in german):
>>>>
>>>> Fehler: Error processing one or more read files. Check
warnings().
>>>> Zusätzlich: Warnmeldungen:
>>>> 1: In mclapply(arglist, fn, ..., mc.preschedule = TRUE,
>>>> mc.allow.recursive = TRUE) :
>>>> scheduled cores 2 encountered errors in user code, all
values
>>>> of the jobs will be affected
>>>> 2:
>>>> 3: Fehler bei der Auswertung des Argumentes 'x' bei der
>>>> Methodenauswahl
>>>> 4:
>>>> 5:
>>>> 6:
>>>> 7: Fehler bei der Auswertung des Argumentes 'x' bei der
>>>> Methodenauswahl
>>>> 8:
>>>>
>>>>
>>>> The sessionInfo for this is as follows:
>>>>
>>>> sessionInfo()
>>>> R version 3.0.2 (2013-09-25)
>>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>> [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
>>>> [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
>>>> [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
>>>> [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] parallel stats graphics grDevices utils datasets
>>>> methods
>>>> [8] base
>>>>
>>>> other attached packages:
>>>> [1] Rsamtools_1.14.3 Biostrings_2.30.1 DiffBind_1.8.4
>>>> [4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.6
>>>> [7] BiocGenerics_0.8.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] amap_0.8-11 bitops_1.0-6 caTools_1.16 edgeR_3.4.2
>>>> [5] gdata_2.13.2 gplots_2.12.1 gtools_3.3.0
>>>> KernSmooth_2.23-10
>>>> [9] limma_3.18.13 RColorBrewer_1.0-5 stats4_3.0.2
tools_3.0.2
>>>> [13] zlibbioc_1.8.0
>>>>
>>>>
>>>>
>>>> Will you also plan to implement the recent featureCounts, too?
This
>>>> could possibly speed up massively the operations, I guess.
>>>>
>>>> If I'd ask you from your hands-on experience, what is a "good"
>>>> number of DB sites? I assume this is not a one size fits all
>>>> situation, but just as an overall idea it could help to have some
>>>> numbers.
>>>>
>>>> Thank you again for replying, and I will get back to you as soon
as
>>>> I have a better idea of what possibly went wrong.
>>>>
>>>> Best regards,
>>>>
>>>> Federico
>>>>
>>>>
>>>>
>>>> On 04.03.14 12:36, Rory Stark wrote:
>>>>> Hi Frederico-
>>>>>
>>>>> Looks like you've got it just about right you want to use the
>>>>> bUseSummarizedOverlaps option. For paired end, you don't need to
>>>>> set the insert length (changed to fragmentSize moving forward)
as
>>>>> with paired end data the size of each fragment is known.
>>>>>
>>>>> I will add the option to set the fragments parameter (in
>>>>> DBA$config$fragments), it should appear in the development
version
>>>>> 1.9.9.
>>>>>
>>>>> Do let me know if you have any problems wit this, as this is a
>>>>> scenario we'd really like to see work and can help debugging if
>>>>> there are issues.
>>>>>
>>>>> Cheers-
>>>>> Rory
>>>>>
>>>>> From: Federico Marini <marinif@uni-mainz.de>>>>> <mailto:marinif@uni-mainz.de>>
>>>>> Date: Mon, 24 Feb 2014 14:51:49 +0100
>>>>> To: Rory Stark <rory.stark@cruk.cam.ac.uk>>>>> <mailto:rory.stark@cruk.cam.ac.uk>>
>>>>> Subject: DiffBind and paired end data
>>>>>
>>>>> Dear Dr. Stark,
>>>>>
>>>>> I had the pleasure to attend ("back then in 2013") to the
>>>>> presentation you gave at the EBI Advanced Course for RNA-seq and
>>>>> ChIP-seq, where you introduced DiffBind and its usage examples.
>>>>>
>>>>> Now I moved from IMB - Mainz and I am working in the
Biostatistics
>>>>> department of the University of Medical Center in Mainz as a
>>>>> research associate, where I also have the chance of doing my
PhD.
>>>>>
>>>>> I contact you regarding indeed DiffBind, and the possibility of
>>>>> doing differential binding analysis for ChIP-seq paired end
data.
>>>>> As one of our collaborators recently produced such datasets, and
>>>>> they would like to investigate the aspect of changes in the
>>>>> binding for TFs and histone modifications, I thought the
DiffBind
>>>>> framework would be a very solid solution for analysis.
>>>>>
>>>>> The doubt I have, is whether DiffBind can use the information of
>>>>> the paired end data, and how. Ideally I guess the best way would
>>>>> be this: properly paired reads will be counted just once for
each
>>>>> of the ends, and singletons(/not properly paired) will also
count one.
>>>>>
>>>>> I was following the discussions around
>>>>>
https://stat.ethz.ch/pipermail/bioconductor/2013-June/053394.html,
>>>>> and it seems that by incorporating the summarizeOverlaps
functions
>>>>> it would be (almost-)possible, but I would like to double check
it
>>>>> with you.
>>>>> bUseSummarizeOverlaps would be set to TRUE, DBA$config$singleEnd
>>>>> to false. Is there anything else I should take into account
(e.g.
>>>>> the insert length in this case would be meaningful?)? Is there
>>>>> also a parameter to use the "fragment" parameter set to true in
>>>>> the summarizeOverlaps function?
>>>>>
>>>>> Thank you very much in advance for the attention, and thank you
>>>>> again for the nice package on top of it!
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Federico
>>>>>
>>>>> --
>>>>> *Federico Marini, M.Sc.*
>>>>> Medizinische Biometrie
>>>>> _____________________________________________
>>>>> UNIVERSITÄTSMEDIZIN
>>>>> der Johannes Gutenberg-Universität Mainz
>>>>> Institut für Medizinische Biometrie, Epidemiologie und
Informatik
>>>>> (IMBEI)
>>>>> Abteilung Medizinische Biometrie
>>>>> Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere
>>>>> Zahlbacher Str. 69, 55131 Mainz
>>>>> www.imbei.uni-mainz.de
>>>>> Telefon +49 (0) 6131 17-7029
>>>>> Telefax +49 (0) 6131 17-472433
>>>>> E-Mail: federico.marini@unimedizin-mainz.de
>>>>> marinif@uni-mainz.de
>>>>
>>>> --
>>>> *Federico Marini, M.Sc.*
>>>> Medizinische Biometrie
>>>> _____________________________________________
>>>> UNIVERSITÄTSMEDIZIN
>>>> der Johannes Gutenberg-Universität Mainz
>>>> Institut für Medizinische Biometrie, Epidemiologie und Informatik
>>>> (IMBEI)
>>>> Abteilung Medizinische Biometrie
>>>> Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere
>>>> Zahlbacher Str. 69, 55131 Mainz
>>>> www.imbei.uni-mainz.de
>>>> Telefon +49 (0) 6131 17-7029
>>>> Telefax +49 (0) 6131 17-472433
>>>> E-Mail: federico.marini@unimedizin-mainz.de
>>>> marinif@uni-mainz.de
>>>
>>> --
>>> *Federico Marini, M.Sc.*
>>> Medical Biostatistics
>>> _____________________________________________
>>> University Medical Center of the Johannes Gutenberg University
Mainz
>>> Institute of Medical Biostatistics, Epidemiology and Informatics
(IMBEI)
>>> Medical Biostatistics Department
>>> Postal address: 55101 Mainz
>>> Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz
>>> www.imbei.uni-mainz.de
>>> Phone +49 (0) 6131 17-7029
>>> Fax +49 (0) 6131 17-472433
>>> E-Mail: marinif@uni-mainz.de
>>
>> --
>> *Federico Marini, M.Sc.*
>> Medical Biostatistics
>> _____________________________________________
>> University Medical Center of the Johannes Gutenberg University
Mainz
>> Institute of Medical Biostatistics, Epidemiology and Informatics
(IMBEI)
>> Medical Biostatistics Department
>> Postal address: 55101 Mainz
>> Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz
>> www.imbei.uni-mainz.de
>> Phone +49 (0) 6131 17-7029
>> Fax +49 (0) 6131 17-472433
>> E-Mail: marinif@uni-mainz.de
>
> --
> *Federico Marini, M.Sc.*
> Medical Biostatistics
> _____________________________________________
> University Medical Center of the Johannes Gutenberg University Mainz
> Institute of Medical Biostatistics, Epidemiology and Informatics
(IMBEI)
> Medical Biostatistics Department
> Postal address: 55101 Mainz
> Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz
> www.imbei.uni-mainz.de
> Phone +49 (0) 6131 17-7029
> Fax +49 (0) 6131 17-472433
> E-Mail: marinif@uni-mainz.de
--
*Federico Marini, M.Sc.*
Medical Biostatistics
_____________________________________________
University Medical Center of the Johannes Gutenberg University Mainz
Institute of Medical Biostatistics, Epidemiology and Informatics
(IMBEI)
Medical Biostatistics Department
Postal address: 55101 Mainz
Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz
www.imbei.uni-mainz.de
Phone +49 (0) 6131 17-7029
Fax +49 (0) 6131 17-472433
E-Mail: marinif@uni-mainz.de
[[alternative HTML version deleted]]
Dear Rory,
just a brief update:
The counting with bScaleControl is returning pretty much the same
results as without it - which can be correct, as the sizes are somehow
comparable.
As I wrote you in the former email, I do have the samples from batch 2
for RNA PolII that differ in terms of % of mapped reads. But
nevertheless I get a very comparable amount of counts (checked
afterwards with the chipReads and controlReads variables).
This means it should not be due to the scaling.
I am doing the same checks for the other ChIPped material and it seems
the strange pattern is not there for them. The one thing that is still
consistently there is the relative high number of negative read values
in the readCountsForAnalysis object, created with the subtraction of
the
scaled/unscaled counts.
I hope this helps somehow in finding out what can be wrong. From my
side
I will give the manual approach a try soon, so that it can be found
whether summarizeOverlaps is behaving not as expected.
Cheers,
Federico
On 31.03.14 10:45, Federico Marini wrote:
> Hi Rory,
>
> On 28.03.14 18:10, Rory Stark wrote:
>> Frederico-
>>
>> So the controls for wt_r1 and wt_r2 are the exact same, and
likewise
>> for KD_r1 and KD_r2? If the number of reads in the control is close
>> to those in the ChIP, this could't account for the large
differences
>> in control reads you are seeing.
>>
> The controls are indeed the very same. And in advance I already
> downsampled all samples actually down to about 20 mln uniquely
mapping
> reads. So as you say, it seems kind of strange.
>> Two things we can try to see of the control scaling is doing
>> something unexpected:
>>
>> One is to turn of control scaling in the call to dba.count
>> (bScaleControl=FALSE) -- then you should get identical counts for
the
>> shared controls.
>>
>> The other other thing I can think of to try is to make sure that we
>> are getting the correct library size from genomicAlignment using
>> countBam. The easiest way to check this is to examine the DBA
object
>> after the call to dba.count:
>>
>> > lib.sizes = DBA$class[ DiffBind:::PV_READS,]
>>
>> Will give you the total number of reads in each library. Check
these
>> numbers to see if they correspond to how many reads you think there
>> should be, or if there are any big disparities.
>>
> I am currently re-runnign the counting step, which is the most
> intensive on RAM to carry out. But the lib sizes seem to be
recognized
> correctly - the numbers all range from 20 to 22 mln reads as they
> should (I think the size accounts for multi-mapping reads too). The
> samples for one batch of one factor had lower amount of mapped
reads,
> which is mirrored by higher values in the size of the library. I
will
> update you when the counting is over.
>> You are correct that after scaling the counts are rounded to given
>> the DE package whole count numbers. In the report, the
concentrations
>> are reported as log2 values, which is why they can be very small or
>> even negative.
>>
>> Finally, to retrieve the DESeq CountDataSet object:
>>
>> > DESeqCountDataSetObject = DBA$contrasts[[n]]$DESeq1$DEdata
>>
>> You'll need to run dba.analyze with bReduceObject=FALSE to get back
a
>> "complete" result set for use in DESeq.
> Thanks, great. So far then the concentrations are fine, but the
> concerns are still on for the reads.
> Would it make sense for you to use as said in the previous messages
a
> tool such as featureCounts as a check? My question was pointing at
how
> to re-import the count matrix into DiffBind, and as I said use the
> nice heatmaps/MA plots you already provide.
>
> I'll keep you posted!
>
> Best,
>
> Federico
>>
>> Cheers-
>> Rory
>>
>>
>> From: Federico Marini <marinif@uni-mainz.de>> <mailto:marinif@uni-mainz.de>>
>> Date: Fri, 28 Mar 2014 14:28:15 +0100
>> To: Rory Stark <rory.stark@cruk.cam.ac.uk>> <mailto:rory.stark@cruk.cam.ac.uk>>
>> Cc: "bioconductor@r-project.org
<mailto:bioconductor@r-project.org>"
>> <bioconductor@r-project.org <mailto:bioconductor@r-project.org="">>
>> Subject: Re: DiffBind and paired end data
>>
>> Hi Rory,
>>
>> An update on the read count values. For your tamoxifen datasets it
>> happens the following:
>>
>> head(controlReads)
>> BT474.1- BT474.2- MCF7.1- MCF7.2- MCF7.1+ MCF7.2+ MCF7.3+
>> T47D.1+ T47D.2+ ZR75.1+ ZR75.2+
>> 1433 3 3 8 8 4 4 4
>> 3 3 7 7
>> 7 43 43 8 8 24 25 25
>> 6 6 20 20
>> 1606 13 13 2 2 19 19 19
>> 8 8 21 21
>> 156 33 33 4 4 6 6 6
>> 8 8 8 8
>> 1238 27 27 5 5 29 30 30
>> 11 11 29 29
>> 1446 3 3 2 2 3 3 3
>> 1 1 4 4
>>
>> Correctly each input has the same amount of reads.
>>
>> In our case we had one input for each condition and for each batch.
>> Samples 1 and 2 for each condition were sequenced in batch 1,
samples
>> numbered as 3 were done in batch 2.
>>
>> head(dba_controlReads)
>> wt_r1 wt_r2 wt_r3 KD_r1 KD_r2 KD_r3
>> 4442 168 226 152 239
>> 241 100
>> 1464 696 934 471 592
>> 596 289
>> 2956 4293 5762 3538 4677
>> 4711 2178
>> 1292 1035 1390 704 1110
>> 1118 482
>> 5741 229 307 172 230
>> 232 112
>> 557 150 201 87 165
>> 166 71
>>
>> The numbers differ in every column, and this is somehow probably
>> pointing to some strange behaviour of summarizeOverlaps. Would you
>> have the same guess if you look at this lines?
>>
>> Cheers,
>> Federico
>>
>> On 28.03.14 13:29, Federico Marini wrote:
>>> Hi Rory,
>>>
>>> Thanks for the detailed reply.
>>>
>>> On 27.03.14 20:14, Rory Stark wrote:
>>>> Hi Frederico-
>>>>
>>>> Regarding normalizing to controls, the default behavior in
DiffBind
>>>> is to first get the number of reads in the ChIP, and the number
of
>>>> reads in the control, then scale the control reads if the control
>>>> library was sequenced to higher depth. Then it subtracts the
>>>> control reads from the ChIP reads (the _MINUS scoring options).
>>>> Finally the read counts for each sample are normalised with the
>>>> other samples (this depends which analysis method you use; the
>>>> default is edgeR's TMM method, based on the total library
>>>> size).This is not an overly-principled step, it just dampens some
>>>> peaks where there are a high concentration of reads int he
control.
>>>> The assumption is that you used the controls for the peak calling
>>>> peak callers like MACS model the background more elaborately.
Read
>>>> scores can't be negative.
>>> I guess that the rescaling, if performed, is then somehow rounded
>>> up, so that the amounts of reads are still discrete numbers - to
>>> feed for edgeR/DESeq. If it is not the case, please correct me. I
>>> was asking about details because I was trying to "do what DiffBind
>>> does" with single steps performed outside of the pipeline, i.e.
>>> counting in the enriched regions with featureCounts, importing the
>>> count table back in a semi-standard DESeq analysis (and there I
did
>>> not know what to do for the input to be as close as possible to
your
>>> procedure).
>>>
>>> One point I could not figure out of the vignette and also of the
>>> reference manual is the sequence of exact steps to pass the count
>>> matrix back to diffbind (to exploit the plot aids and so on).
Could
>>> you provide me a pseudocode example for it? I got lost in the
>>> reference with the dba.peakset explanation. You can assume I have
>>> one vector of counts per each sample.
>>>>
>>>> If you want to see what is really going on, I suggest retrieving
>>>> the report as a SummarizedExperiment and look through the
>>>> appropriate assays to get the raw counts. For example, using the
>>>> example set:
>>>>
>>>> > data(tamoxifen_analysis)
>>>> > sumexp =
dba.report(tamoxifen,1,DataType=DBA_DATA_SUMMARIZED_EXPERIMENT)
>>>> > assays(sumexp)
>>>> List of length 5
>>>> names(5): scores RPKM Reads cRPKM cReads
>>>>
>>>> You can see the actual (non-normalized) read counts for the ChIP
>>>> (min 1):
>>>> > chipReads = assay(sumexp,3)
>>>>
>>>> And for the Control (min 1):
>>>> > controlReads = assay(sumexp,5)
>>>>
>>>> And the subtracted values that are actually given to edgeR (or
>>>> DESeq/DESeq2):
>>>> > readCountsForAnalysis = assay(sumexp,3) - assay(sumexp,5)
>>> I tried this on your tamoxifen sample dataset, as well as the ones
>>> we have. A few additional information, the data we had was
extremely
>>> deep sequenced (drosophila, sometimes more than 20M reads per
>>> replicate) - so I tried to downsample the data, and I did this
>>> scaling each dataset to 20M reads. So for this reason I think that
>>> the scaling for the input should not have that much of an impact.
>>>>
>>>> You can compare these counts with what you see in the browser for
>>>> the same region for ChIP and control. If there is a big mismatch
>>>> let me know and we can figure out why summarizeOverlaps isn't
doing
>>>> what we expect.
>>>>
>>> It seems that the numbers are somehow correct - I did a few spot
>>> checks. The thing is, the concentration values look kind of
strange
>>> to me. Indeed, for regions where I detected peaks in all samples
(I
>>> used MACS2 with mostly default parameters) I see that the final
>>> concentration value is quite close to 0 or even negative. The
>>> readCountsForAnalysis object in my case has indeed quite some
>>> negative values, quite more than what you have in the same object
>>> derived from the tamoxifen dataset.
>>> Can this be to the fact that my merging adjacent peaks coming from
>>> different datasets the input amount gets "overestimated"? If so,
do
>>> you have any suggestion to take in this case?
>>>
>>> Thanks again for the advices.
>>>
>>> Cheers,
>>> Federico
>>>> Cheers-
>>>> Rory
>>>>
>>>>
>>>>
>>>> From: Federico Marini <marinif@uni-mainz.de>>>> <mailto:marinif@uni-mainz.de>>
>>>> Date: Mon, 24 Mar 2014 15:47:56 +0100
>>>> To: Rory Stark <rory.stark@cruk.cam.ac.uk>>>> <mailto:rory.stark@cruk.cam.ac.uk>>
>>>> Cc: "bioconductor@r-project.org
>>>> <mailto:bioconductor@r-project.org>" <bioconductor@r-project.org>>>> <mailto:bioconductor@r-project.org>>
>>>> Subject: Re: DiffBind and paired end data
>>>>
>>>> Hi Rory,
>>>>
>>>> I sorted the problem out, it apparently was a lack of memory in
the
>>>> old server. I could not think that summarizeOverlaps was so
>>>> demanding on these very deeply sequenced samples.
>>>>
>>>> Still, I have one additional question for you.
>>>> I am trying to understand what DiffBind is doing for the
>>>> normalization with the control sample. Could you elaborate the
>>>> concept a little more? Is the raw count subtracted/rounded and
>>>> subtracted or something else?
>>>> I am asking because I see that a lot of regions get quite low
>>>> values in the final report (shown with the bCounts=TRUE option)
>>>> even if the peak caller identified peaks over there.
>>>>
>>>> Thanks in advance for the explanation! And once again, thank you
>>>> for the nice tool you provided.
>>>>
>>>> Federico
>>>>
>>>>
>>>> On 20.03.14 16:08, Rory Stark wrote:
>>>>> Hi Frederico-
>>>>>
>>>>> Let me know when you are able to use a recent build of DiffBind
to
>>>>> see fi it works on your paord-end data.
>>>>>
>>>>> At some point I'll probably need access to at least a subset of
>>>>> the experiment (eg using Dropbox) so I can debug any further
>>>>> problems you may be having.
>>>>>
>>>>> Cheers-
>>>>> Rory
>>>>>
>>>>> From: Federico Marini <marinif@uni-mainz.de>>>>> <mailto:marinif@uni-mainz.de>>
>>>>> Date: Tue, 4 Mar 2014 15:21:44 +0100
>>>>> To: Rory Stark <rory.stark@cruk.cam.ac.uk>>>>> <mailto:rory.stark@cruk.cam.ac.uk>>
>>>>> Subject: Re: DiffBind and paired end data
>>>>>
>>>>> Dear Rory,
>>>>>
>>>>> Great to hear back from you. And even better to read that the
>>>>> feature is going to be implemented soon.
>>>>>
>>>>> So far I am trying it then with bUseSummarizeOverlaps = TRUE and
>>>>> with DBA$config$singleEnd <- FALSE before calling the dba.count.
>>>>> I got this warning message during the count operations:
>>>>>
>>>>> Warnung in FUN(bf, param = param) : 'yieldSize' set to 'NA'
>>>>>
>>>>> # (8 times, I assume it is due to the 8 samples in total -
>>>>> triplicates + respective input, 2 conditions)
>>>>>
>>>>>
>>>>> The counting part still takes quite a very long time to perform.
>>>>> When it was done, I sadly got this error message (sorry for some
>>>>> parts in german):
>>>>>
>>>>> Fehler: Error processing one or more read files. Check
warnings().
>>>>> Zusätzlich: Warnmeldungen:
>>>>> 1: In mclapply(arglist, fn, ..., mc.preschedule = TRUE,
>>>>> mc.allow.recursive = TRUE) :
>>>>> scheduled cores 2 encountered errors in user code, all
>>>>> values of the jobs will be affected
>>>>> 2:
>>>>> 3: Fehler bei der Auswertung des Argumentes 'x' bei der
>>>>> Methodenauswahl
>>>>> 4:
>>>>> 5:
>>>>> 6:
>>>>> 7: Fehler bei der Auswertung des Argumentes 'x' bei der
>>>>> Methodenauswahl
>>>>> 8:
>>>>>
>>>>>
>>>>> The sessionInfo for this is as follows:
>>>>>
>>>>> sessionInfo()
>>>>> R version 3.0.2 (2013-09-25)
>>>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>>>
>>>>> locale:
>>>>> [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
>>>>> [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
>>>>> [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
>>>>> [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] parallel stats graphics grDevices utils datasets
methods
>>>>> [8] base
>>>>>
>>>>> other attached packages:
>>>>> [1] Rsamtools_1.14.3 Biostrings_2.30.1 DiffBind_1.8.4
>>>>> [4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.6
>>>>> [7] BiocGenerics_0.8.0
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] amap_0.8-11 bitops_1.0-6 caTools_1.16 edgeR_3.4.2
>>>>> [5] gdata_2.13.2 gplots_2.12.1 gtools_3.3.0
KernSmooth_2.23-10
>>>>> [9] limma_3.18.13 RColorBrewer_1.0-5 stats4_3.0.2
>>>>> tools_3.0.2
>>>>> [13] zlibbioc_1.8.0
>>>>>
>>>>>
>>>>>
>>>>> Will you also plan to implement the recent featureCounts, too?
>>>>> This could possibly speed up massively the operations, I guess.
>>>>>
>>>>> If I'd ask you from your hands-on experience, what is a "good"
>>>>> number of DB sites? I assume this is not a one size fits all
>>>>> situation, but just as an overall idea it could help to have
some
>>>>> numbers.
>>>>>
>>>>> Thank you again for replying, and I will get back to you as soon
>>>>> as I have a better idea of what possibly went wrong.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Federico
>>>>>
>>>>>
>>>>>
>>>>> On 04.03.14 12:36, Rory Stark wrote:
>>>>>> Hi Frederico-
>>>>>>
>>>>>> Looks like you've got it just about right you want to use the
>>>>>> bUseSummarizedOverlaps option. For paired end, you don't need
to
>>>>>> set the insert length (changed to fragmentSize moving forward)
as
>>>>>> with paired end data the size of each fragment is known.
>>>>>>
>>>>>> I will add the option to set the fragments parameter (in
>>>>>> DBA$config$fragments), it should appear in the development
>>>>>> version 1.9.9.
>>>>>>
>>>>>> Do let me know if you have any problems wit this, as this is a
>>>>>> scenario we'd really like to see work and can help debugging if
>>>>>> there are issues.
>>>>>>
>>>>>> Cheers-
>>>>>> Rory
>>>>>>
>>>>>> From: Federico Marini <marinif@uni-mainz.de>>>>>> <mailto:marinif@uni-mainz.de>>
>>>>>> Date: Mon, 24 Feb 2014 14:51:49 +0100
>>>>>> To: Rory Stark <rory.stark@cruk.cam.ac.uk>>>>>> <mailto:rory.stark@cruk.cam.ac.uk>>
>>>>>> Subject: DiffBind and paired end data
>>>>>>
>>>>>> Dear Dr. Stark,
>>>>>>
>>>>>> I had the pleasure to attend ("back then in 2013") to the
>>>>>> presentation you gave at the EBI Advanced Course for RNA-seq
and
>>>>>> ChIP-seq, where you introduced DiffBind and its usage examples.
>>>>>>
>>>>>> Now I moved from IMB - Mainz and I am working in the
>>>>>> Biostatistics department of the University of Medical Center in
>>>>>> Mainz as a research associate, where I also have the chance of
>>>>>> doing my PhD.
>>>>>>
>>>>>> I contact you regarding indeed DiffBind, and the possibility of
>>>>>> doing differential binding analysis for ChIP-seq paired end
data.
>>>>>> As one of our collaborators recently produced such datasets,
and
>>>>>> they would like to investigate the aspect of changes in the
>>>>>> binding for TFs and histone modifications, I thought the
DiffBind
>>>>>> framework would be a very solid solution for analysis.
>>>>>>
>>>>>> The doubt I have, is whether DiffBind can use the information
of
>>>>>> the paired end data, and how. Ideally I guess the best way
would
>>>>>> be this: properly paired reads will be counted just once for
each
>>>>>> of the ends, and singletons(/not properly paired) will also
count
>>>>>> one.
>>>>>>
>>>>>> I was following the discussions around
>>>>>>
https://stat.ethz.ch/pipermail/bioconductor/2013-June/053394.html, and
>>>>>> it seems that by incorporating the summarizeOverlaps functions
it
>>>>>> would be (almost-)possible, but I would like to double check it
>>>>>> with you.
>>>>>> bUseSummarizeOverlaps would be set to TRUE,
DBA$config$singleEnd
>>>>>> to false. Is there anything else I should take into account
(e.g.
>>>>>> the insert length in this case would be meaningful?)? Is there
>>>>>> also a parameter to use the "fragment" parameter set to true in
>>>>>> the summarizeOverlaps function?
>>>>>>
>>>>>> Thank you very much in advance for the attention, and thank you
>>>>>> again for the nice package on top of it!
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Federico
>>>>>>
>>>>>> --
>>>>>> *Federico Marini, M.Sc.*
>>>>>> Medizinische Biometrie
>>>>>> _____________________________________________
>>>>>> UNIVERSITÄTSMEDIZIN
>>>>>> der Johannes Gutenberg-Universität Mainz
>>>>>> Institut für Medizinische Biometrie, Epidemiologie und
Informatik
>>>>>> (IMBEI)
>>>>>> Abteilung Medizinische Biometrie
>>>>>> Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere
>>>>>> Zahlbacher Str. 69, 55131 Mainz
>>>>>> www.imbei.uni-mainz.de
>>>>>> Telefon +49 (0) 6131 17-7029
>>>>>> Telefax +49 (0) 6131 17-472433
>>>>>> E-Mail: federico.marini@unimedizin-mainz.de
>>>>>> marinif@uni-mainz.de
>>>>>
>>>>> --
>>>>> *Federico Marini, M.Sc.*
>>>>> Medizinische Biometrie
>>>>> _____________________________________________
>>>>> UNIVERSITÄTSMEDIZIN
>>>>> der Johannes Gutenberg-Universität Mainz
>>>>> Institut für Medizinische Biometrie, Epidemiologie und
Informatik
>>>>> (IMBEI)
>>>>> Abteilung Medizinische Biometrie
>>>>> Postanschrift: 55101 Mainz Haus- und Lieferanschrift: Obere
>>>>> Zahlbacher Str. 69, 55131 Mainz
>>>>> www.imbei.uni-mainz.de
>>>>> Telefon +49 (0) 6131 17-7029
>>>>> Telefax +49 (0) 6131 17-472433
>>>>> E-Mail: federico.marini@unimedizin-mainz.de
>>>>> marinif@uni-mainz.de
>>>>
>>>> --
>>>> *Federico Marini, M.Sc.*
>>>> Medical Biostatistics
>>>> _____________________________________________
>>>> University Medical Center of the Johannes Gutenberg University
Mainz
>>>> Institute of Medical Biostatistics, Epidemiology and Informatics
>>>> (IMBEI)
>>>> Medical Biostatistics Department
>>>> Postal address: 55101 Mainz
>>>> Office and delivery address: Obere Zahlbacher Str. 69, 55131
Mainz
>>>> www.imbei.uni-mainz.de
>>>> Phone +49 (0) 6131 17-7029
>>>> Fax +49 (0) 6131 17-472433
>>>> E-Mail: marinif@uni-mainz.de
>>>
>>> --
>>> *Federico Marini, M.Sc.*
>>> Medical Biostatistics
>>> _____________________________________________
>>> University Medical Center of the Johannes Gutenberg University
Mainz
>>> Institute of Medical Biostatistics, Epidemiology and Informatics
(IMBEI)
>>> Medical Biostatistics Department
>>> Postal address: 55101 Mainz
>>> Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz
>>> www.imbei.uni-mainz.de
>>> Phone +49 (0) 6131 17-7029
>>> Fax +49 (0) 6131 17-472433
>>> E-Mail: marinif@uni-mainz.de
>>
>> --
>> *Federico Marini, M.Sc.*
>> Medical Biostatistics
>> _____________________________________________
>> University Medical Center of the Johannes Gutenberg University
Mainz
>> Institute of Medical Biostatistics, Epidemiology and Informatics
(IMBEI)
>> Medical Biostatistics Department
>> Postal address: 55101 Mainz
>> Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz
>> www.imbei.uni-mainz.de
>> Phone +49 (0) 6131 17-7029
>> Fax +49 (0) 6131 17-472433
>> E-Mail: marinif@uni-mainz.de
>
> --
> *Federico Marini, M.Sc.*
> Medical Biostatistics
> _____________________________________________
> University Medical Center of the Johannes Gutenberg University Mainz
> Institute of Medical Biostatistics, Epidemiology and Informatics
(IMBEI)
> Medical Biostatistics Department
> Postal address: 55101 Mainz
> Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz
> www.imbei.uni-mainz.de
> Phone +49 (0) 6131 17-7029
> Fax +49 (0) 6131 17-472433
> E-Mail: marinif@uni-mainz.de
--
*Federico Marini, M.Sc.*
Medical Biostatistics
_____________________________________________
University Medical Center of the Johannes Gutenberg University Mainz
Institute of Medical Biostatistics, Epidemiology and Informatics
(IMBEI)
Medical Biostatistics Department
Postal address: 55101 Mainz
Office and delivery address: Obere Zahlbacher Str. 69, 55131 Mainz
www.imbei.uni-mainz.de
Phone +49 (0) 6131 17-7029
Fax +49 (0) 6131 17-472433
E-Mail: marinif@uni-mainz.de
[[alternative HTML version deleted]]