Hi
I would like to use DEseq or DEseq2 to normalise the peak signal for
some Chip-seq data across 10 biological replicates.
I started looking at the DEseq documentation - it seems the program
requires a matrix arrangement of raw count data, where each row is a
peak and each column is a replicate.
What is the best way to obtain this? I have bam files for the reads,
obtained with BWA, and bed files (or alternatively narrowPeak files)
for
the peak intervals, obtained using MACS.
I gather it is possible to use a program called HTseq to compute these
counts, however this program seems unable to deal with bed files, only
with gff files, and I'd prefer working directly with my beds if at all
possible. Thank you.
Best regards
Giuseppe
Hi Giuseppe,
Unfortunately there is not much available to do stats on ChIPseq data.
It
is my experience that the data shows exactly the same overdispersion
problem that is see in RNAseq so using either EdgeR, DEseq or DEseq2
to
analyze ChIPseq data is the way to go. There are a couple of
challenges
along the way that make this undertaking not quite straightforward.The
only
bioconductor package that I know tries to tackle this issues is
DiffBind,
so you can give it a try.
One of the main differences is that unlike gene or exon coordinates,
peaks
in your individual replicates will not be exactly in the same place,
if you
are working with TF data this will not be too bad, but anything
nucleosome
associated will have considerable phase shift from replicate to
replicate.
So you first have to do some sort of merging of reproducible peaks
into
regions.
I do not recommend doing the peak calling with the pooled data.After
doing
several ChIP-seq experiments with replicates I have observed that a
lot of
peaks, even ones with high z-scores/low p-values, do not show up in
more
than one replicate (but maybe this is particular to my type of
experiments). Merging all the peaks leads to a high number of false
positives. So you need to integrate the peak locations into a single
file
but make sure you have a minimum number of carriers for each peak, I
usually do presence in at least 2 of the replicates.
You can make a gff file that you can feed into HTSeq in which you
define
the reproducible peak regions on your samples as if it was the gff
with the
gene models, but making this file takes a little bit of work.
We are currently preparing a package for CRAN submission to
specifically
integrate the analysis of ChIP-seq data with replicates to EdgeR and
DESeq,
addressing most of what I mentioned above and including a peak caller
for
ease of flow of the analysis.I cannot finish the submission until the
accompanying biological paper is out, so it won't be available until
next
year.
hope this was helpful
best
Lucia
On Mon, Nov 4, 2013 at 8:47 AM, Giuseppe Gallone <
giuseppe.gallone@dpag.ox.ac.uk> wrote:
> Hi
>
> I would like to use DEseq or DEseq2 to normalise the peak signal for
some
> Chip-seq data across 10 biological replicates.
>
> I started looking at the DEseq documentation - it seems the program
> requires a matrix arrangement of raw count data, where each row is a
peak
> and each column is a replicate.
>
> What is the best way to obtain this? I have bam files for the reads,
> obtained with BWA, and bed files (or alternatively narrowPeak files)
for
> the peak intervals, obtained using MACS.
>
> I gather it is possible to use a program called HTseq to compute
these
> counts, however this program seems unable to deal with bed files,
only with
> gff files, and I'd prefer working directly with my beds if at all
possible.
> Thank you.
>
> Best regards
> Giuseppe
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.
> science.biology.informatics.conductor
>
--
Lucia Peixoto PhD
Postdoctoral Research Fellow
Laboratory of Dr. Ted Abel
Department of Biology
School of Arts and Sciences
University of Pennsylvania
"Think boldly, don't be afraid of making mistakes, don't miss small
details, keep your eyes open, and be modest in everything except your
aims."
Albert Szent-Gyorgyi
[[alternative HTML version deleted]]
Hello Lucia
this is all great info! Thank you very much for taking the time to
share
your findings. I am indeed using diffbind, and found some interesting
results, however now I'd need to access directly my downsampled mapped
reads.
With diffbind, I was basically throwing in my mapped reads and my
peak
intervals and getting a differential analysis out (after setting up a
contrast).
Now I'd like to try something slightly different and need to work with
normalised bams of my samples. The problem I have is that my 10 bam
have
wildly varying numbers of mapping reads. I would like to downsample
them
all to a minimum common before examining quantitative differences in
the
peak signals across them.
I was hoping I could do this with DEseq: feed it some bams and obtain
normalised versions of them. But I understand this is not possible?
I guess I will try to downsample my bams by myself using for example
picard and then take it from there. Are there maybe some alternatives
you'd suggest? I know MACS also allows to downsample bam. Thanks!
Giuseppe
On 11/04/13 18:13, Lucia Peixoto wrote:
> Hi Giuseppe,
> Unfortunately there is not much available to do stats on ChIPseq
data.
> It is my experience that the data shows exactly the same
overdispersion
> problem that is see in RNAseq so using either EdgeR, DEseq or DEseq2
to
> analyze ChIPseq data is the way to go. There are a couple of
challenges
> along the way that make this undertaking not quite
straightforward.The
> only bioconductor package that I know tries to tackle this issues is
> DiffBind, so you can give it a try.
>
> One of the main differences is that unlike gene or exon coordinates,
> peaks in your individual replicates will not be exactly in the same
> place, if you are working with TF data this will not be too bad, but
> anything nucleosome associated will have considerable phase shift
from
> replicate to replicate. So you first have to do some sort of merging
of
> reproducible peaks into regions.
>
> I do not recommend doing the peak calling with the pooled data.After
> doing several ChIP-seq experiments with replicates I have observed
that
> a lot of peaks, even ones with high z-scores/low p-values, do not
show
> up in more than one replicate (but maybe this is particular to my
type
> of experiments). Merging all the peaks leads to a high number of
false
> positives. So you need to integrate the peak locations into a single
> file but make sure you have a minimum number of carriers for each
peak,
> I usually do presence in at least 2 of the replicates.
> You can make a gff file that you can feed into HTSeq in which you
define
> the reproducible peak regions on your samples as if it was the gff
with
> the gene models, but making this file takes a little bit of work.
> We are currently preparing a package for CRAN submission to
specifically
> integrate the analysis of ChIP-seq data with replicates to EdgeR and
> DESeq, addressing most of what I mentioned above and including a
peak
> caller for ease of flow of the analysis.I cannot finish the
submission
> until the accompanying biological paper is out, so it won't be
available
> until next year.
>
> hope this was helpful
> best
>
> Lucia
>
>
>
> On Mon, Nov 4, 2013 at 8:47 AM, Giuseppe Gallone
> <giuseppe.gallone at="" dpag.ox.ac.uk="" <mailto:giuseppe.gallone="" at="" dpag.ox.ac.uk="">>
> wrote:
>
> Hi
>
> I would like to use DEseq or DEseq2 to normalise the peak signal
for
> some Chip-seq data across 10 biological replicates.
>
> I started looking at the DEseq documentation - it seems the
program
> requires a matrix arrangement of raw count data, where each row
is a
> peak and each column is a replicate.
>
> What is the best way to obtain this? I have bam files for the
reads,
> obtained with BWA, and bed files (or alternatively narrowPeak
files)
> for the peak intervals, obtained using MACS.
>
> I gather it is possible to use a program called HTseq to compute
> these counts, however this program seems unable to deal with bed
> files, only with gff files, and I'd prefer working directly with
my
> beds if at all possible. Thank you.
>
> Best regards
> Giuseppe
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org="">
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">
> Search the archives:
>
http://news.gmane.org/gmane.__science.biology.informatics.__conductor
<http: news.gmane.org="" gmane.science.biology.informatics.conductor="">
>
>
>
>
> --
> Lucia Peixoto PhD
> Postdoctoral Research Fellow
> Laboratory of Dr. Ted Abel
> Department of Biology
> School of Arts and Sciences
> University of Pennsylvania
>
> "Think boldly, don't be afraid of making mistakes, don't miss small
> details, keep your eyes open, and be modest in everything except
your
> aims."
> Albert Szent-Gyorgyi
hi Giuseppe,
DiffBind calls either edgeR, DESeq or DESeq2 during the processing
pipeline.
>From reading the DiffBind 1.8.2 vignette, I see on page 37:
The final results (as a DESeqDataSet) are accessible within the DBA
object
> as
> DBA$contrasts[[n]]$DESeq2$DEdata
> and may be examined and manipulated directly for further
customization.
So then you can get DESeq2 normalized counts at this point with:
dds <- DBA$contrasts[[n]]$DESeq2$DEdata
counts(dds, normalized=TRUE)
Mike
On Tue, Nov 5, 2013 at 7:57 AM, Giuseppe Gallone <
giuseppe.gallone@dpag.ox.ac.uk> wrote:
> Hello Lucia
>
> this is all great info! Thank you very much for taking the time to
share
> your findings. I am indeed using diffbind, and found some
interesting
> results, however now I'd need to access directly my downsampled
mapped
> reads.
>
> With diffbind, I was basically throwing in my mapped reads and my
peak
> intervals and getting a differential analysis out (after setting up
a
> contrast).
>
> Now I'd like to try something slightly different and need to work
with
> normalised bams of my samples. The problem I have is that my 10 bam
have
> wildly varying numbers of mapping reads. I would like to downsample
them
> all to a minimum common before examining quantitative differences in
the
> peak signals across them.
>
> I was hoping I could do this with DEseq: feed it some bams and
obtain
> normalised versions of them. But I understand this is not possible?
>
> I guess I will try to downsample my bams by myself using for example
> picard and then take it from there. Are there maybe some
alternatives you'd
> suggest? I know MACS also allows to downsample bam. Thanks!
>
> Giuseppe
>
>
> On 11/04/13 18:13, Lucia Peixoto wrote:
>
>> Hi Giuseppe,
>> Unfortunately there is not much available to do stats on ChIPseq
data.
>> It is my experience that the data shows exactly the same
overdispersion
>> problem that is see in RNAseq so using either EdgeR, DEseq or
DEseq2 to
>> analyze ChIPseq data is the way to go. There are a couple of
challenges
>> along the way that make this undertaking not quite
straightforward.The
>> only bioconductor package that I know tries to tackle this issues
is
>> DiffBind, so you can give it a try.
>>
>> One of the main differences is that unlike gene or exon
coordinates,
>> peaks in your individual replicates will not be exactly in the same
>> place, if you are working with TF data this will not be too bad,
but
>> anything nucleosome associated will have considerable phase shift
from
>> replicate to replicate. So you first have to do some sort of
merging of
>> reproducible peaks into regions.
>>
>> I do not recommend doing the peak calling with the pooled
data.After
>> doing several ChIP-seq experiments with replicates I have observed
that
>> a lot of peaks, even ones with high z-scores/low p-values, do not
show
>> up in more than one replicate (but maybe this is particular to my
type
>> of experiments). Merging all the peaks leads to a high number of
false
>> positives. So you need to integrate the peak locations into a
single
>> file but make sure you have a minimum number of carriers for each
peak,
>> I usually do presence in at least 2 of the replicates.
>> You can make a gff file that you can feed into HTSeq in which you
define
>> the reproducible peak regions on your samples as if it was the gff
with
>> the gene models, but making this file takes a little bit of work.
>> We are currently preparing a package for CRAN submission to
specifically
>> integrate the analysis of ChIP-seq data with replicates to EdgeR
and
>> DESeq, addressing most of what I mentioned above and including a
peak
>> caller for ease of flow of the analysis.I cannot finish the
submission
>> until the accompanying biological paper is out, so it won't be
available
>> until next year.
>>
>> hope this was helpful
>> best
>>
>> Lucia
>>
>>
>>
>> On Mon, Nov 4, 2013 at 8:47 AM, Giuseppe Gallone
>> <giuseppe.gallone@dpag.ox.ac.uk <mailto:giuseppe.gallone@dpag.ox.ac.uk="">>
>>
>> wrote:
>>
>> Hi
>>
>> I would like to use DEseq or DEseq2 to normalise the peak
signal for
>> some Chip-seq data across 10 biological replicates.
>>
>> I started looking at the DEseq documentation - it seems the
program
>> requires a matrix arrangement of raw count data, where each row
is a
>> peak and each column is a replicate.
>>
>> What is the best way to obtain this? I have bam files for the
reads,
>> obtained with BWA, and bed files (or alternatively narrowPeak
files)
>> for the peak intervals, obtained using MACS.
>>
>> I gather it is possible to use a program called HTseq to
compute
>> these counts, however this program seems unable to deal with
bed
>> files, only with gff files, and I'd prefer working directly
with my
>> beds if at all possible. Thank you.
>>
>> Best regards
>> Giuseppe
>>
>> _________________________________________________
>> Bioconductor mailing list
>> Bioconductor@r-project.org <mailto:bioconductor@r-project.org>
>> https://stat.ethz.ch/mailman/__listinfo/bioconductor
>>
>> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">
>> Search the archives:
>>
http://news.gmane.org/gmane.__science.biology.informatics.__conductor<
>> http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>
>>
>>
>>
>>
>> --
>> Lucia Peixoto PhD
>> Postdoctoral Research Fellow
>> Laboratory of Dr. Ted Abel
>> Department of Biology
>> School of Arts and Sciences
>> University of Pennsylvania
>>
>> "Think boldly, don't be afraid of making mistakes, don't miss small
>> details, keep your eyes open, and be modest in everything except
your
>> aims."
>> Albert Szent-Gyorgyi
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.
> science.biology.informatics.conductor
>
[[alternative HTML version deleted]]
Hi Mike
thanks a lot for this, exactly what I needed.
Best
Giuseppe
On 05/11/2013 13:54, Michael Love wrote:
> hi Giuseppe,
>
> DiffBind calls either edgeR, DESeq or DESeq2 during the processing
pipeline.
>
> From reading the DiffBind 1.8.2 vignette, I see on page 37:
>
> The
final results (as a DESeqDataSet) are accessible within the DBA
> object as
> DBA$contrasts[[n]]$DESeq2$DEdata
> and may be examined and manipulated directly for further
customization.
>
>
> So then you can get DESeq2 normalized counts at this point with:
>
> dds <- DBA$contrasts[[n]]$DESeq2$DEdata
> counts(dds, normalized=TRUE)
>
> Mike
Dear Lucia,
On 05/nov/2013, at 13:57, Giuseppe Gallone <giuseppe.gallone at="" dpag.ox.ac.uk=""> wrote:
>> I do not recommend doing the peak calling with the pooled
data.After
>> doing several ChIP-seq experiments with replicates I have observed
that
>> a lot of peaks, even ones with high z-scores/low p-values, do not
show
>> up in more than one replicate (but maybe this is particular to my
type
>> of experiments). Merging all the peaks leads to a high number of
false
>> positives. So you need to integrate the peak locations into a
single
>> file but make sure you have a minimum number of carriers for each
peak,
>> I usually do presence in at least 2 of the replicates.
This sounds reasonable and, indeed, I have a similar anedoctical
experience. Do you have some data/benchmarks supporting this claim?
Thanks,
d
/*
Davide Cittaro, PhD
Coordinator of Bioinformatics Core
Center for Translational Genomics and Bioinformatics
Ospedale San Raffaele
Via Olgettina 58
20132 Milano
Italy
Office: +39 02 26439211
Mail: cittaro.davide at hsr.it
Skype: daweonline
*/
Hi,
I always think I should use IDR from ENCODE to get some <conservative>
peaks from replicates, then move to Diffbind/DESeq.
How do you think?
PS: IDR: https://sites.google.com/site/anshulkundaje/projects/idr
Regards,
Sheng
On Wed, Nov 6, 2013 at 10:18 AM, Davide Cittaro
<cittaro.davide@hsr.it>wrote:
> Dear Lucia,
>
> On 05/nov/2013, at 13:57, Giuseppe Gallone
<giuseppe.gallone@dpag.ox.ac.uk>
> wrote:
>
> >> I do not recommend doing the peak calling with the pooled
data.After
> >> doing several ChIP-seq experiments with replicates I have
observed that
> >> a lot of peaks, even ones with high z-scores/low p-values, do not
show
> >> up in more than one replicate (but maybe this is particular to my
type
> >> of experiments). Merging all the peaks leads to a high number of
false
> >> positives. So you need to integrate the peak locations into a
single
> >> file but make sure you have a minimum number of carriers for each
peak,
> >> I usually do presence in at least 2 of the replicates.
>
> This sounds reasonable and, indeed, I have a similar anedoctical
> experience. Do you have some data/benchmarks supporting this claim?
> Thanks,
>
> d
>
>
> /*
> Davide Cittaro, PhD
>
> Coordinator of Bioinformatics Core
> Center for Translational Genomics and Bioinformatics
> Ospedale San Raffaele
> Via Olgettina 58
> 20132 Milano
> Italy
>
> Office: +39 02 26439211
> Mail: cittaro.davide@hsr.it
> Skype: daweonline
> */
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
[[alternative HTML version deleted]]
Hello Sheng
I have had biologically interesting results when selecting overlapping
peaks between replicates using the IDR. I first call peaks with MACS2
or
SPP, using a very loose q-value threshold, and then pass these to the
IDR.
However, in my experience this works fine between pairs of aligned
read
files. If you have many samples, my experience reflects Lucia's:
plotting an overlap function and then, based on the overlap behaviour,
choosing a minimum number of carriers for each peak position (in my
case
2-3) works well to create a consensus peakset.
Regards
Giuseppe
On 11/06/13 09:40, pbczyd . wrote:
> Hi,
> I always think I should use IDR from ENCODE to get some
<conservative>
> peaks from replicates, then move to Diffbind/DESeq.
>
> How do you think?
>
> PS: IDR: https://sites.google.com/site/anshulkundaje/projects/idr
>
> Regards,
> Sheng
>
--
Giuseppe Gallone, PhD
MRC career development fellow
MRC Functional Genomics Unit - DPAG
University of Oxford, UK
Hi Guiseppe-
You can retrieve the complete matrix of read counts from DiffBind,
either
normalized or not, using dba.peakset with bRetrieve=TRUE. To can set
the
score to use via dba.count with peaks=NULL and score=DBA_SCORE_READS,
or
any of the other possible score values. The default score is
DBA_SCORE_TMM_MINUS_FULL, which is normalized using edgeR's TMM
method,
after subtracting the reads in the control, and using the full library
size (not just the reads in peaks) as a scalar.
Cheers-
Rory
On Mon, Nov 4, 2013 at 8:47 AM, Giuseppe Gallone <
giuseppe.gallone at dpag.ox.ac.uk> wrote:
>Hi
>
>I would like to use DEseq or DEseq2 to normalise the peak signal for
some
>Chip-seq data across 10 biological replicates.
>
>I started looking at the DEseq documentation - it seems the program
>requires a matrix arrangement of raw count data, where each row is a
peak
>and each column is a replicate.
>
>What is the best way to obtain this? I have bam files for the reads,
>obtained with BWA, and bed files (or alternatively narrowPeak files)
for
>the peak intervals, obtained using MACS.
>
>I gather it is possible to use a program called HTseq to compute
these
>counts, however this program seems unable to deal with bed files,
only
>with
>gff files, and I'd prefer working directly with my beds if at all
>possible.
>Thank you.
>
>Best regards
>Giuseppe
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at r-project.org
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives: http://news.gmane.org/gmane.
>science.biology.informatics.conductor
Thanks a lot Rory. Do you think it would then make sense to use the
normalised counts in the peaks to build a ratio based on the raw count
and then feed this ratio to, say, picard to get a downsampled bam from
the original bam?
Best
Giuseppe
On 11/05/13 18:18, Rory Stark wrote:
> Hi Guiseppe-
>
> You can retrieve the complete matrix of read counts from DiffBind,
either
> normalized or not, using dba.peakset with bRetrieve=TRUE. To can set
the
> score to use via dba.count with peaks=NULL and
score=DBA_SCORE_READS, or
> any of the other possible score values. The default score is
> DBA_SCORE_TMM_MINUS_FULL, which is normalized using edgeR's TMM
method,
> after subtracting the reads in the control, and using the full
library
> size (not just the reads in peaks) as a scalar.
>
> Cheers-
> Rory
>
Hi Rory,
Could you give some insight into why TMM is used with full library
size,
it seems to make sense for effective library size case but where full
library size is used, would it still be valid to try and minimize
changes between conditions?
Best,
-Ying
On 11/05/13 18:18, Rory Stark wrote:
> Hi Guiseppe-
>
> You can retrieve the complete matrix of read counts from DiffBind,
either
> normalized or not, using dba.peakset with bRetrieve=TRUE. To can
set the
> score to use via dba.count with peaks=NULL and
score=DBA_SCORE_READS, or
> any of the other possible score values. The default score is
> DBA_SCORE_TMM_MINUS_FULL, which is normalized using edgeR's TMM
method,
> after subtracting the reads in the control, and using the full
library
> size (not just the reads in peaks) as a scalar.
>
> Cheers-
> Rory
>
Hi Ying-
We actually just changed the default normalization from effective to
full
library size in the most recent release. The reason is that while
effective is frequently a better choice, it is based on the assumption
that overall binding levels in all the samples is similar. When this
assumption is incorrect, it can result in substantially incorrect
results;
using full library size when effective applies results in less
catastrophically wrong answers.
I will definitely be changing normalization to use effective sizes
when it
is the right thing to do, but I have become aware that many (most?)
DiffBind users don't change the defaults, so we determined that a more
conservative default was preferable.
I'm not sure what you're asking regarding "try and minimize changes
between conditions" in this context?
Cheers-
Rory
On 06/11/2013 19:08, "Ying Wu" <daiyingw at="" gmail.com=""> wrote:
>Hi Rory,
>Could you give some insight into why TMM is used with full library
size,
>it seems to make sense for effective library size case but where full
>library size is used, would it still be valid to try and minimize
>changes between conditions?
>
>Best,
>-Ying
>
>On 11/05/13 18:18, Rory Stark wrote:
> > Hi Guiseppe-
> >
> > You can retrieve the complete matrix of read counts from DiffBind,
>either
> > normalized or not, using dba.peakset with bRetrieve=TRUE. To can
set
>the
> > score to use via dba.count with peaks=NULL and
score=DBA_SCORE_READS,
>or
> > any of the other possible score values. The default score is
> > DBA_SCORE_TMM_MINUS_FULL, which is normalized using edgeR's TMM
method,
> > after subtracting the reads in the control, and using the full
library
> > size (not just the reads in peaks) as a scalar.
> >
> > Cheers-
> > Rory
> >
>
>
>
Hi Rory,
From my understanding, TMM normalization will try to minimize the
fold
changes between conditions (assumes that most regions are not
differentially expressed) and works best with about equal number of up
and down regulated genes (For the second part see some of the work
done
by Kadota K. on extending TMM http://pubmed.gov/22475125)
Using full library size, one no longer assumes that overall binding
levels in all samples are similar. If the motivation behind using full
library size is that overall binding levels are very different,
wouldn't
most regions then be differentially bound and thus TMM's assumption
that
most regions are unchanged also be invalid?
Best,
Ying
On 11/6/2013 11:16 AM, Rory Stark wrote:
> Hi Ying-
>
> We actually just changed the default normalization from effective to
full
> library size in the most recent release. The reason is that while
> effective is frequently a better choice, it is based on the
assumption
> that overall binding levels in all the samples is similar. When this
> assumption is incorrect, it can result in substantially incorrect
results;
> using full library size when effective applies results in less
> catastrophically wrong answers.
>
> I will definitely be changing normalization to use effective sizes
when it
> is the right thing to do, but I have become aware that many (most?)
> DiffBind users don't change the defaults, so we determined that a
more
> conservative default was preferable.
>
> I'm not sure what you're asking regarding "try and minimize changes
> between conditions" in this context?
>
> Cheers-
> Rory
>
> On 06/11/2013 19:08, "Ying Wu" <daiyingw at="" gmail.com=""> wrote:
>
>> Hi Rory,
>> Could you give some insight into why TMM is used with full library
size,
>> it seems to make sense for effective library size case but where
full
>> library size is used, would it still be valid to try and minimize
>> changes between conditions?
>>
>> Best,
>> -Ying
>>
>> On 11/05/13 18:18, Rory Stark wrote:
>>> Hi Guiseppe-
>>>
>>> You can retrieve the complete matrix of read counts from DiffBind,
>> either
>>> normalized or not, using dba.peakset with bRetrieve=TRUE. To can
set
>> the
>>> score to use via dba.count with peaks=NULL and
score=DBA_SCORE_READS,
>> or
>>> any of the other possible score values. The default score is
>>> DBA_SCORE_TMM_MINUS_FULL, which is normalized using edgeR's TMM
method,
>>> after subtracting the reads in the control, and using the full
library
>>> size (not just the reads in peaks) as a scalar.
>>>
>>> Cheers-
>>> Rory
>>>
>>
>>
Hi,
It is true that TMM and similar methods are based on the assumption
that
most binding levels are similar across samples, and this assumption
may
not be true. However, Using raw library sizes (i.e. sum of all counts
in
a sample) also imposes an assumption that is at least as bad, which is
that the total amount of binding in each sample is constant. In other
words, using raw library size is equivalent to assuming that there is
a
fixed amount of binding that is allocated differently across the
genome
in each sample. In my opinion, if I had to choose one of those two
assumptions, I would almost always choose the former and go with TMM.
-Ryan
On 11/6/13, 11:16 AM, Rory Stark wrote:
> Hi Ying-
>
> We actually just changed the default normalization from effective to
full
> library size in the most recent release. The reason is that while
> effective is frequently a better choice, it is based on the
assumption
> that overall binding levels in all the samples is similar. When this
> assumption is incorrect, it can result in substantially incorrect
results;
> using full library size when effective applies results in less
> catastrophically wrong answers.
>
> I will definitely be changing normalization to use effective sizes
when it
> is the right thing to do, but I have become aware that many (most?)
> DiffBind users don't change the defaults, so we determined that a
more
> conservative default was preferable.
>
> I'm not sure what you're asking regarding "try and minimize changes
> between conditions" in this context?
>
> Cheers-
> Rory
>
> On 06/11/2013 19:08, "Ying Wu" <daiyingw at="" gmail.com=""> wrote:
>
>> Hi Rory,
>> Could you give some insight into why TMM is used with full library
size,
>> it seems to make sense for effective library size case but where
full
>> library size is used, would it still be valid to try and minimize
>> changes between conditions?
>>
>> Best,
>> -Ying
>>
>> On 11/05/13 18:18, Rory Stark wrote:
>>> Hi Guiseppe-
>>>
>>> You can retrieve the complete matrix of read counts from DiffBind,
>> either
>>> normalized or not, using dba.peakset with bRetrieve=TRUE. To can
set
>> the
>>> score to use via dba.count with peaks=NULL and
score=DBA_SCORE_READS,
>> or
>>> any of the other possible score values. The default score is
>>> DBA_SCORE_TMM_MINUS_FULL, which is normalized using edgeR's TMM
method,
>>> after subtracting the reads in the control, and using the full
library
>>> size (not just the reads in peaks) as a scalar.
>>>
>>> Cheers-
>>> Rory
>>>
>>
>>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
Hi Guiseppe-
I'm not sure why you want to downsample? The normalization is supposed
to
take care of differences in read depth and distribution of read
densities
amongst peaks.
edgeR/DESeq/DESeq2 do calculate overall scaling factors for each
library
as part of the normalization computation, so it may be useful to
retrieve
those to see how each library is being weighted. This would be better
than
basically reverse-enginering it by calculating the ratios between the
original values and the normalized ones.
Cheers-
R
>Date: Wed, 06 Nov 2013 10:48:56 +0000
>From: Giuseppe Gallone <giuseppe.gallone at="" dpag.ox.ac.uk="">
>Thanks a lot Rory. Do you think it would then make sense to use the
>normalised counts in the peaks to build a ratio based on the raw
count
>and then feed this ratio to, say, picard to get a downsampled bam
from
>the original bam?
>
>Best
>Giuseppe
Hi Rory
I suppose the analysis I'd like to carry out is conceptually a simpler
one than those I was looking at using Diffbind. I'd like to look at
the
existence-vs-full depletion of peaks at each interval across samples -
as opposed to continuous quantitative differences in the peak signal
at
each interval across samples.
Given 10 samples, if I take the consensus peakset generated by
DiffBind
with, say minOverlap=8, I'd like to look at each of those peak
positions
and find sample pairs where a peak is there for sample_1 but not there
for sample_2. Then a question could be: are there genetic differences
in
the motif under this peak which might cause the complete depletion?
If I had technical replicates, I image I'd be able to use diffbind as
follows
-set contrast: (sample_1r1, ..., sample_1ri) VS (sample_2r1, ...,
sample_2ri)
-dba_analyse (sample1 vs sample2)
-follow up peaks sites with complete depletion VS signal
However I do not have technical replicates and if I understand
correctly
I cannot use diffbind/edgeR in this case.
So my idea is to simple select peak positions with large overlap
across
sample after bam subsampling to a minimum common value, and then look
at
(signal/nosignal) pairs for those peak positions manually.
Would you recommend to use the overall scaling factors for each
library
in this case?
Cheers
G
On 11/06/13 11:32, Rory Stark wrote:
> Hi Guiseppe-
>
> I'm not sure why you want to downsample? The normalization is
supposed to
> take care of differences in read depth and distribution of read
densities
> amongst peaks.
>
> edgeR/DESeq/DESeq2 do calculate overall scaling factors for each
library
> as part of the normalization computation, so it may be useful to
retrieve
> those to see how each library is being weighted. This would be
better than
> basically reverse-enginering it by calculating the ratios between
the
> original values and the normalized ones.
>
> Cheers-
> R
>
>> Date: Wed, 06 Nov 2013 10:48:56 +0000
>> From: Giuseppe Gallone <giuseppe.gallone at="" dpag.ox.ac.uk="">>
>> Thanks a lot Rory. Do you think it would then make sense to use the
>> normalised counts in the peaks to build a ratio based on the raw
count
>> and then feed this ratio to, say, picard to get a downsampled bam
from
>> the original bam?
>>
>> Best
>> Giuseppe
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>