How to set the parameter "fragmentSize" in DiffBind and set parameter for paired-end CHIP-seq data?
4
0
Entering edit mode
@xieyongjun93-13638
Last seen 7.0 years ago

(1) If I have two chip samples and two input samples, and their fragmentsize are chip_1,chip_2,input_1,input_1,respectively, should I set the parameter “fragmentSize” as    config = data.frame(fragmentSize = c(chip_1 , chip_2 , input_1 , input_2)) or config = data.frame(fragmentSize = c(chip_1 , input_1 , chip_2 , input_2) ? The fragmentsize varies among the four samples, so I think it is not reasonable to set it as a constant.

 (2) There is always a warning:     

      In if (DBA$config$parallelPackage == DBA_PARALLEL_MULTICORE) { :

                          the condition has length > 1 and only the first element will be used

     Does it affect my outcomes?

 (3) My CHIP-seq data are paired-end and I set the parameter as follows:

            config = data.frame(singleEND = FLASE)

     but I always got an error similar to the above :

     In if (DBA$config$singleEND == FLASE) { :

                          the condition has length > 1 and only the first element will be used

     How to set the parameters when the data are paired-end?

diffbind software error • 2.0k views
ADD COMMENT
0
Entering edit mode
Rory Stark ★ 5.2k
@rory-stark-5741
Last seen 14 days ago
Cambridge, UK

For the fragment sizes, you are not setting the dba.read()parameter correctly. You should never set a value for config as a parameter.

You want to either set the parameter value directly:

> myDBA <- dba.read(myDBA,fragmentSize=c(chip_1 , chip_2 , input_1 , input_2))

or set this as the default value in the configuration before making the call to dba.count():

> myDBA$config$fragmentSize <- c(chip_1 , chip_2 , input_1 , input_2)
> myDBA <- dba.read(myDBA)

The warning about parallelization should go away if you either set the parameters directly or set it in the config portion of the DBA object as described.

Likewise, if you are using summarizeOverlaps to count PE data, you should set the configuration parameter prior to the call to dba.count():

> myDBA$config$singleEnd <- FALSE
> myDBA <- dba.read(myDBA, bUseSummarizeOverlaps=TRUE)

Cheers-

Rory

ADD COMMENT
0
Entering edit mode
@xieyongjun93-13638
Last seen 7.0 years ago

OK, thank you!

    But I have an another question about the methods used in DiffBind. When I use the method 'DEseq2' and 'edgeR' to analyze my data, I can get a much different result. The number of differential binding sites calculated by DEseq2 is about at least twice of The number of differential binding sites calculated by egdeR. Why there are so big differences between DEseq2 and egdeR? Whice one should I choose? I have two or three replicates for each exprement.

 

ADD COMMENT
0
Entering edit mode
Gord Brown ▴ 670
@gord-brown-5664
Last seen 4.0 years ago
United Kingdom

Hi,

EdgeR and DESeq2 use the same underlying statistical model, but they normalise quite differently, which accounts for most of the difference we see (and yes, we also see that they give very different answers in many cases).  if you look at the MA plots via dba.plotMA, you can usually see which one seems more believable.  My personal preference is DESeq2.

ADD COMMENT
0
Entering edit mode
Rory Stark ★ 5.2k
@rory-stark-5741
Last seen 14 days ago
Cambridge, UK

I agree with Gord's answer, but wanted to elaborate a bit.

The difference is most likely the normalization step. edgeR uses the TMM method which relies on a core of sites that don't systematically change their binding affinities. While this assumption of usually true for RNA-seq, there are many ChIP-seq experiments where one sample group has a completely different binding profile than the other sample group. Often, in one condition there is very little binding, but in the other condition much additional binding has been induced. For example, when we ChIP for the estrogen receptor ER, we sometimes use a control where the cells have been deprived of estrogen, the ligand required for binding. Comparing any of these samples to ones that include estrogen will violate the assumption behind TMM. As a result, the edgeR analysis will "over-normalize", driving the code of fold changes to zero. As a result, fewer of the sites appear to be differentially bound -- and in many cases, the sign of the fold change will even be flipped.

You can see if this is what is going on in your data by following Gord's suggestion of looking at the MA plots. Specifically, look at three versions of the MA plots:

> par(mfrow=c(2,2))
> dba.plotMA(myDBA,bNormalized=FALSE)
> dba.plotMA(myDBA,method=DBA_DESEQ2)
> dba.plotMA(myDBA,method=DBA_EDGER)

If this is a normalization issue, you should see that in the first, non-normalized plot, most of the fold changes are either above or below the line. Often the Y axis is asymmetric as well (extending to a higher magnitude in one direction). In the second (DESeq2) plot, there may be a small shift toward the zero line, but the one-sided trend is maintained. 

In the third (edgeR) plot, if the points are more centered around the zero line, with a more equal balance of sites with positive and negative fold changes, this shows the data have been "over-normalized" (really, just improperly normalized). You should not trust the edgeR analysis in this case. This is the reason we changed the default analysis method in DiffBind from edgeR to DESeq2.

Cheers-
Rory

ADD COMMENT

Login before adding your answer.

Traffic: 653 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6