ChIPseq diffbind merging peaks
3
0
Entering edit mode
@veroniquestorme-12161
Last seen 7.6 years ago

Dear,

I am new to differential ChIPseq analysis. One condition has 6 biological replicates and the other 3. I did not perform the experiment, I just perform the analysis. I started with loading the data:

prol.10 = dba(sampleSheet="macs10_30.csv")

This resulted in a total of 7602 peaks with 47 present in at least 2 samples. This indicates that the experiment was not very reproducible. I further looked into peaks in common between biological replicates and the result was very disappointing:

dba.overlap(prol.10, prol.10$masks$elo, mode=DBA_OLAP_RATE)

resulted in 1408    0    0

and 
dba.overlap(prol.10, prol.10$masks$pro, mode=DBA_OLAP_RATE)

6206   35    2    0    0    0

My question is now: how is it possible that I still find 16 peaks to be differentially bound out of the 45 after peak merging when no peaks were found in all biological replicates? I read in another post that dba.count re-counts the overlapping reads for every consensus peak for every sample, whether or not that peak was identified in that sample. When I extract the reads, I find for 28 peaks reads higher than 5, while 0 overlapping peaks were found. I would expect reads in one biological replicate and reads below 5 for the other replicates. This is my code:

prol.10 = dba.count(prol, summits=250)

contrast.10 = dba.contrast(prol.10, categories=DBA_CONDITION)
de.10 = dba.analyze(contrast.10)

reads.10 = dba.count(prol.10, peaks=NULL, score=DBA_SCORE_READS)
bindingMatrix.10 = dba.peakset(reads.10, bRetrieve=TRUE, DataType=DBA_DATA_FRAME)

counts.10 = bindingMatrix.10[,4:ncol(bindingMatrix.10)]

Should I change the summits option?

I would very much value your input,

Thanks, Veronique

 

 

diffbind merging_peaks • 2.2k views
ADD COMMENT
0
Entering edit mode
Rory Stark ★ 5.2k
@rory-stark-5741
Last seen 4 weeks ago
Cambridge, UK

Hello Veronique-

The peak calling step (in this case via MACS) can be imprecise, especially in “borderline” peaks. As you use a hard threshold, some regions may just miss being identified as peaks.

You are correct that DiffBind re-counts overlapping reads for all consensus peaks in all samples. I don’t think that the summits parameter is causing an issue here, as it is serving to re-center and normalise the width of the enriched intervals. It is certainly possible that you identify 16 of the 47 consensus peaks as being significantly differential bound if the variance between the biological replicates at these loci is low.

It could be useful to look at the data in a report that includes the normalized read counts:

> report.10 <- dba.report(de.10, bCounts=TRUE)

This will give you the mean normalised read count for each of the two sample groups, their fold change, and the normalized read counts for each sample. If the read concentrations are low, and/or the absolute values of the fold changes are low, you may be detecting very small changes, or changes from no binding to very low (but consistent) binding. You can see similar data graphically using dba.plotMA().

If you like, I can take a look at the data to see if anything looks unusual. You can email me the DBA objects prol.10 before and after calling dba.count(). If the objects are more than a few MB, you can send me a link on Dropbox or something similar.

Cheers-
Rory
 

ADD COMMENT
0
Entering edit mode

Dear Rory,

I emailed the DBA objects to you last friday. I hope I used the correct email adress. Thank you for your time. 

Veronique

ADD REPLY
0
Entering edit mode

Hi Veronique-

No, I didn't get any email.

rory.stark @ cruk.cam.ac.uk

 

ADD REPLY
0
Entering edit mode
Rory Stark ★ 5.2k
@rory-stark-5741
Last seen 4 weeks ago
Cambridge, UK

Hi Véronique-

I had a look at the data you sent me. As you say, there is very low agreement on the peak calling:

> dba.overlap(prol.10,mode=DBA_OLAP_RATE)
[1] 7602   47    2    0    0    0    0    0    0

As you note, this is suggestive of a non-reproducible experiment.

I think I may be able to shed some light on the discrepancies you are seeing when extracting normalized and non-normalized read counts form the binding matrix. The big difference I see is due to the reads in the Control tracks. When you use score=DBA_SCORE_READS, you are getting the counts in the ChIP track only. However, by default, dba.analyze() subtracts the counts in the control tracks before normalizing. To see the count values that are actually being used, try:

> reads.10 <- dba.count(prol.10, peaks=NULL, score=DBA_SCORE_READS_MINUS)
> bindingMatrix.10 <- dba.peakset(reads.10, bRetrieve=TRUE, DataType=DBA_DATA_FRAME)

You will notice that many of the values are negative, indicating that there are more reads in the Control track than there are in the ChIP track. When doing the analysis, DiffBind will set these values to 1 before normalizing, resulting in the very low normalized count values you are seeing. Note that having large signals (or even just much deeper coverage) in the Control tracks can throw off the peak callers, so this may also account for the low level of agreement in peaksets.

You can work with just the read counts (without subtracting the Control reads) by setting bSubControl=FALSE in the call to dba.analyze():

> de.10 <- dba.analyze(contrast.10,bSubControl=FALSE)

If you do this, you should see greater agreement in the read count values.

-Rory

ADD COMMENT
0
Entering edit mode
@veroniquestorme-12161
Last seen 7.6 years ago

Dear Rory,

Thanks a lot for your time and effort, this was really helpful,

Veronique

ADD COMMENT

Login before adding your answer.

Traffic: 661 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6