Question

Handing duplicates in DiffBind

0

Entering edit mode

loretta • 0

@loretta-18286

Last seen 5.2 years ago

Hi Rory,

I had been previously, not marking duplicates or removing them in my ChIP-Seq datasets, and instead setting the dba.count bRemoveDuplicates to TRUE. Firstly, how adverse is this strategy when determining differential peaks? And secondly, how does setting bUseSummarizeOverlaps to TRUEcompare to bRemoveDuplicates?

Thanks,

diffbind dba.count bRemoveDuplicates bUseSummarizeOverlaps • 1.8k views

ADD COMMENT • link updated 5.7 years ago by Rory Stark ★ 5.2k • written 5.7 years ago by loretta • 0

score 0 · Answer 1 · 2019-03-12

If you want to remove duplicates, you need to mark duplicates before running dba.count(), whatever bUseSummarizeOverlaps is set to. If duplicated are not marked, even if you set bRemoveDuplicates=TRUE, no duplicates will be identified.

However for differential analysis, we strongly recommend not removing duplicates. In a well-prepared ChIP-seq experiment, most of the duplicate reads will be "true"duplicates indicating high levels of enrichment. The degree to which this is true will depend on how the sequencing is done (single-end vs paired-end, read length, number of reads). If you remove duplicates, you are clipping the signal, so you might be unable to detect, for example, a difference between one sample group where 30% of the DNA is bound at a particular interval and one where 90% of the DNA is bound. It also helps to use blacklists and greylists as many problematic duplicates are located at the blacklisted intervals.

If your ChIP reads have a high proportion of duplicates (say, greater than 50%), there may be issues with the ChIP, leaving more artifactual duplicates, which you may be better off removing (after marking them in the BAM).