Opinions on DESeqDataSetFromMatrix rowSums minimum for ChIP and/or ATAC-seq
2
0
Entering edit mode
rbronste ▴ 60
@rbronste-12189
Last seen 5.0 years ago

So Im hoping to get some opinions on what people generally set as the rowSums minimum as follows:

dds <- dds[ rowSums(counts(dds)) > 10, ]

I obtain the count matrices from ChIP-seq and ATAC-seq experiments through DiffBind and then use DESeq2 for the differential binding analysis. I guess its a bit of an arbitrary cutoff, but wondering if there is any additional experimental information people employ to set this boundary? Thanks! 

deseq2 atac-seq chipseq differential binding analysis • 1.4k views
ADD COMMENT
2
Entering edit mode
@mikelove
Last seen 1 day ago
United States

I don't analyze much ChIP-seq so I'll let others weigh in on that. I don't think you will be losing many informative features with this filter, e.g. if you have 3 vs 3 samples that would be 10 reads distributed over 6 samples, so the best looking feature might be [3 3 4] vs [0 0 0]. It's hard to imagine even with incredibly minimal technical variability (e.g. UMI dedup) that this feature would survive multiple test correction. So it seems a "safe" filter in terms of false negatives.

ADD COMMENT
0
Entering edit mode

Thanks for the info. Another question I had was how to do it on a cell by cell basis, limiting it to > 10 in all cells of each triplet in the matrix.

ADD REPLY
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 1 hour ago
The city by the bay

I'll chip in on the ChIP-seq part of it (ho, ho, ho).

I use the background level of enrichment to guide the filter threshold for the average count. There's a number of ways to do this, but the easiest is to cut your genome into large bins, count the number of reads in each bin, calculate the median of the average counts across all bins and then scale that average count based on the ratio of the bin width to your median peak width. This gives you a reasonable indication of what average count you might expect in the absence of binding, assuming most genomic locations are not bound.

There are some more sophisticated strategies using local or control-based background estimates, but I find that a global estimate works pretty well. You can increase the stringency of the filter by asking for all regions that are some X-fold change above the expected background coverage, where a "good" value of X might be 3-5, depending on the strength of enrichment for your protein target (e.g., higher for TFs, lower for histone marks).

Honestly, though, any half-decent peak caller should not be calling peaks in regions that have very low coverage, so I would be surprised if there is any filtering that needs to be done at this stage. In other words, the peak-caller should have implicitly filtered for you (though there are difficult challenges with making this implicit filtering compatible with differential analysis tools like DESeq(2) and edgeR).

ADD COMMENT

Login before adding your answer.

Traffic: 591 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6