Question

Clarification on Normalization Approach in DiffBind's db.analyze()

0

Entering edit mode

siklenkak • 0

@siklenkak-12153

Last seen 7.6 years ago

Hi to all,

Simple question for DiffBind that isn't clear to me despite a lot of searching and (over) thinking.

How exactly is the count data treated when one invokes db.analyze(method=DESeq2) using default parameters in current release?

My understanding is that some years ago the devs set bFullLibrarySize=TRUE as default to guard against violation of the assumption that the majority of enrichment regions were similar between groups. This has not been updated in the technical notes section 7.3 of the user guide. Therefore, currently, does db.analyze(method=DESeq2) only scale on depth of library, or is there a subsequent TMM normalization on these scaled counts?

Thank you for any responses!

Bonus question:
I realize there's no hard rules and you would need data to make a real suggestion, but is simply scaling counts by library-size be considered sufficient normalization or would you recommend TMM / RPKM required in addition (or instead of)?

All my work in DiffBind uses the default scoring system DBA_SCORE_TMM_MINUS_FULL.

diffbind chip-seq normalization deseq2 tmm • 2.4k views

ADD COMMENT • link updated 8.3 years ago by Rory Stark ★ 5.2k • written 8.3 years ago by siklenkak • 0

score 0 · Answer 1 · 2017-01-17

The TMM normalization only applies when an analysis is done using edgeR [dba.analyze(method=DBA_EDGER)]. You can use it as a score for non-analyzed data (e.g. plotting the full binding matrix), but the data are re-normalized using the method built-in to each analysis package (edgeR, DESeq2, or DESeq). For DESeq2, size factors for each library are computed using the "mean ratio method" (Anders and Huber 2010).

Thanks for catching the inconsistency in the DiffBind vignette, which still assumes that the default for dba.analyze() is bFullLibrarySize=FALSE, in which case the size factors are computed using only the read counts that overlap binding sites. When bFullLibrarySize=TRUE (the current default), the size factors are set based on the relative library sizes:

DESeq2::sizeFactors(DBAanalysis) <- libsizes/mean(libsizes)

We are currently looking at offering more flexible options for normalizing data prior to statistical analysis, as this often has a larger impact on the result that the choice of how to compute the dispersion estimations of the negative binomial.

Cheers-

Rory

score 0 · Answer 2 · 2017-01-18

The vignette is correct when bFullLibrarySize=TRUE. DESeq2::estimateSizeFactors() is only called when bFullLibrarySize=FALSE. This is because there isn't a way to tell DESeq2 about the reads in the library that did not result in counts against peaks. In the code above, libsizes contains the total number of mapped reads in the library, not just the total of the "counted" reads (ones that overlap consensus peaks). This reflects a difference in ChIP-seq and RNA-seq: in ChIP-seq, there are more reasons why you might get a relatively small proportion of reads overlapping peaks, and it can be important to take this into account when normalizing.

Bottom line: If you want to make sure that DESeq2::estimateSizeFactors() is invoked, set bFullLibrarySize=FALSE.

-R