Simple question for DiffBind that isn't clear to me despite a lot of searching and (over) thinking.
How exactly is the count data treated when one invokes db.analyze(method=DESeq2) using default parameters in current release?
My understanding is that some years ago the devs set bFullLibrarySize=TRUE as default to guard against violation of the assumption that the majority of enrichment regions were similar between groups. This has not been updated in the technical notes section 7.3 of the user guide. Therefore, currently, does db.analyze(method=DESeq2) only scale on depth of library, or is there a subsequent TMM normalization on these scaled counts?
Thank you for any responses!
Bonus question:
I realize there's no hard rules and you would need data to make a real suggestion, but is simply scaling counts by library-size be considered sufficient normalization or would you recommend TMM / RPKM required in addition (or instead of)?
All my work in DiffBind uses the default scoring system DBA_SCORE_TMM_MINUS_FULL.
The TMM normalization only applies when an analysis is done using edgeR [dba.analyze(method=DBA_EDGER)]. You can use it as a score for non-analyzed data (e.g. plotting the full binding matrix), but the data are re-normalized using the method built-in to each analysis package (edgeR, DESeq2, or DESeq). For DESeq2, size factors for each library are computed using the "mean ratio method" (Anders and Huber 2010).
Thanks for catching the inconsistency in the DiffBind vignette, which still assumes that the default for dba.analyze() is bFullLibrarySize=FALSE, in which case the size factors are computed using only the read counts that overlap binding sites. When bFullLibrarySize=TRUE (the current default), the size factors are set based on the relative library sizes:
We are currently looking at offering more flexible options for normalizing data prior to statistical analysis, as this often has a larger impact on the result that the choice of how to compute the dispersion estimations of the negative binomial.
Thanks for the quick response; however, I will follow up with a question about that.
The Anders and Huber 2010 DEseq paper describes the median ratio method, also implemented in DESeq2. As you know, this function is described as DEseq2::estimateSizeFactors(), but from what I can tell takes DESeq2::sizeFactors(DBAanalysis) <- libsizes/mean(libsizes) a step further. This estimateSizeFactors() is recommended by Anders instead of what's verbally described in the DiffBind vignette that the sizeFactors are calculated by libsizeSample / min(libsizes).
Can you confirm the vignette is wrong and the suggested estimateSizeFactors is performed when dba.analyze(bFullLibrarySize=TRUE). That or clarify that it calls DESeq2::sizeFactors(DBAanalysis) <- libsizes/mean(libsizes) that you described in this thread?
The vignette is correct when bFullLibrarySize=TRUE. DESeq2::estimateSizeFactors() is only called when bFullLibrarySize=FALSE. This is because there isn't a way to tell DESeq2 about the reads in the library that did not result in counts against peaks. In the code above, libsizes contains the total number of mapped reads in the library, not just the total of the "counted" reads (ones that overlap consensus peaks). This reflects a difference in ChIP-seq and RNA-seq: in ChIP-seq, there are more reasons why you might get a relatively small proportion of reads overlapping peaks, and it can be important to take this into account when normalizing.
Bottom line: If you want to make sure that DESeq2::estimateSizeFactors() is invoked, set bFullLibrarySize=FALSE.
Hi Rory,
Thanks for the quick response; however, I will follow up with a question about that.
The Anders and Huber 2010 DEseq paper describes the median ratio method, also implemented in DESeq2. As you know, this function is described as
DEseq2::estimateSizeFactors()
, but from what I can tell takesDESeq2::sizeFactors(DBAanalysis) <- libsizes/mean(libsizes)
a step further. ThisestimateSizeFactors()
is recommended by Anders instead of what's verbally described in the DiffBind vignette that the sizeFactors are calculated bylibsizeSample / min(libsizes).
Can you confirm the vignette is wrong and the suggested estimateSizeFactors is performed when
dba.analyze(bFullLibrarySize=TRUE).
That or clarify that it callsDESeq2::sizeFactors(DBAanalysis) <- libsizes/mean(libsizes)
that you described in this thread?Thanks for your help
Cheers,
Keith