Question

Correct use of TMM / normalization factors based on large bins

1

Entering edit mode

ATpoint ★ 4.7k

@atpoint-13662

Last seen 1 day ago

Germany

The csaw package suggests to use TMM normalization based on large, e.g. 10kb, bins across the genome if ChIP-seq samples are expected to show rather global differences / composition bias is expected. As I want to use the resulting normalization factors to scale non-standard (=non DGElist files, such as bedGraph/bigwig files with raw counts for every base of the genome) I am asking for clarification if my understanding is correct:

One creates a count matrix for the 10kb bins across the genomes, then feeds this into calcNormFactors() and obtains normalization factors. Based on the calculateCPM() and cpm() source code I think one now uses these factors to correct the library size for each sample, therefore library.size/norm.factor, and this ~~multiplied~~ (edit) divided, as Aaron explains) (/edit) by 1e+06 to get a per-million scaling factor.

Eventually one would now divide the "raw" counts by this per-million factor. In my case that could be these bigwig/bedGraph files, which is simply a four-column format with chr-start-end and $4 being the raw counts for every base in the genome of a given sample, therefore $4 / per.million.factor.

Is that correct?

csaw TMM • 2.1k views

ADD COMMENT • link 5.5 years ago ATpoint ★ 4.7k

score 2 · Accepted Answer · 2019-08-28

2

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 17 hours ago

The city by the bay

So close. The effective library size is library.size * norm.factor. This means that you should divide your count/coverage/etc. value by library.size * norm.factor / 1e6 to get normalized per-million equivalents.

ADD COMMENT • link 5.5 years ago Aaron Lun ★ 28k