The csaw
package suggests to use TMM normalization based on large, e.g. 10kb, bins across the genome if ChIP-seq samples are expected to show rather global differences / composition bias is expected. As I want to use the resulting normalization factors to scale non-standard (=non DGElist
files, such as bedGraph/bigwig files with raw counts for every base of the genome) I am asking for clarification if my understanding is correct:
One creates a count matrix for the 10kb bins across the genomes, then feeds this into calcNormFactors()
and obtains normalization factors. Based on the calculateCPM()
and cpm()
source code I think one now uses these factors to correct the library size for each sample, therefore library.size/norm.factor
, and this multiplied (edit) divided, as Aaron explains) (/edit) by 1e+06 to get a per-million scaling factor.
Eventually one would now divide the "raw" counts by this per-million factor. In my case that could be these bigwig/bedGraph files, which is simply a four-column format with chr-start-end and $4 being the raw counts for every base in the genome of a given sample, therefore $4 / per.million.factor
.
Is that correct?