Question

Normalization factor in TMM method

0

Entering edit mode

Sara ▴ 20

@sara-9865

Last seen 11 weeks ago

Germany

Hi all,

As many of you do, I apologize for asking a likely dumb question, but I appreciate in advance any clarification from you. As far as I know, calcNormFactors() produces two columns of information. The first is lib.size and the second is norm.factors, which multiplying these two columns together gives us an effective library size. However, I don't understand how the normalization factor was calculated, could you please explain me in a simple way as I'm basically a biologist?

From what I read, I understand that TMM_count = raw_counts / ( libsize * norm.factor ). Please kindly let me know what is differences between TMM_count and FPKM values in terms of normalization by library size?

Thank you

TMM normalization edgeR • 4.0k views

ADD COMMENT • link updated 8.6 years ago by James W. MacDonald 68k • written 8.6 years ago by Sara ▴ 20

score 5 · Answer 1 · 2016-10-06

The basic idea is that we are trying to account for differences due to library size. Consider two samples, one with 10M reads, and one with 20M reads. All else equal, if you had a gene that was expressed at the same level in both samples, you still expect twice as many reads in the second sample as compared to the first (because there are twice as many total reads).

Dividing the samples by the library size accounts for these differences, but you can get 'compositional biases' where there might be a set of mRNA transcripts in one sample that are highly expressed, and they hogged up a bunch of the space on a given lane. Since they took up so much space, the remaining mRNA transcripts may have lower counts just because they got out-competed for space. The TMM normalization accounts for that, by ignoring some of the really highly expressed genes, so when you adjust for library sizes you can arguably get a better adjustment.

FPKM goes one step further, accounting for the length of the transcripts you are measuring. A longer transcript will usually have more reads, because it's longer. And if you were trying to make comparisons between genes or transcripts within a sample, that might be something you care about. But in general you are looking for differences between the SAME genes in different samples, so the transcript length doesn't matter.

Does that make sense?