Normalization factor in TMM method
1
0
Entering edit mode
Sara ▴ 20
@sara-9865
Last seen 23 months ago
Germany

Hi all,

As many of you do, I apologize for asking a likely dumb question, but I appreciate in advance any clarification from you. As far as I know, calcNormFactors() produces two columns of information. The first is lib.size and the second is norm.factors, which multiplying these two columns together gives us an effective library size. However, I don't understand how the normalization factor was calculated, could you please explain me in a simple way as I'm basically a biologist?

From what I read, I understand that TMM_count = raw_counts / ( libsize * norm.factor ). Please kindly let me know what is differences between TMM_count and FPKM values in terms of normalization by library size?

Thank you

 

TMM normalization edgeR • 3.7k views
ADD COMMENT
5
Entering edit mode
@james-w-macdonald-5106
Last seen 17 hours ago
United States

The basic idea is that we are trying to account for differences due to library size. Consider two samples, one with 10M reads, and one with 20M reads. All else equal, if you had a gene that was expressed at the same level in both samples, you still expect twice as many reads in the second sample as compared to the first (because there are twice as many total reads).

Dividing the samples by the library size accounts for these differences, but you can get 'compositional biases' where there might be a set of mRNA transcripts in one sample that are highly expressed, and they hogged up a bunch of the space on a given lane. Since they took up so much space, the remaining mRNA transcripts may have lower counts just because they got out-competed for space. The TMM normalization accounts for that, by ignoring some of the really highly expressed genes, so when you adjust for library sizes you can arguably get a better adjustment.

FPKM goes one step further, accounting for the length of the transcripts you are measuring. A longer transcript will usually have more reads, because it's longer. And if you were trying to make comparisons between genes or transcripts within a sample, that might be something you care about. But in general you are looking for differences between the SAME genes in different samples, so the transcript length doesn't matter.

Does that make sense?

ADD COMMENT
0
Entering edit mode

Thank you very much, James. It's very helpful, but I'm really sorry for this question, your mean from "samples" in "Dividing the samples by the library size accounts" in paragraph 2 is the mapped read for each gene in a given library?

thanks

ADD REPLY
0
Entering edit mode

Exactly. You divide the counts for each gene by the library size (in millions, because you don't want to be dealing with normalized counts of like 0.000002 or whatever).
 

ADD REPLY
0
Entering edit mode

Thanks a lot for your great help.

ADD REPLY

Login before adding your answer.

Traffic: 878 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6