Question

edgeR normalization method

2

Entering edit mode

Sara ▴ 20

@sara-9865

Last seen 24 months ago

Germany

Hi all experts,

I am a biology student that started to learn R and NGS analysis and have some basic questions, so please be patient with me. Regarding differential gene expression analysis from RNA-seq experiment, as far as I read, edgeR accept raw count and normalize with TMM method, is it right? However, I read in a paper used edgeR for differential expression analysis, gene fold change calculated as log2 (FPKM treatment / FPKM control), I got confused why the author said "FPKM", could someone please kindly explain me this issue, where does FPKM come from?

For statistical analysis, we need to ensure that all samples are comparable, if box plot shows samples have not a normal distribution, in fact, one of samples stands out from the rest, please let me know if we normal these data before running edgeR analysis?

Thank you in advance

edgeR differential expression normalization • 17k views

ADD COMMENT • link updated 8.7 years ago by Aaron Lun ★ 28k • written 8.7 years ago by Sara ▴ 20

score 6 · Answer 1 · 2016-03-08

FPKM = fragments per kilobase/million. To compute this, you divide the count by the exonic length of the gene (in kilobases) and the library size (in millions of reads). This can be done using the rpkm function.

However, calculation of the FPKM is distinct from edgeR's normalization. In edgeR, the TMM method computes normalization factors that represent sample-specific biases. These factors are multiplied by the library size to yield the effective library size, i.e., the library size that we would have gotten if those biases were not present. The effective library sizes can then be used for various normalization purposes, most frequently as offsets in generalised linear models. Calculation of the FPKM is not essential to this process.

That said, if you wanted to compute FPKM values that incorporate information from TMM normalization, you would use the effective library size instead of the library size in the FPKM calculation. This is done automatically if you run a DGEList object through calcNormFactors and supply the resulting object to rpkm.

As for your final question; edgeR uses a negative binomial distribution, so lack of normality is not an issue. It's not exactly clear what you're making boxplots of; (normalized?) expression values across samples for each gene, or expression values across genes for each sample? I would be reluctant to define a sample as an outlier based on boxplots for a small number of genes. In any case, your options are to turn on robust=TRUE for estimateDisp or use estimateGLMRobustDisp to reduce the impact of outliers in a few genes; or remove the offending outlier sample prior to the DE analysis, if all genes are affected.