Dear all,
I am currently using the edgeR package for my research on 16S RNA metabarcoding.
At the moment, I am focusing on TMM normalization, and I am quite confused by the way people use the calcNormFactors function.
Indeed, in the edgeR vignette, it is written : " The normalization factors of all the libraries multiply to unity. A normalization factor below one indicates that a small number of high count genes are monopolizing the sequencing, causing the counts for other genes to be lower than would be usual given the library size. As a result, the library size will be scaled down, analogous to scaling the counts upwards in that library. Conversely, a factor above one scales up the library size, analogous to downscaling the counts."
I understand with this section that raw counts of each sample should be multiplied by size factor.
On the other hand, I found an article also using the calcNormFactors function : https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4625728/
In part "Normalization methods", it's written " Scaling factors were calculated using the calcNormFactors function in the package, and then rescaled gene counts were obtained by dividing gene counts by each scaling factor for each run. TMM is the sum of rescaled gene counts of all runs"
With this article, I understand that raw counts should be divided by size factors.
Finally, this code from MetaLonDA seems to use another approach : https://github.com/aametwally/MetaLonDA/blob/master/R/Normalization.R
(lines 22 to 26)
What is the appropriate way to optained normalized counts within edgeR package for TMM normalization ?
I tried the cpm function but I am not interested in a count per million value, I would like to have the normalized value.
Best,
Pauline
James and Gordon, thank you very much for your answers.
Here is why I wanted the normalized counts in the first place : In my metabarcoding study, I would like to put in perspective two kinds of results :
- Results from DE analysis, run with edgeR, with TMM normalization
- Results from beta diversity analysis (throught PCoA analysis, by calculating distance matrix)
In my mind, it would be a lot more justifiable to use the same count data in entry. This is why I wanted the normalized data, to be able to calculate a distance matrix on it.
What you need for the PCoA analysis is logCPM. That's what edgeR does when you ask for PCoA by way of the plotMDS() function -- it automatically computes logCPM from the counts using cpm() and applies PCoA to them. And note that the TMM library size normalization is utilized when the logCPM are computed, so these are complementary rather than exclusive things.
You can't input the same data values both to both edgeR and PCoA because edgeR works on counts and PCoA does not.