Hi,
I am working with RNAseq data using EdgeR
(steps below), while I was discussing some preliminary data analysis and observations, I cam across a question about the gene length. Does EdgeR trimmed mean of M values (TMM) account for gene length along with the sequencing depth and RNA composition?
While I was exploring more about this, I came across a couple of resources (links below):
- List item
EdgeR trimmed mean of M values (TMM) - accounts for sequencing depth, RNA composition, and gene length,
- List item
[A scaling normalization method for differential expression analysis of RNA-seq data: 2 It states that gene length is generally absorbed into a certain parameter and does not get used in the inference procedure. The focus of the TMM method is on estimating the relative RNA production of two samples, essentially a global fold change, by equating the overall expression levels of genes between samples under the assumption that the majority of them are not differentially expressed. Thus, while gene length biases are acknowledged as significant in gene expression analysis.
Sample metadata
#> Samples Ind Event Treatment
#> 1 S1 I1 5m Untreated
#> 2 S2 I1 9m Treated
#> 3 S3 I2 5m Untreated
#> 4 S4 I2 9m Treated
#> 5 S5 I3 5m Untreated
#> 6 S6 I3 9m Treated
EdgeR Analysis
library(edgeR)
group.Treatment <- factor(Sample_metadata$Treatment)
y <- DGEList(counts = gene_counts, group = group.Treatment, remove.zeros = TRUE)
keep <- filterByExpr(y)
y <- y[keep, , keep.lib.sizes=FALSE]
y <- calcNormFactors(y, method = "TMM")
logCPM = cpm(y, prior.count=1, log=TRUE)
Then, use logCPM values for downstream analysis such as to calculate fold changes per individual, plotting, and more...
Thank you,
Sabiha
Gordon Smyth thank you.
Additionally, I was also reading the below article and learnt,
A scaling normalization method for differential expression analysis of RNA-seq data: I infer that gene length is generally absorbed into a certain parameter and does not get used in the inference procedure. The focus of the TMM method is on estimating the relative RNA production of two samples, essentially a global fold change, by equating the overall expression levels of genes between samples under the assumption that the majority of them are not differentially expressed. Thus, while gene length biases are acknowledged as significant in gene expression analysis.
https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-3-r25
Is this a different TMM approach? Sampling framework section of the paper does describes about the gene length.
I already gave you a complete answer as it affects edgeR. The paper that you quote (Robinson & Oshlack, 2010) agrees with what I told you in every respect.
TMM does not adjust for gene length nor does it need to. The genes lengths do not enter into the TMM calculation. They are not relevant for what TMM is trying to achieve.
If you want more detailed explanations for statements made in the paper that you quote, it would be best to write to the authors of that paper. I can only answer questions about how to conduct analyses in edgeR or about the edgeR documentation.
Gordon Smyth thanks. Thank you for your response and for clarifying the role of TMM in edgeR. I truly appreciate the time and effort you took to explain this.
I should clarify that my use of
edgeR
often varies depending on the specific requirements, basically, I import the raw counts in theedgeR
package. While I sometimes useedgeR
for differential expression analysis, there are instances where I only use it to extract logCPM values (steps above). I then incorporate these logCPM values into other tools likelimma
for comparative analysis or for other downstream applications.