Hi all,
I wonder of the values in lrt$fitted.values are cpm normalized values?
des <- model.matrix(~KRAS)
fit <- glmFit(y2, des)
lrt <- edgeR::glmLRT(fit)
Hi all,
I wonder of the values in lrt$fitted.values are cpm normalized values?
des <- model.matrix(~KRAS)
fit <- glmFit(y2, des)
lrt <- edgeR::glmLRT(fit)
Let's look.
> example(glmLRT) <snip> glmLRT> d <- DGEList(y) <snip> glmLRT> # Likelihood ratio tests for trend glmLRT> results <- glmLRT(fit, coef=2) <snip> > head(results$fitted.values) x0 x1 x2 gene.1 9.618113 29.449462 69.696630 gene.2 4.787029 5.444920 4.787005 gene.3 8.071594 7.879561 5.945545 gene.4 5.807543 6.605920 5.807928 gene.5 6.194186 5.686449 4.035013 gene.6 3.528931 5.335160 6.234458 > head(d$counts) x0 x1 x2 gene.1 6 44 55 gene.2 5 5 5 gene.3 7 10 5 gene.4 8 2 8 gene.5 5 8 3 gene.6 5 2 8
Does that answer your question?
To clarify James' point; the fitted values are on the same scale as the raw counts themselves. This demonstrates that they are not CPM values. If you want to convert them to CPMs, you could do something like this:
cpm(results$fitted.values, lib.size=exp(getOffset(d)))
The getOffset
call allows us to handle non-unity normalization factors.
Dear Aaron,
Considering your reply, which of the following options do you think would be the correct way to obtain normalized counts?
fit<-glmFit(y,design)
lrt<-glmLRT(fit, coef=14:2)
norm_counts <- cpm(lrt$fitted.values, lib.size = exp(getOffset(y)))
norm_counts2 <- cpm(lrt$fitted.values, normalized.lib.sizes = TRUE)
norm_counts3<- cpm(y, lib.size = exp(getOffset(y)))
norm_counts4 <- cpm(y, normalized.lib.sizes = TRUE)
Thank you very much! Is it ok to use then this normalized counts to proceed with clustering or is it better to use the log2 of the counts (using log = TRUE in the cpm function)?
Best,
Andrew
Using cpm
is the recommended way to obtain normalized expression values for other (i.e., non-DB) analyses such as clustering and dimensionality reduction. Running it as you have done with norm_counts4
is the simplest approach, so you might as well go with that. In fact, normalized.lib.sizes=TRUE
is the default anyway, so you can even save yourself some typing by leaving that argument out.
And yes, it is usually better to perform a log-transformation, see my comments here:
A: Robust transformation of raw RNA-seq counts for exploratory data analysis and hi
Note that we generally don't use the term "normalized counts" in edgeR, as this implies that you can treat the values as counts (and use them in GLM fitting, etc.). This may not be appropriate as the scaling will distort the mean-variance relationship when the library sizes are very different, e.g., a "normalized count" of 1 that was derived from an original count of 100 is much less variable than the same value derived from an original count of 1.
The help page for glmLRT explains the meaning of all the output components.
For more mathematical detail, here's the paper that describes the methodology:
http://nar.oxfordjournals.org/content/40/10/4288
The quantities that are called fitted.values in the output are denoted by the Greek letter mu in the article. Generally speaking, these quantities are for internal edgeR use and you do not need to manipulate them yourself in a practical analysis.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks! So what are the fitted values then?
Dear Gordon. What is the best way to get out cpm values when I use lib.size from my libraries and norm.factors from my spike-ins. Please see pipeline below:
s=spike-ins
counts=miRNAs
s <- read.delim("exprs.txt", sep="\t", check.names=FALSE)
df.mdp <- read.delim("miRNAs_expressed_all_samples_ALL.csv", sep="\t", check.names=FALSE, stringsAsFactors=FALSE)
counts <- as.matrix(df.mdp[,grep("\\d\\d\\d", colnames(df.mdp))])
sids <- as.character(read.table("config_braf.csv", stringsAsFactors=FALSE)[,3])
colnames(counts) <- sids
colnames(counts.N) <- sids
s <- s[,sids]
group <- sids
group[grep("\\dR", sids)] <- "Normal"
group[grep("\\dG", sids)] <- "Tumor"
group <- factor(group)
patient <- factor(sapply(str_split(sids, 'R|G'), function(x) x[[1]]))
d <- edgeR::DGEList(counts = round(counts), group=group)
d.s <- edgeR::DGEList(counts = s)
d$samples$lib.size <- colSums(d$counts) # recalculate size factors
## add spike-ins
d$counts <- rbind(d$counts, d.s$counts)
## use lib sizes from mirna and norm factors from spike-ins
method <- "TMM" # c("TMM","RLE","upperquartile","none")
method <- "RLE"
y <- calcNormFactors(d, method=method)
y.s <- calcNormFactors(d.s, method=method)
plot(y$samples$lib.size, y.s$samples$lib.size)
y$samples$lib.size <- y.s$samples$lib.size
y2 <- estimateDisp(y, trend.method="loess")
des <- model.matrix(~patient + group)
fit <- glmFit(y2, des)
lrt <- edgeR::glmLRT(fit)
tab <- topTags(lrt, n=Inf)@.Data[[1]]
You've also posted this CPM normalization issue as two separate questions, at print matrix from EdgeR that are normalized by cpm and by spike-in and Detect global differences in miRNA expression between tumor and normal using spike-ins. Please try to avoid redundancy, as we'll end up answering the same thing multiple times.