Hello!
I have a pipeline for targeted RNASeq data that includes edgeR for initial normalization followed by RUVSeq (RUVr). Output of RUVr is expected: df$W and df$NormalizedCounts. However, the normalized counts are in logCPM and I would like to back log them for data exploration purposes. This level of stats is a bit over my head (I'm a PhD physiology student) so I am struggling to understand what direction to take my code to back log them. If I simply take the normalized counts and apply 10^(x+1) I get counts that seem unreasonably high. This makes sense to me given that "The normalized counts are indeed simply the residuals from ordinary least squares regression of logY on W (with the offset if needed)". However, I don't understand how to apply that equation with the data in R to work back to the CPM. Can someone point me to some general R code or help with the concept of what to do to get back to CPM rather than logCPM?
It hadn't occurred to me to input CPM rather than logCPM which is why my df$NormalizedCounts didn't exceed 17ish in value and have negative values.
Having utilized RUVr with both CPM and logCPM, inputting the logCPM gives us output that is more reflective of what we'd expect biologically. If I used edgeR to normalize initially (which used log2 to make my logCPM input), can I 2^(df$NormalizedCounts) to back convert or is this problematic because edgeR utilizes library size?
You shouldn't use CPM or logCPM for any of the steps! To use RUVr you need to fit a preliminary model using edgeR, which takes counts as input, and then you give RUVr counts as well. This is pretty clearly spelled out in the help page for RUVr. For example
Usage:
RUVr(x, cIdx, k, residuals, center=TRUE, round=TRUE, epsilon=1, tolerance=1e-8, isLog=FALSE)
Arguments:
x: Either a genes-by-samples numeric matrix or a
SeqExpressionSet object containing the read counts.
Nothing about CPM or logCPM there, just counts. And below that it says
residuals: A genes-by-samples matrix of residuals obtained from a
first-pass regression of the counts on the covariates of
interest, usually the negative binomial deviance residuals
obtained from 'edgeR' with the 'residuals' method.
Again, no CPM or logCPM here. Just counts. And the example shows how to do it, using some data from the zebrafishRNASeq package that you can use as an example for your data.
Also, you might be confused about something. CPM is counts/million counts, which uses library size to scale normalize and logCPM is just the logs of the CPM values. And edgeR doesn't
normalize anything, really, unless you are talking about CPM or logCPM values. Usually you use TMM to estimate the library size, and that is used as an offset in the GLM rather than explicitly normalizing the data first.
Ah I think I know where the communication breakdown is occurring. When I think of RUVr, I'm including fitting the preliminary model in edgeR, not just the RUVr line; I've been following along with the help page and tried the zebrafishRNASeq package initially.
When I switch to my data I've been also using my predecessor's code: she used TMM on raw data to estimate library size df <-calcNormFactors(df), obtained the matrix of logCPM counts <- cpm(df, log=T) and then used that count matrix to fit the design as described in the help page line for line (except higher k).
design <- model.matrix(~x, data=pData(set))
y <- DGEList(counts=counts(set), group=x)
y <- calcNormFactors(y, method="upperquartile")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
res <- residuals(fit, type="deviance")
set4 <- RUVr(set, genes, k=3, res)
pData(set4)
Since she used a logCPM count matrix to fit a model prior to RUVr(), the normalized counts came back logCPM in appearance.
I’m guessing this pipeline is inappropriate since it essentially renormalized counts that were already normalized.
I am unfamiliar with a counts function (your counts(set)), so it's not clear what exactly you are putting into that DGEList. But if it's logCPM values, then for sure it's wrong.
Thank you James!
It hadn't occurred to me to input CPM rather than logCPM which is why my df$NormalizedCounts didn't exceed 17ish in value and have negative values.
Having utilized RUVr with both CPM and logCPM, inputting the logCPM gives us output that is more reflective of what we'd expect biologically. If I used edgeR to normalize initially (which used log2 to make my logCPM input), can I 2^(df$NormalizedCounts) to back convert or is this problematic because edgeR utilizes library size?
You shouldn't use CPM or logCPM for any of the steps! To use
RUVr
you need to fit a preliminary model usingedgeR
, which takes counts as input, and then you giveRUVr
counts as well. This is pretty clearly spelled out in the help page forRUVr
. For exampleNothing about CPM or logCPM there, just counts. And below that it says
Again, no CPM or logCPM here. Just counts. And the example shows how to do it, using some data from the
zebrafishRNASeq
package that you can use as an example for your data.Also, you might be confused about something. CPM is counts/million counts, which uses library size to scale normalize and logCPM is just the logs of the CPM values. And edgeR doesn't normalize anything, really, unless you are talking about CPM or logCPM values. Usually you use TMM to estimate the library size, and that is used as an offset in the GLM rather than explicitly normalizing the data first.
Ah I think I know where the communication breakdown is occurring. When I think of RUVr, I'm including fitting the preliminary model in edgeR, not just the RUVr line; I've been following along with the help page and tried the zebrafishRNASeq package initially.
When I switch to my data I've been also using my predecessor's code: she used TMM on raw data to estimate library size
df <-calcNormFactors(df)
, obtained the matrix of logCPMcounts <- cpm(df, log=T)
and then used that count matrix to fit the design as described in the help page line for line (except higher k).Since she used a logCPM count matrix to fit a model prior to RUVr(), the normalized counts came back logCPM in appearance.
I’m guessing this pipeline is inappropriate since it essentially renormalized counts that were already normalized.
I am unfamiliar with a
counts
function (yourcounts(set)
), so it's not clear what exactly you are putting into thatDGEList
. But if it's logCPM values, then for sure it's wrong.Thank you, it was!