I'm using the function RUVg in the RUVSeq package to analyse my RNA-seq data. The output of RUVg consists of the factors of unwanted variation W and the normalised counts N. Could someone please tell me how the normalised counts N are calculated? See the formula below.
log E[Y |W, X, O] = W α + Xβ + O.
Are the normalised counts simply log(Y)-Wα-O? If so, how can I get α from the output of RUVg? Also, log(Y)-Wα-O wouldn't consist of integers, so why does the normalised counts matrix N produced by RUVg consist of counts?
I'm glad you figured it out yourself. I'll just post an answer anyway in case somebody else has a similar issue.
The normalized counts are indeed simply the residuals from ordinary least squares regression of logY on W (with the offset if needed). The output from RUV is actually exponentiated (and optionally rounded) to be interpretable as "normalized counts". Please, note that these are intended just for visualization/exploration, as the RUV model has been tested only on supervised problems, and works better when W is inserted in the model with the original counts (rather than modeling the normalized counts).
Similarly, the alpha estimated from the first step of RUV shouldn't probably be used directly (and hence is not returned), but rather alpha and beta will be (re-)estimated by the full model on W and X. If you're using edgeR, you should find the estimated parameters in the output of glmFit (coefficients).
Hello
I was wondering if I should use ruv-corrected counts and ruv lib sizes to create a DGE object that can be used later for DE analysis using limma voom?
Our recommended approach is to add the estimated factors of RUV in the design matrix of the DGE object. We illustrate this with edgeR and DESeq2 in the vignette and it will be similar for limma-voom, e.g., use the library-size normalized log-counts (the same input you gave to RUV) and include W in the design matrix.
I am trying to add factors of unwanted variation "W" to the design matrix but I am confused because the design matrix will be used to make contrasts to fit a linear model and perform empirical Bayes moderation and it gives me this error below.
First a pedantic note. It's easier and clearer to do this
design <- model.matrix(~ 0 + Stage + Age + Sex + Sample.source + Sample.ID, dge_post$samples)
Rather than how you did it.
Second, you have Sample.ID in your design matrix twice. It won't matter because model.matrix will drop one, but it makes me wonder if that is really what you did. I say that because your error says you cannot estimate dge_post$samples$Sample, which shouldn't be a coefficient if you used the code you have supplied. An alternative is that you cropped off part of the error message, but either way it's not possible to help given the information you have provided.
Third, in this context you should use voomLmFit instead
fit <- voomLmFit(dge_post, design, sample.weights = TRUE)
And if you don't have complete pairing (which I suspect you don't), you probably won't be able to fit Sample.ID as a fixed effect, in which case you should fit a GLS.
design <- model.matrix(~ 0 + Stage + Age + Sex + Sample.source, dge_post$samples) ## NO Sample.ID here
fit <- voomLmFit(dge_post, design, dge_post$samples$Sample.ID, sample.weights = TRUE)
fit2 <- eBayes(contrasts.fit(fit, contrast.matrix))
But again, I am just guessing because your error doesn't make sense given how you constructed the design matrix (there's no dge_post$samples$Sample used anywhere).
Can I use ruv normalized counts in logistic regression? if not, then how to remove the unwanted variation from the gene expression matrix after estimating W?
I did start with the RUVSeq vignette, but i can't find how to export the normalized counts? I have got a kinetics data and I dont know how to apply DESeq on such a data. Hence, I would like to export the RUVg normalized data frame.
The way you extract normalized counts depends on what object you're working on. If you're starting with a matrix, RUVg will return a list with two elements, one of which is the matrix of normalized counts.
If you're working with a SeqExpressionSet object, then you can use the normCounts() method to extract the normalized data. This is all documented in the RUVg manual page, available with:
Yes. It can happen since the "normalized data" are simply the residuals of a linear model, and there is no constraint in the model to force zero to stay zero. Note that in practice this will all be close to 0.
Also, keep in mind that the normalzed value are in the log(x+1) scale.
Hello I was wondering if I should use ruv-corrected counts and ruv lib sizes to create a DGE object that can be used later for DE analysis using limma voom?
Our recommended approach is to add the estimated factors of RUV in the design matrix of the DGE object. We illustrate this with edgeR and DESeq2 in the vignette and it will be similar for limma-voom, e.g., use the library-size normalized log-counts (the same input you gave to RUV) and include W in the design matrix.
I am trying to add factors of unwanted variation "W" to the design matrix but I am confused because the design matrix will be used to make contrasts to fit a linear model and perform empirical Bayes moderation and it gives me this error below.
First a pedantic note. It's easier and clearer to do this
Rather than how you did it.
Second, you have Sample.ID in your design matrix twice. It won't matter because
model.matrix
will drop one, but it makes me wonder if that is really what you did. I say that because your error says you cannot estimate dge_post$samples$Sample, which shouldn't be a coefficient if you used the code you have supplied. An alternative is that you cropped off part of the error message, but either way it's not possible to help given the information you have provided.Third, in this context you should use
voomLmFit
insteadAnd if you don't have complete pairing (which I suspect you don't), you probably won't be able to fit Sample.ID as a fixed effect, in which case you should fit a GLS.
But again, I am just guessing because your error doesn't make sense given how you constructed the design matrix (there's no dge_post$samples$Sample used anywhere).
Can I use ruv normalized counts in logistic regression? if not, then how to remove the unwanted variation from the gene expression matrix after estimating W?