Hi, I have a matrix of approximately 2 million genes, reads resulting from RSEM_readCounts with over 100 samples (over 30 treatments, each one has 3 repeats).
- What is the correct way to identify differentially expressed genes?
- I was thinking of using limma, since I found few references suggesting limma for RSEM. How should one normalize RSEM_readCounts with limma? Is there some suggested pipeline for using limma for such case?
- Normalization issue: I have also tried to use logCPM or logCPMPrior3 on the RSEMreadCounts and received 20% of the data with high peak below zero (plus a peak above zero). I have tried to do the same (logCPM or logCPMPrior3) on rounded RSEMreadCounts values but still got high peak below zero. Of course I have filtered for low counts before. Any suggestion for better normalization?
Thank you,
Arik
RSEM will report counts at the gene level too.
"...then you should use limma-voom...".
Dear James, thank you for your answer.
Regarding: "You have two million genes?". Yes (actually 1.95M) :-) . These are mixed samples of multiple genomes, a Metatranscriptome.
Regarding: "should summarize first using tximport. There is a vignette for that package that you should read." Thank you. I saw something in: https://bioconductor.riken.jp/packages/3.7/bioc/vignettes/tximport/inst/doc/tximport.html#rsem
Regarding: "you shouldn't expect to have a unimodal distribution! ". A) I tried many kind of filtering ( even reduced number of transcripts by 90%) and still got a peak below zero (for logcm)). This makes biologial sense- since there are over 90 treatments, it makes sense that there is zero expression in some of them for the great majority of the genes. B) My worry is, that many of the statistical tests evaluating if genes (or transcripts) are significantly differentially expressed are based on the assumption of normal distribution. (e.g., "in limma, linear modelling is carried out on the log-CPM values which are assumed to be normally distributed ", https://www.bioconductor.org/packages/devel/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html). Do you think it is possible to apply voom on non-normal distribution?
Thank you,
Arik
The histogram of logcpm values you are looking at is completely irrelevant. limma makes no assumptions whatsoever about the distribution of expression levels between genes or transcripts. The histogram does not need to be normally distributed. The histogram could look like a camel with three humps for all limma cares.
The assumption of normality mentioned in the 1-2-3 workflow is something different, it refers to the distribution of residuals within a gene, something you cannot easily examine. Basically, you don't need to worry about this assumption. If you use expected counts from RSEM, then the assumption will be taken care of by voom and limma.
If there was any need to check assumptions by looking at histograms of log-CPM values, then that would have been explained in the 1-2-3 workflow.
Dear, Professor Smyth, Thank you very much.
Regarding: "The histogram does not need to be normally distributed.... could look like a camel with three humps ..."
I understand it is true for all limma tests not only voom, is it correct?
Regrading "distribution of residuals within a gene".
Could you please explain what does it mean? Is it something like distribution of: (Xi-Xavg), were Xi is an expression value for a gene X in a given sample i, and avg is average expression of that gene (i.e., average of row in expression matrix were cols=samples, rows = genes)?
Could you please help me and explain or send a protocol explaining:
What is the procedure needed (e.g., which normalization is needed? is rounding of decimals needed?) before using limma voom on RSEM values?
Regarding my previous request I saw in tximport vignettes the following procedure (https://bioconductor.riken.jp/packages/3.7/bioc/vignettes/tximport/inst/doc/tximport.html#rsem):
files <- file.path(dir, "rsem", samples$run, paste0(samples$run, ".genes.results.gz"))
names(files) <- paste0("sample", 1:6)
txi.rsem <- tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE)
y <- DGEList(txi$counts)
y <- calcNormFactors(y)
design <- model.matrix(~condition, data = sampleTable)
v <- voom(y, design)
Is this procedure above enough to calculate voom limma ?
Or is there a need to do also additional normalization? Like the procedure suggested for edgeR in the same tximport vignettes?
In this regards, the vignettes describes two major possibilities for edgeR, which is the correct one for RSEM?:
1. The "bias corrected counts without an offset” method:
Use the tximport argument countsFromAbundance="lengthScaledTPM" or "scaledTPM", and then to use the gene-level count matrix txi$counts directly as you would a regular count matrix with these software I guess it means one should us ethe following import command:
txi.rsem <- tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE, countsFromAbundance = "lengthScaledTPM")
2. The “original counts and offset” method:
library(edgeR)
cts <- txi$counts
normMat <- txi$length
normMat <- normMat/exp(rowMeans(log(normMat)))
library(edgeR)
o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat))
y <- DGEList(cts)
y$offset <- t(t(log(normMat)) + o)
Thank you very much again for your kind help,
Arik
None of this is at all relevent. The distribution of Xi-Xavg does not need to be normal. You do not need tximport. See my separate answer.