Question

How to use RSEM value to find differentially expressed genes

1

Entering edit mode

harelarik ▴ 60

@harelarik-13564

Last seen 3.1 years ago

Israel

Hi, I have a matrix of approximately 2 million genes, reads resulting from RSEM_readCounts with over 100 samples (over 30 treatments, each one has 3 repeats).

What is the correct way to identify differentially expressed genes?
I was thinking of using limma, since I found few references suggesting limma for RSEM. How should one normalize RSEM_readCounts with limma? Is there some suggested pipeline for using limma for such case?
Normalization issue: I have also tried to use logCPM or logCPMPrior3 on the RSEMreadCounts and received 20% of the data with high peak below zero (plus a peak above zero). I have tried to do the same (logCPM or logCPMPrior3) on rounded RSEMreadCounts values but still got high peak below zero. Of course I have filtered for low counts before. Any suggestion for better normalization?

Thank you,

Arik

RSEM limma edgeR voom normalization • 6.3k views

ADD COMMENT • link updated 2.3 years ago by Gordon Smyth 51k • written 4.2 years ago by harelarik ▴ 60

2

Entering edit mode

Gordon Smyth 51k

@gordon-smyth

Last seen 4 hours ago

WEHI, Melbourne, Australia

Analysing RSEM genewise expected counts in limma is straightforward. It is exactly the same as analysing regular genewise read counts and requires no special protocol, normalization or workflow. You simply follow the standard limma documentation for RNA-seq.

You must however obtain expected counts from RSEM for genes rather than for transcripts.

None of the various concerns that you mention are relevent:

There is no need for any histograms to look normal.
There is no need to round the expected counts to integers
There is no need to use tximport.
There is no need to use offsets.
There is no need for any normalization other than the usual for RNA-seq

Note that this is general advice about RSEM and limma. This does not constitute specific advice for your dataset.

ADD COMMENT • link 4.2 years ago Gordon Smyth 51k

0

Entering edit mode

Dear Professor Smyth thank you very much,
Regarding "You must however obtain expected counts from RSEM for genes rather than for transcripts."
It is important to note that we are using meta-transcriptome data (containing hosts, and microbiome etc'), and therefore we were concerned about unifying orthologs of several organisms into one gene, if we use gene level data.

Therefore, we have selected to use (RSEM based-) isoform level counts.
Is there any issue of using isoform level counts with voom limma (discussed above). Or should we treat isoform level counts differently?
Thank you again for your help,
Arik

ADD REPLY • link 4.2 years ago harelarik ▴ 60

0

Entering edit mode

Yes, definitely there is an issue with isoform level counts.

If you don't want to reduce the gene-level counts, then you just have to proceed with the counts you have. The DE analysis will not work as well as usual but you can't do anything about it. You might get useful results anyway.

Your data is very nonstandard. That's as much help as I can give you.

ADD REPLY • link 4.2 years ago • updated 2.3 years ago Gordon Smyth 51k

0

Entering edit mode

Dear Professor Smyth, To be more correct, I meant to ask what is the problem of using isoform level counts in limma? What can I do to reduce this problem?

Thank you,

Arik

ADD REPLY • link 4.2 years ago harelarik ▴ 60

0

Entering edit mode

The problem with isoform level counts is well known and has nothing to do with limma. Isoform level counts are highly variable because of ambiguity between overlapping transcripts for the same gene. This extra level of variation is absent from gene level counts. It means that the mean-variance relationship in the data that voom exploits will not be very strong.

You still may a well use limma or limma-voom as there's nothing better. If there was, then I or someone else would have told you days ago.

ADD REPLY • link 4.2 years ago • updated 2.3 years ago Gordon Smyth 51k

0

Entering edit mode

"...The problem with isoform level counts is well known and has nothing to do with limma. Isoform level counts are highly variable because of ambiguity between overlapping transcripts for the same gene. This extra level of variation is absent from gene level counts. It means that the mean-variance relationship in the data that voom exploits will not be very strong...".

Thank you for the information.

Arik

ADD REPLY • link 4.2 years ago harelarik ▴ 60

0

Entering edit mode

I appreciate very much your contribution to the scientific community .

Arik

ADD REPLY • link 4.2 years ago harelarik ▴ 60

score 2 · Accepted Answer · 2020-08-25

2

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 1 day ago

United States

You have two million genes? That seems unlikely. I can't even imagine having counts for two million transcripts. Regardless, if you have 100 samples, then you should use limma-voom. See the edgeR and limma User's Guides for more information.

If you used RSEM, then you probably have transcript counts rather than gene counts. If you want to know about differentially expressed genes, then you should summarize first using tximport. There is a vignette for that package that you should read.

You seem to misunderstand what normalization is intended to do. You shouldn't expect to have a unimodal distribution! You could argue (as I normally do) that the genes with an average logCPM below zero are probably not expressed and should be filtered out. But this is a filtering issue, not normalization. Normalization is intended to reduce technical differences between samples rather than forcing a single sample to have some idealized distribution. You should expect that some genes aren't being expressed (you do, right?), in which case you should also expect a bimodal distribution for each sample.

ADD COMMENT • link 4.2 years ago James W. MacDonald 67k

0

Entering edit mode

RSEM will report counts at the gene level too.

ADD REPLY • link 4.2 years ago swbarnes2 ★ 1.4k

0

Entering edit mode

"...then you should use limma-voom...".

Dear James, thank you for your answer.

Regarding: "You have two million genes?". Yes (actually 1.95M) :-) . These are mixed samples of multiple genomes, a Metatranscriptome.
Regarding: "should summarize first using tximport. There is a vignette for that package that you should read." Thank you. I saw something in: https://bioconductor.riken.jp/packages/3.7/bioc/vignettes/tximport/inst/doc/tximport.html#rsem
Regarding: "you shouldn't expect to have a unimodal distribution! ". A) I tried many kind of filtering ( even reduced number of transcripts by 90%) and still got a peak below zero (for logcm)). This makes biologial sense- since there are over 90 treatments, it makes sense that there is zero expression in some of them for the great majority of the genes. B) My worry is, that many of the statistical tests evaluating if genes (or transcripts) are significantly differentially expressed are based on the assumption of normal distribution. (e.g., "in limma, linear modelling is carried out on the log-CPM values which are assumed to be normally distributed ", https://www.bioconductor.org/packages/devel/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html). Do you think it is possible to apply voom on non-normal distribution?

Thank you,

Arik

ADD REPLY • link 4.2 years ago harelarik ▴ 60

1

Entering edit mode

The histogram of logcpm values you are looking at is completely irrelevant. limma makes no assumptions whatsoever about the distribution of expression levels between genes or transcripts. The histogram does not need to be normally distributed. The histogram could look like a camel with three humps for all limma cares.

The assumption of normality mentioned in the 1-2-3 workflow is something different, it refers to the distribution of residuals within a gene, something you cannot easily examine. Basically, you don't need to worry about this assumption. If you use expected counts from RSEM, then the assumption will be taken care of by voom and limma.

If there was any need to check assumptions by looking at histograms of log-CPM values, then that would have been explained in the 1-2-3 workflow.

ADD REPLY • link 4.2 years ago Gordon Smyth 51k

0

Entering edit mode

Dear, Professor Smyth, Thank you very much.

Regarding: "The histogram does not need to be normally distributed.... could look like a camel with three humps ..."
I understand it is true for all limma tests not only voom, is it correct?
Regrading "distribution of residuals within a gene".
Could you please explain what does it mean? Is it something like distribution of: (Xi-Xavg), were Xi is an expression value for a gene X in a given sample i, and avg is average expression of that gene (i.e., average of row in expression matrix were cols=samples, rows = genes)?
Could you please help me and explain or send a protocol explaining:
What is the procedure needed (e.g., which normalization is needed? is rounding of decimals needed?) before using limma voom on RSEM values?
Regarding my previous request I saw in tximport vignettes the following procedure (https://bioconductor.riken.jp/packages/3.7/bioc/vignettes/tximport/inst/doc/tximport.html#rsem):
files <- file.path(dir, "rsem", samples$run, paste0(samples$run, ".genes.results.gz"))
names(files) <- paste0("sample", 1:6)
txi.rsem <- tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE)
y <- DGEList(txi$counts)
y <- calcNormFactors(y)
design <- model.matrix(~condition, data = sampleTable)
v <- voom(y, design)
Is this procedure above enough to calculate voom limma ?
Or is there a need to do also additional normalization? Like the procedure suggested for edgeR in the same tximport vignettes?
In this regards, the vignettes describes two major possibilities for edgeR, which is the correct one for RSEM?:
1. The "bias corrected counts without an offset” method:
Use the tximport argument countsFromAbundance="lengthScaledTPM" or "scaledTPM", and then to use the gene-level count matrix txi$counts directly as you would a regular count matrix with these software I guess it means one should us ethe following import command:
txi.rsem <- tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE, countsFromAbundance = "lengthScaledTPM")
2. The “original counts and offset” method:
library(edgeR)
cts <- txi$counts
normMat <- txi$length
normMat <- normMat/exp(rowMeans(log(normMat)))
library(edgeR)
o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat))
y <- DGEList(cts)
y$offset <- t(t(log(normMat)) + o)

Thank you very much again for your kind help,

Arik

ADD REPLY • link 4.2 years ago harelarik ▴ 60

0

Entering edit mode

None of this is at all relevent. The distribution of Xi-Xavg does not need to be normal. You do not need tximport. See my separate answer.

ADD REPLY • link 4.2 years ago Gordon Smyth 51k