Question

Normalisation of RNA seq data

0

Entering edit mode

Fiona ▴ 70

@fiona-5790

Last seen 9.0 years ago

United Kingdom

Hi everyone,

I'm struggling to understand pros and cons of various ways of normalising RNA-seq read count data.

Although I'm not working with a specific dataset at the moment, in theory I'm thinking about read counts from replicated samples, and trying to test differential expression between different treatments in a fairly complex experimental design (one requiring mixed model analysis). At no point in the analysis will comparisons between relative expression levels of different genes be made.

I have come across 3 main ways of dealing with normalisation, and I was hoping people with more expertise than me would be willing to offer opinions/advice of which of these is best (although I appreciate the question probably won't be as straightforward as that).

1. RPKM values, then analyse these in the mixed model

2. Analysis of non-normalised read counts within a R BioConductor package such as limma. This is problematic with the types of analyses I am considering, because the package cannot form models complex enough to account for all the mixed effects that I would like to be able to use.

3. Analysis of non-normalised read counts within an R mixed modelling package such as 'lme4', using a suitable data distribution (Poisson/quasi-Poisson) and including a sample ID term as a random factor that will account for variation in reads between replicates, within the model.

Any thoughts/advice that anyone has would be gratefully received.

Thanks very much!

rnaseq bioconductor R normalization • 1.9k views

ADD COMMENT • link updated 9.0 years ago by chris86 ▴ 420 • written 9.0 years ago by Fiona ▴ 70

0

Entering edit mode

I don't think RPKM values will be helpful for statistical analyses. They are more meant for e.g., heatmaps, for equal visualization.

I think you are better of with normalizing your data in limma, with e.g., voom. Then use the normalized data in limma for further statistics, or import it to your lme4 package for further analysis.

ADD REPLY • link 9.0 years ago b.nota ▴ 370

0

Entering edit mode

Keep in mind that the "normalized" expression one would retrieve from voom (presumably you mean from the $E element of the EList the voom function returns) is not anything extraordinary -- it's essentially log2(cpm(counts + 0.25)), and that's it. The important thing that voom provides is the $weights from the same EList, which it then uses in the linear modeling step.

If you were to export the $E matrix from the voomed EList for analyses with another package, be sure you are using an analysis package that can also take advantage of the observational weights, otherwise ... what's the point of vooming in the first place?

ADD REPLY • link 9.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Thanks for clarifying that, Steve!

Seems best advice is to keep the whole analysis in limma. Maybe Fiona can explain what kind of design/mixed model she needs for her analysis.

ADD REPLY • link 9.0 years ago b.nota ▴ 370

0

Entering edit mode

Perhaps more progress could be made if you explained your experimental design, and why it cannot be modeled with existing tools like limma. Note that limma provides mixed model-like functionality via duplicateCorrelation. Also, there are many more normalization methods than you've listed here; TMM, loess/quantile, size factor normalization, etc.

ADD REPLY • link 9.0 years ago Aaron Lun ★ 28k

score 0 · Answer 1 · 2016-02-26

I agree with the comments you have.

RPKM is just for comparing abundance between different genes, you don't want that for any downstream statistics.

You do need to normalise your counts though, you can do that with limma&voom or deseq etc. Then you can export if you need to, these counts, to whatever package you do the mixed modelling with, but it can all be done internally with limma.