Hi everyone,
I'm struggling to understand pros and cons of various ways of normalising RNA-seq read count data.
Although I'm not working with a specific dataset at the moment, in theory I'm thinking about read counts from replicated samples, and trying to test differential expression between different treatments in a fairly complex experimental design (one requiring mixed model analysis). At no point in the analysis will comparisons between relative expression levels of different genes be made.
I have come across 3 main ways of dealing with normalisation, and I was hoping people with more expertise than me would be willing to offer opinions/advice of which of these is best (although I appreciate the question probably won't be as straightforward as that).
1. RPKM values, then analyse these in the mixed model
2. Analysis of non-normalised read counts within a R BioConductor package such as limma. This is problematic with the types of analyses I am considering, because the package cannot form models complex enough to account for all the mixed effects that I would like to be able to use.
3. Analysis of non-normalised read counts within an R mixed modelling package such as 'lme4', using a suitable data distribution (Poisson/quasi-Poisson) and including a sample ID term as a random factor that will account for variation in reads between replicates, within the model.
Any thoughts/advice that anyone has would be gratefully received.
Thanks very much!
I don't think RPKM values will be helpful for statistical analyses. They are more meant for e.g., heatmaps, for equal visualization.
I think you are better of with normalizing your data in limma, with e.g., voom. Then use the normalized data in limma for further statistics, or import it to your lme4 package for further analysis.
Keep in mind that the "normalized" expression one would retrieve from
voom
(presumably you mean from the$E
element of theEList
thevoom
function returns) is not anything extraordinary -- it's essentiallylog2(cpm(counts + 0.25))
, and that's it. The important thing that voom provides is the$weights
from the sameEList
, which it then uses in the linear modeling step.If you were to export the
$E
matrix from thevoom
edEList
for analyses with another package, be sure you are using an analysis package that can also take advantage of the observational weights, otherwise ... what's the point of vooming in the first place?Thanks for clarifying that, Steve!
Seems best advice is to keep the whole analysis in limma. Maybe Fiona can explain what kind of design/mixed model she needs for her analysis.
Perhaps more progress could be made if you explained your experimental design, and why it cannot be modeled with existing tools like limma. Note that limma provides mixed model-like functionality via
duplicateCorrelation
. Also, there are many more normalization methods than you've listed here; TMM, loess/quantile, size factor normalization, etc.