Question

RNAseq analysis: what comes first, filtering or normalization

3

Entering edit mode

apfelbapfel ▴ 30

@apfelbapfel-15149

Last seen 6.0 years ago

Hi there

please excuse my very basic questions, but I was not able to find appropriate answers using searchengines.

I am trying to analyze a small dataset of the RNAseq of 3 vs 3 samples to identify differentially expressed genes and do some multivariate statistics. Due to the low sample size I chose to use EdgeR, but am a bit confused. In the package description (https://www.bioconductor.org/packages/devel/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf) all steps are nicely explained, but the order seems odd to me: they first describe filtering for low read counts, which in my samples removes quite a bit from the respective libraries, and then describe TMM normalization to account for the RNA composition effect.

Is this really the right order to do it, or am I confusing things?

So first:

data_edgeR <- DGEList(counts=data_matrix[2:46079,3:10], group=group) #create DGEList for further analyses

data_edgeR$samples #looking at library sizes before filtering

keep <- rowSums(cpm(data_edgeR)>1) >= 3
data_edgeR_filtered <- data_edgeR[keep, , keep.lib.sizes=FALSE]

and then

data_TMM_normalized <- calcNormFactors(data_edgeR_filtered)

Is this correct, or the other way ´round?

Many thanks!

edger R rnaseq • 4.9k views

ADD COMMENT • link updated 6.0 years ago by Aaron Lun ★ 28k • written 6.1 years ago by apfelbapfel ▴ 30

score 2 · Answer 1 · 2018-12-29

You don't explain why it seems odd to you, which would help with explaining things.

Keep in mind that the filtering step with cpm (or with filterExprs) uses counts-per-million that are effectively library size-normalized. So, it's not like the filtering is being done on the raw counts. That would obviously be silly in the presence of samples with differing coverage, as the retention of a gene by the filter would be greatly dependent on their expression in libraries with deeper coverage.

TMM normalization in calcNormFactors removes composition biases that remain after library size normalization. You could argue that the filtering should be performed on the TMM-normalized counts; naively, this would require calcNormFactors to be run before filtering. However, this is not desirable as low counts reduce the accuracy of TMM normalization (lots of discreteness, imprecision, see here).

If you must filter on the TMM-normalized counts (e.g., because some of your samples have extreme composition biases), the correct procedure would be to do something analogous to what we do in single-cell data analysis. That is:

Filter prior to TMM normalization, using the same procedure as described above. Do not recompute the libsize fields.
Run calcNormFactors on the filtered object.
Assign the normalization factors back to the original object. This only works if the libsize is not changed after filtering.
Redo the filtering step to obtain the final filtered object. cpm will now be aware of the normalization factors.

This is rather complicated for routine use and it's difficult to show that it would have any benefit over the usual approach, especially given that most datasets will have normalization factors for all samples close to 1.