Question

when to apply quantile normalization with voom in limma/voom framework with RNA-seq data?

5

Entering edit mode

lassancejm ▴ 50

@lassancejm-9631

Last seen 9.2 years ago

Hi all,

I'm using voom and limma to analyse RNA-seq data. In general, I have followed the basic instructions regarding RNA-seq analysis presented in the limma manual.

The matrix of read counts is filtered so that I keep genes where cpm>=1 in at least 3 samples. This is done using the following code:

keep=rowSums(cpm(matrix)>=1)>=3
matrix=rnaseqMatrix_parents[keep,]

Next, I apply the TMM normalization and use the results as input for voom

DGE=DGEList(matrix)
DGE=calcNormFactors(DGE,method =c("TMM"))
v=voom(DGE,design,plot=T)

However, with one of my datasets, the plot produced by voom looks generally ok, but with a gap midway

Looking back at the manual for limma, I noticed the following note:

If the data are very noisy, one can apply the same between-array normalization methods as would be used for microarrays, for example: v <- voom(counts,design,plot=TRUE,normalize="quantile")

Although I am not certain that my dataset qualifies as "very noisy" (happy to read what people have to say about it or point me to examples) I decided to give it a whirl.

v2=voom(matrix,design,plot=T, normalize="quantile")

The plot looks better than before:

I should finish by saying that I went on and perform the differential expression analysis using the two outputs from voom. Things change marginally (1570 vs 1758 DEG out of 15729 genes at 5% FDR when using the "TMM" vs "quantile" strategy). Not surprisingly, the genes differentially expressed in one configuration only are associated with Q-values close to the 0.05 cut-off.

I am a bit uncertain about using the quantile normalization route (did not find a reference where people had used it), so I am eager to have inputs/guidelines about when to decide using TMM vs quantile norm.

Input on this most welcome!

Thanks!

Jean-Marc

limma voom limma voom rna-seq quantile normalization • 20k views

ADD COMMENT • link updated 9.2 years ago by Gordon Smyth 52k • written 9.2 years ago by lassancejm ▴ 50

Gordon Smyth · Answer 1 · 2016-01-30

9

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

You ask a good question. Personally I reserve quantile normalization for really extreme situations. On the other hand, one of my close colleagues, Wei Shi, likes to use quantile normalization almost routinely, see for example:

http://www.ncbi.nlm.nih.gov/pubmed/25894659

As you have seen, it doesn't make a spectacular difference in most cases; scale normalization is simpler but quantile normalization tends to give a few more DE genes.

To me, your voom plot with TMM looks fine. The plots shows that you have some clusters of transcripts with special characteristics, evidenced by the gap and by the blob of points in the middle left of the plot. These transcripts are probably outliers in a subset of samples and this is something to explore as part of the downstream analysis, but to me is not something for the normalization to solve. It doesn't cause any serious problems to voom.

ADD COMMENT • link 9.2 years ago Gordon Smyth 52k

0

Entering edit mode

Thank you for your answer(s) Gordon, and pointing to reference making use to the quantile normalization.

I will be more careful about potential outliers (we expect some heterogeneity between the samples here) as well as interactions between the variables tested here, good point.

From what I see, I gain some DE genes,but I also lose some, which worries me a bit as I find arbitrary to favor one normalization route over the other at present.

ADD REPLY • link updated 9.2 years ago by Gordon Smyth 52k • written 9.2 years ago by lassancejm ▴ 50

0

Entering edit mode

Rather than worry about the normalization too much, better to explore the data. E.g. try a BCV plot to look for dispersion outliers, or try robust=TRUE with eBayes() to downweight dispersion outliers. I like to do this:

dge <- estimateDisp(DGE, design, robust=TRUE)
plotBCV(dge)

then highlight genes for which dge$df.prior is small.

ADD REPLY • link 9.2 years ago Gordon Smyth 52k

0

Entering edit mode

Well, something is possibly going on here. As I wrote previously, I am expecting some heterogeneity in the dataset. More about the samples: they correspond to a particular brain region sampled in males and females of two different genetic backgrounds. So, there is probably some biological noise coming from sex, strain and the interaction as well as due to the sampling process. I read previously that high level of BCV were typically expected in brain samples.

The BCV plot linked here was produced using the code mentioned by Gordon (see his answer above).

So, to counter-balance the effect of outliers and based on the previous replies, the code should then be:

  DGE=DGEList(matrix)
  DGE=calcNormFactors(DGE,method =c("TMM"))
  v=voom(DGE,design) 
  fit=lmFit(v,design)
  fit=eBayes(fit,robust=TRUE )

Does that seem reasonable or do you recommend additional steps? Thanks again for your input!

ADD REPLY • link 9.2 years ago lassancejm ▴ 50

2

Entering edit mode

Well, the above code may be enough. But you will learn more from the BCV plot if you look to see which are the most variable genes (with small df.prior). Are they sex-linked genes on the Y and X chromosomes? Are they ribosomal genes? Are they mitochondrial? Are they clearly associated with a particular cell type? In my experience, the dispersion outliers almost always tell a story that tells you something about your experiment.

ADD REPLY • link 9.2 years ago Gordon Smyth 52k

0

Entering edit mode

Very good points, thanks for your advise.

ADD REPLY • link 9.2 years ago lassancejm ▴ 50

0

Entering edit mode

Just add to Gordon's reply - the quantile normalization was used along with limma/voom in the SEQC/MAQC III study and this approach has been demonstrated to perform very well for the SEQC RNA-seq data:

http://www.ncbi.nlm.nih.gov/pubmed/25150838

ADD REPLY • link updated 9.2 years ago by Gordon Smyth 52k • written 9.2 years ago by Wei Shi ★ 3.6k

Gordon Smyth · Answer 2 · 2016-01-30

6

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 12 days ago

United States

I believe the original voom paper (Law et al.) actually uses quantile normalization. although I've never used it, it would be interesting to hear if the gurus have any pointers here.

Also, TMM and quantile normalization are separate and you should always ensure that you are using TMM instead of "simple" normalization by read depth. An easy way to ensure this works as expect is to create an edgeR::DGEList with your data and call calcNormFactors on it. Use that data object to pass through voom, and not just a count matrix

Update

In the event that someone skimming this thread doesn't read through the comments, I wanted to provide an update to this answer in light of Gordon's replies to this answer:

Take note that (contrary to what I said in my answer) Gordon suggests that you actually wouldn't want to do both TMM followed by quantile normalization, but rather choose one or the other.

ADD COMMENT • link 9.2 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

In the voom paper, we used TMM for the HapMap and and D. melanogaster datasets and quantile for the SEQC and DBA/2J datasets. Wei Shi prepared the SEQC data and he likes to use quantile. I don't recall why Charity used quantile for the DBA/2J data.

ADD REPLY • link 9.2 years ago Gordon Smyth 52k

0

Entering edit mode

Thanks for chiming in, but I'm a bit confused that you are referencing TMM and quantile normalization as mutually exclusive entities.

Do you not consider these two separate things, as I mentioned in my answer? So, when you say that you used TMM for HapMap and fly, but quantile for SEQC, does this mean that when voom (essentially) calculates a cpm(counts, prior.count=0.5) of the count matrix, it's using straight-up colSums of the count matrix instead of an adjusted lib.size using TMM normalization factors in the SEQC (quantile normalized) case?

I'm pretty sure that's not what you mean, which but that's how I'm reading what you're saying ... ? Even if I were going to quantile normalize my data, I'd still want to adjust the library size by the norm factors either way ...

ADD REPLY • link updated 9.2 years ago by Gordon Smyth 52k • written 9.2 years ago by Steve Lianoglou ★ 13k

2

Entering edit mode

Hi Steve,

No I wouldn't recommend TMM before quantile normalization, but it won't make much difference. If you apply scale normalization (such as TMM) then follow it up with quantile normalization of the logCPM values, then the scale normalization will be overwritten. Scale normalization becomes essentially an additive constant on the log-scale for each library, so it will be removed by quantile normalization.

We have never recommended using multiple normalization methods together as it confuses the issue somewhat.

ADD REPLY • link 9.2 years ago Gordon Smyth 52k