Question

Limma - how to manage huge differences in library sizes

0

Entering edit mode

baus87 • 0

@baus87-13757

Last seen 6.9 years ago

Dear Bioconductor community,

i'm performing a differential expression analysis (classic approach: treatment vs control condition) in which library sizes of biological replicates of control thesis exceed of 100 times those of treatment condition. The software used, limma, is impaired in DEGs detection despite filtering threshold was set up on the basis of the lowest library size. In particular, limma found up-/down-regulated genes which actually are not (this is clearly showed up by looking at raw counts of such genes).

How can I manage it?

Thanks in advance

limma • 3.4k views

ADD COMMENT • link updated 6.9 years ago by Gordon Smyth 52k • written 6.9 years ago by baus87 • 0

0

Entering edit mode

Dear Wolfgang,

thanks for your reply. Actually it's a consequence of experimental design. We're talking about fungal pure culture (control condition) vs. an in planta fungal interaction at the very early stage of fungal growth. Regrettably, we do not have negative control. Also, the failure in proper DEGs identification it is not a consequence of a mistaken limma workflow, since i verified that this drawback does not occur when taking into account theses having similar library sizes.

ADD REPLY • link 6.9 years ago baus87 • 0

1

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 6 weeks ago

EMBL European Molecular Biology Laborat…

Is this a data quality problem or an expected consequence of the experimental design?

If it's the former, then perhaps it's best to stop here.

If it's the latter, then what is your approach to deconfound true biological effects on library size from technical variations? Do you have spike-in controls, known 'negative control' genes, etc.?

ADD COMMENT • link 6.9 years ago Wolfgang Huber ★ 13k

score 2 · Accepted Answer · 2018-05-08

The fact that the sequencing depths are so radically different is certainly undesirable, especially when library size is confounded with treatment condition. However, limma does not give DE results just because the library sizes are different.

I assume that you have RNA-seq data and you are using limma-voom. voom is specifically designed to handle library sizes that vary by an order of magnitude or more. Are you using robust=TRUE when you call eBayes()? That may help. How many replicates do you have?

There is only one situation in which limma can give too much DE, which is when you have a lot of genes with total count equal to zero in one of the treatment conditions (i.e., all the replicate treatment condition samples have count=0, but the control counts are not zero). This could easily be occurring in your study if the treated samples have small library sizes. Otherwise limma should control the error rate correctly. I don't believe that limma is so dumb that you can determine the true status of the genes yourself without any doubt by just eyeballing the counts whereas limma can't figure it out correctly. If you wish to claim this, you need to give some evidence.

I myself use edgeR instead of limma in extreme count situations when there are lots of zeros. The edgeR quasi-likelihood pipeline (described here:)

https://www.bioconductor.org/packages/release/workflows/vignettes/RnaSeqGeneEdgeRQL/inst/doc/edgeRQL.html

https://f1000research.com/articles/5-1438

is similar to limma, but should control the error rate correctly even when the counts are all zero on one group. This issue is discussed here:

No counts, no variance: allowing for loss of degrees of freedom when assessing biological variability from RNA-seq data