Question

Combining multiple L2FC values and determining variance

0

Entering edit mode

David Eccles (gringer) ▴ 30

@david-eccles-gringer-7488

Last seen 5.2 years ago

New Zealand

How do I calculate the error associated with aggregated L2FC values?

I'd like to combine multiple differential expression values into a single statistic, with variance, to express how genes in the mitochondrial complexes change (or don't change) compared to wildtype cell lines under certain conditions. Here's an example of a graph I'd like to reduce to a simpler line-with-error plot (possibly with overlaid values). This is a beeswarm plot of the [non-shrunk] L2FC values produced by DESeq2:

Beeswarm of L2FC values

Here's what it looks like when I do a normal boxplot of those results in R:

Boxplot of L2FC values

But that's not right, because the points don't represent individual data values; they're similar to a difference of means. I want something that looks similar (or maybe a "thick line with indicated error" plot), but where the error represents the standard error associated with the mean L2FC. How should I calculate this error?

I notice that DESeq2 reports "lfcSE", and am wondering if I can calculate a pooled variance based on that. I had a look at the Pooled variance formula on Wikipedia, and my rough scratchings (with a bit of cancelling) suggest that in a situation with k L2FC values testing equal-size groups (i.e. where the samples used for each L2FC are the same), it ends up being:

$ n/k * \sum{i=1}^{k} ((SE{xi})^2)$

In other words, add up the squares of the standard errors, then multiply by whatever n is used in the standard error calculation, which I assume to be the total number of samples in both conditions (e.g. with 5 replicates vs 4 replicates, divide by 3), then divide by the number of L2FC values. If the L2FC values are essentially a difference-of-means test, then I see that fishing an n out of the standard error doesn't make much sense... which ties me up in knots.

Edit: here's what I've ended up with for now. I'm using the mean of L2FC values for the central value, and this function for determining the error :

    l2fcAll.se <- tapply(sub.se.tbl.genome$lfcSE, sub.se.tbl.genome$comparison,
                         function(x){sqrt(sum(x^2) / (length(x)))});

In other words, the square root of the mean square of standard error.

Beeswarm with range

I'm not confident in that (I've changed the formula a few times over the last couple of hours), and I'd like to know if anyone has any other / better ideas on what to do about this.

Thanks heaps.

deseq2 • 1.7k views

ADD COMMENT • link 5.8 years ago David Eccles (gringer) ▴ 30

score 0 · Answer 1 · 2019-06-26

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 16 hours ago

WEHI, Melbourne, Australia

What you are trying to do is called a gene set test, and you should use a gene set test method that allows for the possibility of correlations between the genes. Trying to pool errors as if the genes were independent can markedly underestimate the true variability if the genes are positively correlated (which they typically are if they're part of the same pathway). There are two major approaches to get around this, either

compute a single sample score (such as average log-expression) for the mitochondrial complex for each sample, then obtain standard errors empirically form the replicates, or
use a formal gene set test (such as roast).

ADD COMMENT • link 5.8 years ago Gordon Smyth 52k

0

Entering edit mode

Your (1) is what I'm trying to do here. Getting the average L2FC is easy, I just don't know how to properly estimate the error.

I'll pair this up with a visualisation of the normalised expression values. Again, that's an easy bit - the values are single data points (rather than relative values), and I can confidently express those as boxplots with standard error / normal range calculated in the usual way.

ADD REPLY • link 5.8 years ago David Eccles (gringer) ▴ 30

1

Entering edit mode

Getting the average L2FC is easy

As I said above, you need an average log-expression value for every sample, not an average logFC. You can't compute variability from an average log-fold-change, which is just one value across all samples.

I just don't know how to properly estimate the error.

If you have a value for every sample, then it just becomes a univariate anova or t-test. There are many examples in the literature: Figure 4AB of this paper https://doi.org/10.1038/nm.2000 gives some examples from my practice.

the values are single data points (rather than relative values), and I can confidently express those as boxplots with standard error / normal range calculated in the usual way

Yes, exactly. It has to be done like that. You an even compute an average logFC and associated standard error from the single data points. But you can't average the logFCs from DESeq2 and expect to attach a SE to that -- in my opinion that is not possible.

ADD REPLY • link 5.8 years ago Gordon Smyth 52k