How do I calculate the error associated with aggregated L2FC values?
I'd like to combine multiple differential expression values into a single statistic, with variance, to express how genes in the mitochondrial complexes change (or don't change) compared to wildtype cell lines under certain conditions. Here's an example of a graph I'd like to reduce to a simpler line-with-error plot (possibly with overlaid values). This is a beeswarm plot of the [non-shrunk] L2FC values produced by DESeq2:
Here's what it looks like when I do a normal boxplot of those results in R:
But that's not right, because the points don't represent individual data values; they're similar to a difference of means. I want something that looks similar (or maybe a "thick line with indicated error" plot), but where the error represents the standard error associated with the mean L2FC. How should I calculate this error?
I notice that DESeq2 reports "lfcSE", and am wondering if I can calculate a pooled variance based on that. I had a look at the Pooled variance formula on Wikipedia, and my rough scratchings (with a bit of cancelling) suggest that in a situation with k L2FC values testing equal-size groups (i.e. where the samples used for each L2FC are the same), it ends up being:
$ n/k * \sum{i=1}^{k} ((SE{xi})^2)$
In other words, add up the squares of the standard errors, then multiply by whatever n is used in the standard error calculation, which I assume to be the total number of samples in both conditions (e.g. with 5 replicates vs 4 replicates, divide by 3), then divide by the number of L2FC values. If the L2FC values are essentially a difference-of-means test, then I see that fishing an n out of the standard error doesn't make much sense... which ties me up in knots.
Edit: here's what I've ended up with for now. I'm using the mean of L2FC values for the central value, and this function for determining the error :
l2fcAll.se <- tapply(sub.se.tbl.genome$lfcSE, sub.se.tbl.genome$comparison,
function(x){sqrt(sum(x^2) / (length(x)))});
In other words, the square root of the mean square of standard error.
I'm not confident in that (I've changed the formula a few times over the last couple of hours), and I'd like to know if anyone has any other / better ideas on what to do about this.
Thanks heaps.
Your (1) is what I'm trying to do here. Getting the average L2FC is easy, I just don't know how to properly estimate the error.
I'll pair this up with a visualisation of the normalised expression values. Again, that's an easy bit - the values are single data points (rather than relative values), and I can confidently express those as boxplots with standard error / normal range calculated in the usual way.
As I said above, you need an average log-expression value for every sample, not an average logFC. You can't compute variability from an average log-fold-change, which is just one value across all samples.
If you have a value for every sample, then it just becomes a univariate anova or t-test. There are many examples in the literature: Figure 4AB of this paper https://doi.org/10.1038/nm.2000 gives some examples from my practice.
Yes, exactly. It has to be done like that. You an even compute an average logFC and associated standard error from the single data points. But you can't average the logFCs from DESeq2 and expect to attach a SE to that -- in my opinion that is not possible.