Limma Singlet Moderated T-Test
2
0
Entering edit mode
@andrewjskelton73-7074
Last seen 10 months ago
United Kingdom

Hi, 

I only noticed this by chance, but it peaked my curiosity and couldn't find any documentation about it. I occasionally get asked about experiments where conditions are in singlet (no replication), and my typical mandate is to run the line of "no replication, no statistics", however in the case of making the most of data available, I was aware of DESeq2's single sample method in which all samples are treated as a single group for dispersion estimation. When running the same singlet design in Limma, I expected an error when testing the model fit, but I got nothing.

So my question is, what's Limma doing in the case of a singlet experiment A-B, as my assumption was that the moderated t-test couldn't be calculated without replication?

Some code just to make a reproducible example:

library(limma)
set.seed(1234)
pheno_table            <- data.frame(Rep      = c(paste0("A",1:8)), 
                                     Variable = c("A","B","C","C","C","D","D","D"))
mat_rows               <- 100
matrix_in              <- matrix(log2(rnorm(n = (mat_rows*nrow(pheno_table)), mean = 100, sd = 30)), nrow = mat_rows)
design                 <- model.matrix(~0 + Variable, data = pheno_table)
colnames(design)       <- gsub("Variable", "", colnames(design))
contrasts              <- c("A_Vs_B" = "B-A")
contrast_mat           <- makeContrasts(contrasts = contrasts, levels = colnames(design))
colnames(contrast_mat) <- names(contrasts)
fit                    <- eBayes(contrasts.fit(lmFit(matrix_in, design),contrast_mat))
topTable(fit, coef = 1, number = 5, p.value = 0.1)

#      logFC  AveExpr        t      P.Value  adj.P.Val         B
# 92 -2.909594 6.21836 -4.935319 0.0009739746 0.09739746 -0.3183399
limma • 1.4k views
ADD COMMENT
3
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 14 hours ago
The city by the bay

In your case, you're still estimating the sample variance from groups C and D. limma will use all samples to estimate the variance, even if the comparisons only involve a subset of samples. There are non-zero residual degrees of freedom in this model, so there is no need to resort to non-standard methods for variance estimation.

In particular, the "single-sample" approach used by DESeq2 is not without problems. Let's consider the application of this strategy in the context of a limma analysis. Sure, you could get an estimate of a common variance if you assume that most genes are not DE between conditions. (The same arguments apply to estimation of a mean-variance trend; I will stick with a common estimate for simplicity.) However, using this estimate in a moderated t-test would be equivalent to having infinite prior degrees of freedom, i.e., the true variance for each gene is the same. I have never seen an RNA-seq data set where this is the case - the prior d.f. is generally low, around 5-20, reflecting the differences in the variability between genes.

Overstating the prior d.f. is problematic as it suggests strong homogeneity in the true variances where there is none. This usually results in anticonservativeness, where you get more false positives than expected due to unmodelled variability in the variance estimates. The same concern applies to count-based models as well, which is why the edgeR user's guide is reluctant to suggest this strategy as option 3 in Section 2.11. But for your experiment, none of this matters: limma can directly estimate the variance as part of a standard analysis.

ADD COMMENT
0
Entering edit mode

Thanks for suggesting the edgeR manual, it's nice to have a place where it's written down that I can point people to. I think it'd be useful for Limma to possibly output a warning where singlet comparisons are made stipulating a summary of what you've said. Thanks though for the comprehensive response! 

ADD REPLY
0
Entering edit mode

Singlet comparisons (n=1 in some groups but n>1 in others) are quite routine and I don't see any reason for a warning from limma. The complications that Aaron discussed above relate to DESeq2's single sample method, which is quite a different thing, and he was explaining why it isn't used in limma. By contrast, limma's method for your experiment is standard linear modelling and does not require any such compromises.

DESeq2's single sample method and edgeR's no replicate methods are really just red herrings here. Your experiment does have replication and none of the packages (limma, edgeR or DESeq2) would need to resort to a "no replicate" method for your design.

ADD REPLY
0
Entering edit mode
@gordon-smyth
Last seen 5 hours ago
WEHI, Melbourne, Australia

But you do have replication! Your experiment has 3 replicates in group C and 3 replicates in group D. Your design thereby provides 4 residual degrees of freedom from which the residual variance can be estimated and limma, like any good linear modelling software, makes use of that.

This is no different from the lm() function in the stats package. Here is a little simulated example with lm() applied to your design:

> y <- rnorm(8)
> Variable = factor(c("A","B","C","C","C","D","D","D"))
> fit <- lm(y ~ Variable)
> anova(fit)
Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value Pr(>F)
Variable   3 1.9051 0.63503  0.5263 0.6875
Residuals  4 4.8265 1.20662               

Notice that Df=4 for the residuals and the residual variance is estimated as 1.20662. Now let's use that to compute t-tests:

> summary(fit)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.7275     1.0985   0.662    0.544
VariableB    -0.9125     1.5535  -0.587    0.588
VariableC    -1.4630     1.2684  -1.153    0.313
VariableD    -1.4415     1.2684  -1.136    0.319

Residual standard error: 1.098 on 4 degrees of freedom
Multiple R-squared:  0.283,     Adjusted R-squared:  -0.2547 
F-statistic: 0.5263 on 3 and 4 DF,  p-value: 0.6875

This is all quite standard. No need for any warning. limma does the same base computations as lm() but adds in Bayesian moderation.

On the other hand, if you really did have no replication, then limma would tell you:

> library(limma)
> matrix_in <- matrix(rnorm(400),100,4)
> Variable <- factor(c("A","B","C","D"))
> design <- model.matrix(~Variable)
> fit <- lmFit(matrix_in, design)
> fit <- eBayes(fit)
Error in ebayes(fit = fit, proportion = proportion, stdev.coef.lim = stdev.coef.lim,  : 
  No residual degrees of freedom in linear model fits

 

Note that the same general principles apply to the edgeR and DESeq2 packages as well, and neither of them would have any difficulties with your experimental design either. Section 2.11 of the edgeR User's Guide refers to designs that have no replication at all, not to an experiment such as yours. The same goes for DESeq2's single sample method -- it is only for experiments with no replicates at all.

ADD COMMENT

Login before adding your answer.

Traffic: 829 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6