Question

fitFDist() shows poor fit if df1=1

0

Entering edit mode

Frederik Ziebell ▴ 30

@frederik-ziebell-14676

Last seen 9 months ago

Heidelberg, Germany

If df1=1 and df2 is larger than about 25, the estimate of fitFDist() results in df2 = Inf rather than giving a good estimate. Here's an example:

set.seed(1)
scale <- .3
df1 <- 1
df2 <- 50
x <- scale*rf(10^6,df1,df2)
fitFDist(x,df1)

# $scale
# [1] 0.3262758
# $df2
# [1] Inf

Conversely, if one sets df1 <- 2 or df2 <- 20, a good estimate is returned. Is this a bug? How can we get a good estimate when df1 = 1 and df2 > 25?

limma • 888 views

ADD COMMENT • link updated 4.3 years ago by Gordon Smyth 52k • written 4.3 years ago by Frederik Ziebell ▴ 30

score 1 · Answer 1 · 2020-09-16

First, let me say that df2=Inf is a legimate value for df2 that is deliberately set by fitFDist whenever the variances are more homogeneous than expected, i.e., when the variance of log(x) is smaller than expected. The infinite value does not arise from floating underflow or numerical instabilility.

For the scenario in your simulation, the function is over-estimating both df2 and the scale somewhat, but there is a reason for this. Before I explain the reason, let me explain why this over-estimation isn't usually a problem. The purpose of limma is obviously to analyse expression data rather than to solve theoretical mathematical problems. The fitFDist function is not called directly by users and the scenario of df1=1 and df2=50 is almost impossible for real data because it implies almost no replication in the experiment and virtually identical genewise variances. Even in the rare cases that data like this could occur, the over-estimation of df2 does not have a dramatic effect on the DE results. The infinite value is not used to compute p-values and there is compensating over-estimation of the scale parameter that prevents the p-values from becoming over-significant.

The over-estimation of df2 in the above context is caused by limma's robustness checking. The fitFDist function checks for very small variances that can sometimes arise from rounding of expression data. Extremely small variances are reset to a larger value, equal to 1e-5 times the median of the genewise variances. The extreme scenario that you've simulated generates very small variances with some being 1e-14 times the median. When fitFDist resets the smallest variances, it reduces the variance of the input values on the log-scale and this leads to over-estimation of df2.

I don't plan to change this behaviour because robustness is a higher priority for limma than accuracy of hyperparameters in theoretical scenarios than don't occur in practice. Your simulation generates genewise variances that are extremely variable, with the smallest being 1e-16 or less times the largest value. It isn't realistic for limma to trust variances that are so small as being an accurate reflection of biological reproducibility because expression data isn't measured and recorded to this precision in practice. Expression data would have to be recorded to more than 16 significant figures to make such an analysis meaningful.