First, let me say that df2=Inf is a legimate value for df2 that is deliberately set by fitFDist whenever the variances are more homogeneous than expected, i.e., when the variance of log(x) is smaller than expected. The infinite value does not arise from floating underflow or numerical instabilility.
For the scenario in your simulation, the function is over-estimating both df2 and the scale somewhat, but there is a reason for this. Before I explain the reason, let me explain why this over-estimation isn't usually a problem. The purpose of limma is obviously to analyse expression data rather than to solve theoretical mathematical problems. The fitFDist function is not called directly by users and the scenario of df1=1 and df2=50 is almost impossible for real data because it implies almost no replication in the experiment and virtually identical genewise variances. Even in the rare cases that data like this could occur, the over-estimation of df2 does not have a dramatic effect on the DE results. The infinite value is not used to compute p-values and there is compensating over-estimation of the scale parameter that prevents the p-values from becoming over-significant.
The over-estimation of df2 in the above context is caused by limma's robustness checking. The fitFDist function checks for very small variances that can sometimes arise from rounding of expression data. Extremely small variances are reset to a larger value, equal to 1e-5 times the median of the genewise variances. The extreme scenario that you've simulated generates very small variances with some being 1e-14 times the median. When fitFDist resets the smallest variances, it reduces the variance of the input values on the log-scale and this leads to over-estimation of df2.
I don't plan to change this behaviour because robustness is a higher priority for limma than accuracy of hyperparameters in theoretical scenarios than don't occur in practice. Your simulation generates genewise variances that are extremely variable, with the smallest being 1e-16 or less times the largest value. It isn't realistic for limma to trust variances that are so small as being an accurate reflection of biological reproducibility because expression data isn't measured and recorded to this precision in practice. Expression data would have to be recorded to more than 16 significant figures to make such an analysis meaningful.
Thank you for the clarification. We have an analysis with limited amount of data (just 1 residual degree of freedom) and figured out that the prior degrees of freedom were estimated as Inf, similar to the above scenario. I wouldn't expect that a real-valued df2 estimate would influence downstream results because df2 clearly dominates df1, but just wanted to check whether some numerical instability was causing this result.
I agree that with this limited amount of data, variance estimates are very unstable and so it is sensible that variances are set to the prior estimate.
No, it's not numerical instability. Inf is a legitimate value for df2 that is deliberatory set by fitFDist whenever the variances are more homogeneous than expected. If I modify fitFDist to omit the variance thresholding then it gives for your simulation:
Thank you for the clarification. We have an analysis with limited amount of data (just 1 residual degree of freedom) and figured out that the prior degrees of freedom were estimated as
Inf
, similar to the above scenario. I wouldn't expect that a real-valueddf2
estimate would influence downstream results becausedf2
clearly dominatesdf1
, but just wanted to check whether some numerical instability was causing this result.I agree that with this limited amount of data, variance estimates are very unstable and so it is sensible that variances are set to the prior estimate.
No, it's not numerical instability. Inf is a legitimate value for df2 that is deliberatory set by
fitFDist
whenever the variances are more homogeneous than expected. If I modifyfitFDist
to omit the variance thresholding then it gives for your simulation: