Question

Cross-validation with multiple control subgroups in limma

0

Entering edit mode

Ali Barry ▴ 40

@2f691b31

Last seen 5 hours ago

United Kingdom

I have a dataset with 10 condition vs 20 control samples and am using limma to test for differential expression. Broadly, groups are age/sex matched but have added noise due to complex medical histories, which are matched as best as possible but is still far from perfect. Ran through a basic analysis limma pipeline, everything worked as expected.

In the same batch, I processed a number of other samples for a separate study (some overlapping control samples, some different due to different population demographics for study #2), which will be published separately.

What I am left with is the ability to create secondary, also age/sex matched control groups by resampling from a larger pool of possible controls. It doesn't make sense to include all the possible samples in the control group from the start, as it will skew my population demographics. Instead, I'm looking to test the robustness of the initial results using resampling due to the variability in medical histories.

Is there an available way to approach this already using limma or am I better using a bootstrapping-specific hypothesis testing method?

limma multtest • 87 views

ADD COMMENT • link 1 day ago • updated 5 hours ago Ali Barry ▴ 40

score 0 · Answer 1 · 2024-12-21

limma doesn't do cross-validation or resampling. Like most classical linear modeling procedures in statistics, limma uses a statistical model to represent the sampling variability that arises from the fact that the columns of data are considered to be sampled from a much larger population of possible samples. Like classical anova, limma does not use or need resampling to estimate the variability and, provided the distributional assumptions are correct, it cannot be improved by introducing resampling.

I don't follow the logic behind the procedure you describe. No doubt your data is much too complex to describe completely in this forum, but I never think it's a good idea to randomly choose data to analyse. I don't see why you would not analyse all the possible controls at once. limma does not need equal numbers in each demographic group -- it is perfectly able to adjust for different demographics if necessary.

You mention "cross-validation" and "bootstrapping", but the resampling procedure that you describe does not appear to be either of those things. You seem to be simply subsampling from a collection of possible control samples.

If you are concerned about how sensitive the analysis is to the choice of control samples, you could simply repeat the limma analysis a few times with different sets of control samples and see how much the results change. There's no formal way to combine the results though. You would just compare the results yourself ad hoc. Or else analyse all the data at once, so that variability between choices is eliminated.