I have a dataset where I am interested in looking at differential expression of genes in a singular body fluid and the relationship with a histochemical outcome that is a repeated measure. I would prefer not to collapse this repeated measure into one variable and use it for input into limma due to missing data which would make a composite variable biased.
Therefore, I was wondering whether there was a statistical issue with modeling the genes as independent variables rather than a dependent variable so that my repeated measure could serve as the outcome? That way I could then run a mixed model using dream in the variance partition package or use duplicate correlation in limma?
The data that is repeated is a quantitative metric of post-mortem pathology across multiple sections. Not all patients have the same number of sections assessed due to availability, etc. The genes/proteins are measured once in the serum. The goal is to identify serum biomarkers of this pathological hallmark. Given missing values are present in the pathological data, I was concerned about generating a composite score to use as an independent variable.
Therefore, my question related to how to address this and I wondered whether the genes/proteins could not be tested one by one as an independent variable in a mixed model and p-values adjusted by FDR? Any insight into why this would not make statistical sense would be helpful for me to understand. Other suggestions/options are much appreciated. Apologies for the naivety.
Such an analysis cannot be done in limma. Sorry, I cannot tell you how to do it or even whether it is possible.
Also, the reason why gene abundance is the dependent variable and is iterated over is that there are far more genes than there are samples in the majority of data sets. So, it is not possible to fit a linear model with all genes as covariates.
Thanks. The proposal was not to include all proteins as covariates in one model but to iterate with one protein serving as the independent variable in each model and then correcting the p-values from all models.
An alternative to that is to use the
glmnet
package to fit a regularized regression using all proteins at once.