I am looking at a proteomic dataset and comparing it to multiple variables of interest. If I want to use sva to calculate surrogate variables, the obvious option is to calculate them for each variable of interest individually in independent models.
But is there a way to include multiple variables of interest in the model? Such that sva would be calculating latent effects that are present across all of the variables of interest?
Variables of interest as in the response variables. Ultimately, we want to characterize the different response variables based on proteomic predictor(s), but I would like to adjust for latent variation that is shared amongst all response variables.
If I understand you correctly, you plan to use the proteomic data as predictors rather than outcomes. SVA is meant for the converse, where the proteomics values are the outcome and there might be batch effects or other unobserved variables that affect the proteomics measurements.
Yes, we have several biomarkers, and a proteomic set, so we are characterizing each biomarker by it's proteomic association(s). The use for surrogate variables in this case would be to account for unobserved batch effects across samples. So we can generate svas for each biomarker independently (with the whole proteomic set as the expression set).
But the thought occurred that in having multiple biomarkers tested on the same samples, presumably true unobserved sampling batch effects might be those that appear across multiple independently measured biomarkers. ie
for i in [biomarker[1], ..., biomarker[n]] --> biomarker[i] ~ covs + shared_svas + protein
To generate the svas, typically it would be
but putting
mod <- model.matrix('~ covs + biomarker[1] + ... + biomarker[n]', data = pheno_data)
would... generate svas that adjust for latent variables that occur between any individual or combo of biomarkers and the prots (like a union), without telling me which biomarker it varies across? Or generate svas that adjust for latent variables represented in the association of all biomarkers and the prots (like an intersection), i.e only latent variation if it shows up in all the biomarkers, not just one or a few?I would prefer the latter, as you would just have a set of svas that could be shared for all
biomarker ~ covs + shared_svas + prot_data
comparisons instead ofsva[i][1], sva[i][2], ...
for eachbiomarker[i]
.