Question

Generating surrogate variables across multiple variables of interest

0

Entering edit mode

Robert • 0

@face71c7

Last seen 6 months ago

United States

I am looking at a proteomic dataset and comparing it to multiple variables of interest. If I want to use sva to calculate surrogate variables, the obvious option is to calculate them for each variable of interest individually in independent models.

But is there a way to include multiple variables of interest in the model? Such that sva would be calculating latent effects that are present across all of the variables of interest?

sva • 1.0k views

ADD COMMENT • link 6 months ago Robert • 0

score 0 · Answer 1 · 2024-07-22

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 2 hours ago

United States

If you can fit a model that includes all the predictor variables you might want to use, yet remains identifiable, then sure. But if that is the case, why fit individual models instead of the one?

ADD COMMENT • link 6 months ago James W. MacDonald 68k

0

Entering edit mode

Variables of interest as in the response variables. Ultimately, we want to characterize the different response variables based on proteomic predictor(s), but I would like to adjust for latent variation that is shared amongst all response variables.

ADD REPLY • link 6 months ago Robert • 0

0

Entering edit mode

If I understand you correctly, you plan to use the proteomic data as predictors rather than outcomes. SVA is meant for the converse, where the proteomics values are the outcome and there might be batch effects or other unobserved variables that affect the proteomics measurements.

ADD REPLY • link 6 months ago James W. MacDonald 68k

0

Entering edit mode

Yes, we have several biomarkers, and a proteomic set, so we are characterizing each biomarker by it's proteomic association(s). The use for surrogate variables in this case would be to account for unobserved batch effects across samples. So we can generate svas for each biomarker independently (with the whole proteomic set as the expression set).

But the thought occurred that in having multiple biomarkers tested on the same samples, presumably true unobserved sampling batch effects might be those that appear across multiple independently measured biomarkers. ie for i in [biomarker[1], ..., biomarker[n]] --> biomarker[i] ~ covs + shared_svas + protein

To generate the svas, typically it would be

mod0 <- model.matrix('~ covs', data = pheno_data) 
mod <- model.matrix('~ covs + biomarker[i]', data = pheno_data) 
svobj <- sva(prot_data, mod, mod0) # svas for i only

but putting mod <- model.matrix('~ covs + biomarker[1] + ... + biomarker[n]', data = pheno_data) would... generate svas that adjust for latent variables that occur between any individual or combo of biomarkers and the prots (like a union), without telling me which biomarker it varies across? Or generate svas that adjust for latent variables represented in the association of all biomarkers and the prots (like an intersection), i.e only latent variation if it shows up in all the biomarkers, not just one or a few?

I would prefer the latter, as you would just have a set of svas that could be shared for all biomarker ~ covs + shared_svas + prot_data comparisons instead of sva[i][1], sva[i][2], ... for each biomarker[i].

ADD REPLY • link 6 months ago Robert • 0