Question

surrogate variables from sva independent of model variables?

1

Entering edit mode

pixie ▴ 10

@pixie-13383

Last seen 7.4 years ago

Hi,

I'm trying to get understand the full vs null model specifications in the sva package.
My understanding is the following:

full model = Variable-of-interest + all confounders
null model = all confounders

The resulting SVs should then capture noise that is not captured by my confounders. Is this correct?
What I observe though is that when I run sva, the resulting SVs relate stronger to the confounders, when these were included in the model, vs a scenario, where I didn't include the confounders in the models.

Below a reproducible example using the bladderbatch data. "batch" is in this case a measured confounder (not my variable-of-interest).

library(sva)
library(bladderbatch)
data(bladderdata)
library(pamr)
library(limma)
pheno = pData(bladderEset)
edata = exprs(bladderEset)

mod = model.matrix(~as.factor(cancer), data=pheno)
mod0 = model.matrix(~1,data=pheno)
n.sv = num.sv(edata,mod,method="leek")
n.sv
[1] 2

svobj = sva(edata,mod,mod0,n.sv=n.sv)

Number of significant surrogate variables is:  2 
Iteration (out of 5 ):1  2  3  4  5

Now, let's see how the derived SVs relate to the batch variable (a covariate not included in the model).

summary(aov(svobj$sv ~ as.factor(pheno$batch)))

 Response 1 :
                       Df  Sum Sq  Mean Sq F value    Pr(>F)    
as.factor(pheno$batch)  4 0.34913 0.087284  6.9734 0.0001426 ***
Residuals              52 0.65087 0.012517                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 Response 2 :
                       Df  Sum Sq  Mean Sq F value   Pr(>F)   
as.factor(pheno$batch)  4 0.29016 0.072541  5.3141 0.001153 **
Residuals              52 0.70984 0.013651                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Ok, let's include "batch" as a covariate and see whether the derived SVs are independent of that covariate.

mod = model.matrix(~as.factor(cancer) + as.factor(batch), data=pheno)
mod0 = model.matrix(~as.factor(batch),data=pheno)
svobj = sva(edata,mod,mod0,n.sv=n.sv)

Number of significant surrogate variables is:  2 
Iteration (out of 5 ):1  2  3  4  5  

summary(aov(svobj$sv ~ as.factor(pheno$batch)))

 Response 1 :
                       Df  Sum Sq  Mean Sq F value    Pr(>F)    
as.factor(pheno$batch)  4 0.36353 0.090882   7.425 8.269e-05 ***
Residuals              52 0.63647 0.012240                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 Response 2 :
                       Df  Sum Sq  Mean Sq F value    Pr(>F)    
as.factor(pheno$batch)  4 0.42935 0.107339  9.7812 5.631e-06 ***
Residuals              52 0.57065 0.010974                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

I don't get why the association between the SVs and batch is stronger in the second example, in which batch was included in the models. I thought batch should have been "protected". (note: in my own data, the covariate is age; I just used batch here for reproducibility)

Thanks, Esther

sva • 1.5k views

ADD COMMENT • link 7.4 years ago pixie ▴ 10

0

Entering edit mode

I have the same problem. Did you manage to find an answer Esther?

ADD REPLY • link 5.1 years ago sara.blocquiaux • 0