Hi,
I'm trying to get understand the full vs null model specifications in the sva package.
My understanding is the following:
full model = Variable-of-interest + all confounders
null model = all confounders
The resulting SVs should then capture noise that is not captured by my confounders. Is this correct?
What I observe though is that when I run sva, the resulting SVs relate stronger to the confounders, when these were included in the model, vs a scenario, where I didn't include the confounders in the models.
Below a reproducible example using the bladderbatch data. "batch" is in this case a measured confounder (not my variable-of-interest).
library(sva) library(bladderbatch) data(bladderdata) library(pamr) library(limma) pheno = pData(bladderEset) edata = exprs(bladderEset) mod = model.matrix(~as.factor(cancer), data=pheno) mod0 = model.matrix(~1,data=pheno) n.sv = num.sv(edata,mod,method="leek") n.sv [1] 2 svobj = sva(edata,mod,mod0,n.sv=n.sv) Number of significant surrogate variables is: 2 Iteration (out of 5 ):1 2 3 4 5
Now, let's see how the derived SVs relate to the batch variable (a covariate not included in the model).
summary(aov(svobj$sv ~ as.factor(pheno$batch))) Response 1 : Df Sum Sq Mean Sq F value Pr(>F) as.factor(pheno$batch) 4 0.34913 0.087284 6.9734 0.0001426 *** Residuals 52 0.65087 0.012517 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Response 2 : Df Sum Sq Mean Sq F value Pr(>F) as.factor(pheno$batch) 4 0.29016 0.072541 5.3141 0.001153 ** Residuals 52 0.70984 0.013651 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Ok, let's include "batch" as a covariate and see whether the derived SVs are independent of that covariate.
mod = model.matrix(~as.factor(cancer) + as.factor(batch), data=pheno) mod0 = model.matrix(~as.factor(batch),data=pheno) svobj = sva(edata,mod,mod0,n.sv=n.sv) Number of significant surrogate variables is: 2 Iteration (out of 5 ):1 2 3 4 5 summary(aov(svobj$sv ~ as.factor(pheno$batch))) Response 1 : Df Sum Sq Mean Sq F value Pr(>F) as.factor(pheno$batch) 4 0.36353 0.090882 7.425 8.269e-05 *** Residuals 52 0.63647 0.012240 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Response 2 : Df Sum Sq Mean Sq F value Pr(>F) as.factor(pheno$batch) 4 0.42935 0.107339 9.7812 5.631e-06 *** Residuals 52 0.57065 0.010974 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I don't get why the association between the SVs and batch is stronger in the second example, in which batch was included in the models. I thought batch should have been "protected". (note: in my own data, the covariate is age; I just used batch here for reproducibility)
Thanks, Esther
I have the same problem. Did you manage to find an answer Esther?