Question

Bioconductor SVA: two variables of interest

0

Entering edit mode

sara.blocquiaux • 0

@sarablocquiaux-21717

Last seen 4.9 years ago

Dear Bioconductor list & Jeff Leek,

I am using the sva function in the SVA package to estimate surrogate variables in my methylation data (EPIC beadchip, Illumina). The main goal is to define the effect of time & group (and their interaction) on methylation. I was wondering how to incorporate two variables of interest and their interaction?

I tried the following code:

form = ~as.factor(time) + as.factor(group) + (as.factor(time)*as.factor(group)) mod = model.matrix(form, data=pd) mod0 = model.matrix(~1, data=pd) svobj = sva(dat=betafilt, mod=mod, mod0=mod0)

I get the following error: Error in solve.default(t(mod) %*% mod): Lapack routine dgesv: system is exactly singular: U[7,7] = 0

Not including the interaction term, does give me 8 surrogate variables, using the 'be-method'. Using 'Leeks' version, 0 surrogate variables are returned.

This raises the following questions:

- Is it necessary to include the interaction term?
- Should I remove all terms in the null-model, as they are all variables of interest and not just adjustment variables? - Which method is preferred: 'be' or 'leek'?

Many thanks in advance!

Sara

sva • 1.4k views

ADD COMMENT • link updated 5.2 years ago by James W. MacDonald 67k • written 5.2 years ago by sara.blocquiaux • 0

score 1 · Answer 1 · 2019-09-10

If your model matrix (note that model.matrix(~A + B + A*B) is the exact same thing as model.matrix(~A*B), and for that matter model.matrix(~ A + B + A:B)) is singular, that means that you have too many coefficients for the number of samples, and you cannot estimate them all. Without more information about what you are trying to do it's impossible to say why, but your design isn't full rank, and you won't be able to use that to analyze your data anyway.

That said, you have other problems that need attention. You shouldn't be using the beta values in this instance, as they are strictly bounded on (0,1), and are not expected to have anything remotely resembling a normal distribution. I tend to use the M-values for all statistics, and the beta values for all plotting. There are packages that will model the betas directly, but converting to M-values and using normal based methods makes the most sense to me.

If you are using time as a factor rather than as a linear covariate, then the interaction term isn't really interpretable as such. Well, it's interpretable, but not by itself. Usually what one would do is find all the significant interaction terms and then use plots to figure out what is going on.

However, most methylation analyses use genomically adjacent CpGs to infer regional changes in methylation, because the probes tend to be noisy, and getting consistent signal from a bunch of adjacent CpGs is arguably a stronger signal than a single CpG, and because biologically you expect there to be a somewhat smooth increase/decrease in methylation over a genomic region rather than a single CpG being affected in isolation. It's hard (I would argue non-sensical) to try to do this with an interaction term, because the log fold change of an interaction can mean any number of things, whereas the log fold change between two groups has only one meaning (same for a slope, if time were continuous).

In other words, most methylation analyses look for groups of adjacent CpGs that are all doing the similar thing (e.g., all up-regulated in a group). But say you have a set of CpGs that all have a positive log fold change for the interaction term. This doesn't mean that one group is up-regulated vs the other, because it's a comparison of comparisons (e.g., (grp1time2 - grp1time1) - (grp2time2 - grp2time1)), and a positive log fold change just means the quantity in the first parenthesis is larger than the second. That can come about in any number of ways, so a group of CpGs with positive log fold changes doesn't necessarily mean something consistent is happening in that genomic region.

If you suspect an interaction, you are probably better off doing a stratified approach for each group where the log fold change is interpretable rather than fitting an interaction term with all samples. Regardless, for SVA you always want to keep the coefficients of interest in mod, because if you don't you might regress them out as surrogate vectors!