I have RNA-seq data (5 subjects are measured on 4 time points) and would like to do a SVA first to be able to include potential confounders into the statistical model (Deseq2 pipeline).
I am having troubles how to define my null and full model in the SVA:
Full model ~ TIME + SUBJECT.ID
Null model ~ SUBJECT.ID
OR
Null model ~ 1
Should the subjects ID be treated as a factor of interest or as a confounding factor?
You should probably use just an intercept for your null model. In general, if you have repeated measures (which I assume you do, given the subject ID), AND given that you have complete repeated measures (where you have measurements from each subject at each time), then the subject-specific changes are orthogonal to the measure of interest, and blocking on subject is the way to go. It also makes it easier to interpret your coefficients.
Put a different way, sva is intended to generate surrogate variables for unobserved variability. The subjects are by definition observed, so if you wanted to use the sva package to do something with them, you could consider them to be batch effects and use ComBat (note that I am not advocating this, but just noting that sva is for things you don't observe and ComBat is for things you know about.)
Yes, I have repeated measures. Subject.ID is not my factor of interest, but of course I want to include it in the model. For one subject, I have two missing time points though.
I do not want ComBat to correct for Subject.ID, but rather want SVAseq to find confounding factors (other than Time and Subject.ID). The design model I intend to use in deseq is: ~Subject + SV1 + ... + Time.
I will use the null model ~1 in SVAseq, as suggested.
I would say the answer is to include SUBJECT.ID in the null model because, as argued by Jeff Leek, author of SVA, in this thread about a similar design case, SUBJECT.ID will be used in the ultimate linear model you intend to fit to test for the effect of your variable of interest.
But subject.ID is not just a covariate, it is a random factor. So it is still not clear to me whether to include it in the null model or not. Not including it in the null model, makes it kind of a variable of interest itself.
If SUBJECT.ID is a random factor, then you should not put it into the design matrix and use duplicateCorrelation() and the arguments correlation and block in the call to lmFit(); see section on Multi-level experiments from the limma User's Guide. If you don't need surrogate variables, then you can just follow that documentation.
The complication comes when you want to combine it with surrogate variables estimated with SVA. You can try to have a full model with TIME only and the null with the intercept. Then, estimate surrogate variables, paste them into the design matrix and proceed with the duplicateCorrelation() blocking on SUBJECT.ID. However, it may happen that SVA has already estimated part of the SUBJECT.ID variablity and this may lead to problems with duplicateCorrelation(); see this thread about that possibility. So, I'd suggest to include SUBJECT.ID in the full and null models that you give to SVA (next to TIME), just to ensure that the SUBJECT.ID variability is not picked up by SVA. Then, place TIME and the surrogate variables in a new design matrix, i.e., without SUBJECT.ID, and proceed with duplicateCorrelation() blocking on SUBJECT.ID.
Thanks James.
Yes, I have repeated measures. Subject.ID is not my factor of interest, but of course I want to include it in the model. For one subject, I have two missing time points though.
I do not want ComBat to correct for Subject.ID, but rather want SVAseq to find confounding factors (other than Time and Subject.ID). The design model I intend to use in deseq is: ~Subject + SV1 + ... + Time.
I will use the null model ~1 in SVAseq, as suggested.