I want to associate gene expression of some people with their metabolite levels. In my case metabolite levels are measured as a function of ~400 features. My question is if I can use sva, if the variable of interest is matrix instead of a vector? In all the exaples I saw variable of interest was always a vector like cancer status, etc.. Thanks!
This is a great question! SVA isn't specifically designed for this scenario but here is a hack that you could try:
(1) Perform the SVD on the metabolite data, take the top 2-3 singular vectors (the ones that explain 80%+ of the variation)
(2) Treat these as the "primary variables" in the sva analysis of the expression data.
(3) Get the resulting surrogate variables and include them in the models relating metabolites to expression levels.
There is no theory to guarantee this would work (although probably a good methods paper in there if someone wants to write it), but its pretty clear it would capture most of the variation in the metabolite data and protect you from removing the signal while removing the batch effects.
The only tricky part would be if the metabolites and expression were processed in batches together. Then you would be protecting batch effects from the metabolites in the expression data.
The analysis you suggested makes sense. However I am not sure if I can apply it for the following reason:
As first step you are suggesting to do SVD on metabolite matrix and take first couple of SV that explain %80 of the variance. In my metabolite data it takes first 22 SV to reach the level where %80 of the variance is explained. I could use all these 22 SV as primary variables but I am afraid of over parameterising the model. The metabolite data I am using is standardised across samples and then the features if it makes any difference.. Below are the cumulative variance explained by first 30 SVs.
If I use 22 SVs as primary variables in my full model, number of unknown effect are estimated to be
412 for expression data
0 for log2(expression data)
2 for scaled(log2(expression data)) *scaling makes each gene to have 0 mean, unit variance
I don't know why the number of surrogate variable estimates are so different.. If I use svaseq with rpkm data I will get 412 surrogate variable estimates which is too high to use as covariables in downstream statistical analysis. If I use sva with scaled and log2 transformed rpkm data I will get 2 covariables to use which is much more feasible for downstream analysis On the other hand I would like to justify why I choose the latter but I can't.
So overall;
1)Do you think using 22 SV is too much?
2)Would you use sva with scaled and transformed rpkm data? if so why?
Thanks for the answer Jeff,
The analysis you suggested makes sense. However I am not sure if I can apply it for the following reason:
As first step you are suggesting to do SVD on metabolite matrix and take first couple of SV that explain %80 of the variance. In my metabolite data it takes first 22 SV to reach the level where %80 of the variance is explained. I could use all these 22 SV as primary variables but I am afraid of over parameterising the model. The metabolite data I am using is standardised across samples and then the features if it makes any difference.. Below are the cumulative variance explained by first 30 SVs.
[1] 33.11876 41.41801 49.04494 54.30329 58.29662 61.36211 64.06303 65.89266 67.58878 69.18557
[11] 70.56121 71.80918 72.98200 73.93578 74.81452 75.68446 76.50778 77.31598 78.08510 78.77824
[21] 79.45655 80.07261 80.63256 81.17656 81.70586 82.19804 82.67386 83.14807 83.57942 83.99930
If I use 22 SVs as primary variables in my full model, number of unknown effect are estimated to be
412 for expression data
0 for log2(expression data)
2 for scaled(log2(expression data)) *scaling makes each gene to have 0 mean, unit variance
I don't know why the number of surrogate variable estimates are so different.. If I use svaseq with rpkm data I will get 412 surrogate variable estimates which is too high to use as covariables in downstream statistical analysis. If I use sva with scaled and log2 transformed rpkm data I will get 2 covariables to use which is much more feasible for downstream analysis On the other hand I would like to justify why I choose the latter but I can't.
So overall;
1)Do you think using 22 SV is too much?
2)Would you use sva with scaled and transformed rpkm data? if so why?
Thanks a lot,
Reyhan