Dear Bioconductor community,
i have some gene expression microarray data, on which i would like to fit a machine learning methodology and construct a classifier regarding a binary outcome(Disease status). Although from literature and various papers i have found various packages and methodologies in R, as i would like also to add additional continuous variables alongside the genes, to train my classifier. Thus, as i dont have experience in this specific topic: is this approach generally appropriate for any model in classification procedures(i.e. randorm forests, SVM etc) ? or it is restricted to specific methodologies/packages in R that can handle this possibility ? I have knowledge of the caret R package which implements various methodologies, but my main concern is particularly about the "validity" of this "multivariate" approach !!
Any ideas or suggestions would be grateful !!
Dear Steve,
thank you for your answer !! the main issue i adress here, is because im a newbie in machine learning(although i have read and searched many tutorials and papers, like http://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/), is to have some feeback from experienced users from the field, if for "simple" methodologies like for istance "random forests"- i can use along with my gene expression microarray data other continuous variables(like clinical data) for the bulding of the classifier on the training set !! Or alternatively, the "valid" solution for this purpose is only general linear models(like the glment with the elastic net methodology) ??
Regarding the second part of your answer, i have knowledge that scaling is essential to various groups of variables(i.e. different groups of variables like in my case) in order to preserve unit variance and is implemented in various methods(also has a function in the trainControl in caret package). But still, my main concern is the first part of your answer !!
Best,
Efstathios