Hello everyone,
I have available 17 000 variables (SNV frequencies, a certain number of zeros) for 40 patients. Each patient is represented by its response to a treatment : 13 responses, 27 no-responses. I want to extract a subset of SNV which can have strong prediction power.
Because of the large size of set of variables, there are strong correlations, that's why I'm considering adaptive-lasso. I used glmnet R package, Ridge initial estimated coefficients and the following R code :
library(cvTools) library(glmnet) err.test.response <- c() err.test.noresponse <- c() nbiters <- 50 for(i in 1:nbiters){ ## k folds kflds <- 8 flds <- cvFolds(length(y), K = kflds) pred.test <- c() ## predicted classes class.test <- c() ## real classes for(j in 1:kflds){ ## Train x.train <- x[flds$which!=j,] y.train <- y[flds$which!=j] ## Test x.test <- x[flds$which==j,] y.test <- y[flds$which==j] ## Adaptive Weights Vetor cv.ridge <- cv.glmnet(x.train, y.train, family='binomial', alpha=0, standardize=FALSE, parallel = TRUE, nfolds = 7) w3 <- 1/abs(matrix(coef(cv.ridge, s=cv.ridge$lambda.min)[, 1][2:(ncol(x)+1)] ))^1 w3[w3[,1] == Inf] <- 999999999 ## Adaptive Lasso cv.lasso <- cv.glmnet(x.train, y.train, family='binomial', alpha=1, standardize=FALSE, parallel = TRUE, type.measure='class', penalty.factor=w3, nfolds = 7) ## Prediction pred.test <- c(pred.test, predict(cv.lasso, x.test, s = 'lambda.1se', type = c("class"))) class.test <- c(class.test, as.character(y.test)) } ## Prediction error err.test.noresponse <- c(err.test.noresponse, 1-sum(pred.test=="noresponse"&class.test=="noresponse") /sum(class.test=="noresponse")) # noresponse error vector err.test.response <- c(err.test.response, 1-sum(pred.test=="response"&class.test=="response") /sum(class.test=="response")) # response error vector } mean(err.test.noresponse) ## Mean noresponse prediction error mean(err.test.response) ## Mean response prediction error
Is it good to do an external cross-validation like this to evaluate adaptive-lasso prediction power on my data ?
My results are not conclusive at all, I have mean(err.test.noresponse) = 0.15 and mean(err.test.response)=0.88, so my model doesn't succeed to identify the response. Have you got an idea why my results are so bad and how could I improve this ?
Thanks for your help and your ideas,
Corentin
Nobody has an idea ?