Hi, When I checked articles to find a method to identify predictors of overall survival, mostly glmnet cox model is used with Lasso penalty and cross validation. As cross validation, people used 10-fold cross validation with 1000 iterations or LOOCV.
I read about glmnet method and searched forums for codes. I found two different approaches for 10-fold cross validation: first one is running final glmnet function by using lambda.min value from cv.glmnet results, to create final model; second is (from an article) to find frequency of active covariates from results of cv.glmnet command with 1000 iterations and decide most frequent model as final model. Moreover, I understood that if we adjust nfolds option of cv.glmnet command as number of rows (samples) of response vector (y), it performs LOOCV. For LOOCV, we do not need iterations. If there is any misunderstood part, please correct me.
I tried the codes below and I had close but different results. Therefore I wonder which approach is more trustable, in this case? Note: Approach 3 (LOOCV) takes shorter time than others although disadvantage of LOOCV is time consuming. Is it because of iterations?
Thanks in advance.
library(glmnet)
library(dplyr)
data(CoxExample)
x <- x[1:500,]
y <- y[1:500,]
## Approach 1: 10-fold cross validation with 1000 iterations, then final glmnet run
lambdas = NULL
for (i in 1:1000) {
fit <- cv.glmnet(x,y,family = "cox", nfolds = 10, alpha = 1, grouped = TRUE)
errors = data.frame(fit$lambda,fit$cvm)
lambdas <- rbind(lambdas,errors)
}
# take mean cvm for each lambda
lambdas <- aggregate(lambdas[, 2], list(lambdas$fit.lambda), mean)
# select the best one
bestindex = which(lambdas[2]==min(lambdas[2]))
bestlambda = lambdas[bestindex,1]
# and now run glmnet once more with it
fit <- glmnet(x,y,family = "cox", alpha = 1, lambda=bestlambda)
coef.min = coef(fit, s = "lambda.min")
active.min = which(coef.min != 0)
active.min
## Approach 2: 10-fold cross validation with 1000 iterations, then finding frequency of active covariates
coefs <- NULL
active.mins <- list()
for (i in 1:1000){
cvfit = cv.glmnet(x, y, family = "cox", nfolds = 10, alpha = 1, grouped = TRUE)
coef.min = coef(cvfit, s = "lambda.min")
coefs <- cbind(coefs, coef.min)
active.min = which(coef.min != 0)
active.mins <- c(active.mins, list(active.min))
}
table(unlist(lapply(active.mins, paste, collapse = " ")))
## Approach 3: LOOCV through nrow-folds
cvfit <- cv.glmnet(x, y, family = "cox", nfolds = nrow(y), alpha = 1, grouped = TRUE)
cv.coef.min <- coef(cvfit, s = "lambda.min")
cv.active.min <- which(cv.coef.min != 0)
cv.active.min
fit <- glmnet(x, y, family = "cox", alpha = 1, lambda = cvfit$lambda.min)
coef.min <- coef(fit, s = "lambda.min")
active.min <- which(coef.min != 0)
active.min
Results Approach 1:
[1] 1 2 3 4 5 6 7 8 9 10 13 17 18 24 27 29
Approach 2:
[1] 1 2 3 4 5 6 7 8 9 10 13 17 18 24 27 29 (frequency: 966/1000)
Approach 3:
[1] 1 2 3 4 5 6 7 8 9 10 13 14 17 18 24 27 29