Question

Kernel Ridge Regression in R for Drug-Target Interaction

0

Entering edit mode

ረ • 0

@-10147

Last seen 6.7 years ago

I want to run Kernel Ridge Regression on a set of kernels I have computed, but I do not know how to do this in R. I found the constructKRRLearner function from CVST package, but the manual is not clear at all, especially for me being a complete beginner in Machine Learning. The function needs and x and y, but I have no idea what to input there, as I only have a data frame that has the pairwise kernel computed as kronecker product between drugs and proteins.

How can I do a Kernel Ridge Regression task in R?

Ideally I also want to visualize my data points and then illustrate the regression line on the plot! For instance like this:

http://scikit-learn.org/stable/_images/plot_kernel_ridge_regression_0011.png

MORE INFO ON MY DATASET

I have a drug-target interactions (DTI) data set. The data set comprises of 100 drug compounds (rows) and 100 protein kinase targets (columns). there are some NAN's (missing values) in this data set. Values in this data set reflect how tightly a compound binds to a target.

I have drugs' SMILES and CHEMBL IDs.

I have the protein's (targets) sequences and UNIPROT IDs.

For drugs [100 drugs]: I converted drug SMILES to SDFset, and then I computed the fingerprints for each drug using OpenBabel. Based on these fingerprints I computed Tanimoto kernels for all possible combinations between drugs. (using "fpSim" function), e.g. Drug 1 with Drug 2, 3, 4, ... 10. Then Drug 2 with Drug 1, 3, 4... 100 and so on until Drug 99 with Drug 100. I named this BASE_DRUG_KERNELS

For proteins: I had the protein sequences, so I computed Smith-Waterman scores for all combination of protein pairs; e.g. Protein 1 with Protein 2, 3, ... 100, then Protein 2 with Protein 1, 3, 4, ... 100 and so on until Protein 99 with Protein 100. I named this BASE_PROTEIN_KERNELS

Then I computed the Kronecker between BASE_DRUG_KERNELS and BASE_PROTEIN_KERNELS which gave me a matrix of 100,000,000 elements. I named this matrix KRONECKER_PRODUCTS

I wish to run Kernel Ridge Regression on the matrix KRONECKER_PRODUCTS.

regression kernel ridge r • 4.0k views

ADD COMMENT • link updated 8.6 years ago by Guido Kraemer • 0 • written 9.0 years ago by ረ • 0

score 1 · Answer 1 · 2016-04-24

It's been a while since I've looked around, but I don't know where you might find such an implementation.

The kernlab package is where I'd go for most kernel-related things. While they don't have KRR, they do have SVR, as well as gaussian processes, from your picture of the relative performance between KRR and SVR, as well as this comparison in sci-kit learn between KRR and gaussian processes, it doesn't seem like you're missing much if you don't use KRR to begin with.

The nuclear option for all kernel related things is to look at the shogun toolbox. It has been a rather long time that I tried to get it up and running in R, but you can do it and it has all of the functionality you are looking for, and likely much more.

Lastly, given that KRR in particular is more readily accessible in Python via sci-kit learn, I might just try and use it through there. At the minimum you can compare its performance vs. svr and gaussian processes to see if it's worth looking for KRR implementations in R (again, if you really want it, the shogun toolbox has you covered) -- or you can always just leave that part in Python if nothing else will do :-)

score 0 · Answer 2 · 2016-09-28

The CVST package uses cross validation for parameter selection with the CV() and fastCV() functions. But you have to provide the data in a certain form so that the CV() function understands what you are trying to do -- unfortunately this also applies if you are not using cross validation.

First you must construct a CVST.data object (see the documentation of constructData) with constructData(x, y) where x are your predictors and y is your target variable.

Something like this (make sure x is a matrix and y a vector, CVST is not very good a parameter checking):

iris.krr.data <- constructData(as.matrix(iris[,1:3]), iris[,4])

If you do not want to use cross-validation simply provide the parameters as a list (getN refers to the number of samples you have in your data set, this is important if you use cross-validation):

p <- list(kernel="rbfdot", sigma=100, lambda=.1/getN(iris.krr.data))

The actual KRR is done an object that has to be initialized first:

krr <- constructKRRLearner()

Then learn the KRR:

m <- krr$learn(iris.krr.data, p)

And then to predict:

pred <- krr$predict(m, iris.krr.data)

check the mean squared error:

mean(pred - iris.krr.data$y)^2)

I do not think that KRR can deal with missing values, you will probably have to remove observations that contain missing values.

KRR is very slow and memory intensive if your dataset gets large, you might want to try constructFastKRRLearner() from the DRR package which simply means replacing constructKRRLearner() and providing a parameter nblocks, higher values trade a little bit of accuracy for speed and less memory consumption.

Disclaimer: I am the author of the DRR package.