MPI-based Nested Cross-Validation for scikit-learn

If you are working with machine learning, at some point you have to choose hyper-parameters for your model of choice and do cross-validation to estimate how well the model generalizes to unseen data. Usually, you want to avoid over-fitting on your data when selecting hyper-parameters to get a less biased estimate of the model’s true performance. Therefore, the data you do hyper-parameter search on has to be independent from data you use to assess a model’s performance. If you want to do know what happens if you perform both tasks on the same data, have a look at the chapter The Wrong and Right Way to Do Cross-validation in the excellent book The Elements of Statistical Learning.

For instance, scikit-learn’s Support Vector Regression class has at least two hyper-parameters, the penalty weight C and which kernel to use. Depending on the kernel, additional hyper-parameters are to be considered. Traditionally, people do an exhaustive grid search over a pre-defined set of values for each parameters and choose the setting that performed best. In fact, that is exactly what sklearn.grid_search.GridSearchCV does. In the end, what you get is the average score across the hold-out data with the best parameters. However, you don’t want to report that number, because you essentially cheated by repeatedly using the hold-out data with different parameter settings to evaluate your model’s performance, which is over-fitting too.

It is important that the numbers you report in the end were retrieved from data you only used once to measure performance. To avoid the pitfalls of GridSearchCV, you essentially have to nest GridSearchCV within another cross-validation such as StratifiedKFold. That way, the grid search only uses the training data of the outer cross-validation loop and results are reported on the test set, which was not used for the grid search.

Obviously, this can become computationally very demanding, e.g. if you do 10-fold cross-validation with 3-fold cross-validation for the grid-search, you need to train a total of 30 models for each configuration of parameters. Luckily, inner and outer cross-validation can be easily parallelized. GridSearchCV can do this with the n_jobs parameter. However, for large-scale analysis you want to use a cluster to process data in parallel.

This is where the Message Passing Interface (MPI) comes in. It is a standardized protocol for parallel computing. In MPI terms, all processes are organized in groups, which are managed by a communicator and each MPI process gets an ID or rank. Usually, the node with rank zero is used as the master that distributes the work and collects it again.

I implemented nested grid search for scikit-learn classes that distributes work using MPI. I’m not an expert in MPI, so there might be more efficient solutions, but it gets the job done for me. Using it is very similar to GridSearchCV:

from mpi4py import MPI
import numpy
from sklearn.datasets import load_boston
from sklearn.svm import SVR
from grid_search import NestedGridSearchCV

data = load_boston()
X = data['data']
y = data['target']

estimator = SVR(max_iter=1000, tol=1e-5)

param_grid = {'C': 2. ** numpy.arange(-5, 15, 2),
              'gamma': 2. ** numpy.arange(3, -15, -2),
              'kernel': ['poly', 'rbf']}

nested_cv = NestedGridSearchCV(estimator, param_grid, 'mean_absolute_error',
                               cv=5, inner_cv=3)
nested_cv.fit(X, y)

if MPI.COMM_WORLD.Get_rank() == 0:
    for i, scores in enumerate(nested_cv.grid_scores_):
        scores.to_csv('grid-scores-%d.csv' % (i + 1), index=False)

    print(nested_cv.best_params_)

To run this example you execute

mpiexec python example.py

and it uses all available MPI processors to distribute the work among. Your final result is stored in the best_params_ attribute, which is a pandas data frame that contains the selected hyper-parameters, the average performance across all inner cross-validation folds (score (Validation)), and the performance on the outer testing fold (score (Test)).

score (Validation)Cgammakernelscore (Test)
fold
1-7.2524900.50.000122rbf-4.178257
2-5.662221128.00.000122rbf-5.445915
3-5.58278032.00.000122rbf-7.066123
4-6.3065610.50.000122rbf-6.059503
5-6.174779128.00.000122rbf-6.606218

Complete results of the grid search are stored in the grid_scores_ attributes, which is a list of data frames, one for each outer cross-validation fold.

The code is available at https://github.com/sebp/scikit-learn-mpi-grid-search and has the following dependencies.

Note that I only tried this out with python 3.4. For further details, please check out the inline documentation.

Avatar
Sebastian Pölsterl
AI Researcher

My research interests include machine learning for time-to-event analysis, causal inference and biomedical applications.