Reader level: Intermediate

Disclaimer: This post is work in progress and will be updated periodically. This is not meant to a comprehensive overview of the topic, but more of an introduction to AutoML, some tools and techniques.

Overview

Finding a model that works for a specific problem or a class of problems can be a time-consuming task. Usually, an engineer or a scientist determines what model class to use either based on his prior knowledge of the problem at hand or by evaluating several models and picking the best one. Each of these model classes would have 'hyperparameters' or parameters of the learning algorithm that has to be tuned; determining these can be tedious. The goal of Automated Machine Learning is to automate this process of model selection and finding the right combination of hyperparameters that not only gives you good performance but only generalizes well (holy grail!). Keep in mind that this does not eliminate the need for a trained engineer/scientist to guide the process and critically evaluate the results. The sections below explain of the concepts associated with this idea of Automated Machine Learning.

Hyperparameter optimization

A number of methods exist out there for performing Hyperparameter optimization for Machine Learning algorithms. The goal is to find the combination of hyperparameters that give us the best results on the validation set, i.e. best accuracy or the least amount of error, referred to here as a score. They can roughly be categorized into the following:

  • Grid search
  • Random search
  • Genetic algorithm
  • Bayesian optimization

While Grid search is one of the most popular methods, where you work through a grid of hyperparameter combinations to find an optimal combination, this is quite expensive. Random search can be slightly less expensive, instead of searching a grid space systematically you sample points at random. This has been shown to perform just as well as grid search or even better. Genetic algorithm-based search is used find optimal candidates, such an approach is used in the data science library TPOT. Bayesian optimization algorithms have become quite popular recently and two popular approaches are SMAC and TPE.

Bayesian algorithms (paper) have not been without their share of controversy with some suggesting that it is no better than random search. However, it has been shown that while Bayesian optimization algorithms suffer from slow starts on large datasets, while pretrained on smaller datasets and the problem size adaptively increased it tends to outperform pure random search (paper).

Bayesian algorithms can be roughly classified into three types based on the types of regression models used. They differ based on the ‘surrogate’ function, also sometimes called a response surface, we use to approximate the ML model. The response surface is built gradually from the evaluation history. This model has to be easier to optimize than the ML model and gives us an estimate of where the next hyperparameter configuration lies in the search process.

  • Gaussian Processes
  • Random Forests
  • Tree Parzen Estimators

‘Acquisition functions’ are used to define a balance between exploring new areas in the hyperparameter space and exploiting the areas that have favorable solutions. Many forms have been proposed for these functions, one such popular criterion is the Expected Improvement (EI)

$$ EI(x) = \int_{-\infty}^{fbest} (f_{best} - f ) p(f|x) df $$

where ‘x’ is the hyperparameter, for which we seek an optimal value, and ‘f’ is the objective function whose value indicates the algorithm performance. Note here that ‘f’ is a metric whose value is to be minimized, i.e. the lower the value the better the model. For e.g. ‘f’ could be the error here, whereas if one were to use accuracy as a metric we would that to be maximized. Here, p(f|x) is the probability of ‘f’ given ‘x’ evaluated using the probabilstic surrogate model. The goal is to pick ‘x’ that has the maximal EI. To summarize, they perform the following steps to optimize hyperparameters:

  • Select a number of points at random and run the ML model to initialize a probabilistic regression model (surrogate for the real ML model) (GP, Random Forests to approximate GPs etc).
  • Using the surrogate regression model generated and a suitable ‘acquisition function’, optimize this ‘acquisition function’ using the surrogate regression model that we generated to get a new point. This is way cheaper than running the real ML model.
  • Using the newly identified point, update the surrogate regression model and repeat the step above.
  • Usually runs for a fixed time, before the optimization is terminated and performance of hyperparameter optimization is evaluated.

Bayesian optimization works even if the cost function ‘f’ is non-convex.

Sequential Model-based Algorithm Configuration (SMAC)

This excellent paper covers this topic. I briefly summarize SMAC here. SMAC is based on making improvements to the Sequential model-based optimization (SMBO) algorithm which iterates between building a model and gathering additional data.

The SMBO algorithm uses Gaussian Processes (GP) to fit a response surface model. It uses the Expected Improvement (EI) criterion ‘which is high in regions of low predictive mean and high predictive variance’. The next point is selected that has the highest EI criterion. The time required to perform this consists of the time required to run the model and the time required to compute the EI. In practice, more than one potential configuration is selected in one iteration.

SMAC expands this to work with multiple instances and introduces Random Forests (RF) and approximate GPs instead of a full GP surrogate to SMBO. For Random Forests (RF), the mean at each point is the prediction of the RF while the variance at each point is now the variance of the predictions of the separate trees in the RF. So while Random Forests are not, generally, considered probabilstic models, one can use the frequentist models to approximate a GP (Auto-Weka) from the estimated mean and variance. SMBO also could not handle categorical variables, SMAC resolves this by using a weighted Hamming distance function for the kernel function instead of the weighted squared distance (Euclidean). For a combined continuous/categorical attribute problem, a combined kernel is created using both the Euclidean and Hamming distances. The authors provide proof that this is indeed a valid distance kernel in the paper cited above. They also use a projected process approximation for a GP instead of a full GP to speed up this process.

Tree of Parzen Estimators (TPE)

The tree-structure in the name implies that the conditional dependencies of the hyperparameters are exploited, i.e. when there are certain values selected for some hyperparameters, the values that the remaining hyperparameters can take are constrained. This forms a tree-like structure of dependencies (slide ‘Tuning of feedforward neural networks’). They model p(configuration|score) instead of p(score|configuration) as we saw in SMAC/SMBO. They replace the Gaussian density with non-parametric densities from which the configurations can be drawn. We have two densities l(x) and g(x) that are created from configuration points that are below and above a threshold ‘f_threshold’ respectively. Instead of a GP surrogate, we have the densities to compute a score.

$$ p(x|f) = \left[ \begin{array}{ccc} l(x) , \; f < f_{threshold} \\ g(x) , \; f > f_{threshold} \end{array} \right] $$ As more configuration points are evaluated, the density is updated. Configuration points are selected based on the following EI $$ EI(x) \propto ( \gamma + \dfrac{g(x)}{l(x)}(1 - \gamma) )^{-1} $$

Here γ is an algorithm parameter, usually set to around 0.15. What the above means is what we would intuitively expect, the more likely a point is selected that maximizes the ratio of l(x)/g(x) or in other words points that have higher density in l(x) (density estimate beneath the threshold) and lower density in g(x) (density estimate above the threshold) the higher our EI for the hyperparameters configuration ‘x’.

If the surrogate function, i.e. the densities are correct or close, the points selected by maximizing the EI listed above should give us hyperparameter configurations that give us the best results on our true objective function or ML model. One downside though, Tree of Parzen Estimators have the limitation that they cannot optimize hyperparameters independently of each other. The article does a very good job of describing this in detail.

Neural Architecture Search

A recent paper covering the developments in Neural Architecture Search (NAS) can be found here NAS survey. For a more extensive literature review visit the AutoML website at AutoML literature review.

Auto-Keras

As of this writing, this tool (Auto-Keras) is still fairly new and pre-release and is missing a lot of features. I have this here because it has the potential to be a very powerful open-source alternative to commercial cloud-based AutoML solutions. The advantage of Auto-Keras is that it uses ‘network morphism’ to reduce the training time instead of training a network from scratch at each iteration. It also points out that most NAS algorithms work towards increasing the size of the network without any ability to explore smaller architectures, Auto-Keras apparently overcomes this deficiency by retaining the ability to explore smaller architectures found earlier in the iterative process multiple times. This blog post provides a good overview of how to run the examples and some of the issues the author faced with this tool.

The fundamental challenge is unliked hyperparameter optimization, the Neural Architecture Search space is not a Euclidean space thereby rendering it difficult to be applied directly to a Gaussian Process. The authors propose an ‘edit-distance’ kernel to deal with this difficulty, the definition of edit-kernel being the number of operations required to change a network into another one. This is analogous to the number of operations required to morph a graph to resemble another one. If the edit-distance is defined as

$$ EditDistance = d(f_a,f_b) $$

then the kernel function can be written as

$$ K(f_a,f_b) = e^{- \rho^2 (d(f_a,f_b))} $$

where ρ is a mapping function to map the distance. Now the key here is to approximate the edit-distance using the following formula since computing it exactly is a computationally intractable problem.

$$ d(f_a,f_b) = D_l (L_a,L_b) + \lambda D_s(S_a, S_b) $$

where Dl is the minimum distance for morphing the architecture fa to fb if skip connections are ignored. Similarly, Ds is the approximate distance required to morph skip-connections from fa to fb. The details of how these are computed are not covered below, interested readers can refer to the paper linked above.

Numerical optimization techniques such as gradient-based techniques do not work in a tree-structured space such as the architecture search space. Various algorithms like the Bayesian optimization algorithm TreeBO and evolutionary algorithm NASBOT have been proposed to solve this specific problem. The authors propose a new hybrid approach that uses the features of both the tree-structured search and simulated annealing.

Implementations

Examples on how to perform hyperparameter optimization and neural architecture search are shown with Auto-sklearn (Auto-sklearn paper) and Auto-Keras respectively.

Auto-sklearn

The example below shows how to run hyperparameter optimization on the Boston housing dataset.

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split, KFold
from sklearn.model_selection import ShuffleSplit
import autosklearn.regression

boston = load_boston()
X_train, X_test, y_train, y_test = \
        sklearn.model_selection.train_test_split(boston.data, boston.target, random_state=1)

automl = autosklearn.regression.AutoSklearnRegressor()
automl.fit(X_train, y_train)
y_hat = automl.predict(X_test)
# This gives you the results back in a dictionary
results_dict = automl.cv_results_
# prints the statistics
print(automl.sprint_statistics())
# print the models that were trained
print(automl.show_models())

Running the above produces the following output:

auto-sklearn results:
Dataset name: fdd1924fd43daf1502d116d6da914420
Metric: r2
Best validation score: 0.880841
Number of target algorithm runs: 676
Number of successful target algorithm runs: 662
Number of crashed target algorithm runs: 8
Number of target algorithms that exceeded the memory limit: 6
Number of target algorithms that exceeded the time limit: 0

[(0.540000, SimpleRegressionPipeline({'categorical_encoding:__choice__': 'no_encoding', 'imputation:strategy': 'median', 'preprocessor:__choice__': 'feature_agglomeration', 'regressor:__choice__': 'adaboost', 'rescaling:__choice__': 'minmax', 'preprocessor:feature_agglomeration:affinity': 'euclidean', 'preprocessor:feature_agglomeration:linkage': 'average', 'preprocessor:feature_agglomeration:n_clusters': 58, 'preprocessor:feature_agglomeration:pooling_func': 'max', 'regressor:adaboost:learning_rate': 0.17811908210425184, 'regressor:adaboost:loss': 'square', 'regressor:adaboost:max_depth': 8, 'regressor:adaboost:n_estimators': 377},
dataset_properties={
  'task': 4,
  'sparse': False,
  'multilabel': False,
  'multiclass': False,
  'target_type': 'regression',
  'signed': False})),
(0.220000, SimpleRegressionPipeline({'categorical_encoding:__choice__': 'no_encoding', 'imputation:strategy': 'mean', 'preprocessor:__choice__': 'polynomial', 'regressor:__choice__': 'ard_regression', 'rescaling:__choice__': 'minmax', 'preprocessor:polynomial:degree': 3, 'preprocessor:polynomial:include_bias': 'False', 'preprocessor:polynomial:interaction_only': 'True', 'regressor:ard_regression:alpha_1': 7.942024519019753e-05, 'regressor:ard_regression:alpha_2': 1.4691455454157323e-10, 'regressor:ard_regression:fit_intercept': 'True', 'regressor:ard_regression:lambda_1': 6.273455521456825e-10, 'regressor:ard_regression:lambda_2': 8.129782496225096e-08, 'regressor:ard_regression:n_iter': 300, 'regressor:ard_regression:threshold_lambda': 1630.9116363987964, 'regressor:ard_regression:tol': 0.0017370820165403767},
dataset_properties={
  'task': 4,
  'sparse': False,
  'multilabel': False,
  'multiclass': False,
  'target_type': 'regression',
  'signed': False})),
(0.140000, SimpleRegressionPipeline({'categorical_encoding:__choice__': 'one_hot_encoding', 'imputation:strategy': 'median', 'preprocessor:__choice__': 'polynomial', 'regressor:__choice__': 'sgd', 'rescaling:__choice__': 'minmax', 'categorical_encoding:one_hot_encoding:use_minimum_fraction': 'True', 'preprocessor:polynomial:degree': 3, 'preprocessor:polynomial:include_bias': 'False', 'preprocessor:polynomial:interaction_only': 'False', 'regressor:sgd:alpha': 0.0001359889935213488, 'regressor:sgd:average': 'False', 'regressor:sgd:eta0': 0.07052716024285549, 'regressor:sgd:fit_intercept': 'True', 'regressor:sgd:learning_rate': 'invscaling', 'regressor:sgd:loss': 'squared_loss', 'regressor:sgd:penalty': 'l1', 'regressor:sgd:tol': 0.0005734508309511284, 'categorical_encoding:one_hot_encoding:minimum_fraction': 0.010000000000000004, 'regressor:sgd:power_t': 0.25},
dataset_properties={
  'task': 4,
  'sparse': False,
  'multilabel': False,
  'multiclass': False,
  'target_type': 'regression',
  'signed': False})),
(0.040000, SimpleRegressionPipeline({'categorical_encoding:__choice__': 'no_encoding', 'imputation:strategy': 'mean', 'preprocessor:__choice__': 'no_preprocessing', 'regressor:__choice__': 'gradient_boosting', 'rescaling:__choice__': 'minmax', 'regressor:gradient_boosting:learning_rate': 0.46095010873848963, 'regressor:gradient_boosting:loss': 'huber', 'regressor:gradient_boosting:max_depth': 8, 'regressor:gradient_boosting:max_features': 0.9143471505632927, 'regressor:gradient_boosting:max_leaf_nodes': 'None', 'regressor:gradient_boosting:min_impurity_decrease': 0.0, 'regressor:gradient_boosting:min_samples_leaf': 11, 'regressor:gradient_boosting:min_samples_split': 13, 'regressor:gradient_boosting:min_weight_fraction_leaf': 0.0, 'regressor:gradient_boosting:n_estimators': 106, 'regressor:gradient_boosting:subsample': 0.42311590902545987, 'regressor:gradient_boosting:alpha': 0.9389512496772583},
dataset_properties={
  'task': 4,
  'sparse': False,
  'multilabel': False,
  'multiclass': False,
  'target_type': 'regression',
  'signed': False})),
(0.020000, SimpleRegressionPipeline({'categorical_encoding:__choice__': 'one_hot_encoding', 'imputation:strategy': 'mean', 'preprocessor:__choice__': 'random_trees_embedding', 'regressor:__choice__': 'ridge_regression', 'rescaling:__choice__': 'robust_scaler', 'categorical_encoding:one_hot_encoding:use_minimum_fraction': 'True', 'preprocessor:random_trees_embedding:bootstrap': 'True', 'preprocessor:random_trees_embedding:max_depth': 7, 'preprocessor:random_trees_embedding:max_leaf_nodes': 'None', 'preprocessor:random_trees_embedding:min_samples_leaf': 2, 'preprocessor:random_trees_embedding:min_samples_split': 7, 'preprocessor:random_trees_embedding:min_weight_fraction_leaf': 1.0, 'preprocessor:random_trees_embedding:n_estimators': 79, 'regressor:ridge_regression:alpha': 0.0010507086341767701, 'regressor:ridge_regression:fit_intercept': 'True', 'regressor:ridge_regression:tol': 0.09150923931712328, 'rescaling:robust_scaler:q_max': 0.9941949444067358, 'rescaling:robust_scaler:q_min': 0.25635232814287207, 'categorical_encoding:one_hot_encoding:minimum_fraction': 0.2945010490264204},
dataset_properties={
  'task': 4,
  'sparse': False,
  'multilabel': False,
  'multiclass': False,
  'target_type': 'regression',
  'signed': False})),
(0.020000, SimpleRegressionPipeline({'categorical_encoding:__choice__': 'one_hot_encoding', 'imputation:strategy': 'median', 'preprocessor:__choice__': 'polynomial', 'regressor:__choice__': 'sgd', 'rescaling:__choice__': 'minmax', 'categorical_encoding:one_hot_encoding:use_minimum_fraction': 'True', 'preprocessor:polynomial:degree': 3, 'preprocessor:polynomial:include_bias': 'True', 'preprocessor:polynomial:interaction_only': 'False', 'regressor:sgd:alpha': 5.274048148124689e-07, 'regressor:sgd:average': 'False', 'regressor:sgd:eta0': 0.06853316582334407, 'regressor:sgd:fit_intercept': 'True', 'regressor:sgd:learning_rate': 'invscaling', 'regressor:sgd:loss': 'squared_loss', 'regressor:sgd:penalty': 'elasticnet', 'regressor:sgd:tol': 0.0006412478800169259, 'categorical_encoding:one_hot_encoding:minimum_fraction': 0.00023940988165439893, 'regressor:sgd:l1_ratio': 0.14999999999999974, 'regressor:sgd:power_t': 0.25},
dataset_properties={
  'task': 4,
  'sparse': False,
  'multilabel': False,
  'multiclass': False,
  'target_type': 'regression',
  'signed': False})),
(0.020000, SimpleRegressionPipeline({'categorical_encoding:__choice__': 'one_hot_encoding', 'imputation:strategy': 'median', 'preprocessor:__choice__': 'feature_agglomeration', 'regressor:__choice__': 'gradient_boosting', 'rescaling:__choice__': 'minmax', 'categorical_encoding:one_hot_encoding:use_minimum_fraction': 'False', 'preprocessor:feature_agglomeration:affinity': 'manhattan', 'preprocessor:feature_agglomeration:linkage': 'complete', 'preprocessor:feature_agglomeration:n_clusters': 274, 'preprocessor:feature_agglomeration:pooling_func': 'max', 'regressor:gradient_boosting:learning_rate': 0.07986384032740823, 'regressor:gradient_boosting:loss': 'ls', 'regressor:gradient_boosting:max_depth': 7, 'regressor:gradient_boosting:max_features': 0.73532303005895, 'regressor:gradient_boosting:max_leaf_nodes': 'None', 'regressor:gradient_boosting:min_impurity_decrease': 0.0, 'regressor:gradient_boosting:min_samples_leaf': 8, 'regressor:gradient_boosting:min_samples_split': 12, 'regressor:gradient_boosting:min_weight_fraction_leaf': 0.0, 'regressor:gradient_boosting:n_estimators': 129, 'regressor:gradient_boosting:subsample': 0.9289357260148249},
dataset_properties={
  'task': 4,
  'sparse': False,
  'multilabel': False,
  'multiclass': False,
  'target_type': 'regression',
  'signed': False})),
  ]

Auto-Keras

This paper describes Auto-Keras goes about performing Neural Architecture Search (NAS) using Bayesian optimization. Keep in mind that this is different from finding an optimal combination of hyperparameters since now a network morphism is performed at each step, which makes the Bayesian optimization process more complex. The following snippet is from the Auto-Keras website and illustrates how easy it is set up a hyperparameter search using the framework. Depending on the problem size, this can take day or weeks to run depending on the size of the problem.

from keras.datasets import mnist
from autokeras.image_supervised import ImageClassifier

if __name__ == '__main__':
    (x_train, y_train), (x_test, y_test) = mnist.load_data()
    x_train = x_train.reshape(x_train.shape + (1,))
    x_test = x_test.reshape(x_test.shape + (1,))

    clf = ImageClassifier(verbose=True)
    clf.fit(x_train, y_train, time_limit=12 * 60 * 60)
    clf.final_fit(x_train, y_train, x_test, y_test, retrain=True)
    y = clf.evaluate(x_test, y_test)
    print(y)

The following result was obtained after 24 hours of training (default setting) and evaluation before it hit the specified time limit.

/Users/srijithraj/anaconda3/envs/tensorflow_class/lib/python3.6/site-packages/tqdm/autonotebook/__init__.py:14: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  " (e.g. in jupyter console)", TqdmExperimentalWarning)



Initializing search.
Initialization finished.


+----------------------------------------------+
|               Training model 0               |
+----------------------------------------------+

Saving model.
+--------------------------------------------------------------------------+
|        Model ID        |          Loss          |      Metric Value      |
+--------------------------------------------------------------------------+
|           0            |   2.1007456893101333   |        0.98336         |
+--------------------------------------------------------------------------+


+----------------------------------------------+
|               Training model 1               |
+----------------------------------------------+


/Users/srijithraj/anaconda3/envs/tensorflow_class/lib/python3.6/site-packages/autokeras/bayesian.py:151: UserWarning: Predicted variances smaller than 0. Setting those variances to 0.
  warnings.warn("Predicted variances smaller than 0. "



+--------------------------------------------------------------------------+
|    Father Model ID     |                 Added Operation                 |
+--------------------------------------------------------------------------+
|           0            |           ('to_add_skip_model', 1, 5)           |
+--------------------------------------------------------------------------+

Saving model.
+--------------------------------------------------------------------------+
|        Model ID        |          Loss          |      Metric Value      |
+--------------------------------------------------------------------------+
|           1            |   2.355205422639847    |        0.98268         |
+--------------------------------------------------------------------------+


+----------------------------------------------+
|               Training model 2               |
+----------------------------------------------+

+--------------------------------------------------------------------------+
|    Father Model ID     |                 Added Operation                 |
+--------------------------------------------------------------------------+
|           0            |          ('to_conv_deeper_model', 9, 3)         |
+--------------------------------------------------------------------------+

Saving model.
+--------------------------------------------------------------------------+
|        Model ID        |          Loss          |      Metric Value      |
+--------------------------------------------------------------------------+
|           2            |   1.8618361109867692   |        0.98668         |
+--------------------------------------------------------------------------+


+----------------------------------------------+
|               Training model 3               |
+----------------------------------------------+

+--------------------------------------------------------------------------+
|    Father Model ID     |                 Added Operation                 |
+--------------------------------------------------------------------------+
|                        |           ('to_add_skip_model', 1, 5)           |
|                        |          ('to_conv_deeper_model', 5, 3)         |
|           2            |           ('to_add_skip_model', 5, 23)          |
|                        |          ('to_concat_skip_model', 1, 5)         |
|                        |         ('to_concat_skip_model', 5, 23)         |
+--------------------------------------------------------------------------+

Saving model.
+--------------------------------------------------------------------------+
|        Model ID        |          Loss          |      Metric Value      |
+--------------------------------------------------------------------------+
|           3            |   1.6021900983527302   |        0.98876         |
+--------------------------------------------------------------------------+


+----------------------------------------------+
|               Training model 4               |
+----------------------------------------------+

+--------------------------------------------------------------------------+
|    Father Model ID     |                 Added Operation                 |
+--------------------------------------------------------------------------+
|                        |          ('to_conv_deeper_model', 9, 3)         |
|                        |          ('to_conv_deeper_model', 5, 3)         |
|                        |          ('to_conv_deeper_model', 9, 3)         |
|           3            |         ('to_concat_skip_model', 5, 24)         |
|                        |           ('to_add_skip_model', 5, 9)           |
|                        |          ('to_add_skip_model', 21, 24)          |
+--------------------------------------------------------------------------+

Saving model.
+--------------------------------------------------------------------------+
|        Model ID        |          Loss          |      Metric Value      |
+--------------------------------------------------------------------------+
|           4            |   1.5797022448852658   |        0.98916         |
+--------------------------------------------------------------------------+

Time limit for model search is reached. Ending the model search.

Loading and training the best model recorded from the search.



HBox(children=(IntProgress(value=0, description='Epoch-1, Current Metric - 0', layout=Layout(flex='2'), max=46…