Eric Websmith's Studio

AutoML Introduction

Word count: 1.2kReading time: 7 min
2020/04/04 Share

With the development of machine learning, demand for AutoML (Automatic Machine Learning) increases. This article split this topic into two, namely AutoML in classic machine learning and that in deep learning.

After completing this tutorial, you will know:

  1. AutoML in both classic and deep machine learning.
  2. A peek of two open-source frameworks, Auto-Sklearn and Auto-Keras.
  3. Two algorithms, Bayesian Optimization and Network Morphism.
  4. The ideas behind Neural Architecture Search.

AutoML in Classic Machine Learning

CASH

AutoML has many sub-functionalities, including automatic data analysis, automatic feature engineering, automatic normalization and regularization, automatic feature selection, automatic algorithm selection and automatic hyperparameter optimization(HPO). Recent studies focus on the CASH[1] problem. That is Combined Algorithm and Selection and Hyperparameter Optimization.

CASH was proposed by Thornton in Auto-Weka[1]. The solution he provided was Bayesian Optimization[2].

Bayesian Optimization

Bayesian Optimization (BO) is one of the means we find extrema (both max and min values) in a function. Other options may include grid search, random search and gradient descent. Compared to gradient descent, BO optimizes functions that are non-convex and non-derivative.

In BO, we call the black-box function we optimize the objective function. This function is usually costly to run. Thus, we introduce the surrogate function to estimate the mean and standard deviation of the objective function. The output is then used by the acquisition function in searching for the next sample to exploit or explore.

The BO searches for extrema in a finite search space.

The process is summarized as follow:

  1. Sample some points x in the search space.
  2. Find the corresponding y by the objective function.
  3. update the surrogate function using x, y.
  4. find the next x using the acquisition function.
  5. Goto 2 unless finding extrema.

The following flowchart demonstrates the process.

Bayesian Optimization Process

Now let’s look at an example. Suppose we have the following function:

Bayesian Optimization Objective

The search space is [0, 1]. Firstly two points x=0 and x=1 are selected, and their y values are calculated using the objective function.

Bayesian Optimization Iteration 0

Then the mean and standard deviation are generated by the Gaussian process or the surrogate function. The UCB (Upper Confidence Bound) algorithm is used as the acquisition function. It simply takes the max value of the upper confidence bound (red in the plot) as the next sample point.

The BO continues…

Bayesian Optimization Iteration 1

Bayesian Optimization Iteration 2

Bayesian Optimization Iteration 3

After four iterations, the max value is found.

The above pictures were generated by a notebook from my Github Repo Machine Learning Step by Step.

Auto-Sklearn

Auto-Sklearn[3] used Bayesian Optimization to find the best hyperparameters for machine learning. Besides that, it proposed two new improvements. The first is meta-learning. Auto-Sklearn collected several datasets, extracted features from these datasets. These datasets are trained and the best hyperparameters are collected. The dataset features and best hyperparameters are treated as input and output for machine learning and the relationship is learned. Thus, the search space is narrow and the algorithm runs faster. The second improvement is ensemble. As we all know ensemble improves accuracy.

I installed Auto-Sklearn. The Titanic Survive and Boston House Price datasets are used. Each took one hour to finish on an i7 CPU. The scores are 0.76 and 0.85. Not too bad.

Auto-Sklearn focuses on the CASH problem. it does not handle data preprocessing. I have to remove string features like names and tickets in advance. Categorical features like embarking place, gender have to be transformed into Boolean values or one-hot values.

Sample code for auto-sklearn

1
2
3
4
import autosklearn.classification
cls = autosklearn.classification.AutoSklearnClassifier()
cls.fit(X_train, y_train)
predictions = cls.predict(X_test)

Official Website: https://automl.github.io/auto-sklearn/master/

Deep Learning in AutoML

Neural Architecture Search (NAS) is the main research topic in deep learning automation. The rough idea is to treat deep models as strings such that the next layer can be predicted by RNN. NAS named this RNN the controller. The original NAS used reinforcement learning to optimize the parameters.

Neural Architecture Search

Neural Architecture Search RNN

Some architectures have branches. NAS uses a batch of bi-classifiers to determine if Layer N connects to one of the N-1 layers.

The following are the original LSTM (upper left) and two LSTM structures NAS found.

Neural Architecture Search LSTM

Xin He[5] summarized all the NAS works. For controllers, grid search, random search, reinforcement learning, evolution algorithm, Bayesian optimization, gradient descent are used. Low fidelity, transfer learning, surrogate, early stopping are used to perform more efficiently.

The following is a comparison of all the NAS implementations:

Neural Architecture Search Compare

Network Morphism

In the process of NAS, a great number of architectures are tested. If the new network can learn parameters from the old ones, it would be more efficient. Network Morphism[6] provides such a mechanism. Network Morphism named the old and new network the parent network and the child network. The child network inherits knowledge from the parent network.

Network Morphism

There are five cases in Network Morphism, linear, non-linear, stand-alone width, stand-alone kernal, and subnet.

Linear Case

Network Morphism Linear

A new layer, Layer L is inserted between Layer L-1 and Layer L+1. Bl-1 and Bl+1 are two matrices and Bl+1=Bl-1 * G. To insert a new layer, we simply split G into two matrices, Fl and Fl+1 .

Non-Linear Case

Network Morphism Non-Linear

To insert a non-linear layer between two layers, a new activation function, parametric activation is introduced. It starts as an Identity matrix and learns to become an activation function afterward. It is a wrapper around other activation functions. The equation is as follow:

Network Morphism Parametric Activation

Stand-alone Width

Network Morphism Stand-alone Width

We calculate the equations from the above chart. We finally will have the following:

Network Morphism Stand-alone Width

Kernal

Network Morphism Stand-alone Kernal

For kernals, we simply do padding.

Subnet

Network Morphism Subnet

Firstly, we insert the layers one by one. this is sequential subnet morphing. We then stack them by simply split G into a group of G, each is G/n.

Auto-Keras

Auto-Keras is an opensource based on Keras. It uses Bayesian optimization and network morphism. It trains CNN, RNN and traditional DNN.

Sample code:

1
2
3
4
5
import autokeras as ak

clf = ak.ImageClassifier()
clf.fit(x_train, y_train)
results = clf.predict(x_test)

The official website:

https://autokeras.com/

Where to use AutoML?

AutoML is suitable in some mature scenarios, like traditional ML, image recognition, object detection, OCR, text analysis, etc.. The reason is that we have limit options.

For problems like audio-related research, because it is not mature, what is to try is not clear. Thus it is not possible to use autoML.

Reference

[1] C. Thornton, F. Hutter, H. Hoos, and K. Leyton-Brown. Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In Proc. of KDD’13, pages 847–855, 2013.

[2] E. Brochu, V. Cora, and N. de Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. CoRR, abs/1012.2599, 2010.

[3] Efficient and Robust Automated Machine Learning, Feurer et al., Advances in Neural Information Processing Systems 28 (NIPS 2015).

[4] Zoph, Barret, and Quoc V. Le. “Neural Architecture Search with Reinforcement Learning.” arXiv: Learning (2016).

[5] He, Xin, Kaiyong Zhao, and Xiaowen Chu. “AutoML: A Survey of the State-of-the-Art..” arXiv: Learning (2019).

[6] Wei, Tao, et al. “Network morphism.” international conference on machine learning (2016): 564-572.

CATALOG
  1. 1. AutoML in Classic Machine Learning
    1. 1.1. CASH
    2. 1.2. Bayesian Optimization
    3. 1.3. Auto-Sklearn
  2. 2. Deep Learning in AutoML
    1. 2.1. Neural Architecture Search
    2. 2.2. Network Morphism
      1. 2.2.1. Linear Case
      2. 2.2.2. Non-Linear Case
      3. 2.2.3. Stand-alone Width
      4. 2.2.4. Kernal
      5. 2.2.5. Subnet
    3. 2.3. Auto-Keras
  3. 3. Where to use AutoML?
  4. 4. Reference