Week 1 — Hyperparameter tuning: searching for the best architecture

Overview

This course is focuses on tools and techniques to effectively manage modeling resources and best serve batch and real-time inference requests.

Effective search strategies for the best model that will scale for various serving needs
Constraining model complexity and hardware requirements
Optimize and manage compute storage and IO needs
We'll be going through TensorFlow Model Analysis (TFMA)

Neural Architecture Search

Neural architecture search (NAS) is a technique for automating the design of artificial neural networks
It helps finding the optimal architecture
This is a search over a huge space
AutoML is an algorithm to automate this search

Types of parameters in ML models

Trainable parameters
- Learned by the algorithm during training
- e.g., weights of a neural network
Hyperparameters
- Set before launching the learning process
- not updated in each training step
Hyperparameters are of two types:
1. Model hyperparameters which influence model selection such as the number and width of hidden layers
2. Algorithm hyperparameters which influence the speed and quality of the learning algorithm such as the learning rate for Stochastic Gradient Descent (SGD) and the number of nearest neighbors for a k Nearest Neighbors (KNN) classifier.
3. e.g., learning rate or the number of units in a dense layer

Manual hyperparameter tuning is not scalable

The process of finding the optimal set of hyperparameters is called hyperparameter tuning or hypertuning.

Hyperparameters can be numerous even for small models
e.g., shallow DNN
- Architecture choices
- activation functions
- Weight initialization strategy
- Optimization hyperparameters such as learning rate, stop condition
Tuning them manually can be a real brain teaser
helps boost model performance.

Automating hyperparameter tuning with Keras Tuner

Automation is key.
Keras Tunes:
- Hyperparameter tuning with TensorFlow 2.0
- Many methods available>

Keras Autotuner Demo

Do the model need more or less hidden units to perform well?
How does model size affect the convergence spped?
Is there any trade off between convergence speed, model size and accuracy?
Search automation is the natural path to take
keras tuner built in search functionality will help!
Keras Tuner has four tuners available with built-in strategies:
- RandomSearch
- Hyperband
- BayesianOptimization
- Sklearn

import keras_tuner as kt


def model_builder(hp):
  '''
  Builds the model and sets up the hyperparameters to tune.
  Args:
    hp - Keras tuner object
  Returns:
    model with hyperparameters to tune
  '''
  # The model you set up for hypertuning is called a hypermodel.
  model = keras.Sequential()
  model.add(keras.layers.Flatten(input_shape=(28, 28)))

  hp_units = hp.Int('units', min_value=16, max_value=512, step=16)
  model.add(keras.layers.Dense(units=hp_units, activation='relu'))
  model.add(tf.keras.layers.Dropout(0.2))
  model.add(keras.layer.Dense(10))

  model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
  return model

# Keras tuner support multiple strategies, one we're using is Hyperband strategy
# Keras Tuner has four tuners available with built-in strategies - RandomSearch, Hyperband, BayesianOptimization, and Sklearn
tuner = kt.Hyperband(
  model_builder,
  objective='val_accuracy',
  max_epochs=10,
  factor=3,
  directory='my_dir',
  project_name='intro_to_kt')

# callback configuration
stop_early = tf.keras.callbacks.EarlyStopping(
  monior='val_loss', patience=5)

tuner.search(
  X_train, y_train,
  epochs=50,
  validation_split=0.2,
  callbacks=[stop_early])

Search Output

Then, we can set the hidden units to be 48

best_hps=tuner.get_best_hyperparameters()[0]
h_model = tuner.hypermodel.build(best_hps)
h_model.fit(X_train, y_train, epochs=NUM_EPOCHS, validation_split=0.2)

AutoML — Intro to AutoML (Automated Machine Learning)

It is aimed at developers with very little experience in ML to make use of ML model and techniques by trying to automate entire workflow end-to-end.

Neural Architecture Search (❤️ of AutoML)

The process of automating architecture engineering is strictly called NAS.

Three main parts:

Search space: Defines the range of architecture which can be represented.
Search strategy: Defines how we explore the search space.
Performance estimation strategy: Helps measure the comparing the performance of various architectures.

Understanding Search Spaces

Types of Search Space:

Macro
Micro

Node: A node is a layer in a neural network.

An arrow from layer $\text{L}_i$ to $\text{L}_j$ indicates the layer $\text{L}_j$receives the output of $\text{L}_i$as input.

Macro Architecture Search Space

A macro search space contains the individual layers and connection types of neural network.

A NAS searches within that space for the best model, building the model layer by layer.

In a chain-structured Neural Network Architecture (NNA), space is parametrized as:

The operation every layer can execute
Hyperparameters associated with the operation
A number of n sequentially fully-connected layers

Micro Architecture Search Space

In a micro search space, NAS build a neural network from cells where each cell is a smaller network.

Cells are stacked to produce the final network.

Search Strategies

Find the architecture that produces the best performance

A few search strategies

Grid Search
Random Search
Bayesian Optimization
Evolutionary Algorithms
Reinforcement Learning

Grid Search & Random Search

Grid Search

Exhaustive search approach on fixed grid values

Random Search:

Select the next options randomly within the search space.

Both Suitable for small search space.

Both quickly fail with growing size of search space

Bayesian Optimization

Assumes that a specific probability distribution, is underlying the performance.
Tested architectures constrain the probability distribution and guide the selection of the next option.
This way, promising architectures can be stochastically determined and tested.

Evolutionary Methods

Reinforcement Learning

Agents goal is to maximise a reward.
The available options are selected from the search space.
The performance estimation strategy determines the reward

Reinforcement Learning for NAS

A NN can also be specified by a variable length string.
Hence RNN can be used generate that string, referred as controller.
After training model on real data called, child network, we measure the accuracy on validation set.
This accuracy then determines the reinforcement learning reward.

Neural Architecture Search

If you wish to dive more deeply into neural architecture search , feel free to check out these optional references. You won’t have to read these to complete this week’s practice quizzes.

Measuring AutoML Efficacy

Performance Estimation Strategy

The search strategies in neural architecture search need to estimate the performance of generated architectures, so that they can in turn generate even better performing architectures.

The simplest approach is to measure validation accuracy...

Strategies to Reduce the Cost

Lower fidelity estimates
Learning Curve Extrapolation
Weight Inheritance/ Network Morphisms

Lower fidelity estiamtes

Lower fidelity or precision estimates try to reduce training time by reframing the problem make is easier to solve.

This is done either by:

Training on subset of data
lower resolution images
Fewer filters per layer and fewer cells

Learning Curve Extrapolation

Based on the assumption that you have mechanisms to predict the learning curve reliably.

Extrapolates based on initial learning
Removes poor performers
Progressive NAS, uses a similar approach by training a surrogate model and using it to predict the performance using architectural properties.

Weight Inheritance/Network Morphisms

Initialize weights of new architectures based on previously trained architectures.
- Similar to transfer learning
Uses Network Morphism
- Modifies the architecture without changing the underlying function.
- New network inherits knowledge from parent network
- Computational speed up: only a few days of GPU usage
- Network size not inherently bounded

AutoML on the Cloud

Amazon SageMaker Autopilot

Automatically trains and tunes the model for classification or regression based on your data.

Microsoft Azure AutoML

Google Cloud AutoML

How do these Cloud offerings perform AutoML?

We don't know (or can't say) and they're not about to tell us.
The underlying algorithms will be similar to what we've learned.
The algorithms will evolve with the state of the art

AutoML

If you wish to dive more deeply into AutoMLs, feel free to check out these cloud-based tools. You won’t have to read these to complete this week’s practice quizzes.