Skip to content

Week 1 — Intro to Model Serving and Infrastructure

Introduction to Model Serving

What exactly is Serving a Model?

Model Serving Patterns

Components used by inference process:

  • A model,
  • An interpreter
  • Input data

ML worflows

Batch/static Learning: model training is performed offline on a set of already collected data.

Online/Dynamic learning: model is regularly being retrained as new data arrives as is in the case of time series data.

Batch Inference: when the inference is not done instantly and is pushed in a batch to be done.

Realtime Inference: predictions are generated in real time using available data at the time of the request.

Important Metrics

  1. Latency: Delay between user's action and response of application to user's action. Latency of the whole process, starting from sending data to server, performing inference using model and returning response

Minimal latency is a key requirement to maintain customer satisfaction

  1. Throughput: Number of successful requests served per unit time say one second

    In some applications only throughput is important and not latency.

  2. Cost

    The cost associated with each inference should be minimised

    Important infrastructure requirements that are expensive:

    • CPU
    • Hardware Accelerators like GPU
    • Caching infrastructure for faster data retrieval

Minimizing latency, Maximising Throughput

Minimizing latency

  • Airline Recommendation Service
  • Reduce latency for user satisfaction

Maximising Throughput

  • Airline recommendation service faces high load of inference requests per second.

Scale infrastructure (number of servers, caching requirements etc.) to meet requirements

Balance Cost, Latency and Throughput

  • Cost increases as infrastructure is scaled
  • In applications where latency and throughput can suffer slightly:
    • Reduce costs by GPU sharing
    • Multi-model serving etc..,
    • Optimizing models used for inference

Introduction to Model Serving Infrastructure

Optimizing Models for Serving

As Model Complexity Increases Cost Increases

Increased requirements means Increased cost and increases hardware requirement management of larger model registries leading to higher support and maintenance burden.

Balancing Cost and Complexity

The challenge for ML practitioners is to balance complexity and cost

Optimizing and Satisficing Metrics

Use of Accelerators in Serving Infrastructure

Maintaining Input Feature Lookup

  • Prediction request to your ML model might not provide all features required for prediction
  • For example, estimating how long food delivery will require accessing features from a data store:
    • Incoming orders (not included in request)
    • Outstanding orders per minute in the past hour
  • Additional pre-computed or aggregated features might be read in real-time form a data store.
  • Providing that data store is a cost

NoSQL Databases: Caching and Feature Lookup

Google Cloud Memorystore: In memory cache, sub-millisecond read latency

Google Cloud Firestore: Scaleable, can handle slowly changing data, millisecond read latency

Google Cloud Bigtable: Scaleable, handles dynamically changing data, millisecond read latency

Amazon DynamoDB: Single digit millisecond read latency, in memory cache available.

Deployment Options

Two options:

  1. Data centers
  2. Embedded devices

Running in Huge Data Centers

Constrained Environment: Mobile Phone

Restrictions in a Constrained Environement

Prediction Latency is Almost Always Important

  • Opt for on-device inference whenever possible
    • Enhances user experience by reducing the response time of your app

Improving Prediction Latency and Reducing Resource Costs

For mobile deployed models

Web Applications for Users

Serving systems for easy deployment

  • Centralized model deployment
  • Predictions as service

Eliminates need for custom web applications by getting the mode deployed in just a few lines of code. They can also make is easy to rollback/update models on the fly.


TensorFlow Serving

Advantages of Serving with a Managed Service

Installing TensorFlow Serving

  • Docker Image:

    • Easiest and most recommended method
    docker pull tensorflow/serving
    docker pull tensorflow/serving:latest-gpu

Install TensorFlow Serving

  • Building From Source
  • Install using Aptitude (apt-get) on a Debian-based Linux system
!echo "deb stabletensorflow-model-server tensorflow-model-server-universal" |
tee /etc/apt/source.list.d/tensorflow-serving.list && \
curl https://storage/ | apt-key add -

!apt update

!apt-get install tensorflow-model-server

Import the MNIST Dataset

mnist = tf.keras.datasets.mnist
(train_image, train_labels), (test_images, test_labels) = mnist.load_data()

# Scale the values of the arrays below to be between 0.0 and 1.0
train_images = train_images / 255.0
test_images = test_images / 255.0

# Reshape the arrays below
train_images = train_images.reshape(train_images.shape[0], 28, 28, 1)
test_images = test_images.reshape(test_images.shape[0], 28, 28, 1)

print("\ntrain_images.shape: {}, of {}".format(
  train_images.shape train_images.dtype))
print("test_images.shape: {}, of {}".format(
  test_images.shape test_images.dtype))

Look at a sample Image

idx = 42

plt.imshow(test_images[idx].reshape(28, 28),
plt.title("True label: {}".format(test_labels[idx]), fontdict={'size': 16})

Build a Model

# Create a model
model = tf.keras.Sequential([
      input_shape=(28, 28, 1),
      filters=8, kernel_size=3,
      strides=2, activation='relu',
    tf.keras.layers.Dense(10, activation=tf.nn.softmax, name='softmax')


Train the Model

# Configure the model for training

hisory =, train_labels, epochs=epochs)

Evaluate the Model

# evaluate the model on the test iamges
results_eval = model.evaluate(test_images, test_labels, verbose=0)

for metric, value in zip(model.metrics_names, results_eval):
        print(metric + ': {:.3}'.format(value))

Save the Model

MODEL_DIR = tempfile.gettempdir()
version = 1
export_path = os.path.join(MODEL_DIR, str(version))

if os.path.isdir(export_path):
    print('\n Already saved a model, clearning up\n')
    !rm -r {export_path}, save_format='tf')

print('\nexport_path = {}'.format(export_path))
!ls -l {export_path}

Launch Your Saved Model

os.environ['MODEL_DIR'] = MODEL_DIR

%%bash --bg
nohup tensorflow_model_server \
    --rest_apt_port=8501 \
    --model_name=digits_model \
    --model_base_path="${MODEL_DIR}" > server.log 2>&1
!tail server.log

Sending an Inference Request

data = json.dumps({
  "signature_name": "serving_default",
  "isntances": test_images[0:3].tolist()})

headers = {"content-type": "application/json"}

json_response =
  data=data, headers=headers)

predictions = json.loads(json_response.text)['predictions']

Plot Predictions

plt.figure(figsize=(10, 15))

for i in range(3):
  plt.subplot(1, 3, i+1)
  plt.imshow(test_iamges[i].reshape(28, 28),
  color = 'green' if np.argmax(predictions[i]) == test_labels[i]  else 'red'
  plt.title('Prediction: {}\n True Label: {}'.format(
    test_labels[i]), color=color)