Week 3 — ML Experiment Management and Workflow Automation & MLOps Methodology

ML Experiments Management and Workflow Automation

Experiment Tracking

The need for rigorous processes and reproducible results, creates a need for experiment tracking.

Why experiment tracking?

ML projects have far more branching and experimentation
Debugging in ML is difficult and time consuming
Small changes can lead to drastic changes in a model's performance and resource requirements.
Running experiments can be time consuming and expensive

What does it mean to track experiments?

Enable you to duplicate a result
Enable you to meaningfully compare experiments
Mange code/data versions, hyperparameters, environment, metrics
Organize them in a meaningful way
Make them available to access and collaborate on within your organization

Simple Experiments with Notebooks

Notebooks are great tools
Notebook code is usually not promotes to production
Tools for managing notebook code
- nbconvert (.ipynb → .py conversion)
- nbdime (diffing)
- jupytext (conversion+versioning)
- neptune-notebooks (versioning+diffing+sharing)

Smoke testing for Notebooks

jupyter nbconvert --to script train_model.ipynb python train_model.py;
python train_model.py

Not Just One Big File

Modular code, not monolithic
Collections of independent and versioned files
Directory hierarchies or monorepos
Code repositories and commits

Tracking Runtime Parameters

Config files

The parameters value can be tracked along with other code files

data:
    train_path: '/path/to/my/train.csv'
    valid_path: 'path/to/my/vaild.csv'

model:
    objective: 'binary'
    metric: 'auc'
    learning_rate: 0.1
    num_boost_round: 200
    num_leaves: 60
    feature_fraction: 0.2

or in Command line

But this requires additional code to be save these values and associate them with the experiment. This is an additional burden but it also makes those values available for analysis and visualization rather than having to parse them out of specific commit.

python train_evaluate.py \
    --train_path '/path/to/my/train.csv' \
    --valid_path 'path/to/my/vaild.csv' \
    --objective 'binary' \
    --metric 'auc' \
    --learning_rate 0.1 \
    --num_boost_round 200 \
    --num_leaves 60
    --feature_fraction 0.2

Log Runtime Parameters

Example of what the code to save the runtime parameters where we're setting runtime parameters from command line.

parser = argparse.ArgumentParser()
parser.add_argumnet('--numeber_tress')
parser.add_argument('--learning_rate')
args = parser.parse_args()

neptune.create_experiment(params=vars(args))
...
# experiment logic
...

Tools for Experiment Tracking

Data Versioning

Data reflects the world, and the world changes
Experimental changes include changes in data
Tracking, understanding, comparing, and duplicating experiments includes data

Tools for Data Versioning

Neptune
Pachyderm
Delta Lake
Git LFS
Dolt
lakeFS
DVC
ML-Metadata

Experiment tracking to compare results

As you gain experience with the tools, you'll get more confortable
Log every metric that you might care about
Tag experiments with a few consistent tags which are meaningful to you

Example: Logging metrics using TensorBoard

logdir = "logs/image" + datetime.now().strftime("%Y%m%d-%H%M%S")

tensorboard_callback = keras.callbacks.TensorBoard(
  log_dir=logdir, historgram_freq=1)
cm_callback = keras.callbacks.LambdaCallback(on_epoch_end=log_confusion_matrix)

model.fit(..., callbacks=[tensorboard_callback, cm_callback])

Organizing model development

Search through & visualize all experiments
Organize into something digestible
Make data shareable and accessible
Tag and add notes that will be more meaningful

Tooling for Teams

Vertex TensorBoard

Managed service with enterprise-grade security, privacy, and compliance
Persistent, shareable link to you experiment dashboard
Searchable list of all experiments in a project

Experiments are iterative in nature

Creative iterations for ML experimentation
Define a baseline approach
Develop, implement, and evaluate to get metrics
Asses the results, and decide on next steps
Latency, cost, fairness, GDPR, etc.
Experiment Tracking

Experiment

Experiment Tracking

Learn more about experiment tracking by checking this two resources out:

Introduction to MLOps

Data Scientists vs Software Engineers

Data Scientists

Often work on fixed datasets
Focused on model metrics
Prototyping on Jupyter notebooks
Expert in modeling techniques and feature engineering
Model size, cost, latency, and fairness are often ignored

Software Engineers

Build a product
Concerned about cost, performance, stability, schedule
Identify quality through customer satisfaction
Must scale solution, handle large amounts of data
Detect and handle error conditions, preferably automatically
Consider requirements for security, safety, fairness
Maintain, evolve, and extend the product over long periods

Growing Need for ML in products and Services

Large datasets
Inexpensive on-demand compute resources
Increasingly powerful accelerators for ML
Rapid advances in many ML research fields (such as computer vision, natural language understanding, and recommendation systems)
Business are investing in their data science teams and ML capabilities to develop predictive models that can deliver business value to their customers

Key problems affecting ML efforts today

We've been here before

In the 90s, Software Engineering was siloed
Weak version control CI/CD didn't exist
Software was slow to ship; now it ships in minutes
Is that ML today?

Today's perspective

Models blocked before deployment
Slow to market
Manual Tracking
No reproducibility or provenance
Inefficient collaboration
unmonitored models

Bridging ML and IT with MLOps

Continuous Integration (CI): Testing and validating code, components, data, data schemas, and models
Continuous Delivery (CD): Not only about deploying a single software package or a service, but a system which automatically deploys another service (model prediction service)
Continuous Training (CT): A new process, unique to ML systems, that automatically retrains candidate models for testing and serving
Continuous Monitoring (CM): Catching errors in production systems, and monitoring production inference data and models performance metrics tied to business outcomes.

ML Solution Lifecycle

Standardizing ML processes with MLOps

ML Lifecycle Management
Model Versioning & iteration
Model Monitoring and Management
Model Governance
Model Security

MLOps Methodology

MLOps Level 0

What defines an MLOps process' maturity?

The level of automation of ML pipelines determines the maturity of the MLOps process
As maturity increases, the available velocity for the training and deployment of new models also increases
Goal is to automate training and deployment of ML models into the core software system, and provide monitoring.

MLOps level 0: Manual process

The process of developing and deploying the model is manual. This creates a disconnect between the ML and operations teams. It also leads to the possibility of training serving skew.

A new model version is probably only deployed a couple of times a year, so because of fewer code changes Continuous Integration (CI) and often even unit testing is totally ignored.

A level 0 process is concerned only with deploying the trained model as a prediction service. Also we do not tracking and logging the model predictions and actions which are required for detecting model degradation and other model behavioral drifts

Challenges for MLOps level 0

Need for actively monitoring the quality of your model in production
Retraining your production models with new data
Continuously experimenting with new implementation to improve the data and model

MLOps Levels 1&2

MLOps level 1: ML pipeline automation

One of the main goal of level one is to perform continuous training of the model.

That requires the need to introduce automated data and model validation steps to the pipeline as well as pipeline triggers and metadata management.

Notice how the transition from one step to another in the experiment orchestration is automated.
Models are automatically retrained using fresh data based on live pipeline triggers.
The pipeline implementation that is used in the development or experimentation is also used in the pre-production and production environment.
The components need to be modularized, ideally be containerized.
An ML pipeline in production continuously delivers new models that are trained on new data to prediction services
Compared to level 0 where we where just deploying the model, here we are deploying the whole training pipeline, which automatically and recurrently runs to serve the trained model.

When we deploy the pipeline to production, one or more of the triggers automatically executes the pipeline.
The pipeline expects a new live data to produce a new model version that is trained on the new data. So automated data validation & model validation steps are required in ML pipelines.
Whether you should retrain the model, or stop the execution of the pipeline:

This decision is automatically made only if the data is deemed valid. Like data schema mismatch are considered anomalies, in that case the pipeline execution should be stopped and notification should be raised for the team to investigate.
Model validation and evaluation of the model is done before promoting the model to production.

The newly trained model needs to be assessed on a test data and then comparing the evaluation metric produced by newly trained model with current model in production.

Also the performance needs to be consistent on different slices of data.
In addition to offline model validation, a newly deployed model undergoes online model validation in either a canary deployment or an AB testing setup during the transition to serving prediction for the online traffic.
Feature Store: A feature store is a centralized repository where you standardize the definition, storage, and access of features for training and serving.
- A feature store also lets you rediscover and reuse available feature sets instead of recreating the same or similar feature sets, avoiding having similar features that have different definition by maintaining features and their related metadata.
Metadata store: This is where information about each execution of the pipeline is recorded in order to help with data and artifact lineage, reproducibility, and comparisons. This can help us debug errors and anomalies. In case of an interruption, it also allows you to resume execution seamlessly.

MLOps level 2: CI/CD pipeline automation

The truth is that at the current stage of the development of MLOps best practices, level two is still somewhat speculative

The diagram presents on of the current architectures, focused on enabling rapid and reliable update of the pipelines in production.

MLOps Resources

If you want to learn more about MLOps check this blog out, and visit this curated list of references for more information, ideas, and tools.

Developing Components for an Orchestrated Workflow

Pre-built and standard components, and 3 styles of custom components
Components can also be containerized
Examples of thing you can do with TFX components:
- Data augmentation, upsampling, or downsampling
- Anomaly detection based on confidence intervals or autoencoder reproduction error
- Interfacing with external systems like help desks for alerting and monitoring and more...

Anatomy of a TFX component

Components are essentially composed of a component specification and executor class packaged inside a component class.

Component specification

The component's input and output contract and parameters used for component execution.

Executor class

Provides the implementation for component's processing

Component Class

Combines the component specification with the executor to create a TFX component

TFX components at runtime

When a pipeline runs a TFX component the component is executed in three phases:

The driver uses the component specification to retrieve the required artifacts form the metadata store and pass them into the component.
The executor perform the components work
The publisher uses the component specification and the results from the executor to store the components, output in the metadata store.

Custom components only requires modification to executor class. Modification to the driver or publisher should only be necessary if we want to change the interaction between pipeline's components and the metadata store

If we want to change the inputs, outputs or parameters for your component, we only need to modify the component specification.

Types of custom components:

Python function based custom components

Only require Python function for the executor with a decorator and argument annotations.
Container based custom components

Provide the flexibility to integrate code written in any language into your pipeline by wrapping the components inside a Docker container.
Fully custom components

Fully custom components lets us build components by defining the component specification, executor and component interface classes.

Python function-based components

In this style you write a function that is decorated and annotated with type hints.

The type hints describe the InputArtifacts , OutputArtifacts, parameters of your component.

@component
def MuValidationComponent(
  model: InputArtifact[Model],
  blessing: OutputArtifact[Model],
  accuracy_threshold: Parameter[int] = 10,
  ) -> OutputDict(accuracy=float):
  '''My simple custom model validation component.'''

  accuracy = evaluate_model(model)
  if accuracy >= accuracy_threshold:
    write_output_blessing(blessiing)

  return {
    'accuracy': accuracy
  }

Container-based components

To create one we need to specify a Docker container image that includes our component dependencies.

from tfx.dsl.component.experimental import container_component, placeholders
from tfx.types import standard_artifacts

grep_component = container_component.create_container_component(
  name='FilterWithGrep',
  inputs={'text': standard_artifacts.ExternalArtifact},
  outputs={'filtered_text': standard_artifacts.ExternalArtifact},
  parameters={'pattern': str},
  ...
  image='google/cloud-sdk:278.0.0',
  command=[
    'sh', '-exec',
    ...
    ...
    '--pattern', placeholders.InputValuePlaceholder('pattern'),
    '--text', placeholders.InputUriPlaceholder('text'),
    '--filtered-text',
    placeholders.OutputUriPlaceholder('filtered_text'),
  ],
)

Their are other parts of the configuration like container image name and optionally the image tag. For the body of the component we can have command parameter which defines the container entry point command line. The command line can use placeholder objects that are replaced at compilation time with the input, output or parameters.

Fully custom components

Define custom component spec, executor class, and component class
Component reusability
- Reuse a component spec and implement a new executor that derives from an existing component

Defining input and output specifications

These inputs and outputs are wrapped in channels, essentially dictionaries of typed parameters for input and output artifacts.

The PARAMETERS is a dictionary of additional execution parameter items that are passed into the executor and are not metadata artifacts.

class HelloComponentSepc(types.ComponentSpec):
  INPUTS = {
    # This will be a dictionary with input artifacts, including URIs
    'input_data': ChannelParameter(type=standard_artifacts.Examples),
  }
  OUTPUTS = {
    # This will be a dictionary which this component will populate
    'output_data': ChannelParameter(type=standard_artifacts.Examples),
  }
  PARAMETERS = {
    # These are parameters that will be passed in the call to create
    # an instance of this component
    'name': ExecutionParameter(type=Text),
  }

Implement the executor

class Executor(base_executor.BaseExecutor):
  def Do(self, input_dict: Dict[Text, List[type.Artifact]],
          output_dict: Dict[Text, List[types.Artifact]],
          exec_properties: Dict[Text, Any]) -> None:
    ...
    split_to_instances = {}
    for artifact in input_dict['input_data']:
      for split in json.loads(artifacts.split_names):
        uri = os.path.join(artifact.uri, split)
        split_to_instance[split] = uri
    for split, instance in split_to_instance.items():
      input_dir = instance
      output_dir = artifact_utils.get_split_uri(
                      output_dict['output_data'], split)
    for filename in tf.io.gfile.listdir(input_dir):
      input_uri = os.path.join(input_dir, filename)
      output_uri = os.path.join(output_dir, filename)
      io_utils.copy_file(src=input_uri, dst=output_uri, overwrite=True)

Make the component pipeline-compatible

from tfx.types import standard_artifacts
from hello_component import executor

class HelloComponent(base_component.BaseComponent):
    SPEC_CLASS = HelloComponentSpec
    EXECUTOR_SPEC = ExecutorClassSpec(executor.Executor)

    def __init__(self,
      input_data: types.Channel = None,
      output_data: types.Channel = None,
      name: Optional[Text] = None):
      if not output_data:
        examples_artifact = standard_artifacts.Examples()
        examples_artifact.split_names = input_data.get()[0].split_names
        output_data = channel_utils.as_channel([examples_artifact])

      spec = HelloComponentSpec(input_data=input_data, output_data, name=name)
      super(HelloComponent, self).__init__(spec=spec)

Assemble into a TFX pipeline

def _create_pipeline():
  ...
  example_gen = CsvExampleGen(input_base=examples)

  hello = component.HelloComponent(
    input_data=example_Gen.outputs['examples'],
    name='HelloWorld')

  statistics_gen = StatisticsGen(
    examples=hello.outputs['output_data'])
  ...
  return pipeline.Pipeline(
    ...
    components=[example_gen, hello, statistics_gen, ...],
    ...
  )

Architecture for MLOps using TFX, Kubeflow Pipelines, and Cloud Build

To learn more about MLOps using TFX please check this document out.

Model Management and Deployment Infrastructure

Managing Model Versions

Why versioning ML Models?

In normal software development, teams and individual rely on version control software to help teams manage and control changes to their code. This helps them stay in sync with each other, rollback if new changes cause havoc and assess their development.

Similarly in ML model development, model versioning helps teams keep track of changes to code, data and configs to properly reproduce the results and do collaboration.

How ML Models are versioned?

How software is versioned

Version: MAJOR.MINOR.PATCH

MAJOR: Contains incompatible API changes
MINOR: Adds functionality in a backwards compatible manner
PATCH: Makes backwards compatible bug fixes.

ML models versioning

No uniform standard accepted yet
Different organizations have different meanings and conventions

A Model Versioning Proposal

Version: MAJOR.MINOR.PIPLEINE

Major: Incompatibility in data or target variable
MINOR: Model performance in improved
PIPELINE: Pipeline of model training is changed

TFX uses pipeline execution versioning. In this style, a new version is defined with each successfully run training pipeline. Models will be versioned regardless of changes to model architecture, input, or output.

Retrieving older models

Can ML framework be leveraged to retrieve previously trained models?
ML framework may internally be versioning models

What is model lineage?

Model lineage is a set of relationships among the artifacts that resulted in the trained model.

Artifacts

Artifacts are information needed to preprocess data and generate result (code, data, config, model)

To build model artifacts, you have to be able to track the code that build them and the data including pre-processing operations that the model was trained and tested upon.

ML orchestration frameworks (like TFX) may store operations and data artifacts to recreate model. Model lineage usually only includes those artifacts and operations that were part of model training. Post-training artifacts and operations are usually not part of the lineage.

What is model registry?

A model registry is a central repository for storing trained models.

Provides various operations of ML model development lifecycle
Promotes model discovery, model understanding, and model reuse
Integrated into OSS and commercial ML platforms

Metadata stored by model registry

Metadata usually includes:

Model versions
Model serialized artifacts
Free text annotations and structured properties.
Links to other ML artifact and metadata stores

Capabilities Enabled by Model Registries

Model search/discovery and understanding
Approval/Governance
Collaboration/Discussion
Streamlined deployments
Continuous evaluation and monitoring

Examples of Model Registries

Azure ML Model registry
SAS model manager
MLflow Model Registry
Google AI platform
Algorithmia
ML Model Management

ML Model Management

Take a deeper dive into managing ML model versions by checking this blog out.

Continuous Delivery

Continuous Delivery helps promotes robust deployment.

What is Continuous Integration (CI)

What is Continuous Delivery (CD)

CI/CD Infrastructure

Unit Testing in CI

In unit test we test each component in the pipeline produces the expected artifacts

In addition to unit testing our code, following the standard practices of software development, there are two additional types of unit tests when doing CI for ML:

Unit testing Data
Unit testing Model performance

Unit Testing Input Data

Unit testing of data is not the same as performing data validation on your raw features.

Unit Testing Model performance

ML Unit Testing Considerations

Infrastructure validation

Infrastructure validation acts as an early warning layer before pushing a model into production to avoid issues with models that might nor run or might perform badly when actually serving requests in production.

When to apply infrastructure validation

Before starting CI/CD as part of model training
Can also occur as part of CI/CD as a last check to check to verify that the model is deployable to the serving infrastructure.

TFX InfraValidator

TFX InfraValidator takes the model, launches a sand-boxed model server with the model and sees if it can be successfully loaded and optionally queried
InfraValidator is using the same model server binary, same resources, and same server configuration as production.
InfraValidator only interacts with the model server in the user configured environment to see of it works as expected. Configuring this environment correctly will ensure that your inferred validation passing or failing will be indicative of whether the model would be survivable in the production serving environment.

Continuous Delivery

Explore this website to learn more about continuous delivery.

Progressive Delivery

Progressive Delivery is essentially an improvement over Continuous Delivery.

Complex Model Deployment Scenarios

Progressive delivery usually involves having multiple versions deployed at the same time so that comparisons in performance can be made

You can deploy multiple models performing same task
Deploying competing models, as in A/B testing
Deploying as shadow models, as in Canary testing

Blue/Green deployment

Traffic is passing through the load balancer directing it to current live environment called Blue. Meanwhile a new version is deployed to the green environment which acts as a staging setup where a series of tests are conducted to ensure performance and functionality.

After passing the tests the traffic is then directed to the green deployment. If problem arises traffic can be moved back to the Blue version.

No Downtime
Quick rollback & reliable
Smoke testing in production environment

Canary deployment

Similar to blue/green deployment, but instead of switching the entire incoming traffic from blue to green all at once, traffic is switched gradually.

As traffic begins consuming new version, the performance of the new version is monitored. If necessary the deployment can be stopped and reversed with no downtime and minimal exposure of user to new version.

Eventually all traffic is transferred to the new version.

Live Experimentation

Model metrics are usually not exact matches for business objectives
Examples: Recommender systems
- Model trained on clicks
- Business wants to maximise profit
- Example: Different products have different profit margins

Live Experimentaion:; A/B Testing

In A/B testing we have at least two different models (or n) and we compare the business results between them to select the model that gives the best business performance.

Users are divided into two groups
Users are randomly routed to different models in environment
You gather business results from each model to see which one is performing better

Progressive Delivery

Explore more about progressive delivery with Kubernetes operators allowing for minimum downtime and easy rollbacks in this documentation.

Week 3 — ML Experiment Management and Workflow Automation & MLOps Methodology

ML Experiments Management and Workflow Automation

Experiment Tracking

Why experiment tracking?

What does it mean to track experiments?

Simple Experiments with Notebooks

Smoke testing for Notebooks

Not Just One Big File

Tracking Runtime Parameters

Log Runtime Parameters

Tools for Experiment Tracking

Data Versioning

Tools for Data Versioning

Experiment tracking to compare results

Example: Logging metrics using TensorBoard

Organizing model development

Tooling for Teams

Experiments are iterative in nature

Experiment Tracking

Introduction to MLOps

Data Scientists vs Software Engineers

Growing Need for ML in products and Services

Key problems affecting ML efforts today

Bridging ML and IT with MLOps

ML Solution Lifecycle

Standardizing ML processes with MLOps

MLOps Methodology

MLOps Level 0

What defines an MLOps process' maturity?

MLOps level 0: Manual process

Challenges for MLOps level 0

MLOps Levels 1&2

MLOps level 1: ML pipeline automation

MLOps level 2: CI/CD pipeline automation

MLOps Resources

Developing Components for an Orchestrated Workflow

Anatomy of a TFX component

TFX components at runtime

Python function-based components

Container-based components

Fully custom components

Architecture for MLOps using TFX, Kubeflow Pipelines, and Cloud Build

Model Management and Deployment Infrastructure

Managing Model Versions

Why versioning ML Models?

How ML Models are versioned?

A Model Versioning Proposal

Retrieving older models

What is model lineage?

What is model registry?

Metadata stored by model registry

Capabilities Enabled by Model Registries

Examples of Model Registries

ML Model Management

Continuous Delivery

What is Continuous Integration (CI)

What is Continuous Delivery (CD)

CI/CD Infrastructure

Unit Testing in CI

Unit Testing Input Data

Unit Testing Model performance

ML Unit Testing Considerations

Infrastructure validation

Continuous Delivery

Progressive Delivery

Complex Model Deployment Scenarios

Blue/Green deployment

Canary deployment

Live Experimentation

Live Experimentaion:; A/B Testing

Progressive Delivery