Week 1 — Collecting, labeling, and validating data
Overview
This course is all about data, the first week we'll go through collecting our data, labeling it, and validating it. Along the way we'll also get familiarize with TensorFlow Extended (TFX) framework and data pipelines.
"Data is the hardest part of ML and the most important piece to get right...broken data is the most common cause of problems in production ML systems" - Scaling Machine Learning at Uber with Michelangelo - Uber
The production setting for ML systems is really different than academic one where the data is fixed, already cleaned and ready to be experimented with.
Success
Production ML = ML development + software development
Using Modern software development also needs to account for:
- Scalability,
- Extensibility
- Configuration
- Consistency & reproducibility
- Best practices
- Safety & security
- Modularity
- Testability
- Monitoring
Challenges in production grade ML
- Build integrated ML systems
- Continuously operate it in production
- Handle continuously changing data
- Optimize compute resource costs
ML pipelines
ML pipeline is a software architecture for automating, monitoring, and maintaining the ML workflow from data to a trained model.
A directed acyclic graph (DAG) is a directed graph that has no cycles.
ML pipeline workflow are usually DAGs,
Pipeline orchestration frameworks are responsible for the various components in an ML pipeline depending on DAG dependencies. Basically help with pipeline automation.
Examples: Airflow, Argo, Celery, Luigi, Kubeflow
TensorFlow Extended (TFX)
End-to-end platform for deploying production ML pipelines.
TFX production components are designed for scalable, high-performance machine learning tasks.
tion="Components represented by orange blocks." >}}
Collecting Data — Importance of Data
In ML arena Data is a first class citizen.
- Software 1.0: Explicit instructions were given to the computer
- Software 2.0:
- Specify some goal on the behaviour of a program
- Find solution using optimization techniques.
- Good data is key for success
- Code in Software = Data in ML
Models aren't magic wands, they are statistical tools and so require meaningful data:
- Maximize predictive content
- remove non-informative data
- feature space coverage
Key Points
- Understand users, translate user needs into data problems
- What kind of/how much data is available
- What are the details and issues of your data
- What are your predictive features
- What are the labels you are tracking
- What are your metrics
- Ensure data coverage and high predictive signal
- Source, store and monitor quality data responsibly
Few issues while collecting data that may arise:
- Inconsistent formatting
- Is zero "0", "0.0", or an indicator of a missing measurement, sea level, bad sensor?
- Compounding errors from other ML models
- Monitor data sources for system issues and outages
- Outliers
Measure data effectiveness
- Intuition about data value can be misleading
- Which feature have predictive value and which ones do not?
- Feature engineering helps to maximise the predictive signals
- Feature selection helps to measure the predictive signals
Responsible Data: Security, Privacy & Fairness
- Data collection and management isn't just about your model
- Give user control of what data can be collected
- Is there a risk of inadvertently revealing user data?
- Compliance with regulations and policies (e.g. GPDR)
Data privacy is proper usage, collection retention, deletion and storage of the data.
- Protect personally identifiable information
- Aggregation - replace unique values with summary value
- Redaction - remove some data to create less complete picture
Commit to fairness
- Make sure your models are fair
- Group fairness, equal accuracy
- Bias in human labeled and/or collected data.
- ML models can amplify biases.
Reducing bias: Design fair labelling systems
- Accurate labels are necessary for supervised learning
- Labeling be done by
- Automation (logging or weak supervision)
- Humans (aka "Raters", often semi-supervised)
How ML systems can fail users
- Representational harm: A system will amplify or reflect a negative stereotype about particular groups.
- Opportunity denial: When a system makes predictions that have negative real life consequences that could result in lasting impacts.
- Disproportionate product failure: Where the effectiveness of your model is really skewed so that the output happen more frequently for particular groups of users, skewed outputs are generated.
- Harm by disadvantage: A system will infer disadvantageous associations between different demographic characteristics and user behaviour around that.
Types of human raters
- Generalists (usually by crowdsourcing tools)
- Subject Matter Experts (requires specialized tools, like X-Rays)
- Your users (Derived labels, e.g. tagging photos.
Key points
- Ensure rater pool diversity
- Investigate rater context and incentives
- Evaluate rater tools
- Manage cost
- Determines freshness requirements
Labeling data — Data and Concept Change in Production ML
Detecting problems with deployed models
- Data and scope changes
- Monitor models and validate data to find problems early
-
Changing ground truth: label new training data
The ground truth may change gradually (may be years, months) or faster (weeks) or maybe really really fast (days, hours, min). If the ground truth is changing really fast, you got a really hard problem and might be important to retrain ASAP after following a Direct feedback or Weak supervision.
Key points
- Model performance decays over times
- Data and concept drift
- Model retraining helps to improve performance
- Data labeling for changing ground truth and scarce labels
Process Feedback and Human Labeling
Methods
- Process Feedback (Direct Labeling): e.g., Actual vs predicted click-through
- Human Labeling: e.g., Cardiologists labeling MRI images
- Semi-Supervised Labeling
- Active Learning
- Weak Supervision
Why is labeling important in production ML?
- Using business/organisation available data
- Frequent model retraining
- Labeling ongoing and critical process
- Creating a training datasets requires labels
Direct labeling — continuous creation of training dataset
Advantages:
- Training dataset continuous creation
- Labels evolve quickly
- Captures strong label signals
Disadvantages:
- Hindered by inherent nature of the problem
- Failure to capture ground truth
- Largely custom designed
Open-Source log analysis tools
Logstash: Free and open source data processing pipeline
- Ingests data from a multitude of sources
- Transforms it
- Sends it to your favourite "stash"
Fluentd: Open source data collector
Unify the data collection and consumption
Cloud log analytics
Google Cloud Logging
- Data and events from Google Cloud and AWS
- BindPlane. Logging: application components, on-premise and hybrid cloud systems
AWS ElasticSearch
Azure Monitor
Human labeling
"Raters" examine data and assign labels manually
- More labels
- Pure supervised learning
Disadvantages:
- Quality consistency- many datasets difficult for human labeling
- Slow
- Expensive
- Small dataset curation
Validating Data — Detecting Data issues
Concept Drift: Is the change in the statistical properties of the labels over time. The mapping from $x\rightarrow y$ changes
Data Drift: Changes in data over time, such as data collected once a day. The distribution of the data ($x$) changes
Data skew: Difference between two static versions, or different sources, such as training set and serving set.
Detecting distribution skew
Dataset shift occurs when the joint probability of $x$ (features), $y$ (labels) is not same during training and serving.
$$ P_\text{train}(y,x) \not = P_\text{serve}(y, x) $$
Covariate shift refers to the change in distribution of input variables present in training and serving data.
Marginal distribution/probability of features is not the same during training and serving.<
$$ P_\text{train}(y|x) = P_\text{serve}(y|x) \P_\text{train}(x) \not = P_\text{serve}(x) $$
Concept shift refers to a change in the relationship between the input and output variables as opposed to the differences in the Data Distribution or input itself./mark>
$$ P_\text{train}(y|x) \not = P_\text{serve}(y|x)\P_\text{train}(x) = P_\text{serve}(x) $$
Skew detection workflow
TensorFlow Data Validation (TFDV)
- Understand, validate and monitor ML data at scale
- Used to analyze and validate petabytes of data at Google every day
- Proven track record in helping TFX users maintain the health of their ML pipelines
TFDV capabilities
- Generates data statistics and browser visualizations
- Infers the data schema
- Performs validity checks against schema
- Detects training/serving skew
- Schema skew
- Feature skew
- Distribution skew
Skew - TFDV
- Supported for categorical features
- The Degree of data drift is expressed in terms of L-infinity distance (Chebyshev Distance):
$$ D_\text{Checbyshev}(x, y) = \max_i(|x_i - y_i|) $$
- Set a threshold to receive warnings
Schema skew
Serving and training data don't conform to same schema:
- For example,
int != float
Feature skew
Training feature values are different than the serving feature values:
- Feature values are modified between training and serving time
- Transformation applied only in one of the two instances
Distribution skew
Distribution of serving and training dataset is significantly different:
- Faulty sampling method during training
- Different data source for training and serving data
- Trend, seasonality, changes in data over time
Key points
TFDV: Descriptive statistics at scale with the embedded facets visualizations
It provides insight into:
- What are the underlying statistic of your data
- How does your training, evaluation, and serving dataset statistics compare
- How can you detect and fix data anomalies
Week 1 references
Week 1: Collecting, Labeling and Validating Data
This is a compilation of optional resources including URLs and papers appearing in lecture videos. If you wish to dive more deeply into the topics covered this week, feel free to check out these optional references. You won’t have to read these to complete this week’s practice quizzes.
Papers
Konstantinos, Katsiapis, Karmarkar, A., Altay, A., Zaks, A., Polyzotis, N., … Li, Z. (2020). Towards ML Engineering: A brief history of TensorFlow Extended (TFX). http://arxiv.org/abs/2010.02013
Paleyes, A., Urma, R.-G., & Lawrence, N. D. (2020). Challenges in deploying machine learning: A survey of case studies. http://arxiv.org/abs/2011.09926
ML code fraction:
Sculley, D., Holt, G., Golovin, D., Davydov, E., & Phillips, T. (n.d.). Hidden technical debt in machine learning systems. Retrieved April 28, 2021, from Nips.cc https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf