Building Machine Learning Powered Applications: Going from Idea to Product
My notes and highlights on the book.
Author: Emmanuel Ameisen
- Part I. Find the Correct ML Approach
- Ch1. From Product Goal to ML Framing
- Ch2. Create a Plan
- Part II. Build a Working Pipeline
- Ch3. Build your first end-to-end pipeline
- Ch4. Acquire an initial dataset
- Part III. Iterate on Models
- Ch5. Train and evaluate your model
- Ch6. Debug your ML problems
- Ch7. Using classifiers for writing recommendations
- Part IV. Deploy and Monitor
- Ch8. Considerations when deploying models
- Ch9. Choose Your Deployment Option
- Ch10. Build Safeguards for Models
- Ch11. Monitor and update models
Part I. Find the Correct ML Approach
Ch1. From Product Goal to ML Framing
ML is particularly useful to build systems for which we are unable to define a heuristic solution
Start from a concrete business problem, determine whether it requires ML, then work on finding the ML approach that will allow you to iterate as rapidly as possible
- Framing your product goal in an ML paradigm
- Evaluating the feasibility of that ML task
Estimating the challenge of data acquisition ahead of time is crucial in order to succeed
Data availability scenarios
- Labeled data exists
- Weakly labeled data exists
- Unlabeled data exists
- We need to acquire data
“Having an imperfect dataset is entirely fine and shouldn’t stop you. The ML process is iterative in nature, so starting with a dataset and getting some initial results is the best way forward, regardless of the data quality.”
The Simplest Approach: being the algorithm
Start with a human heuristic and then build a simple model: initial baseline = first step toward a solution -> Great way to inform what to build next
What to focus on in an ML project
Find the impact bottleneck: piece of the pipeline that could provide the most value if improved
Imagine that the impact bottleneck is solved: it was worth the effort you estimate it would take?
Which modeling techniques to use
Spend the manual effort to look at inputs and outputs of your model: see if anything looks weird. Looking at your data helps you think of good heuristics, models and ways to reframe the product
Ch2. Create a Plan
Measuring Success
First model: simplest model that could address a product’s needs -> generating and analyzing results is the fastest way to make progress in ML
- Baseline: heuristics based on domain knowledge
- Simple model
- Complex model
You don’t always need ML: even features that could benefit from ML can often simply use a heuristic for their first version (you may realize that you don’t need ML at all)
Business Performance
Product metrics: goals of your product or feature. Ultimately the only ones that matter, all other metrics should be used as tools to improve product metrics
Updating an app to make a modeling task easier
- Change an interface so that a model’s results can be omitted if they are below a confidence threshold
- Present a few other predictions or heuristics in addition to model’s top prediction
- Communicate to users that model is still in an experimental phase and giving them opportunities to provide feedback
“A product should be designed with reasonable assumptions of model performance in mind. If a product relies on a model being perfect to be useful, it is very likely to produce innacurate or even dangerous results”
Freshness and Distribution Shift
Distribution of the data shifts -> model often needs to change in order to maintain the same level of performance
Leverage Domain Expertise
Best way to devise heuristics -> see what experts are currently doing. Most practical applications are not entirely novel. How do people currently solve the problem you are trying to solve?
Second best way -> look at your data. Based on your dataset, how would you solve this task if you were doing it manually?
Examining the data
EDA: process of visualizing and exploring a dataset -> to get an intuition to a given business problem. Crucial part of building any data product
Stand on the Shoulders of giants
- Reproduce existing results
- Build on top of them
To make regular progress: start simple
- Start with the simplest model that could address your requirements
- Build an end-to-end prototype including this model
- Judge its performance: optimization metrics and product goal
Looking at the performance of a simple model on an initial dataset is the best way to decide what task should be tackled next
Diagnose Problems
Write analysis and exploration functions:
- Visualize examples the model performs the best and worst on
- Explore data
- Explore model results
Part II. Build a Working Pipeline
Ch3. Build your first end-to-end pipeline
First iteration: lackluster by design. Goal: allow us to have all the pieces of a pipeline in place:
- prioritize which ones to improve next
- identify the impact bottleneck
“Frequently, your product is dead even if your model is successful” - Monica Rogati
Test your workflow
Evaluate:
- usefulness of the current user experience
- results of your handcrafted model
Finding the impact bottleneck
Next challenge to tackle next:
- iterating on the way we present results to the users or;
- improving model performance by identifying key failure points
Ch4. Acquire an initial dataset
Understanding your data well leads to the biggest performance improvements
Iterate on datasets
Data gathering, preparation and labeling should be seen as an iterative process, just like modeling
ML engineering: engineering + ML = products
Choosing an initial dataset, regularly updating it, and augmenting it is the majority of the work
Data: best source of inspiration to develop new models and the first place to look for answers when things go wrong
Models only serve as a way to extract trends and patterns from existing data. Don’t overestimate the impact of working on the model and underestimate the value of working on the data
Before noticing predictive trends, start by examining quality
Data quality rubric
Format
Validate that you understand the way in which the data was processed
Quality
Notice the quality ahead of time -> missing labels, weak labels
Quantity and distribution
Estimate:
- enough data?
- feature values are within reasonable range?
Summary statistics
Identifying differences in distributions between classes of data early: will either make our modeling task easier or prevent us from overestimating the performance of a model that may just be leveraging one particularly informative feature.
Data leakage
Using training and validation data for vectorizing/preprocessing can cause data leakage -> leveraging info from outside the training set to create training features
Clustering
As with dimensionality reduction: additional way to surface issues and interesting data points
Let data inform features and models
The more data you have and the less noisy your data is, the less feature engineering work you usually have to do
Feature crosses
Feature generated by multiplying (crossing) two or more features -> nonlinear combination of features -> allows our model to discriminate more easily
Giving your model the answer
New binary feature that takes a nonzero value only when relevant combination of values appear
Robert Munro: how do you find, label and leverage data
Uncertainty sampling
Identify examples that your model is most uncertain about and find similar examples to add to the training set
“Error model”
Use the mistakes your model makes as labels: “predicted correctly” or “predicted incorrectly”. Use the trained error model on unlabeled data and label the examples that it predicts your model will fail on
“Labeling model”
To find the best examples to label next. Identify data points that are most different from what you’ve already labeled and label those
Validation
While you should use strategies to gather data, you should always randomly sample from your test set to validate your model
Part III. Iterate on Models
Ch5. Train and evaluate your model
The simplest appropriate model
Not the best approach: try every possible model, benchmark and pick the one with the best results on a test set
Simple model
- Quick to implement: won’t be your last
- Understandable: debug easily
- Deployable: fundamental requirement for a ML-powered application
Model explainability and interpretability: ability for a model to expose reasons that caused it to make predictions
Test set
“While using a test set is a best practice, practitioners sometimes use the validation set as a test set. This increases the risk of biasing a model toward the validation set but can be appropriate when running only a few experiments”
Data leakage
- Temporal data leakage
- Sample contamination
Always investigate the results of a model, especially if it shows surprisingly strong performance
Bias variance trade-off
- Underfitting: weak performance on the training set = high bias
- Overfitting: strong performance on the training set, but weak performance on the validation set = high variance
Evaluate your model: look beyond accuracy
- Contrast data and predictions
- Confusion matrix: see whether our model is particularly successful on certain classes and struggles on some others
- ROC Curve: plot a threshold on it to have a more concrete goal than simply getting the largest AUC score
- Calibration Curve: whether our model’s outputed probability represents its confidence well. Shows the fraction of true positive examples as a function of the confidence of our classifier
- Dimensionality reduction for errors: identify a region in which a model performs poorly and visualize a few data points in it
- The top-k method
- k best performing examples: identify features that are successfully leveraged by a model
- k worst performing examples: on train: identify trends in data the model fails on, identify additional features that would make them easier for a model. On validation: identify examples that significantly differ from the train data
- k most uncertain examples: on train: often a symptom of conflicting labels. On validation: can help find gaps in your training data
Top-k implementation: book’s Github repository
Evaluate Feature Importance
- Eliminate or iterate on features that are currently not helping the model
- Identify features that are suspiciously predictive, which is often a sign of data leakage
Black-box explainers
Attempt to explain a model’s predictions independently of its inner workings, i.e. LIME and SHAP
Ch6. Debug your ML problems
Software Best Practices
KISS principle: building only what you need
Most software applications: strong test coverage = high confidence app is functioning well. ML pipelines can pass many tests, but still give entirely incorrect results. Doesn’t have just to run, it should produce accurate predictive outputs
Progressive approach, validate:
- Data flow
- Learning capacity
- Generalization and inference
Make sure your pipeline works for a few examples, then write tests to make sure it keeps functioning as you make changes
Visualization steps
Inspect changes at regular intervals
- Data loading: Verify data is formatted correctly
- Cleaning and feature selection: remove any unnecessary information
- Feature generation: check that the feature values are populated and that the values seem reasonable
- Data formatting: shapes, vectors
- Model output: first look if the predictions are the right type or shape, then check if the model is actually leveraging the input data
Separate your concerns
Modular organization: separate each function so that you can check that it individually works before looking at the broader pipeline. Once broken down, you’ll be able to write tests
Test your ML code
Source code on book’s Github repository
- Test data ingestion
- Test data processing
- Test model outputs
Debug training: make your model learn
Contextualize model performance: generate an estimate of what an acceptable error for the taks is by labeling a few examples yourself
Task difficulty
- The quantity and diversity of data you have: more diverse/complex the problem = more data for the model to learn from it
- How predictive the features are: make the data more expressive to help the model learn better
- The complexity of your model: simplest model is good to quickly iterate, but some tasks are entirely out of reach of some models
Debug generalization: make your model useful
- Data Leakage: if you are surprised by validation performance, inspect the features; fixing a leakage issue will lead to lower validation performance, but a better model
- Overfitting: model performs drastically better on the training set than on the test set. Add regularization or data augmentation
- Dataset redesign: use k-fold cross validation to alleviate concerns that data splits may be of unequal quality
“If your models aren’t generalizing, your task may be too hard. There may not be enough information in your training examples to learn meaningful features that will be informative for future data points. If that is the case, then the problem you have is not well suited for ML”
Ch7. Using classifiers for writing recommendations
Part IV. Deploy and Monitor
Production ML pipelines need to be able to detect data and model failures and handle them with grace -> proactively
Ch8. Considerations when deploying models
- How was the data you are using collected?
- What assumptions is your model making by learning from this dataset?
- Is this dataset representative enough to produce a useful model?
- How could the results of your work be misused?
- What is the intended use and scope of your model?
Data Concerns
Data ownership
- Collection
- Usage and permission
- Storage
Data bias
Datasets: results of specific data collection decisions -> lead to datasets presenting a biased view of the world. ML models learn from datasets -> will reproduce these biases
- Measurement errors or corrupted data
- Representation
- Access
Test sets
Build a test set that is inclusive, representative, and realistic -> proxy for performance in production -> improve the chances that every user has an equally positive experience
Models are trained on historical data -> state of the world in the past. Bias most often affects populations that are already disenfranchised. Working to eliminate bias -> help make systems fairer for the people who need it most
Modeling Concerns
Feedback loops
User follow a model’s recommendation -> future models make the same recommendation -> models enter a self-reinforcing feedback loop
To limit negative effects of feedback loops -> choose a label that is less prone to creating such a loop
Inclusive model performance
Look for performance on a segment of the data, instead of only comparing aggregate performance
Adversaries
Regularly update models
Some types of attacks:
- Fool models into a wrong prediction (most common)
- Use a trained model to learn about the data it was trained on
Chris Harland: Shipping Experiments
When giving advice, the cost of being wrong is very high, so precision is the most useful
Ch9. Choose Your Deployment Option
Server-side deployment
Setting up a web server that can accept requests from clients, run them through an inference pipeline, and return the results. The servers represents a central failure point for the application and can be costly if the product becomes popular
Streaming API workflow
Endpoint approach
- Quick to implement
- Requires infrastructure to scale linearly with the current number of users (1 user = 1 separate inference call)
- Required when strong latency constraints exist (info the model needs is available only at prediction time and model’s prediction is required immediately)
Batch Predictions
Inference pipeline as a job that can be run on multiple examples at once. Store predictions so they can be used when needed
- Appropriate when you have access to the features need for a model before the model’s prediction is required
- Easier to allocate and parallelize resources
- Faster at inference time since results have been precomputed and only need to be retrieved (similar gains to caching)
Hybrid Approach
- Precompute as many cases as possible
- At inference either retrieve precomputed results or compute them on the spot if they are not available or are outdated
- Have to maintain both a batch and streaming pipeline (more complexity of the system)
Client-side deployment
Run all computations on the client, eliminating the need for a server to run models. Models are still trained in the same manner and are sent to the device for inference
- Reduces the need to build infra
- Reduces the quantity of data that needs to be transferred between the device and the server
- Reduces network latency (app may even run without internet)
- Removes the need for sensitive information to be transferred to a remote server
If the time it would take to run inference on device is larger than the time it would take to transmit data to the server to be processed, consider running your model in the cloud
On-device deployment is only worthwhile if the latency, infrastructure, and privacy benefits are valuable enough to invest the engineering effort (simplifying a model)
Browser side
Some libraries use browsers to have the client perform ML tasks
Tensorflow.js
: train and run inference in JavaScript in the browser for most differentiable models, even trained in different languages such as Python
Federated Learning: a hybrid apporach
Each client has their own model. Each model learns from their user’s data and send aggregated (and potentially anonymized) updates to the server. The server leverages all updates to improve its model and distills this new model back to individual clients. Each user receives a model personalized to their needs, while still benefiting from aggregate information about other users
Ch10. Build Safeguards for Models
No matter how good a model is, it will fail on some examples -> engineer a system that can gracefully handle such features
Check inputs
- Very different data from train
- Some features missing
- Unexpected types
Input checks are part of the pipeline -> change the control flow of a program based on the quality of inputs
Model outputs
Prediction falls outside an acceptable range -> consider not displaying it
Acceptable outcome: not only if the outcome is plausible -> also depends if the outcome would be useful for the user
Model failure fallbacks
Flag cases that are too hard and encourage user to provide an easier input (e.g. well-lit photo)
Detecting errors:
- Track the confidence of a model
- Build an additional model tasked with detecting examples a main model is likely to fail on
Filtering model
- ML version of input tests
- Binary classifier
- Estimate how well a model will perform on an example without running the model on it
- Decrease the likelihood of poor results and improve resource usage
- Catch:
- qualitatively different inputs
- inputs the model struggled
- adversarial inputs meant to fool the model
- Minimum criteria:
- should be fast (reduce the computational burden)
- should be good at eliminating hard cases
The faster your filtering model is, the less effective it needs to be
Engineer for Performance
Scale to multiple users
ML is horizontally scalable = more servers = keep response time reasonable when the number of requests increases
Caching fo ML
Storing results to function calls -> future calls with same parameters simply retrieve the stored results
Caching inference results
Least recently used (LRU) cache: keep track the most recent inputs to a model and their corresponding outputs
- not appropriate if each input is unique
functools
Python module proposes a default implementation of an LRU cache that you can use with a simple decorator
from functools import lru_cache
@lru_cache(maxsize=128)
def run_model(data):
# Insert slow model inference below
pass
Caching by indexing
Cache other aspects of the pipeline that can be precomputed. Easy if a model does not only rely on user inputs
“Caching can improve performance, but it adds a layer of complexity. The size of the cache becomes an additional hyperparameter to tune depending on your application’s workload. In addition, any time a model or the underlying data is updated, the cache needs to be cleared in order to prevent it from serving outdated results”
Model and data life cycle management
ML application:
- produces reproducible results
- is resilient to model updates
- is flexible enough to handle significant modelling and data processing changes
Reproducibility
Each model/dataset pair should be assigned an unique identifier -> should be logged each time a model is used in production
Resilience
- production pipeline should aim to update models without significant downtime
- if a new model performs poorly, we’d like to be able to roll back to the previous one
Data Processing and DAGs
Directed acyclic graph (DAG): can be used to represent our process of going from raw data to trained model -> each node represent a processing step and each step represent a dependency between two nodes
DAGs helps systematize, debug, and version a pipeline -> can become a crucial time saver
Ask for feedback
- Explicity asking for feedback (display model’s prediction accompanying it with a way for users to judge and correct a prediction)
- Measuring implicit signals
User feedback is a good source of training data and can be the first way to notice a degradation in performance
Chris Moody: Empowering Data Scientist to Deploy Models
- Make humans and algorithms work together: spend time thinking about the right way to present information
- Canary development -> start deploying the new version to one instance and progressively update instances while monitoring performance
“Ownership of the entire pipeline leads individuals to optimize for impact and reliability, rather than model complexity”
Ch11. Monitor and update models
Monitoring saves lives
Monitoring: track the health of a system. For models: performance and quality of their predictions
Monitor to inform refresh rate
Detect when a model is not fresh anymore and needs to be retrained. Retraining events happen when accuracy dips below a threshold.
Monitor to detect abuse
Anomaly detection to detect attacks
Choose what to monitor
Commonly monitor metrics such as the average time it takes to process a request, the proportion of requests that fail to be processed, and the amount of available resources
Performance Metrics
- Track changes in the input distribution (feature drift)
- Monitor the input distribution (summary statistics)
- Monitor distribution shifts
Conterfactual evaluation: aims to evaluate what would have happened if we hadn’t actioned a model -> Not acting on a random subset of examples allow us to observe an unbiased distribution of the positive class. Comparing model predictions to true outcomes for the random data, we can begin to estimate a model’s precision and recall
Business metrics
Product metrics should be closely monitored
CI/CD for ML
- CI: letting multiple developers regularly merge their code back into a central codebase
- CD: improving the speed at which new versions of software can be released
CI/CD for ML: make it easier to deploy new models or update existing ones
“Releasing updates quickly is easy; the challenge comes in guaranteeing their quality (…) There is no substitute for live performance to judge the quality of a model”
Shadow mode: deploying a new model in parallel to an existing one. When running inference, both models’ predictions are computed and stored, but the application only uses the prediction of the existing model
- estimate a new models’ performance in a production environment without changing the user experience
- test the infrastructure required to run inference for a new model (may be more complex)
- but can’t observe the user’s response to the new model
A/B Testing and Experimentation
Goal: maximize chances of using the best model, while minimizing the cost of trying out suboptimal models
Expose a sample of users to a new model, and the rest to another. Larger control group (current model) and a smaller treatment group (new version we want to test). Run for a sufficient amount of time -> compare the results for both groups and choose the better model
Choosing groups and duration
- Users in both groups should be as similar as possible -> any difference in outcome = our model and not difference in cohorts
- Treatment group should be:
- large enough: statistically meaningful conclusion
- small as possible: limit exposure to a potentially worse model
- Duration of the test:
- too short: not enough information
- too long: risk losing users
Estimating the better variant
Decide on the size of each group and the length of the experiment before running it
Building the infrastructure
Branching logic: decides which model to run depending on a given field’s value (harder if a model is accessible to logged-out users)
Other approaches
Multiarmed bandits: more flexible approach, can test variants continually and on more than two alternatives. Dynamically update which model to serve based on how well each option in performing
Contextual multiarmed bandits: go even further, by learning which model is a better option for each particular user
“The majority of work involved with building ML products consists of data and engineering work”