# The Hundred-Page Machine Learning Book

My notes and highlights on the book.

Author: Andriy Burkov

# Table of Contents

- Table of Contents
- 1 Introduction
- 2 Notation and Definitions
- 3 Fundamental Algorithms
- 4 Anatomy of a Learning Algorithm
- 5 Basic Practice
- 6 Neural Networks and Deep Learning
- 7 Problems and Solutions
- 8 Advanced Practice
- 9 Unsupervised Learning
- 10 Other Forms of Learning
- 11 Conclusion

# 1 Introduction

## What is Machine Learning

Process of solving a practical problem:

- Gathering a dataset
- Algorithmically building a statistical model based on that dataset to be used somehow to solve the practical problem

## Supervised Learning

Dataset is a collection of **labeled examples**

Goal is to use the dataset to produce a model that takes a feature vector as input and outputs information that allows deducing the label for this feature vector

## Unsupervised Learning

Dataset is a collection of **unlabeled examples**

Goal is to create a model that takes a feature vector as input and either transforms it into another vector or into a value that can be used to solve a practical problem

## Semi-supervised Learning

Dataset contains both labeled and unlabeled examples. Usually unlabeled quantity » labeled quantity

Goal is the same as supervised learning. When you add unlabeled examples, you add more information about your problem, a larger sample reflects better the probability distribution the data we labeled came from

## Reinforcement Learning

Machine “lives” in an environment and is capable of perceiving the **state** as a vector of features. Machine can execute **actions** in every state. Different actions bring different **rewards** and could also move the machine to another state.

The goal of RL algorithm is to learn a **policy**. A policy is a function that takes the feature vector of a state as input and outputs an optimal action to execute. The action is optimal if it maximizes the expected average reward

## Why the Model Works on New Data

*PAC (“probably approximately correct”) learning*: theory that helps to analyze whether and under what conditions a learning algorithm will probably output an approximately correct classifier

# 2 Notation and Definitions

# 3 Fundamental Algorithms

# 4 Anatomy of a Learning Algorithm

## Building Blocks of a Learning Algorithm

- a loss function
- an optimization criterion based on the loss function
- an optimization routine leveraging training data to find a solution to the optimization criterion

## Gradient Descent

Iterative optimization algorithm for finding the minimum of a function

Find a *local minimum*: Starts at some random point and takes steps proportional to the negative of the gradient of the function at the current point

Gradient descent proceeds in **epochs**. Epoch: using the training set entirely to update each parameter

The learning rate controls the size of an update

Regular gradient descent is sensitive to the choice of the learning rate and slow for large datasets

Minibatch stochastic gradient descent (SGD): speed up the computation by approximating the gradient descent using smaller batches (subsets) of the training data.

Upgrades to SGD:

- Adagrad
- Momentum
- RMSprop
- Adam

## How Machine Learning Engineers Work

You don’t implement algorithms yourself, you use libraries, most of which are open source -> *scikit-learn*

## Learning Algorithms’ Particularities

- different hyperparameters
- some can accept categorical features
- some allow the data analyst to provide weightings for each class -> influence the decision boundary
- some given a feature vector only output the class -> others the probability
- some allow for online learning
- some can be used for both classification and regression

# 5 Basic Practice

## Feature Engineering

Transforming raw data into a dataset. Labor-intensive process, demands creativity and domain knowledge

Highly informative features = high predictive power

**Low bias**: predicts the training data well

## One-Hot Encoding

Transform categorical feature into several binary ones -> increase the dimensionality of the feature vectors

## Binning

Also called bucketing. Convert a continuous feature into multiple binary features (bins or buckets), based on value range

Can help the learning algorithm to learn using fewer examples

## Normalization

Converting the actual range of values into a standard range of values, typically in the interval [-1, 1] or [0, 1]

Can increase speed of learning. Avoid numerical overflow

## Standardization

Also called **z-score normalization**. Values are rescaled so that they have the properties of a standard normal distribution with `mean=0`

and `stdev=1`

If feature has outliers -> prefer standardization than normalization

Feature rescaling -> usually benefical to most learning algorithms

## Dealing with Missing Features

- Remove the examples (if data big enough)
- Using algorithm that can deal with missing features
- Data imputation

## Data Imputation Techniques

- Average
- Median
- Value outside the normal range (i.e., -1 for data in [0, 1])
- Use the missing value as target for a regression problem
- If data large enough: add binary indicator (another column) for each feature with missing value

Use the same data imputation technique to fill the missing values on the test set you used to complete the training data

## Learning Algorithm Selection

- Explainability
- In-memory vs. out-of-memory
- Number of features and examples
- Categorical vs. numerical features
- Nonlinearity of the data
- Training speed
- Prediction speed
- Test on the validation set

## Three Sets

- Training set
- Validation set
- Test set

Shuffle the examples and split the dataset into three subsets. Training set is usually the biggest one, use it to build the model. Validation and test sets are roughly the same sizes, much smaller than the training set. The learning algorithm cannot use these two subsets to build the model -> those two are also often called *holdout sets*

Why two holdout sets? We use the validation set to choose the learning algorithm and find the best hyperparameters. We use the test set to assess the model before putting it in production

## Underfitting and Overfitting

**High bias**: model makes many mistakes on the training data -> **underfitting**. Reasons:

- model is too simple for the data
- the features are not informative enough

Solutions:

- try a more complex model
- engineer features with higher predictive power

**Overfitting**: model predicts very well the training data but poorly the data from at least one of the two holdout sets. Reasons:

- model is too complex for the data
- too many features but a small number of training examples

**High variance**: error of the model due to its sensitivity to small fluctuations in the training set

The model learn the idiosyncrasies of the training set: the noise in the values of features, the sampling imperfection (due to small dataset size) and other artifacts extrinsic to the decision problem at hand but present in the training set

Solutions:

- try a simpler model
- reduce the dimensionality of the dataset
- add more training data
- regularize the model

## Regularization

Methods that force the learning algorithm to build a less complex model. Often leads to slightly higher bias but significantly reduces the variance -> **bias-variance tradeoff**

**L1 regularization**: produces a sparse model, most of its parameters equal to zero. Makes feature selection by deciding which features are essential for prediction.*Lasso regularization***L2 regularization**: penalizes larger weights, if your only goal is to decrease variance, L2 usually gives better results. L2 also has the advantage of being differentiable, so gradient descent can be used for optimizing the objective function.*Ridge Regularization***Elastic net regularization**: combine L1 and L2

Neural networks also benefit from two other regularization techniques:

- Dropout
- Batch Normalization

Also non-mathematical methods have a regularization effect: data augmentation and early stopping

## Model Performance Assessment

Model *generalizes well*: model performs well on predicting the test set

Overfitting: error on the test data is *substantially higher* then the error obtained in the training data

### Confusion Matrix

Table that summarizes how successful the classification model is at predicting examples belonging to various classes

Used to calculate two other metrics: precision and recall

### Precision/Recall

**Precision**: ratio of correct positive predictions to the overall number of positive predictions: TP/(TP+FP)**Recall**: ratio of correct positive predictions to the overall number of positive examples in the dataset: TP/(TP+FN)

In practice, almost always have to choose between high precision or high recall -> usually impossible to have both

- assign a higher weighting to the examples of a specific class
- tune hyperparameters to maximize precision or recall on the validation set
- vary the decision threshold for algorithms that return probabilities of classes

### Accuracy

Number of correctly classified examples divided by the total number of classified examples: (TP+TN)/(TP+TN+FP+FN)

Useful metric when errors in predicting all classes are equally important

### Cost-Sensitive Accuracy

When different classes have different importances

Assign a cost (positive number) to both types of mistakes: FP and FN. Then compute the counts TP, TN, FP, FN as usual and multiply the counst for FP and FN by the corresponding cost before calculating the accuracy normally

### Area under the ROC Curve (AUC)

ROC curve (“receiver operating characteristic”, comes from radar engineering): use a combination of the **true positive rate** (define exactly as recall) and **false positive rate** (proportion of negative examples predicted incorrectly) to build up a summary picture of the classification performance

TPR = TP/(TP+FN)

FPR = FP/(FP+TN)

ROC curvers can only be used to assess classifiers that return some confidence score (or a probability) of prediction

The higher the **area under the ROC curve (AUC)** the better the classifier. AUC > 0.5 -> better than a random classifier. AUC = 1 -> perfect classifier -> TPR closer to 1 while keeping FPR near 0

### Hyperparameter Tuning

**Grid Search**: when you have enough data for a validation set and the number of hyperparameters and their range is not too large**Random Search**: instead of providing discrete set of values to explore, you provide a statistical distribution for each hyperparameter from which values are randomly samples and set the total number of combinations you want to try**Bayesian hyperparameter optimization**: use past evaluation results to choose the next values to evaluate**Gradient-based techniques****Evolutionary optimization techniques**

### Cross-Validation

When you have few training examples, it could be prohibitive to have both validation and test set. You would prefer to use more data to train the model. In such case, you only split your data into training and test. Then you use **cross-validation** on the training set to simulate a validation set

# 6 Neural Networks and Deep Learning

# 7 Problems and Solutions

# 8 Advanced Practice

## Handling Imbalanced Datasets

- Set the cost of misclassification of examples of the minority class higher
- oversampling -> make multiple copies of the example of some class
- undersampling -> randomly remove from training set some examples of the majority class
- synthetic minority oversampling technique (SMOTE)
- adaptive synthetic sampling method (ADASYN)
- algorithms less sensitive to imbalanced datasets: Decision trees, Random Forest, Gradient Boosting

## Combining Models

Ensemble models, typically combine models of the same nature. Boost performance by combining hundreds of weak models. We can sometimes get an additional performance gain by combining strong models made with different learning algorithms (two or three models):

- averaging
- majority vote
- stacking

Stacking: building a meta-model that takes the output of base models as input. Make sure your stacked model performs better on the validation set than each of the base models you stacked. When severaluncorrelatedstrong models agree they are more likely to agree on the correct outcome

## Training Neural Networks

- Challenge to convert your data into the input the network can work with (i.e., resize images, word embeddings)
- The choice of specific NN architecture is a difficult one
- Decide the number of layers, their type and size
- Regularization

## Advanced Regularization

For NNs, besides L1 and L2 regularization:

- Dropout: for each pass, temporarily exclude at random some units frrom the computation
- Early Stopping: stop training once observe a decreased performance on the validation set
- Batch Normalization: standardize the outputs of each layer
- Data augmentation: create a synthetic example from an original by applying various transformations

## Handling Multiple Inputs

Multimodal data -> e.g., input is an image and text and binary output indicates whether the text describes this image

It’s hard to adapt shallow learning algorithms to work with multimodal data -> train one shallow model on the image and another one in the text

## Handling Multiple Outputs

Some problems you would like to predict multiple outputs for one input -> sometimes can convert into a multi-label classification problem -> Subnetworks

## Transfer Learning

Pick an existing model trained on some dataset, and adapt this model to predict examples from another dataset, different from the one the model was built on

- Build a deep model on the original big dataset
- Compile a much smaller labeled dataset for your second model
- Remove the last one or several layers from the first model
- Replace the removed layers with new layers adapted for the new problem
- “Freeze” the parameters of the layers remaining from the first model
- Use your smaller labeled dataset and gradient descent to train the parameters of only the new layers

## Algorithmic Efficiency

**Big O notation**: classify algorithms according to how their running time or space requirements grow as the input size grows. Complexity measured in the worst case

- avoid using loops whenever possible
- use appropriate data structures (e.g., if order is not important, use a set instead of a list)
- use dict (hashmap) -> allows you to define a collection of key-value pairs with very fast lookups for keys
- use Scientific Python packages -> many methods implemented in C
- if you need to iterate over a vast collection of elements, use generators that create a function that return one element at a time rather than all the elements at once
*cProfile*package to find inefficiencies*multiprocessing*package*PyPy*,*Numba*

# 9 Unsupervised Learning

# 10 Other Forms of Learning

## Metric Learning

You can create a metric that would work better for your dataset

One-shot learning with siamese networks and triplet loss can be seen as metric learning problem

## Learning to Rank

Supervised learning problem (e.g., optimization of search results returned by a search engine for a query)

Three approaches:

- pointwise
- parwise
- listwise

State of the art rank learning algorithm:

LambdaMART. Listwise approach -> one popular metric that combines both precision and recall is calledmean average precision (MAP)

In typical supervised learning algorithm, we optimize the cost instead of the metric (usually metrics are not differentiable). In LambdaMART the metric is optmized directly

## Learning to Recommend

**Content-based filtering**: learning what users like based on the description of the content they consume -> user can be trapped in a “filter bubble”**Collaborative filtering**: recommendations to one user are computed based on what other users consume or rate -> content of the item consumed is ignored -> huge and extremely sparse matrix

Real-world recommender systems -> hybrid approach

### Factorization Machines (FM)

Explicity designed for sparse datasets. Users and items are encoded as one-hot vectors

### Denoising Autoencoders (DAE)

NN that reconstructs its input from the bottleneck layer. Ideal tool to build a recommender system: input is corrupted by noise while the output shouldn’t be

Idea: new items a user could like are seen as if they were removed from the complete set by some corruption process -> goal of the denoising autoencoder is to reconstruct those removed items

Another effective collaborative-filtering model is an FFNN with two inputs and one output

## Self-Supervised Learning: Word Embeddings

Word embeddings: feature vectors that represent words -> similar words have similar feature vectors

**word2vec**: pretrained embeddings for many languages are available to download online. **skip-gram**

Self-supervised: the labeled examples get extracted from the unlabeled data such as text

# 11 Conclusion

## Topic Modeling

Prevalent unsupervised learning problem. **Latent Dirichlet Allocation (LDA)** -> You decide how many topics are in your collection, the algorithm assigns a topic to each word in this collection. To extract the topics from a document -> count how many words of each topic are present in that document

## Gaussian Process (GP)

Supervised learning method that competes with kernel regression

## Generalized Linear Models (GLM)

Generalization of the linear regression to modeling various forms of dependency between the input feature vector and the target

## Probabilistic Graphical Models (PGM)

One example: Conditional Random Fields (CRF) -> model the input sequence of words and relationships between the features and labels in this sequence as a sequential *dependency graph*

**Graph**: structure consisting of a colletion of nodes and edges that join a pair of nodes

PGMs are also know under names of Bayesian networks, belief networks and probabilistic independence networks

## Markov Chain Monte Carlo (MCMC)

If you work with graphical models and want to sample examples from a very complex distribution defined by the dependency graph. MCMC is a class of algorithms for sampling from any probability distribution defined mathematically

## Generative Adversarial Networks (GAN)

Class of NN used in unsupervised learning. System of two neural networks contesting with each other in a *zero-sum game* setting

## Genetic Algorithms (GA)

Numerical optimization technique used to optimize undifferentiable optimization objective functions. Use concepts from evolutionary biology to search for a global optimum (minimum or maximum) of an optimization problem, by mimicking evolutionary biological processes

GA allow finding solutions to any measurable optimization criteria (i.e., optimize hyperparameters of a learning algorithm -> typically much slower than gradient-based optimization techniques)

## Reinforcement Learning (RL)

Solves a very specific kind of problem where the decision making is sequential. There’s an agent acting in a unknown environment. Each action brings a reward and moves the agent to another state of the envinronment. The goal of the agent is to optimize its long-term reward