DIGITAL TWIN AI and Machine Learning: The Bias-Variance ... · Introduction The Bias-Variance...

Introduction The Bias-Variance Decomposition Cross-validation Hyperparameter Selection Reflections

DIGITAL TWIN AI and Machine Learning:The Bias-Variance Tradeoff and Model Selection

Prof. Andrew D. Bagdanovandrew.bagdanov AT unifi.it

Dipartimento di Ingegneria dell’InformazioneUniversità degli Studi di Firenze

13/19 November 2020

AI&ML: The Bias-Variance Tradeoff A. D. Bagdanov


Outline

Introduction

The Bias-Variance Decomposition

Cross-validation

Hyperparameter Selection

Reflections



Introduction



Overview

I Today will bring to a close the first part of this course.I We will first look at a very important concept: the Bias-Variance

Decomposition.I This gives us a conceptual model of estimator performance in

terms of generalization error.I With this in hand, we will then look at some concrete tools from we

can use to manage the trade-off between model bias and variance:I Cross-validation: the basis for understanding bias and variance.I Learning curves: to understand how model variance depends on

training split size.I Validation curves: to understand how hyperparameters affect model

performance.I Model selection: to systematically explore hyperparameter setting

through grid search.



Bias-variance in a Nutshell

The Math

I The biasvariance decomposition is a way of analyzing a model’sexpected generalization error: the bias, the variance, and theirreducible error resulting from noise in the problem itself.

I Say we estimate a true function f (x) by y = f̂ (x) + ε, where ε isthe noise with zero-mean and variance σ2.

I It can be shown that the expected error is equal to:

E[f (x)− f̂ (x) + ε] = (Bias[f̂ (x)])2 + Var[f̂ (x)] + σ2, where

Bias[f̂ (x)]) = E[f̂ (x)]− E[f (x ]Var[f̂ (x)] = E[f̂ (x)2]− E[f̂ (x)]2




The Main Point

I As we increase model complexity:I Bias decreases: a better fit to data.I Variance increases: fit model varies more with data.I Imagine the hierachy of polynomial models:

I f (x) = cI f (x) = ax + cI f (x) = ax2 + bx + cI . . .

I As we go up in this hierarchy, model complexity increases and biasdecreases.

I But, the model parameters estimated from data will wildly fluctuatewith changing data – even if drawn from the same distribution.




Visualizing Bias and Variance



Cross-validation



Robust Evaluation of Model Performance

Cross-validation Overview

I Learning the parameters of any model and testing it on the samedata is a methodological mistake.

I A model that just memorizes the labels of the training sampleswould have a perfect score.

I But, of course it would fail miserably to classify any samples notyet seen.

I This is an extreme example of what is called overfitting.I To avoid it, it is common practice when performing supervised

machine learning to hold out part of the available data as a test set(as we have done since the beginning).




A Useful FlowchartI Here is a flowchart of the cross-validation workflow for training:




Validation Set

I When evaluating different hyperparameter settings, there is still arisk of overfitting on the test set.

I If we tweak parameters until estimator is optimal, knowledge aboutthe test set can "leak" into the model and evaluation metrics nolonger reflect generalization performance.

I To solve this problem, usually another part of the dataset can beheld out as a validation set: we train on training set, then evaluateon the validation set, and when the model seems to work well thefinal evaluation is done on the test set.

I However, by partitioning the available data into three sets, wedrastically reduce the number of samples used for learning.

I Moreover, the results can depend on a particular random choice fortrain and validation sets.




Enter, Cross-validation

I A solution to this problem is a procedure called cross-validation.I A test set should still be held out for final evaluation, but the

validation set is no longer needed.I The basic approach is called k-fold cross-validation: the training

set is split into k equally-sized, smaller sets, and the followingprocedure is followed for each of the k folds:

1. A model is trained using k − 1 of the folds as training data;2. The resulting model is validated on the remaining part of the data

by computing a performance measure such as accuracy on it.

I Important: the average performance over the k folds gives us alower bound on the generalization of the model to unseen data.




Cross-validation (continued)I Here is a diagram explaining the k-fold process:




Cross-validation (continued)

I In sklearn we can easily do k-fold cross-validation using thesklearn.model_selection.cross_val_score function:

from sklearn.model_selection import cross_val_scorefrom sklearn.svm import LinearSVCmodel = LinearSVC(C=100, verbose=3)scores = cross_val_score(model, X_tr, y_tr, cv=3,

verbose=3, n_jobs=4)

I Some parameters to pay attention to:I cv: number of folds to use.I verbose: logging level – useful to have feedback for long runs.I n_jobs: number of parallel jobs to use.I scoring: function to use for scoring (defaults to model.score().




Cross-validation: Analysis

I Cross-validation is a powerful tool for understanding how models(might) generalize.

I As we will see next, it is the basis for hyperparameter evaluationand selection.

I Problem: cross-validation is expensive as multiple models must befit to multiple splits of data.



Finding the Best Model


I Up to now we have used cross-validation only to obtain a morereliable estimate of the performance of our estimator.

I By training multiple times on random train/validation splits wemake the most of available data.

I But this still leaves open the question of how to effectively selectthe hyperparameters of our model.

I Up to now we have used models that have relatively fewhyperparameters.

I When we look at deep models based on neural networks, however,there will be significantly more.

I Fortunately, cross-validation also gives us a tool for robustlyestimating performance over a grid of hyperparameters.




Hyperparameter Selection: Validation Curves

I An excellent way to get an overview of the sensitivity of a model toone hyperparameter is to plot a validation curve.

from sklearn.model_selection import validation_curve(train_scores, val_scores) = validation_curve(

LinearSVC(), X_tr, y_tr,"C", [0.1, 1.0, 10, 100, 1000],cv=3)

val_scores

array([[0.85090745, 0.85135743, 0.82478248],[0.85180741, 0.84535773, 0.85238524],[0.84910754, 0.85300735, 0.83408341],[0.83620819, 0.84640768, 0.85358536],[0.85525724, 0.86170691, 0.84983498]])




Hyperparameter Selection: Validation CurvesI Useful: validation_curve returns the cross-validated scores for

all folds for all parameters.I This allows us to make useful plots:

See: https://scikit-learn.org/stable/auto_examples/model_selection/plot_validation_curve.html


https://scikit-learn.org/stable/auto_examples/model_selection/plot_validation_curve.html



Hyperparameter Selection: Learning Curves

I An important factor in the variance of any model is the size of thetraining split.

I According to Geoffrey Hinton: "More labeled data is the bestpossible model regularizer. . . "

I Using sklearn.model_selection.learning_curve() we canevaluate model performance as a function of test split size:

from sklearn.model_selection import learning_curvetrain_sizes, train_scores, test_scores = learning_curve(

LinearSVC(),X_tr, y_tr, cv=3)




Hyperparameter Selection: Learning CurvesI Again, this returns all scores for all folds for all training set sizes.I From these we can produce nice plots like:

See: https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html


https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html



Hyperparameter Selection: Grid Search

I Grid search is an unsophisticated, brute-force technique that worksvery well in practice.

I There are two main variations: Uniform and Random Grid Search




Hyperparameter Selection: Grid Search (continued)I The first thing to do is understand which hyperparameters are of

interest.I This almost always requires a detailed perusal of the

documentation.I Consider a linear SVM with hinge loss: the model essentially has

only one hyperparameter: the C used to weight model complexityversus empirical loss:

f (x) = wT x+ b

L(D;w, b) = minw||w||2 +

∑(x,y)∈D

Cmax(0, 1− yf (x))

class(x) =

{−1 if f (x) ≤ 0+1 if f (x) > 0




Hyperparameter Selection: Grid Search (continued)I The key class in sklearn is

sklearn.model_selection.GridSearchCV.I What must provide to GridSearchCV is a grid of parameters to

search:from sklearn.model_selection import GridSearchCVmodel = LinearSVC(max_iter=2000)param_grid = {'C': [0.001, 0.1, 1.0, 10, 20, 50, 100, 1000]}search = GridSearchCV(model, param_grid, cv=3, verbose=3, n_jobs=4)search.fit(X_tr, y_tr)test_score = accuracy_score(y_te, search.best_estimator_.predict(X_te))print(f'Best parameters: {search.best_params_}')print(f'Best cross-val score: {search.best_score_}')print(f'Score on test set: {test_score}')

Fitting 3 folds for each of 8 candidates, totalling 24 fits...




Hyperparameter Selection: Grid Search (continued)

I What if we have more hyperparameters?I For example, in LinearSVC we can also choose the type of penalty

(L1 or L2).I Well, we can just add them to the grid:

...param_grid = {'C': [0.001, 0.1, 1.0, 10, 20, 50, 100, 1000],

'penalty': ['l1', 'l2']}...

Fitting 3 folds for each of 16 candidates, totalling 48 fits



Reflections



Model Selection

I Model selection is a fundamental fact of life when working withmachine learning algorithms.

I Most of the models we have seen so far are low-variance models:they perform fairly stably over a range of hyperparameter settings.

I This is why these models are the tried-and-true techniques forsupervised learning: they often just work.

I In the next part of the course we will start looking at neuralnetwork models.

I They can achieve significantly better performance. . .I . . . at the cost, however, of significantly complicating the model

selection process.




I Nutshell: The more complex the model f̂ (x) is, the more datapoints it will capture, and the lower the bias will be; however,complexity will make the model "move" more to capture the datapoints, and hence its variance will be larger.

I Caveat: The Bias-Variance Decomposition is useful as aconceptual model – in practice the bias and variance of models isdifficult to estimate.



Model Selection Lab

I The laboratory notebook for today:

http://bit.ly/DTwin-ML5


http://bit.ly/DTwin-ML5

DIGITAL TWIN AI and Machine Learning: The Bias-Variance ... · Introduction The Bias-Variance...

Documents

Transcript of DIGITAL TWIN AI and Machine Learning: The Bias-Variance ... · Introduction The Bias-Variance...