COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating...
Transcript of COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating...
COMP 551 - Applied Machine LearningLecture 6 --- Evaluation and RegularizationWilliam L. Hamilton(with slides and content from Joelle Pineau)* Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission.
William L. Hamilton, McGill University and Mila 1
Clarifications on generative classification§ The proper pdf for the multivariate Gaussian is:
§ For continuous variables, we are not computing proper probabilities but instead we are comparing the probability densities of different solutions (via pdfs). § The “probability” of a specific point in a continuous distribution is always zero!
§ But we can compare the relative likelihoods of different points and model parameter settings by comparing the pdf values.
William L. Hamilton, McGill University and Mila 2
Cancels out in LDA computation, but it is still
important to include!
Classification in the real world
William L. Hamilton, McGill University and Mila
Not spam
Spam
?
3
Multi-class classification
William L. Hamilton, McGill University and Mila
§ Generally two options:
1. Learn a single classifier that can produce 20 distinct output values.
2. Learn 20 different 1-vs-all binary classifiers.
4
Multi-class classification
William L. Hamilton, McGill University and Mila
§ Generally two options:1. Learn a single classifier that can produce 20 distinct output values.2. Learn 20 different 1-vs-all binary classifiers.
§ Option 1 assumes you have a multi-class version of the classifier.
§ For Naïve Bayes, compute P(y|x)for each class, and select the class with highest probability.
5
Multi-class classification
William L. Hamilton, McGill University and Mila
§ Generally two options:1. Learn a single classifier that can produce 20 distinct output values.2. Learn 20 different 1-vs-all binary classifiers.
§ Option 1 assumes you have a multi-class version of the classifier.
§ For Naïve Bayes, compute P(y|x)for each class, and select the class with highest probability.
§ Option 2 applies to all binary classifiers, so more flexible. But often slower (need to learn many classifiers), and creates an imbalance problem (the target class has relatively fewer data points, compared to the aggregation of the other classes.)
6
Classification in the real world
William L. Hamilton, McGill University and Mila
§ Different objectives:§ Selecting the right model for a problem.
§ Testing performance of a new algorithm.
§ Evaluating impact on a new application.
7
Minimizing the error
§ Find the low point in the validation error:220 7. Model Assessment and Selection
0 5 10 15 20 25 30 35
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Model Complexity (df)
Pred
ictio
n Er
ror
High Bias Low BiasHigh VarianceLow Variance
FIGURE 7.1. Behavior of test sample and training sample error as the modelcomplexity is varied. The light blue curves show the training error err, while thelight red curves show the conditional test error ErrT for 100 training sets of size50 each, as the model complexity is increased. The solid curves show the expectedtest error Err and the expected training error E[err].
Test error, also referred to as generalization error, is the prediction errorover an independent test sample
ErrT = E[L(Y, f(X))|T ] (7.2)
where both X and Y are drawn randomly from their joint distribution(population). Here the training set T is fixed, and test error refers to theerror for this specific training set. A related quantity is the expected pre-diction error (or expected test error)
Err = E[L(Y, f(X))] = E[ErrT ]. (7.3)
Note that this expectation averages over everything that is random, includ-ing the randomness in the training set that produced f .
Figure 7.1 shows the prediction error (light red curves) ErrT for 100simulated training sets each of size 50. The lasso (Section 3.4.2) was usedto produce the sequence of fits. The solid red curve is the average, andhence an estimate of Err.
Estimation of ErrT will be our goal, although we will see that Err ismore amenable to statistical analysis, and most methods effectively esti-mate the expected error. It does not seem possible to estimate conditional
Train error
Validation error
William L. Hamilton, McGill University and Mila 8
Not all errors are created equal
William L. Hamilton, McGill University and Mila
§ Not all errors have equal impact!
§ Different types of mistakes, particularly with classification tasks.
§ Incorrectly classifying a tumor is very different from accidently allowing a spam
email though the filter!
9
Not spam
Spam
?
Datasets can be imbalanced
William L. Hamilton, McGill University and Mila
§ Most real world data is not “balanced”, meaning that there classes are not equally distributed.
§ It is easy to get high accuracy on imbalanced data!
§ E.g., if 99% of emails are not spam, then we can get 99% accuracy by
simply predicting that everything is not spam!
10
Not spam
Spam
?
Example 1
§ Real-world study on predicting seizures for EEG signals.
§ Claim the technique is “clinically ready for mass screening”
William L. Hamilton, McGill University and Mila 11
Example 1
Accuracy =Truepositives+TrueNegatives/TotalnumberofexamplesSensitivity =Truepositives/TotalnumberofactualpositivesSpecificity =Truenegatives/Totalnumberofactualnegatives
§ Real-world study on predicting seizures for EEG signals.
§ Claim the technique is “clinically ready for mass screening”
William L. Hamilton, McGill University and Mila 12
Performance metrics for classification§ Not all errors have equal impact!§ There are different types of mistakes, particularly in the
classification setting.§ E.g., Consider the diagnostic of a disease. Two types of mis-diagnostics:
§ Patient does not have disease but received positive diagnostic (Type I error);§ Patient has disease but it was not detected (Type II error).
§ E.g. Consider the problem of spam classification:§ A message that is not spam is assigned to the spam folder (Type I error);§ A message that is spam appears in the regular folder (Type II error).
§ Question: How many Type I errors are you willing to tolerate, for a reasonable rate of Type II errors?
William L. Hamilton, McGill University and Mila 13
Example 2
§ Another real-world study on predicting seizures for EEG signals.
§ >85% sensitivity.
§ “Warning rate” of 0.35/hr.
William L. Hamilton, McGill University and Mila 14
Example 3
§ Another real-world study on predicting seizures for EEG signals.
§ 95% sensitivity.
§ “False alarm rate” of 0.15/hr.
William L. Hamilton, McGill University and Mila 15
Terminology
§ Type of classification outputs:§ True positive (m11): Example of class 1 predicted as class 1.
§ False positive (m01): Example of class 0 predicted as class 1. Type 1 error.
§ True negative (m00): Example of class 0 predicted as class 0.
§ False negative (m10): Example of class 1 predicted as class 0. Type II error.
§ Total number of instances: m = m00 + m01 + m10 + m11
William L. Hamilton, McGill University and Mila 16
Terminology
§ Type of classification outputs:§ True positive (m11): Example of class 1 predicted as class 1.
§ False positive (m01): Example of class 0 predicted as class 1. Type 1 error.
§ True negative (m00): Example of class 0 predicted as class 0.
§ False negative (m10): Example of class 1 predicted as class 0. Type II error.
§ Total number of instances: m = m00 + m01 + m10 + m11
§ Error rate: (m01 + m10) / m§ If the classes are imbalanced (e.g. 10% from class 1, 90% from class 0),
one can achieve low error (e.g. 10%) by classifying everything as coming
from class 0!
William L. Hamilton, McGill University and Mila 17
Confusion matrix
§ Many software packages output a confusion matrix:
William L. Hamilton, McGill University and Mila 18
Confusion matrix
§ Many software packages output this matrix:
§ Be careful! Sometimes the format is slightly different, e.g.,
Confusion matrix
• Confusion matrix gives more information than error rate:
m00 m01
m10 m11
�
• Many software packages (eg. Weka) output this matrix
• Varying the parameter of the algorithm produces a curve
COMP-652, Lecture 12 - October 18, 2012 11
William L. Hamilton, McGill University and Mila 19
Common measures§ Accuracy = (TP+ TN) / (TP + FP + FN + TN)
§ Precision = True positives / Total number of declared positives
= TP / (TP+ FP)
§ Recall = True positives / Total number of actual positives
= TP / (TP + FN)
William L. Hamilton, McGill University and Mila 20
Common measures§ Accuracy = (TP+ TN) / (TP + FP + FN + TN)
§ Precision = True positives / Total number of declared positives
= TP / (TP+ FP)
§ Recall = True positives / Total number of actual positives
= TP / (TP + FN)
§ Sensitivity is the same as recall.
§ Specificity = True negatives / Total number of actual negatives
= TN / (FP + TN)
Textclassification
Medicine
William L. Hamilton, McGill University and Mila 21
Common measures§ Accuracy = (TP+ TN) / (TP + FP + FN + TN)
§ Precision = True positives / Total number of declared positives
= TP / (TP+ FP)
§ Recall = True positives / Total number of actual positives
= TP / (TP + FN)
§ Sensitivity is the same as recall.
§ Specificity = True negatives / Total number of actual negatives
= TN / (FP + TN)
§ False positive rate = FP / (FP + TN)
Textclassification
Medicine
William L. Hamilton, McGill University and Mila 22
Common measures§ Accuracy = (TP+ TN) / (TP + FP + FN + TN)
§ Precision = True positives / Total number of declared positives
= TP / (TP+ FP)
§ Recall = True positives / Total number of actual positives
= TP / (TP + FN)
§ Sensitivity is the same as recall.
§ Specificity = True negatives / Total number of actual negatives
= TN / (FP + TN)
§ False positive rate = FP / (FP + TN)
§ F1 measure
Textclassification
Medicine
William L. Hamilton, McGill University and Mila 23
Trade-off: False positives and false negatives§ Often have a trade-off between false positives and false negatives.
§ E.g. Consider 30 different classifiers trained on a class. Classify a new sample as positive if K classifiers output positive. Vary K between 0 and 30.
Example: Tree bagging
• 30 decision trees, classify an example as positive if K trees classify it aspositive
• Vary K between 0 and 30
!"#$%&'()*+),'-./.01)23''/)!"#$%&'()*+),'-./.01)23''/)-01/234-2',)56)5#77.17-01/234-2',)56)5#77.17
8&#//.96)#/)%0/.2.:').9);)042)09)*+)23''/)8&#//.96)#/)%0/.2.:').9);)042)09)*+)23''/)%3',.-2)%0/.2.:'<))=#36);<%3',.-2)%0/.2.:'<))=#36);<
COMP-652, Lecture 12 - October 18, 2012 12
Precision-recall
• Similar concept to AUC curves, but used in retrieval tasks
• Precision is true positive / total number of documents retrieved
• Recall is true positives / all positives
• In medical applications we use instead sensitivity and selectivity, whichare the recall for the two classes
!"#$%&%'()*#$+,,)-"+./!"#$%&%'()*#$+,,)-"+./!,'0)"#$+,,)'()/'"%1'(0+,)+2%&3)."#$%&%'()'()!,'0)"#$+,,)'()/'"%1'(0+,)+2%&3)."#$%&%'()'()4#"0%$+,)+2%&3)+(5)4+"6)0/#)0/"#&/',5)7'")8+9%(:)4#"0%$+,)+2%&3)+(5)4+"6)0/#)0/"#&/',5)7'")8+9%(:).'&%0%4#)."#5%$0%'(&);'")4+"6)<=.'&%0%4#)."#5%$0%'(&);'")4+"6)<=
COMP-652, Lecture 12 - October 18, 2012 17
William L. Hamilton, McGill University and Mila 24
Receiver operating characteristic (ROC)§ Consider simple 2-class classification (blue = negative class; red = positive class).
§ Consider the decision boundary (vertical line inthe left figure), where you predict Negative on the left of the boundary and predict Positive on the right .
§ Changing that boundary defines the ROC curve on the right.
Predictpositive
Predictnegative
William L. Hamilton, McGill University and Mila 25
Understanding the ROC curve§ Consider simple 2-class classification (blue = negative class; red = positive class).
§ Consider the decision boundary (vertical line inthe left figure), where you predict Negative on the left of the boundary and predict Positive on the right .
§ Changing that boundary defines the ROC curve on the right.
Predictpositive
Predictnegative
Diagonal line corresponds to a random classifier.
William L. Hamilton, McGill University and Mila 26
Building the ROC curve§ In many domains, the empirical ROC curve will be non-convex
(red line). Take the convex hull of the points (blue line).
To build the ROC curve: 1. Train a classifier.2. Vary the decision boundary.3. Compute FP and TP rate for
each different decision boundary.
ROC convex hull
• Suppose we have two hypotheses h1 and h2 along the ROC curve.
• We can always use h1 with probability p and h2 with probability (1 � p)and the performance will interpolate between the two
• So we can always match any point on the convex hull of an empiricalROC curve !"#$#%&'()$*+,,!"#$#%&'()$*+,,
!"#$#%&'()$*+,,
"-./.&0,$!"#$#+-'(
COMP-652, Lecture 12 - October 18, 2012 15
FP
TP
William L. Hamilton, McGill University and Mila 27
Using the ROC curve§ To compare 2 algorithms over a range of classification thresholds,
consider the Area Under the (ROC) Curve (AUC).§ A perfect algorithm has AUC=1.§ A random algorithm has AUC=0.5.§ Higher AUC doesn’t mean all performance measures are better.
ROC convex hull
• Suppose we have two hypotheses h1 and h2 along the ROC curve.
• We can always use h1 with probability p and h2 with probability (1 � p)and the performance will interpolate between the two
• So we can always match any point on the convex hull of an empiricalROC curve !"#$#%&'()$*+,,!"#$#%&'()$*+,,
!"#$#%&'()$*+,,
"-./.&0,$!"#$#+-'(
COMP-652, Lecture 12 - October 18, 2012 15
William L. Hamilton, McGill University and Mila 28
Using the ROC curve§ To compare 2 algorithms over a range of classification thresholds,
consider the Area Under the (ROC) Curve (AUC).§ A perfect algorithm has AUC=1.§ A random algorithm has AUC=0.5.§ Higher AUC doesn’t mean all performance measures are better.
ROC convex hull
• Suppose we have two hypotheses h1 and h2 along the ROC curve.
• We can always use h1 with probability p and h2 with probability (1 � p)and the performance will interpolate between the two
• So we can always match any point on the convex hull of an empiricalROC curve !"#$#%&'()$*+,,!"#$#%&'()$*+,,
!"#$#%&'()$*+,,
"-./.&0,$!"#$#+-'(
COMP-652, Lecture 12 - October 18, 2012 15
Intuition: AUC is equivalent to the probability of ranking a random positive example higher than a random negative example!
William L. Hamilton, McGill University and Mila 29
What about regression?§ Evaluation for regression is generally more straightforward than
classification, but there are still choices!
§ Common examples:
§ Mean Squared Error:
§ Root Mean Squared Error:
§ Mean Absolute Error:
§ (R)MSE penalizes larger errors more (e.g., being off by 10 is more than
twice as bad as being off by 5).
MSE =1
n
nX
i=1
(yi � yi)2
MAE =1
n
nX
i=1
|yi � yi|
William L. Hamilton, McGill University and Mila 30
RMSE =
rPni=1(yi � yi)2
n
Our goal: Minimize validation error38 2. Overview of Supervised Learning
High Bias
Low Variance
Low Bias
High Variance
Prediction
Error
Model Complexity
Training Sample
Test Sample
Low High
FIGURE 2.11. Test and training error as a function of model complexity.
be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.
The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.
More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.
Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1
N
!i(yi − yi)2. Unfortunately
training error is not a good estimate of test error, as it does not properlyaccount for model complexity.
Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). Inthat case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.
§ Overly simple model:§ High training error and
high test error.§ Overly complex model:
§ Low training error but high test error.
William L. Hamilton, McGill University and Mila 31
Our goal: Minimize validation error38 2. Overview of Supervised Learning
High Bias
Low Variance
Low Bias
High Variance
Prediction
Error
Model Complexity
Training Sample
Test Sample
Low High
FIGURE 2.11. Test and training error as a function of model complexity.
be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.
The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.
More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.
Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1
N
!i(yi − yi)2. Unfortunately
training error is not a good estimate of test error, as it does not properlyaccount for model complexity.
Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). Inthat case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.
§ Overly simple model:§ High bias
§ Overly complex model:§ High variance
William L. Hamilton, McGill University and Mila 32
Bias vs variance in machine learning§ Assume that y =f(x)+) where ) is zero mean and finite variance *+§ Then for any learned model we can decompose the error as:
§ where
E[(y � f(x))2] = (Bias[f(x)])2 +Var[f(x)] + �2
Bias[f(x)] = E[f(x)]� f(x)
Var[f(x)] = E[f(x)2]� E[f(x)]2
William L. Hamilton, McGill University and Mila 33
Bias vs variance in machine learning§ Assume that y =f(x)+) where ) is zero mean and finite variance *+§ Then for any learned model we can decompose the error as:
§ where E[(y � f(x))2] = (Bias[f(x)])2 +Var[f(x)] + �2
Bias[f(x)] = E[f(x)]� f(x)
Var[f(x)] = E[f(x)2]� E[f(x)]2
How far off is our mean prediction?
How wildly do our predictions vary?
William L. Hamilton, McGill University and Mila 34
Bias vs variance in machine learning
§ Overly simple model:§ High bias§ Underfitting
§ Overly complex model:§ High variance§ Overfitting
William L. Hamilton, McGill University and Mila 35
Overfitting, bias, and variance
§ What to do when we are overfitting?
§ Remove some features.
§ Simplify the model.
§ Reduce the variance at the cost of some bias!
§ This is called regularization.
William L. Hamilton, McGill University and Mila 36
Overfitting, bias, and variance
Which models have higher bias, which have higher variance?
William L. Hamilton, McGill University and Mila 37
Ridge regression (aka L2-regularization)§ Constrains the weights by imposing a penalty on their size:
ŵridge =argminw { ∑i=1:n(yi - wTxi)2 +λ∑j=0:mwj2 }where λ can be selected manually, or by cross-validation.
§ Do a little algebra to get the solution: ŵridge =(XTX+λI)-1XTY§ The ridge solution is not equivariant under scaling of the data, so
typically need to normalize the inputs first.
§ Ridge gives a smooth solution, effectively shrinking the weights, but
drives few weights to 0.
William L. Hamilton, McGill University and Mila 38
Ridge regression (aka L2-regularization)§ Constrains the weights by imposing a penalty on their size:
ŵridge =argminw { ∑i=1:n(yi - wTxi)2 +λ∑j=0:mwj2 }where λ can be selected manually, or by cross-validation.
§ I.e., Err(w)=∑i=1:n(yi - wTxi)2 +λ(||w||2)2§ Can also simply add the penalty to gradient descent:
∂Err(w)/∂w =2(XTXw – XTY)+2 λ w
Usual gradient for linear regression
Gradient from the L2-penalty term
William L. Hamilton, McGill University and Mila 39
Lasso regression (aka L1-regularization)
§ Constrains the weights by penalizing the absolute value of their size:
ŵlasso =argminW {∑i=1:n(yi - wTxi)2 +λ∑j=1:m|wj|}
§ Now the objective is non-linear in the output y, and there is no closed-form solution. Need to solve a quadratic programming problem instead.
§ More computationally expensive than Ridge regression.
§ Effectively sets the weights of less relevant input features to zero.
William L. Hamilton, McGill University and Mila 40
Lasso regression (aka L1-regularization)
§ Constrains the weights by penalizing the absolute value of their size:
ŵlasso =argminW {∑i=1:n(yi - wTxi)2 +λ∑j=1:m|wj|}
§ I.e., Err(w)=∑i=1:n(yi - wTxi)2 +λ||w||1§ Can also simply add the penalty to gradient descent:
∂Err(w)/∂w =2(XTXw – XTY)+λ sign(w)
Usual gradient for linear regression
(Sub)gradient of the L1-penalty term. sign(x)=1 if x >0, sign(x)=-1 if x < 0,
and sign(x)=0 if x=0.
William L. Hamilton, McGill University and Mila 41
Comparing Ridge and Lasso
4
Joelle Pineau 7
Comparing Ridge and Lasso
COMP-598: Applied Machine Learning
Solving L1 regularization
• The optimization problem is a quadratic program
• There is one constraint for each possible sign of the weights (2n
constraints for n weights)
• For example, with two weights:
minw1,w2
mX
j=1
(yj � w1x1 � w2x2)2
such that w1 + w2 ⇥ ⇤
w1 � w2 ⇥ ⇤
�w1 + w2 ⇥ ⇤
�w1 � w2 ⇥ ⇤
• Solving this program directly can be done for problems with a smallnumber of inputs
COMP-652, Lecture 3 - September 8, 2011 31
Visualizing L1 regularization
w1
w2
w?
• If ⌅ is big enough, the circle is very likely to intersect the diamond atone of the corners
• This makes L1 regularization much more likely to make some weightsexactly 0
COMP-652, Lecture 3 - September 8, 2011 32
Ridge Lasso
L2 Regularization for linear models revisited
• Optimization problem: minimize error while keeping norm of the weightsbounded
minw
JD(w) = minw
(�w � y)T (�w � y)
such that wTw ⇥ ⇤
• The Lagrangian is:
L(w,⌅) = JD(w)�⌅(⇤�wTw) = (�w�y)T (�w�y)+⌅wTw�⌅⇤
• For a fixed ⌅, and ⇤ = ⌅�1, the best w is the same as obtained byweight decay
COMP-652, Lecture 3 - September 8, 2011 27
Visualizing regularization (2 parameters)
w1
w2
w?
w⇤ = (�T�+ ⌅I)�1�y
COMP-652, Lecture 3 - September 8, 2011 28
Joelle Pineau 8
A quick look at evaluation functions
• We call L(Y,fw(x)) the loss function.
– Least-square / Mean squared-error (MSE) loss:
L(Y, fw(X)) = = ∑i=1:n ( yi - wTxi)2
• Other loss functions?
– Absolute error loss: L(Y, fw(X)) = ∑i=1:n | yi – wTxi |
– 0-1 loss for classification: L(Y, fw(X)) = ∑i=1:n I ( yi ≠ fw(xi) )
• Different loss functions make different assumptions.
– Squared error loss assumes the data can be approximated by a global linear model.
COMP-598: Applied Machine Learning
Contours of equalregression error
Contours of equalmodel complexity penalty
§ Ridge regression will tend to lower all weights.
§ Lasso will tend to set some weights to 0.
§ Combining Lasso and Ridge regression is called the “elastic” net.
William L. Hamilton, McGill University and Mila 42
Bias, variance, and regularization
William L. Hamilton, McGill University and Mila
§ Regularization (e.g, Ridge and Lass) decrease variance at the expense of some bias.
§ Regularization reduces overfitting but can cause underfitting.
§ The L2 and L1 penalties can be added to any model (e.g., logistic regression). Easy to incorporate with gradient descent.
43
The concept of inductive bias
William L. Hamilton, McGill University and Mila
§ Statistical “bias” is not a bad thing in ML!§ It is all about having the right inductive bias. § ML is all about coming up with different models that
have different inductive biases. § Optional reading: “The Need for Biases in Learning
Generalizations"
44