Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have...

20
Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham =0 ĉ(x) = spam =1 0 2 4 6 8 10 12 14 2 4 6 8 10 12 14 16

Transcript of Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have...

Page 1: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham

Epilogue: what have you learned this semester?

1

ʻViagraʼ

ʻlotteryʼ

=0

ĉ(x) = spam

=1

ĉ(x) = ham

=0

ĉ(x) = spam

=1

0 2 4 6 8 10 12 142

4

6

8

10

12

14

16

Page 2: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham

What did you get out of this course?

What skills have you learned in this course that you feel would be useful? What are the most important insights you gained this semester? What advice would you give future students? What was your biggest challenge in this course? What would you like me to do differently?

2

Page 3: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham

What I hope you got out of this course

The machine learning toolbox ■  Formulating a problem as an ML problem ■  Understanding a variety of ML algorithms ■  Running and interpreting ML experiments ■  Understanding what makes ML work – theory and

practice

Page 4: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham

Learning scenarios we covered

Classification: discrete/categorical labels Regression: continuous labels Clustering: no labels

4 0 2 4 6 8 10 12 142

4

6

8

10

12

14

16

Page 5: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham

A variety of learning tasks

■  Supervised learning ■  Unsupervised learning ■  Semi-supervised learning

– Access to a lot of unlabeled data ■  Multi-label classification

–  Each example can belong to multiple classes

■  Multi-task classification – Solving multiple related tasks

Page 6: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham

A variety of learning tasks

■  Outlier/novelty detection – Novelty: anything that is not part of the

normal behavior of a system. ■  Reinforcement learning

–  Learn action to maximize payoff

■  Structured output learning

Other Learning Aides

1.Nonlinear dimension reduction:

x1

x2

x1

x2

x1

x2

2.Hints (invariances and prior information):rotational invariance, monotonicity, symmetry, . . . .

3.Removing noisy data:

intensity

symmetry

intensity

symmetry

intensity

symmetry

4.Advanced validation techniques: Rademacher and Permutation penaltiesMore efficient than CV, more convenient and accurate than VC.

c⃝ AML Creator: Malik Magdon-Ismail Learning Aides: 16 /16brace

Page 7: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham

Learning in structured output spaces Handle prediction problems with complex output spaces ■  Structured outputs: multivariate, correlated, constrained

General way to solve many learning problems

Examples taken from Ben Taskar’s 07 NIPS tutorial

Page 8: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham

Local vs. Global

brace

Global classification takes advantage of correlations and satisfies the constraints in the problem

Page 9: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham

Other techniques

²  Graphical models (conditional random fields, Bayesian networks)

²  Bayesian model averaging

9

Page 10: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham

The importance of features and their representation

Choosing the right features is one of the most important aspects of applying ML. What you can do with features: ²  Normalization ²  Selection ²  Construction ²  Fill in missing features

10

Page 11: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham

w

wTx + b < 0

wTx + b > 0

Types of models

Geometric q  Ridge-regression, SVM, perceptron q  Neural networks Distance-based q  K-nearest-neighbors Probabilistic q  Naïve-bayes Logical models: Tree/Rule based q  Decision trees

Ensembles

11

ʻViagraʼ

ʻlotteryʼ

=0

ĉ(x) = spam

=1

ĉ(x) = ham

=0

ĉ(x) = spam

=1

P (Y = spam|Viagara, lottery)

Page 12: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham

Loss + regularization

Many of the models we studied are based on a cost function of the form:

loss + regularization Example: Ridge regression

12

1

N

NX

i=1

(yi �w

|xi)

2 + �w|w

Page 13: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham

Loss + regularization for classification

SVM The hinge loss is a margin maximizing loss function Can use other regularizers: (L1 norm) Leads to very sparse solutions and is non-differentiable. Elastic Net regularizer:

13

Hinge loss L2 regularizer

||w||1

↵||w||1 + (1� ↵)||w||22

C

N

NX

i=1

max [1� yihw(xi), 0] +1

2

w

|w

Page 14: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham

Loss + regularization for classification SVM Logistic regression AdaBoost can be shown to optimize the exponential loss

14

Hinge loss L2 regularizer

Log loss L2 regularizer

C

N

NX

i=1

max [1� yihw(xi), 0] +1

2

w

|w

1

N

NX

i=1

log(1 + exp(yihw(xi)) +�

2

w

|w

1

N

NX

i=1

exp(�yihw(xi))

Page 15: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham

Loss + regularization for regression

Ridge regression Closed form solution; sensitivity to outliers Lasso Sparse solutions; non-differentiable Can use alternative loss functions

15

1

N

NX

i=1

(yi �w

|x)2 + �w|

w

1

N

NX

i=1

(yi �w

|x)2 + �||w||1

Page 16: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham

Comparison of learning methods

16

10.7 “Off-the-Shelf” Procedures for Data Mining 351

TABLE 10.1. Some characteristics of different learning methods. Key: ▲= good,◆=fair, and ▼=poor.

Characteristic Neural SVM Trees MARS k-NN,Nets Kernels

Natural handling of dataof “mixed” type

▼ ▼ ▲ ▲ ▼

Handling of missing values ▼ ▼ ▲ ▲ ▲

Robustness to outliers ininput space

▼ ▼ ▲ ▼ ▲

Insensitive to monotonetransformations of inputs

▼ ▼ ▲ ▼ ▼

Computational scalability(large N)

▼ ▼ ▲ ▲ ▼

Ability to deal with irrel-evant inputs

▼ ▼ ▲ ▲ ▼

Ability to extract linearcombinations of features

▲ ▲ ▼ ▼ ◆

Interpretability ▼ ▼ ◆ ▲ ▼

Predictive power ▲ ▲ ▼ ◆ ▲

siderations play an important role. Also, the data are usually messy: theinputs tend to be mixtures of quantitative, binary, and categorical vari-ables, the latter often with many levels. There are generally many missingvalues, complete observations being rare. Distributions of numeric predic-tor and response variables are often long-tailed and highly skewed. Thisis the case for the spam data (Section 9.1.2); when fitting a generalizedadditive model, we first log-transformed each of the predictors in order toget a reasonable fit. In addition they usually contain a substantial fractionof gross mis-measurements (outliers). The predictor variables are generallymeasured on very different scales.

In data mining applications, usually only a small fraction of the largenumber of predictor variables that have been included in the analysis areactually relevant to prediction. Also, unlike many applications such as pat-tern recognition, there is seldom reliable domain knowledge to help createespecially relevant features and/or filter out the irrelevant ones, the inclu-sion of which dramatically degrades the performance of many methods.

In addition, data mining applications generally require interpretable mod-els. It is not enough to simply produce predictions. It is also desirable tohave information providing qualitative understanding of the relationship

Table 10.1 from “Elements of statistical learning”

random

forests

10.7 “Off-the-Shelf” Procedures for Data Mining 351

TABLE 10.1. Some characteristics of different learning methods. Key: ▲= good,◆=fair, and ▼=poor.

Characteristic Neural SVM Trees MARS k-NN,Nets Kernels

Natural handling of dataof “mixed” type

▼ ▼ ▲ ▲ ▼

Handling of missing values ▼ ▼ ▲ ▲ ▲

Robustness to outliers ininput space

▼ ▼ ▲ ▼ ▲

Insensitive to monotonetransformations of inputs

▼ ▼ ▲ ▼ ▼

Computational scalability(large N)

▼ ▼ ▲ ▲ ▼

Ability to deal with irrel-evant inputs

▼ ▼ ▲ ▲ ▼

Ability to extract linearcombinations of features

▲ ▲ ▼ ▼ ◆

Interpretability ▼ ▼ ◆ ▲ ▼

Predictive power ▲ ▲ ▼ ◆ ▲

siderations play an important role. Also, the data are usually messy: theinputs tend to be mixtures of quantitative, binary, and categorical vari-ables, the latter often with many levels. There are generally many missingvalues, complete observations being rare. Distributions of numeric predic-tor and response variables are often long-tailed and highly skewed. Thisis the case for the spam data (Section 9.1.2); when fitting a generalizedadditive model, we first log-transformed each of the predictors in order toget a reasonable fit. In addition they usually contain a substantial fractionof gross mis-measurements (outliers). The predictor variables are generallymeasured on very different scales.

In data mining applications, usually only a small fraction of the largenumber of predictor variables that have been included in the analysis areactually relevant to prediction. Also, unlike many applications such as pat-tern recognition, there is seldom reliable domain knowledge to help createespecially relevant features and/or filter out the irrelevant ones, the inclu-sion of which dramatically degrades the performance of many methods.

In addition, data mining applications generally require interpretable mod-els. It is not enough to simply produce predictions. It is also desirable tohave information providing qualitative understanding of the relationship

10.7 “Off-the-Shelf” Procedures for Data Mining 351

TABLE 10.1. Some characteristics of different learning methods. Key: ▲= good,◆=fair, and ▼=poor.

Characteristic Neural SVM Trees MARS k-NN,Nets Kernels

Natural handling of dataof “mixed” type

▼ ▼ ▲ ▲ ▼

Handling of missing values ▼ ▼ ▲ ▲ ▲

Robustness to outliers ininput space

▼ ▼ ▲ ▼ ▲

Insensitive to monotonetransformations of inputs

▼ ▼ ▲ ▼ ▼

Computational scalability(large N)

▼ ▼ ▲ ▲ ▼

Ability to deal with irrel-evant inputs

▼ ▼ ▲ ▲ ▼

Ability to extract linearcombinations of features

▲ ▲ ▼ ▼ ◆

Interpretability ▼ ▼ ◆ ▲ ▼

Predictive power ▲ ▲ ▼ ◆ ▲

siderations play an important role. Also, the data are usually messy: theinputs tend to be mixtures of quantitative, binary, and categorical vari-ables, the latter often with many levels. There are generally many missingvalues, complete observations being rare. Distributions of numeric predic-tor and response variables are often long-tailed and highly skewed. Thisis the case for the spam data (Section 9.1.2); when fitting a generalizedadditive model, we first log-transformed each of the predictors in order toget a reasonable fit. In addition they usually contain a substantial fractionof gross mis-measurements (outliers). The predictor variables are generallymeasured on very different scales.

In data mining applications, usually only a small fraction of the largenumber of predictor variables that have been included in the analysis areactually relevant to prediction. Also, unlike many applications such as pat-tern recognition, there is seldom reliable domain knowledge to help createespecially relevant features and/or filter out the irrelevant ones, the inclu-sion of which dramatically degrades the performance of many methods.

In addition, data mining applications generally require interpretable mod-els. It is not enough to simply produce predictions. It is also desirable tohave information providing qualitative understanding of the relationship

10.7 “Off-the-Shelf” Procedures for Data Mining 351

TABLE 10.1. Some characteristics of different learning methods. Key: ▲= good,◆=fair, and ▼=poor.

Characteristic Neural SVM Trees MARS k-NN,Nets Kernels

Natural handling of dataof “mixed” type

▼ ▼ ▲ ▲ ▼

Handling of missing values ▼ ▼ ▲ ▲ ▲

Robustness to outliers ininput space

▼ ▼ ▲ ▼ ▲

Insensitive to monotonetransformations of inputs

▼ ▼ ▲ ▼ ▼

Computational scalability(large N)

▼ ▼ ▲ ▲ ▼

Ability to deal with irrel-evant inputs

▼ ▼ ▲ ▲ ▼

Ability to extract linearcombinations of features

▲ ▲ ▼ ▼ ◆

Interpretability ▼ ▼ ◆ ▲ ▼

Predictive power ▲ ▲ ▼ ◆ ▲

siderations play an important role. Also, the data are usually messy: theinputs tend to be mixtures of quantitative, binary, and categorical vari-ables, the latter often with many levels. There are generally many missingvalues, complete observations being rare. Distributions of numeric predic-tor and response variables are often long-tailed and highly skewed. Thisis the case for the spam data (Section 9.1.2); when fitting a generalizedadditive model, we first log-transformed each of the predictors in order toget a reasonable fit. In addition they usually contain a substantial fractionof gross mis-measurements (outliers). The predictor variables are generallymeasured on very different scales.

In data mining applications, usually only a small fraction of the largenumber of predictor variables that have been included in the analysis areactually relevant to prediction. Also, unlike many applications such as pat-tern recognition, there is seldom reliable domain knowledge to help createespecially relevant features and/or filter out the irrelevant ones, the inclu-sion of which dramatically degrades the performance of many methods.

In addition, data mining applications generally require interpretable mod-els. It is not enough to simply produce predictions. It is also desirable tohave information providing qualitative understanding of the relationship

10.7 “Off-the-Shelf” Procedures for Data Mining 351

TABLE 10.1. Some characteristics of different learning methods. Key: ▲= good,◆=fair, and ▼=poor.

Characteristic Neural SVM Trees MARS k-NN,Nets Kernels

Natural handling of dataof “mixed” type

▼ ▼ ▲ ▲ ▼

Handling of missing values ▼ ▼ ▲ ▲ ▲

Robustness to outliers ininput space

▼ ▼ ▲ ▼ ▲

Insensitive to monotonetransformations of inputs

▼ ▼ ▲ ▼ ▼

Computational scalability(large N)

▼ ▼ ▲ ▲ ▼

Ability to deal with irrel-evant inputs

▼ ▼ ▲ ▲ ▼

Ability to extract linearcombinations of features

▲ ▲ ▼ ▼ ◆

Interpretability ▼ ▼ ◆ ▲ ▼

Predictive power ▲ ▲ ▼ ◆ ▲

siderations play an important role. Also, the data are usually messy: theinputs tend to be mixtures of quantitative, binary, and categorical vari-ables, the latter often with many levels. There are generally many missingvalues, complete observations being rare. Distributions of numeric predic-tor and response variables are often long-tailed and highly skewed. Thisis the case for the spam data (Section 9.1.2); when fitting a generalizedadditive model, we first log-transformed each of the predictors in order toget a reasonable fit. In addition they usually contain a substantial fractionof gross mis-measurements (outliers). The predictor variables are generallymeasured on very different scales.

In data mining applications, usually only a small fraction of the largenumber of predictor variables that have been included in the analysis areactually relevant to prediction. Also, unlike many applications such as pat-tern recognition, there is seldom reliable domain knowledge to help createespecially relevant features and/or filter out the irrelevant ones, the inclu-sion of which dramatically degrades the performance of many methods.

In addition, data mining applications generally require interpretable mod-els. It is not enough to simply produce predictions. It is also desirable tohave information providing qualitative understanding of the relationship

10.7 “Off-the-Shelf” Procedures for Data Mining 351

TABLE 10.1. Some characteristics of different learning methods. Key: ▲= good,◆=fair, and ▼=poor.

Characteristic Neural SVM Trees MARS k-NN,Nets Kernels

Natural handling of dataof “mixed” type

▼ ▼ ▲ ▲ ▼

Handling of missing values ▼ ▼ ▲ ▲ ▲

Robustness to outliers ininput space

▼ ▼ ▲ ▼ ▲

Insensitive to monotonetransformations of inputs

▼ ▼ ▲ ▼ ▼

Computational scalability(large N)

▼ ▼ ▲ ▲ ▼

Ability to deal with irrel-evant inputs

▼ ▼ ▲ ▲ ▼

Ability to extract linearcombinations of features

▲ ▲ ▼ ▼ ◆

Interpretability ▼ ▼ ◆ ▲ ▼

Predictive power ▲ ▲ ▼ ◆ ▲

siderations play an important role. Also, the data are usually messy: theinputs tend to be mixtures of quantitative, binary, and categorical vari-ables, the latter often with many levels. There are generally many missingvalues, complete observations being rare. Distributions of numeric predic-tor and response variables are often long-tailed and highly skewed. Thisis the case for the spam data (Section 9.1.2); when fitting a generalizedadditive model, we first log-transformed each of the predictors in order toget a reasonable fit. In addition they usually contain a substantial fractionof gross mis-measurements (outliers). The predictor variables are generallymeasured on very different scales.

In data mining applications, usually only a small fraction of the largenumber of predictor variables that have been included in the analysis areactually relevant to prediction. Also, unlike many applications such as pat-tern recognition, there is seldom reliable domain knowledge to help createespecially relevant features and/or filter out the irrelevant ones, the inclu-sion of which dramatically degrades the performance of many methods.

In addition, data mining applications generally require interpretable mod-els. It is not enough to simply produce predictions. It is also desirable tohave information providing qualitative understanding of the relationship

10.7 “Off-the-Shelf” Procedures for Data Mining 351

TABLE 10.1. Some characteristics of different learning methods. Key: ▲= good,◆=fair, and ▼=poor.

Characteristic Neural SVM Trees MARS k-NN,Nets Kernels

Natural handling of dataof “mixed” type

▼ ▼ ▲ ▲ ▼

Handling of missing values ▼ ▼ ▲ ▲ ▲

Robustness to outliers ininput space

▼ ▼ ▲ ▼ ▲

Insensitive to monotonetransformations of inputs

▼ ▼ ▲ ▼ ▼

Computational scalability(large N)

▼ ▼ ▲ ▲ ▼

Ability to deal with irrel-evant inputs

▼ ▼ ▲ ▲ ▼

Ability to extract linearcombinations of features

▲ ▲ ▼ ▼ ◆

Interpretability ▼ ▼ ◆ ▲ ▼

Predictive power ▲ ▲ ▼ ◆ ▲

siderations play an important role. Also, the data are usually messy: theinputs tend to be mixtures of quantitative, binary, and categorical vari-ables, the latter often with many levels. There are generally many missingvalues, complete observations being rare. Distributions of numeric predic-tor and response variables are often long-tailed and highly skewed. Thisis the case for the spam data (Section 9.1.2); when fitting a generalizedadditive model, we first log-transformed each of the predictors in order toget a reasonable fit. In addition they usually contain a substantial fractionof gross mis-measurements (outliers). The predictor variables are generallymeasured on very different scales.

In data mining applications, usually only a small fraction of the largenumber of predictor variables that have been included in the analysis areactually relevant to prediction. Also, unlike many applications such as pat-tern recognition, there is seldom reliable domain knowledge to help createespecially relevant features and/or filter out the irrelevant ones, the inclu-sion of which dramatically degrades the performance of many methods.

In addition, data mining applications generally require interpretable mod-els. It is not enough to simply produce predictions. It is also desirable tohave information providing qualitative understanding of the relationship

Page 17: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham

Comparison of learning methods

17

10.7 “Off-the-Shelf” Procedures for Data Mining 351

TABLE 10.1. Some characteristics of different learning methods. Key: ▲= good,◆=fair, and ▼=poor.

Characteristic Neural SVM Trees MARS k-NN,Nets Kernels

Natural handling of dataof “mixed” type

▼ ▼ ▲ ▲ ▼

Handling of missing values ▼ ▼ ▲ ▲ ▲

Robustness to outliers ininput space

▼ ▼ ▲ ▼ ▲

Insensitive to monotonetransformations of inputs

▼ ▼ ▲ ▼ ▼

Computational scalability(large N)

▼ ▼ ▲ ▲ ▼

Ability to deal with irrel-evant inputs

▼ ▼ ▲ ▲ ▼

Ability to extract linearcombinations of features

▲ ▲ ▼ ▼ ◆

Interpretability ▼ ▼ ◆ ▲ ▼

Predictive power ▲ ▲ ▼ ◆ ▲

siderations play an important role. Also, the data are usually messy: theinputs tend to be mixtures of quantitative, binary, and categorical vari-ables, the latter often with many levels. There are generally many missingvalues, complete observations being rare. Distributions of numeric predic-tor and response variables are often long-tailed and highly skewed. Thisis the case for the spam data (Section 9.1.2); when fitting a generalizedadditive model, we first log-transformed each of the predictors in order toget a reasonable fit. In addition they usually contain a substantial fractionof gross mis-measurements (outliers). The predictor variables are generallymeasured on very different scales.

In data mining applications, usually only a small fraction of the largenumber of predictor variables that have been included in the analysis areactually relevant to prediction. Also, unlike many applications such as pat-tern recognition, there is seldom reliable domain knowledge to help createespecially relevant features and/or filter out the irrelevant ones, the inclu-sion of which dramatically degrades the performance of many methods.

In addition, data mining applications generally require interpretable mod-els. It is not enough to simply produce predictions. It is also desirable tohave information providing qualitative understanding of the relationship

Table 10.1 from “Elements of statistical learning”

Page 18: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham

The scikit-learn algorithm cheat sheet

18

http://scikit-learn.org/stable/tutorial/machine_learning_map/

Page 19: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham

19 https://medium.com/@chris_bour/an-extended-version-of-the-scikit-learn-cheat-sheet-5f46efc6cbb#.g942x8l3d

Page 20: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham

Applying machine learning

Always try multiple models ■  What would you start with?

If accuracy is not high enough ■  Design new features ■  Collect more data

20