Validation methods - PyData Israel

14
Validation methods Nathaniel Shimoni PyData Israel 16/2/2017

Transcript of Validation methods - PyData Israel

Page 1: Validation methods - PyData Israel

Validation methods

Nathaniel ShimoniPyData Israel 16/2/2017

Page 2: Validation methods - PyData Israel

Validation techniques

basics

Model selection

Early stopping

Train-test split

Kfold

leave one out (loo)

Leave P out

Group Kfold

Leave one group out

Time series

Sliding window

Anchored sliding window

Time based group Kfold

Unbalanced data

Stratified methods

Why? Grouped data

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov De

c

train01

Data

Fold 1Fold 2Fold 3

Hyper-parameter tuning

𝝁 ,𝜽 ,𝜸 ,𝜷 ,𝜶 ,𝜹

Page 3: Validation methods - PyData Israel

Validation techniques

basics

Model selection

Early stopping

Train-test split

Kfold

leave one out (loo)

Leave P out

Group Kfold

Leave one group out

Time series

Sliding window

Anchored sliding window

Time based group Kfold

Unbalanced data

Stratified methods

Why? Grouped data

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov De

c

train01

Data

Fold 1Fold 2Fold 3

Hyper-parameter tuning

𝝁 ,𝜽 ,𝜸 ,𝜷 ,𝜶 ,𝜹

Page 4: Validation methods - PyData Israel

We use validation to balance two things:

1. Fit our training data as well as we can (while…)2. Generalize well to get best performance on

unseen data (aka refrain from over-fitting)

We use validation to: • Select best model• Select best hyper-parameters• Early stopping of training process

Page 5: Validation methods - PyData Israel

Validation techniques

basics

Model selection

Early stopping

Train-test split

Kfold

leave one out (loo)

Leave P out

Group Kfold

Leave one group out

Time series

Sliding window

Anchored sliding window

Time based group Kfold

Unbalanced data

Stratified methods

Why? Grouped data

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov De

c

train01

Data

Fold 1Fold 2Fold 3

Hyper-parameter tuning

𝝁 ,𝜽 ,𝜸 ,𝜷 ,𝜶 ,𝜹

Page 6: Validation methods - PyData Israel

Train test split

The most basic validation technique

It is based on hold-out-sample

We split the training data randomly and test our performance on the unseen data

Page 7: Validation methods - PyData Israel

Select folds in a way that keeps equal proportion of target variable in each fold

Stratification

The train-test split validation method is very common. its main benefits:

Cross Validation

computational efficiency simplicity

we’re loosing large amount of data

Might suffer from skew / bias

but it has two disadvantages:

Data

Fold 1Fold 2Fold 3

Page 8: Validation methods - PyData Israel

Validation techniques

basics

Model selection

Early stopping

Train-test split

Kfold

leave one out (loo)

Leave P out

Group Kfold

Leave one group out

Time series

Sliding window

Anchored sliding window

Time based group Kfold

Unbalanced data

Stratified methods

Why? Grouped data

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov De

c

train01

Data

Fold 1Fold 2Fold 3

Hyper-parameter tuning

𝝁 ,𝜽 ,𝜸 ,𝜷 ,𝜶 ,𝜹

Page 9: Validation methods - PyData Israel

Yes use Kfold /

stratified Kfold

What if the samples are drown from different groups?

Can we retrain for new groups?

No use group-based folding methods

Page 10: Validation methods - PyData Israel

Yes use Kfold /

stratified Kfold

What if the samples are drown from different groups?

Can we retrain for new groups?

No use group-based folding methods

Page 11: Validation methods - PyData Israel

Validation techniques

basics

Model selection

Early stopping

Train-test split

Kfold

leave one out (loo)

Leave P out

Group Kfold

Leave one group out

Time series

Sliding window

Anchored sliding window

Time based group Kfold

Unbalanced data

Stratified methods

Why? Grouped data

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov De

c

train01

Data

Fold 1Fold 2Fold 3

Hyper-parameter tuning

𝝁 ,𝜽 ,𝜸 ,𝜷 ,𝜶 ,𝜹

Page 12: Validation methods - PyData Israel

Time series data?Are we predicting a specific time frame?Are we predicting future events? • Use time based folds / split• Use sliding window• Use anchored sliding window• Random split is more like an imputation problem

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov De

c

trainvalidation

Page 13: Validation methods - PyData Israel

Validation techniques

basics

Model selection

Early stopping

Train-test split

Kfold

leave one out (loo)

Leave P out

Group Kfold

Leave one group out

Time series

Sliding window

Anchored sliding window

Time based group Kfold

Unbalanced data

Stratified methods

Why? Grouped data

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov De

c

train01

Data

Fold 1Fold 2Fold 3

Hyper-parameter tuning

𝝁 ,𝜽 ,𝜸 ,𝜷 ,𝜶 ,𝜹