Machine Learning for Knowledge Dissemination in Creative...

26
Machine Learning for Knowledge Dissemination in Creative Economies Krzysztof Pampuch

Transcript of Machine Learning for Knowledge Dissemination in Creative...

Page 1: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows

Machine Learning for Knowledge Dissemination in Creative Economies

Krzysztof

Pampuch

Page 2: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows

• What is machine learning?

• Basic terminology

• Systematics of ML methods

• How to measure the quality of our model

• Selected methods of ML

• What ML looks like in everyday practice?

Page 3: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows

StatisticsComputer

Science

Machine learning (ML) is a category of algorithm that allows software applications to become more accurate in predicting outcomes without being explicitly programmed.

Page 4: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows

No observation

Length of stalk

Width of stalk

Length of petal

Width of petal

Label

1 5.1 3.5 1.4 0.2 Setosa

2 4.9 3.0 1.4 0.2 Setosa

3 6.4 3.5 4.5 1.2 Versicolor

… … … … … …

100 5.9 3.0 5.0 1.8 Virginica

Ob

serv

atio

ns

FeaturesPredictors

LabelPredicted variable

Page 5: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows

A neurone of McCullocha-Pittsa (1943)

A neurone of Frank Rosenblatt (1957)

Learning conception:

Page 6: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows

Machine learning

unsupervised

clusteringdimensionality

reduction

supervised

classification regression

reinforcementlearning

Page 7: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows

quantity

• can be expressedusing specificunits of measurement

quality

• can be describedonly by words, can’tbe ordered

Page 8: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows

Criteria:

• Efficiency

• Stability

• For other samples

• Over time

• Interpretability

Page 9: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows

• We split the dataset into:• Train set - used for training a model

• Validation set - used to choose the best model

• Test set - used to make sure that our model is stable

train validation test

Page 10: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows

Test set Training set

Test setTraining set…

Each observation is used exactly one for test and k-1 times for a training

The quality of a model is a mean counted on all training sets

Page 11: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows

An expected error on a test test:

𝐸( 𝑦𝑖 − 𝑦𝑖)2 = 𝑉𝑎𝑟 𝑦𝑖 + [𝐵𝑖𝑎𝑠( 𝑦𝑖)]

2+𝑉𝑎𝑟(𝜀)

𝑉𝑎𝑟 𝑦𝑖 - variance

𝐵𝑖𝑎𝑠( 𝑦𝑖) - bias

𝑉𝑎𝑟(𝜀) - variance of a random component

• A bias reflects what error we make when appraching reality with a model

• A variance reflects how much the prediction would change if a different set of data were used to learn the model

• A random component variance is independent of the proces modeled and irreducible

• Best situation: negliglible deviation and variance

Page 12: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows

The more „flexible” the method, the less devation

𝐸( 𝑦𝑖 − 𝑦𝑖)2 = 𝑉𝑎𝑟 𝑦𝑖 + [𝐵𝑖𝑎𝑠( 𝑦𝑖)]

2+𝑉𝑎𝑟(𝜀)

Page 13: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows

The more „flexible” the method, the higher the variance

𝐸( 𝑦𝑖 − 𝑦𝑖)2 = 𝑉𝑎𝑟 𝑦𝑖 + [𝐵𝑖𝑎𝑠( 𝑦𝑖)]

2+𝑉𝑎𝑟(𝜀)

Page 14: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows
Page 15: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows

• Goal: to fit a linear function to our data

• 𝑦 = 𝛽0 + 𝑖=1𝑝

𝛽𝑖𝑥𝑖 + 𝜖

• How to find model coefficients?

• Minimizing the cost functions:

𝐿 = 𝑖=1𝑁 (𝑦𝑖 − 𝑦𝑖)

2

• Disadvantages: sensitivity to outliers, poorly modeling nonlinear relationships

15

Page 16: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows

𝑅2 = 1 − 𝑖( 𝑦𝑖 − 𝑦)2

𝑖(𝑦𝑖 − 𝑦)2

• Values in the range [0;1]• Interpretation:

How much variance of data does the model explain?

Mean value 𝑦

Page 17: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows
Page 18: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows
Page 19: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows

• Misclassification Rate: 𝑀𝑅 = 1 − 𝑖 𝑓𝑖𝑖

𝑖≠𝑗 𝑓𝑖𝑗

• Accuracy: 𝐴𝐶𝐶 = 1 − 𝑀𝑅

• Multi-class log-loss: 𝑀𝐿𝐿 = −1

𝑁 𝑖=1

𝑁 𝑗=1𝑀 𝑦𝑖𝑗log(𝑝𝑖𝑗)

• ROC, AUC, F-measure: 𝐹1 =2𝑇𝑃

2𝑇𝑃+𝐹𝑃+𝐹𝑁

True value

0 1 2

Pre

dic

ted

valu

e

0 𝑓00 𝑓01 𝑓021 𝑓10 𝑓11 𝑓122 𝑓20 𝑓21 𝑓22

True value

1/T 0/N

Pre

dic

ted

valu

e

1/T 𝑇𝑃 𝐹𝑃

0/N 𝐹𝑁 𝑇𝑁

Page 20: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows

K-means DBSCAN

Page 21: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows

DataFeature

engineering Tain set

Test set

Model

Learning

Model validation

Page 22: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows

• Data almost never has the desired format

• Often we have to acquire data from many sources

• Volume, inflow rate

• Examples of problems

• Storage of terabytes of data

• Data from various DBMS + external data

• Data refreshing and retention

• Consistency od data types

• Unstructured data

• Character encoding, numer and date formats

Page 23: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows

• The most time-consuming activity

• The type of processing required depends on the type of data and the problem

• Generating features – manual vs automatic:

• Examples of generation of the features:

czas

preprocessingdimensionality reduction

prediciton

Text

• Regular expression• tokenization• lematiozation• bag-of-words• TF-IDF

Customer data

• Total playments• Balance on accounts• Number of logins• Demographic data

Audio / video

• Signal framing• LPC, MFCC• Color/gradient hist• SIFT, SURF• bag-of-words

Page 24: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows

• High dimensionality of the space of features:

• Degrades the predictive power of models

• Introduces redundancy (variable correlation)

• Leads to overfitting

• Requires larger data sets to achieve the same goal

• Increases the computational effort

• And besides… decision-makers do not like complex models and many variables

• So let’s reduce the dimensionality!

• Principle of operation (most ofen):

• The most accurate reproduction of data in the space of lower dimensionality

• The best possible highlighting of information differentiating the predicted value of variables

nkn x

x

x

f

y

y

y

x

x

x

2

1

2

1

cech ekstrakcja2

1

ki

i

i

nx

x

x

x

x

x

2

1

cech selekcja2

1

𝑘 < 𝑛Feature selection Feature selection

Page 25: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows
Page 26: Machine Learning for Knowledge Dissemination in Creative ...dbis.fberg.tuke.sk/public/media/0134/pampuch-ml.pdf · Science Machine learning (ML) is a category of algorithm that allows