Machine Learning for Knowledge Dissemination in Creative...
Transcript of Machine Learning for Knowledge Dissemination in Creative...
Machine Learning for Knowledge Dissemination in Creative Economies
Krzysztof
Pampuch
• What is machine learning?
• Basic terminology
• Systematics of ML methods
• How to measure the quality of our model
• Selected methods of ML
• What ML looks like in everyday practice?
StatisticsComputer
Science
Machine learning (ML) is a category of algorithm that allows software applications to become more accurate in predicting outcomes without being explicitly programmed.
No observation
Length of stalk
Width of stalk
Length of petal
Width of petal
Label
1 5.1 3.5 1.4 0.2 Setosa
2 4.9 3.0 1.4 0.2 Setosa
3 6.4 3.5 4.5 1.2 Versicolor
… … … … … …
100 5.9 3.0 5.0 1.8 Virginica
Ob
serv
atio
ns
FeaturesPredictors
LabelPredicted variable
A neurone of McCullocha-Pittsa (1943)
A neurone of Frank Rosenblatt (1957)
Learning conception:
Machine learning
unsupervised
clusteringdimensionality
reduction
supervised
classification regression
reinforcementlearning
quantity
• can be expressedusing specificunits of measurement
quality
• can be describedonly by words, can’tbe ordered
Criteria:
• Efficiency
• Stability
• For other samples
• Over time
• Interpretability
• We split the dataset into:• Train set - used for training a model
• Validation set - used to choose the best model
• Test set - used to make sure that our model is stable
train validation test
Test set Training set
Test setTraining set…
Each observation is used exactly one for test and k-1 times for a training
The quality of a model is a mean counted on all training sets
An expected error on a test test:
𝐸( 𝑦𝑖 − 𝑦𝑖)2 = 𝑉𝑎𝑟 𝑦𝑖 + [𝐵𝑖𝑎𝑠( 𝑦𝑖)]
2+𝑉𝑎𝑟(𝜀)
𝑉𝑎𝑟 𝑦𝑖 - variance
𝐵𝑖𝑎𝑠( 𝑦𝑖) - bias
𝑉𝑎𝑟(𝜀) - variance of a random component
• A bias reflects what error we make when appraching reality with a model
• A variance reflects how much the prediction would change if a different set of data were used to learn the model
• A random component variance is independent of the proces modeled and irreducible
• Best situation: negliglible deviation and variance
The more „flexible” the method, the less devation
𝐸( 𝑦𝑖 − 𝑦𝑖)2 = 𝑉𝑎𝑟 𝑦𝑖 + [𝐵𝑖𝑎𝑠( 𝑦𝑖)]
2+𝑉𝑎𝑟(𝜀)
The more „flexible” the method, the higher the variance
𝐸( 𝑦𝑖 − 𝑦𝑖)2 = 𝑉𝑎𝑟 𝑦𝑖 + [𝐵𝑖𝑎𝑠( 𝑦𝑖)]
2+𝑉𝑎𝑟(𝜀)
• Goal: to fit a linear function to our data
• 𝑦 = 𝛽0 + 𝑖=1𝑝
𝛽𝑖𝑥𝑖 + 𝜖
• How to find model coefficients?
• Minimizing the cost functions:
𝐿 = 𝑖=1𝑁 (𝑦𝑖 − 𝑦𝑖)
2
• Disadvantages: sensitivity to outliers, poorly modeling nonlinear relationships
15
𝑅2 = 1 − 𝑖( 𝑦𝑖 − 𝑦)2
𝑖(𝑦𝑖 − 𝑦)2
• Values in the range [0;1]• Interpretation:
How much variance of data does the model explain?
Mean value 𝑦
• Misclassification Rate: 𝑀𝑅 = 1 − 𝑖 𝑓𝑖𝑖
𝑖≠𝑗 𝑓𝑖𝑗
• Accuracy: 𝐴𝐶𝐶 = 1 − 𝑀𝑅
• Multi-class log-loss: 𝑀𝐿𝐿 = −1
𝑁 𝑖=1
𝑁 𝑗=1𝑀 𝑦𝑖𝑗log(𝑝𝑖𝑗)
• ROC, AUC, F-measure: 𝐹1 =2𝑇𝑃
2𝑇𝑃+𝐹𝑃+𝐹𝑁
True value
0 1 2
Pre
dic
ted
valu
e
0 𝑓00 𝑓01 𝑓021 𝑓10 𝑓11 𝑓122 𝑓20 𝑓21 𝑓22
True value
1/T 0/N
Pre
dic
ted
valu
e
1/T 𝑇𝑃 𝐹𝑃
0/N 𝐹𝑁 𝑇𝑁
K-means DBSCAN
DataFeature
engineering Tain set
Test set
Model
Learning
Model validation
• Data almost never has the desired format
• Often we have to acquire data from many sources
• Volume, inflow rate
• Examples of problems
• Storage of terabytes of data
• Data from various DBMS + external data
• Data refreshing and retention
• Consistency od data types
• Unstructured data
• Character encoding, numer and date formats
• The most time-consuming activity
• The type of processing required depends on the type of data and the problem
• Generating features – manual vs automatic:
• Examples of generation of the features:
czas
preprocessingdimensionality reduction
prediciton
Text
• Regular expression• tokenization• lematiozation• bag-of-words• TF-IDF
Customer data
• Total playments• Balance on accounts• Number of logins• Demographic data
Audio / video
• Signal framing• LPC, MFCC• Color/gradient hist• SIFT, SURF• bag-of-words
• High dimensionality of the space of features:
• Degrades the predictive power of models
• Introduces redundancy (variable correlation)
• Leads to overfitting
• Requires larger data sets to achieve the same goal
• Increases the computational effort
• And besides… decision-makers do not like complex models and many variables
• So let’s reduce the dimensionality!
• Principle of operation (most ofen):
• The most accurate reproduction of data in the space of lower dimensionality
• The best possible highlighting of information differentiating the predicted value of variables
nkn x
x
x
f
y
y
y
x
x
x
2
1
2
1
cech ekstrakcja2
1
ki
i
i
nx
x
x
x
x
x
2
1
cech selekcja2
1
𝑘 < 𝑛Feature selection Feature selection