Machine Learning-Based Classification of Patterns of EEG Synchronization for Seizure Prediction
Inleiding Machine Learning - SNNbertk/inl_ml/murphy_ch1.pdf · Machine Learning: ” a set of...
Transcript of Inleiding Machine Learning - SNNbertk/inl_ml/murphy_ch1.pdf · Machine Learning: ” a set of...
Inleiding Machine Learning
Bert KappenSNN Donders Institute, Radboud University, Nijmegen
Gatsby Unit, UCL London
September 1, 2014
Bert Kappen
Based on the book
Machine Learning: A probabilsitic Perspective
Kevin P. Murphy
Bert Kappen Inl. ML
Big data
• Science: neuroscience, astronomy, medical
• Internet (facebook, yahoo, google, IBM)
• Advertisement
• ...
Bert Kappen Inl. ML
Machine Learning: ” a set of methods that can automatically detect patterns indata, and then use the uncovered patterns to predict future data, or to performother kinds of decision making under uncertainty (such as planning how to collectmore data!)”.
Types of machine learning:
• Predictive or Supervised: classification, regression, function fitting
• Descriptive or Unsupervised: clustering, pattern recognition, visualization, ’dis-covery’
• Reinforcement learning (not this course)
Bert Kappen Inl. ML
Supervised Learning/Classification
Learn a mapping y = f (x) from inputs x to outputs y. y = {1, . . . ,C}, x are ’features’.
Predict y for previously unseen x (generalization).
We require probabilities for this: p(y|x,D) with D the previous data. This is a vectorof length C such that
∑y p(y|x,D) = 1.
Use MAP estimate y = maxy p(y|x,D) as most likely prediction.
Use uncertainty to quantify uncertainty. Examples: Jeopardy and ad placement.
Bert Kappen Inl. ML
Document classification
Aim is to classify document in a number of classes. Compute p(y|x,D) with x somerepresentation of the text. Examples: spam filtering, classify as spam yes/no (ch8). News feeds: classify topics.
Text is often represented as bag of words: xi j = 1 iff word j appears in document i.
words
docum
ents
10 20 30 40 50 60 70 80 90 100
100
200
300
400
500
600
700
800
900
1000
Subset of 1000 documents and 100 words from USENET. Each row is a document (represented asa bag-of-words bit vector), each column is a word. The red lines separate the 4 classes, which arecomp, rec, sci, talk.
Bert Kappen Inl. ML
Flower classification
Feature extraction is non-trivial first step. Features extracted: sepal length andwidth, and petal length and width.
Bert Kappen Inl. ML
Flower classification
sepa
l len
gth
sepal length
sepa
l wid
thpe
tal l
engt
hpe
tal w
idth
sepal width petal length petal width
We see that it is easy to distinguish setosas (red circles) from the other two classesby just checking if their petal length. However, distinguishing versicolor from vir-ginica is slightly harder; any decision will need to be based on at least two features.
Bert Kappen Inl. ML
No features: Handwritten character recognition
true class = 7 true class = 2 true class = 1
true class = 0 true class = 4 true class = 1
true class = 4 true class = 9 true class = 5
true class = 7 true class = 2 true class = 1
true class = 0 true class = 4 true class = 1
true class = 4 true class = 9 true class = 5
Data from National Institute of Standards (MNIST data set) consists of 60.000training and 10.000 testing images.
Most methods are insensitive to the spatial layout of the pixels: a classifier trainedon permuted data works equally well (ex. 1.1).
Bert Kappen Inl. ML
Face detection and face recognition
Face detection: classify areas that likely contain a face.
Face recognition: classify face areas to identify persons
Bert Kappen Inl. ML
Regression
Is like classification, but with y continuous.
Examples:tomorrows stock value based on todays stock values and other attributeslocation of a robots position based on sensor datasurvival probability of a patient with certain characteristics...
0 5 10 15 20−10
−5
0
5
10
15degree 1
0 5 10 15 20−10
−5
0
5
10
15degree 2
(Over) Fitting the data: find the best model to describe a given data setGeneralization: find a model that best predicts on new data
Bert Kappen Inl. ML
Unsupervised learning
Goal is to find ’interesting structure’ in the data.
55 60 65 70 75 8080
100
120
140
160
180
200
220
240
260
280
height
wei
ght
Build a model of the form p(x|θ) with θ a (set of) parameters and x (high-dimensional) data.
Bert Kappen Inl. ML
Unsupervised learning is more like human or animal learning because it does notrequire labeled data:
When were learning to see, nobody’s telling us what the right answers are - wejust look. Every so often, your mother says ”that’s a dog”, but that’s very littleinformation. You’d be lucky if you got a few bits of information - even one bit persecond - that way. The brain’s visual system has 1014 neural connections. And youonly live for 109 seconds. So it’s no use learning one bit per second. You needmore like 105 bits per second. And there’s only one place you can get that muchinformation: from the input itself. (Hinton 1996).
Bert Kappen Inl. ML
Discovering clusters
55 60 65 70 75 8080
100
120
140
160
180
200
220
240
260
280
height
wei
ght
55 60 65 70 75 8080
100
120
140
160
180
200
220
240
260
280
height
wei
ght
K=2
Height and weight of 210 personsa) Unsupervised: only the data pointsb) Supervised: also the class labels (male and female)
Choose the number of clusters K and compute p(K|D). The model with K∗ =argmaxK p(K|D) is the best model.
Assign a label to each data point: z∗i = argmaxz=1,2p(z|xi,D) when K = 2. (rightfigure).
Bert Kappen Inl. ML
Examples of clustering problems
Examples (refs in book):
• The autoclass system discovered a new type of star based on clustering astro-physical measurements
• In e-commerce, users are clustered into groups based on their purchasing orother behavior and then selectively targeted for advertisement
• In biology, flow cytometry data is clustered into groups to discover different sub-plpulations of cells.
Bert Kappen Inl. ML
Dimension reduction
Often high dimensional data can be reduced to a lower dimension that capturesthe ’essence’ of the problem.
−8 −6 −4 −2 0 2 4 6 8−4
−20
24
−2
0
2
−50
5−4
−20
24
A simple method is called principle component analysis (PCA). In the example, the3 dimensional data lie in 2 dimensions and their variability can approximately becharacterized by 1 number (latent factor).
PCA is a linear dimension reduction method. Also non-linear methods exist.
Bert Kappen Inl. ML
Dimension reduction
a) 25 randomly chosen 64 × 64 pixel images from the Olivetti face database. (b) The mean and thefirst three principal component basis vectors (eigenfaces).
Advantage of dimension reduction: higher accuracy of classification, better gener-alization, visualization.
Bert Kappen Inl. ML
Discovering graph structure
Represent the dependencies between variables in a graph G: nodes representvariables, and edges represent direct dependence between variables.
We learn a distribution over graphs from data (p(G|D)) and choose the best oneargmaxG p(G|D).
Biology: measure the phosphorylation status of some proteins in a cell (Sachs etal. 2005) and construct the interdependencies between the different proteins.
lambda=7.00, nedges=18
Finance: model covariance between large number of different stocks. sparse priors
Bert Kappen Inl. ML
Matrix completion
A questionaire where people have not filled in all questions.Image inpainting based on pairwise Markov random fields
Collaborative filtering: xi j denotes how much user i likes movie j.
��XVHUV
PRYLHV
� � " � � "
" � � � � �
� � � � � "
Netflix 2006 prize of 1 M$. 18k movies and 500k users. 1% entries observed. Seesection 27.6.2.
Bert Kappen Inl. ML
Basic concepts of machine learning
Parametric versus non-parametric models: does the model have a fixed set ofparameters or does the number of parameters grow with the model.
Example: K nearest neighbor classifier (KNN).
p(y|x,D) =1K
∑i∈NK(x,D)
I(yi = c)
NK(x,D) is the set of K nearest neighbors of x in the total data set D, I is theindicator function. Distance is typically Euclidean, but can be other.
−3 −2 −1 0 1 2 3
−2
−1
0
1
2
3
4
5
train p(y=1|data,K=10)
20 40 60 80 100
20
40
60
80
100
120
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1p(y=2|data,K=10)
20 40 60 80 100
20
40
60
80
100
120
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
−3 −2 −1 0 1 2 3−2
−1
0
1
2
3
4
5
predicted label, K=10
c1
c2
c3
Two dimensional example with 3 classes and K = 10. a) data. b) p(y = 1|x,D). c) p(y = 2|x,D). d)argmaxyp(y|x,D).
Bert Kappen Inl. ML
Curse of dimensionality
The KNN model performs poorly in high dimensions.
s
1
1
0
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fraction of data in neighborhood
Edge length
of cube
d=1
d=3
d=5d=7d=10
The reasoning:Estimating p(y|x,D) requires a fixed fraction f of the data in the neighborhood of x, otherwise nostatistics.Assume data is uniform distributed in N dimensional unit hyper cube.Fraction of data in smaller hyper cube with linear dimension εN( f ) = f 1/N.When N large, neighborhood almost coincides with unit hyper cube, even for small f .Thus, KNN looks at neighbors very far away from x, which are not good predictors for the classmembership of x.
Bert Kappen Inl. ML
Parametric models
Models with fixed number of parameters tend to suffer less from curse of dimen-sionality.
Linear regression:
y(x) =∑
j
w jx j + ε
ε is the error between predicted output wT x and actual output y. ε is often modelledas a Gaussian random variable ε ∼ N(µ, σ2) (see Ch. 2).
−3 −2 −1 0 1 2 30
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4PDF
Bert Kappen Inl. ML
The probability is
p(y|x, θ) = N(y|µ(x), σ2(x))
In the simplest case µ(x) = wT x and noise is fixed σ2(x) = σ2. Then θ = (w, σ2)are the parameters of the model.
Bert Kappen Inl. ML
Linear regression with basis functions
We can model non-linear relations by using (fixed) non-linear functions φ j(x). Then
p(y|x, θ) = N(y|wTφ(x), σ2)
0 5 10 15 20−10
−5
0
5
10
15degree 14
0 5 10 15 20−10
−5
0
5
10
15degree 20
φ(x) = (1, x, x2, . . . , xd) a) with d = 14. b) with d = 20.
Many popular methods (support vector machines, neural networks, tree methods)extend this basic idea (ch. 14 and 16).
When d is too large, the model overfits on the data: poor generalization.
Bert Kappen Inl. ML
Logistic regression
In the case of classification, the output y is binary y ∈ {0, 1}. We construct aclassifier in two steps:
• construct µ(x) = σ(wT x) with
σ(x) =1
1 + e−x
−10 −5 0 5 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 ≤ µ(x) ≤ 1 can be interpreted as a probability.
• Define p(y|x,w) = Ber(y|µ(x)), with Ber(y|p) = py(1 − p)1−y the Bernouilli distri-bution (ch. 2)
Bert Kappen Inl. ML
Example: SAT scores
Classify whether students pass or fail a class (y) based on their SAT scores (x).
p(yi = 1|xi,w) = σ(w0 + w1xi)
460 480 500 520 540 560 580 600 620 640
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Black dots show the training data. The red circles plot p(yi = 1|xi, w) with w the parametersestimated from the training data (ch. 8). Green dots: two students with identical SAT perform
different in the class.
Decision rule: y(x) = 1 ⇔ p(y = 1|x) > 0.5 is not perfect. Data is not linearlyseparable.
Bert Kappen Inl. ML
Model selection
When varying K we obtain different KNN classifiers: Large K averages more datapoints and gives smoother models.
−3 −2 −1 0 1 2 3−2
−1
0
1
2
3
4
5
predicted label, K=1
c1
c2
c3
−3 −2 −1 0 1 2 3−2
−1
0
1
2
3
4
5
predicted label, K=5
c1
c2
c3
0 20 40 60 80 100 1200
0.05
0.1
0.15
0.2
0.25
0.3
0.35
K
mis
clas
sific
atio
n ra
te
traintest
Misclassification rate err( f ,D) = 1N∑N
i=1 I( f (xi) , yi) on training data (blue) and test data usingcross validation (red).
Bert Kappen Inl. ML
No free lunch
All models are wrong, but some models are useful (George Box 1987).
We can empirically choose the best model for our problem.
There is no universally best model (no free lunch theorem)
Bert Kappen Inl. ML
Outline of course (tentative)
1. Introduction
2. Probability
(a) review of probability theory(b) discrete distributions(c) continuous distributions(d) joint distributions (Gaussian, Dirichlet)(e) Transformation of random variables(f) Monte Carlo sampling
(g) Information theory
3. Generative models for discrete data
(a) Bayesian concept learning(b) beta-binomial model(c) Dirichlet model(d) Naive Bayes classifiers
Bert Kappen Inl. ML
4. Gaussian models
(a) Maximum likelihood models(b) Discriminant analysis(c) Inference in jointly Gaussian distributions
5. Bayesian statistics
(a) Summarizing posterior distribution(b) Bayesian model selection(c) (Priors)(d) Hierarchical Bayes(e) Empirical Bayes
6. ...
Bert Kappen Inl. ML