Introduction to Machine Learning
description
Transcript of Introduction to Machine Learning
Introduction to Machine Learning
Jinhyuk Choi
Human-Computer Interaction Lab @ Information and Communications University
Contents
Concepts of Machine Learning
Multilayer Perceptrons
Decision Trees
Bayesian Networks
What is Machine Learning?
Large storage / large amount of data
Looks random but certain patterns
Web log data
Medical record
Network optimization
Bioinformatics
Machine vision
Speech recognition…
No complete identification of the process
A good or useful approximation
What is Machine Learning?Definition
Programming computers to optimize a
performance criterion using example data or past
experience
Role of Statistics
Inference from a sample
Role of Computer science
Efficient algorithms to solve the optimization problem
Representing and evaluating the model for inference
Descriptive (training) / predictive (generalization)
Learning from Human-generated data??
What is Machine Learning?Concept Learning
• Inducing general functions from specific training examples (positive or negative)
• Looking for the hypothesis that best fits the training examples
• Concepts:
- describing some subset of objects or events defined over a larger set
- a boolean-valued function
Objects
눈, 코, 다리생식능력,
…
무생물…
Bird
날개, 부리,
깃털…
Concept
boolean function :
Bird(animal) “true or not”
What is Machine Learning?Concept Learning
Inferring a boolean-valued function from training examples of its input and output
Positive examples
Negative examples
Hypothesis 1
Hypothesis 2
Concept
Web log data
Medical record
Network optimization
Bioinformatics
Machine vision
Speech recognition…
What is Machine Learning?Learning Problem Design
Do you enjoy sports ?
Learn to predict the value of “EnjoySports” for an arbitrary day, based on the value of its other attributes
What problem?
Why learning?
Attributes selection
Effective?
Enough?
What learning algorithm?
Applications
Learning associations
Classification
Regression
Unsupervised learning
Reinforcement learning
Examples (1)
TV program preference inference based on web usage data
Web page #1
Web page #2
Web page #3
Web page #4
….
Classifier
TV Program #1
TV Program #2
TV Program #3
TV Program #4
….
1 2
3
What are we supposed to do at each step?
Examples (2)from a HW of Neural Networks Class (KAIST-2002)
Function approximation (Mexican hat)
2 2
3 1 2 1 2 1 2( , ) sin 2 , , [ 1,1]f x x x x x x
Examples (3)from a HW of Machine Learning Class (ICU-2006)
Face image classification
Examples (4)from a HW of Machine Learning Class (ICU-2006)
Examples (5)from a HW of Machine Learning Class (ICU-2006)
Sensay
Examples (6)
A. Krause et. al, “Unsupervised, Dynamic Identification of Physiological and Activity Context in Wearable
Computing”, ISWC 2005
#1. Multilayer Perceptrons
Neural Network?
VS. Adaline
MLP
SOM
Hopfield network
RDFN
Bifurcating neuron networks
…
Multilayer Networks of Sigmoid Units
• Supervised learning
• 2-layer
• Fully connected
Really looks like the brain??
Sigmoid Unit
The back-propagation algorithm
Network model
ix jy ko
j ji ii
y v xs
jivkjw
k kj jj
o w ys
Input layer hidden layer output layer
Error function: 2
1,
2 k kk
E v w t o
Stochastic gradient descent
Gradient-Descent Function Minimization
Gradient-descent function minimization
In order to find a vector parameter x
that minimizes a function f x
…
Start with a random initial value of 0x x
.
Determine the direction of the steepest descent in the parameter space by
1 2
, ,...,n
f f ff
x x x
Move to the direction a step.
1x i x i fh
Repeat the above two steps until no more change in x
.
For gradient-descent to work…
The function to be minimized should be continuous.
The function should not have too many local minima.
Back-propagation
Derivation of back-propagation algorithm
Adjustment of kjw :
2
21 1
2 2
11 1 2
2
1
k k jk k jk jk j k j k j
j k k k k
j k k k k
Et o t w y
w w w
y o o t o
y o o t o
s
1ok
kj k k k k jkj
Ew o o t o y
wd
h h
Derivation of back-propagation algorithm
Adjustment of jiv :
2
2
2
1 1
2 2
1
2
11 1 1 2
2
1
k k k kj jk k jj i j i j i
k kj ji ik j ij i
k k k ki j j kjk
i j
Et o t w y
v v v
t w v xv
x y y w o o t o
x y y
s
s s
1k k k kj kjk
w o o t o
1 1
1
yj
ji j j kj k k k k ikji
oj j kj k i
k
Ev y y w o o t o x
v
y y w x
d
h h
h d
Backpropagation
Batch learning vs. Incremental learning
Batch standard backprop proceeds as
follows:
Initialize the weights W.
Repeat the following steps:
Process all the training data DL to compute the gradient
of the average error function AQ(DL,W).
Update the weights by subtracting the gradient times the
learning rate.
Incremental standard backprop can be done as follows:
Initialize the weights W.
Repeat the following steps for j = 1 to NL:
Process one training case (y_j,X_j) to compute the gradient
of the error (loss) function Q(y_j,X_j,W).
Update the weights by subtracting the gradient times the
learning rate.
Training
Overfitting
#2. Decision Trees
Introduction
Divide & conquer
Hierarchical model
Sequence of recursive splits
Decision node vs. leaf node
Advantage
Interpretability
IF-THEN rules
Divide and Conquer Internal decision nodes Univariate: Uses a single attribute, xi
Numeric xi : Binary split : xi > wm
Discrete xi : n-way split for n possible values
Multivariate: Uses all attributes, x
Leaves Classification: Class labels, or proportions Regression: Numeric; r average, or local fit
Learning Construction of the tree using training examples Looking for the simplest tree among the trees that code the training
data without error Based on heuristics NP-complete “Greedy”; find the best split recursively (Breiman et al, 1984; Quinlan, 1986, 1993)
Classification Trees
Split is main procedure for tree construction By impurity measure
For node m, Nm instances reach m, Nim
belong to Ci
Node m is pure if pim is 0 or 1
Measure of impurity is entropy
To be pure!!!
m
imi
miN
Npm,CP̂ x|
K
i
im
imm pp
12logI
Representation
Each node specifies a test of some attribute of the instance
Each branch correspond to one of the possible values for this attribute
Best Split
If node m is pure, generate a leaf and stop, otherwise split and continue recursively
Impurity after split: Nmj of Nm take branch j. Nimj belong to
Ci
Find the variable and split that min impurity (among all variables -- and split positions for numeric variables)
mj
imji
mjiN
Npj,m,CP̂ x|
K
i
imj
imj
n
j m
mj
m ppN
N
12
1
logI'
Q) “Which attribute should be tested at the root of the tree?”
Top-Down Induction of Decision Trees
Entropy “Measure of uncertainty”
“Expected number of bits to resolve uncertainty”
Suppose Pr{X = 0} = 1/8
If other events are equally likely, the number of events is 8. To indicate one out of so many events, one needs lg 8 bits.
Consider a binary random variable X s.t. Pr{X = 0} = 0.1.
The expected number of bits:
In general, if a random variable X has c values with prob. p_c:
The expected number of bits:
1.01
1lg1.01
1.0
1lg1.0
1 1
1lg lg
c c
i i i
i ii
H p p pp
EntropyExample
14 examples
Entropy 0 : all members positive or negative
Entropy 1 : equal number of positive & negative
0 < Entropy < 1 : unequal number of positive & negative
2 2
([9 ,5 ])
(9 /14) log (9 /14) (5 /14) log (5/14) 0.940
Entropy
Information Gain
Measures the expected reduction in entropy caused by partitioning the examples
Information Gain
ICU-Student tree
Gender
HeightIQ
Candidate
• # of samples = 100
• # of positive samples = 50
• Entropy = 1
Male Female
Left side:
• # of samples = 50
• # of positive samples = 40
• Entropy = 0.72
Right side:
• # of samples = 50
• # of positive samples = 10
• Entropy = 0.72
On average
• Entropy = 0.5 * 0.72 + 0.5*0.72
= 0.72
• Reduction in entropy = 0.28
Information gain
Training Examples
Selecting the Next Attribute
Partially learned tree
Hypothesis Space Search
Hypothesis space: the set of
all possible decision trees
DT is guided by information
gain measure.
Occam’s razor ??
Overfitting
• Why “over”-fitting?
– A model can become more complex than the true target function(concept) when it tries to satisfy noisy data as well
Avoiding over-fitting the data Two classes of approaches to avoid overfitting
Stop growing the tree earlier.
Post-prune the tree after overfitting
Ok, but how to determine the optimal size of a tree?
Use validation examples to evaluate the effect of pruning (stopping)
Use a statistical test to estimate the effect of pruning (stopping)
Use a measure of complexity for encoding decision tree.
Approaches based on the first strategy
Reduced error pruning
Rule post-pruning
Rule Extraction from Trees
C4.5Rules (Quinlan, 1993)
#3. Bayesian Networks
Bayes’ RuleIntroduction
xx
xp
pPP
CCC
| |
posterior
likelihoodprior
evidence
1|1|0
00|11|
110
xx
xxx
CC
CCCC
CC
Pp
PpPpp
PP
Bayes’ Rule: K>2 ClassesIntroduction
K
kkk
ii
iii
CPCp
CPCp
p
CPCpCP
1
|
|
||
x
x
x
xx
xx | max |if choose
1 and 01
kkii
K
iii
CPCPC
CPCP
Bayesian NetworksIntroduction
Graphical models, probabilistic networks causality and influence
Nodes are hypotheses (random vars) and the prob corresponds to our belief in the truth of the hypothesis
Arcs are direct influences between hypotheses
The structure is represented as a directed acyclic graph (DAG) Representation of the dependencies among random variables
The parameters are the conditional probs in the arcs
Small set of probability, relating only neighbor node
all possible combinations of cicumstances
B.N.
Bayesian NetworksIntroduction
Learning Inducing a graph
From prior knowledge
From structure learning
Estimating parameters
EM
Inference Beliefs from evidences
Especially among the nodes not directly connected
StructureIntroduction
Initial configuration of BN Root nodes
Prior probabilities
Non-root nodes
Conditional probabilities given all possible combinations of direct predecessors
A B
D
E
C
P(b)P(a)
P(d|ab), P(d|aㄱb), P(d|ㄱab), P(d|ㄱaㄱb)
P(e|d)
P(e|ㄱd)
P(c|a)
P(c|ㄱa)
Causes and Bayes’ RuleIntroduction
75060204090
4090
|~|
|
||
.....
..
R~PRWPRPRWP
RPRWP
WP
RPRWPWRP
Diagnostic inference:Knowing that the grass is wet, what is the probability that rain is the cause?causal
diagnostic
Causal vs Diagnostic InferenceIntroduction
Causal inference: If the sprinkler is on, what is the probability that the grass is wet?
P(W|S) = P(W|R,S) P(R|S) + P(W|~R,S) P(~R|S)
= P(W|R,S) P(R) + P(W|~R,S) P(~R)
= 0.95*0.4 + 0.9*0.6 = 0.92
Diagnostic inference: If the grass is wet, what is the probabilitythat the sprinkler is on? P(S|W) = 0.35 > 0.2 P(S)P(S|R,W) = 0.21Explaining away: Knowing that it has rained
decreases the probability that the sprinkler is on.
Bayesian Networks: CausesIntroduction
Causal inference:P(W|C) = P(W|R,S) P(R,S|C) +
P(W|~R,S) P(~R,S|C) + P(W|R,~S) P(R,~S|C) + P(W|~R,~S) P(~R,~S|C)
and use the fact thatP(R,S|C) = P(R|C) P(S|C)
Diagnostic: P(C|W ) = ?
Bayesian Nets: Local structureIntroduction
P (F | C) = ?
d
i
iid XXPX,XP1
1 parents|
Bayesian Networks: InferenceIntroduction
P (C,S,R,W,F ) = P (C ) P (S |C ) P (R |C ) P (W |R,S ) P (F |R )
P (C,F ) = ∑S ∑R ∑W P (C,S,R,W,F )
P (F |C) = P (C,F ) / P(C ) Not efficient!
Belief propagation (Pearl, 1988) Junction trees (Lauritzen and Spiegelhalter, 1988) Independence assumption
InferenceEvidence & Belief Propagation
Evidence – values of observed nodes
V3 = T, V6 = 3
Our belief in what the value of Vi
„should‟ be changes.
This belief is propagated
As if the CPTs became
V3=T 1.0
V3=F 0.0
P V2=T V2=F
V6=1 0.0 0.0
V6=2 0.0 0.0
V6=3 1.0 1.0
V1
V5
V2
V4
V3
V6
Specifically:
9
Belief Propagation
Message
Messages
Going down arrow, sum out parent Going up arrow, Bayes Law
)(
)()|()|(
BP
APABPBAP
Bayes Law:
1/a
“Causal” message “Diagnostic” message
* some figures from: Peter Lucas BN lecture course
The Messages
• What are the messages?
• For simplicity, let the nodes be binary
V1
V2
V1=T 0.8
V1=F 0.2
P V1=T V1=F
V2=T 0.4 0.9
V2=F 0.6 0.1
The message passes on information.
What information? Observe:
P(V2| V1) = P(V2| V1=T)P(V1=T)
+ P(V2| V1=F)P(V1=F)
The information needed is the CPT
of V1 = V(V1)
Messages capture information
passed from parent to child
The Messages
)|()()(
)|()()|( 121
2
12121 VVPVP
VP
VVPVPVVP a
• We know what the messages are
• What about ?
V1
V2
Assume E = { V2 } and compute by Bayes‟rule:
The information not available at V1 is the P(V2|V1). To be
passed upwards by a -message. Again, this is not in general
exactly the CPT, but the belief based on evidence down the tree.
Belief Propagation
V
U2
V1 V2
U1
π(U2)
π(V1)π(V2)
π(U1)
λ(U1)
λ(V2)
λ(V1)
λ(U2)
Evidence & Belief
V1
V5
V2
V4
V3
V6
Evidence
Belief
Evidence
Works for classification ??
Naive Bayes’ Classifier
Given C, xj are independent:
p(x|C) = p(x1|C) p(x2|C) ... p(xd|C)
Application ProceduresFor classification
MLP
Data collection & Pre-processing (Training data / Test data)
Decision node selection (output node)
Network training
Generalization
Parameter tuning & Pruning
Final network
Decision Trees
Data collection & Pre-processing (Training data / Test data)
Decision attribute selection
Tree construction
Pruning
Final tree
Bayesian Networks
Data collection & Pre-processing (Training data / Test data)
Structure configuration
Prior knowledge
Parameter learning
Decision node selection
Inference (classification)
Evidence & belief
Final network
Simulation
Simulation Packages
WEKA (JAVA)
http://www.cs.waikato.ac.nz/ml/weka/
FullBNT (MATLAB)
http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html
MSBNx
http://research.microsoft.com/msbn/
MATLAB Neural Networks Toolbox
http://www.mathworks.com/products/neuralnet/
C4.5
http://www.rulequest.com/Personal/
WEKA
FullBNT clear all
N = 4; % 노드의 개수
dag = zeros(N,N); % 네크워크 구조 shell
C = 1; S = 2; R = 3; W = 4; % 각 노드 Naming
dag(C,[R S]) = 1; % 네트워크 구조 명시
dag(R,W) = 1;
dag(S,W)=1;
%discrete_nodes = 1:N;
node_sizes = 2*ones(1,N); % 각 노드가 가질 수 있는 값의 개수
%node_sizes = [4 2 3 5];
%onodes = [];
%bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes, 'observed', onodes);
bnet = mk_bnet(dag, node_sizes, 'names', {'C','S','R','W'}, 'discrete', 1:4);
%C = bnet.names('cloudy'); % bnet.names is an associative array
%bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
%%%%%% Specified Parameters
%bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
%bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]);
%bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]);
%bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
MSBNx
References Textbooks
Ethem ALPAYDIN, Introduction to Machine Learning, The MIT Press, 2004
Tom Mitchell, Machine Learning, McGraw Hill, 1997
Neapolitan, R.E., Learning Bayesian Networks, Prentice Hall, 2003
Materials
Serafín Moral, Learning Bayesian Networks, University of Granada, Spain
Zheng Rong Yang, Connectionism, Exeter University
KyuTae Cho ,Jeong KiYoo ,HeeJin Lee, Uncertainty in AI, Probabilistic reasoning, Especially for Bayesian Networks
Gary Bradski, Sebastian Thrun, Bayesian Networks in Computer Vision, Stanford University
Recommended Textbooks
Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006
J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1992
Haykin, Simon S., Neural networks : a comprehensive foundation, Prentice Hall, 1999
Jensen, Finn V., Bayesian networks and decision graphs, Springer, 2007