A classification learning example Predicting when Rusell will wait for a table --similar to book...

82
A classification learning example Predicting when Rusell will wait for a table milar to book preferences, predicting credit card f redicting when people are likely to respond to junk
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    226
  • download

    1

Transcript of A classification learning example Predicting when Rusell will wait for a table --similar to book...

Page 1: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

A classification learning examplePredicting when Rusell will wait for a table

--similar to book preferences, predicting credit card fraud, predicting when people are likely to respond to junk mail

Page 2: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Inductive Learning(Classification Learning)

• Given a set of labeled examples, and a space of hypotheses

– Find the rule that underlies the labeling

• (so you can use it to predict future unlabeled examples)

– Tabularasa, fully supervised

• Idea:– Loop through all hypotheses

• Rank each hypothesis in terms of its match to data

• Pick the best hypothesis

• Main variations:• Bias: the “sort” of rule are you

looking for?– If you are looking for only

conjunctive hypotheses, there are just 3n

– Search:– Greedy search

– Decision tree learner– Systematic search

– Version space learner– Iterative search

– Neural net learner

The main problem is that the space of hypotheses is too large

Given examples described in terms of n boolean variablesThere are 2 different hypothesesFor 6 features, there are 18,446,744,073,709,551,616 hypotheses

2n

It can be shown that sample complexity of PAC learning is proportional to 1/, 1/ AND log |H|

Page 3: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

5/5

Page 4: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Bias & Learning AccuracyWhy Simple is Better?

Training error

Test (prediction) error

Fra

ctio

n in

core

ctly

cla

ssifi

ed• Having weak bias (large hypothesis space)– Allows us to capture

more concepts– ..increases learning

cost– May lead to over-

fitting

Also the goal of a compression algorithm is to drive down the training errorBut the goal of a learning algorithm is to drive down the test error

Page 5: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Uses different biases in predicting Russel’s waiting habbits

Russell waits

Wait time? Patrons? Friday?

0.3

0.5

full

0.3

0.2

some

0.4

0.3

None

F

T

RW

0.3

0.5

full

0.3

0.2

some

0.4

0.3

None

F

T

RW

Naïve bayes(bayesnet learning)--Examples are used to --Learn topology --Learn CPTs

Neural Nets--Examples are used to --Learn topology --Learn edge weights

Decision Trees--Examples are used to --Learn topology --Order of questionsIf patrons=full and day=Friday

then wait (0.3/0.7)If wait>60 and Reservation=no then wait (0.4/0.9)

Association rules--Examples are used to --Learn support and confidence of association rules SVMs

K-nearest neighbors

Page 6: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Learning Decision Trees---How?

Basic Idea: --Pick an attribute --Split examples in terms of that attribute --If all examples are +ve label Yes. Terminate --If all examples are –ve label No. Terminate --If some are +ve, some are –ve continue splitting recursively

(Special case: Decision Stumps If you don’t feel like splitting any further, return the majority label )

Which one to pick?

Page 7: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Depending on the order we pick, we can get smaller or bigger trees

Which tree is better? Why do you think so??

Page 8: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Basic Idea: --Pick an attribute --Split examples in terms of that attribute --If all examples are +ve label Yes. Terminate --If all examples are –ve label No. Terminate --If some are +ve, some are –ve continue splitting recursively --if no attributes left to split? (label with majority element)

Would you split on patrons or Type?

Page 9: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

N+N-

N1+N1-

N2+N2-

Nk+Nk-

Splitting on feature fk

P+ : N+ /(N++N-)P- : N- /(N++N-)

I(P+ ,, P-) = -P+ log(P+) - P- log(P- )

I(P1+ ,, P1-) I(P2+ ,, P2-) I(Pk+ ,, Pk-)

[Ni+ + Ni- ]/[N+ + N-] I(Pi+ ,, Pi-)

i=1

k

The differenceis the informationgain

So, pick the featurewith the largest Info Gain

I.e. smallest residual info

The Information GainComputation

Given k mutually exclusive and exhaustiveevents E1….Ek whose probabilities are p1….pk

The “information” content (entropy) is defined as

i -pi log2 pi

A split is good if it reduces the entropy..

# expected comparisonsneeded to tell whether agiven example is +ve or -ve

Page 10: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Ex Masochistic Anxious Nerdy HATES EXAM

1 F T F Y

2 F F T N

3 T F F N

4 T T T Y

A simple example

V(M) = 2/4 * I(1/2,1/2) + 2/4 * I(1/2,1/2) = 1

V(A) = 2/4 * I(1,0) + 2/4 * I(0,1) = 0

V(N) = 2/4 * I(1/2,1/2) + 2/4 * I(1/2,1/2) = 1

So Anxious is the best attribute to split onOnce you split on Anxious, the problem is solved

I(1/2,1/2) = -1/2 *log 1/2 -1/2 *log 1/2

= 1/2 + 1/2 =1

I(1,0) = 1*log 1 + 0 * log 0 = 0

Page 11: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.
Page 12: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Learning curves… Given N examples, partition them into N tr the training set and Ntest the test instances Loop for i=1 to |Ntr| Loop for Ns in subsets of Ntr of size I Train the learner over Ns

Test the learned pattern over Ntest and compute the accuracy (%correct)

Evaluating the Decision Trees

Russell Domain“Majority” function(say yes if majority of attributes are yes)

Lesson: Every bias makes some concepts easier to learn and others harder to learn…

m-fold cross-validation Split N examples into m equal sized parts for i=1..m train with all parts except ith

test with the ith part

Page 13: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Problems with Info. Gain. Heuristics

• Feature correlation: We are splitting on one feature at a time• The Costanza party problem

– No obvious easy solution…

• Overfitting: We may look too hard for patterns where there are none– E.g. Coin tosses classified by the day of the week, the shirt I was wearing, the

time of the day etc. – Solution: Don’t consider splitting if the information gain given by the best

feature is below a minimum threshold• Can use the 2 test for statistical significance

– Will also help when we have noisy samples…• We may prefer features with very high branching

– e.g. Branch on the “universal time string” for Russell restaurant example– Branch on social security number to look for patterns on who will get A– Solution: “gain ratio” --ratio of information gain with the attribute A to the information

content of answering the question “What is the value of A?”• The denominator is smaller for attributes with smaller domains.

Page 14: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Decision Stumps• Decision stumps are decision

trees where the leaf nodes do not necessarily have all +ve or all –ve training examples– Could happen either because

examples are noisy and mis-classified or because you want to stop before reaching pure leafs

• When you reach that node, you return the majority label as the decision.

• (We can associate a confidence with that decision using the P+

and P-)

N+N-

N1+N1-

N2+N2-

Nk+Nk-

Splitting on feature fk

P+= N1+ / N1

++N1-

Sometimes, the best decision tree for a problem could be a decision stump (see coin toss example next)

Page 15: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Bayes Network Learning

• Bias: The relation between the class label and class attributes is specified by a Bayes Network.

• Approach– Guess Topology– Estimate CPTs

• Simplest case: Naïve Bayes – Topology of the network is “class label” causes all the attribute

values independently– So, all we need to do is estimate CPTs P(attrib|Class)

• In Russell domain, P(Patrons|willwait)– P(Patrons=full|willwait=yes)= #training examples where patrons=full and will wait=yes #training examples where will wait=yes

– Given a new case, we use bayes rule to compute the class label

Russell waits

Wait time? Patrons? Friday?

0.3

0.5

full

0.3

0.2

some

0.4

0.3

None

F

T

RW

0.3

0.5

full

0.3

0.2

some

0.4

0.3

None

F

T

RW

Class label is the disease; attributes are symptoms

Page 16: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Naïve Bayesian Classification• Problem: Classify a given example E into one of the classes among [C1,

C2 ,…, Cn]

– E has k attributes A1, A2 ,…, Ak and each Ai can take d different values

• Bayes Classification: Assign E to class Ci that maximizes P(Ci | E)

P(Ci| E) = P(E| Ci) P(Ci) / P(E)

• P(Ci) and P(E) are a priori knowledge (or can be easily extracted from the set of data)

• Estimating P(E|Ci) is harder

– Requires P(A1=v1 A2=v2….Ak=vk|Ci)

• Assuming d values per attribute, we will need ndk probabilities

• Naïve Bayes Assumption: Assume all attributes are independent P(E| Ci) = P(Ai=vj | Ci )

– The assumption is BOGUS, but it seems to WORK (and needs only n*d*k probabilities

Page 17: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

NBC in terms of BAYES networks..

NBC assumption More realistic assumption

Page 18: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Estimating the probabilities for NBCGiven an example E described as A1=v1 A2=v2….Ak=vk we want to compute the class of E

– Calculate P(Ci | A1=v1 A2=v2….Ak=vk) for all classes Ci and say that the class of E is the one for which P(.) is maximum

– P(Ci | A1=v1 A2=v2….Ak=vk)

= P(vj | Ci ) P(Ci) / P(A1=v1 A2=v2….Ak=vk)

Given a set of training N examples that have already been classified into n classes Ci

Let #(Ci) be the number of examples that are labeled as Ci

Let #(Ci, Ai=vi) be the number of examples labeled as Ci

that have attribute Ai set to value vj

P(Ci) = #(Ci)/N P(Ai=vj | Ci) = #(Ci, Ai=vi) / #(Ci)

Common factor

USER PROFILE

Page 19: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

P(willwait=yes) = 6/12 = .5P(Patrons=“full”|willwait=yes) = 2/6=0.333P(Patrons=“some”|willwait=yes)= 4/6=0.666

P(willwait=yes|Patrons=full) = P(patrons=full|willwait=yes) * P(willwait=yes) ----------------------------------------------------------- P(Patrons=full) = k* .333*.5P(willwait=no|Patrons=full) = k* 0.666*.5

Similarly we can show that P(Patrons=“full”|willwait=no) =0.6666

Example

Page 20: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Using M-estimates to improve probablity estimates

• The simple frequency based estimation of P(Ai=vj|Ck) can be inaccurate, especially when the true value is close to zero, and the number of training examples is small (so the probability that your examples don’t contain rare cases is quite high)

• Solution: Use M-estimate P(Ai=vj | Ci) = [#(Ci, Ai=vi) + mp ] / [#(Ci) + m]

– p is the prior probability of Ai taking the value vi

• If we don’t have any background information, assume uniform probability (that is 1/d if Ai can take d values)

– m is a constant—called “equivalent sample size” • If we believe that our sample set is large enough, we can keep m small.

Otherwise, keep it large. • Essentially we are augmenting the #(Ci) normal samples with m more

virtual samples drawn according to the prior probability on how Ai takes values

– Popular values p=1/|V| and m=|V| where V is the size of the vocabulary

Also, to avoid overflow errors do addition of logarithms of probabilities (instead of multiplication of probabilities)

Page 21: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

How Well (and WHY) DOES NBC WORK?

• Naïve bayes classifier is darned easy to implement– Good learning speed, classification speed– Modest space storage– Supports incrementality

• It seems to work very well in many scenarios– Lots of recommender systems (e.g. Amazon books recommender) use it– Peter Norvig, the director of Machine Learning at GOOGLE said, when asked about what sort of technology they use “Naïve bayes”

• But WHY? – NBC’s estimate of class probability is quite bad

• BUT classification accuracy is different from probability estimate accuracy

– [Domingoes/Pazzani; 1996] analyze this

Page 22: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Uses different biases in predicting Russel’s waiting habbits

Russell waits

Wait time? Patrons? Friday?

0.3

0.5

full

0.3

0.2

some

0.4

0.3

None

F

T

RW

0.3

0.5

full

0.3

0.2

some

0.4

0.3

None

F

T

RW

Naïve bayes(bayesnet learning)--Examples are used to --Learn topology --Learn CPTs

Neural Nets--Examples are used to --Learn topology --Learn edge weights

Decision Trees--Examples are used to --Learn topology --Order of questionsIf patrons=full and day=Friday

then wait (0.3/0.7)If wait>60 and Reservation=no then wait (0.4/0.9)

Association rules--Examples are used to --Learn support and confidence of association rules SVMs

K-nearest neighbors

Page 23: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Decision Surface Learning(aka Neural Network Learning)

• Idea: Since classification is really a question of finding a surface to separate the +ve examples from the -ve examples, why not directly search in the space of possible surfaces?

• Mathematically, a surface is a function – Need a way of learning

functions

– “Threshold units”

Page 24: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

“Neural Net” is a collection ofwith interconnections

Feed ForwardUni-directional connections

Single Layer Multi-Layer

Recurrent

Bi-directional connections

Any linear decision surface can be representedby a single layer neural net

Any “continuous” decision surface (function) can be approximated to any degree of accuracy by some 2-layer neural net

Can act as associative memory

differentiable

threshold units

w1

w2

t=k

I1

I2

w1

w2

t=k

I1

I2

= 1 if w1I1+w2I2 > k= 0 otherwise

Page 25: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

A Threshold Unit

…is sort of like a neuron

Threshold Functions

differentiable

The “Brain” Connection

Page 26: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Perceptron Networks

What happened to the“Threshold”? --Can model as an extra weight with static input

w1

w2

t=k

I1

I2

w1

w2

w0= k

I0=-1

t=0

==

Page 27: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Perceptron Learning

• Perceptron learning algorithmLoop through training examples

– If the activation level of the output unit is 1 when it should be 0, reduce the weight on the link to the jth input unit by *Ij, where Ii is the ith input value and a learning rate

– If the activation level of the output unit is 0 when it should be 1, increase the weight on the link to the ith input unit by *Ij

– Otherwise, do nothing

Until “convergence”Iterative search!

--node -> network weights

--goodness -> error

Actually a “gradient descent” search

http://neuron.eng.wayne.edu/java/Perceptron/New38.html

A nice applet at:

Page 28: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

jj

jjjj

jj

jjj

jj

ji

i

IWgOTIWW

IWgOTIW

E

IWgTWE

OTE

)(

)(

2

1)(

)(2

1

2

2

Perceptron Learning as Gradient Descent Search in the weight-space

))(1)(()('

)(1

1)(

xgxgxg

fnsigmoide

xgx

I

Ij

Often a constant learning rate parameter is used instead

Page 29: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Can Perceptrons Learn All Boolean Functions?--Are all boolean functions linearly separable?

Page 30: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Majority function Russell Domain

Perce

ptro

n

Decision Trees

Decision Trees

Perceptron

Comparing Perceptrons and Decision Treesin Majority Function and Russell Domain

Majority function is linearly seperable.. Russell domain is apparently not....

Encoding: one input unit per attribute. The unit takes as many distinct real values as the size of attribute domain

Page 31: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Max-Margin Classification & Support Vector Machines

• Any line that separates the +ve & –ve examples is a solution

• And perceptron learning finds one of them

– But could we have a preference among these?

– may want to get the line that provides maximum margin (equidistant from the nearest +ve/-ve)

• The nereast +ve and –ve holding up the line are called support vectors

• This changes the problem into an optimization one

– Quadratic Programming can be used to directly find such a line

Learning is Optimization after all!

Page 32: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Lagrangian Dual

Page 33: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Two ways to learn non-linear decision surfaces

• First transform the data into higher dimensional space

• Find a linear surface – Which is guaranteed to

exist

• Transform it back to the original space

• TRICK is to do this without explicitly doing a transformation

• Learn non-linear surfaces directly (as multi-layer neural nets)

• Trick is to do training efficiently– Back Propagation to the

rescue..

Page 34: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Linear Separability in High Dimensions

“Kernels” allow us to consider separating surfaces in high-D without first converting all points to high-D

Page 35: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Kernelized Support Vector Machines

• Turns out that it is not always necessary to first map the data into high-D, and then do linear separation

• The quadratic programming formulation for SVM winds up using only the pair-wise dot product of training vectors

• Dot product is a form of similarity metric between points

• If you replace that dot product by any non-linear function, you will, in essence, be transforming data into some high-dimensional space and then finding the max-margin linear classifier in that space

– Which will correspond to some wiggly surface in the original dimension

• The trick is to find the RIGHT similarity function

– Which is a form of prior knowledge

Page 36: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Kernelized Support Vector Machines

• Turns out that it is not always necessary to first map the data into high-D, and then do linear separation

• The quadratic programming formulation for SVM winds up using only the pair-wise dot product of training vectors

• Dot product is a form of similarity metric between points

• If you replace that dot product by any non-linear function, you will, in essence, be tranforming data into some high-dimensional space and then finding the max-margin linear classifier in that space

– Which will correspond to some wiggly surface in the original dimension

• The trick is to find the RIGHT similarity function

– Which is a form of prior knowledge

K (A;A0) = ((100A à 1)(100

A 0à 1) à 0:5)6

ïPolynomial Kernel:

Page 37: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Domain-knowledge & Learning

• Classification learning is a problem addressed by both people from AI (machine learning) and Statistics

• Statistics folks tend to “distrust” domain-specific bias.– Let the data speak for itself…– ..but this is often futile. The very act of “describing” the data points

introduces bias (in terms of the features you decided to use to describe them..)

• …but much human learning occurs because of strong domain-specific bias..

• Machine learning is torn by these competing influences.. – In most current state of the art algorithms, domain knowledge is

allowed to influence learning only through relatively narrow avenues/formats (E.g. through “kernels”)

• Okay in domains where there is very little (if any) prior knowledge (e.g. what part of proteins are doing what cellular function)

• ..restrictive in domains where there already exists human expertise..

Those who ignore easily available domain knowledge are doomed to re-learn it… Santayana’s brother

Page 38: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Multi-layer Neural Nets

How come back-prop doesn’t get stuck in local minima? One answer: It is actually hard for local minimas to form in high-D, as the “trough” has to be closed in all dimensions

Page 39: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Russell Domain

Decision Trees

Perceptron

Decision Trees

Multi-layernetworks

Multi-Network Learning can learn Russell Domains

…but does it slowly…

Page 40: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Practical Issues in Multi-layer network learning

• For multi-layer networks, we need to learn both the weights and the network topology– Topology is fixed for perceptrons

• If we go with too many layers and connections, we can get over-fitting as well as sloooow convergence– Optimal brain damage

• Start with more than needed hidden layers as well as connections; after a network is learned, remove the nodes and connections that have very low weights; retrain

Page 41: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

K-nearest-neighbor The test example’s class is determined by the class of the majority of its k nearest neighborsNeed to define an appropriate distance measure --sort of easy for real valued vectors --harder for categorical attributes

Other impressive applications: --no-hands across america --learning to speak

Humans make 0.2%Neumans (postmen) make 2%

Page 42: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.
Page 43: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.
Page 44: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.
Page 45: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

True hypothesis eventually dominates… probability of indefinitely producing uncharacteristic data 0

Page 46: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Bayesian prediction is optimal (Given the hypothesis prior, all other predictions are less likely)

Page 47: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Also, remember the Economist article that shows that humans have strong priors..

Page 48: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

..note that the Economist article says humans are able to learn from few examples only because of priors..

Page 49: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

So, BN learning is just probability estimation! (as long as data is complete!)

Page 50: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

How Well (and WHY) DOES NBC WORK?

• Naïve bayes classifier is darned easy to implement– Good learning speed, classification speed– Modest space storage– Supports incrementality

• It seems to work very well in many scenarios– Lots of recommender systems (e.g. Amazon books recommender) use it– Peter Norvig, the director of Machine Learning at GOOGLE said, when asked about what sort of technology they use “Naïve bayes”

• But WHY? – NBC’s estimate of class probability is quite bad

• BUT classification accuracy is different from probability estimate accuracy

– [Domingoes/Pazzani; 1996] analyze this

Page 51: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.
Page 52: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Reinforcement Learning

Based on slides from Bill Smarthttp://www.cse.wustl.edu/~wds/

Page 53: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

What is RL?

“a way of programming agents by reward and punishment without needing to specify how the

task is to be achieved”

[Kaelbling, Littman, & Moore, 96]

Page 54: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Basic RL Model

1. Observe state, st

2. Decide on an action, at

3. Perform action

4. Observe new state, st+1

5. Observe reward, rt+1

6. Learn from experience7. Repeat

Goal: Find a control policy that will maximize the observed rewards over the lifetime of the agent

AS R

World

Page 55: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

An Example: Gridworld

Canonical RL domain• States are grid cells• 4 actions: N, S, E, W• Reward for entering top right cell• -0.01 for every other move

Minimizing sum of rewards Shortest path• In this instance

+1

Page 56: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

The Promise of Learning

Page 57: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

The Promise of RL

Specify what to do, but not how to do it• Through the reward function• Learning “fills in the details”

Better final solutions• Based of actual experiences, not programmer

assumptions

Less (human) time needed for a good solution

Page 58: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Learning Value Functions

We still want to learn a value function• We’re forced to approximate it iteratively• Based on direct experience of the world

Four main algorithms• Certainty equivalence• Temporal Difference (TD) learning• Q-learning• SARSA

Page 59: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Certainty Equivalence

Collect experience by moving through the world• s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, a3, r4, s4, a4, r5, s5, ...

Use these to estimate the underlying MDP• Transition function, T: SA → S• Reward function, R: SAS →

Compute the optimal value function for this MDP• And then compute the optimal policy from it

Page 60: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Temporal Difference (TD)

TD-learning estimates the value function directly• Don’t try to learn the underlying MDP

Keep an estimate of V(s) in a table• Update these estimates as we gather more

experience• Estimates depend on exploration policy, • TD is an on-policy method

[Sutton, 88]

Page 61: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

TD-Learning Algorithm

Initialize V(s) to 0, sObserve state, sPerform action, (s)Observe new state, s’, and reward, rV(s) ← (1-)V(s) + (r + V(s’))Go to 2

0 ≤ ≤ 1 is the learning rate• How much attention do we pay to new experiences

Page 62: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

TD-Learning

V(s) is guaranteed to converge to V*(s)• After an infinite number of experiences• If we decay the learning rate

• will work

In practice, we often don’t need value convergence• Policy convergence generally happens sooner

0tt

0t

2t

tc

ct

Page 63: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Actor-Critic Methods

TD only evaluates a particular policy• Does not learn a better policy

We can change the policy as we learn V• Policy is the actor• Value-function estimate is the critic

Success is generally dependent on the starting policy being “good enough”

ValueFunction

(critic)

World

Policy(actor)

s r

aV

[Barto, Sutton, & Anderson, 83]

Page 64: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Q-Learning

Q-learning iteratively approximates the state-action value function, Q

• Again, we’re not going to estimate the MDP directly• Learns the value function and policy simultaneously

Keep an estimate of Q(s, a) in a table• Update these estimates as we gather more

experience• Estimates do not depend on exploration policy• Q-learning is an off-policy method

[Watkins & Dayan, 92]

Page 65: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Q-Learning Algorithm

Initialize Q(s, a) to small random values, s, a

Observe state, s

Pick an action, a, and do it

Observe next state, s’, and reward, r

Q(s, a) ← (1 - )Q(s, a) + (r + maxa’Q(s’, a’))

Go to 2

0 ≤ ≤ 1 is the learning rate• We need to decay this, just like TD

Page 66: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Picking Actions

We want to pick good actions most of the time, but also do some exploration

• Exploring means that we can learn better policies• But, we want to balance known good actions with

exploratory ones• This is called the exploration/exploitation problem

Page 67: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Picking Actions

-greedy• Pick best (greedy) action with probability • Otherwise, pick a random action

Boltzmann (Soft-Max)• Pick an action based on its Q-value

• , where is the “temperature”

a'

)a' Q(s,

a) Q(s,

e

e s) | P(a

Page 68: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

SARSA

SARSA iteratively approximates the state-action value function, Q

• Like Q-learning, SARSA learns the policy and the value function simultaneously

Keep an estimate of Q(s, a) in a table• Update these estimates based on experiences• Estimates depend on the exploration policy• SARSA is an on-policy method• Policy is derived from current value estimates

Page 69: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

SARSA Algorithm

Initialize Q(s, a) to small random values, s, aObserve state, sPick an action, a, and do it (just like Q-learning)Observe next state, s’, and reward, rQ(s, a) ← (1-)Q(s, a) + (r + Q(s’, (s’)))Go to 2

0 ≤ ≤ 1 is the learning rate• We need to decay this, just like TD

Page 70: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

On-Policy vs. Off Policy

On-policy algorithms• Final policy is influenced by the exploration policy• Generally, the exploration policy needs to be “close”

to the final policy• Can get stuck in local maxima

Off-policy algorithms• Final policy is independent of exploration policy• Can use arbitrary exploration policies• Will not get stuck in local maxima

Given enoughexperience

Page 71: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Convergence Guarantees

The convergence guarantees for RL are “in the limit”

• The word “infinite” crops up several times

Don’t let this put you off• Value convergence is different than policy

convergence• We’re more interested in policy convergence• If one action is really better than the others, policy

convergence will happen relatively quickly

Page 72: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Rewards

Rewards measure how well the policy is doing• Often correspond to events in the world

• Current load on a machine• Reaching the coffee machine• Program crashing

• Everything else gets a 0 reward

Things work better if the rewards are incremental• For example, distance to goal at each step• These reward functions are often hard to design

These aredense rewards

These aresparse rewards

Page 73: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

The Markov Property

RL needs a set of states that are Markov• Everything you need to know to make a decision is

included in the state• Not allowed to consult the past

Rule-of-thumb• If you can calculate the reward

function from the state without any additional information, you’re OK

S G

K

Not holding key

Holding key

Page 74: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

But, What’s the Catch?

RL will solve all of your problems, but• We need lots of experience to train from• Taking random actions can be dangerous• It can take a long time to learn• Not all problems fit into the MDP framework

Page 75: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Learning Policies Directly

An alternative approach to RL is to reward whole policies, rather than individual actions

• Run whole policy, then receive a single reward• Reward measures success of the whole policy

If there are a small number of policies, we can exhaustively try them all

• However, this is not possible in most interesting problems

Page 76: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Policy Gradient Methods

Assume that our policy, p, has a set of n real-valued parameters, q = {q1, q2, q3, ... , qn }

• Running the policy with a particular q results in a reward, rq

• Estimate the reward gradient, , for each qi

R

iii θ

Rθθ

This is anotherlearning rate

Page 77: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Policy Gradient Methods

This results in hill-climbing in policy space• So, it’s subject to all the problems of hill-climbing• But, we can also use tricks from search, like random

restarts and momentum terms

This is a good approach if you have a parameterized policy

• Typically faster than value-based methods• “Safe” exploration, if you have a good policy• Learns locally-best parameters for that policy

Page 78: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

An Example: Learning to Walk

RoboCup legged league• Walking quickly is a big advantage

Robots have a parameterized gait controller• 11 parameters• Controls step length, height, etc.

Robots walk across soccer pitch and are timed• Reward is a function of the time taken

[Kohl & Stone, 04]

Page 79: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

An Example: Learning to Walk

Basic idea1. Pick an initial = {1, 2, ... , 11}

2. Generate N testing parameter settings by perturbing j = {1 + 1, 2 + 2, ... , 11 + 11}, i {-, 0, }

3. Test each setting, and observe rewardsj → rj

4. For each i Calculate 1

+, 10, 1

- and set

5. Set ← ’, and go to 2Average rewardwhen qn

i = qi - di

largest θ if

largest θ if

largest θ if

θθ'

i

i

i

ii

00

Page 80: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

An Example: Learning to Walk

Video: Nate Kohl & Peter Stone, UT Austin

Initial Final

Page 81: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Value Function or Policy Gradient?

When should I use policy gradient?• When there’s a parameterized policy• When there’s a high-dimensional state space• When we expect the gradient to be smooth

When should I use a value-based method?• When there is no parameterized policy• When we have no idea how to solve the problem

Page 82: A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting.

Summary for Part I

Background• MDPs, and how to solve them• Solving MDPs with dynamic programming • How RL is different from DP

Algorithms• Certainty equivalence• TD• Q-learning• SARSA• Policy gradient