V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 1 An Introduction to Reinforcement...

32
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 1 An Introduction to Reinforcement Learning Presenter: Verena Rieser, vrieser@coli . uni-sb .de Course: Classification and Clustering, WS 2005
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    223
  • download

    0

Transcript of V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 1 An Introduction to Reinforcement...

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 1

An Introduction to Reinforcement Learning

Presenter:

Verena Rieser, [email protected]

Course:

Classification and Clustering, WS 2005

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 2

Contents

Part 1: The main ideas of RL

Part 2: The general framework of RL

Part 3: Automatic Optimization of Dialogue Management (Application)

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 3

Reinforcement Learning

Psychology

Artificial Intelligence

Control Theory andOperations Research

Artificial Neural Networks

ReinforcementLearning (RL)

Neuroscience

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 4

Part 1:The Idea of Reinforcement Learning

Learning from interaction with environment to achieve some goal

Example 1 Baby playing: No teacher; sensorimotor connection to environment.

Cause-effect/Action-consequences How to achieve some goal

Example 2 Learning to hold a conversation, etc. We find out the effects of our actions later.

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 5

Supervised Learning

Supervised Learning SystemInputs Outputs

Training Info = desired (target) outputs

Error = (target output – actual output)

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 6

Reinforcement Learning

RLSystemInputs Outputs (“actions”)

Training Info = evaluations (“rewards” / “penalties”)

Objective: get as much reward as possible

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 7

R L - How does it work?

Learning a mapping from situations to actions in order to maximize a scalar reward/reinforcement signal.

How? Try out actions to learn which produces highest

reward - trial and error search Actions affect immediate reward + all

subsequent rewards - delayed effects, delayed rewards

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 8

Exploration/Exploitation Trade-off

High rewards from trying previously-well-rewarded actions - EXPLOITATION (= greedy)

BUT: Which actions are best? Must try ones not tried before - EXPLORATION (= )

Must do both!

Exploitation/Exploration trade-off also depends on the life-time of an agent.

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 9

-Greedy Methods on the 10-Armed Testbed[Sutton and Barto, 2002]

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 10

Part 2: Framework of RL

Temporally situated Continual learning and planning Object is to affect the environment Environment is stochastic and uncertain

Environment

actionstate

rewardAgent

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 11

Elements of RL

Policy: what to do Reward: what is good Value: what is good because it predicts reward Model: what follows what

Policy

Reward

ValueModel of

environment

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 12

General RL Algorithm

i. Initialise learner’s internal state

ii. Do forever (!?):

a. Observe current state s

b. Choose action a using some evaluation function

c. Execute action a

d. Let r be immediate reward, s’ new state

e. Update internal state based on s,a,r,s’

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 13

To solve the problem mathematically:

• Formulate it as Markov Decision Process (MDP) or Partially Observable Markov Decision Process (POMDP)

• Maximize the state-value and action-value functions using the Bellmann optimality equation

• Use approximations to solve the Bellmann equation such as dynamic programming, Monte Carlo Methods, Temporal-difference Learning.

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 14

The Bellmann Equation

• Bellmann optimality equation estimates “how good” it is to be in a state s.

V*(s) = max Qπ*(s,a) [figure (a)]

Q*(s,a) = ∑Pass´ [Ra

ss´+ max Q*(s´,a´)] [figure (b)]

Vπ(s)=∑aπ(s,a) ∑s’Pss’a[Rss’

a+µVπ(s’)]

“What actions are available?” “How good are those actions?”

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 15

Summary: Key Features of RL

Learner is not told which actions to takeTrial-and-Error searchPossibility of delayed reward

(Sacrifice short-term gains for greater long-term gains)

The need to explore and exploitConsiders the whole problem of a goal-

directed agent interacting with an uncertain environment

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 16

Interactive Exercise:

Help me to annotate the example “a dog catching a stick” with concepts from RL.

Explain: How would an artificial dog learn to catch the stick using RL?

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 17

Part 3: Application for CoLi

Diane J. Litman and Michael S. Kearns and Satinder Singh and Marilyn A. Walker:

Automatic Optimization of Dialogue Management.

In: Proceedings of the 18th International Conference on Computational Linguistics (COLING-2000), Saarbrücken, 2000.

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 18

Dialogue Management

Motivation:

• Agent wants to achieve some goal

• Non-trivial choices based on the internal state

• Usability should be guaranteed by iterative prototyping

DM is costly!

Why not “simply” learn the optimal choices? Formulate dialogue as MDP Represent the environment (= states) Define a set of possible dialogue strategies (= actions) Evaluate actions (= reward)

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 19

The NJFun System

1) Represent a dialogue strategy as mapping from state S to a set of dialogue acts

2) Deploy an initial training system which generates exploratory training data w.r.t. S

3) Construct an MPD model from the training data

4) Using value iteration to learn the optimal strategy

5) Evaluate the system w.r.t. a hand-coded strategy

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 20

NJFun: Action Space

Initiative User: the system asks open questions with an

unrestricted grammar for recognition System: the system uses directed prompts with

restricted grammars Mixed: the system uses directed prompts with non-

restricted grammars Confirmation

Explicit: the system asks the user to verify an attribute

No confirmation: the system does not generate a confirmation prompt

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 21

NJFun: State Space

{Greet}: whether the system has greeted the user or not (0,1)

{Attr}: which attr the system is trying to obtain or verify (1=activity, 2=location, 3=time, 4=done)

{Conf}: ASR confidence after obtaining value for an attribute (0,1,2,3,4)

{Val}: whether system has obtained a value for an attribute (0,1)

{Times}: number of times the system has asked for the attribute

{Gram}: type of grammar most recently used to obtain the attribute

{Hist}: “trouble-in-past”

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 22

Example

S1: Welcome to NJFun. How may I help you?

s[greet=1] - a[user initiative]

U1: I’d like to find *um* wine tasting in Lambertville.

s[conf=2, val=1]• S2a: Did you say you are interested in wine

tasting in Lambertville?

s’[attr=(1,2), times=1] - a[explicit confirmation ]• S2b: At what time?

s’[attr=3] - a[no confirmation ]

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 23

NJFun: Optimizing the strategy

NJFun’s initial strategy: “Exploratory for Initiative and Confirmation” (EIC); chooses randomly between possible actions in each state

Data: 54 subjects for training, 21 for testing Binary reward function: 1 if system queries DB

with all specified attr., 0 otherwise Results: large and significant improvement for

expert user and non-significant degeneration for novice

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 24

Discussion

How general are the features? What about dialogues on other domains (e.g. information seeking vs. Tutorial dialogue)

What about the algorithm? Why can’t we use supervised learning?

Do we really save costs? Stochastic user models for training “boot-strap” an initial system from training

data

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 25

Additional Slides

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 26

Simple Learning Taxonomy

Supervised Learning “Teacher” provides required response to inputs.

Desired behaviour is know. Unsupervised Learning

Learner looks for patterns in input. No “right” answer.

Reinforcement Learning Learner not told which actions to take, but gets

reward/punishment from environment and learns the action to pick the next time.

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 27

RL vs. SL

The main problem facing a SL system is to construct a mapping from situations to actions

that mimics the correct actions specified by the environment

and that generalizes correctly to new situations. A SL system cannot be said to learn to control its

environment because it follows, rather than influences, the instructive

information it receives. Instead of trying to make its environment behave

in a certain way, it tries to make itself behave as instructed by its environment.

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 28

RL vs. US

US: Make some decision *now* which satisfies

the immediate constrains (e.g. clustering: clusters should be not smaller than n)

RL: Plan your decision to achieve some goal in

the future; delayed rewards

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 29

A More Formal Definition of the RL Framework...

POLICY p(s,a) =P{at= a|st=t}

Given the situation at time t is s the policy which gives the probability that the agent’s action will be a.

Reward function

Defines goal, and immediate good or bad experience

Value function

Estimate of total future long-term reward.

(We want actions that lead to states of high value, not necessarily high immediate reward!)

Model of environment

Maps states and actions onto states S´AxS. If in state s1 we take action a2 the model predicts s2 (and sometimes reward r2)

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 30

Markov Property

A state signal that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property.

For example: the current position and velocity of a cannonball is all that matters for its future flight. It doesn't matter how that position and velocity came about.

This is sometimes also referred to as an "independence of path" property because all that matters is in the current state signal; its meaning is independent of the "path," or history, of signals that have led up to it.

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 31

MPDs vs. POMPDs

Major difference: how they represent uncertainty.

In MPDs the state space is in general represented as vectors describing information slots where each is associated with a discrete value.

POMPDS explicitly model uncertainty by maintaining a belief state - a distribution over MPD states - in the absence of knowing its state exactly.

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 32

Some Notable RL Applications

TD-Gammon: Tesauro– world’s best backgammon program

Elevator Control: Crites & Barto– high performance down-peak elevator controller

Dynamic Channel Assignment: Singh & Bertsekas, Nie & Haykin– high performance assignment of radio channels to mobile

telephone calls In general applicable for all (?) optimization tasks which are

goal-oriented