Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint...

24
Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich Reinforcement Learning via Practice and Critique Advice AAAI-2010 Atlanta, GA

Transcript of Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint...

Page 1: Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,

Kshitij Judah, Saikat RoyAlan Fern, Tom Dietterich

Reinforcement Learning via Practice and Critique Advice

AAAI-2010 Atlanta, GA

Page 2: Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,

PROBLEM: Usually RL takes a long time to learn a good policy.

Reinforcement Learning (RL)

Teacher

behavior

advice

RESEARCH QUESTION: Can we make RL perform better with some outside help, such as critique/advice from teacher and how?

GOALS: Non-technical

users as teachers

Natural interaction methods

state

action

rewardEnvironment

Page 5: Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,

Critique Data Loss L(θ,C)

Imagine: Our teacher is an Ideal Teacher (Provides All Good Actions)

Set of all Good

actions

Any action not in O(si) is suboptimal according to Ideal

Teacher

All actions

are equallygood

Advice Interface

Ideal Teacher

Advice Interface

Some good

actions

Some bad actions

Some actions

unlabeled

Page 6: Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,

‘Any Label Learning’ (ALL)

Imagine: Our teacher is an Ideal Teacher (Provides All Good Actions)

Set of all Good

actions

Any action not in O(si) is suboptimal according to Ideal

Teacher

All actions

are equallygood

Advice Interface

Ideal Teacher

Learning Goal: Find a probabilistic policy, or classifier, that has a high probability of returning an action in O(s) when applied to s.

ALL Likelihood (LALL(,C)) :

Probability of selecting an action in O(Si) given state si

Page 7: Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,

Critique Data Loss L(θ,C) Coming back to reality: Not All Teachers are Ideal !

and provide partial evidence about O(si)

Advice Interface

What about the naïve approach of treating as the true set O(si) ? Difficulties: When there are actions outside of that are equally good

compared to those in , the learning problem becomes even harder.

We want a principled way of handling the situation where either or can be empty.

Page 9: Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,

Map 1 Map 2

Experimental Setup

Our Domain: Micro-management in tactical battles in the Real Time Strategy (RTS) game of Wargus.

5 friendly footmen against a group of 5 enemy footmen (Wargus AI).

Two battle maps, which differed only in the initial placement of the units.

Both maps had winning strategies for the friendly team and are of roughly the same difficulty.

Page 10: Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,

Advice Interface

Difficulty:Fast pace and

multiple units acting in parallel

Our setup: Provide end-users

with an Advice Interface that allows to watch a battle and pause at any moment.

Page 11: Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,

Goal is to evaluate two systems1. Supervised System = no practice session2. Combined System = includes practice and

critique

The user study involved 10 end-users 6 with CS background 4 no CS background

Each user trained both the supervised and combined systems

30 minutes total for supervised 60 minutes for combined due to additional time

for practice

Since repeated runs are not practical results are qualitative

To provide statistical results we first present simulated experiments

User Study

Page 12: Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,

Simulated Experiments

After user study, selected the worst and best performing users on each map when training the combined system

Total Critique data: User#1: 36, User#2: 91, User#3: 115, User#4: 33.

For each user: divide critique data into 4 equal sized segments creating four data-sets per user containing 25%, 50%, 75%, and 100% of their respective critique data.

We provided the combined system with each of these data sets and allowed it to practice for 100 episodes. All results are averaged over 5 runs.

Page 13: Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,

Simulated Experiments Results:Benefit of Critiques (User #1)

RL is unable to learn a winning policy (i.e. achieve a positive value).

Page 14: Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,

Simulated Experiments Results:Benefit of Critiques (User #1)

With more critiques performance increases a little bit.

Page 15: Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,

As the amount of critique data increases, the performance improves for a fixed number of practice episodes.

RL did not go past 12 health difference on any map even after 500 trajectories.

Simulated Experiments Results:Benefit of Critiques (User #1)

Page 16: Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,

Simulated Experiments Results:Benefit of Practice (User #1)

Even with no practice, the critique data was sufficient to outperform RL.

RL did not go past 12 health difference.

Page 17: Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,

Simulated Experiments Results:Benefit of Practice (User #1)

With more practice performance increases too.

Page 18: Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,

Simulated Experiments Results:Benefit of Practice (User #1)

Our approach is able to leverage practice episodes in order to improve the effectiveness on a given amount of critique data.

Page 19: Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,

Goal is to evaluate two systems1. Supervised System = no practice session2. Combined System = includes practice and

critique

The user study involved 10 end-users 6 with CS background 4 no CS background

Each user trained both the supervised and combined systems

30 minutes total for supervised 60 minutes for combined due to additional time

for practice

Results for Actual User Study

Page 20: Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,

Results of User Study

-50 0 50 80 1000123456789

10

Health Difference Achieved by Users

SupervisedCombined

Health Difference

Num

ber o

f Use

rs

Page 21: Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,

Results of User Study

Comparing to RL: 9 out of 10 users achieved 50 or more

performance using Supervised System 6 out of 10 users achieved 50 or more

performance using Combined System

Users effectively performed better than RL using either the Supervised

or Combined Systems.

RL did not go past 12 health difference on any map even after 500 trajectories.

-50 0 50 80 1000123456789

10

Health Difference Achieved by Users

SupervisedCombined

Health Difference

Num

ber o

f Use

rs

Page 22: Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,

Results of User Study

Frustrating Problems for Users Large delay experience. (not an issue in many

realistic settings)

Policy returned after practice was sometimes poor, seemed to be ignoring advice. (perhaps practice sessions were too short)

Comparing Combined and Supervised: The end-users had slightly greater success

with the supervised system v/s the combined system.

More users were able to achieve performance levels of 50 and 80 using the supervised system.

-50 0 50 80 1000123456789

10

Health Difference Achieved by Users

SupervisedCombined

Health Difference

Num

ber o

f Use

rs

Page 23: Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,

Future Work

An important part of our future work will be to conduct further user studies in order to pursue the most relevant directions including:

Studying more sophisticated user models that further approximate real users.

Enriching the forms of advice.

How learner can actively request advice from teacher.

Designing realistic user studies.

Increasing the stability of autonomous practice.

Page 24: Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,

Questions ?