Deciding Under Probabilistic Uncertainty Russell and Norvig: Sect. 17.1-3,Chap. 17 CS121 – Winter...

Deciding Under Deciding Under Probabilistic Probabilistic UncertaintyUncertainty

Russell and Norvig: Sect. 17.1-3,Chap. 17Russell and Norvig: Sect. 17.1-3,Chap. 17CS121 – Winter 2003

Non-deterministic vs. Non-deterministic vs. Probabilistic UncertaintyProbabilistic Uncertainty

Non-deterministic model

?

ba c

{a,b,c}

decision that is best for worst case

~ Adversarial search

?

ba c

{a(pa),b(pb),c(pc)}

decision that maximizes expected utility valueProbabilistic

model

s0

s3s2s1

A1

0.2 0.7 0.1100 50 70 utility of state

U(S0) = 100 x 0.2 + 50 x 0.7 + 70 x 0.1 = 20 + 35 + 7 = 62

One State/One Action One State/One Action ExampleExample

s0

s3s2s1

A1

0.2 0.7 0.1100 50 70

A2

s40.2 0.8

80

• U1(S0) = 62• U2(S0) = 74• U(S0) = max{U1(S0),U2(S0)} = 74

One State/Two Actions One State/Two Actions ExampleExample

s0

s3s2s1

A1

0.2 0.7 0.1100 50 70

A2

s40.2 0.8

80

• U1(S0) = 62 – 5 = 57• U2(S0) = 74 – 25 = 49• U(S0) = max{U1(S0),U2(S0)} = 57

-5 -25

Introducing Action CostsIntroducing Action Costs

Example: Finding JulietExample: Finding Juliet

Charles’ off.

Juliet’s off.

Conf. room

10min

5min

10min

Example: Finding JulietExample: Finding Juliet

States are: S0: Romeo in Charles’ office S1: Romeo in Juliet’s office and Juliet here S2: Romeo in Juliet’s office and Juliet not here S3: Romeo in conference room and Juliet here S4: Romeo in conference room and Juliet not here

Actions are: GJO (go to Juliet’s office) GCR (go to conference room)

Utility ComputationUtility Computation

0

GJO

GJO

GCR

GCR

1

1

2

3

4

3

0

0

0

0

-10

-10

-10

-15

-10

Charles’ off.

Juliet’s off.

Conf. room

10min

5min

10min

n-Step Decision Processn-Step Decision Process

There is a single initial stateStates reached after i steps are all different from those reached after ji stepsEach state i has a its reward R(i) Each state reached after n steps is terminalThe goal is to maximize the sum of rewards

0 1 2 3


Utility of state i:

U(i) = R(i) + maxa kP(k | a.i) U(k)

0 1 2 3

i

• Two possible actions from i : a1 and a2

• a1 leads to k11 with probability P11 or k12 with probability P12

P11 = P(k11 | a1.i) P12 = P(k12 | a1.i)• a2 leads to k21 with probability P21 or k22 with probability P22

• U(i) = R(i) + max {P11 U(k11) + P12 U(k12), P21 U(k21) + P22 U(k22)}


For j = n-1, n-2, …, 0 do:

For every state si attained after step j Compute the utility of si

Label that state with the corresponding best action

Utility of state i:


Best choice of action at state i:

*(i) = arg maxa kP(k | a.i) U(k)0 1 2 3

i

Optimal policy

n-Step Decision Process …n-Step Decision Process …

Utility of state i:

U(i) = maxa (kP(k | a.i) U(k) – Ca)


*(i) = arg maxa (kP(k|a.i)U(k)–Ca)

0 1 2 3

i

… with costs on actions instead ofrewards on states

0

GJO

GJO

GCR

GCR

1

1

2

3

4

3

0 1 2 3

i

0

GJO

GJO

GCR

GCR

1

2

4

3

0 1 2 3

i

Target TrackingTarget Tracking

targetrobot

• The robot must keep the target in view• The target’s trajectory is not known in advance• The environment may or may not be known

States Are Indexed by States Are Indexed by Time Time

• State = (robot-position, target-position, time)• Action = (stop, up, down, right, left)• Outcome of an action = 5 possible states, each with probability 0.2

([i,j], [u,v], t)

• ([i+1,j], [u,v], t+1)• ([i+1,j], [u-1,v], t+1)• ([i+1,j], [u+1,v], t+1)• ([i+1,j], [u,v-1], t+1)• ([i+1,j], [u,v+1], t+1)

right

Each state has 25 successors

h-Step Planning Processh-Step Planning Process

Planning horizon h

“Terminal” states:• States where the target is not visible• States at depth h

Reward function:• Target visible +1• Target not visible 0Maximizing the sum of rewards ~ maximizing escape time

R(state) = t

where 0 < < 1

discounting


Planning horizon h

The planner computes the optimal policy over this tree of states, but only the firststep of the policy is executed.

Then everything is repeated again …(sliding horizon)


Planning horizon h


Planning horizon h

h is chosen such that the computation of the optimal policy over the tree can be computed in one increment of time

Example With No PlannerExample With No Planner

Example With PlannerExample With Planner

Other ExampleOther Example


Planning horizon h

The optimal policy over this tree is not theoptimal policy that would have been computedif a prior model of the environment had beenavailable, along with an arbitrarily fast computer

0

GJO

GJO

GCR

GCR

1

1

2

3

4

30 1 2 3

i

Simple Robot Navigation Simple Robot Navigation ProblemProblem

• In each state, the possible actions are U, D, R, and L

Probabilistic Transition Probabilistic Transition ModelModel

• In each state, the possible actions are U, D, R, and L• The effect of U is as follows (transition model):

• With probability 0.8 the robot moves up one square (if the robot is already in the top row, then it does not move)



• With probability 0.8 the robot moves up one square (if the robot is already in the top row, then it does not move)• With probability 0.1 the robot moves right one square (if the robot is already in the rightmost row, then it does not move)



• With probability 0.8 the robot moves up one square (if the robot is already in the top row, then it does not move)• With probability 0.1 the robot moves right one square (if the robot is already in the rightmost row, then it does not move)• With probability 0.1 the robot moves left one square (if the robot is already in the leftmost row, then it does not move)

Markov PropertyMarkov Property

The transition properties depend only on the current state, not on previous history (how that state was reached)

Sequence of ActionsSequence of Actions

• Planned sequence of actions: (U, R)• U is executed

2

3

1

4321


• Planned sequence of actions: (U, R)• U is executed• R is executed

2

3

1

4321


• Planned sequence of actions: (U, R)

2

3

1

4321

[3,2]


• Planned sequence of actions: (U, R)• U is executed

2

3

1

4321

[3,2]

[4,2][3,3][3,2]

HistoriesHistories

• Planned sequence of actions: (U, R)• U has been executed• R is executed

• There are 9 possible sequences of states – called histories – and 6 possible final states for the robot!

4321

2

3

1

[3,2]

[4,2][3,3][3,2]

[3,3][3,2] [4,1] [4,2] [4,3][3,1]

Probability of Reaching the Probability of Reaching the GoalGoal

•P([4,3] | (U,R).[3,2]) = P([4,3] | R.[3,3]) x P([3,3] | U.[3,2]) + P([4,3] | R.[4,2]) x P([4,2] | U.[3,2])

2

3

1

4321

Note importance of Markov property in this derivation

•P([3,3] | U.[3,2]) = 0.8•P([4,2] | U.[3,2]) = 0.1

•P([4,3] | R.[3,3]) = 0.8•P([4,3] | R.[4,2]) = 0.1

•P([4,3] | (U,R).[3,2]) = 0.65

Utility FunctionUtility Function

• The robot needs to recharge its batteries• [4,3] provides power supply• [4,2] is a sand area from which the robot cannot escape• [4,3] or [4,2] are terminal states• Reward of a terminal state: +1 or -1• Reward of a non-terminal state: -1/25• Utility of a history: sum of rewards of traversed states• Goal: Maximize the utility of the history

-1

+1

2

3

1

4321

-1

+1

2

3

1

4321

Histories are potentially unbounded and the same state can be reached many times

Utility of an Action Utility of an Action SequenceSequence

-1

+1

• Consider the action sequence (U,R) from [3,2]

2

3

1

4321


-1

+1

• Consider the action sequence (U,R) from [3,2]• A run produces one among 7 possible histories, each with some probability

2

3

1

4321

[3,2]

[4,2][3,3][3,2]

[3,3][3,2] [4,1] [4,2] [4,3][3,1]


-1

+1

• Consider the action sequence (U,R) from [3,2]• A run produces one among 7 possible histories, each with some probability• The utility of the sequence is the expected utility of the histories:

U = hUh P(h)

2

3

1

4321

[3,2]

[4,2][3,3][3,2]

[3,3][3,2] [4,1] [4,2] [4,3][3,1]

Optimal Action SequenceOptimal Action Sequence

-1

+1

• Consider the action sequence (U,R) from [3,2]• A run produces one among 7 possible histories, each with some probability• The utility of the sequence is the expected utility of the histories• The optimal sequence is the one with maximal utility

2

3

1

4321

[3,2]

[4,2][3,3][3,2]

[3,3][3,2] [4,1] [4,2] [4,3][3,1]

Optimal Action SequenceOptimal Action Sequence

-1

+1

• Consider the action sequence (U,R) from [3,2]• A run produces one among 7 possible histories, each with some probability• The utility of the sequence is the expected utility of the histories• The optimal sequence is the one with maximal utility• But is the optimal action sequence what we want to compute?

2

3

1

4321

[3,2]

[4,2][3,3][3,2]

[3,3][3,2] [4,1] [4,2] [4,3][3,1]

NO!! Except if sequence is executed blindly (open-loop strategy)

observable stateRepeat: s sensed state If s is terminal then exit a choose action (given s) Perform a

Reactive Agent AlgorithmReactive Agent Algorithm

Utility of state i:



*(i) = arg maxa (kP(k|a.i)U(k)–Ca)0 1 2 3

i

Policy Policy (Reactive/Closed-Loop (Reactive/Closed-Loop Strategy)Strategy)

• A policy is a complete mapping from states to actions

-1

+1

2

3

1

4321

Repeat: s sensed state If s is terminal then exit a (s) Perform a

Reactive Agent AlgorithmReactive Agent Algorithm

Optimal PolicyOptimal Policy

-1

+1

• A policy is a complete mapping from states to actions• The optimal policy * is the one that always yields a history (ending at a terminal state) with maximal expected utility

2

3

1

4321

Makes sense because of Markov property

Note that [3,2] is a “dangerous” state that the optimal policy

tries to avoid

Optimal PolicyOptimal Policy

-1

+1

• A policy is a complete mapping from states to actions• The optimal policy * is the one that always yields a history with maximal expected utility

2

3

1

4321

This problem is called aMarkov Decision Problem (MDP)

How to compute *?

Utility of state i:



*(i) = arg maxa (kP(k|a.i)U(k)–Ca)0 1 2 3

i

The trick used in target-The trick used in target-tracking (indexing state by tracking (indexing state by time) can be applied …time) can be applied …

-1

+1

… but would yield a large tree + sub-optimal policy

First-Step AnalysisFirst-Step Analysis

Simulate one step:


*(i) = arg maxa kP(k | a.i) U(k)

( Principle of Max Expected Utility)

-1

+1

?

What is the Difference?What is the Difference?

*(i) = arg maxa kP(k | a.i) U(k)


0 1 2 3

-1

+1

Value IterationValue IterationInitialize the utility of each non-terminal state si

to U0(i) = 0

For t = 0, 1, 2, …, do:

Ut+1(i) R(i) + maxa kP(k | a.i) Ut(k)

-1

+1

2

3

1

4321

Value IterationValue IterationInitialize the utility of each non-terminal state si

to U0(i) = 0

For t = 0, 1, 2, …, do:

Ut+1(i) R(i) + maxa kP(k | a.i) Ut(k)Ut([3,1])

t0 302010

0.6110.5

0-1

+1

2

3

1

4321

0.705 0.655 0.3880.611

0.762

0.812 0.868 0.918

0.660

Note the importanceof terminal states andconnectivity of thestate-transition graph

Not very different from indexing states by time

Policy IterationPolicy Iteration

Pick a policy at random


Pick a policy at random Repeat: Compute the utility of each state for

U (i) = R(i) + kP(k | (i).i) U (k)Set of linear equations

(often a sparse system)


Pick a policy at random Repeat: Compute the utility of each state for

U (i) = R(i) + kP(k | (i).i) U (k) Compute the policy ’ given these

utilities

’(i) = arg maxa kP(k | a.i) U(k)

If ’ = then return else’

Application of First Step Application of First Step Analysis Analysis Computing the Probability of Computing the Probability of Folding pFolding pfoldfold of a Proteinof a Protein

Unfolded state Folded state

pfold1- pfold

HIV integrase

Computation Through Computation Through SimulationSimulation

10K to 30K independent simulations

Capture the stochastic nature of Capture the stochastic nature of molecular molecular motion by a probabilistic roadmapmotion by a probabilistic roadmap

vi

vj

Pij

Edge probabilitiesEdge probabilities

Follow Metropolis criteria:

otherwise. ,

1

;0 if ,)/exp(

i

iji

Bij

ij

N

EN

TkE

P

Self-transition probability:

ijijii PP 1

vj

vi

Pij

Pii

Pii

F: Folded setU: Unfolded set

First-Step AnalysisFirst-Step Analysis

Pij

i

k

j

l

m

Pik Pil

Pim

Let fi = pfold(i)After one step: fi = Pii fi + Pij fj + Pik fk + Pil fl + Pim fm

=1 =1

One linear equation per node Solution gives pfold for all nodes

No explicit simulation run All pathways are taken into

account Sparse linear system

Partially Observable Partially Observable Markov Decision ProblemMarkov Decision Problem

• Uncertainty in sensing:

A sensing operation returns multiple states, with a probability distribution

Example: Target TrackingExample: Target Tracking

There is uncertaintyin the robot’s and target’s positions, and this uncertaintygrows with further motion

There is a risk that the target escape behind the corner requiring the robot to move appropriately

But there is a positioninglandmark nearby. Shouldthe robot tries to reduceposition uncertainty?

SummarySummary

Probabilistic uncertainty Utility function Optimal policy Maximal expected utility Value iteration Policy iteration

Deciding Under Probabilistic Uncertainty Russell and Norvig: Sect. 17.1-3,Chap. 17 CS121 – Winter...

Documents

Transcript of Deciding Under Probabilistic Uncertainty Russell and Norvig: Sect. 17.1-3,Chap. 17 CS121 – Winter...