Testing Stochastic Processes Through Reinforcement Learning
description
Transcript of Testing Stochastic Processes Through Reinforcement Learning
![Page 1: Testing Stochastic Processes Through Reinforcement Learning](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813eda550346895da94bd5/html5/thumbnails/1.jpg)
1
Testing Stochastic Processes Through
Reinforcement Learning
François Laviolette
Sami Zhioua
Nips-Workshop
December 9th, 2006
Josée Desharnais
![Page 2: Testing Stochastic Processes Through Reinforcement Learning](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813eda550346895da94bd5/html5/thumbnails/2.jpg)
2
Outline
Program Verification Problem
The Approach for trace-equivalence
Other equivalences
Conclusion
Application on MDPs
![Page 3: Testing Stochastic Processes Through Reinforcement Learning](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813eda550346895da94bd5/html5/thumbnails/3.jpg)
3
Stochastic Program Verification
Specification (LMP):an MDP without rewards
Implementation
s0
s1
s3
s6
s2
s4 s5
a[0.5]a[0.3]
b[0.9]cb[0.9]
c
How far the Implementation is from the Specification ?
(Distance or divergence)
The Specification model is available.
The Implementation is available only for interaction (no model).
![Page 4: Testing Stochastic Processes Through Reinforcement Learning](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813eda550346895da94bd5/html5/thumbnails/4.jpg)
4
1. Non deterministic trace equivalence
P
a
a c
b
cb
ac
c b b
Q
a
b a
c
cb
aa
b
c
a
b
Trace Equivalence
Two systems are trace equivalent iff they accept the same set of traces
T(P) = {a, aa, aac, ac, b, ba, bab,
c, cb,cc}T(Q) = {a, ab, ac, abc, abca,
ba, bab, c, ca}
2. Probabilistic trace equivalence
Two systems are trace equivalent iff they accept the same set of traces and with the same probabilities
P
a[2/3]
a[1/3] b[2/3]
a[1/4]
cb
a[1/4]a[3/4]
c
a
b[1/2] c[1/2]
a 7/12
aa 5/12
aac 1/6
bc 2/3
…
Q
a[1/3]
a[1/2] a[1/2]
b
cb
a[1/4]a[3/4]
b[1/2]
c
a
b[1/2]
a 1
aa 1/2
aac 0
bc 0
…
![Page 5: Testing Stochastic Processes Through Reinforcement Learning](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813eda550346895da94bd5/html5/thumbnails/5.jpg)
5
Testing (Trace Equivalence)
The system is a black box.
The button goes down (transition)
The button does not go down (no transition)
When a button is pushed
(action execution)
Grammar (trace equiv):
t ::= | a.t
Observations :
When a test is executed, several observations are possible : O t.
b[0.7]
s0
s3
a[0.2]a[0.5]
[2,4) [7,10]
Example:
Ot = {a, a.b, a.b}
0.3 0.56
t = a.b.
0.14
a b z
![Page 6: Testing Stochastic Processes Through Reinforcement Learning](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813eda550346895da94bd5/html5/thumbnails/6.jpg)
6
Outline
Program Verification Problem
The Approach for trace-equivalence
Other equivalences
Conclusion
Application on MDPs
![Page 7: Testing Stochastic Processes Through Reinforcement Learning](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813eda550346895da94bd5/html5/thumbnails/7.jpg)
7
Why Reinforcement Learning ?
s0
s1
s4
s2
s5 s6
a[0.2]a[0.5]
b[0.7]a[0.3]a
s7
b
s3
b[0.9]
a[0.7]
s8
s0
s1 s2 s3
s4s6
s7 s8
s5
a b
a a b
ab
LMP
MDP
Reinforcement Learning is particularly efficient in the absence of the full model.
0.5 0.2 0.9
10.3
0.7
1 0.7
Reinforcement Learning can deal with bigger systems.
Analogy :
LMP MDP
Trace Policy
Divergence Optimal Value ( V* )
![Page 8: Testing Stochastic Processes Through Reinforcement Learning](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813eda550346895da94bd5/html5/thumbnails/8.jpg)
8
A Stochastic Game towards RL
F S S F S F S F F S F S
F F S S S F
S S S F F F
+ 10
- 1
b[0.7]
s0
s1
s3
s6
s2
s4 s5
a[0.2]a[0.5]
b[0.3]a
c[0.4]
s7
c[0.2]
s10
b
s8
b
Implementation Specifications0
s1
s3
s2
s4 s5
a[0.2]a[0.3]
b[0.7]b[0.3]a
s7 s9
c[0.8]c[0.7]
s10
b
s8
b
b[0.9]
Specification (clone)s0
s1
s3
s2
s4 s5
a[0.2]a[0.3]
b[0.7]b[0.3]a
s7 s9
c[0.8]c[0.7]
s10
b
s8
b
b[0.9]
Reward : (+1) when Impl Spec
Reward : (-1) when Spec Clone
![Page 9: Testing Stochastic Processes Through Reinforcement Learning](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813eda550346895da94bd5/html5/thumbnails/9.jpg)
9
MDP Defintion
MDP : Specification LMP StatesActionsNext-state probability distribution
MDPs0
s1
s3
s6
s2
s4 s5
a[0.2]a[0.5]
b[0.7]b[0.3]a
c[0.4]
s7
c[0.2]
s10
b
s8
b
s0
s1
s3
s2
s4 s5
a[0.2]a[0.5]
b[0.7]b[0.3]a
s7 s9
c[0.8]c[0.7]
s10
b
s8
b
Implémentation Spécification
b[0.9]
s0
s1 s2 s3
s3s4
s8 s9
s5
s7
s10
0.5 0.2 0.9
1 0.3 0.7
1 0.80.7
1
a b
a b
cbc
b
Dead
![Page 10: Testing Stochastic Processes Through Reinforcement Learning](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813eda550346895da94bd5/html5/thumbnails/10.jpg)
10
Divergence Computation
F S S
F S F
S F F
S F S
F F S S S F
S S S F F F
+ 1
0
- 1
V*(s0)
0 : Equivalent
1 : Different
*
s0
s1
s3
s6
s2
s4 s5
a[0.2]a[0.5]
b[0.7]b[0.3]a
c[0.4]
s7
c[0.2]
s10
b
s8
b
s0
s1
s3
s2
s4 s5
a[0.2]a[0.5]
b[0.7]b[0.3]a
s7 s9
c[0.8]c[0.7]
s10
b
s8
b
Implementation Specification
b[0.9]
MDPs0
s1 s2 s3
s3s4
s8 s9
s5
s7
s10
0.5 0.2 0.9
1 0.3 0.7
1 0.80.7
1
a b
a b
cbc
b
Dead
![Page 11: Testing Stochastic Processes Through Reinforcement Learning](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813eda550346895da94bd5/html5/thumbnails/11.jpg)
11
Symmetry Problem
Implementation Specification
F S S S F F
F F S S S F
+ 1 - 1
Create two variants for each action (a):
Success variant ( a )
Failure variant ( a )
s0
s1
a[1]
s0
s1
a[0.5]
Spec (Clone)
s0
s1
a[0.5]
Compute and give reward
Give reward 0
Select action make a prediction (, ×)
If pred = obs
If pred obs
Prediction:
execute action
Prob=0*.5*.5+1*.5*.5 = .25Prob=0*.5*.5+1*.5*.5 = .25
![Page 12: Testing Stochastic Processes Through Reinforcement Learning](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813eda550346895da94bd5/html5/thumbnails/12.jpg)
12
The Divergence (with the symmetry problem fixed)
Theorem. Let "Spec" and "Impl" be two LMPs, and M their induced MDP.
V*(s0) ≥ 0, and
V*(s0) = 0 iff "Spec" and "Impl" are trace-equivalent.
![Page 13: Testing Stochastic Processes Through Reinforcement Learning](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813eda550346895da94bd5/html5/thumbnails/13.jpg)
13
Implementation and PAC Guaranty
There exists a PAC Guaranty for Q-Learning Algorithm but ..
Fiechter algorithm has a simpler PAC guaranty.
Besides, it is possible to obtain a bottom bound thanks to the Hoeffding inequality :
If then :
Implementation :
= 0.8
Action selection : softmax ( decreasing from 0.8 to 0.01)
RL algorithm : Q-Learning
decreasing according to the function 1/x
PAC guaranty :
![Page 14: Testing Stochastic Processes Through Reinforcement Learning](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813eda550346895da94bd5/html5/thumbnails/14.jpg)
14
Outline
Program Verification Problem
The Approach for trace-equivalence
Other equivalences
Conclusion
Application on MDPs
![Page 15: Testing Stochastic Processes Through Reinforcement Learning](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813eda550346895da94bd5/html5/thumbnails/15.jpg)
15
Testing (Bisimulation)
The system is a black box.
Grammar
t ::= | a.t
a b z
b[0.7]
s0
s3
a[0.2]a[0.5]
[2,4) [7,10]
Example:
Ot = {a, a.(b, b), a.(b,b), a.(b,b), a.(b,b)}
0.3 0.518
t = a.(b,b)
0.042 0.042 0.098Pt,s0 :
Replication
| (t1, … , tn)
(bisimulation) :
![Page 16: Testing Stochastic Processes Through Reinforcement Learning](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813eda550346895da94bd5/html5/thumbnails/16.jpg)
16
P
a
c
b[1/3] c[2/3]
c
a[1/3] a[2/3]
b
c
Q
New Equivalence Notion
‘’By-Level Equivalence’’
![Page 17: Testing Stochastic Processes Through Reinforcement Learning](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813eda550346895da94bd5/html5/thumbnails/17.jpg)
17
K-Moment Equivalence
t ::= | a.t
t ::= | ak.t k 2
1-moment (trace)
2-moment
3-moment t ::= | ak.t k 3
: is a random variable such that is the probability to perform
the trace and make a transition to a state that accepts action a with probability pi .
is equal toTwo systems are “By-level’’ equivalent
Recall : kth moment of X = E(Xk) = ( xik . Pr(X=xi) )
k
![Page 18: Testing Stochastic Processes Through Reinforcement Learning](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813eda550346895da94bd5/html5/thumbnails/18.jpg)
18
Ready Equivalence and Failure equivalence
1. Ready Equivalence
Two systems are Ready equivalent iff for any trace tr and any set of actions A, they have the same probability to run successfully tr and reach a process accepting all actions from A.
.
P
a[1/3]
a[1/3] a[2/3]
b
cb
a[1/4]a[3/4]
c
a
b[1/2] b[1/2]
Q
a[1/3]
a[1/2] a[1/2]
b
cb
a[1/4]a[3/4]
b[1/2]
c
a
b[1/2]
(<a>,{b,c}) 2/3 (<a>,{b,c}) 1/2
Test t ::= | a.t | {a1, .. , an}
1. Failure Equivalence
P
a[1/3]
a[1/3] a[2/3]
b
cb
a[1/4]a[3/4]
c
a
b[1/2] b[1/2]
Q
a[1/3]
a[1/2] a[1/2]
b
cb
a[1/4]a[3/4]
b[1/2]
c
a
b[1/2]
(<a>,{b,c}) 1/3 (<a>,{b,c}) 1/2
Two systems are Ready equivalent iff for any trace tr and any set of actions A, they have the same probability to run successfully tr and reach a process refusing all actions from A.
Test t ::= | a.t | {a1, .. , an}
![Page 19: Testing Stochastic Processes Through Reinforcement Learning](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813eda550346895da94bd5/html5/thumbnails/19.jpg)
19
1. Barb acceptation
P
a[1/3]
a[1/3] a[2/3]
b
cb
a[1/4]a[3/4]
c
a
b[1/2] b[1/2]
Q
a[1/3]
a[1/2] a[1/2]
b
cb
a[1/4]a[3/4]
b[1/2]
c
a
b[1/2]
Barb equivalence
(<a,b>,<{a,b},{b,c},>) 2/3
2. Barb Refusal
P
a[1/3]
a[1/3] a[2/3]
b
cb
a[1/4]a[3/4]
c
a
b[1/2] b[1/2]
Q
a[1/3]
a[1/2] a[1/2]
b
cb
a[1/4]a[3/4]
b[1/2]
c
a
b[1/2]
(<a,b>,<{b,c},{b,c}>) 1/3
Test t ::= | a.t | {a1, .. , an}a.t
Test t ::= | a.t | {a1, .. , an}a.t
![Page 20: Testing Stochastic Processes Through Reinforcement Learning](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813eda550346895da94bd5/html5/thumbnails/20.jpg)
20
Outline
Program Verification Problem
The Approach for trace-equivalence
Other equivalences
Conclusion
Application on MDPs
![Page 21: Testing Stochastic Processes Through Reinforcement Learning](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813eda550346895da94bd5/html5/thumbnails/21.jpg)
21
MDP 1
s0
s1 s2 s3
s3s4
s8 s9
s5
s7
a b
a b
cbc
0.8 0.2 1
1 0.3 0.7
1 11
r1 r2 r3
r3 r4 r5
r7 r8r6
s0
s1 s2 s3
s4s6
s7 s8
s5
a b
a a b
ab
0.5 0.2 0.9
10.3
0.7
1 0.7
r1 r2 r3
r3 r4 r5
r7 r8
Application on MDPs
MDP 2
Case 3 : The reward space is very large (continuous) : w.l.o.g. [0,1]
Case 1 : The reward space contains 2 values (binary) : 0 and 1
Case 2 : The reward space is small (discrete) : {r1, r2, r3, r4, r5}
![Page 22: Testing Stochastic Processes Through Reinforcement Learning](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813eda550346895da94bd5/html5/thumbnails/22.jpg)
22
Application on MDPs
Case 1 : The reward space contains 2 values (binary)
r1 : 0 F
r2 : 1 S
Case 2 : The reward space is small (discrete)
{r1, r2, r3, r4, r5}
ar1 a
r2 ar3 a
r4 ar5
br1 b
r2 br3 b
r4 br5
F
S
Case 3 :
The reward space is very large (continuous)
Intuition : r = 3/41 with probability 3/4
a rpick a reward value (ranVal)
randomly
ranVal r
ranVal < r
S
F
0 with probability 1/4
![Page 23: Testing Stochastic Processes Through Reinforcement Learning](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813eda550346895da94bd5/html5/thumbnails/23.jpg)
23
Current and Future Work
Application to different equivalence notions :- Failure equivalence- Ready equivalence- Barb equivalence, etc.
Experimental analysis on realistic systems
Applying the approach to compute the divergence between : - HMMs
- POMDPs
Studying the properties of the divergence
- Probabilistic automata