Soar-RL: Reinforcement Learning and Soar
description
Transcript of Soar-RL: Reinforcement Learning and Soar
![Page 1: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/1.jpg)
Soar-RL: Reinforcement
Learning and SoarShelley Nason
![Page 2: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/2.jpg)
Reinforcement Learning
Reinforcement learning:Learning how to act so as to maximize the expected cumulative value of a (numeric) reward signal
In Soar terminology, RL learns operator comparison knowledge
![Page 3: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/3.jpg)
A learning method for low-knowledge situations Non-explanation-based, trial and error learning – RL
does not require any model of operator effects to improve action choice.
Additional requirement – rewards. Therefore RL component should be automatic and
general-purpose. Ultimately avoid
Task-specific hand-coding of features Hand-decomposed task or reward structure Programmer tweaking of learning parameters And so on
![Page 4: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/4.jpg)
Q-values Q(s,a): the expected discounted sum of
future rewards, given that the agent takes action a from state s, and follows a particular policy thereafter
Given optimal Q-function, selecting action with highest Q-value at each state yields optimal policy
state action state action state actionEtc.
reward λ*rewardaction
![Page 5: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/5.jpg)
Representing the Q-function
In Soar-RL, Q-function stored as productions, testing state and operator, and asserting numeric preferences.
Sp{RL-rule(state <s> ^operator <o> +) … (<s> ^operator <o> = 0.33231)}
During decision phase, the Q-value of an operator O is taken to be the sum of all numeric preferences asserted for O.
![Page 6: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/6.jpg)
Learning
Q-value represented as concatenation:StateXAction Set of Features Value
Q-learning: Move the value of Q(st,at) toward rt + γ*maxa Q(st+1,a).
Bootstrapping: Update the prediction at one step using the prediction at the next step.
Automatic Feature Generation
Q-learning(updating parameters of
linear function)
state action state
reward
![Page 7: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/7.jpg)
Current Work-Automatic Feature generation Constructing rule conditions with which to
associate values Since values stored with RL rules, some RL
rule must fire for every state-action pair for bootstrapping to work
Sufficient distinctions required that agent will not confuse state-action pairs with significantly different Q-values
Want rules that take advantage of opportunities for generalization
![Page 8: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/8.jpg)
Waterjug Task-A reasonable set of rules
One for each state-action pair, for instance-
Sp {RL-0003 :rl (state <s> ^jug <j1> ^jug <j2> ^operator <o> +) (<j1> ^volume 5 ^contents 5) (<j2> ^volume 3 ^contents 0) (<o> ^name fill ^jug <j2>) (<s> ^operator <o> = 0)}
46 RL rules
+1
-1
0,00,0
0,30,3
3,03,0
3,33,3
5,15,1
0,10,1
1,01,01,31,34,04,0
4,34,3
5,25,2
0,20,2
2,02,0
2,32,3
5,05,0
5,35,3
![Page 9: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/9.jpg)
How to generate rule automatically? Rule could be built from WM, for instance- sp {RL-1:rl(state <s> ^name waterjug ^jug <j1> ^jug <j2> ^operator <o4> + ^superstate nil ^type state)(<j1> ^contents 0 ^free 5 ^volume 5)(<j2> ^contents 0 ^free 3 ^volume 3)(<o4> ^name fill ^jug <j2>)(<s> ^operator <o4> = 0)}
![Page 10: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/10.jpg)
But we want generalization…0,0
0,3
3,0
3,3
5,1
0,1
1,01,34,0
4,3
5,2
0,2
2,0
2,3
5,0
5,3
+1
-1
![Page 11: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/11.jpg)
Adaptive representations
System constructs feature set so that more distinctions in parts of state-action space requiring more distinctions.
Specific-to-general:Collect instances and cluster according to similar values.
General-to-specific:Add distinctions when area with single representation appears to contain multiple values.
![Page 12: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/12.jpg)
General-to-specificOur most general rules Rules made from operator proposals Sp {RL-1:rl(state <s1> ^name waterjug ^jug <j1>
^operator <o1> +)(<j1> ^free 3)(<o1> ^jug <j1> ^name fill)-->(<s1> ^operator <o1> = 0)}
Only generated when no rule fires for the selected operator
![Page 13: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/13.jpg)
0,0
0,3
3,0
3,3
5,1
0,1
1,01,34,0
4,3
5,2
0,2
2,0
2,3
5,0
5,3
If jug has contents 3, then pour from jug into other jug.
Specialization –Example of overgeneral representation
+1
-1
![Page 14: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/14.jpg)
Predicted Q-values at the state (3,0)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
1 2 3 4 5 6 7 8 9 10
Update #
Q-v
alue 3050
300030033033
Goal achieved Goal achieved
![Page 15: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/15.jpg)
Predicted Q-value for state (3,0)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 86 171 256 341 426 511 596 681 766 851 936 1021 1106 1191 1276 1361 1446 1531 1616 1701 1786 1871 1956 2041
Update #
Q-v
alue 3050
300030033033
![Page 16: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/16.jpg)
How to fix – Add following rule If there is a jug with
volume 3 and contents 3, pour this jug into the other jug.
0,00,0
0,30,3
3,03,0
3,33,3
5,15,1
0,10,1
1,01,01,31,34,04,0
4,34,3
5,25,2
0,20,2
2,02,0
2,32,3
5,05,0
5,35,3
![Page 17: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/17.jpg)
How to fix – Add following rule If there is a jug with volume 3 and contents 3, pour
this jug into the other jug.Q-values at (3,0)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121 127 133 139 145 151 157 163 169 175
3050300030033033
![Page 18: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/18.jpg)
Designing a specialization procedure1. How to decide whether to specialize a given
rule.2. Given that we have chosen to specialize a
rule, what conditions should we add to the rule?
3. (optional) In what, if any, cases should a rule be eliminated?
![Page 19: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/19.jpg)
Question 2 (What conditions to add to a rule) – Proposed Answer Trying an activation-based scheme.
When an (instantiated) rule R decides to specialize, it finds the most activated WME, w = (ID ATTR VALUE).
Traces upward through WM, to find a shortest path from ID to some identifier in the rule’s instantiation
w and the WMEs in the trace add themselves to the conditions in R to form a new rule R’
If R’ is not a duplicate of some existing rule, R’ is added to the Rete (without removing R).
![Page 20: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/20.jpg)
Question 1 (Should a given rule be specialized) – Proposed answer Track weights (numeric preferences) of rules. Weights should converge when rules
sufficient, so stop specializing when weights stop moving.
![Page 21: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/21.jpg)
Tracking weights of rule RL-3
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
1 18 35 52 69 86 103 120 137 154 171 188 205 222 239 256 273 290 307 324 341 358 375 392 409 426 443 460 477 494 511
![Page 22: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/22.jpg)
# steps in run
# steps in run
0
100
200
300
400
500
600
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100
Run #
# st
eps
![Page 23: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/23.jpg)
# rules
Total # RL rules by run
0
10
20
30
40
50
60
70
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97
Run #
# ru
les
![Page 24: Soar-RL: Reinforcement Learning and Soar](https://reader033.fdocuments.net/reader033/viewer/2022061516/56815b0e550346895dc8ba49/html5/thumbnails/24.jpg)
Conclusions Nuggets – Did work (at least on Waterjug)
That is, came to follow the optimal policy. Coal – Makes too many rules
1. Specializes rules that don’t need specialization.2. Specializations not always useful, since chosen
heuristically3. Will try to explain non-determinism by making
rules.