INC 551 Artificial Intelligence
-
Upload
leo-wilson -
Category
Documents
-
view
61 -
download
1
description
Transcript of INC 551 Artificial Intelligence
![Page 1: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/1.jpg)
INC 551 Artificial Intelligence
Lecture 9
Introduction to Machine Learning
![Page 2: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/2.jpg)
What is machine learning (or computer learning)?
ทางปฏิ�บั�ติ�คื�อการหา function ท��เหมาะสมเพื่��อ map input และ output
ทางวั�ติถุ�ประสงคื�คื�อการปร�บัติ�วัของ computer จาก ข อม!ลหนึ่#�งๆ ท��ป%อนึ่เข าไป
![Page 3: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/3.jpg)
Definition of Learning
A computer program is said to “learn” from experience Ewith respect to some class of tasks T and performance P,if its performance improves with experience E
Tom Mitchell, 1997
![Page 4: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/4.jpg)
To learn =
To change parameters in the world model
![Page 5: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/5.jpg)
Deliberative Agent
Environment
Action
Sense, Perceive
MakeDecision
Agent
WorldModel
How to create a world model that representsreal world?
![Page 6: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/6.jpg)
Car Model
ThrottleAmount(x)
Speed(v)
523 2
2
xxv
cbxaxv
![Page 7: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/7.jpg)
Learning as function Mapping
Find better function mapping
Add+3
2 4
Add+3
2 4.8
Performanceerror = 1
Performanceerror = 0.2
ปร�บัติ�วั
![Page 8: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/8.jpg)
Learning Design Issues
1. Components of the performance to be learned
2. Feedback (supervised, reinforcement, unsupervised)
3. Representation (function, tree, neural net, state-action model, genetic code)
![Page 9: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/9.jpg)
Types of Learning
• Supervised Learningม�คืร!สอนึ่บัอกวั'าอะไรดี� อะไรไม'ดี� เหม�อนึ่เร�ยนึ่ในึ่ห อง
• Reinforcement Learningเร�ยนึ่ร! แบับัปฏิ�บั�ติ�ไปเลย นึ่�กเร�ยนึ่เล�อกท+าส��งท��อยากเร�ยนึ่เองคืร!คือยบัอกวั'าดี�หร�อไม'ดี�
• Unsupervised Learningไม'ม�คืร!คือยบัอกอะไรเลย นึ่�กศึ#กษาแยกแยะส��งดี� ไม'ดี� ออกเป.นึ่ 2 พื่วักแติ'ก/ย�งไม'ร! วั'าอะไรดี� ไม'ดี�
![Page 10: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/10.jpg)
Supervised Learning
โดียท��วัไปจะม� data ท��เป.นึ่ Training set และ Test setTraining set ใช้ ในึ่การเร�ยนึ่, Test set ใช้ ในึ่การทดีสอบัData เหล'านึ่�2จะบัอกวั'าอะไร เป.นึ่ Type A, B, C, …
LearnerTraining set
features
type
LearnerTest set
features
typeAnswer
การเร�ยนึ่
การใช้ งานึ่
![Page 11: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/11.jpg)
Graph Fitting
Find function f that is consistent for all samples
![Page 12: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/12.jpg)
x f(x) type
1 3.2 B1.5 5.6 A4 2.3 B2 -3.1 B7 4.4 A5.5 1.1 B5 4.2 A
Data Mapping
![Page 13: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/13.jpg)
1 2
3
ใช้ Least Mean SquareAlgorithm
![Page 14: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/14.jpg)
Overfit
Ockham’s razor principle
“Prefer the simplest”
![Page 15: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/15.jpg)
Least Mean Square Algorithm
)]()()()[()()1( nxnwnynxnwnw T
xwy T Let
We can find the weight vector recursively using
where n = current stateμ = step size
![Page 16: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/16.jpg)
MATLAB Example Match x,y pairx=[1 2 7 4 -3.2]y=[2.1 4.2 13.7 8.1 -6.5]
-4 -2 0 2 4 6 8-8
-6
-4
-2
0
2
4
6
8
10
12
14
Epoch 1
![Page 17: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/17.jpg)
MATLAB Example Match x,y pairx=[1 2 7 4 -3.2]y=[2.1 4.2 13.7 8.1 -6.5]
Epoch 2
-4 -2 0 2 4 6 8-8
-6
-4
-2
0
2
4
6
8
10
12
14
![Page 18: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/18.jpg)
MATLAB Example Match x,y pairx=[1 2 7 4 -3.2]y=[2.1 4.2 13.7 8.1 -6.5]
Epoch 3
-4 -2 0 2 4 6 8-8
-6
-4
-2
0
2
4
6
8
10
12
14
![Page 19: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/19.jpg)
MATLAB Example Match x,y pairx=[1 2 7 4 -3.2]y=[2.1 4.2 13.7 8.1 -6.5]
Epoch 4
-4 -2 0 2 4 6 8-8
-6
-4
-2
0
2
4
6
8
10
12
14
![Page 20: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/20.jpg)
MATLAB Example Match x,y pairx=[1 2 7 4 -3.2]y=[2.1 4.2 13.7 8.1 -6.5]
Epoch 8
-4 -2 0 2 4 6 8-8
-6
-4
-2
0
2
4
6
8
10
12
14
![Page 21: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/21.jpg)
Neural Network
1 brain = 100,000,000,000 neurons
Neuron model
![Page 22: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/22.jpg)
Mathematical Model of Neuron
![Page 23: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/23.jpg)
Activation Function
Step function Sigmoid function
![Page 24: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/24.jpg)
Network of Neurons
สามารถุแบั'งเป.นึ่ 2 ช้นึ่�ดี
1. Single layer feed-forward network (perceptron)2. Multilayer feed-forward network
![Page 25: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/25.jpg)
Single layer Network
x1
x2
x0y
yxWn
jjj
0
สมมติ�วั'า activation function ไม'ม�
ซึ่#�งเป.นึ่ linear equation
W0
W1
W2
![Page 26: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/26.jpg)
ในึ่กรณี�ท�� activation function เป.นึ่ stepจะเหม�อนึ่เป.นึ่เส นึ่ติรงคือยแบั'งกล�'ม ซึ่#�งเส นึ่ติรงนึ่�2จะแบั'งท��ไหนึ่จะข#2นึ่ก�บัคื'า weight wj
![Page 27: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/27.jpg)
Perceptron Algorithm
มาจาก least-square methodใช้ เพื่��อปร�บั weight ของ neuron ให เหมาะสม
α = learning rate
![Page 28: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/28.jpg)
Multilayer Neural Network
ม� hidden layers เพื่��มเข ามาLearn ดี วัย Back-Propagation Algorithm
![Page 29: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/29.jpg)
Back-Propagation Algorithm
![Page 30: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/30.jpg)
Learning Progress
ใช้ training set ป%อนึ่เข าไปหลายๆคืร�2ง แติ'ละคืร�2งเร�ยก Epoch
![Page 31: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/31.jpg)
Types of Learning
• Supervised Learningม�คืร!สอนึ่บัอกวั'าอะไรดี� อะไรไม'ดี� เหม�อนึ่เร�ยนึ่ในึ่ห อง
• Reinforcement Learningเร�ยนึ่ร! แบับัปฏิ�บั�ติ�ไปเลย นึ่�กเร�ยนึ่เล�อกท+าส��งท��อยากเร�ยนึ่เองคืร!คือยบัอกวั'าดี�หร�อไม'ดี�
• Unsupervised Learningไม'ม�คืร!คือยบัอกอะไรเลย นึ่�กศึ#กษาแยกแยะส��งดี� ไม'ดี� ออกเป.นึ่ 2 พื่วักแติ'ก/ย�งไม'ร! วั'าอะไรดี� ไม'ดี�
![Page 32: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/32.jpg)
Source:
Reinforcement Learning: An IntroductionRichard Sutton and Andrew BartoMIT Press, 2002
![Page 33: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/33.jpg)
Supervised Learning SystemInputs Outputs
Training Info = desired (target) outputs (features/class)
Supervised Learning
RLSystemInputs
Outputs (“actions”)
Evaluations (“rewards” / “penalties”)
Environment
Reinforcement Learning
![Page 34: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/34.jpg)
Properties of RL
• Learner is not told which actions to take
• Trial-and-Error search
• Possibility of delayed reward– Sacrifice short-term gains for greater long-
term gains
• The need to explore and exploit
• Considers the whole problem of a goal-directed agent interacting with an uncertain environment
![Page 35: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/35.jpg)
Model of RL
Environment
actionstate
rewardAgent
Key componentsstate, action, reward, and transition
![Page 36: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/36.jpg)
Agent: ฉั�นึ่อย!'ท�� 134 เล�อกท+าการกระท+าแบับัท�� 12Environment: เธอไดี 29 คืะแนึ่นึ่ และไดี ไปอย!'ท�� 113
Agent: ฉั�นึ่อย!'ท�� 113 เล�อกท+าการกระท+าแบับัท�� 8Environment: เธอไดี 33 คืะแนึ่นึ่ และไดี ไปอย!'ท�� 35
Agent: ฉั�นึ่อย!'ท�� 35 เล�อกท+าการกระท+าแบับัท�� 8Environment: เธอไดี 72 คืะแนึ่นึ่ และไดี ไปอย!'ท�� 134
Agent: ฉั�นึ่อย!'ท�� 134 เล�อกท+าการกระท+าแบับัท�� 4Environment: เธอไดี 66 คืะแนึ่นึ่ และไดี ไปอย!'ท�� 35
Agent: ฉั�นึ่อย!'ท�� 35 เล�อกท+าการกระท+าแบับัท�� 2Environment: เธอไดี 53 คืะแนึ่นึ่ และไดี ไปอย!'ท�� 88
:
State Action Reward Next state
![Page 37: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/37.jpg)
Example: Tic-Tac-Toe
X XXO O
X
XO
X
O
XO
X
O
X
XO
X
O
X O
XO
X
O
X O
X
} x’s move
} o’s move
} x’s move
...
...... ...
... ... ... ... ...
x x
x
x o
x
o
xo
x
x
xx
o
o
Assume an imperfect opponent:
—he/she sometimes makes mistakes
![Page 38: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/38.jpg)
1. Make a table with one entry per state:
State V(s) – estimated probability of winning
.5 ?
.5 ?. . .
. . .
. . .. . .
1 win
0 loss
. . .. . .
0 draw
x
xxx
oo
oo
ox
x
oo
o ox
xx
xo
2. Now play lots of games.
To pick our moves,
look ahead one step:
current state
various possible
next states*Just pick the next state with the highest
estimated prob. of winning — the largest V(s);
![Page 39: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/39.jpg)
moveafter state theฑ
move before state theฑ
s
s
We increment each V(s) toward V( s ) – a backup :
V(s) V (s) V( s ) V (s)
a small positive fraction, e.g., .1
the step - size parameter
s
s’
![Page 40: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/40.jpg)
Table Generalizing Function
State VState V
s
s
s
.
.
.
s
1
2
3
N
เหม�อนึ่ก�บั function mapping
![Page 41: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/41.jpg)
Value Table
4
3
22
4
2
3
2
3
State Value Table1 dimension
State
46698467
::::::::
584659875
456807646
584683459
65411357434
793479667
63467423
Action Value Table2 dimension
State
Action
(Q-table)(V-table)
![Page 42: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/42.jpg)
Examples of RL Implementations
Start with a random network
Play very many games against self
Learn a value function from this simulated experience
Action selectionby 2–3 ply search
Value
TD errorVt1 Vt
Tesauro, 1992–1995
TD-Gammon
![Page 43: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/43.jpg)
10 floors, 4 elevator cars
STATES: button states; positions, directions, and motion states of cars; passengers in cars & in halls
ACTIONS: stop at, or go by, next floor
REWARDS: roughly, –1 per time step for each person waiting
Conservatively about 10 states22
Crites and Barto, 1996
Elavator Dispatching
![Page 44: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/44.jpg)
Issues in Reinforcement Learning
• Trade-off between exploration and exploitation
ε – greedysoftmax
• Algorithms to find the value function for delayed reward
Dynamic ProgrammingMonte CarloTemporal Difference
![Page 45: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/45.jpg)
n-Armed Bandit Problem
Slot Machine
12345
Slot machine ม�คื�นึ่โยกอย!'หลายอ�นึ่ซึ่#�งให รางวั�ลไม'เท'าก�นึ่
สมมติ�ลองเล'นึ่ไปเร��อยๆจนึ่ถุ#งจ�ดีๆหนึ่#�ง ไดี ข อสร�ปวั'าเล'นึ่คื�นึ่โยก 1 26 คืร�2ง ไดี รางวั�ล 4 baht/คืร�2งเล'นึ่คื�นึ่โยก 2 14 คืร�2ง ไดี รางวั�ล 3 baht/คืร�2งเล'นึ่คื�นึ่โยก 3 10 คืร�2ง ไดี รางวั�ล 2 baht/คืร�2งเล'นึ่คื�นึ่โยก 4 16 คืร�2ง ไดี รางวั�ล 102 baht/คืร�2ง
![Page 46: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/46.jpg)
Exploration and Exploitation
จะม�ป7ญหา 2 อย'าง• จะลองเล'นึ่คื�นึ่โยกท�� 5 ติ'อไปไหม• คื'าเฉัล��ยของรางวั�ลท��ผ่'านึ่มาเท��ยงติรงแคื'ไหนึ่
Exploitation คื�อการใช้ ในึ่ส��งท��เร�ยนึ่มา คื�อเล'นึ่อ�นึ่ 4 ไปเร��อยๆติลอดีExploration คื�อส+ารวัจติ'อ โดียลองมากข#2นึ่ในึ่ส��งท��ย�งไม'เคืยท+า
Balance
เร�ยก Greedy
![Page 47: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/47.jpg)
ε-greedy Action Selection
Greedy
ε-greedy
at at* arg max
aQt(a)
at* with probability 1
random action with probability at
คื�อเล�อกทางท��ให ผ่ลติอบัแทนึ่ส!งส�ดี
![Page 48: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/48.jpg)
Test: 10-armed Bandit Problem
• n = 10 possible actions
• Each is chosen randomly from a normal distribution:
• each is also normal:
• 1000 plays
• repeat the whole thing 2000 times and average the results
(Q*(at ),1)rt
Q*(a)
(0,1)
![Page 49: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/49.jpg)
Results
![Page 50: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/50.jpg)
SoftMax Action Selection
SoftMax จะเล�อก greedy action ติามปร�มาณีของ rewardReward มาก ก/จะเล�อก greedy action ดี วัย probability ส!งโดียคื+านึ่วัณีจาก Gibb-Boltzmann distribution
Choose action a on play t with probability
eQt (a)
eQt (b) b1
n,
where is the “computational temperature”
![Page 51: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/51.jpg)
Algorithms to find the Value Function
• Incremental Implementation• Markov’s decision process (MDP)• Value Function Characteristics• Bellman’s Equation• Solution methods
![Page 52: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/52.jpg)
Incremental Implementation
Value function มาจากคื'าเฉัล��ยจาก reward หลายๆคืร�2ง
Qk
r1 r2 rk
k
แปลวั'าติ องคื+านึ่วัณี Q ใหม'ท�กๆคืร�2งท��ม� reward เข ามา โดียเก/บัคื'า reward ไวั ดี วัย
Incremental คื�อจะท+าการ update Q ติาม reward ท��เข ามา
This is a common form for update rules:
NewEstimate = OldEstimate + StepSize[Target – OldEstimate]
![Page 53: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/53.jpg)
Policy
คื�อวั�ธ�การเล�อก action โดียดี!จาก state ท�� agent อย!'
Policy at step t, t :
a mapping from states to action probabilities
t (s, a) probability that at a when st s
Goal: To maximize total reward (reward คื+านึ่วัณีย�งไง)
![Page 54: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/54.jpg)
Types of Tasks
Episodic Tasks สามารถุแบั'งเป.นึ่ ส'วันึ่ๆ เช้'นึ่ เกมส�, maze
Non-episodic Tasks ไม'ม�จ�ดีส�2นึ่ส�ดี จะใช้ discount method
Rt rt1 rt2 rT , T คื�อ terminal state
Rt rt1 rt2 2rt3 krt k1,k 0
where , 0 1, is the discount rate.
![Page 55: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/55.jpg)
Markov Decision Processการใช้ RL จะสมม�ติ�ให model ของป7ญหาอย!'ในึ่ร!ปแบับัMarkov Decision Process (MDP) ซึ่#�ง modelนึ่�2จะประกอบัดี วัยส'วันึ่ส+าคื�ญ 4 ส'วันึ่
State, Action, Reward, Transition
State = sAction = a
Rs s a E rt1 st s,at a,st1 s for all s, s S, aA(s).
Reward
Transition
Ps s a Pr st1 s st s,at a for all s, s S, a A(s).
![Page 56: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/56.jpg)
MDP สามารถุเข�ยนึ่ให อย!'ในึ่ร!ป state transition ไดี
![Page 57: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/57.jpg)
• The value of a state is the expected return starting from that state; depends on the agent’s policy:
• The value of taking an action in a state under policy is the expected return starting from that state, taking that action, and thereafter following :
Value Function
State - value function for policy :
V (s)E Rt st s E krtk 1 st sk 0
Action - value function for policy :
Q (s, a) E Rt st s, at a E krtk1 st s,at ak0
![Page 58: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/58.jpg)
Bellman Equation for a Policy
Rt rt1 rt2 2rt3 3rt4
rt1 rt2 rt3 2rt4 rt1 Rt1
So: V (s)E Rt st s E rt1 V st1 st s
Or, without the expectation operator:
V (s) (s,a) Ps s a Rs s
a V ( s ) s
a
คื+านึ่วัณีหา Value Function จาก policy π
![Page 59: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/59.jpg)
Example: Grid World
Action = {up, down, left, right}ถุ าอย!'ท��จ�ดี A แล วัท+า action อะไรก/ไดี จะมาอย!'ท�� A’ แล วัไดี reward = 10ถุ าอย!'ท��จ�ดี B แล วัท+า action อะไรก/ไดี จะมาอย!'ท�� B’ แล วัไดี reward = 5ช้นึ่ก+าแพื่ง reward = -1นึ่อกนึ่�2นึ่ reward = 0
สามารถุหา value function ไดี ติามร!ป b
9.0
![Page 60: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/60.jpg)
• For finite MDPs, policies can be partially ordered:
• There is always at least one (and possibly many) policies that is better than or equal to all the others. This is an optimal policy. We denote them all *.
• Optimal policies share the same optimal state-value function:
• Optimal policies also share the same optimal action-value function:
Optimal Value Function
if and only if V (s) V (s) for all s S
V (s) max
V (s) for all s S
)( and allfor ),(max),( sAaSsasQasQ
![Page 61: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/61.jpg)
Bellman Optimality Equation for V*
V (s) maxaA(s)
Q
(s,a)
maxaA(s)
E rt1 V(st1) st s, at a max
aA(s)Ps s
a
s Rs s
a V ( s )
The value of a state under an optimal policy must equalthe expected return for the best action from that state:
![Page 62: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/62.jpg)
Bellman Optimality Equation for Q*
Q(s,a)E rt1 maxa
Q (st1, a ) st s,at a Ps s
a Rs s a max
a Q( s , a )
s
![Page 63: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/63.jpg)
Why Optimal State-Value Functions are Useful?
Any policy that is greedy with respect to is an optimal policy.
Therefore, given , one-step-ahead search produces the long-term optimal actions.
V
Example: Grid World
![Page 64: INC 551 Artificial Intelligence](https://reader035.fdocuments.net/reader035/viewer/2022081417/568136c6550346895d9e6332/html5/thumbnails/64.jpg)
หนึ่ าท��ของ RL ก/คื�อหา optimal value function