Reinforcement Learning - 4. Model-free reinforcement Learning
Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement...
Transcript of Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement...
![Page 1: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/1.jpg)
Markov Models andReinforcement Learning
Stephen G. Ware
CSCI 4525 / 5525
![Page 2: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/2.jpg)
Camera Vacuum World (CVW)
• 2 discrete rooms with cameras that detect dirt.
• A mobile robot with a vacuum.
• The goal is to ensure both rooms are clean.
A B
![Page 3: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/3.jpg)
Camera Vacuum World
Observable:
Agents:
Deterministic:
Episodic:
Static:
Discrete:
Fully
Single
Deterministic
Episodic
Static
Discrete
A B
![Page 4: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/4.jpg)
CVW State Space
A B
A B
A B ……
……
……
……
Suck
Left
Right
![Page 5: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/5.jpg)
Deterministic Process
A deterministic process describes how the world transitions from one state to another by taking actions. It can be described as a graph whose nodes are states and whose edges are actions.
Most state spaces that we have considered so far (e.g. CVW) are deterministic processes because there is no uncertainty about the outcomes of an action.
![Page 6: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/6.jpg)
Making Decisions with a Deterministic Process
A decision process is a process in which some state has been labeled the start state, some states have been labeled as goal states, and a rational agent is expected to find a way to reach the goal from the start.
We call a solution to a process a policy. A policy is a function which, for any given current state, specifies which action to take next.
![Page 7: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/7.jpg)
CVW Policy
𝜋 𝑅𝑜𝑏𝑜𝑡 𝑎𝑡 𝐴, 𝐴 𝑑𝑖𝑟𝑡𝑦, 𝐵 𝑑𝑖𝑟𝑡𝑦 = 𝑆𝑢𝑐𝑘𝜋 𝑅𝑜𝑏𝑜𝑡 𝑎𝑡 𝐴, 𝐴 𝑐𝑙𝑒𝑎𝑛, 𝐵 𝑑𝑖𝑟𝑡𝑦 = 𝑅𝑖𝑔ℎ𝑡𝜋 𝑅𝑜𝑏𝑜𝑡 𝑎𝑡 𝐵, 𝐴 𝑑𝑖𝑟𝑡𝑦, 𝐵 𝑐𝑙𝑒𝑎𝑛 = 𝐿𝑒𝑓𝑡
… for every possible state
![Page 8: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/8.jpg)
Finding Deterministic Policies
Just as many problems considered so far can be modeled using a deterministic processes, many search-based solutions we have considered are ways of finding a policy.
In general, the problem of solving a deterministic decision process is equivalent to classical planning.
![Page 9: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/9.jpg)
CVW
• 2 discrete rooms with cameras that detect dirt.
• A mobile robot with a vacuum.
• The goal is to ensure both rooms are clean.
A B
![Page 10: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/10.jpg)
Stochastic CVW (SCVW)
• 2 discrete rooms with cameras that detect dirt.
• A mobile robot with a vacuum that fails to clean 10% of the time.
• The goal is to ensure both rooms are clean.
A B
![Page 11: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/11.jpg)
CVW
Observable:
Agents:
Deterministic:
Episodic:
Static:
Discrete:
Fully
Single
Deterministic
Episodic
Static
Discrete
A B
![Page 12: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/12.jpg)
SCVW
Observable:
Agents:
Deterministic:
Episodic:
Static:
Discrete:
Fully
Single
Stochastic
Episodic
Static
Discrete
A B
![Page 13: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/13.jpg)
CVW State Space
A B
A B
A B ……
……
……
……
Suck
Right
Left
![Page 14: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/14.jpg)
SCVW State Space
A B
A B
A B ……
……
……
……
Suck 10%Suck 90%
Right 100%
Left 100%
![Page 15: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/15.jpg)
Stochastic Process
A stochastic process describes how the world transitions from one state to another by taking actions with uncertain effects. It can be described as a graph whose nodes are states and whose edges are actions (annotated with the probability the action will occur in that way).
SCVW is a stochastic processes because the outcome of an action cannot be known in advance (but once taken, the outcome is known).
![Page 16: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/16.jpg)
Markov Process (Markov Chain)
When the probability of transitioning to a next state depends only on the previous state and the action taken, we say a stochastic process has the Markov Property.
This property is named for Andrey Markov, a famous mathematician who studied stochastic processes.
![Page 17: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/17.jpg)
Markov Process (Markov Chain)
In other words, we don’t need to know all the past states you have been in or all the past actions you have taken.
We only need to know your current state and the action you intend to take. From that, we can predict (stochastically) which next state you will be in after taking the action.
![Page 18: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/18.jpg)
Markov Chains
Markov chains can be used to model simple real world processes.
One common application is text mining. Each node in the graph represents a word (𝑤) that appears in the text. An edge leads from 𝑤1 to 𝑤2 if, somewhere in the corpus, we see 𝑤1 followed by 𝑤2. The probability of an edge represents the percentage of times 𝑤2 is seen to come after 𝑤1.
![Page 19: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/19.jpg)
King James ProgrammingA bot trained on two corpuses, The King James Bible and Structure and Interpretation of Computer Programs, that generates random saying such as:
“Jesus saith unto them, Ye know that the relationship between Fahrenheit and Celsius temperatures is Such a constraint.”
“By running the test with more and more in knowledge and in all things approving ourselves as the ministers of the LORD, and they provoked him to jealousy with that which is good. He that doeth good is of God: but the calf of the sin offering,
and the other registers that need to be immediately sworn and notarized.”
![Page 20: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/20.jpg)
Markov Decision Process (MDP)
A Markov Decision Process is a decision process based on a Markov chain.
An agent cannot always predict the result of an action. Thus, any policy for solving an MDP must account for all states that an agent might accidentally end up in.
This can be thought of a classical planning but where things sometimes go wrong.
![Page 21: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/21.jpg)
SCVW Policy
𝜋 𝑅𝑜𝑏𝑜𝑡 𝑎𝑡 𝐴, 𝐴 𝑑𝑖𝑟𝑡𝑦, 𝐵 𝑑𝑖𝑟𝑡𝑦 = 𝑆𝑢𝑐𝑘𝜋 𝑅𝑜𝑏𝑜𝑡 𝑎𝑡 𝐴, 𝐴 𝑐𝑙𝑒𝑎𝑛, 𝐵 𝑑𝑖𝑟𝑡𝑦 = 𝑅𝑖𝑔ℎ𝑡𝜋 𝑅𝑜𝑏𝑜𝑡 𝑎𝑡 𝐵, 𝐴 𝑑𝑖𝑟𝑡𝑦, 𝐵 𝑐𝑙𝑒𝑎𝑛 = 𝐿𝑒𝑓𝑡
… for every possible state
![Page 22: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/22.jpg)
SCVW
• 2 discrete rooms with cameras that detect dirt.
• A mobile robot with a vacuum that fails to clean 10% of the time.
• The goal is to ensure both rooms are clean.
A B
![Page 23: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/23.jpg)
Hidden SCVW (HSCVW)
• 2 discrete rooms with cameras that detect dirt 95% of the time dirt is present.
• A mobile robot with a vacuum that fails to clean 10% of the time.
• The goal is to ensure both rooms are clean.
A B
![Page 24: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/24.jpg)
SCVW
Observable:
Agents:
Deterministic:
Episodic:
Static:
Discrete:
Fully
Single
Stochastic
Episodic
Static
Discrete
A B
![Page 25: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/25.jpg)
HSCVW
Observable:
Agents:
Deterministic:
Episodic:
Static:
Discrete:
Partially
Single
Stochastic
Episodic
Static
Discrete
A B
![Page 26: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/26.jpg)
HSCVW State Space
A B
Camera A reports dirt.Camera B reports dirt.
A B A B
A B
The robot is in room A.
0.25%90.25%
4.75% 4.75%
![Page 27: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/27.jpg)
Hidden Markov Model (HMM)
A Hidden Markov Model is a process which is assumed to operate according to a Markov process, but whose state is not directly observable.
Based on whatever observations are available, an agent must maintain a probability distribution of possible current states and their likelihoods.
![Page 28: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/28.jpg)
Partially Observable Markov Decision Process (POMDP)
A Partially Observable Markov Decision Process is a decision process based on a hidden Markov model.
An agent does not know the actual state of the world, but can guess it based on observations. It must choose a policy which is expected to maximize the chance of reaching a solution.
![Page 29: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/29.jpg)
Markov Processes
How an Agent Makes Decisions in that World
Planning
Markov Decision Process
Partially Observable Markov Decision Process
How the World Works
Deterministic Process →
Markov Chain →
Hidden Markov Model →
![Page 30: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/30.jpg)
Reinforcement Learning
Reinforcement Learning is a kind of machine learning in which labeled data is not available, but for which periodic feedback (in the form of rewards and punishments) is available.
An agent takes actions, observes its reward or punishment, and eventually learns which actions lead to success and which lead to failure.
For example, considering training a dog.
![Page 31: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/31.jpg)
RL as Supervised Learning
Reinforcement learning can be thought of as a kind of supervised learning in which the agent must generate its own labeled data.
![Page 32: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/32.jpg)
RL and POMDP’s
Most reinforcement learning algorithms are designed to learn optimal policies for MDP's and POMDP's.• The agent has observations which tell it which
states it might be in (and how likely). It assumes the process is a Markov process.
• The agent takes actions and observes rewards or punishments.
• The agent eventually learns how to behave in the environment so as to maximize its reward.
![Page 33: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/33.jpg)
Exploration vs. Exploitation
When an agent takes actions whose outcomes are relatively unknown, it is called exploration.
When an agent takes actions whose outcomes are known to produce high rewards, it is called exploitation.
One of the central problems in RL is striking a balance between exploration (learning new information) and exploitation (using known information).
![Page 34: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/34.jpg)
𝑛-Armed Bandits Problem
Imagine you have $1000 that you intend to spend on a row of slot machines (one-armed bandits) in Las Vegas. How can you spend that money so as to maximize the amount of money you have left at the end of the day?
![Page 35: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/35.jpg)
𝑛-Armed Bandits Problem
We assume each machine is controlled by a Markov process.
Win Lose1% 45%
55%
99%
![Page 36: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/36.jpg)
𝑛-Armed Bandits Problem
But we can’t see it; we can only observe the results (wins or losses), so it is a hidden Markov model.
![Page 37: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/37.jpg)
𝑛-Armed Bandits Problem
Our task is to find an optimal policy—that is, to solve a partially observable Markov decision process.
![Page 38: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/38.jpg)
𝑛-Armed Bandits Problem
Say you have played Machine A 20 times. You have won 10 times and lost 10 times. You have played Machine B 1 time and lost. Is it logical to conclude that you should never play Machine B because, so far, you have lost 100% of the time?
![Page 39: Markov Models and Reinforcement Learning · 2019-05-05 · Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but](https://reader030.fdocuments.net/reader030/viewer/2022040216/5f0b825e7e708231d430deb4/html5/thumbnails/39.jpg)
𝑛-Armed Bandits Problem
No. You are exploiting Machine A too early without exploring Machine B enough. Machine B might have a better win/loss ratio than A. You need to gather more information.