An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level...
Transcript of An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level...
![Page 1: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/1.jpg)
An Introduction to Reinforcement Learning and the AlphaZero AIJames Frost
Data Platform Director
Quorum
![Page 2: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/2.jpg)
About the speaker
• James Frost
• Data Platform Director at Quorum, an Edinburgh based IT consultancy
• Recently completed an MSc in Data Science at Dundee University
• Final year project was to build a backgammon AI influenced by techniques based on DeepMind AlphaGo. This AI achieved human Grandmaster level.
![Page 3: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/3.jpg)
Session agenda
• An introduction to reinforcement learning concepts
• Monte-Carlo learning
• Neural networks as function approximators
• Issues with reinforcement learning and Deep Neural Networks
• DeepMind and AlphaGo
![Page 4: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/4.jpg)
What is reinforcement learning?
![Page 5: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/5.jpg)
Types of Machine Learning
Supervised Learning
Reinforcement Learning
Unsupervised Learning
Supervised Learning:
Starts with a dataset of known examples. The engine then trains “by example”.
“A” is for Apple…
… not an apple!
![Page 6: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/6.jpg)
Types of Machine Learning
Supervised Learning
Reinforcement Learning
Unsupervised Learning
Unsupervised Learning:
Starts with a dataset where the categories might not be known and looks for patterns / similarities / clusters which may be of interest.
For example, customer segmentation or fraud investigation
![Page 7: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/7.jpg)
Types of Machine Learning
Supervised Learning
Reinforcement Learning
Unsupervised Learning
Reinforcement Learning:
Is based on an agent interacting with the environment, and getting feedback in the form of a reward mechanism.
![Page 8: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/8.jpg)
How do dogs learn?
All training should be reward based. Giving your dog something they really like such as food, toys or praise when they show a particular behaviour means that they are more likely to do it again.
RSPCA Website
![Page 9: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/9.jpg)
Principles of reinforcement learning
Agent
Environment
actionAt
reward Rt
state St+1
![Page 10: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/10.jpg)
Rewards
• A reward Rt is a scalar feedback signal
• Indicates the value of carrying out step t
Money won or lost –e.g. poker or stock market
Win (+1) or Loss (-1)
Kill John Connor (+10,000)Getting to destination (+100)Falling over (-50)Taking a step (-1)
![Page 11: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/11.jpg)
Agent
The Agent generally has the following components:
• Model – the agents representation of the environment
• Policy – how the agent behaves
• Value function – estimate of how good a state or action is
![Page 12: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/12.jpg)
Model
• Environment state is the environments private representation
• Often not visible
• Model is a representation of the environment state through observation
![Page 13: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/13.jpg)
Value Functions
Almost all reinforcement learning algorithms involve estimating value functions that estimate how good it is for the agent to be in a given state
…the most important component of almost all reinforcement learning algorithms we consider is a method for efficiently estimating values
(Sutton, 2017)
![Page 14: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/14.jpg)
Value Functions / Policy - chess
A sample value function for chess might be the estimated chance of winning from that position.
Chess PolicyFrom each state, calculate all legal movesFor each possible move, move to state with highest value function.
OR from each state pick the move with the highest action value.
This is the optimal policy.
![Page 15: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/15.jpg)
Principles of reinforcement learning
1. Accurately estimating the value function is critical for reinforcement learning
But how do we do that?
![Page 16: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/16.jpg)
Technique 1 – Monte Carlo Learning
• Play a large number of games at random.
• Record how many times each state is seen, and how many games were won from that state (or action). This lets us know the estimated value function.
• This is the evaluation step for the random policy
X O
X
O
41% win rate from here
Action values State values
-0.41 -0.54 -0.41
-0.54 X -0.54
-0.41 -0.54 -0.41
0.32 0.19 0.32
0.19 0.48 0.19
0.32 0.19 0.32
![Page 17: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/17.jpg)
Now use Monte-Carlo Control
• Now play another bunch of games, but this time act “greedily” with respect of the value predicted by the previous policy.
• Record how many times each state is seen, and how many games were won from that state
100% win rate from here
X O
X
O
-0.14 -0.82 -0.14
-0.82 X -0.82
-0.14 -0.82 -0.14
0.07 0.08 0.07
0.08 0.20 0.08
0.07 0.08 0.07
![Page 18: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/18.jpg)
Don’t get too greedy…
• However, by only picking the best moves (greedy) we sometimes miss possible moves that might be better.
• So need to introduce an element of exploration.
• ε-greedy learning works as follows:
• With probability (1-ε) make a greedy move• With probability ε move at random.
• ε is often reduced as the number of episodes increases – this is guaranteed to converge to the optimal policy.
![Page 19: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/19.jpg)
Learning cycle
• Policy evaluation / policy improvement is the core concept of reinforcement learning
• By acting greedily with respect to the value function we can create a new, improved policy.
• Iterating this process will trend towards the optimal policy
Optimal policyPolicyevaluation
Policyimprovement
![Page 20: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/20.jpg)
Problems with large state spaces
![Page 21: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/21.jpg)
Neural Networks
Unfortunately most useful problems can’t store the state for every scenario.
• Chess has 10^47 states
• Go has 10^170 states.
• How many states to record every possible scenario for a driverless car or Terminator robot?
So we need some form of function approximator.
![Page 22: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/22.jpg)
Neural networks as value approximators
-4
-1
-1 6
5 2
-1 6
2
-3
-4
0
0
0
0
0
0
-3
.
.
.
.
.
.
![Page 23: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/23.jpg)
Monte-Carlo Learning and Neural Networks
• Same principles as tic-tac-toe
• Play a number of games at random
• Sample states (or state / action pairs) from the games, the reward that these states led to, discounted by the number of steps
• Use these samples to feed into the neural network for training
• Now repeat the process, but instead of random play, use the neural network to predict the best moves. Pick some moves at random (ε-greedy)
![Page 24: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/24.jpg)
Deep Neural Networks as Value Functions
Deep Neural Networks (DNNs) would appear to be a great candidate for a value function approximator. However, these can suffer from the following causes of instability:
• correlations present in the sequence of observations
• small updates to action-values estimates (Q) may significantly change the policy
![Page 25: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/25.jpg)
Atari
• Deepmind paper from 2015
• Taught a deep neural network to play Atari games, such as Breakout and Kung Fu Master
• Achieved above human level in over half of the games – in some cases superhuman performance (e.g. Breakout).
• The network was trained using a technique based on Q-learning
but introduced two important concepts:• Experience replay
• Target Q network
![Page 26: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/26.jpg)
Source: Human-level control through deep reinforcement learning. Nature, 518. https://doi.org/10.1038/nature14236
![Page 27: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/27.jpg)
Experience Replay and target Q network
Agent
Environment
actionAt
reward Rt state St+1
( St, At ,Rt St+1 ), ( St-1, At-1 ,Rt-1 St ), ( St-2, At-2 ,Rt-2 St-1 ), ( St-3, At-3 ,Rt-3 St-2 ), ( St-4, At-4 ,Rt-4 St-3 ), ( St-5, At-5 ,Rt-5 St-4 )
Replay Buffer
Random samplesTo learn new policy
Agent v2
![Page 28: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/28.jpg)
Effect of Target Q and Experience Replay
![Page 29: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/29.jpg)
AlphaGo
![Page 30: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/30.jpg)
AlphaGo
• Initially trained on a database of expert human games
• Then self play
• After months of training beat Lee Sedol
• AlphaGoZero was trained from completely random play
• Within 36 hours of training AlphaZero beat AlphaGoLee 100-0
![Page 31: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/31.jpg)
“Each time you put knowledge into a system you are actually handicapping it”
David Silver (Silver, 2015)
![Page 32: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/32.jpg)
AlphaGoZero
DeepMind’s AlphaGoZero project applied the following principles
• only uses the raw board position as input features
• use a simple Monte-Carlo Tree Search (MCTS) to evaluate positions and sample moves
• residual neural network architecture
• dual-headed network
![Page 33: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/33.jpg)
Monte-Carlo tree search
Source: Mastering the game of Go without human knowledge. https://doi.org/doi:10.1038/nature24270
![Page 34: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/34.jpg)
AlphaZero Success
• Within 4 hours had mastered chess from first principles, beating the worlds then greatest chess engine.
• Within 2 hours had beaten Elmo at Shogi.
• (running on 5,000 TPUs)
![Page 35: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/35.jpg)
What is most important?
Data
OR
Algorithm
![Page 36: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/36.jpg)
Reinforcement learning concepts summary
1. Value estimation is critical to the success of a reinforcement learning algorithm.
2. Monte-Carlo learning is a relatively simple technique to get started with and can be applied to a wide range of problems
3. Balancing exploration vs exploitation critical
4. Deep neural networks can be unstable with reinforcement learning. Deep Q Networks with Experience Replay can help stabilise this.
![Page 37: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/37.jpg)
Where next
• The re-enforcement learning community are now focussing on games with imperfect information and much deeper strategies:
• No limit Texas Hold’em Poker – Deepstack• StarCraft2 – AlphaStar• DOTA2 – OpenAI 5• OpenSpiel
• Ultimately the aim is to apply deep learning to real world scenarios:• Energy efficiency• Self driving cars• Protein folding• Medical diagnosis• Materials research• Artificial General Intelligence
![Page 38: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/38.jpg)
References / Useful links
Netflix AlphaGo documentary
Silver, D. (2015) UCL course on RL.
Sutton, R. S. and Barto, A. G. (2017) Reinforcement Learning: An Introduction.
OpenSpiel - https://github.com/deepmind/open_spiel
Deepmind (2016) Human-level control through deep reinforcement learning
Deepmind (2017) Mastering the game of Go without human knowledge
Deepmind (2018) AlphaZero: Shedding new light on the grand games of chess, shogi and Go, deepmind.com.
![Page 39: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go](https://reader033.fdocuments.net/reader033/viewer/2022042302/5ecdcb4f0334f65af775958a/html5/thumbnails/39.jpg)
Quorum Confidential
Thank YouQuorum Network Resources Ltd
18 Greenside Lane, Edinburgh, EH1 3AH
www.qnrl.com | [email protected]
Reg. No. SC 196645, Registered Office: 18 Greenside Lane, Edinburgh, EH1 3AH