Reinforcement Learning1. Introduction
University of Edinburgh, School of Informatics
13/01/2015
Admin: Contact
Lecturer: Michael HerrmannIPAB, School of [email protected] Forum 1.42, 651 7177
Tutors: Mihai [email protected]
Thomas [email protected]
13/01/2015 Michael Herrmann RL 1
Admin: Time split
Lectures (≤20h): Tuesday and Friday 12:10 - 13:00 (AT, LT2)Assessment: Homework and Exam = 10% + 10% and 80%HW1 (10h): out by 5 February, due 26 February(possible topic: Q-learning: A learning agent in a box-world)HW2 (10h): out by 5 March, due 26 March(possible topic: Continuous-space RL)Reading/SelfStudy/Preparation of tutorial problems (32h)Tutorials (8h) starting in week 3Revision (20h)
13/01/2015 Michael Herrmann RL 1
Admin: Tutorials
Mondays or Thursdays, 13:10 pm - 14:00 pm in AT 4.12Topics, tentatively:
T1 [Bandit problems] – week of 26th JanT2 [Q-learning] – week of 2nd FebT3 [MC methods] – week of 9th FebT4 [TD methods] – week of 16th FebT5 [POMDP] – week of 23th FebT6 [continuous RL] – week 2rd MarT7 [practical aspects] – week 9 of MarT8 [revision] – week of 16th Mar
We’ll assign questions (combination of pen&paper andcomputational exercises – you attempt them before sessions)and give an opportunity to run some simulationsTutor will discuss and clarify concepts underlying exercisesTutorials are not assessed; gain feedback from participation
13/01/2015 Michael Herrmann RL 1
Admin: Resources
Webpage: www.informatics.ed.ac.uk/teaching/courses/rl
Lecture slides will be uploaded as they become available
Main Readings:
1 R. Sutton and A. Barto, Reinforcement Learning, MIT Press,1998, webdocs.cs.ualberta.ca/~sutton/book/the-book.html
2 Csaba Szepesvari: Algorithms for Reinforcement Learning,Morgan & Claypool, 2010.
3 S. Thrun, W. Burgard, D. Fox, Probabilistic Robotics, MITPress, 2006 (Chapters 14 – 16)
4 Reinforcement Learning Warehousehttp://reinforcementlearning.ai-depot.com/
5 Wikipedia (see also external links), Scholarpedia6 Research papers (later)
Background: mathematics, statistics, machine learning, matlab, ...13/01/2015 Michael Herrmann RL 1
Reinforcement Learning
What means “RL” (in Machine Learning)?
Learning given only percepts (states) and occasional rewards(or punishment)Generation and evaluation of a policy i.e. a mapping fromstates to actionsA form of active learningNeither really supervised nor unsupervisedA microcosm for the entire AI problem
“The use of punishments and rewards can at best be a part of theteaching process” (A. Turing)
Russell and Norvig: AI, Ch.21
What else is RL?
13/01/2015 Michael Herrmann RL 1
I. Psychology: Operant conditioning
Example: Whenever a ratpresses a button, it gets atreat. If the rat starts pressingthe button more often, thetreat serves to positivelyreinforce this behaviour.
Example: Whenever a ratmove to a certain area of thecage, it gets a mild electricshock. If the rat starts movingto this area less frequently, theshock serves as a (positive)punishment.
en.wikipedia.org/wiki/Operant_conditioning
en.wikipedia.org/wiki/Reinforcement
13/01/2015 Michael Herrmann RL 1
Example: Shaping
Shaping is reinforcement of successive approximations to a desiredinstrumental response. In training a rat to press a lever, forexample, simply turning toward the lever is reinforced at first.Then, only turning and stepping toward it is reinforced. Theoutcomes of one set of behaviours starts the shaping process forthe next set of behaviours, and the outcomes of that set preparesthe shaping process for the next set, and so on. As trainingprogresses, the response reinforced becomes progressively more likethe desired behavior; each subsequent behaviour becomes a closerapproximation of the final behaviour.
en.wikipedia.org/wiki/Reinforcement#Shaping
13/01/2015 Michael Herrmann RL 1
II. Brain science: Dopamine
An “important effect of dopamine is as a ’teaching’ signal. When amotor response is followed by an increase in dopamine activity, thebasal ganglia circuit is altered in a way that makes the sameresponse easier to evoke when similar situations arise in the future.This is a form of operant conditioning, in which dopamine plays therole of a reward signal.”
en.wikipedia.org/wiki/Dopamine#The_substantia_nigra_dopamine_system_and_motor_control
13/01/2015 Michael Herrmann RL 1
What can we learn from RL in the brain?
A system that solves complexproblems may have to becomplex itself.“Until recently, rather littlehas been done to find outhow animals behave, whetherin the wild or captivity, whenthey have nothing particularto do.” D. E. Berlyne,Conflict, Arousal, andCuriosity, McGraw-Hill, 1960.⇒ Intrinsically-motivated RL(A. Barto)Time course of the rewardand value processes
M. Häggström, A. Gillies and P. J. Lynch
13/01/2015 Michael Herrmann RL 1
What can we learn from RL in the brain?
A system that solves complexproblems may have to becomplex itself.“Until recently, rather littlehas been done to find outhow animals behave, whetherin the wild or captivity, whenthey have nothing particularto do.” D. E. Berlyne,Conflict, Arousal, andCuriosity, McGraw-Hill, 1960.⇒ Intrinsically-motivated RL(A. Barto)Time course of the rewardand value processes
Functional Connectivity in the Brain
O. Sporns, Nature Neuroscience (2014)
13/01/2015 Michael Herrmann RL 1
III. Applications in Management, Education, Social science
B. F. Skinner’s radical behaviorism: All learning results fromreinforcement by repeated expected consequences.
Reactions (some random quotes)
“The concept and use of rewards has been misunderstood forcenturies!” (R. H. Richardson)“Negative reinforcement has its place in good management aswell.” (Neil Kokemuller)“If a leader chooses to rely heavily on rewards and punishmentsto meet his or her objectives, the leader must: determine whatis “good” and “bad,” ... Although that might be possible, thebest outcome that could be hoped for would be mediocrity.”(William Stinnett)“How to discipline without stress, punishment or rewards.”“Education is about motivation.” (Marvin Marshall)“You can’t motivate the unmotivated.” (Dan Gould)
Conclusion: Start with simple questions and simple systems.13/01/2015 Michael Herrmann RL 1
Arthur Samuel (1959): Computer Checkers
Search tree: board positions reachable from the currentstate. Follow paths as indicated by aScoring function: based on the position of the board at anygiven time; tries to measure the chance of winning for eachside at the given position.Program chooses its move based on a minimax strategySelf-improvement: Remembering every position it hadalready seen, along with the terminal value of the rewardfunction. It played thousands of games against itself asanother way of learning.First to play any board game at a relatively high of levelThe earliest successful machine learning research
wikipedia and Russell and Norvig: AI, Ch.21
13/01/2015 Michael Herrmann RL 1
More history
SNARC: Stochastic Neural Analog Reinforcement Calculator(M. Minsky, 1951)A. Samuel (1959) Computer CheckersB. Widrow and T. Hoff (1960) adapted the D. O. Hebb’sneural learning rule (1949) for RL: delta ruleCart-pole problem (D. Michie and R.A. Chambers, 1968)Relation between RL and MDP (P. Werbos, 1977)A. Barto, R. Sutton, P. Brouwer (1981) Associative RLQ-learning (C.J.C.H. Watkins, 1989)
Russell and Norvig: AI, Ch.21
13/01/2015 Michael Herrmann RL 1
Aspects of RL (outlook)
MAB, MDP, DP, MC, SARSA, TD(λ), SMDP, POMDP, ...ExplorationActive learning and machine learningStructure of state and action spacesContinuous domains: Function approximationComplexity, optimality, efficiency, numerics[psychology, neuroscience]
13/01/2015 Michael Herrmann RL 1
Generic Examples
Motor learning in young animals: No teacher. Sensorimotorconnection to environment.Language acquisitionLearning to
drive a carhold a conversationlearning to cookplay games: backgammon, checkers, chessplay a musical instrument
Applications
problem solving: Find a policy (states → actions) thatmaximises rewardscheduling, control, operations research, HCI, economics
13/01/2015 Michael Herrmann RL 1
Practical approach to the problem
Many ways to approach these problemsUnifying perspective: Stochastic optimisation over timeGiven
Environment to interact withGoal
Formulate cost (or reward)Objective: Maximise rewards over timeThe catch: Reward signal not always available, butoptimisation is over time (selecting entire paths)
13/01/2015 Michael Herrmann RL 1
Example: Inventory Control
Objective: Minimise total inventory costDecisions:
How much to order?When to order?
13/01/2015 Michael Herrmann RL 1
Components of Total Cost
1 Cost of items2 Cost of ordering3 Cost of carrying or holding inventory4 Cost of stockouts5 Cost of safety stock (extra inventory held to help avoid
stockouts)
13/01/2015 Michael Herrmann RL 1
The Economic Order Quantity Model - How Much to Order?
Assumptions
1 Demand is known and constant2 Lead time is known and constant3 Receipt of inventory is instantaneous4 Quantity discounts are not available5 Variable costs are limited to: ordering cost and carrying (or
holding) cost6 If orders are placed at the right time, stockouts can be avoided
13/01/2015 Michael Herrmann RL 1
Inventory Level Over Time Based on EOQ Assumptions
Economic order quantity, Ford W. Harris, 1913
13/01/2015 Michael Herrmann RL 1
EOQ Model Total Cost
At optimal order quantity (Q*): (Carrying cost)’ = (Ordering cost)’
Q∗ =
√2DCo
ChD: demand, Co , Ch: costs
13/01/2015 Michael Herrmann RL 1
Realistically, how much to orderIf these assumptions didn’t hold?
Demand is known and constantLead time (latency) is known and constantReceipt of inventory is instantaneousQuantity discounts are not availableVariable costs are limited to: ordering cost and carrying (orholding) costIf orders are placed at right time, stockouts can be avoided
The result may require a more detailed stochasticoptimisation.
13/01/2015 Michael Herrmann RL 1
Properties of RL learning tasks
Does not assume that you know the model of the environmentAssociativity: Value of an action depends on stateActive learning: Environment’s response affects our subsequentactions and thus future responsesDelayed reward: We find out the effects of our actions laterCredit assignment problem: Upon receiving rewards, whichactions were responsible for the rewards?
13/01/2015 Michael Herrmann RL 1
Conclusions
Stochastic Optimisation – make decisions! Over time; may notbe immediately obvious how we’re doingSome notion of cost/reward is implicit in problem – definingthis, and constraints to defining this, are key!Often, we may need to work with models that can onlygenerate sample traces from experiments
13/01/2015 Michael Herrmann RL 1
Summary: The Setup for RL
Agent is:
Temporally situatedContinual learning and planningObjective is to affect the environment by state-dependentactionsEnvironment is uncertain, stochastic
13/01/2015 Michael Herrmann RL 1
Key Features of RL
Learner is not told which actions to take
Trial-and-Error search
Possibility of delayed reward
Sacrifice short-term gains for greater long-term gains
The need to explore and to exploitConsider the whole problem of a goal-directed agentinteracting with an uncertain environment
13/01/2015 Michael Herrmann RL 1
What is Reinforcement Learning?
A paradigm in Artificial IntelligenceGoal-oriented learning (Maximise reward!)Learning from interactionLearning about, from, and while interacting with an externalenvironmentLearning what to do — how to map situations to actions — soas to maximise a numerical reward signalCan be thought of as a stochastic optimisation over time
13/01/2015 Michael Herrmann RL 1
Relation to Metaheuristic Optimisation (MO)
“Natural computing” including algorithms such as GA, ES, DE,PSO, or ACOFor example, genetic algorithms (GA) maximise a fitnessfunction by selecting, mutating and reproducing certainindividuals in a populationThe individuals are described as strings over an alphabet, e.g.{G,A,T,C}The gradient of the fitness function is not knownThe population is a sample of points in the space of all strings
13/01/2015 Michael Herrmann RL 1
Difference between MO and RL
RL MOrepresentation usually a single agent
that must be restartedpopulation
learning steps rewards is received aftereach action
fitness is evaluated ineach generation
dynamics Markovian state trans-itions
population moves inconfiguration space
global search trial and error guided byearlier experiences
trial and error, may beguided by correlations
local search policy gradient hill climbing
Both are stochastic optimisation algorithms. Both can find theirresp. global optimum under certain (quite artificial) conditions.
Note that many combinations are possible, e.g. collective RL orMO of the hardware of a robot that learns by RL
13/01/2015 Michael Herrmann RL 1
A bit of Maths
Geometric seriessn =
n∑i=0
pn
Sum of an infinite geometric serieslim
n→∞sn =
11− p
Averages: arithmetic mean, math. expectation, median, mode
Moving averageai0 =
1N
N−1∑i=0
ai0−i
Weighted (moving) average
ai0 =1∑N−1
i=0 αi
N−1∑i=0
αiai0−i
Exponentially weighted average ai0 = (1− α)∑∞
i=0 αiai0−i
13/01/2015 Michael Herrmann RL 1
Top Related