Post on 13-Dec-2015
REINFORCEMENT REINFORCEMENT LEARNINGLEARNING
04/18/23 1
Group 11
Ashish Meena 04005006
Rohitashwa Bhotica 04005010
Hansraj Choudhary 04d05005
Piyush Kedia 04d05009
OutlineOutline Introduction Learning Models Motivation Reinforcement Learning Framework Q – Learning Algorithm Applications Summary
04/18/23 2
IntroductionIntroduction Machine Learning
◦ Construction of programs that automatically improve with experience.
Types of Learning◦ Supervised Learning◦ Unsupervised Learning◦ Reinforcement Learning
04/18/23 3
Supervised LearningSupervised Learning- Training data: (X,Y). (features,
label) - Predict Y, minimizing some loss.
- Regression, Classification.Example
◦ Predict whether a patient, hospitalized due to a heart attack, will have a second heart attack. The prediction is to be based on demographic, diet and clinical measurements for that patient. (Logistic Regression)
04/18/23 4
Unsupervised LearningUnsupervised LearningUnsupervised Learning - Training data: X. (features only) - Find “similar” points in high-dim X-
space. - Clustering.
Example◦ From the DNA micro-array data, determine
which genes are most “similar” in terms of their expression profiles. (Clustering)
04/18/23 5
Reinforcement LearningReinforcement Learning Training data: (S, A, R). (State-Action-Reward)
Develop an optimal policy (sequence of decision rules) for the learner so as to maximize its long – term reward.
04/18/23 6
04/18/23 7
Agent
Environment
StateReward
Action
Policy
sss 221100 r a2
r a1
r a0 :::
Reinforcement Learning
MotivationMotivationSupervised and unsupervised
learning fail in many situations.Example
◦ FLIGHT CONTROLS SYSTEMS For a set of all sensor readings at a given
time deciding how the flight controls should function.
In case of supervised learning the labels for many features are unknown.
Performing trial-and-error interactions with the environment reinforcement learning is capable of solving such problems.
04/18/23 8
Learning by InteractionLearning by Interaction
04/18/23 9
• Learning to ride a bicycle
• Actions:• Turn handle bars RIGHT.• Turn handle bars LEFT.
• Rewards: • Positive if the cycle is perfectly balanced.• Negative if the angle between the cycle and the ground decreases
Inspiration from Natural Inspiration from Natural LearningLearning
Dopamine ◦ Neurohormone occurring in a wide variety of
animals, including both vertebrates and
invertebrates
◦ Regulates and controls behavior by inducing
pleasurable effects equivalent to rewards
◦ Motivates us to proactively to perform certain
activities
04/18/23 10
Markov Decision Markov Decision ProcessesProcesses
A discrete set of environment states S.
A discrete set of agent actions A.
At each discrete time agent observes state st Є S and
chooses action st Є A
Receives reward rt
State changes to st+1
Markov Assumptions st+1 = δ(st,at)
rt = r(st,at) δ and r are usually unknown to the learner δ and r may also be non deterministic
04/18/23 11
Model-free v/s Model-based Model-free v/s Model-based MethodsMethods
Model-free Methods◦ Eliminate need to know the model
◦ Learn a policy without estimating the model.
◦ Example Q LearningModel Based Methods
◦ Model consists of the Transition Probability
Function and the Reward Function.
◦ Interactively estimate the model and calculate the
policy.
◦ Example Dyna
04/18/23 13
Precise Goal Maximization of long-term reward. An optimal policy has to be learnt in order to
do this. Long-term reward may have different
interpretations according to how the future is taken into account. i.e. different models of optimal behaviour
Models of Optimal Models of Optimal BehaviourBehaviour
Three main models Finite Horizon Model
◦ Optimize the reward for the next h steps
rt represents the reward for the (t+1)th action performed
h=1 represents a greedy strategy
Models of Optimal Models of Optimal Behaviour (2)Behaviour (2)
Infinite Horizon Discounted Model◦ Models future rewards being less valuable than in
the present
◦ Has infinite horizon i.e. considers infinite actions in
the future
◦ Makes use of a discount factor between 0 and
1 which represents the importance of the future.
◦ This is the model used in the Q-Learning algorithm
Models of Optimal Models of Optimal Behaviour (3)Behaviour (3)
Average-Reward Model◦ Reward per action is considered◦ Infinite future is taken into account◦ Rewards in the future are equally valuable
The Learning TaskThe Learning Task Execute actions, observe results and
◦ Learn a policy, π : S → A maximizing
for all states S
• Infinite horizon discounted model is being used here
Value FunctionValue Function For each possible policy π the agent can
adopt the value function is2
1 2( ) ....t t tV S r r r
In terms of the value function the learning task can be reformulated to learn the optimal policy p* such that
* arg max ( ), ( )V s S
Learning the Value Learning the Value FunctionFunction
π* (s) = argmax [r(s, a) + V
*((s, a))]
Learning V * is only possible when the agent has perfect knowledge of both and r.
Thus need to define the Q function arises
The Q FunctionThe Q FunctionThe evaluation function Q(s, a) is
defined asQ(s, a) = r(s, a) + V *((s, a))
If the Q-function is learnt the optimal action can be selected without knowing and r.
*( ) arg max ( , )a
s Q s a
Learning the Q functionLearning the Q functionNotice the relationship between Q
and V*
a'
V * s max Q s, a'
Rewriting the value of Q
Learning the Q-valueLearning the Q-value FOR each <s, a> DO
◦ Initialize table entry:
Observe current state s
WHILE (true) DO
◦ Select action a and execute it
◦ Receive immediate reward r
◦ Observe new state s’
◦ Update table entry for using the training rule:
◦ Move: record transition from s to s’
04/18/23 24
0 a s,Q̂
a s,Q̂
ˆ ˆ a'
Q s, a r s, a max Q s', a'
Q Learning IllustrationQ Learning Illustration Consider that an optimal strategy is being
learnt for the given set of states and rewards for actions
The reward for all actions is 0 unless it is moving to the goal state.
04/18/23 25
0
0
G
0
00
000
0100
100
0
Q Learning Illustration(2)Q Learning Illustration(2) The actual values of V and Q taking g=0.9 are:
04/18/23 26
Learning StepLearning Step Since an absorbing state exists learning will
consist of a series of episodes with a random start state
04/18/23 27
0 0.9 [63,81,100]
90
1 right 2a'
Q s , a r max Q s , a'
max
Uncertain State Uncertain State TransitionsTransitions
04/18/23 28
State transition probability function is used
◦ Denotes the transition probability from State s to s’
when action a is performed. Q function is modified to
( , , ')T s a s
'
( , , ')s S
T s a s
a'
Q s, a r(s,a) max Q s', a'
ApplicationsApplications Cell Phone Channel Allocation
Cobot: A Social Reinforcement Learning Agent
Car Simulation: Using Reinforcement Learning
Network Packet Routing
Elevator Scheduling
Use of RL to improve the performance of natural language question
answering systems
Reinforcement Learning Methods for Military Applications
04/18/23 29
Cell Phone Channel Cell Phone Channel AllocationAllocation
Learns channel allocations for cell phones◦ Channels are limited◦ Allocations affect adjacent cells◦ Want to minimize dropped and blocked calls
04/18/23 30
Cell Phone Channel Allocation Cell Phone Channel Allocation Cont…Cont…
States
◦ Occupied and unoccupied channels for each cell Availability: Number of free channels for cell
Actions ◦ Call arrival
Evaluate possible free channels Assign one that has highest value
◦ Call termination Free channel Consider reassigning each ongoing call to just-
released channel Perform reassignment (if any) with highest value
Rewards and Values ◦ Reward is number of on-going calls
04/18/23 31
Performance of FA, BDCL, and RL
FA=fixed assignment method (FA) BDCL=borrowing with directional channel locking
Cobot: A Social Reinforcement Learning Agent
Cobot is a software agent ◦ Apply RL in a complex human online social chat based
environment
The goal is to interact with other members and to become a real part of his social fabric
Takes certain actions under his own initiative
Any user can reward or punish him.
Cobot has a incremental database of “social statistics”
◦ e.g. how frequently and in what ways users interacted
with one another provided summaries of these statistics
as a service
Cobot Cont…
States
◦ One state space corresponding to each user
State space contains a number of features containing
statistics about that particular user
Actions
◦ Null Action Choose to remain silent for this time period.
◦ Topic Starters Introduce a conversational topic
◦ Social Commentary Make a comment describing the current
social state of the Living Room, such as “It sure is quiet” or
“Everyone here is friendly.”
Cobot cont…
The RL Reward Function
◦ Reward verb
E.g. hug
◦ Punish verb
E.g. spank
◦ These verbs give a numerical (positive and
negative, respectively) training signal to Cobot
Car Simulation: Using Reinforcement Learning
The drivers do not know the track information beforehand
Takes appropriate action to avoid bumping into the wall by learning from the past experience and the given rewards
States
◦ state of the car is represented by two variables
The distance of the car to the left wall of the track
the car’s velocity towards right wall
Car Simulation cont…
Action
◦ Turn left or right to go in right direction
Reward
◦ The car will be given a negative reward if
Car bumping into the wall
car going backwards
◦ Positive if
Going on correct direction
ConclusionConclusion The basic reinforcement learning model consists of:
◦ a set of environment states S
◦ a set of actions A and
◦ a set of scalar "rewards" Learner take actions in an environment so as to maximize
its long-term reward Any problem domain that can be cast as a Markov decision
process can potentially benefit from this technique. Unlike supervised learning, reinforcement learning systems
do not require explicit input-output pairs for training
04/18/23 38
ReferencesReferences Reinforcement Learning: A User’s Guide Bill Smart Department of
Computer Science and Engineering Washington University in St. Louis http://www.cse.wustl.edu/~wds/ ICAC 2005
Machine Learning, Tom Mitchell, McGraw Hill, 1997. Harmon, M., Harmon, S.: Reinforcement Learning : A Tutorial, Wright
State University, 2000. Rich Sutton: Reinforcement Learning: Tutorial, AT& T Labs
http://www.cs.ualberta.ca/~sutton/Talks/RL-Tutorial/RL-Tutorial.ppt Kaelbling, L.P., Littman, M.L., and Moore, A.W. "Reinforcement
learning: A survey". Journal of Artificial Intelligence Research, 4, 1996.
http://en.wikipedia.org/wiki/Reinforcement_learning A Social Reinforcement Learning Agent by Charles Isbell, Christian
Shelton, Michael Kearns, Satinder Singh and Peter Stone. In Proceedings of the Fifth International Conference on Autonomous Agents (AGENTS), pages 377-384, 2001. Winner of Best Paper Award
04/18/23 39