Post on 08-Feb-2017
BLAZING THE TRAILS BEFORE BEATING THE PATH:
SAMPLE-EFFICIENT MONTE-CARLO PLANNING
KATSUKI OHTO
@NIPS2016-YOMI
2017/1/19
INTRODUCED PAPER• Blazing the trails before beating the path:
Sample - efficient Monte-Carlo planning(JB. Grill, M. Valko and R. Munos)
• NIPS 2016 accepted paper (poster session)• Abstract starts with “You are a robot…”• http://papers.nips.cc/paper/6253-blazing-the-trails-before-
beating-the-path-sample-efficient-monte-carlo-planning
TRAILBLAZER
• Nested-fashion Monte-Carlo Planning Algorithm• Problem settings:
MDP (contains MAX nodes and AVG nodes)Actions per each state : Finite State transition candidates : Finite or Infinite• Strong theoretical guarantee
MAX
AVG
AIM• Input : an MDP (Markov Decision Process)
(discount factor , maximum number of valid actions ), (> 0), (0 < < 1)
• Output : estimated value of current state
• Aim : Get good estimation of real value of current statesuch as
( means probability of )with the minimum number of calls to the generative model (state transition function)
1 PLAYER TREE MODELIN STOCHASTIC ENVIRONMENT• Each MAX node means an
opportunity to decide action
• Each AVG node means stochastic state transition
MAX
AVG
ALGORITHM OVERVIEW
• Global Initializationset , as global valueset as an argument of root node
• Recursive algorithm
)
ALGORITHM OVERVIEW 2• In both MAX nodes and AVG nodes,
arguments are (desired branching factor)and (admissible estimation error)
• If is large, we can search many children, but we need much time (dilemma)
• If is small, we can search deeply, but we need much time (dilemma)
ALGORITHMFOR AVG NODES• Input : and • Output : estimated value• If admissible error is large, ignore
successive reward• Fill transition samples
(and store immediate reward)• search all of sampled next states• return averaged immediate reward +
estimated successive reward
ALGORITHMFOR MAX NODES• Input : and • Output : estimated value• Fill candidate action pool by all valid actions• is a value like standard error of estimation• Search candidate actions repeatedly until
“Only 1 action left” or “Error might be small”• If “Error might be small”
then return estimated value of best actionelse search best action 1 more time carefully
SAMPLE COMPLEXITY OF TRAILBLAER
• Sample Complexity is a measure of performance of algorithm
• If N (the number of next states) is finite, on condition that (in detail in the paper)else on condition that is a measure of difficulty to identify near-optimal nodes