Particle Filter on Episode

17
Particle Filter on Episode for Learning Decision Making Rule Ryuichi Ueda Chiba Inst. of Technology Kotaro Mizuta RIKEN BSI Hiroshi Yamakawa DOWANGO Hiroyuki Okada Tamagawa Univ.

Transcript of Particle Filter on Episode

Particle Filter on Episode for Learning Decision Making Rule

Particle Filter on Episode for Learning Decision Making RuleRyuichi Ueda Chiba Inst. of TechnologyKotaro Mizuta RIKEN BSIHiroshi Yamakawa DOWANGOHiroyuki Okada Tamagawa Univ.

navigation problems in the real world Not only robots, but also animals solve them.Mammals have specialized cells for spatial recognition in their brain.especially around the hippocampusex. place cellsThey show different reaction at eachplace of environment.-> existence of maps in the brainJuly 6th, 2016IAS-14 Shanghai2Place cells [O'Keefe71](http://en.wikipedia.org/wiki/Place_cell)

map vs. memoryMammals have maps in their brains.Maps of environments are of concern also in robotics.SLAM has been one of the most important topic.studies introducing the function of the hippocampusRatSLAM [Milford08]

How about memory?Memory is also handled in the hippocampus.Sequence of memory is reduced to maps (or state space models).Robots can record its memory for long time if they has TB level storages. (difference between mammals and robots)

July 6th, 2016IAS-14 Shanghai3

the purposeour intuitionIf memory is the source of maps, robots will be able to decide its action not from a map but directly from memory. Some knowledge about handling of memory in the hippocampus and its surroundings will help this attempt.

to implement a learning algorithm that directly utilizes memoryparticle filter on episode (PFoE)validation with an actual robot July 6th, 2016IAS-14 Shanghai4

related worksEpisode-based reinforcement learning [Unemi 1999] Its base idea is identical with PFoE. PFoE simplifies implementationand enables real-time calculation.

RatSLAM [Milford08]an algorithm for robotics utilizing the knowledge around the hippocampusJuly 6th, 2016IAS-14 Shanghai5

outline of PFoEIn repetitions of a task for learning, a robot stores events.an event = a set of sensor readings, actions, and rewards given by someone obtained at a discrete time stepthe episode: the sequence of the eventsThe degree of recall of each event is represented as a probability. July 6th, 2016IAS-14 Shanghai6time axis

states

episoderewardsbeliefssssssspresent time1-1aaaaaa

aactionspast

current

decision with the belief and the episodeAn action is chosen by calculation of expectation values. July 6th, 2016IAS-14 Shanghai7time axis

states

episoderewardsbeliefssssssspresent time?1-1aaaaaa

aactionsWhen the robot recalls these events, it may obtain +1 reward if it chooses the action as those time.When the robot recalls these events, it should change its action to avoid -1 reward.

representation with particlesThe belief is represented with particles.O(N) even if the episode has infinite lengthvariables of a particleits position on the time axisits weight

July 6th, 2016IAS-14 Shanghai8time axis

beliefpresent time

a particle

operation of PFoE motion updateWhen the current time goes to the next time step,particles simply shift to their next time steps.The episode is extended by an additional event.Positions of particles are shifted. July 6th, 2016IAS-14 Shanghai9before an actiontime axis

belief

after the actiontime axis

belief

addition of the event

operation of PFoE sensor update The event related to each particle is compared to the last one.Weights are reduced responding to the difference.resampled and normalized after reduction of weightsWhen the sum of weights before normalization is under a threshold, all particles are replaced (a reset). how to reset?

July 6th, 2016IAS-14 Shanghai10time axis

belief

difference of sensor readings, the reward, or the action

eeeeeecompareevents

operation of PFoE retrospective resetsinspired by the retrospective activity of place cellsWhen a rat recalls past events, place cells become active as if the rat virtually moves.algorithm1. place particles randomly2. replay the motion update and the sensor update for M steps with the past M events from the current timeJuly 6th, 2016IAS-14 Shanghai11time axis

beliefcurrent

M step before

...moved and comparedeee

experimentsthe robot: a micromousethat has 4 range sensorsT-maze that has a rewardat one of its arms.The robot chooses a turn rightaction or a turn left actionat the T-junction.State transition is simplified to cycles of 4 events.The robot records an event whenit is placed on the initial positionit reaches the T-junctionit turns right or leftit goes to an end of the arm

July 6th, 2016IAS-14 Shanghai12

direction of sensorsa marker of reward

tasks of experimentsa periodical taskThe reward is put right or left alternately.cycles of 8 eventsa discrimination taskThe reward is put the side where the robot is placed at first.Right or left is chosen randomly.not periodical1000 particles50 trials in an episode x 5 setsJuly 6th, 2016IAS-14 Shanghai13

periodical task with/without the retro. resetRetrospective resets reallocate particles effectively.July 6th, 2016IAS-14 Shanghai14

with randomresetwith the reset

discrimination taskcomparison of thresholds for retro. resets

A higher threshold gives signs of learning.Particles are replaced frequently and go over the cyclic state transition.But it is not perfect.July 6th, 2016IAS-14 Shanghai15

0.2 (not frequent) 0.5 (frequent)

conclusionParticle Filter on Episode (PFoE)estimates the relation between current and past, has an ability of real-time learning, and does not require an environmental model except for the Bayes model on the sensor update.experimental resultsIt works on the actual robot.The simple periodical task can be learned within 20 trials.The discrimination task can be partially learned (75% success). It seems that the idea of the retrospective resetting shouldbe extended for non-periodical tasks. (future work)July 6th, 2016IAS-14 Shanghai16

periodical task againwith different thresholdto check ill effects of the high threshold for retrospective resettings in the periodical task

result: no ill effects can be seenJuly 6th, 2016IAS-14 Shanghai17

0.20.5