Learning to Navigate Through Crowded Environments

Learning to Navigate Learning to Navigate Through Crowded Through Crowded

EnvironmentsEnvironmentsPeter HenryPeter Henry11, Christian Vollmer, Christian Vollmer22, Brian Ferris, Brian Ferris11, Dieter Fox, Dieter Fox11

Tuesday, May 4, 2010Tuesday, May 4, 2010

11University of Washington, Seattle, USAUniversity of Washington, Seattle, USA22Ilmenau University of Technology, GermanyIlmenau University of Technology, Germany

The GoalThe Goal

Enable robot navigation within crowded Enable robot navigation within crowded environmentsenvironments

MotivationMotivation

Robots should move Robots should move naturally naturally and and predictably predictably within crowded environmentswithin crowded environments Move amongst people in a socially transparent Move amongst people in a socially transparent

way way More efficient and safer motionMore efficient and safer motion

Humans trade off various factorsHumans trade off various factors To move with the flowTo move with the flow To avoid high density areasTo avoid high density areas To walk on the left/right sideTo walk on the left/right side To reach the goalTo reach the goal

ChallengeChallenge

Humans naturally balance between various factorsHumans naturally balance between various factors Relatively easy to list factorsRelatively easy to list factors But they can’t specify how they are making the But they can’t specify how they are making the

tradeofftradeoff

Previous work typically uses heuristics and Previous work typically uses heuristics and parameters are hand-tunedparameters are hand-tuned Shortest path with collision avoidance [Shortest path with collision avoidance [Burgard, et Burgard, et

al., AI 1999]al., AI 1999] Track and follow a single person [Kirby, et al., HRI Track and follow a single person [Kirby, et al., HRI

2007]2007] Follow people moving in same direction [Mueller, et Follow people moving in same direction [Mueller, et

al., CogSys 2008]al., CogSys 2008]

ContributionContribution

Learn Learn how humans trade off various factorshow humans trade off various factors

A framework for learning to navigate as A framework for learning to navigate as humans do within crowded environmentshumans do within crowded environments

Extension of Extension of Maximum Entropy Inverse Maximum Entropy Inverse Reinforcement LearningReinforcement Learning [Ziebart, et al., [Ziebart, et al., AAAI 2008] to incorporate:AAAI 2008] to incorporate: Limited locally observable areaLimited locally observable area Dynamic crowd flow features Dynamic crowd flow features

Markov Decision Markov Decision ProcessesProcesses

StatesStates

ActionsActions

Rewards / CostsRewards / Costs

(Transition Probabilities)(Transition Probabilities)

(Discount Factor)(Discount Factor)

S0

S1

S2

S3

Goal

Navigating in a Crowd as Navigating in a Crowd as an MDPan MDP

States States ssii

In crowd scenario: Grid cell + orientationIn crowd scenario: Grid cell + orientation

Actions Actions aai,ji,j from from ssii to to ssjj

In crowd scenario: Move to adjacent cellIn crowd scenario: Move to adjacent cell

Cost = An unknown linear combination of Cost = An unknown linear combination of action featuresaction features Cost weights to be learned: Cost weights to be learned: θθ Path: Path: ττ Features: Features: ffττ

cost(τ |θ) =θ·fτ = θ·fai , jai , j ∈τ∑

Inverse Reinforcement Inverse Reinforcement LearningLearning

InverseInverse Reinforcement Learning (IRL): Reinforcement Learning (IRL): Given: The MDP structure and a set of example Given: The MDP structure and a set of example

pathspaths Find: The reward function resulting in the same Find: The reward function resulting in the same

behaviorbehavior (Also called “Inverse Optimal Control”)(Also called “Inverse Optimal Control”)

Has been previously applied with successHas been previously applied with success Lane changing [Abbeel ICML 2004]Lane changing [Abbeel ICML 2004] Parking lot navigation [Abbeel IROS 2008]Parking lot navigation [Abbeel IROS 2008] Driving route choice and prediction [Ziebart AAAI Driving route choice and prediction [Ziebart AAAI

2008]2008] Pedestrian route prediction [Ziebart IROS 2009]Pedestrian route prediction [Ziebart IROS 2009]

Exponential distribution over paths:Exponential distribution over paths:

Learning:Learning:

Gradient: Match Gradient: Match observed observed and and expected expected feature countsfeature counts

Maximum Entropy IRLMaximum Entropy IRL

P(τ |θ) =e−θ ·fτ

e−θ ·f ′τ

′τ∑θ* = argmax

θlog

τ ∈T∑ P(τ |θ)

∇F = %f − Dai , jfai , j

ai , j

∑

Locally Observable Locally Observable FeaturesFeatures

It is unrealistic to assume the agent has It is unrealistic to assume the agent has global knowledge of the crowdglobal knowledge of the crowd Contrast: Continuum Crowd Simulator explicitly Contrast: Continuum Crowd Simulator explicitly

finds a global solution for the entire crowdfinds a global solution for the entire crowd We We do do assume knowledge of the map itselfassume knowledge of the map itself

Training: Only provide flow features for small Training: Only provide flow features for small radius around current positionradius around current position Assumes that these are the features available Assumes that these are the features available

to the “expert”to the “expert” A single demonstration path becomes many A single demonstration path becomes many

small demonstrations of locally motivated pathssmall demonstrations of locally motivated paths

Locally Observable Locally Observable Dynamic Dynamic FeaturesFeatures

Crowd flow changes as the agent movesCrowd flow changes as the agent moves

Locally observable dynamic feature Locally observable dynamic feature training:training:1.1. Update flow features within local horizonUpdate flow features within local horizon

2.2. Compute feature gradient within gridCompute feature gradient within grid

3.3. Perform stochastic update of weightsPerform stochastic update of weights

4.4. Take the next step of the observed pathTake the next step of the observed path

P(τ |θ) =1

Z(θ)e

−θ ·fat+ht

0≤h<H∑t∑

Locally Observable Locally Observable Dynamic IRLDynamic IRL

The path probability decomposes into many The path probability decomposes into many short paths over the current features in the short paths over the current features in the locally observable horizonlocally observable horizon

Decompose over

timestepsLocal

Horizon

Features for actions

within horizon at

time t

Locally Observable Locally Observable Dynamic GradientDynamic Gradient

Uses current estimate of features at time Uses current estimate of features at time tt

Computes gradient only within local horizon Computes gradient only within local horizon HH

∇F t = %ft − Dtai , j

ftai , jai , j∈H∑

Observed features within H

Expected features for

actions within H

Map and FeaturesMap and Features

Each grid cell encompasses 8 oriented statesEach grid cell encompasses 8 oriented states Allows for flow features relative to orientationAllows for flow features relative to orientation

FeaturesFeatures DistanceDistance Crowd flow speed and directionCrowd flow speed and direction Crowd densityCrowd density (many others possible…)(many others possible…)

Chosen as being reasonable to obtain from Chosen as being reasonable to obtain from current sensorscurrent sensors

Crowd SimulatorCrowd Simulator[[Continuum Crowds,Continuum Crowds, Treuille et al., Treuille et al.,

SIGGRAPH 2006]SIGGRAPH 2006]

Simulator EnvironmentSimulator Environment

Experimental SetupExperimental Setup

We used ROS [Willow Garage] to integrate the crowd We used ROS [Willow Garage] to integrate the crowd simulator and IRL learning and planner simulator and IRL learning and planner

1.1. Extract individual crowd traces and observable Extract individual crowd traces and observable featuresfeatures

2.2. Learn feature weights with our IRL algorithmLearn feature weights with our IRL algorithm

3.3. Use weights for a simulated robot in test scenariosUse weights for a simulated robot in test scenarios Planning is A* searchPlanning is A* search Re-planning occurs every grid cell with updated featuresRe-planning occurs every grid cell with updated features The robot is represented to the crowd simulator as just The robot is represented to the crowd simulator as just

another person for realistic reactions from the crowdanother person for realistic reactions from the crowd

Quantitative ResultsQuantitative Results

Measure similarity to “human” pathMeasure similarity to “human” path Shortest Path (baseline): Ignores crowdShortest Path (baseline): Ignores crowd Learned Path: The path from our learned Learned Path: The path from our learned

plannerplanner

Mean / Maximum Difference: Over all path Mean / Maximum Difference: Over all path cells, difference to closest “human” path cells, difference to closest “human” path cellcell Shortest

PathLearned Path Improvement

Mean Difference 1.4 0.9 35%

Maximum Difference 3.3 2.3 30%(Difference is significant at p=0.05 level)

Mall Scenario (Video)Mall Scenario (Video)

Lane Formation Lane Formation (Video)(Video)

Future WorkFuture Work

Train on real crowd dataTrain on real crowd data Overhead video + tracking?Overhead video + tracking? Wearable sensors to mimic robot sensor input?Wearable sensors to mimic robot sensor input?

Implement on actual robotImplement on actual robot Is the method effective for raw sensor data?Is the method effective for raw sensor data? Which are the most useful features?Which are the most useful features?

Pedestrian predictionPedestrian prediction Compare / incorporate other recent work Compare / incorporate other recent work

[Ziebart IROS 2009][Ziebart IROS 2009]

ConclusionConclusion

We have presented a framework for We have presented a framework for learning learning to to imitate human behavior from example tracesimitate human behavior from example traces

We learn weights that produce paths matching We learn weights that produce paths matching observed behavior from whatever features are observed behavior from whatever features are made available made available

Our inverse reinforcement learning algorithm Our inverse reinforcement learning algorithm handles handles locally observable dynamic featureslocally observable dynamic features

Resulting paths are more similar to observed Resulting paths are more similar to observed human pathshuman paths

Learning to Navigate Through Crowded Environments

Documents

Transcript of Learning to Navigate Through Crowded Environments