Locate target

1
Natural Tasking of Robots Based on Human Interaction Cues Brian Scassellati, Bryan Adams, Aaron Edsinger, Matthew Marjanovic MIT Artificial Intelligence Laboratory Locate target Foveate Target Apply Face Filter Software Zoom Feature Extraction 300 msec 66 msec Current Research: Joint Reference and Simple Mimicry Our team at the MIT Artificial Intelligence lab is building robotic systems that use natural social conventions as an interface. We believe that these systems will enable anyone to teach the robot to perform simple tasks. The robot will be usable without special training or programming skills, and will be able to act in unique and dynamic situations. We originally outlined a sequence of behavioral tasks, listed on the chart below, that will allow our robots to learn new tasks from a human instructor. In the chart below, behaviors in bold text have been completed, behaviors in italic text have been partially implemented. Schema Creation Development of Social Interaction Development of Commonsense Knowledge Development of Sequencing Development of Coordinated Body Actions Face Findin g Eye Contac t Gaze Directio n Gaze Following Recognizing Pointing Speech Prosody Intentionality Detector Recognizing Instructor’ s Knowledge States Recognizing Beliefs, Desires, and Intentions Arm and Face Gesture Recognition Facial Expression Recognition Familiar Face Recognition Directing Instructor’s Attention Vocal Cue Production Motion Detecto r Depth Perceptio n Object Salien cy Object Segmentation Object Permanence Body Part Segmentatio n Human Motion Models Long-Term Knowledge Consolidation Expectation- Based Representations Attention System Turn Taking Task-Based Guided Perception Action Sequencin g VOR/ OKR Kinesthetic Body Representatio n Mapping Robot Body to Human Body Self-Motion Models Reaching Around Obstacles Object Manipulati on Active Object Exploration Tool Use Robot Teachi ng Line-of- Sight Reaching Simple Grasping Smooth Pursuit and Vergence Multi-Axis Orientatio n Social Script Sequencing Instruction al Sequencing Goals Future Research Our current research focuses on building the perceptual and motor primitives that will allow the robot to detect and respond to natural social cues. In the past year, we have developed systems that respond to human attention states and that mimic the movement of any animate object by tracing a similar trajectory with the robot’s arm. Gaze Direction Visual Input Animate Objects Face/Eye Finder ToBY Visual Attention Trajectory Formation f f f f Pre- attentiv e filters Arm Primitives Reaching / Pointing Trajectories are selected based on the inherent object saliency, the instructor’s attentional state, and the animacy judgment. These trajectories are mapped from visual coordinates to a set of primitive arm postures. The trajectory can then be used to allow the robot to perform object-centered actions (such The attention of the instructor is monitored by a system that finds faces (using a color filter and shape metrics), orients to the instructor, and extracts salient features at a distance of 20 feet. Moving hand Rolling chair “Animate” chair The “theory of body” module (ToBY) is a set of agents, each of which incorporates a rule of naïve physics. These rules estimate how objects move under natural conditions. In the images Feature Extraction Generate k-best Hypotheses Management (pruning, merging) Delay Generate Predictions Matching Attention Activation w Saturatio n w Motion w Habituati on w Skin The system operates in a sequence of stages: • Visual input is filtered pre-attentively. • An attention mechanism selects salient targets in each image frame. • Targets are linked together into trajectories by a motion correspondence procedure. • The “theory of body” module (ToBY) looks for objects that are self-propelled (animate). • Faces are located in animate stimuli. • Features such as the eyes and mouth are extracted to provide head orientation. • Animate visual trajectories are mapped to arm movements. The attention system produces a set of target points for each frame in the image sequence. These points are connected across time by the multi-hypothesis tracking algorithm developed by Cox and Hingorani. The system maintains multiple hypothesis for each possible trajectory, which allows for ambiguous data to be resolved by further information. Visual input is processed by a set of parallel pre-attentive filters including skin tone, color saturation, motion, and disparity filters. The attention system combines the filtered images using weights that are influenced by high-level task constraints. The attention system also incorporates a habituation mechanism and biases the robot’s attention based on the attention of the instructor. shown above, trajectories that obey these rules are judged to be inanimate (shown in red), while those that display self-propelled movement (like the moving hand or the “animate” chair being pushed with a rod) are judged animate (green). More Complex Mimicry One future direction for our work is to look at more complex forms of social learning. We will both explore a wider range of tasks and ways to sequence together learned actions into more complex behaviors, and we will work on building systems that imitate, that is, they follow the intent of the action, not the form of the action. Understanding Self We will also exploring ideas about how to build representations of the robot’s own body, and the actions that it is capable of performing. The robot should recognize it’s own arm as it moves through the world, and even be able to recognize it’s own movements in a mirror by the temporal correlation. New Head and Hands New Hands

description

Development of Social Interaction. Locate target. Foveate Target. Apply Face Filter. Software Zoom. Feature Extraction. Gaze Direction. Development of Commonsense Knowledge. . Development of Sequencing. Development of Coordinated Body Actions. 300 msec. 66 msec. - PowerPoint PPT Presentation

Transcript of Locate target

Page 1: Locate target

Natural Tasking of Robots Based on Human Interaction Cues

Brian Scassellati, Bryan Adams, Aaron Edsinger, Matthew MarjanovicMIT Artificial Intelligence Laboratory

Locatetarget

FoveateTarget

Apply FaceFilter

SoftwareZoom

FeatureExtraction

300 msec

66 msec

Current Research: Joint Reference and Simple Mimicry

Our team at the MIT Artificial Intelligence lab is building robotic systems that use natural social conventions as an interface. We believe that these systems will enable anyone to teach the robot to perform simple tasks. The robot will be usable without special training or programming skills, and will be able to act in unique and dynamic situations.

We originally outlined a sequence of behavioral tasks, listed on the chart below, that will allow our robots to learn new tasks from a human instructor. In the chart below, behaviors in bold text have been completed, behaviors in italic text have been partially implemented.

SchemaCreation

Developmentof SocialInteraction

Developmentof CommonsenseKnowledge

Developmentof Sequencing

Developmentof CoordinatedBody Actions

Face Finding

Eye Contact

Gaze Direction

Gaze Following

RecognizingPointing

Speech Prosody

Intentionality Detector

RecognizingInstructor’sKnowledge

States

RecognizingBeliefs, Desires,and Intentions

Arm and Face Gesture Recognition

Facial Expression Recognition

Familiar Face Recognition

Directing Instructor’sAttention

Vocal CueProduction

MotionDetector

DepthPerception

Object Saliency

Object Segmentation

Object Permanence

Body PartSegmentation

Human Motion Models

Long-Term Knowledge Consolidation

Expectation-Based Representations

AttentionSystem

Turn Taking

Task-Based Guided Perception

Action Sequencing

VOR/OKR

Kinesthetic Body Representation

Mapping Robot Body to Human Body

Self-Motion Models

Reaching Around Obstacles

Object Manipulation Active Object

Exploration

Tool Use

Robot Teaching

Line-of-Sight Reaching

Simple Grasping

Smooth Pursuit and Vergence

Multi-Axis Orientation

Social ScriptSequencing

InstructionalSequencing

Goals

Future Research

Our current research focuses on building the perceptual and motor primitives that will allow the robot to detect and respond to natural social cues. In the past year, we have developed systems that respond to human attention states and that mimic the movement of any animate object by tracing a similar trajectory with the robot’s arm.

GazeDirection

Visual Input

Animate Objects

Face/EyeFinder

ToBY

VisualAttention

TrajectoryFormation

f f f f Pre-attentive filters

ArmPrimitives

Reaching /Pointing

Trajectories are selected based on the inherent object saliency, the instructor’s attentional state, and the animacy judgment. These trajectories are mapped from visual coordinates to a set of primitive arm postures. The trajectory can then be used to allow the robot to perform object-centered actions (such as pointing) or process-centered actions (such as repeating the trajectory with its own arm).

The attention of the instructor is monitored by a system that finds faces (using a color filter and shape metrics), orients to the instructor, and extracts salient features at a distance of 20 feet.

Moving hand Rolling chair “Animate” chair

The “theory of body” module (ToBY) is a set of agents, each of which incorporates a rule of naïve physics. These rules estimate how objects move under natural conditions. In the images

FeatureExtraction

Generate k-bestHypotheses

Management(pruning, merging)

Delay

GeneratePredictions

Matching

Attention Activation

w

Saturation

w

Motion

w

Habituation

w

Skin

The system operates in a sequence of stages:

• Visual input is filtered pre-attentively.

• An attention mechanism selects salient targets in each image frame.

• Targets are linked together into trajectories by a motion correspondence procedure.

• The “theory of body” module (ToBY) looks for objects that are self-propelled (animate).

• Faces are located in animate stimuli.

• Features such as the eyes and mouth are extracted to provide head orientation.

• Animate visual trajectories are mapped to arm movements.

The attention system produces a set of target points for each frame in the image sequence. These points are connected across time by the multi-hypothesis tracking algorithm developed by Cox and Hingorani. The system maintains multiple hypothesis for each possible trajectory, which allows for ambiguous data to be resolved by further information.

Visual input is processed by a set of parallel pre-attentive filters including skin tone, color saturation, motion, and disparity filters. The attention system combines the filtered images using weights that are influenced by high-level task constraints. The attention system also incorporates a habituation mechanism and biases the robot’s attention based on the attention of the instructor.

shown above, trajectories that obey these rules are judged to be inanimate (shown in red), while those that display self-propelled movement (like the moving hand or the “animate” chair being pushed with a rod) are judged animate (green).

More Complex Mimicry

One future direction for our work is to look at more complex forms of social learning. We will both explore a wider range of tasks and ways to sequence together learned actions into more complex behaviors, and we will work on building systems that imitate, that is, they follow the intent of the action, not the form of the action.

Understanding Self

We will also exploring ideas about how to build representations of the robot’s own body, and the actions that it is capable of performing. The robot should recognize it’s own arm as it moves through the world, and even be able to recognize it’s own movements in a mirror by the temporal correlation.

New Head and Hands

New Hands