A Multi-View Real-Time Human Pose Estimation System Using...

A Multi-View Real-Time Human Pose

Estimation System Using A Hybrid Direct Model-Use Approach

BY: MOHAMED OREABA

Outline

Problem Definition, and motivation System Design Choices (From L.R)

Human Model (based or free) View Point (single vs. multiple) Data Acquisition (different sensors) Tracking Model (based or free) Feature Selection System Initialization Model-Based Methods (Generic – application specific)

Proposed System Design Evaluation

Problem DefinitionWhere does Pose Estimation Fit in computer vision?

Problem Definition

Pose estimation in general means to find the pose of an object in order to get its position and orientation.

Human Pose Estimation tries to find the exact position of all different body joints of a human body and their orientation in a still image or at each frame of a video sequence.

How is it different from Human detection?

10 points 19 points

Motivation:

Video surveillance is used in many places such as critical infrastructure, public transportation, office buildings, parking lots, and homes.

However manually monitoring these cameras is becoming a hazard.

Therefore, approaches for automatic video surveillance including outdoor human activity analysis will be needed.

Design Choices: Model Model-Free approaches:

learn a mapping between appearance and body pose.

Lead to a fast performance and accurate results for certain actions (ex. walking poses).

Agarwal, A.; Triggs, B. Recovering 3D human pose from monocular images. IEEE Trans. PatternAnal. Mach. Intell. 2006, 28, 44–58.

Rogez, G.; Orrite, C.; Mart´ınez-del Rinc´on, J. A spatio-temporal 2D-models framework for humanpose recovery in monocular sequences. Pattern Recognit. 2008, 41, 2926–2944.

Design Choices: Model Model-Based approaches:

These approaches employ human knowledgeto recover the body pose.

Takes into account the human body appearance and structure, depending on the view point.

Takes into account the human motion related to theactivity which is being carried out.

So, which Human Model do you think I should use in this HPE system? Why?

Design Choices: View Point Single View Point: Provide fewer information about the human body Difficult to handle self-occlusion.

Multiple View Points: Provide rich information about the human body. Multi-view camera systems have the advantage that they

enable full 3D reconstruction of the human body To some extent handles self-occlusion.

So, which do you think I should use in this HPE system? Why?

Design Choices: Data Acquisition

Why using 2D imaging devices? Why we don’t use Kinect for example?

Single 3D imaging devices, like ToF sensors and Kinect, will only acquire 3D surface structure visible from that single viewpoint.

Plus Kinect sensor has many drawback if applied in this system.

Number of imaging devices is another data acquisition design choice.

So, which do you think I should use in this HPE system?

Design Choices: Tracking Model Tracking-Based Approaches: Tracking steps can be considered as a mapping from input space to the body

model. The body model configuration contains both static parameters (i.e., shape and

size of each body component) and dynamic parameters (i.e., mean and orientation of each body component), in which the static parameters are estimated in the initialization step.

Methods are different in the way they use and implement the mapping procedures.

Single Frame-Based Approaches (Tracking-Free): Methods that have no tracking step are called single frame-based methods. Because the tracker in tracking based methods would be lost over long

sequences, multiple hypotheses at each frame can be used to improve the robustness of tracking.

The single-frame based approach is a more difficult issue because it does notmake any assumptions on time coherence.

However, most of the time, tracking-based methods encounter the issue of initialization or re-initialization of the tracked model

So, which Tracking Model do you think I should use in this HPE system? Why?

Design Choices: Feature Selection 2D Features from Multi-View: Approaches that use 2D features, e.g., color, edges, and

silhouette. 3D Features from Multi-View: Approaches that use 3D features reconstructed from multiple

views, e.g., volumetric (voxel) data. Since the real body pose is in 3D, using voxel data can

help avoiding the repeated projection of 3D body model onto the image planes to compare against the extracted 2D features.

Several methods are based on voxel data, whichonly indicates that voxel data is a strong cure for body pose estimation.

voxel reconstruction is computationally expensive but efficient techniques for this task have been developed.

So, which do you think I should use in this HPE system? Why?

Design Choices: Initialization Static parameters like: shape and size of each body component Dynamic parameters like: mean and orientation of each body

component. Static parameters are estimated in the initialization step.

Automatic Initialization: Some methods have automatic initialization step. The user is asked to start at a specific pose (e.g., stretch pose) to aid

the automatic initialization.

Manual Initialization: Others methods require a priori-known or manually initialization of

static parameters, (like the shape and size of the head) from a database for example.

It is used to design a hierarchical growing procedure for initialization.

So, which Initialization mode do you think I should use in this HPE system? Why?

Evaluation:

The system input can be: Either from input from a video sequence. Or directly from the installed multi-view camera system.

Types of evaluation: Visual Only Synthesized ground truth. and joint position error Public Dataset

Evaluation:

Evaluation: Two Evaluation Schemes

The first one (using video input): For our own algorithm :

We will try to report our results on two publicly available datasets: the INRIA Xmas Motion Acquisition Sequences (IXMAS) Multi-View Human Action Dataset, and the i3D Post Multi-View Human Action and Interaction Dataset, and compare it with other methods that used both datasets for their evaluation.

For our own dataset: We will try to report the results of existing

implementation on our synthesized dataset The second one (using interactive HPE): By try to report the real-time performance and joint

position error in our specific surveillance application.

Refernces

Y.Song,L.Goncalves,andP.Perona.Unsupervisedlearningofhumanmo5on.IEEETrans.PAMI,25(7):814–827,2003.-

D.RamananandD.A.Forsyth.Automa5cannota5onofeverydaymovements.InAdvancesinNeuralInforma5onProcessingSystems16,2003.-

V.Ferrari,M.Marin,andA.Zisserman.Posesearch:retrievingpeopleusingtheirpose.InProc.IEEEComput.Soc.Conf.Comput.VisionandPaeernRecogn.,2009.-

YangWang,HaoJiang,MarkS.Drew,Ze--NianLi,andGregMori.Unsuperviseddiscoveryofac5onclasses.InCVPR,2006.-

NazliIkizler-

Cinbis,R.GokberkCinbis,andStanSclaroff.Learningac5onsfromtheweb.InIEEEInterna5onalConferenceonComputerVision,2009.-

WeilongYang,YangWang,andGregMori.Recognizinghumanac5onsfroms5llimageswithlatentposes.InProc.IEEEComput.Soc.Conf.Comput.VisionandPaeernRecogn.,2010.

BangpengYaoandLiFei--Fei.Modelingmutualcontextofobjectandhumanposeinhuman--objectinterac5onac5vi5es.InProc.IEEEComput.Soc.Conf.Comput.VisionandPaeernRecogn.,2010.

Thank You

A Multi-View Real-Time Human Pose Estimation System Using...

Documents

Transcript of A Multi-View Real-Time Human Pose Estimation System Using...