Improving real-time human pose estimation from multi-view video

215
Improving real-time human pose estimation from multi-view video Thesis for the degree of Philosophiae Doctor Trondheim, May 2013 Norwegian University of Science and Technology Faculty of Information Technology, Mathematics and Electrical Engineering Department of Computer and Information Science Rune Havnung Bakken

Transcript of Improving real-time human pose estimation from multi-view video

Page 1: Improving real-time human pose estimation from multi-view video

Improving real-time human pose estimation from multi-view video

Thesis for the degree of Philosophiae Doctor

Trondheim, May 2013

Norwegian University of Science and TechnologyFaculty of Information Technology, Mathematics and Electrical EngineeringDepartment of Computer and Information Science

Rune Havnung Bakken

Page 2: Improving real-time human pose estimation from multi-view video

NTNUNorwegian University of Science and Technology

Thesis for the degree of Philosophiae Doctor

Faculty of Information Technology, Mathematics and Electrical EngineeringDepartment of Computer and Information Science

© Rune Havnung Bakken

ISBN 978-82-471-4406-0 (printed ver.)ISBN 978-82-471-4407-7 (electronic ver.)ISSN 1503-8181

Doctoral theses at NTNU, 2013:150

Printed by NTNU-trykk

Page 3: Improving real-time human pose estimation from multi-view video

Abstract

Capturing human motion is a key problem in computer vision, because of thewide range of applications that can benefit from the acquired data. Motion cap-ture is used to identify people by their gait, for interacting with computers usinggestures, for improving the performance of athletes, for diagnosis of orthopaedicpatients, and for creating virtual characters with more natural looking motionsin movies and games. These are but a few of the possible applications of humanmotion capture.

In some of the application areas mentioned above it is important that the dataacquisition is unconstrained by the markers or wearable sensors traditionallyused in commercial motion capture systems. Furthermore, there is a need for lowlatency and real-time performance for certain applications, for instance in per-ceptive user interfaces and gait recognition. Human pose estimation is definedas the process of estimating the configuration of the underlying skeletal structureof the human body.

In this dissertation several algorithms that together form a real-time pose es-timation pipeline are proposed. Images captured with a calibrated multi-camerasystem are input to the pipeline, and the 3D positions of 25 joints in the global co-ordinate frame are the resulting output. The steps of the pipeline are: a) subtractthe background from the images to create silhouettes; b) reconstruct the volumeoccupied by the performer from the silhouette images; c) extract skeleton curvesfrom the volume; d) identify extremities and segment the skeletal curves intobody parts; and e) fit a model to the labelled data. The pipeline can initialiseautomatically, and can recover from errors in estimation.

There are four main contributions of the research effort presented in this dis-sertation: a) a toolset for evaluating shape-from-silhouette-based pose estimationalgorithms using synthetic motion sequences generated from real motion capturedata; b) a fully parallel thinning algorithm implemented on the Graphics Process-ing Unit (GPU) that can skeletonise voxel volumes in real time; c) a real-time poseestimation algorithm that builds a tree structure segmented into body parts fromskeleton data; and d) a constraint algorithm that can fit an articulated model to alabelled tree structure in real time.

Page 4: Improving real-time human pose estimation from multi-view video
Page 5: Improving real-time human pose estimation from multi-view video

Preface

This dissertation is submitted to the Norwegian University of Science andTechnology (NTNU) in partial fulfilment of the requirements for the degreephilosophiae doctor. The work described in the dissertation has been conductedat the Faculty of Informatics and e-Learning of Sør-Trøndelag University College(HiST) in Trondheim, Norway. The work was funded by a PhD research grantfrom HiST.

AcknowledgementsFirst, I would like to thank my main supervisor professor Torbjørn Skramstad,and co-supervisors associate professors Jan H. Nilsen and Bjørn Sæther, for theirsupport and advice. I am very grateful to professor Adrian Hilton for inviting meto visit the Centre for Vision, Speech, and Signal Processing at the University ofSurrey, and for the discussions we have had. I am thankful for the feedback andencouragement I have received from Odd Erik Gundersen and Jon Olav Hauglidin the days and hours before crucial deadlines. I wish to thank the co-authors ofthe published papers included in the dissertation, and although it did not alwaysfeel that way when preparing camera ready versions, I am grateful for the con-structive feedback from the anonymous reviewers of the published papers whichhelped improve the final result. Many thanks to Jørgen Løland who took the timeto proofread this manuscript.

I thank my former colleagues at the Faculty of Informatics and e-Learning ofSør-Trøndelag University College for creating a pleasant environment in whichto work, in particular my fellow PhD candidates Knut Arne Strand and PuneetSharma who I have shared joys and woes with. Finally, my sincerest gratitudegoes to my family and friends who have always supported me, and spurred meon to finish this thing.

Page 6: Improving real-time human pose estimation from multi-view video
Page 7: Improving real-time human pose estimation from multi-view video

Contents

I Context 1

1 Introduction 31.1 Motivation and background . . . . . . . . . . . . . . . . . . . . . . . 31.2 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Research methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Research context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.6 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.7 Document structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Computer vision-based human motion capture and analysis 132.1 Input modalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Application areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 State of the art in human pose estimation 193.1 Models and inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Real-time multi-view approaches . . . . . . . . . . . . . . . . . . . . 243.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

II Research results 31

4 Research focus 334.1 Assumptions and limitations . . . . . . . . . . . . . . . . . . . . . . 334.2 Quantitative constraints . . . . . . . . . . . . . . . . . . . . . . . . . 344.3 Prototype design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Published papers 395.1 Introductory publications . . . . . . . . . . . . . . . . . . . . . . . . 395.2 Main publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Page 8: Improving real-time human pose estimation from multi-view video

vi CONTENTS

6 Quantitative and qualitative results 496.1 Motion sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.2 Processing time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.3 Body parts identification . . . . . . . . . . . . . . . . . . . . . . . . . 516.4 Accuracy and precision of joint position estimates . . . . . . . . . . 526.5 Qualitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.6 Supplementary results . . . . . . . . . . . . . . . . . . . . . . . . . . 58

III Evaluation and conclusion 67

7 Discussion 697.1 Comparative evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 697.2 Research questions revisited . . . . . . . . . . . . . . . . . . . . . . . 707.3 Advantages and drawbacks . . . . . . . . . . . . . . . . . . . . . . . 73

8 Conclusion and future work 798.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Bibliography 85

Glossary 99

IV Publications 101

Paper 1 Semi-automatic camera calibration 103

Paper 2 Background subtraction using GPGPU 117

Paper 3 Synthetic data for shape-from-silhouette 131

Paper 4 Real-time skeletonisation 145

Paper 5 Pose estimation using skeletonisation 155

Paper 6 Model fitting to skeletal data 167

Paper 7 View-independent gait recognition 179

V Appendices 191

Appendix A Statements of co-authorship 193

Page 9: Improving real-time human pose estimation from multi-view video

CONTENTS vii

Appendix B Errata 201

Page 10: Improving real-time human pose estimation from multi-view video
Page 11: Improving real-time human pose estimation from multi-view video

List of Figures

1.1 Design science research methodology . . . . . . . . . . . . . . . . . 6

4.1 Accuracy and precision . . . . . . . . . . . . . . . . . . . . . . . . . . 364.2 Tentative design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.1 Mean 3D error per frame for extremities . . . . . . . . . . . . . . . . 546.2 Articulated model with bone names . . . . . . . . . . . . . . . . . . 556.3 Articulated model with joint names . . . . . . . . . . . . . . . . . . 566.4 Box-and-whisker plots of errors of joint positions . . . . . . . . . . . 576.5 Four frames from walk sequence . . . . . . . . . . . . . . . . . . . . 596.6 Model fitted for walk sequence . . . . . . . . . . . . . . . . . . . . . 606.7 Valid frame from HumanEva . . . . . . . . . . . . . . . . . . . . . . 616.8 Invalid frame from HumanEva . . . . . . . . . . . . . . . . . . . . . 616.9 Incorrectly labelled frame from HumanEva . . . . . . . . . . . . . . 626.10 Valid frame from IXMAS . . . . . . . . . . . . . . . . . . . . . . . . . 636.11 Invalid frame from IXMAS . . . . . . . . . . . . . . . . . . . . . . . . 636.12 Incorrectly labelled frame from IXMAS . . . . . . . . . . . . . . . . 646.13 Challenging poses from CVSSP-3D . . . . . . . . . . . . . . . . . . . 656.14 Cartwheel sequence from children dataset . . . . . . . . . . . . . . . 66

7.1 Relations between papers and research questions. . . . . . . . . . . 717.2 Fast motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757.3 Shoulder displacement . . . . . . . . . . . . . . . . . . . . . . . . . . 767.4 Problem with crouching pose . . . . . . . . . . . . . . . . . . . . . . 767.5 Problem with feet close together . . . . . . . . . . . . . . . . . . . . 77

Page 12: Improving real-time human pose estimation from multi-view video
Page 13: Improving real-time human pose estimation from multi-view video

List of Tables

1.1 Relations between research questions and contributions. . . . . . . 81.2 Relations between research papers and contributions. . . . . . . . . 10

3.1 Comparison of real-time pose estimation methods. . . . . . . . . . . 25

4.1 Assumptions in motion analysis research . . . . . . . . . . . . . . . 34

6.1 Motion sequences used in experiments . . . . . . . . . . . . . . . . . 506.2 Skeletonisation speed-up using GPGPU . . . . . . . . . . . . . . . . 506.3 Processing times for skeletonisation . . . . . . . . . . . . . . . . . . 516.4 Processing time for body parts segmentation . . . . . . . . . . . . . 526.5 Processing time for pose estimation . . . . . . . . . . . . . . . . . . . 526.6 Percentages of incorrectly labelled frames . . . . . . . . . . . . . . . 536.7 Bone ratios used to scale articulated model . . . . . . . . . . . . . . 536.8 Comparison of stature estimates . . . . . . . . . . . . . . . . . . . . 54

Page 14: Improving real-time human pose estimation from multi-view video
Page 15: Improving real-time human pose estimation from multi-view video

Part I

Context

Page 16: Improving real-time human pose estimation from multi-view video
Page 17: Improving real-time human pose estimation from multi-view video

Chapter 1Introduction

The main topic of this dissertation is computer vision-based human motion cap-ture, specifically real-time human pose estimation. This chapter introduces themotivation behind the work, the research questions that have been answered andthe research methodology used.

1.1 Motivation and background

Computer vision-based human motion capture is a well established area of re-search that has seen a lot of activity over the past two decades as a number ofsurveys show (Moeslund et al., 2006; Ji and Liu, 2010; Sigal and Black, 2010). Theactivity is motivated by a wide range of applications for robust motion capture,and the high complexity of the task gives rise to many still unsolved problems.

The best-known approach for capturing human motion is the use of reflect-ing markers placed at anatomical landmarks. Infrared cameras capture light re-flected from the markers, and a three-dimensional point cloud is generated. Thisapproach has several drawbacks: the markers are invasive and can cause exper-imental artefacts, and recreating motion from the point cloud usually involvestime-consuming manual post-processing. The non-invasive nature of computervision-based techniques makes them attractive alternatives.

Motion capture can be used to identify people by their gait, for interactingwith computers using gestures, for improving the performance of athletes, fordiagnosis of orthopaedic patients, and for creating virtual characters with morenatural looking motions in movies and games. There is a need for low latencyand real-time performance in some applications, for instance in perceptive userinterfaces and gait recognition.

The input for computer vision-based motion capture methods can be fromone (monocular), two (stereoscopic) or multiple (multi-view) cameras. There arestrengths and weaknesses with all three options. Multi-view approaches have

Page 18: Improving real-time human pose estimation from multi-view video

4 Introduction

fewer problems with occlusions, but there is increased complexity with calibra-tion and synchronisation of the cameras.

Human motion is governed by an underlying articulated skeletal structureand in the context of computer vision-based human motion analysis (Moeslundet al., 2006) define pose estimation as the process of finding an approximate con-figuration of this structure in one or more frames. The configuration can be de-scribed by joint angles or positions of the joints relative to some global frame ofreference.

A common approach for estimating the pose of a subject is to track body partsthrough the motion sequence, often involving the use of the pose estimate fromthe current and preceding frames to predict the pose in the next frame. Thisworks well in many cases, but can fail for fast motion patterns where the changefrom one frame to the next is too great for the tracker to follow. Automatic recov-ery from tracking failures, particularly in real time, is challenging.

Before the motion of the subject can be captured the algorithm must be ini-tialised. For further processing to commence an initial interpretation of the cur-rent scene must be done. In many cases this is done manually, and in a compar-ison of five state-of-the-art real-time pose estimation methods (Michoud et al.,2007a) found that none of them could initialise automatically in real-time.

1.2 Research questionsBased on the motivation presented above the main research question is formu-lated:

How can real-time markerless multi-view computer vision-based human poseestimation be improved?

This can be split into several sub-questions:

RQ1 Current situation: What is the state of the art in the field of computervision-based human pose estimation?

RQ2 Data acquisition: How can three dimensional information about the sub-ject be recovered from multi-view video in real time?

RQ3 Estimation:

(a) How can the underlying skeletal structure of the human body be re-covered more accurately from the three dimensional reconstructeddata?

(b) How can the configuration of the skeletal articulation be reliably esti-mated from the skeleton data with automatic recovery from errors?

Page 19: Improving real-time human pose estimation from multi-view video

1.3 Research methodology 5

(c) How can the pose estimation algorithm be initialised automaticallywithout prior knowledge of the motion pattern or a specific start pose?

RQ4 Performance and applicability:

(a) How should the performance of the pose estimation algorithm be eval-uated?

(b) How can the proposed pose estimation approach be used in a specificapplication?

The research questions are revisited in section 7.2 with a discussion of how theywere answered.

1.3 Research methodologyComputer vision techniques are constructed artefacts, and in that sense computervision is a science of the artificial, or a design science. Consequently, the designscience research methodology (Vaishnavi and Kuechler, 2007) was used in thisresearch project. As can be seen in figure 1.1, the design science method is aniterative process. New discoveries and knowledge from later phases are fed backinto the start of the cycle and the process is begun anew. This way the design isconstantly evolved and refined into the final artefact. The stages of the design sci-ence research methodology and how they were used in this project are describedbelow.

Awareness of problem The research effort originates from the awareness of aninteresting problem. This awareness can stem from new developments in thefield, or by studying related disciplines from which existing methods can be ap-plied in novel ways. The result of this phase is a proposal for a research project,in this case a PhD project.

Suggestion Existing knowledge or the theoretical base of the problem area isused to suggest possible solutions to the problem. This knowledge was acquiredthrough a literature review of the field, and used to formulate a tentative designfor a solution to the problem. This also involves making necessary assumptionsabout the solution and limiting the scope of the task to be manageable within theconstraints of the project. Research questions RQ1 and RQ2 were answered inthis phase.

Development During the development phase the tentative design evolves intoa complete design of the solution. An artefact is implemented based on the de-sign. A number of new algorithms were developed to answer RQ3.

Page 20: Improving real-time human pose estimation from multi-view video

6 Introduction

Operation andgoal knowledge

Circumscription

Awareness ofproblem

Suggestion

Development

Evaluation

Conclusion

Knowledge flows Process steps Outputs

Artifact

Performance measures

Results

Proposal

Tentative design

Figure 1.1 – The general methodology of design science research, from (Vaishnaviand Kuechler, 2007).

Evaluation The proposal defines a number of criteria for expected behaviour ofthe artefact. In the evaluation phase the performance of the artefact is assessedand compared against those criteria. Deviations from the expected behaviour isanalysed, and hypotheses are formed about the shortcomings of the current solu-tion. As the circumscription arrows in figure 1.1 shows these hypotheses lead tonew awareness of problems, and a new iteration of the design cycle begins andan improved design is created. Concretely, for this project, the algorithms devel-oped in the previous phase were evaluated using publicly available datasets, andtheir accuracy, robustness, and processing time assessed, thus answering RQ4a.

Conclusion Eventually the research effort is concluded, at which point the arte-fact still might not fully satisfy all expected behaviour, but is considered goodenough. The results are consolidated and the new knowledge gained is pub-lished. Anomalous behaviour can serve as starting points for new research ef-

Page 21: Improving real-time human pose estimation from multi-view video

1.4 Research context 7

forts. The output of this phase was seven research papers and this dissertation.This phase resulted in a tentative answer to RQ4b.

Patterns are defined by (Vaishnavi and Kuechler, 2007) as general techniquesfor approaching various types of problems. They present patterns for tackling thestages of the design science research methodology, and the principles outlined inthose patterns were employed in the research effort presented in this dissertation.

1.4 Research context

The bulk of the work behind this dissertation was performed at the Faculty ofInformatics and e-Learning (AITeL) of Sør-Trøndelag University College. Thework has been loosely associated with a project on marker-based motion captureat AITeL, and there has been some exchange of ideas between the two projects,particularly on camera calibration and data acquisition. Some of the main ideasbehind the dissertation were developed during a four month stay at the Centrefor Vision, Speech, and Signal Processing at the University of Surrey in 2010.

1.5 Contributions

The main contributions of this research effort are:

C1 A set of tools that makes it easy to produce silhouette image sequences witharbitrary camera setups by employing sequences from the Carnegie MellonUniversity (CMU) motion capture database. Furthermore, the toolset gener-ates a synthetic ground truth that can be used for evaluation of new shape-from-silhouette-based methods. The toolset is made freely available to facili-tate comparative evaluation.

C2 A novel skeleton-based pose estimation method:

(a) A General-Purpose computation on Graphics Processing Units (GPGPU)implementation of a fully parallel thinning algorithm, that is well suitedfor use in human pose estimation, and achieves real-time performance.

(b) A pose estimation method based on constructing a tree structure froma skeletonised visual hull. The configuration of the skeletal structure isindependently estimated in each frame, which overcomes limitations oftracking, and facilitates automatic initialisation and recovery from erro-neous estimates. The method achieves real-time performance on a vari-ety of different motion sequences.

Page 22: Improving real-time human pose estimation from multi-view video

8 Introduction

(c) A novel method for fitting a skeletal model to skeletonised voxel databased on constrained dynamics. The method achieves real-time perfor-mance on a variety of motion sequences.

Additionally, there are two minor contributions:

C3 A semi-automatic camera calibration tool that simplifies calibration using acheckerboard pattern. The tool is made freely available online.

C4 A discussion on how the proposed pose estimation algorithm in C2b can beused in a gait recognition framework.

Relations between the research questions and contributions are shown in ta-ble 1.1.

Table 1.1 – Relations between research questions and contributions.

RQ2 RQ3a RQ3b RQ3c RQ4a RQ4b

C1 •C2a •C2b • • •C2c •C3 •C4 •

1.6 PublicationsAs mentioned in section 1.3 the research effort has resulted in seven publishedpapers. The papers have been divided into two categories based on the signifi-cance of their contributions.

1.6.1 Introductory publicationsThe first two papers were published in peer-reviewed national conference pro-ceedings:

Paper 1 Rune Havnung Bakken, Bjørn Gunnar Eilertsen, Gustavo Ulises Ma-tus, Jan Harald Nilsen, Semi-automatic Camera Calibration Using CoplanarControl Points, In Proceedings of The Norwegian Informatics Conference(NIK) 2009.This paper presented a semi-automatic camera calibration tool using acheckerboard pattern as control points.

Page 23: Improving real-time human pose estimation from multi-view video

1.6 Publications 9

Paper 2 Eirik Fauske, Lars Moland Eliassen, and Rune Havnung Bakken, A Com-parison of Learning Based Background Subtraction Techniques Implemented inCUDA, In Proceedings of The Norwegian Artificial Intelligence Sympo-sium (NAIS) 2009.This paper presented a comparison of GPGPU implementations of threebackground subtraction algorithms using the Compute Unified DeviceArchiteture (CUDA) library.

1.6.2 Main publicationsThe five main papers were published in peer-reviewed international conferenceproceedings:

Paper 3 Rune Havnung Bakken, Using Synthetic Data for Planning, Developmentand Evaluation of Shape-from-Silhouette Based Human Motion Capture Meth-ods, In Proceedings of The International Symposium on Visual Comput-ing (ISVC) 2012.This paper presented a toolset for generating synthetic multi-view sil-houettes of a moving avatar. The data can be used for planning, de-velopment, and comparative evaluation of shape-from-silhouette-basedpose estimation algorithms.

Paper 4 Rune Havnung Bakken and Lars Moland Eliassen, Real-time 3D Skeleton-isation in Computer Vision-Based Human Pose Estimation Using GPGPU, InProceedings of The International Conference on Image Processing The-ory, Tools and Applications (IPTA) 2012, The special session on HighPerformance Computing in Computer Vision Applications (HPC-CVA).This paper presented a GPGPU implementation of a fully parallel thin-ning algorithm based on the critical kernels framework. Selected as thebest paper of HPC-CVA.

Paper 5 Rune Havnung Bakken and Adrian Hilton, Real-time Pose Estimation Us-ing Tree Structures Built from Skeletonised Volume Sequences, In Proceedingsof The International Conference on Computer Vision Theory and Appli-cations (VISAPP) 2012.This paper presented an algorithm that constructs a tree structure la-belled with body parts based on a skeletonised visual hull.

Paper 6 Rune Havnung Bakken and Adrian Hilton, Real-time Pose Estimation Us-ing Constrained Dynamics, In Proceedings of The Conference on Articu-lated Motion and Deformable Objects (AMDO) 2012.This paper presented an algorithm based on constraint dynamics for fit-ting a skeletal model to the labelled tree structure.

Page 24: Improving real-time human pose estimation from multi-view video

10 Introduction

Paper 7 Rune Havnung Bakken and Odd Erik Gundersen, View-Independent Hu-man Gait Recognition Using CBR and HMM, In Proceedings of The Scan-dinavian Conference on Artificial Intelligence (SCAI) 2011.

This paper outlined a hybrid Hidden Markov Model (HMM)/Case-Based Reasoning (CBR) gait recognition framework where the pose esti-mation algorithm is used to retrieve view-independent 3D gait features.

The relations between the published papers and the contributions are listed intable 1.2.

Table 1.2 – Relations between research papers and contributions.

Paper 1 Paper 2 Paper 3 Paper 4 Paper 5 Paper 6 Paper 7

C1 •C2a •C2b � � •C2c � � •C3 •C4 •

• main contribution of paper � contribution built on results from paper

1.7 Document structureThe dissertation is organised as follows:

Part I Establishes the context of the research effort presented in this disserta-tion.

Chapter 1 This introductory chapter.

Chapter 2 Gives a broad overview of the research field on human mo-tion analysis.

Chapter 3 Examines state-of-the-art research on computer vision-based human pose estimation, with a special emphasis onreal-time approaches.

Part II Presents the main results from the research effort.

Chapter 4 Defines assumptions, limitations and constraints on the re-search effort. Furthermore, a tentative design based on theconstraints is presented.

Page 25: Improving real-time human pose estimation from multi-view video

1.7 Document structure 11

Chapter 5 Summarises the seven research papers published during theresearch effort.

Chapter 6 Recapitulates the qualitative and quantitative results of thepublished papers, and presents additional results from ex-periments on other datasets that were not ready at the timeof publication of the papers.

Part III Evaluates the results that were produced.

Chapter 7 Compares the research effort to the state of the art, and dis-cusses strengths and weaknesses of the approach.

Chapter 8 Concludes the dissertation, examines the contributions indetail and outlines directions for future work.

Part IV The seven published papers.

Page 26: Improving real-time human pose estimation from multi-view video
Page 27: Improving real-time human pose estimation from multi-view video

Chapter 2Computer vision-based human motioncapture and analysis

Human motion capture is the process of registering human movement and esti-mating posture over a period of time. Accurately capturing human motion is acomplex task with many possible applications, like automatic surveillance, inputfor animation in the entertainment industry and biomechanical analysis.

2.1 Input modalities

There are several technologies in use for capturing human motion. Non-opticalsystems use active sensors to capture the motion, including inertial (Roetenberget al., 2009), electro-mechanical (Animazoo) and magnetic (Ascension) sensors.Optical motion capture is based on using images from cameras as input and usecomputer vision and pattern recognition techniques to find the position, orienta-tion, limb angles and gestures of the moving subject(s).

Marker-based optical systems Vision-based approaches can be further sub-divided into different categories based on which assumptions are made aboutthe number of people being captured, the background behind the subjects, andwhether the subjects are wearing some sort of markers that make segmentationfrom the background easier. The majority of current commercial motion cap-ture systems are based on using reflecting markers at anatomical landmarks andinfrared cameras, thus simplifying the segmentation of the target from the back-ground. Examples include (Qualisys), (Vicon) and (Motion Analysis).

Marker-less optical systems The established marker-based motion capture sys-tems have certain drawbacks that have led researchers to look for alternatives.The use of markers or special clothing is invasive, and as (Mundermann et al.,

Page 28: Improving real-time human pose estimation from multi-view video

14 Computer vision-based human motion capture and analysis

2006c) point out this can restrict natural movements and cause unacceptable ex-perimental artefacts when used in biomechanical motion analysis. Marker-basedsystems usually require a complex studio setting to work effectively, thus restrict-ing the range of possible applications. As (Azad et al., 2006) point out, marker-based motion capture usually requires time consuming manual post-processing,making such systems less suitable for real time applications. For motion capturein natural environments like automatic surveillance, using markers is simply notpossible. Commercial multi-camera marker-less systems have reached maturityin recent years (Organic Motion).

Time-of-flight cameras A time-of-flight (ToF) camera is a scanner-less light de-tection and ranging (LIDAR) sensor. In contrast to a laser scanner which scansthe image sequentially, the ToF camera measures all pixels of the image simul-taneously. Laser light pulses are projected onto the scene, and the delay beforethe reflected light reaches the camera sensor is used to estimate the distance. ToFcameras typically have low lateral resolution (320⇥240), and they are still expen-sive. Because of interference between the sensors it is complicated to combinemultiple ToF cameras, but it is possible with for instance time multiplexing. TheDepthSense sensor (SoftKinetic) is an example of a low-cost ToF camera intendedfor use in home entertainment.

Structured light sensors A structured light camera (Scharstein and Szeliski,2003) projects a known light pattern onto the scene, and estimates distance basedon distortion of the pattern. The Microsoft Kinect sensor is a low-cost structuredlight sensor introduced in late 2010, and its impact on recent developments inpose estimation research will be discussed further in section 3.2.1.

Thermographic cameras Thermographic cameras operate in the infrared spec-trum, and measure the temperature of objects in the scene. Humans typicallyhave a higher temperature than their surroundings, so the use of thermal cam-eras can simplify the detection of people in the scene. Example thermographiccameras for use in research are produced by (FLIR).

2.2 Application areas

One of the main reasons behind the high research activity on motion captureduring the last decades is a wide range of possible applications, divided intothree broad categories by (Moeslund and Granum, 2001): surveillance, control andanalysis.

Page 29: Improving real-time human pose estimation from multi-view video

2.2 Application areas 15

2.2.1 Surveillance

Automatic surveillance systems is a very active research area within the computervision community. Potential applications of such systems include:

Identification with behavioural biometrics Video surveillance is ever morecommon in public places. Robust automatic identification of wanted suspectsbased on gait could be a valuable aid to law enforcement. A bank robber wasapprehended in Denmark after the police was able to identify him by his gait(Lynnerup and Vedel, 2005).

Certain security-sensitive areas require that access is restricted to only a selectgroup of people. Using biometrics to identify people with access is one potentialsolution. Gait is one such biometric that can be used in conjunction with otherssuch as height and facial features.

Analysis of crowd flux Automatic surveillance techniques can be used to anal-yse crowd flux in public areas, and used for planning to avoid congestion. Amethod for measuring the utilisation of sports arenas was presented by (Gadeet al., 2012).

Discovery of abnormal behaviour An automatic surveillance system with ac-tion recognition could be used to discover abnormal and potentially criminal be-haviour in for instance car parks. Relations between people and objects weremodelled by (Clapes et al., 2012) and used to detect theft.

2.2.2 Control

Captured motion can be used as control input in various applications:

Gesture-based user interfaces Recognised gestures are used in emerging multi-modal user interfaces.

Character animation The computer gaming industry has a long tradition of us-ing captured motion to animate avatars in games. Motion capture is used inmovies to make the movements of virtual actors more realistic.

Telepresence Captured motion can be used for remote manipulation in telep-resence systems, for instance for tele-rehabilitation (Obdrzalek et al., 2012) wherethe patient and doctor can be at geographically distant locations.

Page 30: Improving real-time human pose estimation from multi-view video

16 Computer vision-based human motion capture and analysis

2.2.3 Analysis

The focus of analysis applications is performance quantification, to determinehow well a motion is performed:

Sports Motion capture is used to evaluate athlete and team performance, andfor analysing the mechanics of injuries. A system for analysing the performanceof divers was introduced by (Li et al., 2007).

Biomechanical analysis in medicine Captured motion can be used for clinicalevaluation of motion patterns for treatment and prevention of diseases that areinfluenced by changes in movement patterns.

Content-based storage and retrieval Analysis of movements can be used to in-dex and search for particular actions in video content. An approach for skimmingand description of 3D videos was presented by (Tung and Matsuyama, 2009).

2.3 SurveysIn the last two decades a plethora of surveys on research in motion capture hasbeen conducted and published. They all have different focus, and are dividedinto two groups. General reviews are concerned with all aspects of human motioncapture without regard for any specific application. Specialised reviews deal withspecific application areas or focus on particular aspects of the motion captureprocess.

2.3.1 General reviews

Early research was focused on low-level vision processing such as segmentation,tracking, pose recovery and trajectory estimation. This is reflected in surveysfrom that period, like (Aggarwal et al., 1994) and (Aggarwal et al., 1998) whodiscussed articulated and elastic non-rigid motion.

An early survey presented by (Aggarwal and Cai, 1997) focused on three ma-jor areas: motion analysis involving human body parts, tracking of human mo-tion using single and multiple cameras, and recognising human activities fromimage sequences. Another early review by (Gavrila, 1999) grouped methods into2D approaches with or without explicit shape models, and 3D approaches.

The most comprehensive reviews to date were those by (Moeslund andGranum, 2001) and (Moeslund et al., 2006). They gave broad overviews of theresearch field and used a taxonomy based on the different functional stages ofthe motion capture process. A similar taxonomy was used by (Wang and Singh,

Page 31: Improving real-time human pose estimation from multi-view video

2.3 Surveys 17

2003), a three-way division of the motion capture process: human detection,tracking and understanding activities.

In addition to full-body human tracking, (Wang and Singh, 2003) also consid-ered tracking of the face and hands. Methods were divided into two classes by(Poppe, 2007): model-based and model-free. Model-based approaches are furthersubdivided into top-down and bottom-up approaches. Combinations of differ-ent methods are also discussed. A brief survey with a high-level description ofcommonly used methods and possible application areas was presented by (Vas-concelos and Tavares, 2008). Recent surveys by (Ji et al., 2008; Ji and Liu, 2010)considered view-invariant approaches for pose estimation and behaviour recog-nition.

2.3.2 Specialised reviewsA number of reviews have focused on narrower topics related to human motionanalysis.

Action and gesture recognition Recognising actions and behaviour is essentialfor many applications, and consequently a number of surveys have dealt withthis specifically. An early review that focused on action recognition was pre-sented by (Cedras and Shah, 1995). They split the process into two steps, extrac-tion of motion information and organising it in a motion model and matchingsome unknown input to the constructed model. Learning and understanding ac-tivity using model-based approaches for recognition in cognitive computer visionwas examined by (Buxton, 2003).

A survey on high-level processing of human action was presented by (Aggar-wal and Park, 2004). The survey was divided into human body modelling, levelof detail needed to understand human motion, approaches for recognition andhigh-level recognition schemes with domain knowledge.

A comparison of action in the contexts of computer vision, artificial intelli-gence and robotics was presented by (Kruger et al., 2007). They focused on rep-resentation, recognition, synthesis and understanding of action.

Gesture recognition is a key input modality in perceptual and multi-modaluser interfaces, and has received a lot of attention in the human-computer inter-action community. Surveys on the use of gestures in human-computer interactionwere presented by (Wu and Huang, 1999), (Porta, 2002) and (Jaimes and Sebe,2005).

Surveillance Visual surveillance in dynamic scenes is an active research area incomputer vision. A comprehensive review of recent work on visual surveillance,covering modelling of complex environments, detection of motion, classificationof moving objects, tracking, understanding and description of behaviours, and

Page 32: Improving real-time human pose estimation from multi-view video

18 Computer vision-based human motion capture and analysis

human identification was given by (Hu et al., 2004), while (Zhan et al., 2008)presented an overview of advances in crowd analysis.

Biomechanical analysis Biomechanical analysis of motion for use in medicineand sport has been explored. For instance, (Mundermann et al., 2006b) gave anoverview of the evolution of methods for biomechanical applications in a clinicalcontext. Approaches for vision-based motion analysis in sport was reviewed by(Barris and Button, 2008).

Page 33: Improving real-time human pose estimation from multi-view video

Chapter 3State of the art in human pose estimation

Human pose estimation is a key problem in computer vision, because of the widerange of applications that can benefit from the acquired data. As mentioned insection 1.1, human pose estimation is defined as the process of estimating theconfiguration of the underlying skeletal structure of the human body from imageevidence. Single-frame approaches can estimate the pose from an image (or set ofimages from multiple cameras) from observations in a single time step. However,pose estimation is often an integral part of a motion tracking framework. Otherterms used are articulated pose estimation, human pose recovery, and body con-figuration recovery, or articulated pose tracking if temporal information is used.

3.1 Models and inferenceThis section examines different methodological approaches for modelling the hu-man pose estimation problem, and how the body configuration can be inferred.Methods are divided into categories loosely based on the taxonomy used in (Si-gal, 2011).

3.1.1 Generative model-based approachesThe goal of generative model-based approaches, also called analysis-by-synthesis, is to model all aspects of the underlying process behind the imagesof a moving person. Such a complete model is not practical. Assumptions andsimplifications must be made, for instance about subject appearance, the envi-ronment in which the motion takes place, or the type of motion performed. Still,this class of methods is by far the most popular approach for multi-view poseestimation, with a generative model used for prediction of the most likely poseor sequence of poses.

Human pose estimation is often formulated as a probabilistic problem wherethe goal is to find the posterior distribution that encodes the probability of a par-

Page 34: Improving real-time human pose estimation from multi-view video

20 State of the art in human pose estimation

ticular pose given a set of image features. In the generative paradigm the pos-terior distribution is expressed as the product of a prior distribution multipliedwith an image likelihood. The prior distribution is typically learned from labelledpose data.

Human motion is non-linear in nature, and the CONDENSATION algorithm(Isard and Blake, 1998) which is capable of tracking multiple non-linear hypothe-ses has been the foundation for many research efforts in the past decade. Thealgorithm, also called the particle filter, works by sampling the probability distri-bution describing the motion, where each particle is one sample point. A recur-ring problem with particle filter based approaches was that the high dimension-ality of the articulated model (usually >20 degrees of freedom) required a veryhigh number of particles, which in turn increased computational complexity andmade the processing very slow. The introduction of hierarchical search strategies,such as the annealed particle filter (Deutscher et al., 2000) drastically improvedthe processing time.

Different hierarchical search strategies have been developed recently. The in-teracting simulated annealing (ISA) approach (Gall et al., 2007, 2008, 2009, 2010).is a generalisation of the annealed particle filter where different resampling pro-cedures can be used. An energy functional describing the model-to-image align-ment using textures and silhouettes was defined, and minimised with ISA. Thesubject’s pose was then estimated by autoregression over a rigid motion history.Motion priors were learned, and worked as soft constraints on the anatomicalstructure.

Illumination invariance was the focus of the annealed particle filtering frame-work presented by (Lu et al., 2010). The CIElab colour model was used to im-prove robustness under changes in lighting conditions..

Building flexible models that can be adjusted to fit different subjects have beenthe focus of some research. Automatic adjustment of a generic model to becomesubject specific was presented by (Alcoverro et al., 2010). The model was thenused in a hierarchical particle filtering framework. A scalable human model withlevels-of-detail was employed by (Canton-Ferrer et al., 2011), and the subject wastracked with a hierarchical annealed particle filter. The self-training approachpresented by (Bandouch et al., 2012) could learn an anthropometric model au-tomatically from image evidence. Branched iterative hierarchical sampling wasused to track the subject.

Recently, particle swarm optimisation (PSO) has been employed (Ivekovic andTrucco, 2006) to perform the stochastic search for tracking the upper body. Theparticles are placed in a swarm intelligence framework that can interact with eachother. Hierarchical PSO (John et al., 2009) extended the earlier work to track full-body motion. Adaptive PSO (John et al., 2010) reduced the number of particlesrequired and thus improved processing time. Furthermore, the adaptive PSO al-gorithm could initialise automatically and recover from errors. Annealing was

Page 35: Improving real-time human pose estimation from multi-view video

3.1 Models and inference 21

combined with PSO by (Zhang et al., 2010) and (Kwolek et al., 2011b) to reduceinconsistencies between the observed human pose and the estimated configura-tion of the 3D model.

3.1.2 Discriminative model-based approaches

Also called training-based, feature-based, and pose regression, discriminative ap-proaches learn the posterior distribution directly from training data and do notuse a prior distribution. Discriminative methods are typically faster than genera-tive methods, but do not generalise well to motions not seen during training.

The mixture of experts (MoE) approach (Sminchisescu et al., 2007) solved vi-sual ambiguities by combining the output of several classifiers. Different classi-fiers were assigned to different regions in space. In ambiguous regions multipleexperts competed to find the solution that best explained the observations.

The pose-regression method presented by (Pourdamghani et al., 2012) mod-elled a lower-dimensional manifold as a graph. By utilising the manifold struc-ture of the input data, unlabelled data could be used for semi-supervised learn-ing, and the amount of training required was reduced.

A regression-based method for estimating the orientation of a subject from asingle silhouette was presented by (Pierard and Droogenbroeck, 2012). Regres-sion and supervised learning was employed, and 18 different shape descriptorswere evaluated for extracting features from the silhouette images.

3.1.3 Exemplar-based approaches

Rather than using training data to learn a model, exemplar-based approaches usethe training data directly to infer the pose. The exemplars are stored in a databaseor pose dictionary, and for each frame of a motion sequence, a lookup is done inthe database to find the pose most similar to the image evidence.

The data-driven framework presented by (Germann et al., 2011) could utiliselow-resolution, loosely calibrated cameras in uncontrolled environments. First, arough pose was estimated by matching silhouette images with a large exemplardatabase. The estimated pose was then refined in a temporal optimisation stepacross the motion sequence.

The single-frame pose recovery framework presented by (Hofmann and Gav-rila, 2009, 2011) first matched candidate 3D poses with a library of 2D pose ex-emplars. In the next step the candidate poses were re-projected into each cameraview to rank them according to the image likelihood. Temporal integration overthe pose candidates for each frame were done to find pose trajectories that bothmatched the observations well, and exhibited smooth motion.

Rather than matching image features to exemplars, (Tanaka et al., 2007) pre-sented an approach based on extracting the skeleton from the reconstructed vi-

Page 36: Improving real-time human pose estimation from multi-view video

22 State of the art in human pose estimation

sual hull. A graph built from the skeleton data was then matched to a databaseof exemplar graphs to identify body parts. Finally, curvature analysis across themotion sequence was used to estimate joint positions.

Recently, (Rogez et al., 2012) presented a hybrid classification algorithm thatcombined cascades of rejectors and randomised forests. Cascade classifiers canquickly reject a large amount of negative candidates. Each hierarchical cascadewas randomly sampled to add robustness to noise and prevent over-fitting. Adatabase of images with corresponding human poses was used to train the en-sembles of cascades.

3.1.4 Part-based approaches

The basic idea behind probabilistic assembly-of-parts is to search for the con-stituent parts of a more complex object, and then combine them in the way thatbest fits the observations. In the case of human pose estimation the locationsof single body parts are detected first, and then integrated into a configurationfor the entire body. This bottom-up approach has most commonly been usedfor 2D pose estimation in monocular images. An attractive feature of part-basedapproaches is that the pose can be estimated from a single image. Temporal in-formation can be integrated to make the estimate consistent across an entire se-quence.

The pictorial structures (PS) approach (Felzenszwalb and Huttenlocher, 2005)has been the basis of a lot of recent research because of its simplicity and effi-ciency. Implementations like (Andriluka et al., 2009) could handle millions of po-tential configurations for each body part. Several extensions to the PS paradigmhave been presented recently.

The parts used in the PS paradigm are rigid, however, the articulated struc-tures whose configurations are estimated with PS typically consist of deformableparts. Extending PS with deformable parts was addressed by (Zuffi et al., 2012).The parts in the deformable structures model were represented by low-dimen-sional shape deformation spaces and pairwise potentials between the parts de-scribe how they deform with different poses. Because object boundaries weremore accurately modelled by deformable parts, the image likelihood could bemade more discriminative than with rigid parts.

Human pose co-estimation (Eichner and Ferrari, 2012) extended PS to esti-mate the same pose among multiple subjects, for instance dancers performingcoordinated dance moves. In a similar vein (Andriluka and Sigal, 2012) extendedPS to model interactions between multiple performers. Each performer was takeninto account as context for the motions of the other performer, and this was usedfor 3D pose estimation from monocular images of dancing couples.

As mentioned earlier part-based approaches have most commonly been usedin monocular pose estimation, however, there are some examples of part-based

Page 37: Improving real-time human pose estimation from multi-view video

3.1 Models and inference 23

multi-view methods as well. Typically, body part detectors have been used toprovide bottom-up information for automatic initialisation and recovery in parti-cle-based belief propagation tracking frameworks. Examples include (Sigal et al.,2004), (Gupta et al., 2008), and (Chen and Ji, 2010).

3.1.5 Indirect model-based approaches

Indirect model-based approaches do not use a motion model directly for predic-tion, but rather employs an a priori model as a reference for interpretation of theobservations. Correspondence-based methods have been very popular. The mainidea is to establish correspondences between extracted image features and 3D lo-cations of an articulated model. Once the correspondences have been found anoptimisation algorithm is used to minimise an error function between the datapoints and the model.

A generic model was refined to fit the subject using a Bayesian network by(Mikic et al., 2003). Visual hulls reconstructed from multiple views were seg-mented into body parts, and correspondences with the model found. The modelwas aligned with the data by minimisation of the Mahalanobis distance betweencorresponding body parts.

Fitting a kinematic model to the skeletal data is an approach taken by severalresearchers. A tracking and pose estimation framework where Laplacian Eigen-maps were used to segment voxel data and extract a skeletal structure consistingof spline curves was presented by (Sundaresan and Chellappa, 2006, 2008). Corre-spondences between the spline curves and a skeletal model were established andthe joint configuration found using nonlinear optimisation. The pose estimationframework described by (Moschini and Fusiello, 2008, 2009) used a hierarchicalarticulated iterative closest point (ICP) algorithm to fit a stick figure model to aset of data points on skeleton curves.

Biomechanical analysis was the target application of the approach presentedby (Corazza et al., 2010). A subject specific model was acquired with a laser scan-ner. Correspondences between the model and segmented visual hulls were deter-mined using articulated ICP, and the optimal fit was found with the Levenberg-Marquardt algorithm. The approach was highly accurate and achieved 3D errorson the order of 1.5 cm per joint.

Pose estimation from multi-view video usually require that the cameras aresynchronised, however, (Elhayek et al., 2012) presented an approach that utilisedan unsynchronised camera system. An appearance-based articulated model con-sisting of 3D Sums-of-Gaussians (SoGs) were fitted to image data representedby 2D SoGs. The model was projected into the image planes for similarity mea-surement. An energy functional was defined over the entire motion sequenceand minimised using conditioned gradient descent. Because the images werecaptured simultaneously for all the camera views, the temporal resolution of the

Page 38: Improving real-time human pose estimation from multi-view video

24 State of the art in human pose estimation

estimates was increased.

3.1.6 Model-free approachesThere has been some research on methods that forgo any model knowledge toinfer articulated poses from motion sequences.

In the model-free motion capture method proposed by (Chu et al., 2003) vol-ume sequences were transformed into a pose-invariant intrinsic space, remov-ing pose-dependent nonlinearities. Principal curves in the intrinsic space werethen projected back into Euclidean space to form a skeleton representation of thesubject. The algorithm required three passes through pre-recorded volume se-quences.

Both (Brostow et al., 2004) and (Theobalt et al., 2004) sought to find the skeletalarticulation of arbitrary moving subjects. Both approaches enforce temporal con-sistency for the entire structure during the motion sequence. Neither focused onhuman pose estimation specifically, however, and no inferences were made aboutwhich part of the extracted skeletons correspond to limbs in a human skeletalstructure.

3.2 Real-time multi-view approachesThe rest of this chapter will deal with the core topic of this dissertation: real-timemulti-view pose estimation. Estimating pose in real time opens up new possibil-ities in applications of human motion analysis. Gait recognition, perceptive userinterfaces, and performance-based animation are examples of applications thatbenefit from interactive frame rates and low latency. However, with an upperbound on the time available for processing of each frame some tradeoffs must bemade. In this section research on real-time pose estimation from the past decadeis examined in detail. Table 3.1 summarises the findings.

Early forays into real-time pose estimation were made by (Cheung et al., 2000)and (Luck et al., 2001, 2002). Both were correspondence-based approaches thatconsisted of fitting models to voxel data obtained with shape-from-silhouette al-gorithms.

A model consisting of six ellipsoidal shells, one for the torso, one for the head,and one for each limb, was fitted to surface voxels by (Cheung et al., 2000). Thecoarse model was not suitable for accurate pose estimation, and was mainly in-tended as an automatic initialisation procedure for a more advanced algorithm.

A physics-based pose estimation approach was presented by (Luck et al., 2001,2002). Spring forces were modelled from the voxels of each body part to corre-sponding segments of a kinematic chain. The partition of the visual hull in theprevious time step was used to segment the voxels into body parts in the currenttime step, after initialisation with a special pose. It was assumed that only small

Page 39: Improving real-time human pose estimation from multi-view video

3.2 Real-time multi-view approaches 25

Table 3.1 – Comparison of real-time pose estimation methods.

Citation Frame rate Advantages Shortcomings

(Cheung et al.,2000)

16 fps Model free automaticinitialisation

Coarse body part estimatenot suited for estimation ofjoint locations

(Luck et al.,2002)

20 fps Automatic generation ofmodel from initialisationpose

Special pose for initialisation,limited tracking capabilities

(Caillette et al.,2008)

10–30 fps Fast/complex motions,automatic initialisation andrecovery

Limited to motions seenduring training

(Tangkuampienand Suter, 2006)

10 fps Works with uncalibratedcameras

Requires prior knowledge oftracked motion

(Okada et al.,2006)

25 fps Requires only two cameraviews

Limited by exemplars in posedictionary

(Menier et al.,2006)

1 fps Can track fast motion Low frame rate, not real-timeinitialisation

(Michoud et al.,2007c)

30 fps Automatic initialisation Assumptions about subject’sclothing because of skincolour detection

(Wu et al., 2008) 30 fps Single-frame estimation Strong heuristics, limitednumber of poses

(Bergh et al.,2009)

30 fps Single-frame estimation Can only classify poses seenduring training

(Hirai et al.,2010)

60 fps Works with loose-fittingclothing, high frame rate

Limited to motions seen dur-ing training

(Raynal et al.,2010)

3–17 fps Automatic initialisation Only labels body parts, nojoint locations

(Pan et al., 2011) 10–12 fps Automatic initialisation Limited to upright poses(Zhang andSeah, 2011)

4–9 fps Automatic acquisition ofsubject-specific model

Requires user interaction,initialisation not real-time

(Straka et al.,2011)

>30 fps Single-frame estimation Cannot identify left and rightlimbs, limited to uprightposes where head is highest

(Huo andHendriks, 2012)

10–13 fps Multiple performers Requires initialisation pose,only upper body

(Kwolek et al.,2012)

3–15 fps High accuracy Manual initialisation,requires 200 ms imageprocessing per frame

Page 40: Improving real-time human pose estimation from multi-view video

26 State of the art in human pose estimation

adjustments between the voxels and model are needed between frames to find thecorrect associations between the voxels and the model limbs. This assumption isnot valid for fast and complex motion patterns.

The real-time tracking framework presented by (Caillette and Howard,2004a,b; Caillette et al., 2005, 2008) used Variable Length Markov Models(VLMM) to learn complex motion patterns. The features were clustered into ele-mentary movements, and the VLMMs were trained to describe higher-level mo-tion by modelling transitions between the elementary movements. Tracking wasdone using a particle filter with the VLMM as motion prior. Fast evaluation ofthe likelihood was achieved by fitting Gaussian blobs to a hierarchically recon-structed visual hull. The VLMM enabled tracking of fast and complex motions atup to 30 fps, and automatic initialisation and error recovery could be done withthe blob fitting scheme, however, the predictive capabilities of the model waslimited to motion patterns seen during training.

The high dimensionality of the full-body pose estimation problem have tra-ditionally made tracking with particle filters unfeasible for real-time applica-tions. The introduction of non-linear dimensionality reduction methods havemade tracking in latent space with fewer particles possible. A back-constrainedGaussian Process Latent Variable Model was used by (Hou et al., 2007) to simulta-neously learn a compact low dimensional representation of the training data anda probabilistic mapping from the subspace of motions to the parameter space.This was incorporated in the algorithm for clustering of elementary movementspresented in earlier work (Caillette et al., 2008).

The manifold learning approach presented in (Tangkuampien and Suter, 2006)was capable of recovering human poses at 10 fps from two uncalibrated cameraviews. Synthetic data were used to learn two manifolds for silhouettes and poses,and a mapping between the two subspaces. Incoming silhouettes from the cam-eras were run through the manifolds, and the unknown pose reconstructed. Themethod had no assumptions about the subject, but required prior knowledge ofthe type of motion to be captured.

A virtual fashion show system was presented by (Okada et al., 2006). Subjectspecific models created with a 3D laser scanner were used to build a pose dictio-nary. Incoming silhouettes from two cameras were compared to exemplars fromthe pose dictionary using a hierarchical search scheme, achieving a frame rateof 25 fps. The pose dictionary was constrained to a limited set of motions anddid not generalise to other applications without increasing the number of knownposes, thus increasing the time required to search the dictionary. Furthermore,the process of acquiring the subject specific model is time consuming and limitsthe method to applications where the subject is known beforehand.

The approach presented by (Menier et al., 2006) used Delaunay triangulationto extract a set of skeleton points from a closed surface visual hull representation.After pruning spurious points from the data an articulated model was fitted to

Page 41: Improving real-time human pose estimation from multi-view video

3.2 Real-time multi-view approaches 27

the data skeleton points using expectation maximisation. The joint configurationin the previous time step was used as the initial guess in the model fitting process.The method achieved close to real-time performance and could track fast motionsequences, but initialisation of the tracker was not real-time.

Home entertainment was the intended application for the tracking approachpresented by (Michoud et al., 2007a,b,c). The method consisted of reconstructinga visual hull from silhouettes, and identifying skin coloured surface voxels. Re-gions with skin colour were used to establish the positions of the head (face) andhands. A model made up of cylinders corresponding to body parts was fittedto the visual hull voxels hierarchically, starting from the head. During trackingtemporal consistency was used to fit body part cylinders to the visual hull. Themethod achieved a frame rate of 30 fps and could track fast and complex motions,however, the reliance on skin colour detection put restrictions on the subject’sclothing and start pose, and the stochastic skin segmentation algorithm requireda large training set in order to yield robust results with variations in lighting andethnicity.

A distributed camera network consisting of wireless smart cameras was pre-sented by (Wu and Aghajan, 2008; Wu et al., 2008). A bottom-up approach forestimating a subject’s pose as a means for controlling a virtual ball game wasimplemented in the camera network. Body part detection was performed on anembedded processor in the cameras, producing locations for the head, hands andshoulders in each input image. Noise filtering and estimation of 3D positionsof the body parts were done on a central computer. Two frontal camera viewswere used in the prototype system. Because of strong heuristics in the detectionof body parts and inference of 3D positions, the allowed poses were fairly lim-ited. However, poses could be estimated at 30 fps for a simple human computerinteraction application.

An exemplar-based distributed multi-camera system was presented by (Berghet al., 2009). Six computers connected to six cameras performed background sub-traction, and a central computer built visual hulls from the silhouette images us-ing voxel carving. A classifier was trained using average neighbourhood marginmaximisation (ANMM), yielding a transform from incoming voxels to a lower-dimensional space that maximised separability of the pose classes. To improveperformance the transform was approximated with 3D Haarlets. Incoming visualhulls were transformed and the resulting coefficients were matched to exampleposes in the database using nearest neighbour. To make the classifier rotation in-variant an overhead tracker was used to determine the direction the subject wasfacing. The system was able to classify 50 different poses with 97.52 % accuracy,at a frame rate of 40 fps. The overhead tracker needed manual initialisation, andthe classifier was limited by the types of poses used during training.

The pose regression framework presented by (Hirai et al., 2010) used Gaus-sian Process Dynamical Models to learn mappings between time steps in latent

Page 42: Improving real-time human pose estimation from multi-view video

28 State of the art in human pose estimation

space, and from volume latent space to volume space. Visual hulls were recon-structed from silhouette images and converted into a rotation invariant volumedescriptor, and used both during offline learning and online tracking. Trackingwas implemented with a particle filter in the volume latent space. To estimate thesubject’s pose a mapping from pose latent space to joint angles was learned withPrincipal Component Analysis. This tracking method performed at 60 fps andwas robust even with loose-fitting clothing, but was limited by the volumes seenduring training.

Analysis of skeletonised visual hulls can be done in real time. A method forautomatic initialisation based on homeomorphic alignment of a data skeletonwith a weighted model tree was presented by (Raynal et al., 2010). The alignmentwas done by minimising the edit distance between the data and model trees. Themethod was intended to be used as an initialisation step for a pose estimationor tracking framework, but could also be used as a single-frame pose estimationapproach with a frame rate of 3–17 fps depending on the volume resolution used.Automatic initialisation was also the goal of the method presented by (Pan et al.,2011). The strategy used for identifying body parts in the skeleton data was sim-ilar to the one (Michoud et al., 2007c) used for visual hulls. The algorithm waslimited to handling upright poses, and achieved a frame rate of 10–12 fps.

A real-time skeletal graph matching-based pose estimation approach was pre-sented by (Straka et al., 2011). A visual hull was reconstructed from silhouetteimages and skeletonised using voxel scooping. The resulting skeleton graph wasmatched with a template graph describing the human skeletal structure. By com-paring matrices of geodesic distances the extremities were identified, and an au-tomated skinning procedure was used to fit the template skeleton to the skeletondata. The method recovered the labelled skeleton independently in each frameand initialised automatically, but it required the subject to be upright with thehead as the highest point throughout the entire motion sequence, and the limbshad to be labelled as left or right manually.

The tracking framework presented by (Zhang et al., 2011a,b; Zhang and Seah,2011) used a niching particle swarm optimisation approach, a search techniquethat can identify multiple optima in the configuration distribution. This was com-bined with a local search optimisation to refine the estimate. A dynamic articu-lated subject-specific body model was used for prediction. The subject-specificmodel was automatically acquired by fitting a generic model to the initial vol-ume, however, user interaction was required to fine-tune the model fitting. Thetracker performed at 4–9 fps with reasonably good accuracy (6–7 cm 3D error)without strong motion priors.

The framework presented by (Huo and Hendriks, 2012) could track the up-per body of up to three people in natural environments. Tracking was done withmultiple camera views to avoid occlusions. A hierarchical particle filter that firstlocated the torso and then the arms was employed. Multiple image cues (sil-

Page 43: Improving real-time human pose estimation from multi-view video

3.2 Real-time multi-view approaches 29

houettes, gradient orientation, and appearance) was used in the image likelihoodformulation. A special pose was required to initialise the tracker. 3D errors of 12–19 cm for joint location estimates for the three people was achieved while trackingat 10–13 fps.

A tracking framework based on parallel PSO was presented by (Kwolek et al.,2011a, 2012). The main contribution was a parallel PSO algorithm with resam-pling that could handle noisy input data, and was resilient towards getting stuckin local extrema. This allowed tracking of simple walking sequences in a fourcamera setup. The method achieved tracking with fairly high accuracy (3–5 cm3D error) at up to 15 fps, but this did not include necessary image processingsteps requiring 200 ms per frame, and the tracker had to be initialised manually.

3.2.1 Depth maps

A depth map is a 2D image augmented with depth information, sometimes re-ferred to as 2.5D data. Stereo maps are depth maps generated by fusing the infor-mation from two or more cameras. Producing a stereo map is a complex processconsisting of several computationally expensive steps, including radial undis-tortion, rectification, stereo matching, and triangulation. Finding correspondingpoints in the images for stereo matching is challenging and dependent on thelevel of surface texturing.

Using novel sensor technology like structured light or time-of-flight (ToF)cameras greatly simplifies the process of working with depth maps. The in-troduction of the Microsoft Kinect sensor has spurred a flurry of activity in re-search on real-time pose estimation from depth maps. The publication introduc-ing the pose estimation algorithm used by the Xbox game console (Shotton et al.,2011) has gained over 300 citations (on Google Scholar) in little more than a year(June 2011–Sep. 2012). Discriminative algorithms have been popular for process-ing the depth maps; (Shotton et al., 2011) trained a classifier with large amountsof motion capture data and classified pixels into body parts, while (Sun et al.,2012) and (Holt and Bowden, 2012) used regression forests. There has also beencorrespondence-based approaches, such as (Qiao et al., 2012) who used hierarchi-cal ICP to fit an articulated model to the depth data.

The lack of full 3D information means that algorithms processing depth mapsstill have problems with self-occlusions. However, recent work presented by(Caon et al., 2012) demonstrated that it is possible to calibrate and synchronisemultiple Kinect sensors, and this could pave the way for new pose estimation al-gorithms that do not suffer from self-occlusions. Another problematic issue withthe use of depth maps is the limited range of current sensor technology.

Page 44: Improving real-time human pose estimation from multi-view video

30 State of the art in human pose estimation

3.3 EvaluationTo evaluate and compare human pose estimation approaches there is a need forstandard benchmark datasets. While there is a fair number of datasets availablefor evaluation of monocular pose estimation (eight reviewed by (Andriluka et al.,2011)), the availability of multi-view datasets for evaluation is limited (only tworeview by (Andriluka et al., 2011)).

The HumanEVA (Sigal and Black, 2006; Sigal et al., 2010) datasets provide syn-chronised multi-view images and ground truth data from a marker-based motioncapture system. The HumanEVA-I dataset consists of 48 image sequences, cap-tured with four greyscale and three colour cameras. Six actions are repeated twiceby four different subjects. The HumanEVA-II dataset has two image sequenceswith one subject each, captured with four synchronised colour cameras.

The multi-view image data in the multimodal motion capture indoor dataset(MPI08) (Pons-Moll et al., 2010; Baak et al., 2010) is augmented with ground truthdata from five inertial sensors attached to the extremities (wrists, lower legs, andneck). The dataset consists of 54 motion sequences with four different subjectsperforming everyday activities and complex motions patterns. The images werecaptured with eight calibrated cameras in front of green screen backgrounds tosimplify silhouette extraction. Additionally, laser scanned meshes of all four sub-jects are provided.

Page 45: Improving real-time human pose estimation from multi-view video

Part II

Research results

Page 46: Improving real-time human pose estimation from multi-view video
Page 47: Improving real-time human pose estimation from multi-view video

Chapter 4Research focus

Capturing the motion of a human performer is a highly complex task, and someassumptions about the conditions in place must be made. Furthermore, somequantitative constraints on the solution should be defined. Based on these as-sumptions and constraints, a high-level design for a real-time pose estimationprototype was created.

4.1 Assumptions and limitations

In their survey (Moeslund and Granum, 2001) discussed a number of assump-tions typically made in computer vision-based human motion analysis research.The assumptions can be divided into three categories: assumptions about themovement, the appearance of the environment, and the appearance of the sub-ject. The assumptions are listed in table 4.1.

Similarly a number of assumptions were made to limit the scope of the re-search effort presented in this dissertation:

Environment The motion takes place in a controlled, indoor environment. Thebackground is static, and the lighting is kept constant.

Silhouettes Silhouette images can be extracted reliably. Background subtrac-tion techniques are commonly used to separate foreground from background andrecent advances have made this a relatively robust process, at least for indoorscenes with static backgrounds as demonstrated in surveys by (Benezeth et al.,2008, 2010). This issue was investigated in paper 2.

Subject There is only one subject inside the viewing volume at a time, but thesubject is free to enter or leave the workspace. Otherwise there are no assump-

Page 48: Improving real-time human pose estimation from multi-view video

34 Research focus

Table 4.1 – Common assumptions made in human motion analysis research, from(Moeslund and Granum, 2001).

Movement Environment Subject

Only inside workspace Constant lighting Known start poseStationary camera Static background Known subjectOnly one subject Uniform background Markers placed on subjectSubject faces the camera Known camera parameters Clothes with special colourParallel to camera plane Special hardware Tight-fitting clothesNo occlusionSlow and continuousOne or few limbs moveKnown motion patternFlat ground plane

tions about the subject, nor are there any restrictions on the movements of thesubject.

Multiple cameras A multi-camera system consisting of regular video camerasis used to capture the motion. Studies by (Starck et al., 2009) and (Mundermannet al., 2005) found that at least eight cameras are required for good 3D reconstruc-tion results using shape-from-silhouette.

Synchronisation The cameras are synchronised. Unsynchronised capture in amulti-camera system is possible, as was demonstrated by (Elhayek et al., 2012),but this approach is not applicable in a real-time system since the algorithm re-quires multiple passes through the image sequences. Apart from the synchroni-sation requirement no special hardware is needed.

Calibration The cameras remain stationary and do not pan during capture. In-trinsic and extrinsic camera parameters are acquired through calibration. Paper 1investigated available calibration options, and presented a semi-automatic cali-bration tool.

4.2 Quantitative constraints

This section attempts to define some quantitative constraints for the proposedpose estimation algorithm, in terms of processing time, accuracy, and robustness.

Page 49: Improving real-time human pose estimation from multi-view video

4.2 Quantitative constraints 35

Real-time processing Estimating the pose of a subject in a motion sequence canbe viewed as a signal processing task with a discrete-time input signal. In thesignal processing literature (Kuo et al., 2006), the definition of a real-time systemis that the bandwidth of the system is limited by the sampling rate. The delaybetween the input and the output is at most one sample interval and for a real-time system it is required that the signal processing time must be less than thesampling period. For a real-time pose estimation algorithm this means that theprocessing time available for each frame is the reciprocal of the frame rate of thecameras used. For instance for a camera system capturing at 25 Hz, the process-ing time available for each frame is 40 ms. The cameras commonly used in com-puter vision systems have a frame rate between 20 and 30 Hz, so the target pro-cessing time per frame for the proposed pose estimation algorithm is 33–50 ms. Ifa delay of more than one frame is acceptable, and the steps of the algorithm canbe pipelined, then this bound pertains to each step of the algorithm, rather thanthe processing time for the entire process.

Latency In analysis and surveillance applications where direct user responseis not necessary, a longer delay in processing can be perfectly acceptable, but incontrol applications system responsiveness is important for the quality of userinteraction. System responsiveness is the time elapsed from the user performs anaction until a response is displayed by the system. For selection and manipulationtasks 75–100 ms has been found to be a reasonable threshold (Starner et al., 2003).

The system responsiveness is the sum of the system latency and the responsedisplay time. For the proposed pose estimation algorithm, the focus is on systemlatency, as there is no specific display considered. The system latency is the timeit takes for the system to capture images, process them and estimate the pose,and to communicate the result to the display. The recommendation by (Wareand Balakrishnan, 1994) is to have an input device with a latency of no morethan 50 ms for manipulating objects in a virtual reality setting. Consequently, thetarget latency for the proposed pose estimation algorithm is 50 ms.

Accuracy and precision Accuracy and precision are quantities that are relatedto the error of the estimated pose. The error of an estimate is computed usingthe Euclidean distance between the expected and estimated positions. High ac-curacy entails that the average of the estimates is close to the actual value, anda small spread in the estimates results in high precision. These relations are il-lustrated in figure 4.1. It is quite difficult to predict how accurate and precisethe pose estimation algorithm is required to be in order to be used for a particu-lar application. As was mentioned in section 3.2, (Kwolek et al., 2012) reportedan accuracy of 35–60 mm, and claimed that this was sufficient for gait recogni-tion, but no experimental data were offered to support that claim. At the extremelow end, (Corazza et al., 2010) achieved an accuracy of the order of 15 mm with

Page 50: Improving real-time human pose estimation from multi-view video

36 Research focus

(a) (b) (c) (d)

Figure 4.1 – Accuracy and precision. (a) Joint estimates that are neither accurate norprecise. (b) Estimates with high precision, but low accuracy. (c) A perfectly accurateand precise estimate. This is unrealistic, and with real data there will always be someamount of noise. (d) A set of reasonably accurate and precise joint estimates.

biomechanical analysis as the intended application, but their approach was farfrom real-time, and it is probably unrealistic to expect that level of accuracy froma real-time system. Both (Shotton et al., 2011) and (Straka et al., 2011) use 100 mmas an upper bound on joint position estimates to count those estimates as truepositives in real-time settings.

Robustness and reliability Robustness and reliability do not have rigorous def-initions like the constraints discussed in the preceding paragraphs. In this contextthe terms are used to describe how well the pose estimation algorithm is able tocorrectly label body parts, and also how it handles erroneous estimates.

If treated as a classification problem, precision and recall (Powers, 2011) aremetrics that can be used to evaluate the performance. Precision (note that thisis precision in a classification sense, and not the same as the precision discussedin the paragraph above), or positive predictive value, gives the percentage ofdetected true positives compared to the total number of items detected by themethod. Recall, or true positive rate, gives the percentage of detected true posi-tives compared to the total number of true positives in the ground truth.

Mean average precision is an error metric used by (Shotton et al., 2011) toevaluate the performance of the Kinect pose estimation algorithm. Average pre-cision is calculated per joint (or body part) over all frames, and the mean averageprecision (mAP) is the mean over all joints of the average precision.

An important aspect of the robustness of a pose estimation algorithm is how itcopes with errors. For an autonomous application, it is crucial that the algorithm

Page 51: Improving real-time human pose estimation from multi-view video

4.3 Prototype design 37

can automatically recover if the correct pose is not found in a frame. Detectingerroneous estimates and failing gracefully allows the application to handle errors,for instance by reusing the previous correct estimate.

4.3 Prototype designA conceptual real-time computer vision-based human motion capture pipelinewas designed in the early suggestion phase of the project. The result is shown infigure 4.2. The tentative design has been refined during the project, but the mainconcepts are the same as they were in the original from November 2007.

The assumptions and constraints outlined in sections 4.1 and 4.2 served asguidelines for the design. A set of images from multiple cameras is input tothe pipeline. The steps in the pipeline are: a) subtract the background from theimages to create silhouettes; b) reconstruct the volume occupied by the performerfrom the silhouette images; c) extract skeleton curves from the volume; d) identifyextremities and segment the skeletal curves into body parts; and e) fit a model tothe labelled data.

The output from the pipeline is a set of three-dimensional joint positions in aglobal frame of reference. The design is modular and it is fairly straightforwardto replace any step in the process. This facilitates comparative evaluation withother methods.

Many algorithms for processing two- and three-dimensional images (vol-umes) are ideal for implementation on stream processing hardware. The adventof inexpensive consumer graphics cards with stream processing capabilities haspopularised GPGPU. Both background subtraction and skeletonisation are taskscharacterised by large data sets, high parallelism, and minimal dependency be-tween data elements, hence they are ideal for GPGPU implementation. GPGPUimplementations of background subtraction and skeletonisation algorithms aredescribed in sections 5.1.2 and 5.2.2, respectively.

After skeletonisation the number of remaining voxels is of the order of 102.Datasets of this size are too small to utilise the stream processing capabilities ofthe GPU, and the overhead of transferring the data between main and graphicsmemory discourages GPGPU implementations of the later stages of the pipeline.Efficient CPU implementations of the body parts identification and model fittingare described in sections 5.2.3 and 5.2.4, respectively.

Once the pose of the subject has been estimated, in the form of joint positions,the result should be evaluated. The joint estimates should be compared to knownground truth data. This can be achieved both with real and synthetic data. Atoolset for generating synthetic data with known ground truth is described insection 5.2.1.

Page 52: Improving real-time human pose estimation from multi-view video

38 Research focus

Separate person from background

Images frommultiple cameras

Silhouettes

Combine silhouettes to form volume

Voxels

Extract skeleton from voxels (C2a)

Skeletal model

Skeleton voxels

Identify body parts in skeleton (C2b)

Labelled data(tree structure)

Fit model to data (C2c)

Joint positions

Evaluate result (C1)

Gait recognition (C4)

Camera calibration (C3)

Figure 4.2 – Tentative design. Boxes with bold text relates to the main contributionsC1 and C2 in section 1.5. Boxes with dashed stroke are not strictly part of the pipeline,but connects the minor contributions C3 and C4 to the rest of the work.

Page 53: Improving real-time human pose estimation from multi-view video

Chapter 5Published papers

This chapter summarises the papers in part IV and relates them to the researchquestions. The contributions of each author is clarified.

Paper 1 Semi-automatic camera calibration

Paper 2 Background subtraction using GPGPU

Paper 3 Synthetic data for shape-from-silhouette

Paper 4 Real-time skeletonisation

Paper 5 Pose estimation using skeletonisation

Paper 6 Model fitting to skeletal data

Paper 7 View-independent gait recognition

The papers have been divided into two categories: introductory publications andmain publications.

5.1 Introductory publicationsThe first two papers dealt with the preliminary conditions for the real-time poseestimation pipeline. Camera calibration and background subtraction are impor-tant processes that lay the foundations for the rest of the processing.

5.1.1 Semi-automatic camera calibration

This paper presented a semi-automatic camera calibration tool using a checker-board pattern as control points.

Page 54: Improving real-time human pose estimation from multi-view video

40 Published papers

Authors

Rune H. Bakken, Bjørn G. Eilertsen, Gustavo U. Matus, Jan H. Nilsen.

Full title

Semi-automatic Camera Calibration Using Coplanar Control Points.

Published in

Proceedings of NIK 2009.

Statement of contributions

Bakken formulated the problem, oversaw the design process, and wrote the pa-per. Eilertsen and Matus designed and implemented the solution. Nilsen pro-vided feedback.

Synopsis

In order to reconstruct 3D information from multiple view image data, the cali-bration parameters of the cameras must be known. The extrinsic parameters areplacement and orientation in the room, while the intrinsic parameters define howincoming light is projected onto the image plane, such as focal length and lensdistortion. Many techniques exist for determining the camera parameters, andthis paper examined and compared eleven freely available implementations. Thefindings were that all solutions use some form of known control points, but fea-tures like automatic discovery of the control points and live capture were scarce.Consequently, calibration with these solutions can be a cumbersome and time-consuming process.

The paper also presented a new camera calibration solution based on the cal-ibration routines available in the OpenCV library (Bradski and Kaehler, 2008).This is a semi-automatic calibration tool using a checkerboard pattern as controlpoints. The control points are automatically detected during live capture with thecameras, resulting in a fast and reliable calibration process. The calibration toolis platform independent and the source code is freely available on the web at:

http://code.google.com/p/calvincalibration/

Relation to research questions

Camera calibration is an important step for further processing, but is beyondthe main focus of this work. The paper helped establish a framework for thefollowing research and contributed towards answering research question RQ2on data acquisition.

Page 55: Improving real-time human pose estimation from multi-view video

5.1 Introductory publications 41

5.1.2 Background subtraction using GPGPU

This paper presented a comparison of GPGPU implementations of three back-ground subtraction algorithms.

Authors

Eirik Fauske, Lars M. Eliassen and Rune H. Bakken.

Full title

A Comparison of Learning Based Background Subtraction Techniques Imple-mented in CUDA.

Published in

Proceedings of NAIS 2009.

Statement of contributions

Fauske and Eliassen designed and implemented the solution, and wrote the pa-per. Bakken formulated the problem, oversaw the design, and provided feedbackduring the writing process.

Synopsis

For recovery of human motion from image data it is necessary to determine whichparts of the images constitute the subject (foreground) and what is not impor-tant information (background). The term background subtraction is used abouta class of techniques that maintain some model of what the background lookslike, and subtracts parts of incoming images that are determined as similar to thebackground model.

This paper examined three background subtraction algorithms with back-ground models of varying complexity. The method presented by (Horprasertet al., 2000) uses a simple background model consisting of mean intensity foreach pixel. The more complex codebook model (Kim et al., 2005) consists of sev-eral ranges of intensities for each pixel, making it more robust to changes in thebackground. Finally, the self-organising background subtraction (SOBS) algo-rithm (Maddalena and Petrosino, 2009, 2008) trains a neural network classifierfor each pixel with surrounding pixels contributing to the model.

Furthermore, the paper presented GPGPU implementations of the three meth-ods, and compared their processing time and subtraction performance. All threeimplementations achieved real-time performance. Horprasert’s algorithm was

Page 56: Improving real-time human pose estimation from multi-view video

42 Published papers

the fastest (0.8 ms per frame), but had the worst subtraction performance (pre-cision 0.76–0.89, recall 0.62–0.96), the SOBS algorithm was the slowest (9.8 msper frame), but had the best subtraction performance (precision 0.84–0.96, recall0.76–0.97), while the codebook method offered a compromise between processingtime (4.6 ms per frame) and subtraction performance (precision 0.84–0.97, recall0.71–0.95).

Relation to research questions

Background subtraction is a large research topic in itself, and a crucial step inthe shape-from-silhouette process. This paper helped gaining understanding ofthe state of the art in background subtraction, and the feasibility of using thesetechniques in a real-time framework. The paper contributed towards answeringresearch question RQ2 on data acquisition.

5.2 Main publicationsThe five main papers investigated different aspects of real-time pose estimation.

5.2.1 Synthetic data for shape-from-silhouette

This paper presented a toolset for generating synthetic multi-view silhouettes ofa moving avatar. The data can be used for planning, development, and compar-ative evaluation of shape-from-silhouette-based pose estimation algorithms.

Authors

Rune H. Bakken.

Full title

Using Synthetic Data for Planning, Development and Evaluation of Shape-from-Silhouette Based Human Motion Capture Methods.

Published in

Proceedings of ISVC 2012.

Statement of contributions

Bakken formulated the problem, designed the solution, conducted the experi-ments, and wrote the paper.

Page 57: Improving real-time human pose estimation from multi-view video

5.2 Main publications 43

Synopsis

To evaluate and compare pose estimation algorithms publicly available multi-view datasets are necessary. For quantitative analysis of accuracy the datasetsmust include ground truth data.

This paper examined eight publicly available multi-view datasets. The find-ings were that the number of actions are limited, and ground truth data are onlyincluded with the HumanEva datasets.

A toolset for generating synthetic silhouette datasets with ground truth datawas presented. The toolset uses motion capture data from the CMU database(http://mocap.cs.cmu.edu) consisting of around 2600 sequences, and canrender silhouette images with arbitrary camera setups. The toolset is imple-mented as a plugin for the Blender content creation suite, along with a stand-alone analysis tool for visualising evaluation results. The toolset facilitates com-parative evaluation of shape-from-silhouette-based pose estimation algorithms.The toolset is available for download at:

http://code.google.com/p/mogens/

Relation to research questions

This paper answered research question RQ4a, by presenting a framework forgenerating synthetic data that facilitates comparative evaluation.

5.2.2 Skeletonisation using GPGPU

In this paper a GPGPU implementation of a fully parallel thinning algorithmbased on the critical kernels framework was presented. Selected as the best paperof the special session on High Performance Computing in Computer Vision Ap-plications (HPC-CVA) at IPTA 2012, and an extended version will be submittedfor publication in a special issue of the International Journal of High PerformanceComputing Applications (IJHPCA).

Authors

Rune H. Bakken and Lars M. Eliassen.

Full title

Real-time 3D Skeletonisation in Computer Vision-Based Human Pose EstimationUsing GPGPU.

Page 58: Improving real-time human pose estimation from multi-view video

44 Published papers

Published in

Proceedings of IPTA 2012.

Statement of contributions

Bakken formulated the problem, designed and implemented the solution, con-ducted the experiments and led the writing process. Eliassen contributed to theimplementation and writing.

Synopsis

The curve-skeleton is a compact 1D line-like representation of an object withtopological and geometrical information preserved. A thinning algorithm itera-tively removes boundary voxels from an object to produce a topologically equiv-alent skeleton. The critical kernels framework (Bertrand and Couprie, 2006) is apowerful tool for analysing and designing parallel thinning algorithms.

This paper presented a parallel thinning algorithm based on critical kerneltheory implemented on the GPU. The thinning algorithm processes each voxelindependently, making it ideal for a GPGPU implementation. The volumes aretypically sparse, so a pre-processing algorithm was proposed to reduce the num-ber of empty voxels the thinning algorithm has to examine.

Experiments with real data from the i3DPost dataset (Gkalelis et al., 2009)and synthetic data (generated with the toolset introduced in paper 3) were con-ducted. Two synthetic sequences with a total of 996 frames were used to per-form experiments with the pose estimation algorithm introduced in paper 5. At64⇥64⇥64 volume resolution the proposed algorithm was shown to yield im-proved accuracy (64.9 mm versus 129.8 mm mean 3D error), precision (35.0 mmversus 51.6 mm standard deviation), and robustness (2.5 % versus 17.1 % in-valid frames) compared to another state-of-the-art thinning algorithm (Raynaland Couprie, 2011). The GPGPU implementation ensured real-time performance,with a processing time of ⇠4.5 ms per frame.

Relation to research questions

This paper dealt with research question RQ3a on how the underlying skeletalstructure can be recovered more accurately. Real-time performance was achievedusing GPGPU.

5.2.3 Pose estimation using skeletonisation

This paper presented an algorithm that constructs a tree structure labelled withbody parts based on a skeletonised visual hull.

Page 59: Improving real-time human pose estimation from multi-view video

5.2 Main publications 45

Authors

Rune H. Bakken and Adrian Hilton.

Full title

Real-time Pose Estimation Using Tree Structures Built from Skeletonised VolumeSequences.

Published in

Proceedings of VISAPP 2012.

Statement of contributions

Bakken formulated the problem, designed and implemented the solution, con-ducted the experiments and wrote the paper. Hilton provided feedback duringthe design and writing processes.

Synopsis

This paper presented a real-time method for estimating human pose from multi-view video. The main algorithm consists of four steps: construct a visual hullfrom silhouette images, find the curve-skeleton of the visual hull, build a treestructure from the skeleton curves, and segment the tree into body parts.

A model based on anthropometric measurements is used to prune the tree andidentify extremities. Each frame of the sequence can be estimated independentlyof previous frames, and no a priori information about the motion is required. Themethod can automatically initialise and recover from estimation failures. Exper-iments were conducted with real data from the i3DPost dataset (Gkalelis et al.,2009) and synthetic data (generated with the toolset introduced in paper 3 todemonstrate the accuracy, precision, robustness, and real-time performance ofthe method.

The experiments were conducted with volume resolutions of 64⇥64⇥64 and128⇥128⇥128 voxels. For a synthetic data sequence consisting of 451 frames themean 3D error was 53.2 mm (standard deviation 15.9 mm) for the estimated po-sitions of the extremities, 97.6 % of the frames were correctly labelled, and theprocessing time was ⇠15.5 ms per frame at 64⇥64⇥64. At 128⇥128⇥128 the mean3D error was 46.8 mm (standard deviation 16.6 mm), 95.8 % of the frames werecorrectly labelled, and processing time was ⇠66.3 ms per frame.

Page 60: Improving real-time human pose estimation from multi-view video

46 Published papers

Relation to research questions

This paper laid the foundation for the skeleton-based pose estimation method.It outlined a simple real-time shape-from-silhouette approach that partially an-swers RQ2 on data acquisition. Furthermore, it presented a pose estimation al-gorithm that can initialise automatically and recover from errors, thus answeringRQ3c.

5.2.4 Model fitting to skeletal dataThis paper presented an algorithm based on constraint dynamics for fitting askeletal model to the labelled tree structure.

Authors

Rune H. Bakken and Adrian Hilton.

Full title

Real-time Pose Estimation Using Constrained Dynamics.

Published in

Proceedings of AMDO 2012.

Statement of contributions

Bakken formulated the problem, designed and implemented the solution, con-ducted the experiments and wrote the paper. Hilton provided feedback duringthe design and writing processes.

Synopsis

This paper built on the results from paper 5. A skeletal model was created us-ing anthropometric measurements, and fitted to the segmented tree structure.SHAKE is a constrained dynamics algorithm designed for maintaining rigidbonds in molecular simulations (Ryckaert et al., 1977), but it is also applicablefor other rigid structures, for instance a skeletal human model consisting of rigidbones.

The SHAKE algorithm has previously been successfully used for inverse kine-matics, but there is no inherent way to control internal joints of the articulation.In this paper SHAKE was extended by formulating a new pseudo-constraint thatguide a joint (including internal) of the model toward a line fitted to the treestructure. The tree was subdivided into node groups corresponding to the joints

Page 61: Improving real-time human pose estimation from multi-view video

5.2 Main publications 47

using anthropometric measurements, and lines were fitted to the node groups.Line constraints were added between each joint and the corresponding lines,and the SHAKE algorithm was used to impose the constraints, thus fitting themodel to the data. Experiments with real data from the i3DPost dataset (Gkaleliset al., 2009) and synthetic data (generated with the toolset introduced in paper 3)demonstrated the accuracy, precision, robustness, and real-time performance ofthe approach.

Experiments with two synthetic data sequences with a total of 996 frames at avolume resolution of 128⇥128⇥128 yielded mean 3D errors of 60.8 mm (standarddeviation 26.8 mm) and 67.0 mm (standard deviation 31.0 mm). Invalid estimateswere detected in 1.6 % and 2.9 % of the frames, and the processing time per framewas ⇠70 ms.

Relation to research questions

This paper answered research question RQ3b on how the configuration of theskeletal articulation can be estimated robustly.

5.2.5 View-independent gait recognition

This paper outlined a hybrid Hidden Markov Model (HMM)/Case-Based Rea-soning (CBR) gait recognition framework where the pose estimation algorithmcould be used to retrieve view-independent 3D gait features.

Authors

Rune H. Bakken and Odd Erik Gundersen.

Full title

View-Independent Human Gait Recognition Using CBR and HMM.

Published in

Proceedings of SCAI 2011.

Statement of contributions

Bakken formulated the problem, and wrote the data acquisition and HMM sec-tions. Gundersen wrote the CBR section. Bakken and Gundersen co-wrote thediscussion section.

Page 62: Improving real-time human pose estimation from multi-view video

48 Published papers

Synopsis

Biometrics are intrinsic characteristics of a person like fingerprints, iris patterns,or voice that can be used for identification. The manner in which a personwalks—the gait—is unique, and can be used as a behavioural biometric. Thisconcept paper outlines how the pose estimation algorithm from paper 5 can beused in a novel gait recognition framework.

Most gait recognition approaches have relied on a single side-view camera asinput. Using three dimensional data instead makes the recognition system view-independent and improves robustness with regards to the direction the subject iswalking. The outlined framework is a hybrid solution where HMMs (Dymarski,2011) are trained using dynamic gait features, and along with static gait featuresare saved as cases in a CBR (Aamodt and Plaza, 1994) system.

Relation to research questions

This paper partially answered research question RQ4b on how the proposed poseestimation algorithm could be employed in a practical application, and also out-lined some directions for future study.

Page 63: Improving real-time human pose estimation from multi-view video

Chapter 6Quantitative and qualitative results

This chapter summarises the quantitative results produced in papers 4, 5, and6. Some of the experiments were conducted on data generated with the toolsetintroduced in paper 3.

6.1 Motion sequencesThe experiments in papers 4, 5, and 6 were conducted with real motion sequencesfrom the i3DPost dataset, and synthetic data generated with the toolset intro-duced in paper 3. i3DPost (Gkalelis et al., 2009) is a publicly available datasetconsisting of 96 sequences, captured with eight synchronised colour cameras ina blue-screen studio environment at a resolution of 1920⇥1080. The data includea wide variety of actions performed by eight different subjects. The syntheticsequences have an image resolution of 800⇥600, and were rendered with eightsynthetic cameras (seven placed in a circle around the viewing volume, and onein the ceiling pointing down). The sequences used are listed in table 6.1. Addi-tionally, there are some results from experiments with other datasets that wereconducted after the papers were published.

6.2 Processing timeThe skeletonisation algorithm (CK3) presented in paper 4 was evaluated withsequences of visual hulls reconstructed with a resolution of 64⇥64⇥64. The pro-cessing time of the GPU implementation of CK3 was compared with a sequentialimplementation. The results can be seen in table 6.2. Using the GPU results in aspeed-up of over 8⇥.

The GPU implementation of CK3 was also compared to a sequential im-plementation of another state-of-the-art skeletonisation algorithm (D6I1D), pre-sented by (Raynal and Couprie, 2011). The results are shown in table 6.3. Even

Page 64: Improving real-time human pose estimation from multi-view video

50 Quantitative and qualitative results

Table 6.1 – The motion sequences used to produce the qualitative and quantitativeresults.

Sequence Type Frames Description

dance-whirl Synthetic 451 Subject performs a slow whirling dancelambada Synthetic 545 Subject dances lambadawalk Real 57 Subject walks across viewing volumewalk-spin Real 46 Subject performs a pirouette while walkingrun-crouch-jump Real 108 Subject runs into viewing volume, jumps, crouches,

jumps, and crouches againballet Real 125 Subject performs several ballet pirouettes

though D6I1D is almost three times faster when compared separately, using anyof the skeletonisation algorithms for pose estimation results in a frame rate wellabove the target 30 fps.

The volume resolution greatly influences the processing time. Three se-quences of varying complexity were tested at both 64⇥64⇥64 and 128⇥128⇥128resolutions for evaluating the pose estimation algorithm presented in paper 5.The results can be seen in table 6.4. The method achieves near real-time per-formance of ⇠15 fps at the highest resolution, but by halving the dimensions ofthe volume the frame rate is almost tripled. In both cases the tree constructionis highly efficient, <1.5 ms for 64⇥64⇥64 and <4 ms for 128⇥128⇥128. At thehigher resolution the skeletonisation is the main computational bottleneck. Thereason for the smaller difference in processing time for the visual hull construc-tion is that the images from the i3DPost dataset have a resolution of 1920 ⇥ 1080and copying the data between main memory and the GPU is the bottleneck.

Four sequences of varying complexity were used to evaluate the extendedpose estimation algorithm presented in paper 6, and the results can be seen intable 6.5. The method achieved near real-time performance of ⇠ 14 fps. Themodel fitting is highly efficient. In this case the data acquisition column is thesum of the previous processing steps (visual hull reconstruction, skeletonisation,

Table 6.2 – Comparison of mean processing time (in milliseconds, with standarddeviation in parentheses) between sequential version and GPU implementation ofthe skeletonisation algorithm from paper 4.

Sequence Sequential GPU Speed-up

dance-whirl 40.25 (2.38) 4.51 (0.39) 8.9⇥lambada 40.13 (3.31) 4.78 (0.45) 8.4⇥walk 42.71 (2.21) 3.99 (0.35) 10.7⇥ballet 42.67 (2.90) 4.09 (0.37) 10.4⇥

Page 65: Improving real-time human pose estimation from multi-view video

6.3 Body parts identification 51

Table 6.3 – Comparison of processing times and frame rates for the two skeletoni-sation algorithms. The volume resolution was 64⇥64⇥64, and the same visual hullswere used as input for both skeletonisation algorithms. The visual hull reconstruc-tion was slower with the i3DPost data because of the higher image resolution.

(a) CK3

Sequence Vis. hull Skel. Pose est. Frame rate

dance-whirl 3.84 (0.21) 4.51 (0.39) 1.60 (0.09) 100.9lambada 3.84 (0.21) 4.78 (0.45) 1.56 (0.08) 98.5walk 9.96 (0.26) 3.99 (0.35) 1.57 (0.13) 65.9ballet 9.96 (0.85) 4.09 (0.37) 1.67 (0.14) 63.6

(b) D6I1D

Sequence Vis. hull Skel. Pose est. Frame rate

dance-whirl 3.84 (0.21) 1.65 (0.02) 1.21 (0.07) 154.7lambada 3.84 (0.21) 1.61 (0.03) 1.18 (0.08) 154.4walk 9.96 (0.26) 1.71 (0.86) 1.23 (0.09) 79.7ballet 9.96 (0.85) 1.73 (0.44) 1.29 (0.07) 77.0

and building the labelled tree structure).

6.3 Body parts identificationThe pose estimation algorithm introduced in paper 5 constructs a tree structuresegmented into body parts. If a valid body segmentation can not be found, thealgorithm labels the tree as invalid.

Real and synthetic data were used to evaluate the body part segmentationperformance. For the dance-whirl sequence (451 frames) 97.6 % and 95.8 % ofthe frames were labelled correctly at 64⇥64⇥64 and 128⇥128⇥128 resolutions,respectively. For the real data at 128⇥128⇥128 volume resolution, the walk (57frames) and walk-spin (46 frames) sequences were labelled 100 % correctly. Only42.6 % of the run-crouch-jump (108 frames) sequence was labelled correctly, butthe subject is crouching during half the sequence. The sequences were truncatedto keep the subject completely inside the viewing volume. The ballet (150 frames)had 87.3% correctly labelled frames.

If an invalid labelling of the tree structure is detected the extended pose es-timation algorithm will reuse the last valid pose in the current frame. This hap-pened in 1.6 % of the frames of the dance-whirl sequence, and 2.9 % of the framesof the lambada sequence in the experiments conducted for paper 6.

The choice of skeletonisation algorithm influences how well the body part

Page 66: Improving real-time human pose estimation from multi-view video

52 Quantitative and qualitative results

Table 6.4 – Comparison of mean processing times for three sequences from thei3DPost dataset, using different resolutions. All times are in milliseconds, with stan-dard deviations in parentheses.

(a) Volume resolution 64 ⇥ 64 ⇥ 64.

Visual hull Skeletonisation Build tree Sum Frame rate

walk 9.61 (0.26) 4.37 (0.28) 1.44 (0.25) 15.41 (0.47) 64.88walk-spin 9.63 (0.26) 4.42 (0.32) 1.41 (0.30) 15.46 (0.54) 64.67run-crouch-jump 9.69 (0.26) 4.82 (0.59) 1.07 (0.27) 15.58 (0.61) 64.18

(b) Volume resolution 128 ⇥ 128 ⇥ 128.

Visual hull Skeletonisation Build tree Sum Frame rate

walk 12.25 (0.34) 49.37 (2.63) 3.09 (0.33) 64.71 (2.61) 15.45walk-spin 12.28 (0.30) 52.06 (4.14) 3.13 (0.46) 67.46 (4.03) 14.82run-crouch-jump 11.94 (0.16) 51.43 (3.73) 3.51 (0.58) 66.88 (3.89) 14.95

Table 6.5 – Comparison of mean processing times for four sequences from thei3DPost dataset. All times are in milliseconds, with standard deviations in paren-theses.

Data acq. Line fitting SHAKE Sum Frame rate

walk 65.16 (2.67) 0.09 (0.01) 2.93 (0.06) 68.18 (2.68) 14.67walk-spin 67.95 (4.06) 0.10 (0.02) 2.95 (0.08) 70.33 (3.96) 14.22run-crouch-jump 67.33 (3.97) 0.09 (0.01) 2.90 (0.08) 68.54 (3.76) 14.59ballet 68.53 (5.53) 0.09 (0.01) 2.90 (0.11) 71.12 (5.60) 14.06

segmentation works. Table 6.6 shows a comparison of the results achieved withthe two skeletonisation algorithms (CK3 and D6I1D).

6.4 Accuracy and precision of joint position estimates

The synthetic sequences are generated with known ground positions for thejoints. These were used to evaluate the accuracy and precision of the estimatedpositions of the extremities identified by the pose estimation algorithm intro-duced in paper 5, and of the full body pose estimates given by the algorithmpresented in paper 6.

Mean 3D errors per frame for the estimated positions of the four extremitiesare plotted in figure 6.1. The mean positional error of the hands and feet for theentire sequence was 46.8 mm (standard deviation 16.6 mm) and 53.2 mm (stan-

Page 67: Improving real-time human pose estimation from multi-view video

6.4 Accuracy and precision of joint position estimates 53

dard deviation 15.9 mm) for 128⇥128⇥128 and 64⇥64⇥64 volume resolutions,respectively.

The model used in the extended pose estimation algorithm is shown in figures6.2 (bones) and 6.3 (joints). Each bone in the model is scaled in relation to anestimate of subject’s stature using the ratios shown in table 6.7. The choice ofskeletonisation algorithm influences the accuracy of the stature estimate, and acomparison of the results produced by the two skeletonisation algorithms (CK3and D6I1D) is shown in table 6.8.

Box-and-whisker plots (1.5 IQR convention) for the 3D errors per joint of thefull body pose estimates are shown in figure 6.4. The mean positional errors ofthe joints over the entire sequences were 60.8 mm (standard deviation 26.8 mm)and 67.0 mm (standard deviation 31.0 mm), respectively.

Table 6.6 – Comparison of percentages of incorrectly labelled frames for the twoskeletonisation algorithms with a volume resolution of 64⇥64⇥64.

Sequence CK3 D6I1D

dance-whirl 3.1 20.2lambada 1.8 13.9walk 22.8 40.4ballet 17.3 28.0

Table 6.7 – Bone ratios used to scale each bone of the articulated model in relation tothe estimated stature of the subject.

Bone Ratio Bone Ratio

Head 0.055 Upper spine 0.065Upper neck 0.055 Lower spine 0.065Lower neck 0.054 Hip 0.079Shoulder 0.111 Thigh 0.224Upper arm 0.162 Lower leg 0.239Lower arm 0.158 Foot 0.066Hand 0.038 Toe 0.033Thorax 0.065

Page 68: Improving real-time human pose estimation from multi-view video

54 Quantitative and qualitative results

Table 6.8 – Comparison of mean stature estimates (in millimetres, with standard de-viation in parentheses) using skeletons from the two algorithms.

Sequence Ground truth CK3 D6I1D

dance-whirl 1834 1784 (53) 1654 (57)lambada 1834 1786 (50) 1640 (50)

100 200 300 400Mea

n po

s. e

rror (

mm

)

Frame

0

120

80

40

160

(a) Volume resolution 64 ⇥ 64 ⇥ 64.

0

120

80

40

160

100 200 300 400

Frame

Mea

n po

s. e

rror (

mm

)

(b) Volume resolution 128 ⇥ 128 ⇥ 128.

Figure 6.1 – Comparison of errors of estimated positions of the hands and feet forthe dance-whirl sequence. The value in each frame is the mean Euclidean distancebetween the estimated and ground truth positions of the four extremities. In frameswhere the curve reaches 0, the algorithm failed gracefully.

Page 69: Improving real-time human pose estimation from multi-view video

6.4 Accuracy and precision of joint position estimates 55

0 - Head

1 - UpperNeck2 - LowerNeck

3 - LShoulder4 - RShoulder

5 - LUpperArm6 - RUpperArm

7 - LLowerArm8 - RLowerArm

9 - LHand10 - RHand

11 - Thorax

12 - UpperSpine

13 - LowerSpine

14 - LHip15 - RHip

16 - LThigh17 - RThigh

18 - LLowerLeg19 - RLowerLeg

20 - LFoot21 - RFoot

22 - LToe23 - RToe

Figure 6.2 – The 24 bones of the articulated model.

Page 70: Improving real-time human pose estimation from multi-view video

56 Quantitative and qualitative results

0 - HeadTop

1 - Head

3 - Chest

4 - LSho5 - RSho

6 - LElb7 - RElb

8 - LHand

10 - LFTip

9 - RHand

14 - LBack

12 - UBack 13 - Back

15 - LHip16 - RHip

24 - RToe

17 - LKne18 - RKne

19 - LAnk20 - RAnk

21 - LFoot22 - RFoot

23 - LToe

2 - Neck

11 - RFTip

Figure 6.3 – The 25 joints of the articulated model.

Page 71: Improving real-time human pose estimation from multi-view video

6.4 Accuracy and precision of joint position estimates 57

(a)

(b)

Figure 6.4 – Box-and-whisker plots of two sequences of synthetic data, (a) dance-whirlconsists of 451 frames (subject 55, trial 1), (b) lambada has 545 frames (subject 55, trial2). The joint names correspond to those in figure 6.3. The HeadTop joint was missingin the ground truth data.

Page 72: Improving real-time human pose estimation from multi-view video

58 Quantitative and qualitative results

6.5 Qualitative results

Figure 6.5 shows four frames from the i3DPost walk sequence. The trees are seg-mented into body parts, and the extremities labelled as left or right. In figure 6.6the same four frames are shown, now with the articulated model fitted to thedata.

6.6 Supplementary results

Performing further experiments with additional datasets was suggested as fur-ther work in paper 5. The pose estimation algorithm has been tested on fourother datasets, and the results are presented in this section.

6.6.1 HumanEva

The HumanEva datasets were discussed earlier, in section 3.3. An experimentwas conducted with the S2 sequence from HumanEva-II. The sequence consists of1221 frames where the subject performs a combination of several motions (walkin circle, then run in a circle, and finally balancing on each foot). The datasetdoes not come with pre-subtracted silhouettes, so background subtraction wasperformed with the Codebook algorithm introduced by (Kim et al., 2005), andalso described in paper 2.

The results from the experiment with the HumanEVA dataset are generallypoor. The algorithm manages to identify the correct body parts in some frames,as shown in figure 6.7, however, for most of the sequence the resulting visualhull from the reconstruction is not an accurate enough approximation of the ac-tual volume to extract a skeleton resembling a human figure. Consequently, thebody parts segmentation fails for most of the frames, as shown in figure 6.8.Even worse than invalid estimates are the results from some frames where thealgorithm is utterly confused and labels erroneous trees as valid limb labellings,which can be seen in figure 6.9. This experiment confirms that four camera viewsare not sufficient to produce reliable results with the proposed algorithm, as wasdiscussed in section 4.1. Furthermore, even if the silhouettes produced with theCodebook algorithm are reasonably good, there is some noise in them that im-pairs the quality of the reconstructed visual hulls.

6.6.2 IXMAS

The Inria Xmas Motion Acquisition Sequences (IXMAS) (Weinland et al., 2006)dataset consists of 33 image sequences that were captured with five colour cam-eras. Eleven subjects performed fourteen everyday actions that were repeated

Page 73: Improving real-time human pose estimation from multi-view video

6.6 Supplementary results 59

Figure 6.5 – Four frames from the i3DPost walk sequence with labelled tree structure.Notice how the left-right labelling (green and left extremities) is consistent across thesequence. Body part segmentation takes place before the left and right sides havebeen determined, so the limb colours can change between consecutive frames.

Page 74: Improving real-time human pose estimation from multi-view video

60 Quantitative and qualitative results

Figure 6.6 – The same four frames from the i3DPost walk sequence as in figure 6.5,with the articulated model fitted to the tree data.

Page 75: Improving real-time human pose estimation from multi-view video

6.6 Supplementary results 61

(a) (b) (c)

Figure 6.7 – A correctly labelled pose from the HumanEva-II S2 sequence. (a) Visualhull. (b) Skeleton. (c) Correctly labelled tree.

(a) (b) (c)

Figure 6.8 – A frame that was correctly identified as invalid from the HumanEva-IIS2 sequence. (a) Visual hull. (b) Skeleton. (c) Tree marked as invalid.

Page 76: Improving real-time human pose estimation from multi-view video

62 Quantitative and qualitative results

(a) (b) (c)

Figure 6.9 – A frame that should have been labelled as invalid from the HumanEva-IIS2 sequence. (a) Visual hull. (b) Skeleton. (c) Incorrectly labelled tree.

three times. The IXMAS dataset was mainly intended for evaluation of actionrecognition algorithms. An experiment was conducted with the julien sequence.

There were similar problems with the IXMAS data as with the HumanEvaexperiment presented earlier; the number of cameras is too low. To make mattersworse, the resolution of the images is also low (390⇥291), and the silhouettesare not accurate enough. As a result the 3D reconstructions are poor, and theextracted skeletons and pose trees unreliable. The algorithm is able to correctlyidentify the body parts in a few frames, as shown in figure 6.10, however, formost frames the arms degenerate and the resulting tree is invalid, an example ofwhich is shown in figure 6.11. As with the HumanEva dataset there are also someframes where the algorithm gets confused, and labels the tree incorrectly, as canbe seen in figure 6.12.

6.6.3 CVSSP-3D

The CVSSP-3D (Starck and Hilton, 2007) dataset contains seventeen image se-quences, with the same studio setup as for i3DPost. Two subjects each perform aset of actions. The first subject wears different costumes and performs differenttypes of walking. The second subject performs a set of challenging dance rou-tines. A freestyle dance sequence was used to test the proposed pose estimationalgorithm. The percentage of correctly labelled frames is lower (48.8 % of 500frames) than for the less complex sequences used above, but the algorithm still

Page 77: Improving real-time human pose estimation from multi-view video

6.6 Supplementary results 63

(a) (b) (c)

Figure 6.10 – A correctly labelled pose from the IXMAS julien sequence. (a) Visualhull. (b) Skeleton. (c) Correctly labelled tree.

(a) (b) (c)

Figure 6.11 – A frame that was correctly identified as invalid from the IXMAS juliensequence. (a) Visual hull. (b) Skeleton. (c) Tree marked as invalid.

Page 78: Improving real-time human pose estimation from multi-view video

64 Quantitative and qualitative results

(a) (b) (c)

Figure 6.12 – A frame that should have been labelled as invalid from the IXMASjulien sequence. (a) Visual hull. (b) Skeleton. (c) Incorrectly labelled tree.

manages to estimate challenging poses as the one shown in figure 6.13 correctly.Even though the body is almost horizontal, and the legs and torso are twisted, thetree is segmented into body parts, and left and right extremities identified. Re-constructed visual hulls are bundled with the CVSSP-3D dataset and those wereused in the experiment.

6.6.4 INRIA childrenThe INRIA children dataset (INRIA Perception Group) consists of seven se-quences of children playing. The images were captured with 16 synchronisedcameras in a green screen studio environment, and extracted silhouettes are dis-tributed along with the image data. A sequence where a small boy is doingcartwheels was used to perform an experiment with the proposed motion cap-ture algorithm. There is poor separation between the arms and the torso for largeparts of the sequence, resulting in a low overall percentage of correctly labelledframes (28.4 % of 507 frames). However, the correctly labelled frames demon-strate that the anthropometric model used in the algorithm is capable of dynam-ically scaling to a child ⇠120 cm tall, as well as a grown man ⇠185 cm tall asshown with the ballet sequence earlier. Figure 6.14 shows three frames from theboy cartwheel sequence.

Page 79: Improving real-time human pose estimation from multi-view video

6.6 Supplementary results 65

(a)

(b)

(c)

Figure 6.13 – Three frames from the CVSSP-3D dance-free sequence. The body is al-most horizontal, and the torso and legs are twisted, but the proposed pose estimationalgorithm still manages to segment the tree into body parts, and identify the left andright limbs correctly.

Page 80: Improving real-time human pose estimation from multi-view video

66 Quantitative and qualitative results

(a)

(b)

(c)

Figure 6.14 – Three frames from the boy cartwheel sequence. The child is ⇠120 cm tall,but since the anthropometric model scales dynamically the algorithm is still capableof correctly identifying the limbs.

Page 81: Improving real-time human pose estimation from multi-view video

Part III

Evaluation and conclusion

Page 82: Improving real-time human pose estimation from multi-view video
Page 83: Improving real-time human pose estimation from multi-view video

Chapter 7Discussion

In this chapter the results presented in chapters 5 and 6 are discussed and eval-uated. First, the proposed algorithms are compared to relevant state-of-the-artresearch examined in chapter 3. Next, the research questions introduced in chap-ter 1 are revisited and the manner in which they have been answered is con-sidered. Finally, the strengths and weaknesses of the proposed algorithms arediscussed.

7.1 Comparative evaluationThe approach presented by (Straka et al., 2011) is similar to the proposed poseestimation method in how it extracts skeletons from visual hulls and identifiesbody parts. Furthermore, (Straka et al., 2011) also used synthetic data based onthe CMU motion capture database for quantitative evaluation. The results pre-sented in paper 6 shows that the accuracy and processing time of the proposedalgorithms is on a par with a state-of-the-art skeletonisation based method usingsimilar data for evaluation. The method presented by (Straka et al., 2011), how-ever, requires the head to be the highest point for the duration of a sequence, andlabelling of left and right limbs is not automatic. These are limitations that theproposed approach does not suffer from.

Paper 4 assessed the suitability of the proposed skeletonisation algorithm(CK3) for use in real-time human motion analysis, and compared the results withanother state-of-the-art real-time thinning algorithm—a parallel, directional thin-ning algorithm based on isthmuses (D6I1D) (Raynal and Couprie, 2011). Thisalgorithm has been successfully applied in the domain of human motion analy-sis by (Raynal, 2010). To compare the generated skeletons, both skeletonisationalgorithms were used in the pose estimation method from paper 5.

There was a considerable difference in accuracy achieved with the two skele-tonisation methods, the 3D error of estimated extremity positions was more thandoubled for D6I1D compared to CK3. The pose estimation algorithm also con-

Page 84: Improving real-time human pose estimation from multi-view video

70 Discussion

sistently performed more robustly with the skeletons from CK3, using both syn-thetic and real data sequences. Estimating stature using the labelled tree structureis another measure of the accuracy of the skeletons. Using D6I1D resulted in threetimes higher error than CK3. The D6I1D algorithm was almost three times faster,however, using any of the skeletonisation algorithms for pose estimation resultedin frame rates well above 30 fps at a volume resolution of 64⇥64⇥64.

The pose estimation algorithm used for evaluation was developed usingskeletons from CK3 as input. Therefore it could be claimed that it is a biasedevaluation metric when used to compare the output of CK3 with skeletons fromanother thinning algorithm. It is possible that assumptions were made during thedevelopment of the pose estimation algorithm that inadvertently favoured CK3.However, it was clear from a qualitative assessment that D6I1D yielded skeletonsthat were further from the ground truth, with limb ends too aggressively short-ened. The D6I1D algorithm is faster, but consistently yields poorer performancein terms of accuracy and robustness.

In their review of the state of the art in pose estimation (Sigal and Black, 2010)claimed that human tracking in controlled environments was solved. As wasmentioned in section 3.1.5 the tracking framework presented by (Corazza et al.,2010), achieved a 3D error of 1.5 cm per joint. However, the processing time usedwas far from real-time, and the dependency on having an accurate subject specificmodel acquired with a laser scanner limits the use of the approach.

7.2 Research questions revisitedThis section will detail how the research questions have been answered. Fig-ure 7.1 shows how the published papers and relevant chapters of the dissertationare related to the research questions.

7.2.1 Current situationThe first research question sought to establish what the current situation in poseestimation is:

What is the state of the art in the field of computer vision-based human poseestimation?

This was answered in chapter 3. Table 3.1 summarised the findings on real-time human pose estimation. Automatic, real-time initialisation was possible inonly a limited number of approaches, confirming the earlier findings of (Michoudet al., 2007a). There were strong assumptions about the type of motion handledby most of the examined approaches. Either there were restrictions based on thetraining data used or by assuming that the subject would always be performingupright motion patterns, for instance walking, running, or dancing.

Page 85: Improving real-time human pose estimation from multi-view video

7.2 Research questions revisited 71

★1

Paper 3

Synthetic data

1

Paper 1

Cameracalibration

2

Paper 2

Backgroundsubtraction

1

Paper 7

Gait recognition

1

Paper 5

Pose estimation

1

Paper 4

Skeletonisation

1

Paper 6

Model fitting

RQ4RQ3RQ2

1 ★2First author Co-author Important contribution

Chapter 3

State of theart in poseestimation

RQ1

Uses results from

Figure 7.1 – Relations between papers and research questions.

7.2.2 Data acquisition

As a foundation for further research the second research question asked how thenecessary data would be acquired:

How can three dimensional information about the subject be recovered frommulti-view video in real time?

This was answered in papers 1, 2, and 5. A fairly simple shape-from-silhouettealgorithm implemented on the GPU was shown to produce good results in realtime when a sufficient number of camera views were available. In order to createa visual hull two types of data are required: silhouette images and camera cali-bration parameters. Three background subtraction algorithms implemented onthe GPU were shown to produce silhouette images in real time. Finally, a semi-automatic camera calibration tool was presented, that simplifies the calibrationof setups with multiple cameras.

Page 86: Improving real-time human pose estimation from multi-view video

72 Discussion

7.2.3 Estimation

The third research question consisted of three parts, and dealt with how the jointconfiguration should be recovered:

How can the underlying skeletal structure of the human body be recoveredaccurately from the three dimensional reconstructed data?

This was answered in paper 4. The critical kernel-based skeletonisation al-gorithm was shown to recover the underlying skeletal structure more accuratelythan another state-of-the-art real-time skeletonisation algorithm. Furthermore, itallowed the stature to be estimated more accurately which in turn influences theresults of fitting an articulated model to the curve-skeleton data.

How can the configuration of the skeletal articulation be reliably estimatedfrom the skeleton data with automatic recovery from errors?

This was answered in papers 5 and 6. Experiments with both real and syn-thetic data demonstrated that the combination of the algorithms presented inthese two papers can estimate the subject’s pose robustly in ⇠95 % of framesin longer sequences of synthetic data. The sequences used in the evaluation alsoincluded fast and challenging motions, such as ballet dancing.

Can the pose estimation algorithm be initialised automatically without priorknowledge of the motion pattern or a specific start pose?

This was answered in paper 5. The pose estimation algorithm presented inthat paper recovers a tree structure labelled with body parts independently ineach frame. No interaction or information from previous frames is required.

7.2.4 Performance and applicability

The two-part fourth research question sought to determine how evaluationshould be done, and how the proposed algorithms could potentially be used inpractice:

How should the performance of the pose estimation algorithm be evaluated?

This was answered in paper 3. The comparison of publicly available multi-view datasets showed that ground truth data is uncommon. In order to assessthe accuracy of the solution a quantitative analysis with known ground truth isnecessary. Furthermore, to compare the accuracy with other solutions it is impor-tant to have data available that can be used by a wide variety of algorithms.

Page 87: Improving real-time human pose estimation from multi-view video

7.3 Advantages and drawbacks 73

Ideally, high quality real data with known ground truth should be used. How-ever, the experiments in chapter 6 and other researcher’s experiences (Munder-mann et al., 2006a; Corazza et al., 2010) with using the best known dataset fit-ting this description, HumanEva, show that shape-from-silhouette-based solu-tions struggle with the four-camera setup used.

Alternatively, the synthetic data produced with the toolset presented in pa-per 3 can be used. The flexibility offered by this approach allows a wide selectionof usage scenarios to be evaluated. Synthetic data produced with the toolset wereused to evaluate the accuracy of the algorithms presented in papers 4, 5, and 6.

How can the proposed pose estimation approach be used in a specific applica-tion?

This question was partially answered in paper 7. Most existing gait recogni-tion approaches rely on a single camera view, and are sensitive to deviations fromthe walking direction parallel to the image plane. Using a shape-from-silhouettemethod to reconstruct 3D information about the subject makes the gait recogni-tion view-independent. Furthermore, it is believed that extracting skeletal infor-mation from the visual hull could improve recognition performance, but experi-ments with real data are required to verify that claim.

7.3 Advantages and drawbacksIn section 3.2 the advantages and shortcomings of current state-of-the-art humanpose estimation approaches were examined. Similarly, the strengths and weak-nesses of the proposed algorithms will be discussed.

7.3.1 StrengthsThe proposed algorithms have a number of advantages that set the method apartfrom the state of the art.

Single-frame estimation Single-frame in this context is taken to mean imagedata from a multi-view camera system in a single time instant. The algorithmoutlined in paper 5 can recover the subject’s pose independently in each frame.The use of an anthropometric model allows a tree segmented into body parts tobe extracted from skeletal data. Consequently, two features of the method areautomatic initialisation and automatic recovery from errors.

Automatic initialisation No interaction is required to initialise the pose esti-mation algorithm. Neither is it necessary for the subject to have a special initial-isation pose, like the T-pose. This is a vital feature in applications that rely onautomatic operation, for instance gait recognition.

Page 88: Improving real-time human pose estimation from multi-view video

74 Discussion

Automatic recovery from errors Pose estimation can fail in cases where theskeletonisation produces a result that is inconsistent with the actual skeletal struc-ture. Because the pose estimation algorithm can automatically initialise, it is alsopossible to automatically recover from such errors, and once a frame with a betterskeleton is reached the correct pose is reacquired.

Fast and/or complex motion Fast, complex motion can be challenging for ap-proaches that rely on temporal information, because the larger distance betweenthe locations of limbs from one frame to the next can make it difficult to findcorrespondences. Because of the single-frame pose recovery the proposed algo-rithm is invariant to the speed of the motion. Some frames from a ballet sequencewhere the extremity locations change by as much as ⇠35 cm between consecutiveframes is shown in figure 7.2.

Processing time and latency All the steps of the processing pipeline in fig-ure 4.2 have been demonstrated to have performance within the real-time con-straint defined in section 4.2. This was achieved partly by implementing partic-ularly computationally intensive tasks using GPGPU. The latency is equal to theprocessing time for one frame.

Handling of displacement A limitation of skeleton based pose recovery meth-ods is that the shoulders and hips tend to be displaced when the upper arms areclose to the torso, or the thighs are close to each other. The model fitting algorithmoutlined in paper 6 compensates for these displacement by using the anthropo-metric model to constrain the shoulders and hips to more anatomically correctlocations, as shown in figure 7.3.

Dynamic anthropometric model The bone ratios used in papers 5 and 6 scalewith the longest path in the tree and estimated stature, respectively. This makesthe algorithms invariant to the subject’s size, and the proposed pose estimationmethod can be employed with both adult and child performers, as demonstratedin chapter 6.

7.3.2 WeaknessesThe real-time constraint placed on the proposed method leads to some shortcom-ings. Furthermore, there are certain situations that the algorithms do not handlewell.

Uneven motion Papers 5 and 6 demonstrated that the pose estimate for singleframes is accurate to within ⇠5 cm, however, the recovered motion is uneven

Page 89: Improving real-time human pose estimation from multi-view video

7.3 Advantages and drawbacks 75

(a)

(b)

Figure 7.2 – Three consecutive frames from the i3DPost ballet sequence. The displace-ment of the right foot is ⇠70 cm in 80 ms from the first to the third frame. Since theproposed algorithm is not reliant on temporal information it still manages to identifythe extremities correctly.

Page 90: Improving real-time human pose estimation from multi-view video

76 Discussion

Figure 7.3 – One frame from the i3DPost ballet sequence. There is a considerable dis-placement of the right shoulder in the tree caused by the upper arm’s close proximityto the torso, however, the extended pose estimation algorithm still manages to find areasonably good estimate of the shoulder’s position.

or jaggy across the sequences, because the error can be in opposite directions inconsecutive frames.

Pose limitations Pose estimation can fail in situations where the limbs are notclearly separated from the torso or each other. Certain poses, like crouchingshown in figure 7.4, yield poor skeletonisation results, where the resulting skele-ton that does not correspond well to the underlying skeletal structure of the body.

Figure 7.4 – A frame from the the i3DPost run-crouch-jump sequence where the subjectis crouching. The limbs are not clearly separated in the visual hull, and the resultingtree is not resemblant of a person.

Page 91: Improving real-time human pose estimation from multi-view video

7.3 Advantages and drawbacks 77

The following quote is taken from (Carranza et al., 2003):

[...] there are extreme body poses such as the foetal position that can-not be reliably tracked due to insufficient visibility. To our knowledge,no non-intrusive system has demonstrated it is able to track such ex-treme positions.

While that quote is almost ten years old, none of the approaches reviewed insection 3.2 were shown to deal with such poses. The methods reviewed weremainly evaluated with upright walking, running, and dancing motions.

Poses where the hands or feet are close together can result in erroneousbranches remaining in the tree structure, as shown in figure 7.5. In cases likethis the breadth-first tree traversal of the skeleton voxel can make the spuriousbranches longer than the correct branches. This effect is exacerbated further ifshadows are not handled robustly by the background subtraction algorithm. Ifspurious branches are incorrectly identified as extremities, significant errors canoccur during model fitting, in turn resulting in illegal joint configurations.

Figure 7.5 – A frame from the dance-whirl sequence generated with the toolset intro-duced in paper 3. The feet are close together and spurious branch between the legs isinterpreted as the right foot, resulting in an illegal pose when the articulated modelis fitted to the data.

Sensitivity to noise The shape-from-silhouette algorithm used makes themethod sensitive to noise in the silhouette images. While state-of-the-art back-ground subtraction methods like the Codebook algorithm used in section 6.6.1give reasonably clean silhouettes, some level of noise is unavoidable. The resultsof the experiments with the HumanEva dataset presented in section 6.6.1 illus-trates the problems that can occur with noisy silhouettes.

Page 92: Improving real-time human pose estimation from multi-view video

78 Discussion

Reliability with fewer cameras As mentioned in section 4.1 several researchershave found that at least eight cameras are necessary to achieve good reconstruc-tion results with a shape-from-silhouette method. The results achieved with theHumanEva and IXMAS datasets in sections 6.6.1 and 6.6.2 supports this assess-ment. Fewer view-points result in a higher over-estimate of the visual hull, in-creasing the chances of ghost volumes and poorer separation of the limbs. Thisin turn leads to problems with the skeletonisation, and gives less reliable poseestimates.

Skeletonisation With a volume resolution of 128⇥128⇥128 the skeletonisationalgorithm is a performance bottleneck. The skeletonisation used almost 75 %of the total processing time. The D6I1D algorithm (Raynal and Couprie, 2011)was found to be considerably faster, but the degradation in accuracy and reliabil-ity was unacceptable. Alternative skeletonisation algorithms, such as the voxelscooping algorithm used by (Straka et al., 2011), should be investigated.

Page 93: Improving real-time human pose estimation from multi-view video

Chapter 8Conclusion and future work

This chapter concludes the dissertation, and the contributions of the research ef-fort are examined. Furthermore, some directions for future work are outlined.

The examination of the the state of the art revealed that while a number ofapproaches for real-time pose estimation have been presented during the lastdecade, a number of limitations on usage scenarios and the types of motion al-lowed still exist. Automatic initialisation is typically dependent on limiting theposes the subject is allowed to use.

In this dissertation a set of algorithms, that together form a real-time pose es-timation pipeline, were presented. The pose estimation pipeline is automaticallyinitialised with few assumptions about the subject’s pose, and can recover fromerrors. Experiments with synthetic ground truth data demonstrated that the cor-rect pose was estimated in ⇠95 % of frames with ⇠6 cm mean 3D error.

8.1 Contributions

The contributions of this research effort was briefly outlined in chapter 1. Next,the contributions will be described in more detail. The main contributions are:

Evaluation using synthetic data To evaluate the quality and efficacy of any al-gorithm it is important to compare its results with the results achieved with al-ternative algorithms solving the same problem. However, in the case of humanpose estimation the amount of evaluation data with known ground truth has beenshown to be fairly limited.

As an alternative to using motion sequences recorded with real cameras forevaluation, a toolset for generating synthetic motion data was proposed. Thetoolset simplifies the process of generating silhouette images from an arbitrarynumber of cameras based on the over 2600 real sequences found in the CMU mo-tion capture database. In addition to the silhouettes, ground truth positions for 30

Page 94: Improving real-time human pose estimation from multi-view video

80 Conclusion and future work

joints in the human body are produced. A tool for analysing and plotting errorsagainst the ground truth is included. The toolset with source code is distributedwith a permissive licence to encourage experimentation and comparative evalu-ation.

Skeletonisation using GPGPU The curve-skeleton is a compact 1D line-likerepresentation of an object that preserves both topology and geometrical infor-mation. For a volumetric representation of the human body, the correspondingcurve-skeleton is a reasonable approximation of the subject’s skeletal structure.

In order to fulfil the requirement of real-time performance, a skeletonisationalgorithm based on critical kernels was implemented in a GPGPU framework.The fully parallel thinning algorithm can process each voxel independently, acharacteristic that allows the implementation to utilise the stream processing ca-pabilities of the GPU.

The skeletonisation algorithm was compared to another state-of-the-art thin-ning algorithm. The other algorithm was faster, but estimates of the subject’sstature were more accurate, and overall 3D error when estimating the positionsof the extremities was lower when the skeletons produced with the proposed al-gorithm was used.

Pose estimation from skeleton data An algorithm for establishing the pose ofa subject from skeleton data was presented. The algorithm consists of building atree structure from the skeleton voxels, pruning spurious branches from the tree,and segmenting the remaining branches into body parts.

Producing the pose tree can be done without any temporal information, andconsequently each frame of a motion sequence can be processed independently.As a result the pose estimation procedure is initialised automatically, and thealgorithm can recover from erroneous estimates, thus overcoming some of thelimitations of tracking.

Using a range of real and synthetic motion sequences with different degrees ofcomplexity, the approach was shown to produce good results at real-time framerates.

Fitting of articulated models to skeleton data Estimating the positions of thejoints of a subject can be achieved by fitting an articulated model to skeleton data.A novel method based on constrained dynamics for fitting a model to the datawas proposed.

An inverse kinematics algorithm typically allows control of only the end pointof a kinematic chain. The SHAKE constraint algorithm was extended to allowcontrol of the intermediate joints in the kinematic chain, thus enabling fitting ofall the joints in the skeletal model to skeleton data. To reduce the computationalcost of the algorithm, an intermediate step of fitting straight line segments to

Page 95: Improving real-time human pose estimation from multi-view video

8.2 Future work 81

identified body parts in the skeleton data was employed. The method attainedreal-time performance on a range of motion sequences of varying complexity.

In addition to the four main contributions, the research effort resulted in twolesser contributions:

Simplified camera calibration A semi-automatic camera calibration tool thatsimplifies calibration using a checkerboard pattern was introduced. Both intrin-sic and extrinsic calibration parameters could be estimated from sets of imagescaptured live, with automatic detection of the calibration pattern and immediatefeedback to the user about the results. The source code is freely available under apermissive licence.

View-invariant gait recognition A discussion on how the proposed pose esti-mation algorithm can be used in a gait recognition framework was presented.The discussion outlined a hybrid machine learning solution where dynamic gaitfeatures extracted from the estimated poses were used to train HMMs, and com-bined with static subject features in a CBR system for classification. This combi-nation of HMMs and CBR for view-independent gait recognition is novel, but noexperimental results have been produced to confirm or disprove its efficacy.

8.2 Future work

The following directions for further study have been identified:

Visual hull construction A simple algorithm that produces volumetric visualhulls has been employed so far. For all voxels in a regular grid that voxel’s centreis projected into each image plane and it is determined whether the projectedpoint is inside or outside the silhouette. Voxels that have projected points insideall the silhouettes are kept and the rest are discarded. This procedure lacks therobustness associated with more advanced techniques, but its simplicity makes itattractive for implementation on graphics hardware.

Using a more robust visual hull construction algorithm could to some extentalleviate the sensitivity to noise in the silhouette images identified in section 7.3.2.Possible alternatives are: a) safe hulls (Miller and Hilton, 2007); b) shape frominconsistent silhouette (Landabaso et al., 2008); c) shape from incomplete silhou-ettes (Haro and Pardas, 2010); d) joint background subtraction and visual hullconstruction (Gallego et al., 2011); and e) shape from silhouette consensus (Haro,2012).

Page 96: Improving real-time human pose estimation from multi-view video

82 Conclusion and future work

Improved skeletal model While using a model with mean anthropometric ra-tios is a reasonable approximation, there are certainly individual differences thatcan not be captured by this approach. Extending the model to learn subjectspecific features would help capture subtle differences between performers, andcould result in better model fitting results. Several methods for training subjectspecific models exist, for instance (Mikic et al., 2003) and (Zhang et al., 2011b).

The current model does not include joint angle limits. As discussed in sec-tion 7.3.2 this can lead to illegal joint configurations during model fitting. Usinga model representation that includes joint angles, such as the one presented by(Guerra-Filho, 2012), would prevent the model fitting from resulting in illegalconfigurations and improve accuracy.

Hybrid solution with tracking The problem with uneven motion discussedin section 7.3.2 could be ameliorated by combining the proposed method witha tracking framework in a hybrid pose estimation solution. Tracking methodscan produce smooth transitions between frames, but are susceptible to losingthe track if no recovery mechanism is in place. Using the proposed method forautomatic initialisation and recovery in a tracking framework would utilise thestrengths of both techniques, and overcome some of the weaknesses they sufferfrom when considered separately. A similar idea was proposed by (Raynal, 2010).

Gait recognition The possibility of using the proposed algorithms in a gaitrecognition system is interesting and should be investigated further. The workpresented in paper 7 is theoretical in nature, and no experimental results thatverify its validity were given. This is partly caused by the limited availability ofmulti-view gait data.

An overview of datasets captured with multiple views is given by (Makiharaet al., 2012). Several datasets with up to eleven views are available, but no cameracalibration parameters are known. It is possible that the camera parameters couldbe estimated directly from the image data using a self-calibration technique, butthat has not been attempted. There is a need to build up a new dataset with gaitdata captured with a calibrated multi-camera system.

Use of synthetic data The automatic skinning method employed in Blender(Baran and Popovic, 2007) can in some cases lead to poor results because it isdependent on the armature and mesh being aligned before skinning. For somesequences this causes the limbs of the mesh to intersect during the motion. Bet-ter skinning algorithms should be investigated, for instance the one presented by(Rau and Brunnett, 2012).

The toolset currently only has one avatar, a hairless and unclothed male mesh.Loose fitting clothing and long hair can have a significant impact on the shape ofthe reconstructed visual hull, and on the results of further processing. The range

Page 97: Improving real-time human pose estimation from multi-view video

8.2 Future work 83

of available avatars should be extended with representatives of both genders,with different types of clothing and hair, and with different body shapes.

Page 98: Improving real-time human pose estimation from multi-view video
Page 99: Improving real-time human pose estimation from multi-view video

Bibliography

[pages where reference was cited]

A. Aamodt and E. Plaza. Case-Based Reasoning: Foundational Issues, Method-ological Variations, and System Approaches. AI Communications, 7(1):39–59,1994. [48]

J. K. Aggarwal and Q. Cai. Human Motion Analysis: A Review. In Proceedings ofthe IEEE Nonrigid and Articulated Motion Workshop, pages 90–102, 1997. [16]

J. K. Aggarwal and S. Park. Human motion: Modeling and recognition of actionsand interactions. In Proceedings of the 2nd International Symposium on 3D DataProcessing, Visualization, and Transmission, pages 640–647. IEEE, 2004. [17]

J. K. Aggarwal, Q. Cai, W. Liao, and B. Sabata. Articulated and elastic non-rigidmotion: a review. In Proceedings of 1994 IEEE Workshop on Motion of Non-rigidand Articulated Objects, pages 2–14. IEEE Comput. Soc. Press, 1994. [16]

J. K. Aggarwal, Q. Cai, W. Liao, and B. Sabata. Nonrigid Motion Analysis: Ar-ticulated and Elastic Motion. Computer Vision and Image Understanding, 70(2):142–156, 1998. [16]

M. Alcoverro, J. R. Casas, and M. Pard. Skeleton and Shape Adjustment andTracking in Multicamera Environments. In Proceedings of AMDO 2010 (LNCS),pages 88–97, 2010. [20]

M. Andriluka and L. Sigal. Human Context: Modeling Human-Human Interac-tions for Monocular 3D Pose Estimation. In Proceedings of the 7th InternationalConference on Articulated Motion and Deformable Objects, pages 260–272, 2012.[22]

M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited: People de-tection and articulated pose estimation. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages 1014–1021, 2009. [22]

Page 100: Improving real-time human pose estimation from multi-view video

86 BIBLIOGRAPHY

M. Andriluka, L. Sigal, and M. J. Black. Benchmark Datasets for Pose Estimationand Tracking. In T. B. Moeslund, A. Hilton, V. Kruger, and L. Sigal, editors,Visual Analysis of Humans, chapter 13, pages 253–275. Springer London, 2011.[30]

Animazoo. IGS. http://www.animazoo.com (visited 2013-03-13). [13]

Ascension. Motion Star. http://www.ascension-tech.com (visited 2013-03-13). [13]

P. Azad, A. Ude, T. Asfour, G. Cheng, and R. Dillmann. Image-Based Markerless3D Human Motion Capture using Multiple Cues. In Proceedings of the Interna-tional Workshop on Vision Based Human-Robot Interaction, pages 1–16, 2006. [14]

A. Baak, T. Helten, M. Muller, G. Pons-Moll, B. Rosenhahn, and H.-P. Seidel. An-alyzing and Evaluating Markerless Motion Tracking Using Inertial Sensors. InProceedings of the 3rd Workshop on Human Motion (European Conference on Com-puter Vision), pages 137–150, 2010. [30]

J. Bandouch, O. C. Jenkins, and M. Beetz. A Self-Training Approach for VisualTracking and Recognition of Complex Human Activity Patterns. InternationalJournal of Computer Vision, 99(2):166–189, 2012. [20]

I. Baran and J. Popovic. Automatic Rigging and Animation of 3D Characters.ACM Transactions on Graphics, 26(3):72:1–8, 2007. [82]

S. Barris and C. Button. A review of vision-based motion analysis in sport. SportsMedicine, 38(12):1025–1043, 2008. [18]

Y. Benezeth, P. Jodoin, B. Emile, H. Laurent, and C. Rosenberger. Review andevaluation of commonly-implemented background subtraction algorithms. InProceedings of the 19th International Conference on Pattern Recognition, pages 1–4,2008. [33]

Y. Benezeth, P.-M. Jodoin, B. Emile, H. Laurent, and C. Rosenberger. Comparativestudy of background subtraction algorithms. Journal of Electronic Imaging, 19(3):033003:1–12, 2010. [33]

M. V. D. Bergh, E. Koller-meier, and R. Kehl. Real-Time 3D Body Pose Estimation.In Multi-Camera Networks, chapter 14, pages 335–361. Elsevier, 2009. [25, 27]

G. Bertrand and M. Couprie. A New 3D Parallel Thinning Scheme Based onCritical Kernels. Discrete Geometry for Computer Imagery (LNCS), 4245:580–591,2006. [44]

G. Bradski and A. Kaehler. Learning OpenCV: Computer Vision with the OpenCVLibrary. O’Reilly Media, 2008. [40]

Page 101: Improving real-time human pose estimation from multi-view video

BIBLIOGRAPHY 87

G. J. Brostow, I. Essa, D. Steedly, and V. Kwatra. Novel Skeletal Representationfor Articulated Creatures. Computer Vision - ECCV (LNCS), 3023:66–78, 2004.[24]

H. Buxton. Learning and understanding dynamic scene activity: a review. Imageand Vision Computing, 21(1):125–136, 2003. [17]

F. Caillette and T. Howard. Real-Time Markerless Human Body Tracking UsingColored Voxels and 3-D Blobs. In Proceedings of the 3rd IEEE and ACM Interna-tional Symposium on Mixed and Augmented Reality, pages 266–267, 2004a. [26]

F. Caillette and T. Howard. Real-Time Markerless Human Body Tracking withMulti-View 3-D Voxel Reconstruction. In British Machine Vision Conference,2004b. [26]

F. Caillette, A. Galata, and T. Howard. Real-Time 3-D Human Body Trackingusing Variable Length Markov Models. In Proceedings of the British MachineVision Conference, pages 469–478, 2005. [26]

F. Caillette, A. Galata, and T. Howard. Real-time 3-D human body tracking usinglearnt models of behaviour. Computer Vision and Image Understanding, 109(2):112–125, 2008. [25, 26]

C. Canton-Ferrer, J. R. Casas, and M. Pardas. Human motion capture using scal-able body models. Computer Vision and Image Understanding, 115(10):1363–1374,2011. [20]

M. Caon, J. Tscherrig, Y. Yue, O. A. Khaled, and E. Mugellini. Extending theInteraction Area for View-Invariant 3D Gesture Recognition. In Proceedings ofthe 3rd International Conference on Image Processing Theory, Tools and Applications,pages 293–298, 2012. [29]

J. Carranza, C. Theobalt, M. A. Magnor, and H.-P. Seidel. Free-viewpoint videoof human actors. ACM Transactions on Graphics, 22(3):569–577, 2003. [77]

C. Cedras and M. Shah. Motion-based recognition: a survey. Image and VisionComputing, 13(2):129–155, 1995. [17]

J. Chen and Q. Ji. Efficient 3D Upper Body Tracking with Self-Occlusions. InProceedings of the 20th International Conference on Pattern Recognition, pages 3636–3639, 2010. [23]

K.-M. G. Cheung, T. Kanade, J.-Y. Bouguet, and M. Holler. A real time system forrobust 3D voxel reconstruction of human motions. In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition, pages 714–720, 2000. [24, 25]

Page 102: Improving real-time human pose estimation from multi-view video

88 BIBLIOGRAPHY

C.-W. Chu, O. C. Jenkins, and M. J. Mataric. Markerless Kinematic Model andMotion Capture from Volume Sequences. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 475–482, 2003. [24]

A. Clapes, M. Reyes, and S. Escalera. User Identification and Object Recognitionin Clutter Scenes Based on RGB-Depth Analysis. In Proceedings of the 7th In-ternational Conference on Articulated Motion and Deformable Objects, pages 1–11,2012. [15]

S. Corazza, L. Mundermann, E. Gambaretto, G. Ferrigno, and T. P. Andriacchi.Markerless Motion Capture through Visual Hull, Articulated ICP and SubjectSpecific Model Generation. International Journal of Computer Vision, 87(1-2):156–169, 2010. [23, 35, 70, 73]

J. Deutscher, A. Blake, and I. Reid. Articulated body motion capture by annealedparticle filtering. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 126–133, 2000. [20]

P. Dymarski, editor. Hidden Markov Models, Theory and Applications. InTech, 2011.[48]

M. Eichner and V. Ferrari. Human Pose Co-Estimation and Applications. IEEETransactions on Pattern Analysis and Machine Intelligence, 34(11):2282–2288, 2012.[22]

A. Elhayek, C. Stoll, N. Hasler, K. I. Kim, H. Seidel, and C. Theobalt. Spatio-temporal Motion Tracking with Unsynchronized Cameras. In Proceedings of theConference on Computer Vision and Pattern Recognition, pages 1870–1877, 2012.[23, 34]

P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial Structures for Object Recog-nition. International Journal of Computer Vision, 61(1):55–79, 2005. [22]

FLIR. Advanced Thermal Solutions for Research & Science. http://www.

flir.com (visited 2013-03-13). [14]

R. Gade, A. Jørgensen, and T. B. Moeslund. Occupancy Analysis of Sports Are-nas Using Thermal Imaging. In Proceedings of the International Conference onComputer Vision Theory and Applications, pages 277–283, 2012. [15]

J. Gall, J. Potthoff, C. Schnorr, B. Rosenhahn, and H.-P. Seidel. Interacting andAnnealing Particle Filters: Mathematics and a Recipe for Applications. Journalof Mathematical Imaging and Vision, 28(1):1–18, 2007. [20]

J. Gall, B. Rosenhahn, and H. Seidel. An Introduction to Interacting SimulatedAnnealing. Human Motion, pages 319–345, 2008. [20]

Page 103: Improving real-time human pose estimation from multi-view video

BIBLIOGRAPHY 89

J. Gall, C. Stoll, E. de Aguiar, C. Theobalt, B. Rosenhahn, and H.-P. Seidel. Motioncapture using joint skeleton tracking and surface estimation. In Proceedings ofthe Conference on Computer Vision and Pattern Recognition, pages 1746–1753, 2009.[20]

J. Gall, B. Rosenhahn, T. Brox, and H.-P. Seidel. Optimization and Filtering forHuman Motion Capture. International Journal of Computer Vision, 87(1-2):75–92,2010. [20]

J. Gallego, J. Salvador, J. R. Casas, and M. Pardas. Joint Multi-View ForegroundSegmentation and 3D Reconstruction with Tolerance Loop. In Proceedings ofthe 18th IEEE International Conference on Image Processing, pages 997–1000, 2011.[81]

D. Gavrila. The Visual Analysis of Human Movement: A Survey. Computer Visionand Image Understanding, 73(1):82–98, 1999. [16]

M. Germann, T. Popa, R. Ziegler, R. Keiser, and M. Gross. Space-time Body PoseEstimation in Uncontrolled Environments. In Proceedings of the InternationalConference on 3D Imaging, Modeling, Processing, Visualization, Transmission, pages244–251, 2011. [21]

N. Gkalelis, H. Kim, A. Hilton, N. Nikolaidis, and I. Pitas. The i3DPost multi-viewand 3D human action/interaction database. In Proceedings of the Conference forVisual Media Production, pages 159–168, 2009. [44, 45, 47, 49]

G. Guerra-Filho. A General Motion Representation - Exploring the Intrinsic View-point of a Motion. In Proceedings of the International Conference on ComputerGraphics Theory and Applications, pages 347–352, 2012. [82]

A. Gupta, A. Mittal, and L. S. Davis. Constraint integration for efficient multiviewpose estimation with self-occlusions. IEEE transactions on pattern analysis andmachine intelligence, 30(3):493–506, 2008. [23]

G. Haro. Shape from Silhouette Consensus. Pattern Recognition, 45(9):3231–3244,2012. [81]

G. Haro and M. Pardas. Shape from incomplete silhouettes based on the repro-jection error. Image and Vision Computing, 28(9):1354–1368, 2010. [81]

M. Hirai, N. Ukita, and M. Kidode. Real-Time Pose Regression with Fast VolumeDescriptor Computation. In Proceedings of the 20th International Conference onPattern Recognition, pages 1852–1855, 2010. [25, 27]

M. Hofmann and D. M. Gavrila. Single-Frame 3D Human Pose Recovery fromMultiple Views. In Proceedings of the DAGM Symposium, pages 71–80, 2009. [21]

Page 104: Improving real-time human pose estimation from multi-view video

90 BIBLIOGRAPHY

M. Hofmann and D. M. Gavrila. Multi-view 3D Human Pose Estimation in Com-plex Environment. International Journal of Computer Vision, 96(1):103–124, 2011.[21]

B. Holt and R. Bowden. Static Pose Estimation from Depth Images using RandomRegression Forests and Hough Voting. In Proceedings of the 7th InternationalConference on Computer Vision Theory and Applications, pages 557–564, 2012. [29]

T. Horprasert, D. Harwood, and L. Davis. A robust background subtraction andshadow detection. In Proceedings of the Asian Conference on Computer Vision,pages 983–988, 2000. [41]

S. Hou, A. Galata, F. Caillette, N. Thacker, and P. Bromiley. Real-time Body Track-ing Using a Gaussian Process Latent Variable Model. In Proceedings of the 11thInternational Conference on Computer Vision, pages 1–8, 2007. [26]

W. Hu, T. Tan, L. Wang, and S. Maybank. A Survey on Visual Surveillance of Ob-ject Motion and Behaviors. IEEE Transactions on Systems, Man and Cybernetics,Part C (Applications and Reviews), 34(3):334–352, 2004. [18]

F. Huo and E. A. Hendriks. Multiple people tracking and pose estimation withocclusion estimation. Computer Vision and Image Understanding, 116(5):634–647,2012. [25, 28]

INRIA Perception Group. 4D-repository children dataset. http://

4drepository.inrialpes.fr (visited 2013-03-13). [64]

M. Isard and A. Blake. CONDENSATION—Conditional Density Propagation forVisual Tracking. International Journal of Computer Vision, 29(1):5–28, 1998. [20]

S. Ivekovic and E. Trucco. Human Body Pose Estimation with PSO. In Proceedingsof the IEEE World Congress on Computational Intelligence, pages 1256–1263, 2006.[20]

A. Jaimes and N. Sebe. Multimodal human-computer interaction: A survey. InProceedings of the IEEE International Workshop on Human Computer Interaction,pages 116–134, 2005. [17]

X. Ji and H. Liu. Advances in View-Invariant Human Motion Analysis: A Review.IEEE Transactions on Systems, Man, and Cybernetics, 40(1):13–24, 2010. [3, 17]

X. Ji, H. Liu, Y. Li, and D. Brown. Visual-based view-invariant human motionanalysis: A Review. Knowledge-Based Intelligent Information and Engineering Sys-tems (LNCS), 5177:741–748, 2008. [17]

Page 105: Improving real-time human pose estimation from multi-view video

BIBLIOGRAPHY 91

V. John, S. Ivekovic, and E. Trucco. Articulated Human Motion Tracking withHPSO. In Proceedings of the International Conference on Computer Vision Theoryand Applications, pages 531–538, 2009. [20]

V. John, E. Trucco, and S. Ivekovic. Markerless human articulated tracking usinghierarchical particle swarm optimisation. Image and Vision Computing, 28(11):1530–1547, 2010. [20]

K. Kim, T. Chalidabhongse, D. Harwood, and L. Davis. Real-time foreground-background segmentation using codebook model. Real-Time Imaging, 11(3):172–185, 2005. [41, 58]

V. Kruger, D. Kragic, A. Ude, and C. Geib. The meaning of action: a reviewon action recognition and mapping. Advanced Robotics, 21(13):1473–1501, 2007.[17]

S. M. Kuo, B. H. Lee, and W. Tian. Introduction to Real-Time Digital Signal Pro-cessing. In Real-Time Digital Signal Processing: Implementations and Applications,chapter 1, pages 1–48. John Wiley & Sons, Ltd., 2nd edition, 2006. [35]

B. Kwolek, T. Krzeszowski, and K. Wojciechowski. Real-Time Multi-view HumanMotion Tracking Using 3D Model and Latency Tolerant Parallel Particle SwarmOptimization. pages 169–180, 2011a. [29]

B. Kwolek, T. Krzeszowski, and K. Wojciechowski. Swarm Intelligence BasedSearching Schemes for Articulated 3D Body Motion Tracking. Advanced Con-cepts for Intelligent Vision Systems (LNCS), pages 115–126, 2011b. [21]

B. Kwolek, T. Krzeszowski, A. Gagalowicz, K. Wojciechowski, and H. Josinski.Real-Time Multi-view Human Motion Tracking Using Particle Swarm Opti-mization with Resampling. In Proceedings of the 7th International Conference onArticulated Motion and Deformable Objects, pages 92–101, 2012. [25, 29, 35]

J. Landabaso, M. Pardas, and J. Casas. Shape from inconsistent silhouette. Com-puter Vision and Image Understanding, 112(2):210–224, 2008. [81]

H. Li, S. Lin, Y. Zhang, and K. Tao. Automatic Video-based Analysis of AthleteAction. In Proceedings of the 14th International Conference on Image Analysis andProcessing, pages 205–210, 2007. [16]

Y. Lu, D. Xu, L. Wang, R. Hartley, and H. Li. Illumination invariant sequentialfiltering human tracking. In Proceedings of the International Conference on MachineLearning and Cybernetics, volume 4, pages 2133–2138, 2010. [20]

J. Luck, D. Small, and C. Q. Little. Real-Time Tracking of Articulated HumanModels Using a 3D Shape-from-Silhouette Method. Robot Vision (LNCS), 1998:19–26, 2001. [24]

Page 106: Improving real-time human pose estimation from multi-view video

92 BIBLIOGRAPHY

J. P. Luck, C. Debrunner, W. Hoff, Q. He, and D. E. Small. Development andanalysis of a real-time human motion tracking system. In Proceedings of the SixthIEEE Workshop on Applications of Computer Vision, pages 196–202, 2002. [24, 25]

N. Lynnerup and J. Vedel. Person Identification by Gait Analysis and Photogram-metry. Journal of Forensic Sciences, 50(1):112–118, 2005. [15]

L. Maddalena and A. Petrosino. A self-organizing approach to background sub-traction for visual surveillance applications. IEEE Transactions on Image Process-ing, 17(7):1168–77, 2008. [41]

L. Maddalena and A. Petrosino. A fuzzy spatial coherence-based approach tobackground/foreground separation for moving object detection. Neural Com-puting and Applications, 19(2):179–186, 2009. [41]

Y. Makihara, H. Mannami, A. Tsuji, M. A. Hossain, K. Sugiura, A. Mori, andY. Yagi. The OU-ISIR Gait Database Comprising the Treadmill Dataset. IPSJTransactions on Computer Vision and Applications, 4:53–62, 2012. [82]

C. Menier, E. Boyer, and B. Raffin. 3D Skeleton-Based Body Pose Recovery. In Pro-ceedings of the Third International Symposium on 3D Data Processing, Visualization,and Transmission, pages 389–396, 2006. [25, 26]

B. Michoud, E. Guillou, and S. Bouakaz. Real-Time and Markerless 3D HumanMotion Capture Using Multiple Views. Human Motion - Understanding, Model-ing, Capture and Animation (LNCS), 4814:88–103, 2007a. [4, 27, 70]

B. Michoud, E. Guillou, H. Briceno, and S. Bouakaz. Real-time and marker-free3D motion capture for home entertainment oriented applications. In Proceed-ings of the 8th Asian conference on Computer vision-Volume Part I, pages 678–687,2007b. [27]

B. Michoud, E. Guillou, H. Briceno, and S. Bouakaz. Real-Time Marker-free Mo-tion Capture from multiple cameras. Proceedings of the 11th International Confer-ence on Computer Vision, pages 1–7, 2007c. [25, 27, 28]

I. Mikic, M. Trivedi, E. Hunter, and P. Cosman. Human Body Model Acquisitionand Tracking Using Voxel Data. International Journal of Computer Vision, 53(3):199–223, 2003. [23, 82]

G. Miller and a. Hilton. Safe hulls. In Proceedings of the 4th European Conference onVisual Media Production (CVMP 2007), volume c, pages 1–8, 2007. [81]

T. B. Moeslund and E. Granum. A Survey of Computer Vision-Based HumanMotion Capture. Computer Vision and Image Understanding, 81:231–268, 2001.[14, 16, 33, 34]

Page 107: Improving real-time human pose estimation from multi-view video

BIBLIOGRAPHY 93

T. B. Moeslund, A. Hilton, and V. Kruger. A survey of advances in vision-basedhuman motion capture and analysis. Computer Vision and Image Understanding,104:90–126, 2006. [3, 4, 16]

D. Moschini and A. Fusiello. Tracking Stick Figures with Hierarchical ArticulatedICP. In Proceedings of the First International Workshop on Tracking Humans for theEvaluation of their Motion in Image Sequences, pages 61–68, 2008. [23]

D. Moschini and A. Fusiello. Tracking Human Motion with Multiple CamerasUsing an Articulated Model. Computer Vision/Computer Graphics CollaborationTechniques (LNCS), 5496:1–12, 2009. [23]

Motion Analysis. Raptor. http://www.motionanalysis.com (visited 2013-03-13). [13]

L. Mundermann, S. Corazza, A. M. Chaudhari, E. J. Alexander, and T. P. Andriac-chi. Most favorable camera configuration for a shape-from-silhouette marker-less motion capture system for biomechanical analysis. In Proceedings of SPIE-IS&T Electronic Imaging, pages 278–287, 2005. [34]

L. Mundermann, S. Corazza, and T. P. Andriacchi. Markerless human motioncapture through visual hull and articulated ICP. In Proceedings of the NIPS Work-shop on Evaluation of Articulated Human Motion and Pose Estimation, pages 1–5,2006a. [73]

L. Mundermann, S. Corazza, and T. P. Andriacchi. The evolution of methodsfor the capture of human movement leading to markerless motion capture forbiomechanical applications. Journal of neuroengineering and rehabilitation, 3:1–11,2006b. [18]

L. Mundermann, S. Corazza, A. M. Chaudhari, T. P. Andriacchi, A. Sundaresan,and R. Chellappa. Measuring human movement for biomechanical applica-tions using markerless motion capture. In Three-Dimensional Image Capture andApplications VII (Proc. of the SPIE), pages 246–255, 2006c. [13]

S. Obdrzalek, G. Kurillo, J. Han, T. Abresch, and R. Bajcsy. Real-time human posedetection and tracking for tele-rehabilitation in virtual reality. Studies in healthtechnology and informatics, 173:320–324, 2012. [15]

R. Okada, B. Stenger, T. Ike, and N. Kondoh. Virtual Fashion Show Using Real-time Markerless Motion Capture. In Proceedings of the Asian Conference on Com-puter Vision, pages 801–810, 2006. [25, 26]

Organic Motion. OpenStage. www.organicmotion.com (visited 2013-03-13).[14]

Page 108: Improving real-time human pose estimation from multi-view video

94 BIBLIOGRAPHY

H. W. Pan, C. Ai, and C. M. Gao. A New Approach for Body Pose Recovery. InProceedings of VRCAI, volume 1, pages 243–248, 2011. [25, 28]

S. Pierard and M. V. Droogenbroeck. Estimation of Human Orientation Based onSilhouettes and Machine Learning Principles. In Proceedings of ICPRAM 2012,pages 51–60, 2012. [21]

G. Pons-Moll, A. Baak, T. Helten, M. Muller, H.-P. Seidel, and B. Rosenhahn.Multisensor-fusion for 3D full-body human motion capture. In Proceedings ofthe Conference on Computer Vision and Pattern Recognition, pages 663–670, 2010.[30]

R. Poppe. Vision-based human motion analysis: An overview. Computer Visionand Image Understanding, 108(1-2):4–18, 2007. [17]

M. Porta. Vision-based user interfaces: methods and applications. InternationalJournal of Human-Computer Studies, 57:27–73, 2002. [17]

N. Pourdamghani, H. R. Rabiee, F. Faghri, and M. H. Rohban. Graph Based Semi-Supervised Human Pose Estimation: When The Output Space Comes to Help.Pattern Recognition Letters, pages 1529–1535, 2012. [21]

D. M. W. Powers. Evaluation: From Precision, Recall and F-Measure to ROC,Informedness, Markedness & Correlation. Journal of Machine Learning Technolo-gies, 2(1):37–63, 2011. [36]

M. Qiao, J. Cheng, and W. Zhao. Model-based Human Pose Estimation with Hi-erarchical ICP from single Depth Images. Advances in Automation and Robotics,Vol. 2, 123:27–35, 2012. [29]

Qualisys. Oqus. www.qualisys.com (visited 2013-03-13). [13]

C. Rau and G. Brunnett. Anatomically Correct Adaption of Kinematic Skeletonsto Virtual Humans. In P. Richard, M. Kraus, R. S. Laramee, and J. Braz, edi-tors, Proceedings of the International Conference on Computer Graphics Theory andApplications, pages 341–346. SciTePress, 2012. [82]

B. Raynal. Applications of Digital Topology For Real-Time Markerless Motion Capture.PhD thesis, Universitee Paris-Est, 2010. [69, 82]

B. Raynal and M. Couprie. Isthmus-Based 6-Directional Parallel Thinning Algo-rithms. In Proceedings of DGCI, pages 175–186, 2011. [44, 49, 69, 78]

B. Raynal, M. Couprie, and V. Nozick. Generic Initialization for Motion Capturefrom 3D Shape. Image Analysis and Recognition (LNCS), 6111:306–315, 2010. [25,28]

Page 109: Improving real-time human pose estimation from multi-view video

BIBLIOGRAPHY 95

D. Roetenberg, H. Luinge, and P. Slycke. Xsens MVN: Full 6DOF Human MotionTracking Using Miniature Inertial Sensors. White paper, 2009. [13]

G. Rogez, J. Rihan, C. Orrite-Urunuela, and P. H. S. Torr. Fast Human Pose Detec-tion Using Randomized Hierarchical Cascades of Rejectors. International Journalof Computer Vision, 99(1):25–52, 2012. [22]

J. Ryckaert, G. Ciccotti, and H. Berendsen. Numerical integration of the Cartesianequations of motion of a system with constraints: molecular dynamics of n-alkanes. Journal of Computational Physics, 23(3):327–341, 1977. [46]

D. Scharstein and R. Szeliski. High-Accuracy Stereo Depth Maps Using Struc-tured Light. In Proceedings of the Conference on Computer Vision and Pattern Recog-nition, pages 195–202, 2003. [14]

J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman,and A. Blake. Real-time Human Pose Recognition in Parts from Single DepthImages. In Proceedings of the Conference on Computer Vision and Pattern Recogni-tion, pages 1297–1304, 2011. [29, 36]

L. Sigal. Articulated Pose Estimation and Tracking: Introduction. In T. B. Moes-lund, A. Hilton, V. Kruger, and L. Sigal, editors, Visual Analysis of Humans, chap-ter 8, pages 131–137. Springer London, 2011. [19]

L. Sigal and M. Black. HumanEva: Synchronized Video and Motion CaptureDataset for Evaluation of Articulated Human Motion. Technical report, De-partment of Computer Science, Brown University, 2006. [30]

L. Sigal and M. J. Black. Guest Editorial: State of the Art in Image- and Video-Based Human Pose and Motion Estimation. International Journal of ComputerVision, 87(1-3):1–3, 2010. [3, 70]

L. Sigal, S. Bhatia, S. Roth, M. J. Black, and M. Isard. Tracking Loose-limbed Peo-ple. In Proceedings of the Conference on Computer Vision and Pattern Recognition,pages 421–428, 2004. [23]

L. Sigal, A. O. Balan, and M. J. Black. HumanEva: Synchronized Video and Mo-tion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Hu-man Motion. International Journal of Computer Vision, 87(1-2):4–27, 2010. [30]

C. Sminchisescu, A. Kanaujia, and D. N. Metaxas. BM3E: Discriminative Den-sity Propagation for Visual Tracking. IEEE Transactions on Pattern Analysis andMachine Intelligence, 29(11):2030–2044, 2007. [21]

SoftKinetic. DepthSense. www.softkinetic.com (visited 2013-03-13). [14]

Page 110: Improving real-time human pose estimation from multi-view video

96 BIBLIOGRAPHY

J. Starck and A. Hilton. Surface Capture for Performance-Based Animation. IEEEComputer Graphics and Applications, 27(3):21–31, 2007. [62]

J. Starck, A. Maki, S. Nobuhara, A. Hilton, and T. Matsuyama. The Multiple-Camera 3-D Production Studio. IEEE Transactions on Circuits and Systems forVideo Technology, 19(6):856–869, 2009. [34]

T. Starner, B. Leibe, D. Minnen, T. Westyn, A. Hurst, and J. Weeks. The perceptiveworkbench: Computer-vision-based gesture tracking, object tracking, and 3Dreconstruction for augmented desks. Machine Vision and Applications, 14(1):59–71, 2003. [35]

M. Straka, S. Hauswiesner, M. Ruther, and H. Bischof. Skeletal Graph BasedHuman Pose Estimation in Real-Time. In Proceedings of the British Machine VisionConference, pages 1–12, 2011. [25, 28, 36, 69, 78]

M. Sun, P. Kohli, and J. Shotton. Conditional Regression Forests for Human PoseEstimation. In Proceedings of the Conference on Computer Vision and Pattern Recog-nition, pages 3394–3401, 2012. [29]

A. Sundaresan and R. Chellappa. Acquisition of Articulated Human Body Mod-els Using Multiple Cameras. pages 78–89, 2006. [23]

A. Sundaresan and R. Chellappa. Model-driven segmentation of articulating hu-mans in Laplacian Eigenspace. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 30(10):1771–1785, 2008. [23]

H. Tanaka, A. Nakazawa, and H. Takemura. Human Pose Estimation from Vol-ume Data and Topological Graph Database. Computer Vision - ACCV 2007(LNCS), 4843:618–627, 2007. [21]

T. Tangkuampien and D. Suter. Real-Time Human Pose Inference using KernelPrincipal Component Pre-image Approximations. In Proceedings of the BritishMachine Vision Conference 2006, pages 1–10, 2006. [25, 26]

C. Theobalt, E. de Aguiar, M. A. Magnor, H. Theisel, and H.-P. Seidel. Marker-free kinematic skeleton estimation from sequences of volume data. In Proceed-ings of the ACM Symposium on Virtual Reality Software and Technology, volume D,page 57, New York, New York, USA, 2004. ACM Press. [24]

T. Tung and T. Matsuyama. Topology dictionary with Markov model for 3D videocontent-based skimming and description. In Proceedings of the Conference onComputer Vision and Pattern Recognition, pages 469–476, 2009. [16]

V. K. Vaishnavi and W. J. Kuechler. Design Science Research Methods and Patterns.Auerbach Publications, 2007. [5, 6, 7]

Page 111: Improving real-time human pose estimation from multi-view video

BIBLIOGRAPHY 97

M. Vasconcelos and J. Tavares. Human Motion Analysis: Methodologies and Ap-plications. In Proceedings of the 8th International Symposium on Computer Methodsin Biomechanics and Biomedical Engineering, pages 1–6, 2008. [17]

Vicon. T-Series. http://www.vicon.com (visited 2013-03-13). [13]

J. J. Wang and S. Singh. Video analysis of human dynamics—a survey. Real-TimeImaging, 9(5):321–346, 2003. [16, 17]

C. Ware and R. Balakrishnan. Reaching for Objects in VR Displays: Lag andFrame Rate. ACM Transactions on Computer-Human Interaction, 1(4):331–356,1994. [35]

D. Weinland, R. Ronfard, and E. Boyer. Free viewpoint action recognition usingmotion history volumes. Computer Vision and Image Understanding, 104(2-3):249–257, 2006. [58]

C. Wu and H. Aghajan. Real-Time Human Pose Estimation: A Case Study inAlgorithm Design for Smart Camera Networks. Proceedings of the IEEE, 96(10):1715–1732, 2008. [27]

C. Wu, H. Aghajan, and R. Kleihorst. Real-Time Human Posture Reconstructionin Wireless Smart Camera Networks. In Proceedings of the International Confer-ence on Information Processing in Sensor Networks, pages 321–331, 2008. [25, 27]

Y. Wu and T. S. Huang. Vision-Based Gesture Recognition : A Review. Gesture-Based Communication In Human-Computer Interaction (LNCS), 1739:103–115,1999. [17]

B. Zhan, D. N. Monekosso, P. Remagnino, S. a. Velastin, and L.-Q. Xu. Crowdanalysis: a survey. Machine Vision and Applications, 19(5-6):345–357, 2008. [18]

X. Zhang, W. Hu, X. Wang, Y. Kong, N. Xie, H. Wang, H. Ling, and S. Maybank.A swarm intelligence based searching strategy for articulated 3D human bodytracking. In Proceedings of the Conference on Computer Vision and Pattern Recogni-tion - Workshops, pages 45–50, 2010. [21]

Z. Zhang and H. Seah. Real-time Tracking of Unconstrained Full-Body Motionusing Niching Swarm Filtering Combined with Local Optimization. In Proceed-ings of Computer Vision and Pattern Recognition - Workshops, pages 23–28, 2011.[25, 28]

Z. Zhang, H. S. Seah, C. K. Quah, A. Ong, and K. Jabbar. A Multiple Camera Sys-tem with Real-Time Volume Reconstruction for Articulated Skeleton. Advancesin Multimedia Modeling (LNCS), 6523:182–192, 2011a. [28]

Page 112: Improving real-time human pose estimation from multi-view video

98 BIBLIOGRAPHY

Z. Zhang, H. S. Seah, C. K. Quah, and J. Sun. A Markerless Motion CaptureSystem with Automatic Subject-Specific Body Model Acquisition and RobustPose Tracking from 3D Data. In Proceedings of the International Conference onImage Processing, pages 525–528, 2011b. [28, 82]

S. Zuffi, O. Freifeld, and M. J. Black. From Pictorial Structures to deformablestructures. In Proceedings of the Conference on Computer Vision and Pattern Recog-nition, pages 3546–3553, 2012. [22]

Page 113: Improving real-time human pose estimation from multi-view video

Glossary and List of Acronyms

AMDO The Conference on Articulated Motion and Deformable Objects. 9, 46

Anthropometry Refers to the measurement of the human individual (height,weight, volume, lengths of bones). 20, 45–47, 64, 66, 73, 74, 82

Biometrics Identification of humans by characteristics or traits, such as voice,DNA, iris, hand print, or gait. 15, 48

CBR Case-Based Reasoning. 10, 47, 81

CMU Carnegie Mellon University. 7, 43, 69, 79

CPU Central Processing Unit. 37

CUDA Compute Unified Device Architeture. 9, 41

Gait The manner in which a person is walking. Can be used as a biometric. i, 3,8, 10, 15, 24, 35, 48, 73, 81, 82

GPGPU General-Purpose computation on Graphics Processing Units. 7, 9, 37,41, 43, 44, 74, 80

GPU Graphics Processing Unit. i, 37, 49, 50, 71, 80

Graceful failure A characteristic of a system which is fault tolerant, where errorscan be detected and handled when they occur. 37, 54

HMM Hidden Markov Model. 10, 47, 81

HPC-CVA The special session on High Performance Computing in ComputerVision Applications. 9, 43

IPTA The International Conference on Image Processing Theory, Tools and Ap-plications. 9, 43

Page 114: Improving real-time human pose estimation from multi-view video

100 Glossary

IQR Interquartile range. The difference between the upper and lower quartiles.53

ISVC The International Symposium on Visual Computing. 9, 42

Kinematics A branch of classical mechanics that describes motion without con-sidering the causes of the motion. 23, 24, 46, 80

Mahalanobis distance A statistical distance measure that takes correlation intoaccount. 23

Monocular Method or algorithm that relies on a single view point. 3, 22, 30

Multi-modal user interface Human-computer interface that fuses the informa-tion from multiple input modalities such as voice and gestures to controlthe computer. 15, 17

NAIS The Norwegian Artificial Intelligence Symposium. 9, 41

NIK The Norwegian Informatics Conference. 8, 40

OpenCV An open-source, cross-platform library with common computer visionroutines. 40

SCAI The Scandinavian Conference on Artificial Intelligence. 10, 47

Stereoscopic Method or algorithm that uses triangulated information from twoview points. 3, 29

Stochastic Having a random probability distribution. 20, 27

VISAPP The International Conference on Computer Vision Theory and Appli-cations. 9, 45

Page 115: Improving real-time human pose estimation from multi-view video

Part IV

Publications

Page 116: Improving real-time human pose estimation from multi-view video
Page 117: Improving real-time human pose estimation from multi-view video

Paper 1Semi-automatic camera calibration

Authors: Rune H. Bakken, Bjørn G. Eilertsen, Gustavo U. Matus, Jan H.Nilsen.

Full title: Semi-automatic Camera Calibration Using Coplanar ControlPoints.

Published in: Proceedings of the Norwegian Informatics Conference (NIK2009), T. Aalberg ed., 2009, Tapir Academic Press.

Copyright: c� 2009 NIK and Tapir Academic Press.

Page 118: Improving real-time human pose estimation from multi-view video
Page 119: Improving real-time human pose estimation from multi-view video

Semi-automatic Camera Calibration Using Coplanar ControlPoints

Rune H. Bakken, Bjørn G. Eilertsen, Gustavo U. Matus, Jan H. NilsenSør-Trøndelag University College

AbstractCamera calibration is a prerequisite for computer vision and photogrammetryapplications that infer knowledge about an object’s shape and locationfrom images. Many implementations of camera calibration methods arefreely available on the Internet. We have reviewed eleven existing cameracalibration tools within a feature oriented framework for comparison, and ourfindings show that achieving the required accuracy can be quite cumbersome.Automatic detection of control points and live capture of calibration imagesare two key features that would simplify the process. We have developed anew camera calibration tool, with automatic control point extraction and livecapture. Our semi-automatic application achieves the same level of accuracyas one of the most widely used camera calibration tools in computer visionapplications.

1 IntroductionIn many computer vision and photogrammetry applications it is necessary to inferknowledge about an object’s shape and location from image evidence. The process ofcapturing an image is a mapping from a three-dimensional space to a two-dimensionalplane, so some information is lost. In order to recover that lost information intimateknowledge about the mapping is needed. The mapping is represented by a model of thecamera, and the process of estimating the parameters of the chosen camera model is calledcamera calibration.

It is common to divide the parameters into two groups, intrinsic and extrinsic.The intrinsic parameters include focal length and principal point, while the extrinsicparameters describe the camera’s orientation and location in the world coordinate system.Both groups of parameters can be estimated through linear transformations. Additionally,there are non-linear distortions in the lens and from the camera manufacturing processthat must be estimated to have a complete model of the camera.

Extensive research have been done [6, 11] to find accurate and robust methods forcamera calibration. One class of techniques use known control points on a 2D or 3Dcalibration object as input for the estimation. The term self-calibration is used forapproaches that do not rely on known control points.

Many camera calibration methods have implementations freely available on theInternet. While some progress has been made in terms of ease-of-use and automation,

This paper was presented at the NIK-2009 conference; see http://www.nik.no/.

Page 120: Improving real-time human pose estimation from multi-view video

camera calibration is still a cumbersome process using existing implementations. Manysolutions require that the control points have been extracted in advance, either a separateapplication for automatic extraction is needed or the points must be manually identified bythe user. Implementations with automatic control point extraction suffer from the problemthat the algorithms fail to find points in some image configurations. This may result inpoor accuracy because of a lack of input points, and the user has to go back and capturenew calibration images to compensate. The solution to these problems would be to have asemi-automatic camera calibration application that given a known calibration object cancapture images, detect control points and give the user immediate feedback on the qualityof the images.

In this paper we present Calvin, a new camera calibration tool with automatic controlpoint extraction and live capture of calibration images. Calvin is based on an existingcalibration method using coplanar control points. We have compared Calvin with one ofthe most widely used calibration tools in computer vision applications, and while theyyield similar accuracy, Calvin can produce results semi-automatically.

The paper is structured as follows: section 2 explores the theoretical backgroundof camera calibration and discusses different types of calibrations, as well as proposedmethods for calibrating cameras. In section 3 we discuss existing implementations thatare freely available and compare them. Our new calibration implementation is presentedin section 4 and in section 5 we outline some experiments that we have conducted withour implementation. In section 6 we sum up the paper by discussing the results of ourexperiments and point out some aspects of our implementation that needs further work.

2 BackgroundAs mentioned in section 1 camera calibration is the process of estimating the parametersof a camera model. We describe in more detail some camera models that are used anddiscuss some proposed methods for calibrating cameras.

Camera ModelsWe start by briefly describing the camera models commonly in use. A more thoroughexplanation can be found in Hartley and Zisserman [7].

A camera is a mapping from a three-dimensional space onto a two-dimensional imageplane. Perspective, or central, projection is the process whereby a point in Euclidean 3-space is mapped to the intersection of a line from the point to a central focal point, thecamera centre, with the image plane. The most basic model of perspective projection isthe pinhole camera model. The camera centre C is placed at the origin, with the principalaxis of the camera pointing along the z axis. The point p where the principal axis intersectsthe image plane is called the principal point.

Given the orthogonal distance f , the focal length, from the camera centre to the imageplane, the point (X ,Y,Z)T is mapped to the point ( f X/Z, fY/Z)T . This mapping can bewritten as

2

4XYZ

3

5 7!

f X/ZfY/Z

�. (1)

In this mapping it is assumed that the origin of the image plane coincides with theprincipal point. In reality this may not be the case, and we therefore add an offset to

Page 121: Improving real-time human pose estimation from multi-view video

Cp

x

X

xcam

ycam

zcam

xy

R, t

Z

Y

X

f

(a) The linear mapping from world coordinates to the image plane of acamera.

(b) Radial distortion.

(c) Tangential distor-tion.

Figure 1: Camera models.

the mapped image coordinates. Using homogeneous coordinates we can express theprojection in matrix form as

2

664

XYZ1

3

775 7!

2

4f XfYZ

3

5 =

2

4f px 0

f py 01 0

3

5

2

664

XYZ1

3

775 . (2)

Writing

K =

2

4f px

f py1

3

5 (3)

we get the compact form

x = K[I|0]Xcam. (4)

The matrix K is called the camera calibration matrix, and its non-zero elements are calledthe intrinsic camera orientation.

The object points in equation 4 are given in the camera coordinate frame with thecamera centre at the origin. In general it is desirable to have both the camera and theobject points placed in some world coordinate system. Given a rotation matrix R and thecamera centre coordinates in the world frame C we can express the transformation fromthe camera frame to the world frame as

Xcam =

R �RC0 1

�2

664

XYZ1

3

775 =

R �RC0 1

�X (5)

We can replace the camera centre coordinates with a translation vector t = �RC, andcombining equations 4 and 5 we get

Page 122: Improving real-time human pose estimation from multi-view video

x = K[R|t]X (6)

The rotation matrix R and the translation vector t are called the extrinsic cameraorientation. The linear mapping from the world coordinate frame to image coordinates isshown in figure 1a.

In modern CCD cameras the pixels in the image sensor are usually not exactly square.This can be modelled in terms of pixels per unit distance in image coordinates, mx andmy, as

K =

2

4ax x0

ay y01

3

5 (7)

where ax = f mx, ay = f my, x0 = mx px and y0 = my py.So far we have considered a linear mapping from object points to image points. This

is a simplification that is sufficient for some applications, however, this mapping is inreality non-linear. This is called distortion and there are several ways to model it. Themost commonly used model divides distortion into radial and tangential components.

When the focal length gets small, the light rays passing through the lens are subjectedto a radial distortion. The distortion is 0 at the (optical) image centre and increases towardthe edges of the image. The effect on the image coordinates is shown in figure 1b.Undistorted image coordinates (xu,yu)T can be found with a Taylor series expansion,given the distorted coordinates (xd,yd)

T , shown here with three terms

xu = xd(1+ k1r2 + k2r4 + k3r6), yu = yd(1+ k1r2 + k2r4 + k3r6) (8)

If the image sensor in the camera is not exactly parallel with the lens, there will bea tangential distortion of the image, as shown in figure 1c. Tangential distortion can becorrected with

xu = xd +(2p1yd + p2(r2 +2x2d)), yu = yd +(p1(r2 +2y2

d)+2p2xd) (9)

Equations 8 and 9 are derived from Brown’s models of distortion [3]. Two terms is usuallysufficient for modelling radial distortion. For highly distorted fish-eye lenses the thirdterm can be used. Tangential distortion is rarely a problem with modern CCD camerasand can usually be omitted from the camera model.

Camera Calibration MethodsThe computer vision and photogrammetry communities have come up with a wide rangeof different methods for performing this estimation. In computer vision many approachesrely on using a calibration object with known control points. Both planar and special-purpose 3D objects have been used.

Tsai [17] presented a two-stage approach relying on n > 8 known feature points ineach image. Some parameters are assumed to be constant and given by the manufacturerof the cameras. Tsai argues that tangential distortion is not required for machinevision applications and including it would cause numerical instability, hence only radialdistortion is considered, with one term. The first step of the approach is to compute theextrinsic orientation using given camera information. In step two, the intrinsic orientation

Page 123: Improving real-time human pose estimation from multi-view video

is refined using the results from step one. The method can be used both with planar gridsof control points and 3D calibration objects.

The method presented by Heikkila and Silven [8] is a four-step procedure. Firstly,a linear estimation of the camera parameters is performed, disregarding distortions.Secondly, a non-linear optimisation scheme is used to refine the initial estimates andcompute radial and tangential distortions. Step three corrects assymetric projectionsof shapes that cover more than one pixel. Finally, in step four, the undistorted imagecoordinates are computed. The method is demostrated on a 3D calibration object withcircular control points, but can also be used with planar grids.

Zhang’s [18] method is based on viewing a planar grid of control points from at leasttwo different orientations. An initial estimate of the camera parameters is obtained witha closed-form solution. Next, maximum likelihood estimation is used to refine the initialsolution. Radial distortion is modelled with two terms. Finally, non-linear optimisation isused to refine all parameters. This approach is quite similar to Trigg’s [16].

Sturm and Maybank [13] presented a method that can handle varying intrinsicparameters, but they disregard any lens distortions. Thus a set of linear equations can besolved based on one or more views of a planar calibration grid. The authors also identifya number of singularities where plane-based approaches in general will yield unreliableresults.

A simple, linear approach was presented by Bakstein [1]. A combination of realcalibration images and a virtual 3D calibration object was used to improve the initialestimates of the calibration parameters. Personnaz and Sturm’s [10] approach calibrateda stereo vision system, based on the motion of special-purpose 3D calibration object.A non-linear optimisation scheme is used to estimate the calibration parameters. Strobland Hirzinger [12] presented a calibration method targeted for hand-eye calibration inrobotics, but the algorithms can also be used for general camera calibration. Theirapproach include a parameterisation of the calibration pattern that compensates forinaccuracies in its measurement.

A comparison of four different linear algorithms for coplanar camera calibration waspresented by Chatterjee and Roychowdhury [5]. They also presented a novel non-linearmethod specifically tailored for calibration with coplanar control points, using constrainedoptimisation. Results were compared with a photogrammetric calibration method, andthe constrained non-linear algorithm was found to be on par with the photogrammetricmethod in terms of accuracy. Sun and Cooperstock [14] presented another comparisonof methods, notably those of Tsai and Zhang, and studied the difference in accuracy of acasual versus an elaborate calibration setup.

A separate class of methods termed self-calibration in the computer vision literaturedoes not rely on any known calibration object, but rather apply a number of constraints toinfer calibration parameters from an unknown scene. Three types of constraints are used:scene constraints, camera motion constraints and constraints on the intrinsic orientationof the camera. A review of self-calibration methods is given by Hemayed [9].

Svoboda et al. [15] presented a self-calibration method designed for large cameraarrays. They used a laser pointer to produce easily detected corresponding points inall images. Geometric constraints were imposed on the recovered control points. Theiralgorithm first produced a linear estimate of calibration parameters, and this was used asinput for a post-processing step that determined non-linear distortion parameters.

Close-range photogrammetry is a field of research that has matured for over 150years and accurate camera calibration has been an important aspect of photogrammetric

Page 124: Improving real-time human pose estimation from multi-view video

applications for much of that period. Clarke and Fryer [6] give an overview of the maincontributions from the last fifty years. The predominant method for photogrammetriccamera calibration is the bundle adjustment approach [4]. This technique produces asimultaneous determination of all intrinsic and extrinsic parameters. Bundle adjustmentcan be used both with known control points and for self-calibration.

Remondino and Fraser [11] presented a comparison of different calibration methodsfor digital cameras, both from the computer vision and photogrammetric points of view.They compared experimental results of six implementations, with bundle adjustmentmethods yielding superior results.

3 Existing ImplementationsSeveral implementations of the calibration methods presented in section 2 are freelyavailable on the Internet. We are interested in determining how easily eleven freelyavailable calibration tools can produce results for a typical computer vision cameracalibration task, and we have compared their feature sets with this in mind. Forour review of existing calibration applications we have formulated a feature orientedframework of comparison, consisting of the following categories: calibration method,calibration pattern, automatic pattern detection, rectification of distorted images,extrinsic calibration, minimum number of cameras, live capture, platform requirements,and latest update.

TsaiCode1 This is the reference implementation of Tsai’s [17] calibration methoddescribed in section 2. It is the oldest of the implementations in this comparison,the latest version being over a decade old. It has only a rudimentary command lineinterface and it requires that the control points have been extracted in advance.

Microsoft Easy Camera Calibration Tool2 The reference implementation ofZhang’s [18] method is also quite dated. As with TsaiCode it requires the coplanarcontrol points to be extracted in advance.

Matlab Camera Calibration Toolbox3 This toolbox is perhaps the most widely usedof the freely available calibration solutions. The implementation is inspired byZhang’s [18] method, but uses Heikkila and Silven’s [8] model of distortion. Thetoolbox has many useful features, such as the ability to use an earlier calibrationresult as the initial values for a new run of the algorithm, advanced error analysisand the possibility to undistort images, however, it lacks automatic extraction ofcontrol points from the images.

Camera calibration toolbox for Matlab (Heikkila)4 The reference implementation ofHeikkila and Silven’s [8] method is also a Matlab toolbox, however, it lacks theGUI features of the previously described method. Image coordinates of the controlpoints must be extracted in advance.

tclcalib5 This implementation is a collection of small stand-alone programs that togethermakes Heikkila’s easier to use. This includes automatic extraction of control

1http://www.cs.cmu.edu/⇠rgw/TsaiCode.html (last checked 09-07-2009)2http://research.microsoft.com/en-us/um/people/zhang/calib/ (last checked 09-07-2009)3http://www.vision.caltech.edu/bouguetj/calib doc/ (last checked 09-07-2009)4http://www.ee.oulu.fi/⇠jth/calibr/ (last checked 09-07-2009)5http://users.soe.ucsc.edu/⇠davis/projects/tclcalib/ (last checked 09-07-2009)

Page 125: Improving real-time human pose estimation from multi-view video

points, extrinsic calibration given an existing intrinsic calibration and a gui tie itall together.

Camera calibration toolbox for Matlab (Bakstein)6 This Matlab toolbox implementsBakstein’s [1] calibration method. The tool is quite simple, control points must beextracted in advance.

BlueCCal7 This is a Matlab toolbox implementing Svoboda et al.’s [15] method. It doesnot have a GUI, but has automatic control point extraction. One drawback is that itrequires at least three cameras for the calibration to work. It also does not place theorigin of the world coordinate system in a known location, but rather at the centreof the extracted point cloud.

GML calibration tools8 This calibration tool is a stand-alone application that mimicsthe functionality of Bouguet’s Matlab toolbox. The calibration routines used arefrom the OpenCV Library [2], which is a reimplementation of Bouget’s Matlabcode. GML sports automatic control point extraction and correction of distortedimages. The program requires a .Net runtime.

Camera Calibration Tools9 This is another stand-alone application inspired byBouguet’s Matlab toolbox, that relies on OpenCV’s [2] calibration routines. It hasrelatively limited functionality, as it only supports intrinsic calibration, but it hasone interesting feature: the ability to capture images live within the program.

Tele210 This stand-alone application implements Personnaz and Sturm’s [10] method.Tele2 requires a special-purpose calibration object that complicates its use. It isdesigned for stereo calibration, but also works with single cameras. One advantageof this application is that it is implemented in Java, hence it is platform independent.

CalDe and CalLab11 These two applications implement Strobl and Hirzinger’s [12]calibration method. This implementation supports calibration of a single or a stereopair of cameras, offers several estimation methods and has advanced features foranalysing the calibration results. The program requires an IDL runtime.

The findings of our review are summed up in table 1. As can be seen some of theofferings have not been updated for quite some time, making them cumbersome to usewith current versions of their respective platforms. About half of the impementationsrequire Matlab, an expensive software package for scientific computation. While thismight not necessarily be a problem, it can be an issue if funds are limited. Some ofthe newer tools offer automatic extraction of control points. This saves a lot of timein generating input for the calibration algorithm, however, it must be robust to have ameaningful impact on the time used. The point detection algorithms are unable to find thecorrect pattern in some image configurations. With fewer control points the accuracy ofthe calibration results are potentially reduced, and the user is forced to go back and aquiremore images to compensate.

6http://terezka.ufa.cas.cz/hynek/toolbox.html (last checked 09-07-2009)7http://cmp.felk.cvut.cz/ svoboda/SelfCal/index.html (last checked 09-07-2009)8http://research.graphicon.ru/calibration/2.html (last checked 09-07-2009)9http://www.doc.ic.ac.uk/ dvs/calib/main.html (last checked 09-07-2009)

10http://perception.inrialpes.fr/Soft/calibration/index.html (last checked 09-07-2009)11http://www.dlr.de/rm/desktopdefault.aspx/tabid-4853/ (last checked 09-07-2009)

Page 126: Improving real-time human pose estimation from multi-view video

Cal

ibra

tion

met

hod

Cal

ibra

tion

patte

rnA

utom

atic

patte

rnde

tect

ion

Und

isto

rtim

ages

Extri

nsic

calib

ratio

nM

inim

umnu

mbe

rof

cam

eras

Live

capt

ure

Plat

form

requ

irem

ents

Last

upda

ted

Tsai

Cod

eTs

ai[1

7]K

now

nco

ntro

lpo

ints

(2D

or3D

)N

oN

oYe

s1

No

Uni

x/D

os28

-10-

1995

Mic

roso

ftEa

syC

amer

aC

alib

ratio

nTo

olZh

ang

[18]

Cop

lana

rcon

trol

poin

tsN

oN

oYe

s1

No

Win

dow

s04

-06-

2001

Mat

lab

Cam

era

Cal

ibra

tion

Tool

box

(Bou

guet

)

Zhan

g[1

8]C

heck

erbo

ard

No

Yes

Yes

1N

oM

atla

b02

-06-

2008

Cam

era

calib

ratio

nto

olbo

xfo

rM

atla

b(H

eikk

ila)

Hei

kkila

and

Silv

en[8

]G

ridof

circ

ular

cont

rolp

oint

sN

oN

oYe

s1

No

Mat

lab

17-1

0-20

00

tclc

alib

Hei

kkila

and

Silv

en[8

]G

ridof

circ

ular

cont

rolp

oint

sYe

sN

oYe

s1

No

Irix

/Win

dow

s14

-08-

2002

Cam

era

calib

ratio

nto

olbo

xfo

rM

atla

b(B

akst

ein)

Bak

stei

n[1

]Li

negr

idN

oN

oYe

s1

No

Mat

lab

10-0

6-19

99

Blu

eCC

alSv

obod

aet

al.[

15]

Lase

rpoi

nter

Yes

No

Yes,

butn

otre

lativ

eto

akn

own

coor

dina

tefr

ame

3N

oM

atla

b24

-05-

2005

GM

LC

++C

amer

aC

alib

ratio

nTo

olbo

x

Zhan

g[1

8]C

heck

erbo

ard

Yes

Yes

No

1N

o.N

et1.

106

-02-

2006

Cam

era

Cal

ibra

tion

Tool

sZh

ang

[18]

Che

cker

boar

dN

oYe

sN

o1

Yes

Win

dow

s16

-02-

2007

Tele

2Pe

rson

naz

and

Stur

m[1

0]Sp

ecia

l-pur

pose

calib

ratio

nob

ject

Yes

No

Ster

eo1

Yes

Java

20-0

3-20

02

Cal

De

and

Cal

Lab

Stro

blan

dH

irzin

ger[

12]

Che

cker

boar

dw

ithsp

ecia

lpa

ttern

Yes

No

Yes

1N

oID

L30

-01-

2008

Tabl

e1:

Com

paris

onch

arto

ffre

ely

avai

labl

eca

mer

aca

libra

tion

tool

s.

Page 127: Improving real-time human pose estimation from multi-view video

Live capture combined with automatic control point extraction gives the userimmediate feedback on the quality of a given calibration image, but only two of thestudied solutions have this feature. Tele2 requires a special-purpose 3D calibration objectwhich makes this method difficult to use, while the Camera Calibration Tools has verylimited features other than live capture.

4 Calvin – A Semi-automatic Camera Calibration Tool UsingCoplanar Control Points

As was mentioned in section 3, there are a lot of existing implementations with manyuseful features freely available, but none that include all the features necessary for semi-automatic camera calibration. By semi-automatic calibration we mean an application thatcan capture a set of images of a calibration object and estimate the camera’s intrinsic,extrinsic and distortion parameters without any further manual processing. We havedeveloped Calvin, a new stand-alone camera calibration application that includes all thefeatures in table 1, and is capable of semi-automatic calibration with a planar calibrationobject.

Calvin is built around the camera calibration routines in OpenCV [2]. Calibrationin OpenCV is based on a combination of Zhang’s method [18] and the distortion termsfrom Brown [3]. The Qt library is used to create a simple and user-friendly GUI. Bothlibraries are open-source and platform independent, so Calvin can be built for a multitudeof operating systems.

Intrinsic and extrinsic calibration is done in two separate steps. The main windowof the application, shown in figure 2a, is organised in tabs to reflect this functionalsubdivision. A list of available checkerboard patterns can be maintained for use indifferent situations, for instance a smaller hand-held pattern for intrinsic calibration, anda larger pattern placed on the floor for extrinsic calibration.

For a given camera, the user can choose to load a set of calibration images fromdisk, or use live capture to aquire the images. When using live capture the automaticallydetected control points are highlighted, thus ensuring that the images are useable in thecalibration steps. Currently, only ordinary USB cameras are supported for live capture,but support for other types of cameras will be added in the future. Figure 2b showslive capture of a checkerboard pattern with detected control points overlayed. Once anadequate set of images (ten or more) has been loaded or aquired the intrinsic calibrationcan be performed.

Given the intrinsic camera parameters, extrinsic calibration can be performed using asingle image loaded from disk or captured live. Cameras are handled independently bythe calibration routines, so there are no lower or upper limits on the number of cameras.Provided that images of the same calibration pattern can be aquired by a set of cameras,the extrinsic parameters for those cameras can be combined to place them in the sameworld coordinate frame. The placement of the cameras can be reviewed in a 3D view.Figures 2c and 2d shows a laboratory camera setup and the results of its calibration.

Additionally, Calvin can undistort severly distorted images. Provided that the intrinsiccalibration has been done, images from for instance a camera with a fish-eye lens can berectified.

Page 128: Improving real-time human pose estimation from multi-view video

(a) Main window with the list of calibrationimages.

(b) Live capture with automatic pattern detec-tion.

(c) Laboratory environment with three cameras. (d) 3D view of the calibrated cameras in thelaboratory environment.

Figure 2: The main functionality of the Calvin calibration tool.

5 ExperimentsWe wanted to determine if Calvin’s semi-automatic operation affects the accuracy of thecalibration results. Since both Calvin and Bouguet’s calibration toolbox are based onthe same calibration method, we compared their respective output with the same inputimages. We captured ten image sequences, each consisting of 25 images. Each sequencecontained images of the checkerboard with 90 degree rotations to avoid singularities. Thecamera used was an AVT Marlin with a 6.5 mm fixed focal length lens.

Bouguet’s toolbox allows the checkerboard corners to be recomputed using theprevious calibration result for use as initial values in a new run of the estimation algorithm.Hence, we tried single, and five consecutive runs to see if this greatly affects accuracy.The results can be seen in table 2, and they show little difference in accuracy between thethree trials. The estimates are consistent, and the standard deviations indicate that theyare reproducable. It should be noted that processing the 250 images using Calvin tookabout ten minutes, while several hours were required to do the manual corner extractionusing Bouguet’s toolbox.

Page 129: Improving real-time human pose estimation from multi-view video

Calvin Bouguet’s toolbox(1 run)

Bouguet’s toolbox(5 runs)

ax 1424.32220 (2.73598) 1424.35075 (2.91961) 1425.20787 (2.04359)ay 1424.05504 (2.72395) 1424.08673 (2.94896) 1425.00161 (2.38663)x0 302.71328 (3.66743) 303.20818 (3.97749) 303.81051 (4.22332)y0 240.94768 (2.88067) 240.38488 (2.86339) 239.56113 (2.65305)k1 -0.24282 (0.01242) -0.24323 (0.01308) -0.24823 (0.01137)k2 0.39128 (0.25483) 0.40439 (0.26352) 0.51467 (0.26808)p1 0.00035 (0.00036) 0.00031 (0.00036) 0.00029 (0.00033)p2 -0.00025 (0.00056) -0.00017 (0.00057) -0.00021 (0.00054)

Table 2: Comparison of intrinsic calibration with Calvin and Bouguet’s MatlabCalibration Toolbox. The results are based on ten image sets, each consisting of 25images. Mean values for all intrinsic parameters are given, with standard deviations inparentheses.

6 ConclusionWe have presented a new semi-automatic camera calibration tool based on coplanarcontrol points in the form of a checkerboard pattern. The stand-alone application, Calvin,is platform independent and automatically extracts control points from captured images.The results of our experiments show that Calvin’s accuracy is on par with one of the mostwidely used calibration tools in computer vision, and should be usable in many of thesame application areas.

Using live capture to aquire the calibration images improves Calvin’s robustness, butextrinsic calibration with automatic detection of a checkerboard pattern can still yieldpoor results if the angle between the principal axis of the camera and the control pointplane is too acute. We plan to complement the automatic pattern detection algorithm withthe ability to manually select control points for extrinsic calibration.

Compared to some of the other calibration tools we have studied, the functionalityfor error analysis in Calvin is fairly rudimentary. We plan to add more advanced erroranalysis tools in the future.

References[1] H. Bakstein. A complete DLT-based camera calibration with virtual 3D calibration

object. Diploma thesis, 1999.

[2] G. Bradski and A. Kaehler. Learning OpenCV. Computer Vision with the OpenCVLibrary. O’Reilly, 2008.

[3] D. Brown. Close-range camera calibration. Photogrammetric engineering, 1971.

[4] D. Brown. The bundle adjustment—progress and prospects. International Archivesof Photogrammetry, 1976.

[5] C. Chatterjee and V. Roychowdhury. Algorithms for coplanar camera calibration.Machine Vision and Applications, 2000.

Page 130: Improving real-time human pose estimation from multi-view video

[6] T. Clarke and J. Fryer. The development of camera calibration methods and models.Photogrammetric Record, 1998.

[7] R. Hartley and A. Zisserman. Multiple view geometry in computer vision.Cambridge University Press, 2003.

[8] J. Heikkila and O. Silven. A four-step camera calibration procedure with implicitimage correction. IEEE Conference on Computer Vision and Pattern Recognition,1997.

[9] E. Hemayed. A survey of camera self-calibration. IEEE Conference on AdvancedVideo and Signal Based Surveillance, 2003.

[10] M. Personnaz and P. Sturm. Calibration of a stereo-vision system by the non-linearoptimization of the motion of a calibration object. INRIA Technical Report, 2002.

[11] F. Remondino and C. Fraser. Digital camera calibration methods: considerationsand comparisons. International Archives of Photogrammetry, Remote Sensing andSpatial Information Sciences, 2006.

[12] K. Strobl and G. Hirzinger. More accurate camera and hand-eye calibrations withunknown grid pattern dimensions. IEEE International Conference on Robotics andAutomation, 2008.

[13] P. Sturm and S. Maybank. On plane-based camera calibration: A general algorithm,singularities, applications. IEEE Conference on Computer Vision and PatternRecognition, 1999.

[14] W. Sun and J. Cooperstock. Requirements for camera calibration: Must accuracycome with a high price? Seventh IEEE Workshop on Application of ComputerVision, 2005.

[15] T. Svoboda, D. Martinec, and T. Pajdla. A convenient multicamera self-calibrationfor virtual environments. Presence: Teleoperators & Virtual Environments, 2005.

[16] B. Triggs. Autocalibration from planar scenes. Lecture Notes in Computer Science,1998.

[17] R. Tsai. A versatile camera calibration technique for high-accuracy 3D machinevision metrology using off-the-shelf TV cameras and lenses. IEEE Journal ofRobotics and Automation, 1987.

[18] Z. Zhang. Flexible camera calibration by viewing a plane from unknownorientations. IEEE International Conference on Computer Vision, 1999.

Page 131: Improving real-time human pose estimation from multi-view video

Paper 2Background subtraction using GPGPU

Authors: Eirik Fauske, Lars M. Eliassen and Rune H. Bakken.

Full title: A Comparison of Learning Based Background Subtraction Tech-niques Implemented in CUDA.

Published in: Proceedings of the First Norwegian Artificial Intelligence Sym-posium (NAIS 2009), A. Kofod-Petersen, H. Langseth, O. E.Gundersen eds., 2009, Tapir Academic Press.

Copyright: c� 2009 NAIS and Tapir Academic Press.

Page 132: Improving real-time human pose estimation from multi-view video
Page 133: Improving real-time human pose estimation from multi-view video

A Comparison of Learning Based Background SubtractionTechniques Implemented in CUDA

Eirik Fauske, Lars Moland Eliassen, Rune Havnung BakkenSør-Trøndelag University College

AbstractThis paper describes three background subtraction techniques and presentsour implementation of them using the CUDA technology for parallelprocessing. Background subtraction is only one step in developing systemsfor visual surveillance, visual hull or other high-level computer visionsystems. In this paper, we define real-time systems as systems thatcan perform at 30 frames per second. In order to achieve real-timeperformance for such high-level systems, background subtraction needs tobe performed in much less than the allotted 33ms per frame. All CUDAimplementations presented in this paper has run-times below 10ms per frame.In fact, our implementations achieve up to 96x speed-up compared to ourserial implementations of the algorithms. We also compare segmentationperformance for the different approaches.

1 IntroductionSubtracting the background from a video stream is an important step in many visualsystems, both simple and complex. The idea is to segment the incoming image intobackground and foreground components, and use the components of interest in furtherprocessing, e.g. in shape-from-silhouette [3] calculations. The formula in [6] describesthe simplest of background subtraction techniques using naive frame difference as themethod of classification. If | f ramei � backgroundi| > threshold, then the pixel i isforeground. Although such techniques are very fast, the segmentation performance canbe quite poor, especially with fluctuating illumination conditions. It is crucial to achievean accurate component classification in order to secure the quality of further processingsteps. In order to manage changes in illuminations, more complex background subtractionalgorithms have been developed. Many of which build up a statistical model of thebackground based on a given number of frames that depict the background, and thencompare the data of incoming frames to this model. Piccardi [12] has made a survey,comparing different background subtraction methods.

In real-time applications where background subtraction is only a single step,processing time is important. We define real-time performance as the processing of 30frames per second. This means that the run-time of background subtraction has to be farbelow 33ms per frame, in order to leave run-time for further processing, and still maintainreal-time. None of the approaches described in this paper achieve real-time processing in

This paper was presented at NAIS-2009; see http://events.idi.ntnu.no/nais2009/.

Page 134: Improving real-time human pose estimation from multi-view video

our serial implementation with 640⇥ 480 images. Of the three, our implementation ofCodebook [9] performs the best with 58.6ms per frame, running on an Intel Core i7 at2.67GHz.

An available solution to reduce run-time is parallelisation, making it possible toprocess several pixels simultaneously. Most modern personal computers have massiveparallel processing power in the form of Graphical Processing Units, or GPUs. The GPUsare first and foremost made for graphically intense games, but since 2002 General Purposecomputation on Graphics Processing Units (GPGPU1) has grown and become morepopular. Nvidia’s CUDA2 technology is an example of such an available technology, andit enables developers to easily utilise the power provided in mainstream GPUs. Nvidia’sGeForce GTX 295 has 480 stream processing units spread across 2 GPUs. When usedin image processing, the card can perform calculations on 480 pixels simultaneously,assuming that both GPUs are fully utilised. The advantages over CPU processing areobvious. If we can divide the process into small steps that can be executed on the GPUindependently, the speed-up will be substantial.

This paper briefly summarises the three background subtraction techniques in section2. We clarify which parts of the techniques were suitable for parallelisation and describehow we compared them in section 3. In section 4, we present and compare results, bothin run-time and segmentation performance.

2 Related workIn this section we will mention some existing parallel implementations of backgroundsubtraction algorithms, and summarise the three methods we have implemented in CUDA.

Parallel processingImplementing background subtraction algorithms in parallel has been done by PeterCarr [1] with successful results, the gain in speed was ⇠ 5x from serial to parallel.The background subtraction algorithm used in his approach is Stauffer and Grimson’salgorithm [13], which is a Gaussian mixture model. In Carr’s approach, the frameworkused is Apple’s Core Image which generates commands using a subset of the OpenGLShading Language(GLSL) to perform operations on the GPU. He uses an Intel Core 2Duo at 2.2GHz processor and an NVIDIA 8600M GT graphics card.

Andreas Griesser et al. [7] present a ”GPU-based foreground-background segmenta-tion that processes image sequences in less than 4ms per frame.” Their approach is basedon a colour similarity test in a small pixel neighborhood. Their hardware is a NvidiaGeForce 6800GT, and the image resolution used is 640⇥480. However, they do not statethe run-time on CPU, so we can not compare speed-up results for their approach.

Gong and Chen’s [5] approach uses an online learning method adopted from [2], andglobal optimisation using graph cuts. Their parallel implementation achieves 16 FPS at320⇥240 resolution on an Intel Centrino T2500 CPU and an ATI Mobility Radeon X1400GPU. They have no serial implementation for performance comparison.

Fukui et al. [4] propose a GPU based algorithm which makes use of the CIELAB colorspace to handle shadows, and conversion tables to deal with sudden and gradual intensitychanges. The experimental results of Fukui et al. show a remarkable speed-up. Computerspecifications are a Intel Core 2 Duo E6600 at 2.4GHz, 1GB RAM and a Nvidia GeForce

1http://gpgpu.org/2http://www.nvidia.com/object/cuda home.html

Page 135: Improving real-time human pose estimation from multi-view video

8800GTX GPU. For 1440x1080 images the run-time on CPU was 864 ms per frame,whilst the GPU used only 0.32ms per frame. That is a speed-up of 2700x. However, theseresults assume that the image data is present in the GPU memory at time of execution.The results we present in this paper include the transfer of data from computer memoryto GPU memory.

Background subtraction algorithmsBackground subtraction techniques has been a widely researched area in computer visionover the last decade because it is an integral part of many different higher-level computervision tasks. The different techniques are often developed for specific contexts, suchas indoor scenes with static background, outdoor scenes with dynamic background andillumination changes. In this paper, we have concentrated on three algorithms whichclaim to be robust against shadows in indoor scenes. Below we present these approachesand their key features.

CodebookThe Codebook [9] algorithm, creates a background model consisting of a codebook foreach pixel. These codebooks are built from codewords, and each codeword contains somekey values for a pixel. During background learning, codewords are added to the codebookfor the i’th pixel if no codewords matching the i’th pixel’s current values are found in thecodebook. Thus, the background model becomes robust to changes in lighting conditions.The key values in a codeword are:

• R, G, B - the mean of R,G,B

• I, I - the minimum and maximum brightness

• f - the frequency of occurrence

• l - longest interval between to occurrences

• p,q - first and last access times

As it learns the background over N frames, the background model is extendedaccording to colour change in the pixel. If the value for a given frame’s pixel is outsidethe boundary of a codeword, a new codeword is created for this pixel. This gives abackground model robust to dynamical change, such as moving leaves, rain, illuminationvariations and other repetitive background changes. An incoming pixel is classified asforeground if there exist no codeword matching the value of the pixel.

Self-Organising Background Subtraction - SOBSMaddalena and Petrosino [10, 11] makes use of an artificial neural network organised asa 2D flat grid of neurons. Each pixel has n⇥ n weight vectors in the neural map as seenin table 1. We use n = 3, so that each pixel has 9 corresponding weight vectors in thebackground model. They have experimented with higher values of n, observing almostconstant values of accuracy, but rapidly growing run-times. Each weight vector in thebackground model is defined as:

ci = [H,S,V ] where ci = {c1,c2, ...,cn2}, H = [0,360] and V ^S = [0,1]

Page 136: Improving real-time human pose estimation from multi-view video

a b cd e fg h i

(a) Pixels inan 3x3 image

a1 a2 a3 b1 b2 b3 c1 c2 c3a4 a5 a6 b4 b5 b6 c4 c5 c6a7 a8 a9 b7 b8 b9 c7 c8 c9

d1 d2 d3 e1 e2 e3 f1 f2 f3d4 d5 d6 e4 e5 e6 f4 f5 f6d7 d8 d9 e7 e8 e9 f7 f8 f9

g1 g2 g3 h1 h2 h3 i1 i2 i3g4 g5 g6 h4 h5 h6 i4 i5 i6g7 g8 g9 h7 h8 h9 i7 i8 i9

(b) The 3⇥3 weight vectors in the neural network forthe image in (a)

Figure 1: Neural network

The idea is that each pixel runs through its 9 vectors in the model and selects theminimum best matching vector. If such a best match is found, all values in the n⇥ nneighbourhood of the chosen vector in the background model are updated according to aselective weighted running average. The weights are:

2

41 2 12 4 21 2 1

3

5

which corresponds well to the lateral inhibition activity of neurons. By making use of thisneural network, a spatial relationship between pixels is obtained. The shadow detection in[10, 11] makes use of the fact that shadowed areas have similar colour, but great variationsin illumination.

First, the method takes one frame and initialises all 9 values in the background modelto the value of the incoming pixel. Then, N frames are used in the background learningphase in order to make the model as robust as possible. Finally, the online phase startsand the pixels of the incoming frames are classified. A pixel is foreground if there existno best matching weight vectors in the background model.

Horprasert et al.’s statistical approachIn the approach proposed by Horprasert et al. [8], they present a colour model thatseparates the brightness components from the chromaticity components which makesit easier to detect shadowed and highlighted areas. Their approach makes use of astatistical analysis of N background frames to find the i’th pixels expected values Ei ={µR

i , µGi , µB

i }. The expected value for the i’th pixel produces the point Ei in the 3Dmodel displayed in Figure 2. In the classification phase, the incoming pixel to be classifiedproduces the point Ii.

The line OE drawn from the origin to Ei represents the chromaticity line. Brightnessdistortion a is a scalar value and scales the point along OE where the orthogonal linefrom Ii intersects OE. Chromaticity distortion CDi is the orthogonal distance betweenthe observed colour and the line OE. a and CD values are calculated for each frame inthe background training process, and then normalised in order to find a single threshold

Page 137: Improving real-time human pose estimation from multi-view video

G

R

B

Ei

CDi

aEi

Ii

O

Figure 2: Horprasert et al.’s proposed RGB colour model

for all the pixels in the picture. The approach builds histograms of the normalised �a and�CD values and takes a detection rate as input in order to automatically select suitablethresholds. In the segmentation phase, the incoming pixels are used to calculate �a and�CD values which are compared to those of the background model. The pixel classificationfor the i’th pixel is as follows:

• Original background if both �ai and �CDi are within a threshold of those in thebackground model.

• Shaded background or shadows if the chromaticity �CDi is within the threshold, butthe brightness �ai is below.

• Highlighted background if the chromaticity �CDi is within the threshold, but thebrightness �ai is above.

• Moving foreground object if the chromaticity �CDi is outside of the threshold.

3 Method of ComparisonThe open source computer vision library OpenCV3, initially developed by Intel hasproven to be a good framework for communications with the camera and enabledus to focus entirely on the background subtraction algorithms. In our Codebookimplementation we chose to limit the parallelisation to the segmentation part. We useda serial implementation for the background learning since it is not run-time critical. Ourimplementation of Horprasert et al.’s algorithm is parallelised for all tasks, except forthe building of histograms and detection of thresholds. Again, these are tasks in thebackground learning process and are not run-time critical. The entire SOBS algorithmwas suitable for parallelisation.

Segmentation PerformanceIn order to compare the results of the three methods, we captured two test sequences.The sequences were captured with a Logitech Sphere AF Webcam with a resolution of

3http://opencv.willowgarage.com/wiki/

Page 138: Improving real-time human pose estimation from multi-view video

640⇥480 pixels at 30 frames per second using the OpenCV framework. All the automaticimage adjustment settings were deactivated in order to test the approaches’ robustnessto illumination changes. For the first sequence we used an office environment with acomplex background, while the other sequence depicts a uniform background, consistingof a well lit white wall and a small table to the right. Both sequences consist of 200frames to learn the background, and 500 frames for the background subtraction. By usinga pre-defined sequence to test the methods, the comparison will be as accurate as possiblebecause we make sure that the exact same frames are processed by all three methods.

From each sequence we chose four images with different scene composition, inorder to test the methods’ robustness. In order to compare the methods segmentationperformance, we needed a ground truth with which to compare the segmented images.This is the reason why we only chose four images, since the process of making the groundtruth was too time-consuming for us to do 500 times. The images were chosen deliberatelyso that different scene compositions were represented. These include shadows and oneor more silhouettes. For each method and environment we have optimised the threshold-values to find those which yielded the highest average score for the four images. We chosenot to optimise threshold values for each image, as none of these methods use adaptivethresholding. The three values shown in the tables, are defined as in [10]:

• Recall is the detection rate and gives the percentage of detected true positives com-pared to the total number of true positives in the ground truth

Recall =t p

t p+ f n, where t p = true positives and f n = f alse negatives.

• Precision is the positive prediction and gives the percentage of detected true posi-tives compared to the total number of items detected by the method

Precision =t p

t p+ f p, where t p = true positives and f p = f alse positives.

• Overall is the weighted harmonic mean of Precision and Recall, and acts as a singlemeasurement in order to compare different methods

Overall =2 ·Recall ·PrecisionRecall +Precision

True positives are the pixels which are correctly detected as foreground. Falsepositives are incorrectly detected as foreground. False negatives are incorrectly detectedas background. We compared the images on a pixel by pixel basis, and classified theresults as described above.

Processing TimeOur measurement of processing time starts from the point in time where the image isstored in computer memory, and stops after calculations have been performed on theGPU. We chose not to time the transfer of image data from the camera to memory, or thetransfer of data from the GPU back to host memory, as the aim of this paper is to producea subtracted frame for use in further processing on the GPU. Moreover, our camera is onlycapable of delivering images at 30 frames per second, resulting in a lower run-time limitof 33ms per frame. In the case of a shape-from-silhouette applications, several cameraswill feed the software with frames for subtraction, hence the algorithms need to perform

Page 139: Improving real-time human pose estimation from multi-view video

faster than 33ms per frame. The following is the pseudocode for our timing measurement:

Algorithm 1 Pseudocode for our timing measurementimage captureImage()startTimer()for i = 0 to 1000 do

copyToGPUDevice(image)performSubtractionOnDevice()

end forstopTimer()

We have used the same method of measurement for the serial versions of theapproaches. It must be emphasised that run-time for the Codebook approach dependson the variation within the background training frames. If there are some variations, morecodewords will be generated for each pixel, and the run-time will increase. However, withquality recording equipment and indoor scenes with low variation, this issue is almostnon-existent. Experiments in [9] show that ”only 6.5 codewords per pixel (on average)are required for the background acquisition in order to model 5min of outdoor videocaptured at 30 frames/s”.

4 ResultsIn this section we will present the results from our comparisons in segmentationperformance and processing time.

Segmentation PerformanceThe tables below show the results from comparing the subtracted frames to the manuallysegmented ground truth. There is reason to believe that this manual segmentation can bea source of error to some degree, but our experiments show that the error percentage isinsignificant. Results show that our implementation of SOBS produces the best result in5 of 8 frames, while our implementation of Codebook produces the best results in theremaining 3.

We believe that Horprasert et al.’s segmentation performance would improve with abetter image capturing device, as we experienced issues with pixels whose values werecapped at [0,0,0] and [255,255,255]. These capped values seems to unbalance the normaldistributions used to automatically select thresholds.

Page 140: Improving real-time human pose estimation from multi-view video

Office BackgroundThe following table shows the results for the office background sequence:

Image Method Recall Precision Overall

1CB 0.7885 0.8399 0.8134

SOBS 0.8868 0.8751 0.8809HORT 0.6198 0.7648 0.6847

2CB 0.9473 0.8603 0.9017

SOBS 0.9689 0.9341 0.9512HORT 0.7869 0.7824 0.7847

3CB 0.7125 0.9035 0.7967

SOBS 0.7575 0.9406 0.8392HORT 0.6306 0.7856 0.6996

4CB 0.9080 0.8998 0.9038

SOBS 0.9294 0.8419 0.8835HORT 0.9442 0.8248 0.8804

Table 1: Office background

Uniform backgroundThe following table shows the results for the uniform background sequence:

Image Method Recall Precision Overall

1CB 0.9145 0.8728 0.8932

SOBS 0.8991 0.8631 0.8808HORT 0.9262 0.7734 0.8429

2CB 0.9418 0.9413 0.9415

SOBS 0.9559 0.9206 0.9379HORT 0.9263 0.8004 0.8588

3CB 0.8614 0.9472 0.9023

SOBS 0.9085 0.9404 0.9200HORT 0.8351 0.8723 0.8533

4CB 0.9336 0.9666 0.9498

SOBS 0.9523 0.9611 0.9567HORT 0.9559 0.8894 0.9215

Table 2: Uniform background

Page 141: Improving real-time human pose estimation from multi-view video

Office Background

Figure 3: Frame to be subtracted (line 1). Ground truth (line 2). Codebook (line 3). SOBS(line 4). Horprasert et al. (line 5)

In this sequence all methods struggle with the segmentation. This is because ofa complex background and similar colours in the background and foreground.As seen,the methods struggle to extract foreground objects, such as the similarly coloured pantsand t-shirts, in front of the cubicle wall. SOBS is superior to the others regardingcorrect classification of shadows, with Horprasert et al.’s method performing quite poorlycompared to the other two. All three approaches manage to detect the small and almosttransparent bottle which is placed by the couch in the third and fourth frame.

Page 142: Improving real-time human pose estimation from multi-view video

Uniform background

Figure 4: Frame to be subtracted (line 1). Ground truth (line 2). Codebook (line 3). SOBS(line 4). Horprasert et al. (line 5)

In this sequence, it is clear that both Codebook [9] and SOBS [10] perform quite well.There is a shadow on the wall in the upper center part of the third image, and neitherSOBS nor Codebook classifies the shadow on the wall in the third image as foreground,nor the shadowed areas on the floor around the persons. On line 5, however, we can seethat the shadow in the third image is included in the foreground, and there is generallymore noise in the picture. Morphological operators will improve the results by smoothingout silhouettes and removing noise in the surroundings on both scene compositions in thispaper.

Page 143: Improving real-time human pose estimation from multi-view video

Processing timeAs seen from table 3, the speed-ups are quite drastic, especially for Horprasert et al.’sapproach which is close to 100 times faster than the serial version for 640⇥480 images.This speed-up will prove valuable in further image processing steps like [3] if we wantto achieve real-time performance. We must emphasise that the table shows the run-timeresult of our implementation of the algorithms. There is probably room for improvement,but the comparison between the GPU and CPU run-times are still viable as the codethat runs on both technologies are almost identical. The GPU implementation include amemory copy to the device that is not present in the CPU implementation. We used anIntel Core i7 CPU at 2.67GHz for the serial implementation and an Nvidia GTX295 GPUfor the parallel implementation.

Method Serial Parallel Speed-upCodebook 58.600ms 4.600ms 12x

SOBS 500.000ms 9.800ms 51xHorprasert et al. 75.000ms 0.779ms 96x

Table 3: Performance gain serial vs parallel

5 ImprovementsThe images shown have not been post-processed in any way, so the implementation ofmorphological opening and closing will further improve the results. Moreover, a morethorough optimalisation of the CUDA implementation could improve the processing timessomewhat, but time-limitations prevented us from doing further optimalisations.

We timed OpenCV’s morphological opening function, which includes first an erosion,then dilation, and it used approximately 12.6 ms per frame. Our implementation in CUDAwas timed at 0.54 ms per frame, and produced near equivalent results. This translates toa speed-up of 23x. However, there are limitations to the CUDA implementation. Weorganised the 640⇥ 480 image in squares of 20⇥ 20 pixels, in order to utilise sharedmemory to speed up memory accesses. Our experiments show that such a regionalisationhas an insignificant impact on the visual effect on the result of the morphologicaloperators.

6 ConclusionWe have described three existing methods for background subtraction, and compared theresults of our CUDA implementations of them. These parallel implementations haveproven to produce a significant speed-up, compared to our serial implementations. Inorder to achieve real-time performance in higher-level computer vision tasks, wherebackground subtraction is only a single step in the process, run-time has to be kept ata minimum. For the purpose of processing time for other tasks, run-time for backgroundsegmentation has to be far below 33ms per frame. Our definition of real-time is 30 framesper second.

As shown in table 3, Horprasert et al.’s approach is the quickest, at 0.779 ms per frame.This translates to 1284 frames/s, or 42 cameras delivering 30 frames/s simultaneously.However, this approach was by far the lowest performing in quality (table 1 and 2).

Page 144: Improving real-time human pose estimation from multi-view video

SOBS achieved the highest score overall in our comparison, but has the highest runtimeat 9.8 ms/frame, which can serve 3 cameras at 30 frames/s. Which method one shouldimplement in a larger system will depend on the demands for run-time and quality.

References[1] P. Carr. Gpu accelerated multimodel background subtraction. Computing:

Techniques and Applications, 2008.

[2] L. Cheng, S. Wang, D. Schuurmans, T. Caelli, and S. V. N. Vishwanathan. An onlinediscriminative approach to background subtraction. AVSS, 2006.

[3] G.K.M. Chung, S. Baker, and T. Kanada. Visual hull alignment and refinementacross time: A 3d reconstruction algorithm combining shape-from-silhouette withstereo. IEEE Computer Society Conference on Computer Vision and PatternRecognition, 2003.

[4] Shinji Fukui, Yuji Iwahori, and Robert J. Woodham. Gpu based extraction of movingobjects without shadows under intensity changes. IEEE Congress on EvolutionaryComputation, 2008.

[5] Minglun Gong and Li Cheng. Real-time foreground-background segmentationon gpus using local online learning and global graph cit optimization. PatternRecognition, 2008.

[6] Rafael C. Gonzalez and Richard E. Woods. Digital Image Processing, pages 465–467. Addison Wesley, 1st edition, 1992.

[7] Andreas Greisser, Stefan De Roeck, Alexander Neubeck, and Luc Van Gool. Gpu-based foreground-background segmentation using an extended colineary criterion.VMV, 2005.

[8] T. Horprasert, D. Harwood, and L.S. Davis. A robust background subtraction andshadow detection. Proc. ACCV, 2000.

[9] K. Kim, T. H. Chalidabhongse, D. Harwood, and L. Davis. Real-time foreground-background segmentation using codebook model. Science Direct, Real-TimeImaging 11 (172-185), 2005.

[10] L. Maddalena and A. Petrosino. A self-organizing approach to backgroundsubraction for visual surveillance applications. IEEE transactions on imageprocessing, vol. 17, no. 7, 2008.

[11] L. Maddalena and A. Petrosino. Multivalued background/foreground separationfor moving object detection. Fuzzy Logic and Applications: 8th InternationalWorkshop, 2009.

[12] Massimo Piccardi. Background subtraction techniques: a review. IEEEInternational Conference on Systems, Man and Cybernetics, 2004.

[13] C. Strauffer and W. Grimson. Adaptive background mixture models for real-timetracking. Proceedings of the IEEE Computer Society Conference on ComputerVision and Pattern Recognition, 1999.

Page 145: Improving real-time human pose estimation from multi-view video

Paper 3Synthetic data for shape-from-silhouette

Authors: Rune H. Bakken.

Full title: Using Synthetic Data for Planning, Development and Evalu-ation of Shape-from-Silhouette Based Human Motion CaptureMethods.

Published in: Proceedings of the 8th International Symposium on Visual Com-puting (ISVC 2012), G. Bebis et al. eds., 2012, Springer.

Copyright: c� 2012 Springer Verlag Berlin Heidelberg.

Page 146: Improving real-time human pose estimation from multi-view video
Page 147: Improving real-time human pose estimation from multi-view video

Using Synthetic Data for Planning, Developmentand Evaluation of Shape-from-Silhouette Based

Human Motion Capture Methods

Rune Havnung Bakken

Faculty of Informatics and e-Learning,Sør-Trøndelag University College, Trondheim, Norway

[email protected]

Abstract. The shape-from-silhouette approach has been popular in com-puter vision-based human motion analysis. For the results to be accurate,a certain number of cameras are required and they must be properly syn-chronised. Several datasets containing multiview image sequences of hu-man motion are publically accessible, but the number of available actionsis relatively limited. Furthermore, the ground truth for the location ofjoints is unknown for most of the datasets, making them less suitable forevaluating and comparing di↵erent methods. In this paper a toolset forgenerating synthetic silhouette data applicable for use in 3D reconstruc-tion is presented. Arbitrary camera configurations are supported, and aset of 2605 motion capture sequences can be used to generate the data.The synthetic data produced by this toolset is intended as a supplementto real data in planning and development, and to fascilitate comparativeevaluation of shape-from-silhouette-based motion analysis methods.

1 Introduction

Computer vision-based human motion capture has been a highly active fieldof research during the last two decades [14]. Both monocular, stereo and multi-view approaches have been explored extensively. Multiview approaches put somespecial demands on the image aquisition process. In order to achieve accurateresults the cameras used must be calibrated and properly synchronised, andwith a higher number of cameras the complexity of those tasks increase. Thatcomplexity and the relatively high cost of the necessary equipment have leadmany researchers to rely on publically available datasets in order to performexperiments.

The selection of publically available multiview datasets is limited. Some of thedatasets are created with a specific application in mind, for instance evaluationor action recognition, and this can limit their applicability to other applicationareas because of the camera configuration used, the types of actions performed,the number of subjects, or the length of the sequences. Consequently, it can beproblematic for researchers to get access to data suitable for a specific humanmotion analysis task.

Page 148: Improving real-time human pose estimation from multi-view video

Human motion is governed by the underlying skeletal articulation, and inthe context of human motion analysis pose estimation is defined as the processof finding the configuration of that skeletal structure in one or more frames. Apossible alternative to relying on datasets with real image data is to use syn-thetically generated images. A bone hierarchy can be used to control the motionof an avatar, and multiview images of the sequence can be rendered for usein quantitative evaluation of a pose estimation algorithm. Since the underlyingskeletal structure is known for these data, it can be used as ground truth in theevaluation. The problem is, however, that evaluation with synthetic data histor-ically have been done in an ad hoc manner, making it di�cult to reproduce theresults and do comparative evaluation.

Shape-from-silhouette has been a very popular approach for doing 3D re-construction in multiview human motion analysis since it was introduced byLaurentini [12]. Algorithms based on the shape-from-silhouette paradigm usebinary silhouette images as input.

Goal: to make a large number of synthetic image sequences available for usein planning, development and evaluation of shape-from-silhouette-based humanmotion analysis research.

Contribution: a set of tools that makes it easy to produce silhouette im-age sequences with arbitrary camera setups by employing the 2605 real motioncaptures from the CMU motion capture database [6]. Furthermore, the toolsetgenerates a synthetic ground truth that can be used for evaluation of new shape-from-silhouette-based methods. The toolset is made freely available to fascilitatecomparative evaluation.

The structure of the paper is as follows: in section 2 we will review somerelated work on synthetic motion data. Section 3 details the proposed toolset forsynthetic silhouette generation, and we explore some possible usage scenarios forthe generated data. In section 4 we present the results of using the generated datafor evaluating a shape-from-silhouette-based human motion analysis algorithm.Section 5 contains a discussion of the assumptions made when using syntheticdata and outlines some further work, and section 6 concludes the paper.

2 Related Work

Using synthetic data for research on human motion analysis is not a novel idea,and is commonly done to alleviate some of the limitations of physical camerasas was demonstrated by Franco et al. [7]. They presented a distributed approachfor shape-from-silhouette reconstruction and wanted to test its performance ona higher number of cameras than the four physical cameras they had available,so they used images from between 12 and 64 synthetic viewpoints instead.

Cheung et al. [5] presented a shape-from-silhouette algorithm that considerstemporal correspondences when doing 3D reconstruction from image data. Thealgorithm was applied to both static and articulated objects, and was used foraquiring detailed kinematic models of humans and for markerless human motion

Page 149: Improving real-time human pose estimation from multi-view video

tracking. Experiments with synthetic data were done to obtain a quantitativecomparison of di↵erent aspects of the algorithm.

Mundermann et al. [16] used synthetic data to explore how camera setupsinfluence the performance of shape-from-silhouette methods in biomechanicalanalysis. Generating silhouette images of a laser scanned human model allowedthem to simulate a wide range of di↵erent camera setups to study how the ac-curacy of the reconstruction is a↵ected by the number and positions of cameras,and the location of the subject within the viewing volume.

The quality of the 3D reconstruction from a shape-from-silhouette algorithmis dependent on the amount of noise in the input silhouette images. Landabasoet al. [11] presented a novel algorithm that exibits greater robustness when usedwith inconsistent silhouettes. In order to produce quantitative results for thealgorithm they used a dataset consisting of synthetic images in order to achievebetter control of noise levels and occlusion, and for easier comparison with aknown ground truth.

Gkalelis et al. [9] used motion capture data from the CMU database to an-imate ten di↵erent avatars performing four di↵erent actions. The synthetic se-quences were used as a supplement for real data in evaluating their continuoushuman movement recognition method. The data were rendered using a singleviewpoint.

Common for all the aforementioned applications of synthetic data is thatthe data in question were not drawn from publically available databases. Thesynthetic images were either generated in an ad hoc manner or came from closeddatasets, making it di�cult for other researchers to reproduce the results. Ane↵ort to standardise the generation of synthetic human silhouettes was made byPierard and Van Droogenbroeck [17]. They presented a technique for buildingdatabases of annotated synthetic silhouettes for use in training learning-basedand example-based pose estimation approaches. The generated silhouettes areautomatically annotated with pose information and identication of body parts.The images are produced by randomly generated poses of an avatar with Make-Human [3]. Consequently, extra measures must be taken to ensure that the sil-houettes are realistic, and only single poses are generated rather than sequencesof actual motion patterns.

2.1 Multiview Datasets

Several datasets with multiview images of human motion are publically available.These contain both real and synthetic data.

The HumanEVA [20] datasets are specifically aimed at evaluation of humanmotion capture methods. The HumanEVA-I dataset consists of 48 image se-quences, captured with four greyscale and three colour cameras. Six actions arerepeated twice by four di↵erent subjects. The HumanEVA-II dataset has twoimage sequences with one subject each, captured with four synchronised colourcameras.

i3DPost [8] is a dataset with 96 sequences, from eight synchronised colourcameras in a blue-screen studio environment. The data include a wide variety of

Page 150: Improving real-time human pose estimation from multi-view video

actions performed by eight di↵erent subjects. The CVSSP-3D [22] dataset con-tains seventeen image sequences, with the same studio setup as for i3DPost. Twosubjects each perform a set of actions. The first subject wears di↵erent costumesand performs actions related to walking. The second subject performs a set ofdance routines. The IXMAS [23] dataset consists of 33 image sequences that werecaptured with five colour cameras. Eleven subjects perform fourteen everydayactions that are repeated three times. Clean silhouette masks are available forthese datasets.

In the MuHAVi [21] dataset there are 119 image sequences. The images werecaptured with eight colour cameras, but the image streams were not synchro-nised. There are seven subjects performing seventeen di↵erent actions. For asubset of the image sequences silhouettes have been manually annotated (twocameras with five actions and two actors).

The CMU MoBo [10] dataset was originally intended for research on gaitrecognition, but has been used for other application areas as well. There are100 image sequences in the dataset. 25 subjects were filmed with six synchro-nised colour cameras while walking on a treadmill. Four di↵erent actions werecaptured, and noisy silhouette masks are included.

The ViHASi [18] dataset stands out from those described above, in thatit consists entirely of synthetic data. Nine virtual subjects performing twentydi↵erent actions were animated using a character animation software package.Configurations of 12-40 virtual cameras were placed in two circular formationsaround the subjects.

A summary of the findings on publically available multiview datasets is shownin Table 1.

Table 1: Publically available multiview datasets. Comparison of number of cam-eras, subjects, sequences, actions, synchronisation and image types.

Name Cam. Sub. Seq. Act. Synch. Image type

HumanEVA-I 7 4 48 6 Yes RealHumanEVA-II 4 2 2 4 Yes RealIXMAS 5 11 33 14 Yes Real w/ silhouettesi3DPost 8 8 96 12 Yes Real w/ silhouettesCVSSP-3D 8 2 17 17 Yes Real w/ silhouettesMuHAVi 8 7 119 17 No Real w/ silhouettes (subset)CMU MoBo 6 25 100 4 Yes Real w/ silhouettesViHASi 12-40 9 180 20 Yes Synthetic silhouettes

3 Toolset for Generating Synthetic Silhouettes

The proposed toolset is implemented as a plugin for the open source 3D contentcreation suite Blender (http://blender.org) and a standalone analysis tool.

Page 151: Improving real-time human pose estimation from multi-view video

3.1 Blender Module

This section describes the features of the Blender plugin. The toolset’s maininterface can be seen in Fig. 1.

Fig. 1: The main interface of the Blender module.

Motions As mentioned earlier, the CMU motion capture database contains2605 sequences. These are divided into to six categories (human interaction, in-teraction with environment, locomotion, physical activities and sport, situationsand scenarios, and test motions). The proposed toolbox supports importing mo-tion sequences in the ASF/AMC format. Blender has import plugins for theC3D and BVH motion capture formats, so support for these could be addedlater. Some of the sequences in the CMU dataset contains noise, particularlyaround the extremities. Consequently, it is possible to lock the outer joints whenimporting an action to reduce the e↵ects of the noise.

Multiple motion sequences can be imported with separate avatars. This worksboth for sequences that contain multiple performers, and for combining sequenceswith a single performer. Di↵erent actions can be stitched together for a singleavatar. This motion stitching functionality relies on Blender’s interpolation ofthe sequences, and works best for combinations of actions where the end pose inthe first sequence is similar to the start pose in the second sequence.

The standard frame rate for the motion captures in the CMU database is120 frames per second. The proposed toolset supports downsampling the framerate when rendering the silhouette images.

For each time step in a rendered sequence the locations of the joints de-fined by the input motion capture file is calculated, and exported along withthe silhouette images. These joint locations can be used as ground truth whenevaluating motion analysis techniques that use the silhouette images as input.

Page 152: Improving real-time human pose estimation from multi-view video

Meshes An unclothed avatar is included with the toolset. This mesh was createdwith MakeHuman [3], a tool for generating anatomically correct human modelswith simple controls for setting age, gender, body shape and ethnicity. Themesh used in the proposed toolset is of a fairly athletic, young adult male. It isrelatively easy to add new avatars to the toolset. Meshes in the open OBJ formatare supported, given that the avatar’s pose matches the rest pose in Blender’sskeletal animation system. The skeletal structure is automatically skinned withthe mesh model using Blender’s bone heat method, an implementation of theapproach outlined by Baran and Popovic [2].

Camera Setup The toolbox supports simulating arbitrary camera setups. Theautomatic setup mode will generate a set of cameras arranged in a circle, facingthe origin. The user can control the radius of the circle and the latitude of thecameras. It is also possible to add cameras in the ceiling. The default resolutionfor the virtual cameras is 800 ⇥ 600 pixels. If the automatically generated setupis not satisfactory the user is free to make any necessary modifications. Thecameras can be moved and rotated in Blender’s 3D view to create an adequatecamera setup.

The calibration information for the camera setup is exported along with therendered silhouette images. The calibration information includes focal lengths,principal points, positions, and rotations for the cameras. An exported camerasetup can be loaded and reused for subsequent simulations.

Analysis Tool Bundled with the toolset is an analysis tool for evaluating hu-man motion capture methods that use the generated silhouette images as input.Estimated joint locations can be compared to the synthetic ground truth datagenerated by the toolset. The analysis tool can visualise 3D error, as well aserrors in the x, y, and z dimensions separately. The analysis tool calculates themean error and standard deviation for all the joints per time step, and meanerror and standard deviation for the entire sequence. Box and whisker plots (fol-lowing the 1.5 IQR convention) can be generated to examine the dispersion ofthe data, and identify outliers.

3.2 Usage Scenarios

A number of possible usage scenarios for the toolset can be envisioned.

Planning The camera equipment required for multiview human motion analysisis expensive, so it can be useful to experiment with di↵erent camera setups tofind the most suitable configuration for a specific application before making apurchase. The rigging and calibration of cameras is a time-consuming process.Using the proposed toolset to experiment with camera setups is faster, and oncea satisfactory configuration has been found it can be replicated in the lab.

Page 153: Improving real-time human pose estimation from multi-view video

Development Not all researchers have access to a multicamera capture laband are reliant on using data provided by others. Having a flexible system forgenerating data like the one presented in this paper can make human motionanalysis research more accessible. Coupled with the large action corpus in theCMU database the proposed toolset provides a good starting point for develop-ment of algorithms for di↵erent applications.

Evaluation There are few datasets with known ground truth available. Ofthe datasets examined in Sect. 2, only the HumanEVA datasets are specificallyaimed at evaluation and comparison of human motion analysis methods. AsMundermann et al. [15] demonstrated it can be di�cult to utilise all the data inthe HumanEVA datasets due to problems with background subtraction, furtherlimiting the availability of data for evaluation purposes. The proposed toolsetcan generate datasets with reproducable results for comparison of di↵erent al-gorithms. The actual data used for evaluation need not be shared, only the cal-ibration parameters and information on which sequences were used. Along withthe silhouette images a synthetic ground truth is generated. The ground truthdata consists of the locations of the joints in the imported skeleton definition.

4 Case Study: Real-time Pose Estimation

The proposed toolset was used to evaluate the performance of the real-timepose estimation method described in [1]. The method consists of the followingsteps. First, a volumetric visual hull is reconstructed from silhouette images.Second, the visual hull is thinned to a skeleton representation. Next, a treestructure is built from the skeleton voxels, and spurious branches are removed.The extremities (hands, feet, and head) are identified, and the tree is segmentedinto body parts. The final step is to label hands and feet as left or right.

To evaluate the algorithm outlined above, a dance sequence from the CMUdatabase was used (subject 55, trial 1) to generate a test set of silhouette images.The sequence consists of 1806 frames, and has the subject doing a series ofpirouettes around the centre of the viewing volume. The ASF and AMC files wereloaded into the Blender plugin. This automatically built an armature and skinnedit with the chosen avatar mesh. A camera setup of eight cameras was chosen.Seven of the cameras were placed in a circular formation around the subject, andthe eighth placed in the ceiling, pointing down. The sequence was downsampledto 30 fps, and silhouettes rendered. Fig. 2 shows the data at di↵erent stages ofthe experiment.

After the outlined algorithm had produced an estimate of the extremitypositions for the entire sequence, the analysis tool was used to evaluate thoseresults. The generated ground truth data was loaded into the analysis tool alongwith the estimates, and a number of graphs were generated, as shown in Fig. 3.

Figure 3a shows the mean error for all extremities per frame of the sequence.The mean error for the entire sequence was 47.8 mm (st. dev. 27.3 mm). Fig. 3billustrates a problem with the right finger tip. In a considerable number of frames

Page 154: Improving real-time human pose estimation from multi-view video

(a) (b) (c) (d) (e)

Fig. 2: Frame 37 from the case study. (a) Ground truth skeletal structure withjoint positions. (b) Mesh model fitted to skeletal structure. (c) Rendered sil-houettes from eight viewpoints. (d) Reconstructed visual hull. (e) Labelled treeextracted from visual hull.

the position estimate is around 120 mm from the ground truth. The cause isthat the algorithm chooses the thumb branch as the extremity in those frames.In Fig. 3c a number of spikes can be seen in the estimate of the position of theright toe. This is caused by noise in the reconstructed visual hull between thefeet, and the right foot is pointing backwards in these frames. Finally, a box andwhisker plot for all four extremities can be seen in Fig. 3d.

5 Discussion and Future Work

The synthetic data produced by the proposed toolset is intended as a supplementto the real datasets available. Naturally there are some di↵erences between realand synthetic data that need to be considered when using synthetic data. Thereare non-trivial issues with reconstruction and image segmentation when usingreal data. Depending on the quality of the camera calibration, the reprojectionerrors in the 3D reconstruction can be problematic. Significant advances arebeing made on background subtraction algorithms [4], but segmenting the imagesinto binary silhouettes is still a di�cult task, that invariably includes some noise.

In comparison the synthetic data have perfectly clean silhouettes and exactlyknown extrinsic calibration parameters and no lens distortion. It is important tokeep in mind that this is an unrealistically ideal situation, and that results pro-duced using synthetic data might not directly transfer to real world applications.However, if the assumption is made that calibration is accurate and backgroundsubtraction is robust, then using synthetic silhouettes is a reasonable approxi-mation.

The toolset currently only has one avatar, a hairless and unclothed malemesh. Loose fitting clothing and long hair can have a significant impact on theshape of the reconstructed visual hull, and on the results of further processing.

Page 155: Improving real-time human pose estimation from multi-view video

(a) (b)

(c) (d)

Fig. 3: Four graphs used for analysing the pose estimation algorithm in the casestudy.(a) Mean error per frame for all extremities. (b) Error for the right fin-ger tip. (c) Error for the right toe tip. (d) Box and whisker plot for all fourextremities.

The range of available avatars should be extended with representatives of bothgenders, with di↵erent types of clothing and hair, and with di↵erent body shapes.

The automatic skinning method supplied by Blender can in some cases leadto poor results, because it is dependent on the armature and mesh being alignedbefore skinning. For some sequences this causes the limbs of the mesh to inter-sect during the motion. Better skinning algorithms should be investigated, forinstance the one presented by Rau and Brunnett [19]. The current support forcombining motions is rather primitive and relies on Blender’s interpolation func-tionality. This should be improved, for instance by implementing a more robustmotion stitching algorithm like the one presented by Li et al. [13].

6 Concluding Remarks

A toolset for generating synthetic silhouette data for use in human motion anal-ysis has been presented. The toolset consists of a Blender plugin for renderingsilhouette images of an avatar animated using motion capture sequences. Thesemotion capture sequences are drawn from the CMU database that contains 2605

Page 156: Improving real-time human pose estimation from multi-view video

trials in six categories. The plugin has support for arbitrary camera configura-tions, basic motion stitching, and ground truth data is exported along with thegenerated silhouettes. An analysis tool that facilitates evaluation and compar-ison of shape-from-silhouette-based pose estimation methods is included. Theanalysis tool can visualise 3D error, errors in each of the x, y, and z dimensions,and generate box and whisker plots. The toolset is available under the BSDlicense at:

http://code.google.com/p/mogens/

Acknowledgements The author wishes to thank Geir Hauge, Dag Stuan, andVegard Løkas for their invaluable assistance with the implementation of thesoftware presented in this paper. The data used in this project was obtainedfrom mocap.cs.cmu.edu. The database was created with funding from NSFEIA-0196217.

References

1. Bakken, R.H., Hilton, A.: Real-Time Pose Estimation Using Tree Structures Builtfrom Skeletonised Volume Sequences. In: Csurka, G., Braz, J. (eds.) Proceedingsof the International Conference on Computer Vision Theory and Applications. pp.181–190. SciTePress (2012)

2. Baran, I., Popovic, J.: Automatic Rigging and Animation of 3D Characters. ACMTransactions on Graphics 26(3), 72:1–8 (2007)

3. Bastioni, M., Re, S., Misra, S.: Ideas and methods for modeling 3D human figures.In: Proceedings of the 1st Bangalore Annual Compute Conference. pp. 10:1–6.ACM (2008)

4. Benezeth, Y., Jodoin, P.M., Emile, B., Laurent, H., Rosenberger, C.: Comparativestudy of background subtraction algorithms. Journal of Electronic Imaging 19(3),033003:1–12 (2010)

5. Cheung, K.M.G., Baker, S., Kanade, T.: Shape-From-Silhouette Across Time PartII: Applications to Human Modeling and Markerless Motion Tracking. Interna-tional Journal of Computer Vision 63(3), 225–245 (2005)

6. CMU: Graphics Lab Motion Capture Database, http://mocap.cs.cmu.edu/7. Franco, J.S., Menier, C., Boyer, E., Ra�n, B.: A Distributed Approach for Real

Time 3D Modeling. In: Proceedings of the Conference on Computer Vision andPattern Recognition Workshop. pp. 31–38. IEEE (2004)

8. Gkalelis, N., Kim, H., Hilton, A., Nikolaidis, N., Pitas, I.: The i3DPost multi-viewand 3D human action/interaction database. In: Proceedings of the Conference forVisual Media Production. pp. 159–168. IEEE (2009)

9. Gkalelis, N., Tefas, A., Pitas, I.: Combining Fuzzy Vector Quantization With Lin-ear Discriminant Analysis for Continuous Human Movement Recognition. IEEETransactions on Circuits and Systems for Video Technology 18(11), 1511–1521(2008)

10. Gross, R., Shi, J.: The CMU Motion of Body (MoBo) Database. Tech. Rep. 1,Carnegie Mellon University (2001)

11. Landabaso, J., Pardas, M., Casas, J.: Shape from inconsistent silhouette. ComputerVision and Image Understanding 112(2), 210–224 (2008)

Page 157: Improving real-time human pose estimation from multi-view video

12. Laurentini, A.: The Visual Hull Concept for Silhouette-Based Image Understand-ing. IEEE Transactions on Pattern Analysis and Machine Intelligence 16(2), 150–162 (1994)

13. Li, L., Mccann, J., Faloutsos, C., Pollard, N.: Laziness is a virtue: Motion stitchingusing e↵ort minimization. In: Proceedings of Eurographics. pp. 87–90 (2008)

14. Moeslund, T.B., Hilton, A., Kruger, V.: A survey of advances in vision-based hu-man motion capture and analysis. Computer Vision and Image Understanding 104,90–126 (2006)

15. Mundermann, L., Corazza, S., Andriacchi, T.P.: Markerless human motion capturethrough visual hull and articulated ICP. In: Proceedings of the NIPS Workshopon Evaluation of Articulated Human Motion and Pose Estimation (2006)

16. Mundermann, L., Corazza, S., Chaudhari, A.M., Alexander, E.J., Andriacchi, T.P.:Most favorable camera configuration for a shape-from-silhouette markerless motioncapture system for biomechanical analysis. In: Proceedings of SPIE-IS&T Elec-tronic Imaging. vol. 5665, pp. 278–287 (2005)

17. Pierard, S., Van Droogenbroeck, M.: A technique for building databases of anno-tated and realistic human silhouettes based on an avatar. In: Proceedings of Work-shop on Circuits, Systems and Signal Processing. pp. 243–246. Citeseer (2009)

18. Ragheb, H., Velastin, S., Remagnino, P.: ViHASi: Virtual Human Action Silhou-ette Data for the Performance Evaluation of Silhouette-based Action RecognitionMethods. In: Proceeding of the 1st ACM Workshop on Vision Networks for Be-haviour Analysis. pp. 77–84. ACM (2008)

19. Rau, C., Brunnett, G.: Anatomically Correct Adaption of Kinematic Skeletonsto Virtual Humans. In: Richard, P., Kraus, M., Laramee, R.S., Braz, J. (eds.)Proceedings of the International Conference on Computer Graphics Theory andApplications. pp. 341–346. SciTePress (2012)

20. Sigal, L., Balan, A.O., Black, M.J.: HumanEva: Synchronized Video and MotionCapture Dataset and Baseline Algorithm for Evaluation of Articulated HumanMotion. International Journal of Computer Vision 87(1-2), 4–27 (2010)

21. Singh, S., Velastin, S., Ragheb, H.: MuHAVi: A Multicamera Human Action VideoDataset for the Evaluation of Action Recognition Methods. In: Proceedings ofthe Seventh IEEE International Conference onAdvanced Video and Signal BasedSurveillance. pp. 48–55 (2010)

22. Starck, J., Hilton, A.: Surface Capture for Performance-Based Animation. IEEEComputer Graphics and Applications 27(3), 21–31 (2007)

23. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition usingmotion history volumes. Computer Vision and Image Understanding 104(2-3), 249–257 (2006)

Page 158: Improving real-time human pose estimation from multi-view video
Page 159: Improving real-time human pose estimation from multi-view video

Paper 4Real-time skeletonisation

Authors: Rune H. Bakken and Lars M. Eliassen.

Full title: Real-time 3D Skeletonisation in Computer Vision-Based HumanPose Estimation Using GPGPU.

Published in: Proceedings of the Third International Conference on Image Pro-cessing Theory, Tools and Applications (IPTA 2012) - Special ses-sion on High Performance Computing in Computer Vision Ap-plications (HPC-CVA), 2012, IEEE.

Copyright: c� 2012 IEEE.

Page 160: Improving real-time human pose estimation from multi-view video
Page 161: Improving real-time human pose estimation from multi-view video

Real-time 3D Skeletonisation in Computer Vision-BasedHuman Pose Estimation Using GPGPU

Rune Havnung Bakken1 and Lars Moland Eliassen1,21 Faculty of informatics and e-learning, Sør-Trøndelag University College, Trondheim, Norway,

e-mail: [email protected] Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway,

e-mail: [email protected]

Abstract— Human pose estimation is the process of approx-imating the configuration of the body’s underlying skeletalarticulation in one or more frames. The curve-skeleton ofan object is a line-like representation that preserves topologyand geometrical information. Finding the curve-skeleton of avolume corresponding to the person is a good starting point forapproximating the underlying skeletal structure. In this paper aGPU implementation of a fully parallel thinning algorithm basedon the critical kernels framework is presented. The algorithmis compared to another state-of-the-art thinning method, andwhile it is demonstrated that both achieve real-time frame rates,the proposed algorithm yields superior accuracy and robustnesswhen used in a pose estimation context. The GPU implementationis > 8⇥ faster than a sequential version, and the positions ofthe four extremities are estimated with rms error ⇠ 6 cm and⇠98 % of frames correctly labelled.

Keywords— Skeletonisation, GPGPU, Real-time, Human Mo-tion Analysis.

I. INTRODUCTION

Human motion capture is the process of registering humanmovement over a period of time. Accurately capturing humanmotion is a complex task with many possible applications,for instance identifying people by their gait, interacting withcomputers using gestures, improving athlete performance, di-agnosing orthopedic patients, and creating virtual characterswith natural looking motions in entertainment. For some ap-plications it is essential that motion is captured automaticallywith low latency and real-time performance, for instance inperceptive user interfaces and gait recognition.

Human motion is governed by an underlying articulatedskeletal structure and in the context of computer vision-basedhuman motion analysis Moeslund et al. [1] define pose esti-mation as the process of finding an approximate configurationof this structure in one or more frames. Shape-from-silhouettehas been a popular approach for recovering the volume of thesubject in multi-camera setups. The resultant visual hull is anoverestimate of the volume occupied by the subject, and isreconstructed from silhouette images covering the scene fromdifferent viewpoints.

The skeleton is a compact 1D line-like representation ofan object. Skeletonisation is a homotopic process, in otherwords it is topology preserving. A curve-skeleton algorithmpreserves both topology and geometrical information. Skele-tonising a visual hull to produce a curve-skeleton yields an

approximation to the underlying skeletal structure, that canbe used to estimate the subject’s pose. Fully parallel thinningalgorithms produce skeletons by considering boundary voxelsfor removal independently, and are ideal for implementationon stream processing hardware. The advent of inexpensiveconsumer graphics cards with stream processing capabilitieshas popularised general-purpose computing on graphics pro-cessing units (GPGPU).

Goal: the overall goal of the research behind this paper isto develop a robust, real-time pose estimation method.

Contribution: we present a GPGPU implementation of afully parallel thinning algorithm, that is well suited for use inhuman pose estimation, and achieves real-time performance.

This paper is organised as follows: in Section II relevantwork by other researchers is examined. The proposed thin-ning algorithm is detailed in Section III. Results of experi-ments with the proposed method are presented in Section IVand compared with another state-of-the-art algorithm. Thestrengths and limitations of the approach are discussed inSection V, and Section VI concludes the paper.

II. RELATED WORK

Representing 3D objects by a skeletal approximation hasmany different applications. A comprehensive overview ofthe properties, applications and algorithms for finding skeletalrepresentations of 3D objects was given in the survey byCornea et al. [2]. There are several, sometimes conflicting,ways to define the skeletal representation of an object, but forthe remainder of this paper we will adher to the conventionsused by Cornea et al.

The medial axis transform introduced by Blum [3] is usedto find the medial axis of objects in 2D, but when generalisedto 3D it results in a medial surface. The skeleton of an objectis defined as the locus of centre points of maximal inscribedballs. The medial axis and the skeleton are closely related,but not exactly the same. The curve-skeleton introduced bySvenson et al. [4] is an alternate representation without arigorous definition. Generally it can be said to be a line-like 1D structure consisting of curves that match the topologyof the object, but also contain geometrical information. Thisis in contrast to the ultimate skeleton, which only preservestopology.

Page 162: Improving real-time human pose estimation from multi-view video

There is a multitude of algorithms for finding the skeletonof an object, but Cornea et al. [2] divide them into four broadcategories: thinning, geometrical, distance field, and potentialfield. A thorough discussion of the strengths and weaknessesof each type of approach is beyond the scope of this paper andthe reader is referred to the aforementioned survey. Thinningalgorithms iteratively remove points from the object withoutchanging its topology, and offer a good compromise betweenaccuracy and computational cost. A subset of the thinning cat-egory consists of fully parallel algorithms. These proceduresevaluate each voxel of the 3D object independently, makingthem very well suited for implementation on modern graphicshardware.

Critical kernel theory established by Bertrand and Cou-prie [5] is a powerful framework for studying n-dimensionalhomotopic thinning algorithms. It has been used to proveexisting algorithms, as well as for designing a number of newparallel thinning algorithms.

The motion of the human body is governed by the skeleton,a rigid, articulated structure. Recovering this structure fromimage evidence is a common approach in computer visionbased motion capture. A method for automatic initialisationbased on homeomorphic alignment of a skeleton from a criticalkernels based algorithm with a weighted model tree waspresented by Raynal et al. [6]. The alignment was done byminimising the edit distance between the data and model trees.The method was intended to be used as an initialisation stepfor a pose estimation or tracking framework.

Fitting a kinematic model to the skeletal data is an approachtaken by several researchers. The pose estimation frameworkdescribed by Moschini and Fusiello [7] used the hierarchicaliterative closest point algorithm to fit a stick figure modelto a set of data points on the skeleton curve. The methodcan recover from errors in matching, but requires manualinitialisation. The approach presented by Menier et al. [8]uses Delauney triangulation to extract a set of skeleton pointsfrom a closed surface visual hull representation. A generativeskeletal model is fitted to the data skeleton points usingmaximum a posteriori estimation. The method is fairly ro-bust, even for sequences with fast motion. A tracking andpose estimation framework where Laplacian Eigenmaps wereused to segment voxel data and extract a skeletal structureconsisting of spline curves was presented by Sundaresan andChellappa [9]. Common for these approaches is that they donot achieve real-time performance.

Straka et al. [10] recently presented a real-time skeletalgraph based pose estimation approach. Voxel scooping, amethod designed for neuron tracing, was used to extract askeletal graph. Using distance matrices they identified theextremities, and used an automated rigging procedure to fita skeletal model to the graph. The subject poses accepted bythis method was limited (the head must be the highest point),and it is not capable of automatically labelling limbs as leftor right.

III. PARALLEL THINNING ALGORITHM

A thinning algorithm iteratively removes boundary voxelsfrom an object to produce a topologically equivalent skeleton.The concept of simple points was introduced by Morgen-thaler [11]. A simple point is a voxel that can be safelyremoved without changing the object’s topology. A fullyparallel thinning algorithm can consider all boundary pointsfor deletion in a single iteration. To achieve this the algorithmmust rely on local characteristics of each voxel; only a smallneighbourhood around each voxel should be examined.

A cubical complex is a set consisting of elements of variousdimensions (cubes, squares, lines, and points) called faces.Figure 1 shows a complex consisting of one of each d-facewith d from 0 to 3. In the 3⇥3⇥3 neighbourhood of a voxelthere are 6 neighbours that share a 2-face with the centre voxel.This is called the 6-neighbourhood. There are 12 voxels thatshare a 1-face with the centre voxel, and combined with the 6-neighbours these voxels constitute the 18-neighbourhood. The8 corner voxels share a 0-face with the centre, and form the26-neighbourhood when combined with all other neighbours.

S

P

E

C

Fig. 1: Faces in F3. A 0-face (point P), a 1-face (edge E), a2-face (square S), and a 3-face (cube C).

Malandain et al. [12], and Bertrand and Malandain [13]presented a topological classification based solely on localcharacteristics. Using this approach it is possible to classifya voxel by counting connected components of its immediate3⇥3⇥3 neighbourhood. Two topological numbers, C⇤ and C,are calculated. After removing the centre voxel, the number of26-connected foreground components in the 26-neighbourhoodis denoted by C⇤. The number of 6-connected backgroundcomponents in the 18-neighbourhood is denoted by C. Usingthese two numbers voxels can be classified into the categoriesshown in Table 1.

Table 1: Topological numbers for classifying voxels.

Classification C C⇤

Interior 0Simple 1 1Surface >1

It is difficult to guarantee that topology is preserved whenonly a small neighbourhood is considered, but the critical ker-nels framework framework is a powerful tool to prove existingand design new homotopic parallel thinning algorithms. Next,we will briefly describe the critical kernels framework. Forfurther details see [5], [14].

Page 163: Improving real-time human pose estimation from multi-view video

Two faces f and g are called a free pair of a complex if f isthe only face that strictly includes g. The collapse operation,defined in topology, is homotopic and consists of removingfree pairs. The critical kernel of a complex is the subset offaces that cannot be collapsed any further. In Fig. 2 a complex(a set of voxels) and its critical kernel can be seen.

Fig. 2: A set of voxels, and its critical kernel, from [14].

The critical kernel is a set of faces of dimension 0 to n. Ad-cell is a d-face and all its subfaces, hence a set of voxelscontains only 3-cells. This is called a pure 3-complex, andwe require that the result of the thinning algorithm also is apure 3-complex. Consequently, the critical kernel is not foundexplicitly. Instead the voxels adjacent to the critical kernel areidentified. These are called crucial voxels, and there are 0-,1-, and 2-crucial voxels depending on what type of face theyshare with the critical kernel. A fully parallel curve-skeletonthinning algorithm based on these concepts is shown in Alg. 1.

Algorithm 1 Fully parallel thinning, CK31: repeat2: for each voxel v in parallel do3: if simple(v) then4: s[v] true.5: if curve(v) then6: s[v] false.7: if s[v] ^ crucial2(v) then8: c2[v] true.9: if s[v] ^ crucial1(v) then

10: c1[v] true.11: if s[v] ^ crucial0(v) then12: c0[v] true.13: if s[v] ^ ¬(c0[v] _ c1[v] _ c2[v]) then14: v 0.15: end16: until stability

First, simple voxels are identified using the topological clas-sification in Table 1. Curve end points are by definition simple,as removing them does not change the topology. However,a curve-skeleton is the desired output of the algorithm, sosome exception must be made to keep curve voxels. Simplevoxels with no non-simple surface or interior voxels in theirneighbourhood are part of the curve, hence in the second stepvoxels that meet this condition are no longer marked as simple.

The next three steps involve identifying voxels as 2-crucial, 1-crucial, or 0-crucial, respectively. This is done by testing theneigbourhood of each voxel against a set of masks. See [5]for details. This is repeated until the volume no longer changefrom one iteration to the next. To check if the volume haschanged, the difference between the two volumes is foundand parallel reduction [15] used to find a single value.

Alg. 1 works on large data sets, has high parallelism, andminimal dependency between data elements, hence it is idealfor GPGPU implementation. The algorithm is implementedusing NVIDIA’s CUDA. CUDA programs can access severalmemory spaces with different characteristics. The global mem-ory space is accessible from all threads and has high capacity,but also high latency. The GPU has a high number of registersthat are used for storing variables, but since the number ofactive threads at any one time is also high, the number ofregisters per thread is limited. If the register pressure is toohigh, variables are stored in local memory instead. The localmemory space is physically similar to global memory, andhas the same high latency. Each block of threads has accessto a shared memory region with very low latency, but withlimited capacity. The constant memory space is read-onlyand is physically similar to global and local memory, butits contents is cached so it does not suffer from the samehigh latency. To achieve the highest possible throughput it isdesirable to store data in shared memory when processing it,and to keep all local variables in registers.

The first two steps of Alg. 1 require an efficient way ofcounting connected components in 3⇥ 3⇥ 3 neighbourhoods.The algorithm outlined by Malandain and Bertrand [16] isadapted for implementation on the GPU. Because of memoryand register constraints, all sets are implemented as singleintegers. A binary 3⇥3⇥3 neighbourhood fits within a 32-bitinteger as only one bit is required per voxel. Bitwise operationsare used to set and access voxel values, and to comparesets against each other. Bitwise operations have very highthroughput and allow us to process the entire neighbourhoodin parallel. The connected component counting algorithm isshown in Alg. 2.

Algorithm 2 Connected component counting.1: buffer, filled, visited 02: setbit(buffer, ↵s(neighbourhood)� 1)3: while buffer 6= 0 do4: i ↵s(buffer)� 15: setbit(visited, i)6: filled filled _ (neighbourhood ^maski)7: buffer filled ^ ¬visited8: end while

We observe that in order to find the topological numbersrequired in Alg. 1 it is sufficient to determine whether theneighbourhood of a voxel has 0, 1 or >1 connected com-ponents. If neighbourhood = 0 there are no components,so Alg. 2 must determine whether there is one connected

Page 164: Improving real-time human pose estimation from multi-view video

component or not. The algorithm does this by flood fillingthe first component it encounters. If neigbourhood = filledthere is one, and if neighbourhood 6= filled there are>1 connected components. The seed point for the flood fillis selected using the ↵s (find first set) intrinsic function.Neighbouring voxels are found using a set of masks, and addedto the buffer. This is repeated until the buffer is empty. Themasks consist of 54 integers (27 for each type of connectivity)stored in constant memory. A 2D example with 8-connectivityof the connected component counting is shown in Fig. 3.

Neighbourhood Mask Filled Buffer⋀ ¬ Visited⋀ ⟶⟶

Fig. 3: 2D example of the connected components algorithmwith 8-connectivity. Columns: (1) the neighbourhood to pro-cess, (2) neighbour mask for selected voxel, (3) flood filledvoxels, (4) the negation of visited voxels, (5) the resultingbuffer. Observe that neighbourhood and filled are identicalafter the final iteration, so there is exactly one connectedcomponent.

The volumes containing visual hulls are typically verysparse, with only 1-3 % non-zero voxels. To reduce thenumber of voxels the thinning algorithm needs to visit thefollowing pre-processing algorithm is proposed. The volumeis projected into the yz plane. Parallel reduction is used to findthe coordinate pairs of voxel strips with non-zero voxels. Thisis illustrated in Alg. 3. The coordinate pairs are used by thethinning algorithm, and each voxel strip is copied into sharedmemory before processing the voxels.

Algorithm 3 Pre-processing the volume.

1: coords = ;2: for each y, z do3: plane[y, z] parallel reduction along x axis.4: if plane[y, z] 6= 0 then5: coords coords [ (y, z)6: end for

Figure 4 shows an example volume pre-processed withAlg. 3. After pre-processing only ⇠14 % of the voxels need

to be examined during thinning.

Fig. 4: Voxels projected into the yz plane.

IV. EVALUATION AND RESULTS

A number of experiments were conducted to evaluate theskeletonisation algorithm outlined in Section III. Specifically,the suitability of the algorithm for use in real-time humanmotion analysis was assessed, and compared with anotherstate-of-the-art real-time thinning approach. Raynal and Cou-prie [17] presented a parallel, directional thinning algorithmbased on isthmuses (a generalisation of curve/surface interiorpoints). This algorithm has been successfully applied in thedomain of human motion analysis by Raynal [18], and willbe referred to as D6I1D.

The method was evaluated using both real and syntheticdata. Two sequences of real data were drawn from the i3DPostdataset [19]. In the walk sequence the subject walks across theviewing volume, and the ballet sequence consists of a dancerperforming a number of fast pirouettes. The dataset containsimages from eight HD cameras, but no ground truth data isavailable. The synthetic sequences were generated with theframework described in [20]. An avatar was animated usingmotion capture data from the CMU dataset [21]. Silhouetteimages of the sequences were rendered with eight virtualcameras at a resolution of 800 ⇥ 600, seven placed aroundthe avatar and one in the ceiling pointing down. Two dancesequences, dance-whirl and lambada, totalling around 1000frames (subject 55, trial 1 and 2) were used. The jointlocations from the motion capture data were used as a basis forcomparison. All experiments were conducted on a computerrunning Debian GNU/Linux with an Intel Core i7 2.66 GHzCPU, 12 GB of RAM, and an NVIDIA GTX 560 Ti graphicscard with CUDA 4.0.

To evaluate and compare the suitability of the generatedskeletons for human motion analysis both skeletonisation algo-rithms were used in the pose estimation method described byBakken and Hilton [22]. The method consists of the followingsteps. First, a volumetric visual hull is reconstructed fromsilhouette images with a resolution of 64 ⇥ 64 ⇥ 64. The

Page 165: Improving real-time human pose estimation from multi-view video

(a) (b) (c) (d) (e) (f)

Fig. 5: One frame from the dance-whirl sequence. (a) Ground truth skeletal structure with joint positions. (b) Mesh model fittedto skeletal structure. (c) Rendered silhouettes from eight viewpoints. (d) Reconstructed visual hull. (e) Skeleton. (f) Labelledtree extracted from skeleton.

silhouette images are cross sections of generalised cones withapexes in the focal points of the cameras, and the visual hullis created by intersecting the silhouette cones. Second, thevisual hull is thinned to a skeleton representation. Next, a treestructure is built from the skeleton by placing a node in eachvoxel and connecting them with edges, followed by removalof spurious branches. The extremities (hands, feet, and head)are identified, and the tree is segmented into body parts. Thefinal step is to label hands and feet as left or right. Figure 5illustrates the stages of the pose estimation algorithm.

Table 2 shows a comparison of errors in the estimatedpositions of the extremities, and percentages of invalid frames.There is a considerable difference in accuracy achieved withthe two skeletonisation methods, for the dance-whirl sequencethe error is more than doubled for D6I1D compared to CK3.The pose estimation algorithm also consistently performs morerobustly with the skeletons from CK3, both with the syntheticand real data sequences.

Table 2: Comparison of mean Euclidean errors (st. dev. inparentheses) for the four extremities, and percentages ofincorrectly labelled frames.

CK3 D6I1DSequence Error (mm) Invalid (%) Error (mm) Invalid (%)

dance-whirl 53.9 (29.8) 3.1 121.6 (56.2) 20.2lambada 75.9 (40.2) 1.8 137.9 (46.9) 13.9walk - 22.8 - 40.4ballet - 17.3 - 28.0

Estimating stature using the labelled tree structure is an-other measure of the accuracy of the skeletons. The geodesicdistance from toe tip to the top of the head would be an overes-timate of the stature. Instead the tree is traversed upwards, andnode groups corresponding to each bone created. The statureis calculated by summing up the Euclidean distances betweenthe first and last nodes in each node group. Stature estimatesbased on data from both skeletonisation algorithms are shownin Table 3. Using D6I1D results in three times higher error

than CK3.

Table 3: Comparison of mean stature estimates (in millimeters,with st. dev. in parentheses) using skeletons from the twoalgorithms.

Sequence Ground truth CK3 D6I1D

dance-whirl 1834 1784 (53) 1654 (57)lambada 1834 1786 (50) 1640 (50)

The processing time of the GPU implementation of CK3was compared with a sequential implementation. The resultscan be seen in Table 4. Using the GPU results in a speed up ofover 8⇥. The GPU implementation of CK3 was also comparedto a sequential implementation of D6I1D, with results shownin Table 5. Even though D6I1D is almost three times fasterwhen compared separately, using any of the skeletonisationalgorithms for pose estimation results in a frame rate wellabove the target 30 fps.

Table 4: Comparison of mean processing time (in milliseconds,with st. dev. in parentheses) between sequential version andGPU implementation of CK3.

Sequence Sequential GPU Speed up

dance-whirl 40.25 (2.38) 4.51 (0.39) 8.9⇥lambada 40.13 (3.31) 4.78 (0.45) 8.4⇥walk 42.71 (2.21) 3.99 (0.35) 10.7⇥ballet 42.67 (2.90) 4.09 (0.37) 10.4⇥

V. DISCUSSION AND FUTURE WORK

The pose estimation algorithm outlined in Section IV wasdeveloped using skeletons from CK3 as input. Therefore itcould be claimed that it is a biased evaluation metric whenused to compare the output of CK3 with skeletons fromanother thinning algorithm. It is possible that assumptionswere made during the development of the pose estimationalgorithm that inadvertently favours CK3. However, it is clear

Page 166: Improving real-time human pose estimation from multi-view video

Table 5: Comparison of processing times and frame rates.The visual hull construction is slower with the i3DPost databecause of the higher image resolution (Full HD).

(a) CK3

Sequence Vis. hull Skel. Pose est. Frame rate

dance-whirl 3.84 (0.21) 4.51 (0.39) 1.60 (0.09) 100.9lambada 3.84 (0.21) 4.78 (0.45) 1.56 (0.08) 98.5walk 9.96 (0.26) 3.99 (0.35) 1.57 (0.13) 65.9ballet 9.96 (0.85) 4.09 (0.37) 1.67 (0.14) 63.6

(b) D6I1D

Sequence Vis. hull Skel. Pose est. Frame rate

dance-whirl 3.84 (0.21) 1.65 (0.02) 1.21 (0.07) 154.7lambada 3.84 (0.21) 1.61 (0.03) 1.18 (0.08) 154.4walk 9.96 (0.26) 1.71 (0.86) 1.23 (0.09) 79.7ballet 9.96 (0.85) 1.73 (0.44) 1.29 (0.07) 77.0

from visual inspection that D6I1D yields skeletons that indeedare further from the ground truth. Two examples are givenFig. 6. In these cases D6I1D removes more of the real skeletoncurves, resulting in the complete disappearance of the left foot.When a tree structure is constructed from the skeleton voxels itis more straightforward to clean up spurious branches, than tocompensate for missing data. The D6I1D algorithm is slightlyfaster, but consistently yields poorer performance in terms ofaccuracy and robustness.

The proposed algorithm should be compared to otherstate-of-the-art skeletonisation algorithms. Straka et al. [10]presented a real-time graph based human pose estimationapproach, using skeletons to recover the subject’s pose. Theskeletons were produced with a technique called voxel scoop-ing [23], and Straka et al. report real-time performance. Atthe time of writing there was no comparable implementationof this algorithm available, so comparison against results fromvoxel scooping must be deferred to future study.

There are several steps in Alg. 1 that could potentially beoptimised further. The register pressure in the steps dealingwith crucial cliques (lines 7-12) is currently too high, leadingto some use of local memory. It is not straightforward toreduce the register usage of these functions, however, and itmight require a complete rethinking of these mask operations.Raynal et al. [6] used look up tables to achieve fast matchingof the neighbourhood masks for a similar critical kernel basedalgorithm, but as demonstrated by Raynal [18] the requiredmemory footprint of this approach makes it unsuitable forGPU implementation. All the look up tables would have to bestored in global memory, seriously hampering performance.

VI. CONCLUDING REMARKS

A fully parallel GPU implementation of a critical kernel basedthinning algorithm has been presented. The proposed algo-rithm was evaluated for use specifically in computer vision-based human motion analysis. Although not as fast as another

(a)

(b)

Fig. 6: Comparison of skeletons with CK3 on the left andD6I1D on the right. (a) shows one frame from the lambadasequence. Although there is less noise in the right skeleton, thereal skeleton curves are also shortened too much, completelyremoving the left foot. (b) shows one frame from the walksequence, also with considerable shortening of the limbs forthe D6I1D skeleton.

state-of-the-art thinning method, the proposed algorithm per-forms well within the target frame rate of 30 fps on 64⇥64⇥64volumes, and the underlying skeletal structure of the humanbody is recovered more accurately from volume sequences.Quantitative evaluation using synthetic data indicates a speedup of > 8⇥ compared with a sequential implementation, anerror of ⇠6 cm in estimated positions of the four extremities,and correct labelling of the body parts in ⇠ 98 % of frames.The proposed algorithm is not as fast as another state-of-the-artthinning algorithm, but consistently performs better in terms ofaccuracy and robustness when used for human pose estimation.

REFERENCES

[1] T. B. Moeslund, A. Hilton, and V. Kruger, “A survey of advances invision-based human motion capture and analysis,” Computer Vision andImage Understanding, vol. 104, pp. 90–126, 2006.

[2] N. D. Cornea, D. Silver, and P. Min, “Curve-Skeleton Properties,Applications, and Algorithms,” IEEE Transactions on Visualization andComputer Graphics, vol. 13, no. 3, pp. 530–548, 2007.

Page 167: Improving real-time human pose estimation from multi-view video

[3] H. Blum, “A Transformation for Extracting New Descriptors of Shape,”in Models for the perception of speech and visual form, 1967, pp. 362–380.

[4] S. Svensson, I. Nystrom, and G. Sanniti di Baja, “Curve skeletonizationof surface-like objects in 3D images guided by voxel classification,”Pattern Recognition Letters, vol. 23, pp. 1419–1426, 2002.

[5] G. Bertrand and M. Couprie, “A New 3D Parallel Thinning SchemeBased on Critical Kernels,” Discrete Geometry for Computer Imagery(LNCS), vol. 4245, pp. 580–591, 2006.

[6] B. Raynal, M. Couprie, and V. Nozick, “Generic Initialization for MotionCapture from 3D Shape,” Image Analysis and Recognition (LNCS), vol.6111, pp. 306–315, 2010.

[7] D. Moschini and A. Fusiello, “Tracking Human Motion with Multi-ple Cameras Using an Articulated Model,” Computer Vision/ComputerGraphics Collaboration Techniques (LNCS), vol. 5496, pp. 1–12, 2009.

[8] C. Menier, E. Boyer, and B. Raffin, “3D Skeleton-Based Body PoseRecovery,” in Proceedings of the Third International Symposium on 3DData Processing, Visualization, and Transmission, Jun. 2006, pp. 389–396.

[9] A. Sundaresan and R. Chellappa, “Model-driven segmentation of articu-lating humans in Laplacian Eigenspace,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 30, no. 10, pp. 1771–1785, Oct.2008.

[10] M. Straka, S. Hauswiesner, M. Ruther, and H. Bischof, “Skeletal GraphBased Human Pose Estimation in Real-Time,” in Proceedings of BMVC,2011, pp. 1–12.

[11] D. G. Morgenthaler, “Three-dimensional simple points: Serial erosion,parallel thinning, and skeletonization,” Computer Science Center, Uni-versity of Maryland, Tech. Rep., 1981.

[12] G. Malandain, G. Bertrand, and N. Ayache, “Topological Segmentationof Discrete Surfaces,” International Journal of Computer Vision, vol. 10,no. 2, pp. 183–197, 1993.

[13] G. Bertrand and G. Malandain, “A new characterization of three-dimensional simple points,” Pattern Recognition Letters, vol. 15, pp.169–175, 1994.

[14] G. Bertrand and M. Couprie, “On Parallel Thinning Algorithms: Mini-mal Non-simple Sets, P-simple Points and Critical Kernels,” Journal ofMathematical Imaging and Vision, vol. 35, pp. 23–35, 2009.

[15] M. Harris, “Optimizing Parallel Reduction in CUDA,” 2007. [Online].Available: developer.nvidia.com

[16] G. Malandain and G. Bertrand, “Fast Characterization of 3D SimplePoints,” in Proceedings of ICPR, 1992, pp. 232–235.

[17] B. Raynal and M. Couprie, “Isthmus-Based 6-Directional Parallel Thin-ning Algorithms,” in Proceedings of DGCI, 2011, pp. 175–186.

[18] B. Raynal, “Applications of Digital Topology For Real-Time MarkerlessMotion Capture,” PhD Thesis, Universitee Paris-Est, 2010.

[19] N. Gkalelis, H. Kim, A. Hilton, N. Nikolaidis, and I. Pitas, “The i3DPostmulti-view and 3D human action/interaction database,” in Proceedingsof the Conference for Visual Media Production. IEEE, Nov. 2009, pp.159–168.

[20] R. H. Bakken, “Using Synthetic Data for Planning, Development andEvaluation of Shape-from-Silhouette Based Human Motion CaptureMethods,” in Proceedings of ISVC, 2012.

[21] CMU, “Graphics Lab Motion Capture Database.” [Online]. Available:http://mocap.cs.cmu.edu/

[22] R. H. Bakken and A. Hilton, “Real-Time Pose Estimation Using TreeStructures Built from Skeletonised Volume Sequences,” in Proceedingsof the International Conference on Computer Vision Theory and Appli-cations, 2012, pp. 181–190.

[23] A. Rodriguez, D. B. Ehlenberger, P. R. Hof, and S. L. Wearne, “Three-dimensional neuron tracing by voxel scooping.” Journal of NeuroscienceMethods, vol. 184, no. 1, pp. 169–175, Oct. 2009.

Page 168: Improving real-time human pose estimation from multi-view video
Page 169: Improving real-time human pose estimation from multi-view video

Paper 5Pose estimation using skeletonisation

Authors: Rune H. Bakken and Adrian Hilton.

Full title: Real-time Pose Estimation Using Tree Structures Built fromSkeletonised Volume Sequences.

Published in: Proceedings of the Seventh International Conference on Com-puter Vision Theory and Applications (VISAPP 2012) Vol. 2,Gabriela Csurka and Jose Braz eds., 2012, SciTePress.

Copyright: c� 2012 SciTePress - Science and Technology Publications.

Page 170: Improving real-time human pose estimation from multi-view video
Page 171: Improving real-time human pose estimation from multi-view video

REAL-TIME POSE ESTIMATION USING TREE STRUCTURESBUILT FROM SKELETONISED VOLUME SEQUENCES

Rune Havnung Bakken1 and Adrian Hilton2

1Faculty of Informatics and e-Learning, Sør-Trøndelag University College, Trondheim, Norway2Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK

[email protected], [email protected]

Keywords: Human motion, pose estimation, real-time

Abstract: Pose estimation in the context of human motion analysis is the process of approximating the body configurationin each frame of a motion sequence. We propose a novel pose estimation method based on constructingtree structures from skeletonised visual hulls reconstructed from multi-view video. The pose is estimatedindependently in each frame, so the method can recover from errors in previous frames, which overcomes theproblems of tracking. Publically available datasets were used to evaluate the method. On real data the methodperforms at a framerate of 15–64 fps depending on the resolution of the volume. Using synthetic data thepositions of the extremities were determined with a mean error of 47–53 mm depending on the resolution.

1 INTRODUCTION

Capturing the motion of a person is a difficult taskwith a number of useful applications. Motion captureis used to identify people by their gait, for interact-ing with computers using gestures, for improving theperformance of athletes, for diagnosis of orthopedicpatients, and for creating virtual characters with morenatural looking motions in movies and games. Theseare but a few of the possible applications of humanmotion capture.

In some of the application areas mentioned aboveit is important that the data aquisition is unconstrainedby the markers or wearable sensors tradionally usedin commercial motion capture systems. Furthermore,there is a need for low latency and real-time perfor-mance in some applications, for instance in perceptiveuser interfaces and gait recognition.

Computer vision based motion capture has beena highly active field of research in the last coupleof decades, as surveys by for instance (Moeslundet al., 2006) and (Poppe, 2007) show. A popular ap-proach for multi-camera setups has been the shape-from-silhouette method, which consists of doing 3Dreconstruction from silhouette images that results inan over-estimate of the volume occupied by the sub-ject called a visual hull.

Human motion is governed by an underlying artic-ulated skeletal structure and (Moeslund et al., 2006)define pose estimation as the process of finding an

approximate configuration of this structure in one ormore frames. In a tracking framework the temporalrelations between body parts during the motion se-quence are ascertained. Many tracking based algo-rithms suffer from increasing errors over time, and re-covering from situations where the track is lost can beproblematic.

Goal: the overall goal of the research presented inthis paper is to develop a robust, real-time pose esti-mation method.

Contribution: we present a pose estimationmethod based on constructing a tree structure froma skeletonised visual hull. The configuration of theskeletal structure is independently estimated in eachframe, which overcomes limitations of tracking, andfacilitates automatic initialisation and recovery fromerroneous estimates. The method achieves real-timeperformance on a variety of different motion se-quences.

This paper is organised as follows: in section 2relevant work by other researchers is examined. Theproposed pose estimation method is detailed in sec-tion 3. Results of experiments with the proposedmethod are presented in section 4, and the strengthsand limitations of the approach are discussed in sec-tion 5. Section 6 concludes the paper.

Page 172: Improving real-time human pose estimation from multi-view video

2 RELATED WORK

The motion of the human body is governed by theskeleton, a rigid, articulated structure. Recoveringthis structure from image evidence is a commonapproach in computer vision based motion capture.Representing 3D objects by a skeletal approxima-tion has many different applications. An overview ofthe properties, applications and algorithms for findingskeletal representations of 3D objects was given by(Cornea et al., 2007). There are several ways to de-fine the skeletal representation of an object. The well-known medial axis transform (Blum, 1967) is used tothin objects in 2D, but when generalised to 3D it re-sults in a medial surface. The skeleton of an object isdefined as the locus of centre points of maximal in-scribed balls. The medial axis and the skeleton areclosely related, but not exactly the same. The curve-skeleton (Svensson et al., 2002) is an alternate repre-sentation without a rigorous definition, but generallyit can be said to be a line-like 1D structure consistingof curves that match the topology of the 3D object.

There exists a plethora of algorithms for findingthe curve-skeleton of an object, but (Cornea et al.,2007) divides them into four broad categories: thin-ning, geometrical, distance field, and potential field.A thorough discussion of the strengths and weak-nesses of each type of approach is beyond the scopeof this paper, but thinning offers a good compromisebetween accuracy and computational cost. A subsetof the thinning category consists of fully parallel al-gorithms. These procedures evaluate each voxel ofthe 3D object independently, making them very wellsuited for implementation on modern graphics hard-ware.

Both (Brostow et al., 2004) and (Theobalt et al.,2004) sought to find the skeletal articulation of arbi-trary moving subjects. Both approaches enforce tem-poral consistency for the entire structure during themotion sequence. Neither focused on human pose es-timation specifically, however, and no inferences weremade about which part of the extracted skeletons cor-respond to limbs in a human skeletal structure.

A method for automatic initialisation based onhomeomorphic alignment of a data skeleton with aweighted model tree was presented by (Raynal et al.,2010). The alignment was done by minimising theedit distance between the data and model trees. Themethod was intended to be used as an initialisationstep for a pose estimation or tracking framework.

In the model-free motion capture method pro-posed by (Chu et al., 2003) volume sequences aretransformed into a pose-invariant intrinsic space,removing pose-dependent nonlinearities. Principal

curves in the intrinsic space are then projected backinto Euclidean space to form a skeleton representationof the subject. This approach requires three passesthrough a pre-recorded volume sequence, hence it wasnot suited for real-time applications.

Fitting a kinematic model to the skeletal data isan approach taken by several researchers. The poseestimation framework described by (Moschini andFusiello, 2009) used the hierarchical iterative closestpoint algorithm to fit a stick figure model to a set ofdata points on the skeleton curve. The method canrecover from errors in matching, but requires man-ual initialisation. The approach presented by (Me-nier et al., 2006) uses Delauney triangulation to ex-tract a set of skeleton points from a closed surface vi-sual hull representation. A generative skeletal modelis fitted to the data skeleton points using maximuma posteriori estimation. The method is fairly robust,even for sequences with fast motion. A tracking andpose estimation framework where Laplacian Eigen-maps were used to segment voxel data and extract askeletal structure consisting of spline curves was pre-sented by (Sundaresan and Chellappa, 2008). Com-mon for the three aforementioned approaches is thatthey do not achieve real-time performance.

A comparison of real-time pose estimation meth-ods was presented by (Michoud et al., 2007). Theirfindings were that real-time initialisation was a fea-ture lacking from other approaches. Michoud et al.’sown approach has automatic initialisation and esti-mates pose with a framerate of around 30 fps. It does,however, rely on finding skin-coloured blobs in the vi-sual hull to identify the face and hands, and this placesrestrictions on the clothing of the subject and the startpose, as well as requiring the camera system to becolour calibrated.

The real-time tracking framework presented by(Caillette et al., 2008) used variable length Markovmodels. Basic motion patterns were extracted fromtraining sequences and used to train the classifier. Themethod includes automatic initialisation and some de-gree of recovery from errors, but as with all trainingbased approaches it is sensitive to over-fitting to thetraining data, and recognition is limitied by the typesof motion that was used during the training phase.

3 APPROACH

In this section, we will detail the steps in the proposedpose estimation method, starting with the input data,and ending up with a set of labelled trees representingthe subject’s pose during the motion sequence.

Page 173: Improving real-time human pose estimation from multi-view video

Table 1: Ratios of limb lengths in relation to the longestpath in the skeleton from one hand to one foot.

Bone Ratio Bone Ratio

Head 0.042 Upper spine 0.050Upper neck 0.042 Lower spine 0.050Lower neck 0.042 Hip 0.061Shoulder 0.085 Thigh 0.172Upper arm 0.124 Lower leg 0.184Lower arm 0.121 Foot 0.050Hand 0.029 Toe 0.025Thorax 0.050

3.1 Anthropometric Measurements

In a fashion similar to (Chen and Chai, 2009) we builda model of the human skeleton by parsing the data inthe CMU motion capture database (mocap.cs.cmu.edu). The CMU database contains 2605 motion cap-ture sequences of 144 subjects. For each sequencethe performer is described by an Acclaim skeleton filethat contains estimated bone lengths for 30 bones. Weparse all the skeleton files in the database and calcu-late the mean bone lengths for a subset of the bones.

The stature was used as the reference lengthwhen calculating anthromopometric ratios by (Mi-choud et al., 2007), but as we wish to use these ra-tios before the tree is labelled it is unknown whichparts of the tree constitute the stature. Hence, we needan alternative reference length. We observe that thelongest path in the ideal tree structure in figure 1 isfrom one hand to one foot (since the tree is symmetricit does not matter which we choose). Consequently,we use the longest path in the tree as the referencelength and it corresponds to the sum of the lengths ofthe hand, lower and upper arm, thorax, lower and up-per spine, hip, thigh, lower leg, foot, and toe. We cal-culate the ratios for all limbs in relation to this longestpath. The limb length ratios are shown in table 1.

3.2 3D Reconstruction andSkeletonisation

The first step in a shape-from-silhouette based ap-proach is to separate foreground (subject) from back-ground (everything else) in the image data. Boththe real and synthetic data used in this paper comeswith silhouette images already provided, so this stepis not included in the method, nor in the processingtimes presented in section 4. There are, however, real-time background subtraction algorithms available thatcould be used. Any of the three methods examined by(Fauske et al., 2009) achieve real-time performancewith reasonable segmentation results.

We assume that the camera system used is cal-ibrated and synchronised. Experiments with multi-camera systems by (Starck et al., 2009) demonstratedthat using eight or more cameras ensures good resultsfrom the 3D reconstruction phase.

Once the silhouettes have been extracted the cali-bration information from the cameras can be used toperform a 3D reconstruction. The silhouette imagesare cross sections of generalised cones with apexes inthe focal points of the cameras. The visual hull (Lau-rentini, 1994) is created by intersecting the silhouettecones. Visual hulls can be represented by surfacesor volumes. We employ a simple algorithm that pro-duces a volumetric visual hull. For all voxels in aregular grid we project that voxel’s centre into eachimage plane and check if the projected point is in-side or outside the silhouette. Voxels that have pro-jected points inside all the silhouettes are kept and therest are discarded. This procedure lacks the robust-ness associated with more advanced techniques, butits simplicity makes it attractive for implementationon graphics hardware.

A parallel thinning technique (Bertrand and Cou-prie, 2006) is used to skeletonise the visual hull. Thealgorithm is implemented on graphics hardware toachieve high throughput.

3.3 Pose Tree Construction

It is natural to represent the human body using a treestructure. The head, torso and limbs form a tree-likehierarchy. If the nodes in the tree are given positionsin Euclidean space the tree describes a unique pose.

3.3.1 Main Algorithm

An overview of the method can be seen in algo-rithm 1. The first step is to create a tree structurefrom the skeleleton voxels. A node is created for eachvoxel, and neighbouring nodes are connected withedges. Next, the extremities (hands, feet, and head)are identified in the tree by first pruning away erro-neous branches, and then examining branch lengths.The third step consists of using the identified extrem-ity nodes to segment the tree into body parts (arms,legs, torso, and neck). Finally, a vector pointing for-ward is estimated and used to label the hands and feetas left or right. Further details about each step of themethod are given in the following sections.

3.3.2 Building the Tree

The skeleton voxel data is traversed in a breadth-firstmanner to build the pose tree. We place the root ofthe tree in the top-most voxel. The nodes are placed

Page 174: Improving real-time human pose estimation from multi-view video

Algorithm 1 Main pose estimation algorithm1: Build the pose tree from skeleton voxels.2: Find the extremities (feet, hands, head).3: Segment the tree into body parts.4: Find a vector pointing forward, and identify left

and right limbs.

in a queue as they are created. When the first node inthe queue is removed the neighours of that node’s cor-responding voxel is checked and new child nodes areadded to the tree if those neighbours have not beenvisited before. This is repeated until the queue isempty. At this point all voxels connected to the top-most voxel will have been processed and given corre-sponding nodes in the tree.

Algorithm 2 A breadth-first pose tree construction al-gorithm.

1: Create a node for the top-most voxel and add it tothe node queue.

2: while node queue is not empty do3: N first node in queue.4: V voxel corresponding to N.5: for all neighbours of V do6: if neighbour has not been visited then7: Create a new child node of N and add it to

the queue.8: Label neighbour as visited.9: end if

10: end for11: end while

3.3.3 Finding Extremities

The next step is to identify the extremities among theleaf nodes in the tree. The input skeleton voxels typ-ically contain some noise which in turn leads to spu-rious branches in the tree. Hence, we need to prunethe tree to reduce the number of leaf nodes. This isdone in two steps. First, a recursive procedure thatremoves branches shorter than a threshold based onthe anthropometric ratios is employed. Calculatingthe length of an arm with the anthropometric ratiosfrom section 3.1 and using that as the threshold hasbeen found to produce good results. Of the remain-ing leaf nodes the feet should be at the end of the twolongest branches, and the hands should be at the endof the next two. Hence, the list of leaf nodes is sortedby branch length and all but the longest four are re-moved. In order to keep all information in the origi-nal tree intact the two pruning steps are performed ona copy.

The procedure outlined above is robust as long asthe top-most voxel is at the location of the head. Thehead branch is shorter than the length of an arm andif one of the other limbs is higher than the head, thehead branch is likely to be removed during pruning.We solve this problem by finding what we define asthe origin node.

Figure 1: An ideal pose tree. The black node is the originwith degree four, the dark grey node is the root, the lightgrey node is the internal node of degree three, and the whitenodes are leaf nodes. All other nodes are internal nodes ofdegree two.

Let us consider an ideal pose tree. An ideal posetree consists of a root node, four leaf nodes, one inter-nal node of degree three, one internal node of degreefour, and a number of internal nodes of degree two asshown in figure 1. We define the origin node as thenode of degree four where the arms, torso, and neckare joined.

All nodes in the pruned tree of degree higher thantwo are candidates for the origin node. In order tochoose the best candidate we create copies of thepruned tree and move the root to each of the candidatelocations. A rank is calculated for each of the candi-date trees, and the root of the highest ranked tree ischosen as the origin node.

The leaf nodes in the candidate tree are sorted bydistance to the root. The furthest two are used as feet,the next two as hands, and the last as head. Ideallengths for the legs, arms, neck, and torso (Il , Ia, In, It )are estimated using the anthropometric ratios and thelongest path in the tree. We calculate the rank of acandidate using the following formula:

r = rl1 · rl2 · ra1 · ra2 · rn · e�(deg(root)�4)2

3 (1)The last term is used to penalise candidate nodes

with degree 6= 4. The ranks for each limb is given bythe following formulae:

rl =min(Il ,dr)

max(Il ,dr)· min(It ,dr�db)

max(It ,dr�db)(2)

Page 175: Improving real-time human pose estimation from multi-view video

Algorithm 3 Finding extremities in the pose tree.1: Create a copy of the tree.2: Prune the copy by recursively removing branches

that are shorter than an arm.3: Examine the remaining leaf nodes, and remove

all but the four belonging the longest branches(feet and hands).

4: Create copies of the pruned tree with candidatesfor the origin node position.

5: Rank the candidate trees and move the root of thepruned copy to the location of the root of the can-didate tree with the highest rank.

6: if no valid labelling from previous frame then7: Sort leaf nodes by distance to root.8: Label the furthest two as feet, the next two as

hands, and the final as head.9: else

10: Label the leaf nodes by finding the nearest cor-respondences from the previous frame.

11: end if

Figure 2: Three candidates for the origin node of a treesorted by rank from left to right. Their ranks are 8.85⇥10�4, 1.45⇥10�4, and 3.40⇥10�5, respectively.

ra =min(Ia,dr)

max(Ia,dr)· 1

1+(dr�db)(3)

rn =min(In,dr)

max(In,dr)· 1

1+(dr�db)(4)

where dr is the distance from the leaf node under con-sideration to the root of the candidate tree, and db isthe distance from the leaf node to the closest branch-ing point. The second terms of equations 3 and 4 areused to penalise candidates where one arm has beenshortened. An example of three candidate trees andtheir ranks can be seen in figure 2.

The root of the pruned copy of the pose tree ismoved to the location of the highest ranked origincandidate. If this is the first frame or no valid la-belling is available from the previous frame, the leafnodes are sorted by the distance to the root. Finally,the nodes in the original pose tree corresponding to

the remaining leaf nodes in the copy are labelled asextremities. The two furthest from the root are la-belled as feet, the next two as hands, and the last oneas the head.

If a valid labelling from the previous frame exists,we use temporal correspondences to label the extrem-ities instead. For each of the labelled extremities inthe previous frame, the Euclidean distance to each ofthe remaining candidates in the current frame is calcu-lated and the one with the shortest distance is chosen.The chosen candidate is labelled correspondingly andremoved from the list of candidates.

If the head node is not the root of the tree, the rootis moved to that node. A lower limit on the originnode rank is used to determine the validity of the la-belling. This threshold is set empirically, and if noneof the candidate trees has a rank above the thresholdthe labelling is not considered valid, and will not beused for finding temporal correspondences in the nextframe.

3.3.4 Segmentation Into Body Parts

Next, we segment the tree into body parts. The iden-tified extremities are used as starting points for thesegmentation. Starting in the first hand node, the treeis labelled as Arm1 upwards until the root is reached.This is repeated for the other hand, and the tree is la-belled as Arm2 upwards until the first node labelledArm1 is encountered. Labelling continues upwards,and all nodes are labelled as Neck until the root isreached. A similar approach is used for the lowerbody. Starting in the first foot, the nodes are labelledas Leg1 upwards in the tree until the label Neck isreached. From the second foot, the nodes are labelledas Leg2 upwards in the tree until the label Leg1 is en-countered. Finally, continuing upwards, all nodes arelabelled as Torso until the label Neck is encountered.If no nodes are labelled as Torso, the tree is marked asinvalid. This procedure is formalised in algorithm 4.

Algorithm 4 Tree segmentation (arms only; the legs,neck, and torso are identified in a similar manner).

1: N the leaf node labelled Hand1.2: while N is not the root do3: Label N as Arm1.4: N N’s parent.5: end while6: N the leaf node labelled Hand2.7: while N is not labelled Arm1 do8: Label N as Arm2.9: N N’s parent.

10: end while11: {Repeat for legs, neck and torso.}

Page 176: Improving real-time human pose estimation from multi-view video

3.3.5 Identifying left and right

The final step of the proposed method is to find theleft and right sides of the body. We use the labelledextremities as the starting point. Using the anthropo-metric ratios from section 3.1 the algorithm createssets of nodes corresponding to the feet, lower legs,thighs, and torso by traversing the tree upwards fromthe labelled leaf nodes. Total least squares line fit-ting is used to find vectors representing the node sets,while making sure all vectors point upwards in thetree.

Taking the cross products of the vectors, normalvectors for the ankle and joints are calculated. In fig-ure 3(a) we observe that the angles a and b can neverexceed 180�, so ordering the vectors in the cross-product consistently ensures that the normals will beoriented towards the left. For each leg the two jointnormals are combined to form a normal for the leg,and a forward pointing vector for each leg is createdby taking the cross product of the normals and thetorso vector. The two forward vectors are combinedto form a single vector pointing forward in the torso’sframe of reference. This procedure is formalised inalgorithm 5. Because of noise in the data it is possi-ble that the foot can degenerate during the skeletoni-sation and tree construction phases. In order to avoidproblems with the node sets used for the curve fitting,a threshold is set on the difference in length betweenthe legs. If one leg is shorter by more than two timesthe length of a foot, the foot vector and consequentlythe ankle joint is disregarded in the calculation of thenormals.

To label the hands as left and right, total leastsquares line fitting is used to find vectors representingthe shoulders, both pointing away from the torso. Avector pointing left is calculated using the cross prod-uct of the torso vector with the forward vector. Theangles between the shoulder vectors and the left vec-tor is calculated, and the hand corresponding to thesmallest angle is labelled as left. If both shoulder vec-tors are pointing in the same or opposite direction ofthe left vector, the largest and smallest angle with theforward vector, respectively, are used to label the lefthand. The procedure is repeated for the feet usingvectors representing the hips. The left-right labellingis formalised in algorithm 6.

4 RESULTS

A number of experiments have been conducted usingthe proposed method. Both real and synthetic datawere used, and reconstructions were done at resolu-

Algorithm 5 Finding forward vector1: Using anthropometric ratios, create sets of nodes

corresponding to the foot, lower leg, thigh, andtorso.

2: Using orthogonal distance regression, fit lines tothe sets of nodes, and make sure the resulting vec-tors ~v f oot ,~vlleg,~vthigh,~vtorso point upwards in thetree.

3: if angle(~v f oot ,~vlleg) > threshold then4: Construct a normal for the ankle joint

~nankle =~v f oot ⇥~vlleg.5: end if6: if angle(~vlleg,~vthigh) > threshold then7: Construct a normal for the knee joint

~nknee =~vlleg⇥�~vthigh.8: end if9: Combine the joint normals~n1 =

~nankle +~nknee

|~nankle +~nknee|.

10: Repeat step 3 to 9 for the other leg, to get a secondnormal n2.

11: Construct two forward vectors using the normals~f1 =~n1⇥~vtorso.

12: Combine the two forward vectors ~f =~f1 +~f2

|~f1 +~f2|.

~vfoot

~vfoot ⇥ ~vlleg

~vlleg

~vthigh

~vlleg ⇥ �~vthigh

~vtorso

(a) Joint angles and nor-mals.

(b) Example usingreal data.

Figure 3: Vectors used for left-right labelling.

tions of 64⇥64⇥64 and 128⇥128⇥128 voxels. Thesizes of a voxel were 31.3 mm3 at 64⇥ 64⇥ 64, and15.6 mm3 at 128⇥ 128⇥ 128. All experiments wereconducted on a computer with a 2.67 GHz Intel Core2 Quad processor, 12 GB RAM, and two graphicscards: one Nvidia GeForce GTX 295 and one NvidiaGeForce GTX 560 Ti. The GTX 295 was used forvisual hull construction and the GTX 560 for skele-tonisation.

Page 177: Improving real-time human pose estimation from multi-view video

Algorithm 6 Identifying left and right limbs1: For each arm, create a set of n nodes representing

the shoulder.2: Construct a vector pointing left~vle f t =~vtorso⇥~f3: Using orthogonal distance regression, fit lines to

the sets of nodes, and make sure the resulting vec-tors~v1 and~v2 point away from the torso.

4: if~vle f t ·~v1 > 0 and~vle f t ·~v2 > 0 then5: le f t = argmax(angle(~f ,~v1),angle(~f ,~v2))6: right = argmin(angle(~f ,~v1),angle(~f ,~v2))7: else if~vle f t ·~v1 < 0 and~vle f t ·~v2 < 0 then8: le f t = argmin(angle(~f ,~v1),angle(~f ,~v2))9: right = argmax(angle(~f ,~v1),angle(~f ,~v2))

10: else11: le f t = argmin(angle(~vle f t ,~v1),

angle(~vle f t ,~v2))12: right = argmax(angle(~vle f t ,~v1),

angle(~vle f t ,~v2))13: end if14: For each leg, create a set of n nodes representing

the hip, and repeat steps 3 to 12.

The accuracy of the method at different volumeresolutions was evaluated using synthetic data. Anavatar was animated using motion capture data fromthe CMU dataset. Silhouette images of the sequencewere rendered with eight virtual cameras at a reso-lution of 800⇥ 600, seven placed around the avatarand one in the ceiling pointing down. The sequencethat was used was a whirling dance motion (subject55, trial 1). The known joint locations from the mo-tion capture data were used as a basis for compari-son. The results of using the proposed pose estimationmethod on the synthetic data can be seen in figure 4.The mean positional error of the hands and feet forthe entire sequence was 46.8 mm (standard deviation16.6 mm) and 53.2 mm (standard deviation 15.9 mm)for 128⇥128⇥128 and 64⇥64⇥64 volume resolu-tions, respectively. There is a noticeable increase inaccuracy with a higher resolution, but not by a hugemargin. An interesting observation is that the numberof frames where the method fails (curve reaches 0) in-creases with higher resolution. The sequence consistsof 451 frames, and 97.6% and 95.8% are labelled cor-rectly at 64⇥64⇥64 and 128⇥128⇥128 resolutions,respectively.

The robustness and computational cost of themethod was evaluated using real data drawn fromthe i3DPost dataset (Gkalelis et al., 2009). This isa publically available multi-view video dataset of arange of different motions captured in a studio en-vironment. The volume resolution greatly influences

100 200 300 400Mea

n po

s. e

rror (

mm

)

Frame

0

120

80

40

160

(a) Volume resolution 64⇥64⇥64.

0

120

80

40

160

100 200 300 400

Frame

Mea

n po

s. e

rror (

mm

)

(b) Volume resolution 128⇥128⇥128.Figure 4: Comparison of mean errors of estimated posi-tions of the hands and feet for a synthetic dance sequence.Frames where the algorithm has failed gracefully have beenomitted.

the processing time of the method. Three sequencesof varying complexity were tested at both 64⇥64⇥64and 128⇥ 128⇥ 128 resolutions, and the results canbe seen in table 2. The method achieves near real-time performance of ⇠ 15 fps at the highest resolu-tion, but by halving the dimensions of the volumethe framerate is almost tripled. A framerate of ⇠ 64fps should be sufficient for most real-time applica-tions. In both cases the tree construction is highlyefficient, < 1.5 ms for 64⇥ 64⇥ 64 and < 4 ms for128⇥ 128⇥ 128. At the higher resolution the skele-tonisation is the main computational bottleneck. Thereason for the small difference in processing time forthe visual hull construction is that the images from thei3DPost dataset have a resolution of 1920⇥1080 andcopying the data between main memory and the GPUis the bottleneck. Figure 5 shows four frames fromthe three sequences at both resolutions.

Sequences with challenging motions were used totest the robustness. Poses where the subject’s limbsare close to the body typically result in a visual hullthat is a poorer approximation of the actual volume.A sequence where the subject is crouching during themotion illustrates this. As can be seen in figure 6 themethod fails gracefully when the subject is crouching,but recovers once the limbs are spread out once more.A pirouette is a challenging motion, because of therapid movement of the extremities. Figure 7 showsthat the temporal correspondence labelling can fail insuch cases, but the identification of the left and rightlimbs is still robust.

For the real data, the walk (57 frames) and walk-spin (46 frames) sequences are labelled 100% cor-

Page 178: Improving real-time human pose estimation from multi-view video

Table 2: Comparison of mean processing times for three sequences from the i3DPost dataset, using different resolutions. Alltimes are in milliseconds, with standard deviations in parentheses.

(a) Volume resolution 64⇥64⇥64.

Visual hull Skeletonisation Build tree Sum Framerate

Walk, sequence 013 9.61 (0.26) 4.37 (0.28) 1.44 (0.25) 15.41 (0.47) 64.88Walk-spin, sequence 015 9.63 (0.26) 4.42 (0.32) 1.41 (0.30) 15.46 (0.54) 64.67Run-crouch-jump, sequence 017 9.69 (0.26) 4.82 (0.59) 1.07 (0.27) 15.58 (0.61) 64.18

(b) Volume resolution 128⇥128⇥128.

Visual hull Skeletonisation Build tree Sum Framerate

Walk, sequence 013 12.25 (0.34) 49.37 (2.63) 3.09 (0.33) 64.71 (2.61) 15.45Walk-spin, sequence 015 12.28 (0.30) 52.06 (4.14) 3.13 (0.46) 67.46 (4.03) 14.82Run-crouch-jump, sequence 017 11.94 (0.16) 51.43 (3.73) 3.51 (0.58) 66.88 (3.89) 14.95

(a) Walk, sequence 013.

(b) Walk-spin, sequence 015.

(c) Run-crouch-jump, sequence 017.Figure 5: Four frames from each of three sequences, showing the visual hulls, skeletons and pose trees. The top rows have avolume resolution of 64⇥64⇥64, and 128⇥128⇥128 in the bottom rows.

Page 179: Improving real-time human pose estimation from multi-view video

Figure 6: Five frames from the run-crouch-jump sequence(017 from the i3DPost dataset) illustrating the problems thatoccur when the subject is crouching and no reliable skeletoncan be extracted, but also that the algorithm recovers whenthe arms and legs become distinguishable again.

Figure 7: Two consecutive frames of a ballet sequence (023of the i3DPost dataset) where the correspondence labellinghas failed and the arm colours have switched sides. Theleft-right labelling, however, is still correct as illustrated bythe red and green extremities.

rectly. Only 42.6% of the run-crouch-jump (108frames) sequence is labelled correctly, but the subjectis crouching during half the sequence. The sequenceswere truncated to keep the subject completely insidethe viewing volume. The ballet sequence consistsof 150 frames and 87.3% of them were labelled cor-rectly.

5 DISCUSSION AND FUTUREWORK

In the previous section we demonstrated the proposedmethod on several sequences where it robustly recov-ers a segmented skeletal structure. There are, how-ever, some limitations to using a skeletonisation basedapproach, and we will discuss them here, and howthese issues can be resolved in the future.

Cases can be found that are likely to be problem-atic for a skeletonisation based method. The extracted

Figure 8: Two frames of a walk sequence (013 from thei3DPost dataset) demonstrating the possible displacementof the shoulders and the pelvis. The proposed method com-pensates for this, and the limbs are still labeled correctly.

skeleton is unreliable in frames where the limbs arenot clearly separated from the rest of the body. How-ever, it is possible to detect these cases and givean indication that the pose estimate for that partic-ular frame is not trustworthy. A significant advan-tage of the proposed approach is that the skeleton ineach frame can be labelled independently of previ-ous frames. As was demonstrated in section 4 thisallows the approach to recover from errors in previ-ous frames where the skeletal reconstruction may bedegenerate.

In figure 8 we see a common problem with us-ing the curve-skeleton. In frames where the arms areclose to the torso or the legs are too close to eachother, the shoulders and pelvis tend to be displaceddownwards. This is not a major issue for the proposedalgorithm, but it is important to keep in mind if morejoint locations should be extracted in the future.

The lower limit on the rank of candidate trees weuse to determine a labelling’s validity is heuristic, anda better alternative should be found. Though the em-pirically set threshold works for the sequences wehave tested with, there are no guarantees that it willdo so for other data.

Currently, only the locations of the hands, feet,and the head are estimated. We intend to extend themethod with estimates for the positions of internaljoints as well. Creating a kinematic model using thelimb length ratios in section 3.1 and fitting that to theskeleton data is an approach that will be examined fur-ther.

In order to better compare the proposed approachto other methods that attempt to solve the pose esti-mation problem, the method should be tested on morepublically available datasets, for instance HumanEVAor INRIA IXMAS.

Page 180: Improving real-time human pose estimation from multi-view video

6 CONCLUDING REMARKS

We have presented a novel pose estimation methodbased on constructing tree structures from skele-tonised sequences of visual hulls. The trees arepruned, segmented into body parts, and the extrem-ities are identified. This is intended to be a real-time approach for pose estimation, the results forthe pose tree computation back this up and demon-strate good labellings across multiple sequences withcomplex motion. The approach can recover from er-rors or degeneracies in the initial volume/skeletal re-construction which overcomes inherent limitatons ofmany tracking approaches which cannot re-initialise.Ground-truth evaluation on synthetic data indicatescorrect extremity labelling in ⇠ 95% of frames withrms errors < 5 cm.

ACKNOWLEDGEMENTS

The authors wish to thank Lars M. Eliassen for help-ing with the implementation of the skeletonisationalgorithm, and Odd Erik Gundersen for his helpfulcomments during the writing of the paper. Someof the data used in this project was obtained frommocap.cs.cmu.edu. The CMU database was createdwith funding from NSF EIA-0196217.

REFERENCES

Bertrand, G. and Couprie, M. (2006). A new 3D parallelthinning scheme based on critical kernels. In Dis-crete Geometry for Computer Imagery, pages 580–591. Springer.

Blum, H. (1967). A transformation for extracting new de-scriptors of shape. Models for the perception of speechand visual form, 19(5):362–380.

Brostow, G. J., Essa, I., Steedly, D., and Kwatra, V. (2004).Novel skeletal representation for articulated creatures.Computer Vision - ECCV (LNCS), 3023:66–78.

Caillette, F., Galata, A., and Howard, T. (2008). Real-time3-D human body tracking using learnt models of be-haviour. Computer Vision and Image Understanding,109(2):112–125.

Chen, Y.-l. and Chai, J. (2009). 3D Reconstructionof Human Motion and Skeleton from UncalibratedMonocular Video. Computer Vision - ACCV (LNCS),5994:71–82.

Chu, C.-W., Jenkins, O. C., and Mataric, M. J. (2003).Markerless Kinematic Model and Motion Capturefrom Volume Sequences. In Proceedings of the IEEEConference on Computer Vision and Pattern Recogni-tion, pages 475–482.

Cornea, N. D., Silver, D., and Min, P. (2007). Curve-Skeleton Properties, Applications, and Algorithms.IEEE Transactions on Visualization and ComputerGraphics, 13(3):530–548.

Fauske, E., Eliassen, L. M., and Bakken, R. H. (2009).A Comparison of Learning Based Background Sub-traction Techniques Implemented in CUDA. In Pro-ceedings of the First Norwegian Artificial IntelligenceSymposium, pages 181–192.

Gkalelis, N., Kim, H., Hilton, A., Nikolaidis, N., and Pitas,I. (2009). The i3DPost multi-view and 3D human ac-tion/interaction database. In Proceedings of the Con-ference for Visual Media Production, pages 159–168.

Laurentini, A. (1994). The Visual Hull Concept forSilhouette-Based Image Understanding. IEEE Trans-actions on Pattern Analysis and Machine Intelligence,16(2):150–162.

Menier, C., Boyer, E., and Raffin, B. (2006). 3D Skeleton-Based Body Pose Recovery. In Proceedings of theThird International Symposium on 3D Data Process-ing, Visualization, and Transmission, pages 389–396.

Michoud, B., Guillou, E., and Bouakaz, S. (2007). Real-time and markerless 3D human motion capture us-ing multiple views. Human Motion - Understanding,Modeling, Capture and Animation (LNCS), 4814:88–103.

Moeslund, T. B., Hilton, A., and Krüger, V. (2006). A sur-vey of advances in vision-based human motion cap-ture and analysis. Computer Vision and Image Under-standing, 104:90–126.

Moschini, D. and Fusiello, A. (2009). Tracking HumanMotion with Multiple Cameras Using an ArticulatedModel. Computer Vision/Computer Graphics Collab-oration Techniques (LNCS), 5496:1–12.

Poppe, R. (2007). Vision-based human motion analysis: Anoverview. Computer Vision and Image Understand-ing, 108(1-2):4–18.

Raynal, B., Couprie, M., and Nozick, V. (2010). GenericInitialization for Motion Capture from 3D Shape. Im-age Analysis and Recognition (LNCS), 6111:306–315.

Starck, J., Maki, a., Nobuhara, S., Hilton, a., and Mat-suyama, T. (2009). The Multiple-Camera 3-D Pro-duction Studio. IEEE Transactions on Circuits andSystems for Video Technology, 19(6):856–869.

Sundaresan, A. and Chellappa, R. (2008). Model-drivensegmentation of articulating humans in LaplacianEigenspace. IEEE Transactions on Pattern Analysisand Machine Intelligence, 30(10):1771–1785.

Svensson, S., Nyström, I., and Sanniti di Baja, G. (2002).Curve skeletonization of surface-like objects in 3Dimages guided by voxel classification. Pattern Recog-nition Letters, 23:1419–1426.

Theobalt, C., de Aguiar, E., Magnor, M. a., Theisel, H., andSeidel, H.-P. (2004). Marker-free kinematic skeletonestimation from sequences of volume data. Proceed-ings of the ACM symposium on Virtual reality softwareand technology - VRST ’04, D:57.

Page 181: Improving real-time human pose estimation from multi-view video

Paper 6Model fitting to skeletal data

Authors: Rune H. Bakken and Adrian Hilton.

Full title: Real-time Pose Estimation Using Constrained Dynamics.

Published in: Proceedings of the VII Conference on Articulated Motion andDeformable Objects (AMDO 2012), F. J. Perales ed., 2012,Springer.

Copyright: c� 2012 Springer Verlag Berlin Heidelberg.

Page 182: Improving real-time human pose estimation from multi-view video
Page 183: Improving real-time human pose estimation from multi-view video

Real-time Pose Estimation Using ConstrainedDynamics

Rune Havnung Bakken1 and Adrian Hilton2

1 Faculty of Informatics and e-Learning,Sør-Trøndelag University College, Trondheim, Norway

[email protected]

2 Centre for Vision, Speech and Signal Processing,University of Surrey, Guildford, UK

[email protected]

Abstract. Pose estimation in the context of human motion analysis isthe process of approximating the body configuration in each frame of amotion sequence. We propose a novel pose estimation method based onfitting a skeletal model to tree structures built from skeletonised visualhulls reconstructed from multi-view video. The pose is estimated inde-pendently in each frame, hence the method can recover from errors inprevious frames, which overcomes some problems of tracking. Publicallyavailable datasets were used to evaluate the method. On real data themethod performs at a framerate of ⇠ 14 fps. Using synthetic data thepositions of the joints were determined with a mean error of ⇠6 cm.

Keywords: Pose estimation, real-time, model fitting

1 Introduction

Human motion capture is the process of registering human movement over aperiod of time. Accurately capturing human motion is a complex task withmany possible applications, like automatic surveillance, input for animation inthe entertainment industry and biomechanical analysis.

In some application areas it is important that the data aquisition is un-constrained by the markers or wearable sensors tradionally used in commercialmotion capture systems. Furthermore, there is a need for low latency and real-time performance in some applications, for instance in perceptive user interfacesand gait recognition.

Computer vision-based motion capture is a highly active field of research, asrecent surveys by Moeslund et al. [11] and Poppe [13] show. Within the com-puter vision community the shape-from-silhouette approach has been popularfor multi-camera setups. The visual hull is an overestimate of the volume oc-cupied by the subject, and is reconstructed from silhouette images covering thescene from di↵erent viewpoints.

Moeslund et al. [11] define pose estimation as the process of approximatingthe configuration of the underlying skeletal structure that governs human motion

Page 184: Improving real-time human pose estimation from multi-view video

in one or more frames. The curve-skeleton is a simplified 1D representation ofa 3D object, and skeletonisation yields an approximation of the curve-skeleton.By skeletonising the visual hull and fitting a kinematic model to the resultingskeleton the pose of the subject can be estimated.

In a tracking framework temporal correspondences between body parts fromone frame to the next are found, and information from previous frames can beused to predict configurations in future frames. A problem with many trackingapproaches is that they can get stuck in invalid configurations and not automat-ically recover.

Goal: the overall goal of the research presented in this paper is to develop arobust, real-time pose estimation method.

Contribution: we present a pose estimation method based on fitting a skele-tal model to skeletonised voxel data. The joint configuration of the model isindependently estimated in each frame, which overcomes limitations of tracking,and fascilitates automatic initialisation and recovery from erroneous estimates.The method achieves real-time performance on a variety of motion sequences.

This paper is organised as follows: in Sect. 2 relevant work by other re-searchers is examined. The proposed pose estimation method is detailed in sec-tion 3. Results of experiments with the proposed method are presented in Sect. 4,and the strengths and limitations of the approach are discussed in Sect. 5. Sect. 6concludes the paper.

2 Related Work

The motion of the human body is governed by the skeleton, a rigid, articulatedstructure. Recovering this structure from image evidence is a common approachin computer vision based motion capture, and fitting a kinematic model to skele-tal data is an approach taken by several researchers.

The pose estimation framework described by Moschini and Fusiello [12] usedthe hierarchical iterative closest point algorithm to fit a stick figure model toa set of data points on the skeleton curve. The method can recover from er-rors in matching, but requires manual initialisation. The approach presented byMenier et al. [9] uses Delauney triangulation to extract a set of skeleton pointsfrom a closed surface visual hull representation. A generative skeletal model isfitted to the data skeleton points using maximum a posteriori estimation. Themethod is fairly robust, even for sequences with fast motion. A tracking and poseestimation framework where Laplacian Eigenmaps were used to segment voxeldata and extract a skeletal structure consisting of spline curves was presentedby Sundaresan and Chellappa [15]. These approaches do not achieve real-timeperformance.

A comparison of real-time pose estimation methods was presented by Mi-choud et al. [10]. Their findings were that real-time initialisation was a featurelacking from other approaches. Michoud et al.’s own approach has automaticinitialisation and pose estimation with a framerate of around 30 fps. Their ap-proach relies on finding skin-coloured blobs in the visual hull to identify the face

Page 185: Improving real-time human pose estimation from multi-view video

and hands. This places restrictions on the clothing of the subject and the startpose, as well as requiring the camera system to be colour calibrated.

The real-time tracking framework presented by Caillette et al. [4] used vari-able length Markov models. Basic motion patterns were extracted from trainingsequences and used to train the classifier. The method includes automatic initial-isation and some degree of recovery from errors, but as with all training basedapproaches it is sensitive to over-fitting to the training data, and recognition islimited by the types of motion that were used during the training phase.

Straka et al. [14] recently presented a real-time skeletal graph based poseestimation approach. Using distance matrices they identified the extremities,and used an automated skinning procedure to fit a skeletal model to the graph.This method will be discussed further in Sect. 5.

3 Approach

In this section, we will detail the steps in the proposed pose estimation method,starting with the input data, and ending up with a skeletal model with a jointconfiguration representing the subject’s pose during the motion sequence.

3.1 Data Aquisition

It is natural to represent the human body using a tree structure. The head, torsoand limbs form a tree-like hierarchy. If the nodes in the tree are given positions inEuclidean space the tree describes a unique pose. We employ the data aquisitionprocedure described in [2]. A volumetric visual hull is skeletonised, and a treestructure is built from the skeleton voxels. Next, the extremities (hands, feet, andhead) are identified, and the tree segmented into body parts. Finally, a vectorpointing forward is estimated and used to label the hands and feet as left orright. An example labelled data tree is shown in Fig. 1a.

3.2 Skeletal Model

In a fashion similar to [5] we build a model of the human skeleton by pars-ing the data in the CMU motion capture database (mocap.cs.cmu.edu). TheCMU database contains 2605 motion capture sequences of 144 subjects. All theskeleton files in the database are parsed, and the mean lengths for a subset ofthe bones calculated. The skeletal model consists of a bone hierarchy where thelength of each bone is calculated by multiplying the bone ratio with an estimateof the stature of the subject. The skeletal model and bone ratios are shown inFig. 1b and 1c, respectively.

During model fitting the length of each bone is computed by multiplying theratios with an estimate of the subject’s stature. The geodesic distance from toetip to the top of the head would be an overestimate of the stature. Instead thetree is traversed upwards, and node groups corresponding to each bone created.The stature is calculated by summing up the Euclidean distances between thefirst and last nodes in each node group.

Page 186: Improving real-time human pose estimation from multi-view video

(a) (b)

Bone Ratio Bone Ratio

Head 0.042 Upper spine 0.050Upper neck 0.042 Lower spine 0.050Lower neck 0.042 Hip 0.061Shoulder 0.085 Thigh 0.172Upper arm 0.124 Lower leg 0.184Lower arm 0.121 Foot 0.050Hand 0.029 Toe 0.025Thorax 0.050

(c)

Fig. 1: (a) Labelled data tree. (b) Skeletal model. (c) Ratios between limb lengthsand estimated stature.

3.3 Model Fitting

As discussed in Sect. 2 fitting a model to skeletal data is not a completely novelidea, however, it is in general not achieved in real-time. An optimisation approachthat minimises some cost function between the model and the data is typicallytoo computationally expensive.

A possible alternative to the optimisation approach is to use an inverse kine-matics (IK) type method to align the model with the data. The end e↵ectorscould be placed at the positions of the extremities in the data tree, and the inter-mediate joints placed as close to their respective branches as possible. Commonlyused IK techniques [3], however, o↵ers poor control of intermediate joints, andare di�cult to adapt for this purpose. Tang et al. [16] demonstrated that con-strained dynamics is a viable alternative to IK for handling intermediate jointsin motion capture data.

The SHAKE constraint algorithm Constraints on bodies that obey theNewtonian laws of motion can be satisfied using a constraint algorithm. Onepossible approach is to introduce explicit constraint forces and minimise themusing Lagrange multipliers. A sti↵ bond between two bodies i and j is called aholonomic constraint and has the form:

�(t) = |rij(t)|2 � l2ij = 0 (1)

where rij is the vector between the two bodies and l is the desired distancebetween them.

Lagrange-type constraint forces are assumed to act along the interbody vec-tors from the previous time step rkij , and are used to update the position r00i attime k + 1 after unconstrained motion has displaced a body to r0i. The constraintsatisfaction procedure can be seen in Alg. 1. For more details on the SHAKEalgorithm, see Kastenmeier and Vesely [8] and Tang et al. [16].

Page 187: Improving real-time human pose estimation from multi-view video

Algorithm 1 SHAKE

1: while �

c

> "

local

and1n

nX

c=1

c

> "

global

and k < max iter do

2: for each constraint c do

3: Compute �

0c

⌘ �

c

(r0ij

) ⌘ |r0ij

|2 � l

2ij

.4: Compute the Lagrange factor a from the requirement �00

c

= 0:

a =�

0c

µ

2rk

ij

· r0ij

, where 1/µ ⌘ 1/mi

+ 1/mj

.

5: Adjust the positions of i and j using:

r00i

= r0i

+a

m

i

rk

ij

r00j

= r0j

� a

m

j

rk

ij

6: end for

7: end while

Tang et al. [16] demonstrated that SHAKE requires few iterations to con-verge, and the computational complexity is O(n), where n is the number ofconstraints. We propose to extend the SHAKE method with constraints thatkeep the intermediate joints close to their corresponding positions in the data.

Model fitting algorithm Distance constraints are created for each pair ofneighbouring joints in the model, and their lengths are calculated using theratios in Fig. 1c. Additionally, constraints are needed to place each joint at itsapproximate location in the data.

Fitting a straight line to a set of data points with orthogonal distance regres-sion (ODR) is a computationally inexpensive operation. Constraining a joint tolie on a line fitted to the data points around the joint’s location in the data is areasonable approximation to the optimal fit of the joint.

A line constraint is simulated by introducing a dummy rigid body with infi-nite mass and using a distance constraint with unit length, as seen in Fig. 2. Forthe line constraints the vector r0ij is replaced with ⇢v (where j is the dummyrigid body, ⇢ is a constant, and v is the unit normal from i to the line), and theposition of i is adjusted using:

�0c = |⇢v|2 � 1, a =

�0c

2v · (⇢v), r00i = r0i + av

This has the e↵ect of gradually moving the joint i towards the line, while thenormal distance constraints between joints are satisfied. The constant ⇢ controlsthe rate of change from one iteration to the next. Since the mass of the dummyrigid body is infinite its position is not explicitly updated.

The data around the shoulders and hips is not reliable. Due to artefacts ofskeletonisation, the limbs can be shortened if the legs are close together or thearms close to the torso. To compensate for the missing shoulder and hip data,lines are fitted to the torso instead and displaced along the mediolateral axis.

Page 188: Improving real-time human pose estimation from multi-view video

p1

p2

vri

rd

(a)

p1

p2

⇢v

r�i

r�d

(b)

p1

p2

r00d

r00i

(c)

Fig. 2: Simulating a line constraint using a dummy rigid body rd and distanceconstraint. (a) The constraint vector v is a unit vector in the direction of thenormal from the joint i to the line. (b) The unconstrained inter-body vector isset to ⇢v, essentially moving the dummy rigid body towards the line. (c) SHAKEupdates the position of i moving it towards the line.

Since the torso can be twisted, two mediolateral axes are needed. A forwardvector is provided with the data tree and used to find the hip mediolateral axis.A plane is fitted to the data around the upper torso, and the plane normal isused to find the shoulder mediolateral axis.

To achieve faster convergence of SHAKE, the model is first roughly alignedwith the data. The model is rotated to face the direction of the forward vector,and translated so the chest joint in the model is at the position of the top of thetorso in the data. To overcome errors in the data tree labelling, the pose fromthe previous frame is used if the tree is marked as invalid, or the change in posefrom one frame to the next exceeds a threshold value. The di↵erence in pose iscalculated by summing up the Euclidean distance between estimated positionsfor each joint. The complete model fitting procedure can be seen in Alg. 2.

4 Results

A number of experiments have been conducted using the proposed method. Bothreal and synthetic data were used, and reconstruction was done with a resolutionof 128 ⇥ 128 ⇥ 128 voxels, resulting in a voxel size of 15.6 mm.

The accuracy of the method was evaluated using synthetic data generatedwith the toolset presented in [1]. An avatar was animated using motion capturedata from the CMU dataset. Silhouette images of the sequence were renderedwith eight virtual cameras at a resolution of 800 ⇥ 600, seven placed around theavatar and one in the ceiling pointing down.

Two dance sequences, dance-whirl and lambada, totalling around 1000 frames(subject 55, trial 1 and 2) were used. The joint locations from the motion cap-ture data were used as a basis for comparison. Box and whisker plots (1.5 IQRconvention) of the two sequences can be seen in Fig. 3. The distances betweenthe upper and lower quartiles are within the size of two voxels, suggesting thatthe errors are caused by di↵erences between the bone lengths of the model and

Page 189: Improving real-time human pose estimation from multi-view video

Algorithm 2 Model fitting

1: for each frame n do

2: if tree is valid then

3: Fit shoulder plane.4: for each joint j do

5: Fit line l to nodes corresponding to j using ODR.6: Set up line constraint between j and l.7: end for

8: Roughly align model with tree using rotation and translation.9: Run the SHAKE algorithm to refine pose

(n).10: if di↵(pose(n)

, pose

(n�1)) > ⌧ then

11: Set pose(n) pose

(n�1).12: end if

13: else

14: Set pose(n) pose

(n�1).15: end if

16: end for

the underlying skeletal structure. The right finger tip joints are particularly in-accurate, and generally the extremities have higher errors than internal joints.

The mean positional errors of the joints over the entire sequences were 60.8mm (st. dev. 26.8 mm) and 67.0 mm (st. dev. 31.0 mm), respectively. If an invalidlabelling of the tree structure is detected the estimated pose from the previousframe will be reused. This happened in 1.6 % of the frames of the dance-whirlsequence, and 2.9 % of the frames of the lambada sequence.

(a) (b)

Fig. 3: Box and whisker plots of two sequences of synthetic data, (a) dance-whirlconsists of 451 frames (subject 55, trial 1), (b) lambada has 545 frames (subject55, trial 2).

The computational cost of the method was evaluated using real data drawnfrom the i3DPost dataset [6]. This is a publically available multi-view videodataset of a range of di↵erent motions captured in a studio environment. Four

Page 190: Improving real-time human pose estimation from multi-view video

sequences of varying complexity were tested, and the results can be seen inTable 1. The method achieves near real-time performance of ⇠14 fps. The modelfitting is highly e�cient, and skeletonising the volume is the main bottleneck.Some frames from the ballet and dance-whirl sequences can be seen in Fig. 4.

Table 1: Comparison of mean processing times for four sequences from thei3DPost dataset. All times are in milliseconds, with standard deviations in paren-theses.

Data aq. Line fitting SHAKE Sum Framerate

Walk, seq. 013 65.16 (2.67) 0.09 (0.01) 2.93 (0.06) 68.18 (2.68) 14.67Walk-spin, seq. 015 67.95 (4.06) 0.10 (0.02) 2.95 (0.08) 70.33 (3.96) 14.22Run-jump, seq. 017 67.33 (3.97) 0.09 (0.01) 2.90 (0.08) 68.54 (3.76) 14.59Ballet, seq. 023 68.53 (5.53) 0.09 (0.01) 2.90 (0.11) 71.12 (5.60) 14.06

5 Discussion and Future Work

In the previous section we demonstrated the proposed method on several se-quences where it robustly recovers a segmented skeletal structure. There are,however, some limitations to using a skeletonisation based approach, and wewill discuss them here, and how these issues can be resolved in the future.

Skeletonisation based pose estimation methods have problems recovering theskeletal structure in frames where the limbs are close together, or close to thetorso. A significant advantage of the proposed method is that frames are pro-cessed individually, and thus the method can recover after detecting invalidposes.

Straka et al. [14] recently presented a similar approach and also used syntheticdata from the CMU database for quantitative evaluation. Section 4 shows thatthe accuracy and processing time of the proposed method is on a par with astate-of-the-art skeletonisation based method using similar data for evaluation.The method presented in [14], however, requires the head to be the highestpoint for the duration of a sequence, and labelling of left and right limbs is notautomatic. These are limitations that the proposed approach does not su↵erfrom.

Similarly to Kastenmeier and Vesely [8], we experimented with combiningSHAKE with simulated annealing to refine the model fitting. Several di↵erentcost functions were tested, but the improvement in accuracy was negligible at thecost of >30 ms extra processing time. It is possible, however, that a cost functioncould be found that significantly improves the accuracy, further experimentationis required to evaluate the benefit of adding simulated annealing.

Some of the extreme outliers in the toe positions in Fig. 3 are caused by poor3D reconstruction resulting in a reversal of the foot direction. The currently

Page 191: Improving real-time human pose estimation from multi-view video

(a)

(b) (c)

Fig. 4: Some frames from the ballet and dance-whirl sequences. In (a) the steps ofthe model fitting are shown: the labelled tree structure with forward vector, linesfitted to the data, rough alignment, joints constrained to lines, and fitted model.An example of an invalid tree is shown in (b), where the model configurationfrom the previous frame is reused. In (c) two consecutive frames, where thesecond tree labelling is valid, but still incorrect. The di↵erence between the twoestimated poses is above a threshold, and the first pose is reused in the secondframe.

used model does not include joint angle limits, so in these cases the model isbrought into what should be illegal configurations. Using a model representationthat includes joint angles, such as the one presented by Guerra-Filho [7], wouldprevent the model fitting from resulting in illegal configurations and improveaccuracy.

6 Concluding Remarks

We have presented a novel pose estimation method based on fitting a model tolabelled tree structures that are generated from skeletonised sequences of visualhulls. The SHAKE algorithm was used to constrain the model to straight linesfitted to the data trees. This is intended to be a real-time approach for pose esti-mation, the results for the pose tree computation back this up and demonstrategood model fitting across multiple sequences of complex motion with framer-ates of around 14 fps. The approach can recover from errors or degeneracies inthe initial volume/skeletal reconstruction which overcomes inherent limitatonsof many tracking approaches which cannot re-initialise. Ground-truth evaluation

Page 192: Improving real-time human pose estimation from multi-view video

on synthetic data indicates correct model fitting in ⇠ 97% of frames with rmserrors ⇠6 cm.

References

1. Bakken, R.H.: Using Synthetic Data for Planning, Development and Evaluation ofShape-from-Silhouette Based Human Motion Capture Methods. In: Proceedings ofISVC (2012)

2. Bakken, R.H., Hilton, A.: Real-Time Pose Estimation Using Tree Structures Builtfrom Skeletonised Volume Sequences. In: Proceedings of VISAPP. pp. 181–190(2012)

3. Buss, S.R.: Introduction to Inverse Kinematics with Jacobian Transpose, Pseu-doinverse and Damped Least Squares Methods. Unpublished manuscript (2004),http://math.ucsd.edu/

~

sbuss/ResearchWeb

4. Caillette, F., Galata, A., Howard, T.: Real-time 3-D human body tracking usinglearnt models of behaviour. Computer Vision and Image Understanding 109(2),112–125 (2008)

5. Chen, Y.L., Chai, J.: 3D Reconstruction of Human Motion and Skeleton fromUncalibrated Monocular Video. Computer Vision - ACCV (LNCS) 5994, 71–82(2009)

6. Gkalelis, N., Kim, H., Hilton, A., Nikolaidis, N., Pitas, I.: The i3DPost multi-viewand 3D human action/interaction database. In: Proceedings of the Conference forVisual Media Production. pp. 159–168 (2009)

7. Guerra-Filho, G.: A General Motion Representation - Exploring the Intrinsic View-point of a Motion. In: Proceedings of GRAPP. pp. 347–352 (2012)

8. Kastenmeier, T., Vesely, F.: Numerical robot kinematics based on stochastic andmolecular simulation methods. Robotica 14(03), 329–337 (1996)

9. Menier, C., Boyer, E., Ra�n, B.: 3D Skeleton-Based Body Pose Recovery. In:Proceedings of 3DPVT. pp. 389–396 (2006)

10. Michoud, B., Guillou, E., Bouakaz, S.: Real-Time and Markerless 3D Human Mo-tion Capture Using Multiple Views. Human Motion - Understanding, Modeling,Capture and Animation (LNCS) 4814, 88–103 (2007)

11. Moeslund, T.B., Hilton, A., Kruger, V.: A survey of advances in vision-based hu-man motion capture and analysis. Computer Vision and Image Understanding 104,90–126 (2006)

12. Moschini, D., Fusiello, A.: Tracking Human Motion with Multiple Cameras Usingan Articulated Model. Computer Vision/Computer Graphics Collaboration Tech-niques (LNCS) 5496 (2009)

13. Poppe, R.: Vision-based human motion analysis: An overview. Computer Visionand Image Understanding 108(1-2), 4–18 (2007)

14. Straka, M., Hauswiesner, S., Ruther, M., Bischof, H.: Skeletal Graph Based HumanPose Estimation in Real-Time. In: Proceedings of BMVC (2011)

15. Sundaresan, A., Chellappa, R.: Model-driven segmentation of articulating humansin Laplacian Eigenspace. IEEE Transactions on Pattern Analysis and MachineIntelligence 30(10), 1771–1785 (2008)

16. Tang, W., Cavazza, M., Mountain, D., Earnshaw, R.: A constrained inverse kine-matics technique for real-time motion capture animation. The Visual Computer15(7-8), 413–425 (1999)

Page 193: Improving real-time human pose estimation from multi-view video

Paper 7View-independent gait recognition

Authors: Rune H. Bakken and Odd Erik Gundersen.

Full title: View-Independent Human Gait Recognition Using CBR andHMM.

Published in: Proceedings of the Eleventh Scandinavian Conference on Artifi-cial Intelligence (SCAI 2011), Frontiers in Artificial Intelligenceand Applications Vol. 227, A. Kofod-Pedersen, F. Heintz, H.Langseth eds., 2011, IOS Press.

Copyright: c� 2011 The authors and IOS Press.

Page 194: Improving real-time human pose estimation from multi-view video
Page 195: Improving real-time human pose estimation from multi-view video

View-Independent Human GaitRecognition Using CBR and HMM

Rune Havnung BAKKEN a,1, and Odd Erik GUNDERSEN b,c

a Department of Informatics and e-Learning, HiST, Trondheim, Norwayb Department of Computer and Information Science, NTNU, Trondheim, Norway

c Verdande Technology, Trondheim, Norway

Abstract. This concept paper describes a novel approach to view-independent hu-man gait analysis. Previous work in computer vision-based gait analysis typicallyuse silhouette images from one side-view camera as input to the recognition system.Our approach utilises multiple synchronised cameras placed around the subject.From extracted silhouette images a 3D reconstruction is done that produces a seg-mented, labelled pose tree. Features are extracted from the pose tree and combinedinto an input case to the Case-Based Reasoning system. The input case is matchedagainst stored past cases in a case base, and the most similar cases are classifiedby a Hidden Markov Model. The advantages of this approach are that subjects canbe captured in a completely view-invariant way and recognition is efficient becauseunlikely candidates can be quickly discarded.

Keywords. Gait Recognition, Hidden Markov Model, Case-Based Reasoning

1. Introduction

Uniquely identifying people is essential in modern day society. On the one hand, peoplehave to verify that they are themselves and not impostors. On the other hand, law en-forcers need to identify people through a third party like a video, an image or a witness’description. Our research can be used for both purposes.

People can be identified in three ways: 1) using knowledge, such as a PIN code ora password, 2) using a token, for instance a passport or an identity card, or 3) usingbiometrics, intrinsic characteristics of a person like fingerprints, iris patterns or voicerecognition. Biometrics are more reliable and more difficult to fake than the other meansof identification and have received a lot of attention from the research community inrecent years. The manner in which a person walks—the gait—is unique, and can be usedas a behavioural biometric. Gait recognition as a biometric has certain unique advantagesin that it can be used at a distance in a non-intrusive way and at low resolutions wereother biometrics like face recognition are not applicable.

Gait information can be gathered using three different modalities: 1) images, 2)floor sensors, and 3) wearable sensors. Computer vision-based approaches that rely onimage data can be further divided into appearance-based and model-based methods.

1Corresponding author: Rune Havnung Bakken, [email protected].

Page 196: Improving real-time human pose estimation from multi-view video

Appearance-based methods extract information directly from the images, while model-based methods attempt to fit a pre-defined model of the human body to the image data.The vast majority of previous work is based on using 2D data from a single camera.These methods are vulnerable to changes in the viewpoint. Recently, some researchershave investigated fusion of classification results from multiple viewpoints, and there isone example of 3D model-based gait recognition. However, this 3D model-based methoduse a custom built laser scanner for capturing input data, rather than using commodityhardware.

Hidden Markov Models (HMM) have been used extensively with good results forrecognition in appearance-based methods. However, in order to achieve real-time frame-rates, we propose to combine the HMM with an example-based machine learning methodcalled Case-Based Reasoning (CBR) [1] into a hybrid method. We argue that the hybridmethod has several advantages. One advantage is that the CBR system can sort out lessprobable candidates using a smaller set of features and simple numeric similarity mea-sures. This saves the execution of several HMMs with large input vectors. Another ad-vantage is that the CBR system easily can be adapted to learn, and the cost of learning inCBR systems is far less than the cost of learning in HMMs. Furthermore, learning is aninherent part of the CBR process, and thus dynamic learning is supported intrinsically.

Goal: The overall goal with the research presented in this paper is to develop amethod for achieving real-time and view-independent human gait recognition using com-modity hardware.

Contribution: We present a framework of methods and a system architecture forachieving real-time and view-independent human gait analysis. Our approach differ fromother approaches in that it classifies based on 3D data reconstructed from images, whichmakes it novel. The classification method is a hybrid machine learning system that com-bine HMM and CBR, and to our knowledge CBR has not been used in gait recognitionbefore.

In section 2 we investigate related work. Section 3 details our approach to view-independent gait recognition, and in section 4 we discuss the merits of said approach,and we present a road-map for future work.

2. Gait Analysis and Recognition

The study of how people walk dates back to Aristotle, and there are several examples upthrough the centuries of researchers, including Leonardo and Galileo, who have inves-tigated human locomotion. In the medical and biomechanical communities researchershave studied the components of gait for diagnosis and treatment of pathological abnor-malities. This work indicated that gait appears to be unique, and that there are twentydistinct gait components. For more details see Nixon et al. [2].

A gait sequence can be divided into gait cycles. A gait cycle is defined as a series ofposes between two consecutive foot-to-floor contacts (heel strikes) of the same foot. Theportion of a cycle when a foot is in contact with the ground is called the stance phase andthe rest of the cycle is called the swing phase during which the foot is in the air. This isillustrated in figure 1.

The uniqueness of gait has prompted interest from researchers in other disciplines tostudy the use of gait as a biometric in human identification. Gafurov [3] categorises gait

Page 197: Improving real-time human pose estimation from multi-view video

Right heel strike Left heel strike Right heel strike

Right stance Right swingLeft stanceLeft swing

0% 50% 100%

Figure 1. The gait cycle.

recognition into three groups based on input method: 1) computer vision, 2) floor sen-sor, and 3) wearable sensor. We will concentrate on computer vision-based approaches.Vision-based gait recognition has some unique advantages in that it can be used at adistance and with low resolution where other biometrics are not applicable.

Computer vision-based gait recognition methods can be divided into two categories,appearance-based and model-based. Appearance-based approaches infer useful informa-tion directly from image evidence, while model-based approaches start with some modelof the human body that is fit to the data. The input images are usually segmented to formsilhouette images.

The input images are commonly captured with a single camera with the subjectwalking parallel to the image plane. A recent survey by Wang et al. [4] shows that thisfronto-parallel approach is still prevalent. The use of a single camera makes the recogni-tion vulnerable to changes in the viewpoint. A change in the walking path away from theparallel line can have detrimental effect on the recognition performance. Recently, someattempts have been made to alleviate this.

Zhao et al. [5] presented a model-based approach where a projected kinematic modelwas fit to silhouette images from multiple views. However, the tracking procedure re-quires manual initialisation of the model making this method unfit for use in autonomoussystems.

A method with less sensitivity to viewpoint changes using only one camera was pre-sented by Goffredo et al. [6]. They fit a lower body model to silhouette images even forangled poses, giving some leeway in the choice of viewpoint, but it still is not completelyview-independent.

Yamauchi et al.’s [7] approach makes a radical change in the input modality. Acustom-built laser range finder is used to capture depth images of the subject. This yieldstrue 3D input data, but the need for special hardware makes it less flexible than methodsrelying only on standard cameras.

Liu and Tan’s [8] recent work revolves around doing classification of multiple viewsseparately, and then fusing the recognition results together. This makes training andrecognition more time-consuming because at least one classifier is needed for each view.

There are many examples [9,10,11,12,13] of variants of HMMs being used success-fully to model the gait cycle in appearance-based approaches. The different phases of the

Page 198: Improving real-time human pose estimation from multi-view video

gait cycle are represented as non-observable states in the HMM. It is assumed that thecurrent state is influenced by the previous state in a first order Markov process. Probabil-ities of observations and state transitions are learned from training data. The candidatewith the highest posterior probability is the identified person. Variants of the standardHMM used in gait recognition include Variable Length Markov Model, Parallel HMM,and factorial HMM.

Case-based reasoning is a methodology based on the observation that humans solveproblems by reusing solutions to past problems when solving new ones. The reasoningprocess is typically divided into four steps: retrieve, reuse, revise and retain. First, aninput case is compared to a set of past cases that are stored in a data base called the casebase, and a set of similar cases is retrieved. Then, the solutions of the retrieved casesare reused to solve the input case. In case the solution was not optimal, it is revisedand the input case with the revised solution is retained in the case base. For in-depthreviews of the methodology and important features, the reader is urged to read [1,14,15].Measuring similarity is an important part of CBR and an extensive introduction is givenby Richter [16].

Hybrid machine learning systems combine different machine learning methods intoone reasoning system. Both CBR and HMM have been successfully included in hybridsystems. A survey of hybrid HMM and Artificial Neural Networks for use in speechrecognition is presented in [17], and a survey of hybrid recommender systems that in-cludes CBR is given in [18].

3. Approach

In this section, we will detail the steps in our gait recognition process, which start withthe input data, a sequence of images, and ends up with the identified person.

3.1. Overview

The gait recognition process consists of two parts, pre-processing of data and recogni-tion, and it is illustrated in figure 2.

RecognitionPre-processing

Featureextraction CBR HMMInput

dataIdentified person

Inputcase

n mostsimilarcases

Rank

3D recon-struction

Figure 2. Overview of the hybrid framework.

The input data is pre-processed to form an input case for the recognition step. Pre-processing consists of combining the information from multiple image streams fromsynchronised cameras to form a 3D representation of the movement of the subject. Fromthe 3D representation a set of features are extracted, and these are combined into an inputcase. The input case is compared to a set of past cases that are stored in a case base.

Page 199: Improving real-time human pose estimation from multi-view video

The comparison retrieves the n most similar past cases from the case base, and the CBRengine forwards them to the HMM subsystem together with the input case. Past casesincludes an HMM as part of the case description. The dynamic features of the input caseis executed on each of the HMMs part of the n most similar past cases. The result is theprobability of each past case representing the same person that is in the input case. Theperson represented by the case with the highest probability is returned as the identifiedperson, while the complete list of probabilities—the rank—is input to the CBR engine.

3.2. Preliminary Assumptions

We assume that the motion of one person at a time is captured in a controlled environ-ment. The background is assumed to be static to simplify image segmentation. A set ofsynchronised cameras placed around the subject is used to capture the gait sequences.Mündermann et al. [19] demonstrated that using eight or more cameras ensures goodresults from the 3D reconstruction phase in a shape-from-silhouette approach. Both in-trinsic and extrinsic calibration information must be available for the cameras.

3.3. Pre-processing

In the following subsections we will describe the different steps in the data aquisitionprocess. An overview of the process is given in figure 3.

Figure 3. The data aquisition process. Images are captured from a set of synchronised cameras. The imagesare segmented into background and foreground (silhouettes). 3D reconstruction produces a visual hull of thevolume occupied by the subject. The volume is skeletonised and a pose tree is built from the skeleton data. Thetest data shown here are from the i3DPost dataset [20].

3.3.1. Background Subtraction

The first step in processing the data is to segment the images into background and fore-ground. Background subtraction methods typically learn a static background model andfind foreground regions in incoming images by comparing them to the learnt model.We have previously investigated implementation of background subtraction methods ongraphics hardware [21], and found that Horprasert et al.’s [22] algorithm yields excellent

Page 200: Improving real-time human pose estimation from multi-view video

throughput while still giving decent segmentation performance. Hence, a graphics hard-ware accelerated implementation of that algorithm is used to extract silhouettes from theimage data. Morphological operators are used to reduce noise.

3.3.2. 3D Reconstruction

Once the silhouettes have been extracted the calibration information from the camerascan be used to perform a 3D reconstruction. The silhouette images are cross sections ofgeneralised cones with apices in the focal points of the cameras. The visual hull [23] iscreated by intersecting the silhouette cones. Visual hulls can be represented by surfacesor volumes. We employ a very simple algorithm that produces a volumetric visual hull.For all voxels in a regular grid we project that voxel’s centre into each image plane andcheck if the projected point is inside or outside the silhouette. Voxels that have projectedpoints inside all the silhouettes are kept and the rest are discarded. This procedure lacksthe robustness associated with more advanced techniques, but its simplicity makes itattractive for implementation on graphics hardware, and early tests show that it performswell on gait sequences.

3.3.3. Skeletonisation

In order to extract useful features from the data we would like to reduce the volume toa more compact representation while preserving its topology. The curve-skeleton [24] isa line-like or stick-like 1D representation that serves this purpose. A parallel thinningtechnique [25] is used to skeletonise the visual hull. The algorithm is implemented ongraphics hardware to achieve high throughput.

3.3.4. Pose Tree Construction

It is natural to represent the human body using a tree data structure. The head, torso andlimbs form a tree-like hierarchy. If the nodes in the tree are given positions in Euclideanspace the tree describes a unique pose.

The skeleton voxel data is traversed in a breadth-first manner to build the pose tree.We place the root of the tree in the top-most voxel and assume that this is the headlocation. For gait sequences where the arms are kept mostly down along the torso thisis a valid assumption. The nodes are placed in a queue as they are created. When thefirst node in the queue is removed the neighours of that node’s corresponding voxel ischecked and new child nodes are added to the tree if those neighbours have not beenvisited before. This is repeated until the queue is empty. At this point all voxels connectedto the head voxel will have been processed and given corresponding nodes in the tree.

An ideal pose tree consists of a root node, four leaf nodes, one internal node of de-gree three, one internal node of degree four, and a number of internal nodes of degreetwo. This is, however, never the case for real data. Noise in the skeleton will producespurious branches in the pose tree that must be pruned away. Anthropometric measure-ments [26] are used to set a threshold for shortest branch length. Branches that are shorterthan the threshold are removed. Initial testing shows that this approach is robust for gaitsequences.

Next, we segment the tree into body parts. Starting from the leaf nodes the armsand legs are labelled. The neck and torso is labelled from the node where the arms andlegs meet, respectively. Finally, we identify the left and right extremities in the tree. By

Page 201: Improving real-time human pose estimation from multi-view video

constructing vectors along the upper and lower arms, and checking if the cross productbetween them points inwards or outwards we can identify the left and right arms. Asimilar approach is used for the legs.

3.3.5. Feature Extraction and Case Representation

Similarly to Yamauchi et al. [7] we distinguish between static and dynamic features.Static features are subject specific and do not change during the course of the gait se-quence. Static features includes limb lengths, torso length and to a certain extent steplength. Dynamic features change from pose to pose during the gait cycle, and includejoint angles and locations of extremities.

Static features Dynamic featuresLimb lengths Joint anglesStep length Locations of extremitiesLength of torso

Table 1. Possible features that can be extracted from the pose tree.

Typically, cases are divided into two parts: the description and the solution. The de-scription is the collection of features that are used for comparing cases while the solutionis a class or an advice. Both the static and the dynamic features are parts of the casedescription in our system. While the input case has an empty solution part, a past casestored in the case base contains a solution. The identity of the person that is capturedin the gait image sequence is a natural part of the solution. In addition, we store thecomplete HMM used to classify the captured person in the solution part of the past case.

3.4. Recognition

From a high level perspective, recognition is a two step process in the hybrid system.First, the CBR system compares the input case to the past cases by comparing the staticfeatures, and the result is a set of the n most similar cases. Then, the HMM subsystemuses the dynamic features of the input case as the input vector to the HMMs part of thepast cases. The person described by the most probable case is returned as the identifiedperson. The interplay of the different components are illustrated in figure 4.

Recognition

Training

Data aquisition

Feature extraction

HMMtraining

Staticfeatures

Casebase

Dynamicfeatures

Case:I. Static

featuresII. HMM

Staticclassification

Dynamicclassification

Subsetof casebase Identified

person

Featureweights

Figure 4. The gait recognition pipeline.

Page 202: Improving real-time human pose estimation from multi-view video

The static features are numeric values, which can be compared quickly using eitherexponential or linear numerical similarity measures in the CBR engine. As the staticfeatures describe human characteristics they will have clearly defined upper and lowerbounds, and these upper and lower bounds should be applied in the comparison. Also,the features will be individually weighted and thus contribute unevenly to the outcome.The number of static features will be around 10.

The Hidden Markov Model can be considered the simplest dynamic Bayesian net-work and the underlying system is assumed to be a Markov process with non-observablestates. The HMM is a finite state machine with m states and is described by the parame-ters l = (p,A,B), where p is the probability of starting in a specific state, A is the statetransition matrix giving the probability of going from one state to another, and the matrixB contains the probabilities of observations belonging to specific states.

The dynamic feature vectors, which can easily count up to 40 dimensions, will beused as input to the HMM. The HMM is trained beforehand and new observations willnot change the network. The gait cycle consists of k frames which in turn will yield kfeature vectors.

The gait cycle is subdivided into m clusters and a characteristic pose, often calledexemplar, is identified for each cluster. These correspond to the m states in the HMM.The transitions between the exemplars correspond to the transition probability matrix,and the probabilistic dependence of poses from the m clusters constitutes the observationprobability matrix.

During recognition the forward algorithm is used to compute the likelihood that theinput sequence was produced by each of the n candidates. The identity of the unknownsubject is given by the maximum log likelihood.

3.5. Learning

Both HMMs and CBR need training in order to classify optimally. HMMs must learnstate probabilities while initial feature weights need to be specified in CBR. The fea-ture weights in CBR can be dynamically adjusted during the reasoning process, whiledynamically changing of state probabilities for HMMs is a more complicated process.

3.5.1. Training Hidden Markov Models

According to Sundaresan et al. [9] there are three problems that must be solved for theHMM to have practical applicability: 1) computing the observation probabilities giventhe model parameters using the forward algorithm, 2) finding the optimal state sequencethat best explains the observations given the model parameters using Viterbi decoding,and 3) adjusting the model parameters using the Baum-Welch algorithm.

For a gait cycle we have k feature vectors, these are the observation symbols. Thefeature vectors must be divided into m clusters and exemplars that best represent theclusters must be chosen.

3.5.2. Learning in CBR

The initial feature weights can be optimised using a genetic algorithm that searches forthe optimal weight of classifying a training set of cases that have been classified before-hand. The optimal set of weights can be found using leave-one-out cross-validation.

Page 203: Improving real-time human pose estimation from multi-view video

The second step in the CBR reasoning process is the revise step. The solution of aninput case can be revised by an expert in supervised learning, and the revised case willthen be retained in the case base. In this system, the HMM provides its list of probabilitiesto the CBR system. Our intention is to adjust the weights of the static properties basedon the difference in rank from the HMM and the similarity comparison done by the CBRsystem. This can be done through measuring the distance between cases in the the tworanking lists and using this distance to adjust the feature weight relatively.

4. Discussion and Future Work

In order to implement our framework, several topics need to be investigated.We have hypothesised that there will be performance gains by using the CBR engine

to sort out less probable candidates early in the classification process. The assumption isthat the classification of the HMMs will be impaired because of large dynamic featurevectors, which must be studied more closely. If we are able to reduce the feature vectorsto the absolute minimum amount of relevant information, the performance gains mightbe so small that there is no use for a CBR system at all.

Even if the above assumption does not hold, there are other possible benefits of usinga hybrid classifier, the most prominent being higher accuracy. The relation between theCBR system and the HMM need not be as proposed in the framework. One possibilityis to let the HMM and CBR classify in parallel with the same input parameters. Then asolution can be selected based on the average or some weighted average of the resultsfrom the two classifiers. Another possibility is to let the HMM be a similarity measurefor the dynamic feature set in the case comparison. The result of the comparison of thedynamic features can be combined with the static features by using weights. If feasible,more than one of the three ways of combining the classifiers in the hybrid system will betested.

The case representation can be changed. Several combinations of static featuresshould be tested. Whether the dynamic features should be a part of the case descriptionthat is used for CBR comparison is an open question; they might improve accuracy.

Training the HMM and finding an initial set of weights for the cases do not needmuch consideration as there are well-established methods for doing this. The revise stepcan change weights of features or change the number, n, of how many of the most similarcases should be forwarded to the HMM. Unsupervised learning in the revise step on thebasis of the ranking of the HMM has not been done before to our knowledge.

References

[1] A. Aamodt and E. Plaza, “Case-Based Reasoning: Foundational Issues, Methodological Variations, andSystem Approaches,” AI Communications, vol. 7, no. 1, pp. 39–59, 1994.

[2] M. S. Nixon, T. N. Tan, and R. Chellappa, Human Identification Based on Gait. Springer-Verlag, 2006.[3] D. Gafurov, “A Survey of Biometric Gait Recognition : Approaches, Security and Challenges,” in Pro-

ceedings of NIK 2007, 2007.[4] J. Wang, M. She, S. Nahavandi, and A. Kouzani, “A Review of Vision-based Gait Recognition Meth-

ods for Human Identification,” in Proceedings of the 2010 International Conference on Digital ImageComputing: Techniques and Applications, pp. 320–327, 2010.

Page 204: Improving real-time human pose estimation from multi-view video

[5] G. Zhao, G. Liu, H. Li, and M. Pietikainen, “3D Gait Recognition Using Multiple Cameras,” in Pro-ceedings of the 7th International Conference on Automatic Face and Gesture Recognition (FGR06),pp. 529–534, 2006.

[6] M. Goffredo, R. D. Seely, J. N. Carter, and M. S. Nixon, “Markerless View Independent Gait Analysiswith Self-camera Calibration,” in Proceedings of the 8th IEEE International Conference on AutomaticFace & Gesture Recognition, pp. 1–6, Sept. 2008.

[7] K. Yamauchi, B. Bhanu, and H. Saito, “Recognition of Walking Humans in 3D: Initial Results,” inComputer Vision and Pattern Recognition Workshops, pp. 45–52, 2009.

[8] N. Liu and Y. Tan, “View Invariant Gait Recognition,” in Proceedings of the 2010 IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSP), pp. 1410–1413, 2010.

[9] A. Sundaresan, A. Roy-Chowdhury, and R. Chellappa, “A Hidden Markov Model Based Framework forRecognition of Humans from Gait Sequences,” in Proceedings of the 2003 International Conference onImage Processing, vol. 2, pp. 93–96, 2003.

[10] K. Iwamoto, K. Sonobe, and N. Komatsu, “A Gait Recognition Method using HMM,” in Proceedingsof the SICE 2003 Annual Conference, vol. 2, pp. 1936–1941, IEEE, 2003.

[11] C. Chen, J. Liang, H. Hu, L. Jiao, and X. Yang, “Factorial Hidden Markov Models for Gait Recognition,”Advances in Biometrics (LNCS), vol. 4642, pp. 124–133, 2007.

[12] F. Caillette, A. Galata, and T. Howard, “Real-time 3-D human body tracking using learnt models ofbehaviour,” Computer Vision and Image Understanding, vol. 109, no. 2, pp. 112–125, 2008.

[13] D. Zhang, Y. Wang, and B. Bhanu, “Age Classification Based on Gait Using HMM,” in Proceedings ofthe 2010 International Conference on Pattern Recognition, pp. 3834–3837, Aug. 2010.

[14] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learning algorithms,” Machine Learning, vol. 6,pp. 37–66, 1991.

[15] R. de Mantaras, “Case-based reasoning,” Machine Learning and Its Applications (LNCS), vol. 2049,pp. 127–145, 2001.

[16] M. Richter, “Similarity,” Case-Based Reasoning on Images and Signals (Studies in Computational In-telligence), vol. 73, pp. 25–90, 2008.

[17] E. Trentin and M. Gori, “A survey of hybrid ann/hmm models for automatic speech recognition,” Neu-rocomputing, vol. 37, no. 1-4, pp. 91 – 126, 2001.

[18] R. Burke, “Hybrid recommender systems: Survey and experiments,” User Modeling and User-AdaptedInteraction, vol. 12, pp. 331–370, 2002.

[19] L. Mündermann, S. Corazza, A. M. Chaudhari, E. J. Alexander, and T. P. Andriacchi, “Most favorablecamera configuration for a shape-from-silhouette markerless motion capture system for biomechanicalanalysis,” in Proceedings of SPIE-IS&T Electronic Imaging, vol. 5665, pp. 278–287, 2005.

[20] N. Gkalelis, H. Kim, A. Hilton, N. Nikolaidis, and I. Pitas, “The i3DPost multi-view and 3D human ac-tion/interaction database,” in Proceedings of the 2009 Conference for Visual Media Production, pp. 159–168, IEEE, Nov. 2009.

[21] E. Fauske, L. M. Eliassen, and R. H. Bakken, “A Comparison of Learning Based Background Subtrac-tion Techniques Implemented in CUDA,” in Proceedings of the First Norwegian Artificial IntelligenceSymposium, pp. 181–192, Tapir, 2009.

[22] T. Horprasert, D. Harwood, and L. Davis, “A Statistical Approach for Real-time Robust BackgroundSubtraction and Shadow Detection,” in Proceedings of the IEEE ICCV’99 FRAME-RATE Workshop,vol. 99, pp. 1–19, 1999.

[23] A. Laurentini, “The Visual Hull Concept for Silhouette-Based Image Understanding,” IEEE Transac-tions on Pattern Analysis and Machine Intelligence, vol. 16, no. 2, pp. 150–162, 1994.

[24] N. D. Cornea, D. Silver, and P. Min, “Curve-Skeleton Properties, Applications, and Algorithms,” IEEETransactions on Visualization and Computer Graphics, vol. 13, no. 3, pp. 530–548, 2007.

[25] G. Bertrand and M. Couprie, “A New 3D Parallel Thinning Scheme Based on Critical Kernels,” DiscreteGeometry for Computer Imagery (LNCS), vol. 4245, pp. 580–591, 2006.

[26] R. Motmans and E. Ceriez, “DINBelg Anthropometry Table,” 2005.

Page 205: Improving real-time human pose estimation from multi-view video

Part V

Appendices

Page 206: Improving real-time human pose estimation from multi-view video
Page 207: Improving real-time human pose estimation from multi-view video

Appendix AStatements of co-authorship

Statements of co-authorship from:

1. Bjørn Gunnar Eilertsen

2. Lars Moland Eliassen

3. Eirik Fauske

4. Odd Erik Gundersen

5. Adrian Hilton

6. Gustavo Ulises Matus

7. Jan Harald Nilsen

Page 208: Improving real-time human pose estimation from multi-view video
Page 209: Improving real-time human pose estimation from multi-view video

To whom it may concern

Statement of co-authorship on joint publications to beused in the PhD dissertation of Rune Havnung Bakken

(Cf. NTNU PhD regulations §7.4, paragraph 4)

As co-author on the following joint publications in the PhD dissertation Improvingreal-time human pose estimation from multi-view video by Rune Havnung Bakken:

Eirik Fauske, Lars M. Eliassen and Rune H. Bakken, A Comparison ofLearning Based Background Subtraction Techniques Implemented inCUDA, In Proceedings of NAIS 2009.Fauske and Eliassen designed and implemented the solution, and wrotethe paper. Bakken formulated the problem, oversaw the design, andprovided feedback during the writing process,

and

Rune H. Bakken and Lars M. Eliassen, Real-time 3D Skeletonisationin Computer Vision-Based Human Pose Estimation Using GPGPU, InProceedings of IPTA 2012.Bakken formulated the problem, designed and implemented the solu-tion, conducted the experiments and led the writing process. Eliassencontributed to the implementation and writing.

I declare that the contributions to the papers are correctly identified, and I consentto this work being used as part of the dissertation.

Date:

Lars Moland Eliassen

1

05/03/13

Page 210: Improving real-time human pose estimation from multi-view video
Page 211: Improving real-time human pose estimation from multi-view video
Page 212: Improving real-time human pose estimation from multi-view video
Page 213: Improving real-time human pose estimation from multi-view video
Page 214: Improving real-time human pose estimation from multi-view video
Page 215: Improving real-time human pose estimation from multi-view video

Appendix BErrata

In figure 1c in paper 6 the bone ratios are scaled incorrectly. The correct valuesare shown in table 6.7.