From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare...

44
Shaogang Gong and Tao Xiang VISUAL ANALYSIS OF BEHAVIOUR From Pixels to Semantics February 2011 Springer

Transcript of From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare...

Page 1: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

Shaogang Gong and Tao Xiang

VISUAL ANALYSIS OF BEHAVIOUR

From Pixels to Semantics

February 2011

Springer

Page 2: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour
Page 3: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

To Aleka, Philip and Alexander

Shaogang Gong

To Ning and Rachel

Tao Xiang

Page 4: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour
Page 5: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

Preface

The human visual system is able to visually recognise and interpret object be-haviours under different conditions. Yet, the goal of building computer vision basedrecognition systems with comparable capabilities has proven to be very difficult toachieve. Computational modelling and analysis of object behaviours through visualobservation is inherently ill-posed. Many would argue thatour cognitive understand-ing remains unclear about why we associate certain semanticmeanings with specificobject behaviours and activities. This is because meaningful interpretation of a be-haviour is subject to the observer’sa priori knowledge, which is at times ratherambiguous. The same behaviour may have different semantic meanings dependingupon the context within which it is observed. This ambiguityis exacerbated whenmany objects are present in a scene. Can a computer based model be constructed thatis able to extract all necessary information for describinga behaviour from visualobservation alone? Do people behave differently in the presence of the others and ifso, how can a model be built to differentiate the expected normal behaviours fromthose of abnormality? Actions and activities associated with the same behaviouralinterpretation may be performed differently according to the intended meaning, anddifferent behaviours may be acted in a subtly similar way. The question arises asto whether these differences can be accurately measured visually and robustly com-puted consistently for meaningful interpretation of behaviour.

Visual analysis of behaviour requires not only to solve the problems of object de-tection, segmentation, tracking, motion trajectory analysis, but also the modelling ofcontext information and utilisation of non-sensory knowledge when available, suchas human annotation of input data or relevance feedback to output signals. Visualanalysis of behaviour faces two fundamental challenges in computational complex-ity and uncertainty. Object behaviours in general exhibit complex spatio-temporaldynamics in a highly dynamical and uncertain environment, for instance, humanactivities in a crowded public space. Segmenting and modelling human actions andactivities in a visual environment is inherently ill-posed, as information processingin visual analysis of behaviour is subject to noise, incompleteness and uncertainty insensory data. Whilst these visual phenomena are difficult to model analytically, they

vii

Page 6: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

viii Preface

can be probabilistically modelled much more effectively through statistical machinelearning.

Despite these difficulties, it is compelling that one of the most significant devel-opments in computer vision research over the last 20 years has been the rapidlygrowing interest in automatic visual analysis of behaviourin video data cap-tured from closed-circuit television (CCTV) systems installed in private and publicspaces. The study of visual analysis of behaviour has had an almost unique impacton computer vision and machine learning research at large. It raises many chal-lenges and provides a testing platform for examining some difficult problems incomputational modelling and algorithm design. Many of the issues raised are rel-evant to dynamic scene understanding in general, multivariate time series analysisand statistical learning in particular.

Much progress has been made since the early 1990s. Most noticeably, statisticalmachine learning has become central to computer vision in general, and to visualanalysis of behaviour in particular. This is strongly reflected throughout this book asone of the underlying themes. In this book, we study plausible computational modelsand tractable algorithms that are capable of automatic visual analysis of behaviourin complex and uncertain visual environments, ranging fromwell-controlled privatespaces to highly crowded public scenes. The book aims to reflect the current trends,progress and challenges on visual analysis of behaviour. Wehope this book will notonly serve as a sampling of recent progress but also highlight some of the challengesand open questions in automatic visual analysis of object behaviour.

There is a growing demand by both governments and commerce worldwide foradvanced imaging and computer vision technologies capableof automatically se-lecting and identifying behaviours of objects in imagery data captured in both publicand private spaces for crime prevention and detection, public transport management,personalised healthcare, information management and market studies, asset and fa-cility management. A key question we ask throughout this book is how to designautomatic visual learning systems and devices capable of extracting and miningsalient information from vast quantity of data. The algorithm design characteris-tics of such systems aim to provide, with minimum human intervention, machinecapabilities for extracting relevant and meaningful semantic descriptions of salientobjects and their behaviours for aiding decision-making and situation assessment.

There have been several books on human modelling and visual surveillance overthe years, includingFace Detection and Gesture Recognition for Human-ComputerInteractionby Yang and Ahuja (2001);Analyzing Video Sequences of Multiple Hu-mansby Ohya, Utsumi and Yamato (2002);A Unified Framework for Video Summa-rization, Browsing and Retrieval: with Applications to Consumer and SurveillanceVideoby Xiong, Radhakrishnan, Divakaran, Rui and Huang (2005);Human Identi-fication based on Gaitby Nixon, Tan and Chellapa (2005); andAutomated Multi-Camera Surveillanceby Javed and Shah (2008). There are also a number of booksand edited collections on behaviour studies from cognitive, social and psychologicalperspectives, includingAnalysis of Visual Behaviouredited by Ingle, Goodale andMansfield (1982);Hand and Mind: What Gestures Reveal about Thoughtby Mc-Neill (1992); Measuring Behaviourby Martin (1993);Understanding Human Be-

Page 7: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

Preface ix

haviourby Mynatt and Doherty (2001); andUnderstanding Human Behaviour andthe Social Environmentby Zastrow and Kirst-Ashman (2003). However, there hasbeen no book that provides a comprehensive and unified treatment of visual analysisof behaviour from a computational modelling and algorithm design perspective.

This book has been written with an emphasis on computationally viable ap-proaches that can be readily adopted for the design and development of intelligentcomputer vision systems for automatic visual analysis of behaviour. We presentwhat is fundamentally a computational algorithmic approach, founded on recentadvances in visual representation and statistical machinelearning theories. This ap-proach should also be attractive to the researchers and system developers who wouldlike to both learn established techniques for visual analysis of object behaviour, andgain insight into up-to-date research focus and directionsfor the coming years. Wehope that this book succeeds in providing such a treatment ofthe subject useful notonly for the academic research communities, both also the commerce and industry.

Overall, the book addresses a broad range of behaviour modelling problems fromestablished areas of human facial expression, body gestureand action analysis toemerging new research topics in learning group activity models, unsupervised be-haviour profiling, hierarchical behaviour discovery, learning behavioural context,modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi-camera behaviour correlation, person re-identification, and ‘connecting-the-dots’for global abnormal behaviour detection. The book also gives in depth treatment tosome popular computer vision and statistical machine learning techniques, includ-ing Bayesian information criterion, Bayesian networks, ‘bag-of-words’ represen-tation, canonical correlation analysis, dynamic Bayesiannetworks, Gaussian mix-tures, Gibbs sampling, hidden conditional random fields, hidden Markov models,human silhouette shapes, latent Dirichlet allocation, local binary patterns, localitypreserving projection, Markov processes, probabilistic graphical models, probabilis-tic topic models, space-time interest points, spectral clustering, and support vectormachines.

The computational framework presented in this book can alsobe applied to mod-elling behaviours exhibited by many other types of spatio-temporal dynamical sys-tems, either in isolation or in interaction, and therefore can be beneficial to a widerrange of fields of studies, including internet network behaviour analysis and profil-ing, banking behaviour profiling, financial market analysisand forecasting, bioin-formatics, and human cognitive behaviour studies.

We anticipate that this book will be of special interest to researchers and aca-demics interested in computer vision, video analysis and machine learning. It shouldbe of interest to industrial research scientists and commercial developers keen to ex-ploit this emerging technology for commercial applications including visual surveil-lance for security and safety, information and asset management, public transportand traffic management, personalised healthcare in assisting elderly and disabled,video indexing and search, human computer interaction, robotics, animation andcomputer games. This book should also be of use to post-graduate students of com-puter science, mathematics, engineering, physics, behavioural science, and cogni-tive psychology. Finally, it may provide government policymakers and commercial

Page 8: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

x Preface

managers an informed guide on the potentials and limitations in deploying intelli-gent video analytics systems.

The topics in this book cover a wide range of computational modelling and al-gorithm design issues. Some knowledge of mathematics wouldbe useful for thereader. In particular, it would be convenient if one were familiar with vectors andmatrices, eigenvectors and eigenvalues, linear algebra, optimisation, multivariateanalysis, probability, statistics and calculus at the level of post-graduate mathemat-ics. However, the non-mathematically inclined reader should be able to skip overmany of the equations and still understand much of the content.

Shaogang GongTao Xiang

LondonFebruary 2011

Page 9: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

Acknowledgements

We shall express our deep gratitude to the many people who have helped us inthe process of writing this book. The experiments describedherein would not havebeen possible without the work of PhD students and postdoctoral research assistantsat Queen Mary University of London. In particular, we want tothank Chen ChangeLoy, Tim Hospedales, Jian Li, Wei-Shi Zheng, Jianguo Zhang,Caifeng Shan, Yo-gesh Raja, Bryan Prosser, Matteo Bregonzio, Parthipan Siva, Lukasz Zalewski, Eng-Jon Ong, Jamie Sherrah, Jeffrey Ng, and Michael Walter for their contributions tothis work. We are indebted to Alexandra Psarrou who read a draft carefully and gaveus many helpful comments and suggestions.

We shall thank Simon Rees and Wayne Wheeler at Springer for their kind helpand patience during the preparation of this book. The book was typeset using LATEX.

We gratefully acknowledge the financial support that we havereceived over theyears from UK EPSRC, UK DSTL, UK MOD, UK Home Office, UK TSB, USArmy Labs, EU FP7, the Royal Society, BAA, and QinetiQ. Finally, we shall thankour families and friends for all their support.

xi

Page 10: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour
Page 11: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

Contents

Part I INTRODUCTION

1 About Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1 Understanding Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 4

1.1.1 Representation and Modelling . . . . . . . . . . . . . . . . . . . .. . . . . 51.1.2 Detection and Classification . . . . . . . . . . . . . . . . . . . . .. . . . . . 61.1.3 Prediction and Association . . . . . . . . . . . . . . . . . . . . . .. . . . . . 6

1.2 Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 71.2.1 Visual Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 81.2.2 Video Indexing and Search . . . . . . . . . . . . . . . . . . . . . . . .. . . . 81.2.3 Robotics and Healthcare . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 91.2.4 Interaction, Animation and Computer Games . . . . . . . . .. . . . 9

1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 101.3.1 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 101.3.2 Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 10

1.4 The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 11

References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 13

2 Behaviour in Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1 Facial Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 152.2 Body Gesture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 172.3 Human Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 192.4 Human Intent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 212.5 Group Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 232.6 Crowd Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 242.7 Distributed Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 262.8 Holistic Awareness: Connecting the Dots . . . . . . . . . . . . .. . . . . . . . . . 29

References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 31

xiii

Page 12: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

xiv Contents

3 Towards Modelling Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.1 Behaviour Representation . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 43

3.1.1 Object-based Representation . . . . . . . . . . . . . . . . . . . .. . . . . . . 433.1.2 Part-based Representation . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 473.1.3 Pixel-based Representation . . . . . . . . . . . . . . . . . . . . .. . . . . . . 483.1.4 Event-based Representation . . . . . . . . . . . . . . . . . . . . .. . . . . . 50

3.2 Probabilistic Graphical Models . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 513.2.1 Static Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 543.2.2 Dynamic Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . .. . . 553.2.3 Probabilistic Topic Models . . . . . . . . . . . . . . . . . . . . . .. . . . . . 56

3.3 Learning Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 573.3.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 583.3.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 583.3.3 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . .. . . . . 603.3.4 Weakly-Supervised Learning . . . . . . . . . . . . . . . . . . . . .. . . . . 613.3.5 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 61

References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 63

Part II SINGLE-OBJECTBEHAVIOUR

4 Understanding Facial Expression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.1 Classification of Images . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 75

4.1.1 Local Binary Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 764.1.2 Designing Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 794.1.3 Feature Selection by Boosting . . . . . . . . . . . . . . . . . . . .. . . . . 83

4.2 Manifold and Temporal Modelling . . . . . . . . . . . . . . . . . . . .. . . . . . . . 844.2.1 Locality Preserving Projections . . . . . . . . . . . . . . . . .. . . . . . . 844.2.2 Bayesian Temporal Models . . . . . . . . . . . . . . . . . . . . . . . .. . . . 92

4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 96

References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 97

5 Modelling Gesture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1015.1 Tracking Gesture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 102

5.1.1 Motion Moment Trajectory . . . . . . . . . . . . . . . . . . . . . . . .. . . . 1025.1.2 2D Colour-based Tracking . . . . . . . . . . . . . . . . . . . . . . . .. . . . 1035.1.3 Bayesian Association . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 1065.1.4 3D Model-based Tracking . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 116

5.2 Segmentation and Atomic Action . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 1225.2.1 Temporal Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 1245.2.2 Atomic Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 125

5.3 Markov Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 1275.4 Affective State Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 131

5.4.1 Space-Time Interest Points . . . . . . . . . . . . . . . . . . . . . .. . . . . . 1325.4.2 Expression and Gesture Correlation . . . . . . . . . . . . . . .. . . . . . 133

Page 13: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

Contents xv

5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 135

References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 137

6 Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1416.1 Human Silhouette . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 1426.2 Hidden Conditional Random Fields . . . . . . . . . . . . . . . . . . .. . . . . . . . 143

6.2.1 HCRF Potential Function . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 1466.2.2 Observable HCRF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 146

6.3 Space-Time Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 1496.3.1 Clouds of Space-Time Interest Points . . . . . . . . . . . . . .. . . . . 1506.3.2 Joint Local and Global Feature Representation . . . . . .. . . . . . 157

6.4 Localisation and Detection . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 1586.4.1 Tracking Salient Points . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 1616.4.2 Automated Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 162

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 166

References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 167

Part III GROUPBEHAVIOUR

7 Supervised Learning of Group Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . 1737.1 Contextual Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 174

7.1.1 Seeding Event: Measuring Pixel-Change-History . . . .. . . . . . 1747.1.2 Classification of Contextual Events . . . . . . . . . . . . . . .. . . . . . 177

7.2 Activity Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 1807.2.1 Semantic Content Extraction . . . . . . . . . . . . . . . . . . . . .. . . . . . 1817.2.2 Semantic Video Segmentation . . . . . . . . . . . . . . . . . . . . .. . . . 183

7.3 Dynamic Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 1917.3.1 Correlations of Temporal Processes . . . . . . . . . . . . . . .. . . . . . 1917.3.2 Behavioural Interpretation of Activities . . . . . . . . .. . . . . . . . . 196

7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 200

References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 202

8 Unsupervised Behaviour Profiling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2058.1 Off-Line Behaviour Profile Discovery . . . . . . . . . . . . . . . .. . . . . . . . . 206

8.1.1 Behaviour Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 2068.1.2 Behaviour Profiling by Data Mining . . . . . . . . . . . . . . . . .. . . 2078.1.3 Behaviour Affinity Matrix . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 2088.1.4 Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 2098.1.5 Model Order Selection . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 2098.1.6 Quantifying Eigenvector Relevance . . . . . . . . . . . . . . .. . . . . . 210

8.2 On-Line Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 2138.2.1 A Composite Behaviour Model . . . . . . . . . . . . . . . . . . . . . .. . 2138.2.2 Run-Time Anomaly Measure . . . . . . . . . . . . . . . . . . . . . . . .. . 216

Page 14: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

xvi Contents

8.2.3 On-Line Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . .. . . . . 2168.3 On-Line Incremental Behaviour Modelling . . . . . . . . . . . .. . . . . . . . . 218

8.3.1 Model Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 2198.3.2 Incremental Parameter Update . . . . . . . . . . . . . . . . . . . .. . . . . 2208.3.3 Model Structure Adaptation . . . . . . . . . . . . . . . . . . . . . .. . . . . 223

8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 224

References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 226

9 Hierachical Behaviour Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2299.1 Local Motion Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 2309.2 Markov Clustering Topic Model . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 231

9.2.1 Off-Line Model Learning by Gibbs Sampling . . . . . . . . . .. . 2349.2.2 On-Line Video Saliency Inference . . . . . . . . . . . . . . . . .. . . . . 236

9.3 On-Line Video Screening . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 2379.4 Model Complexity Control . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 2409.5 Semi-Supervised Learning of Behavioural Saliency . . . .. . . . . . . . . . 2429.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 243

References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 246

10 Learning Behavioural Context. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24710.1 Spatial Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 248

10.1.1 Behaviour-Footprint . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 25010.1.2 Semantic Scene Decomposition . . . . . . . . . . . . . . . . . . .. . . . . 250

10.2 Correlational and Temporal Context . . . . . . . . . . . . . . . .. . . . . . . . . . . 25210.2.1 Learning Regional Context . . . . . . . . . . . . . . . . . . . . . .. . . . . . 25310.2.2 Learning Global Context . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 256

10.3 Context-Aware Anomaly Detection . . . . . . . . . . . . . . . . . .. . . . . . . . . . 25810.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 261

References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 263

11 Modelling Rare and Subtle Behaviours. . . . . . . . . . . . . . . . . . . . . . . . . . . 26511.1 Weakly-Supervised Joint Topic Model . . . . . . . . . . . . . . .. . . . . . . . . 267

11.1.1 Model Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 26711.1.2 Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 270

11.2 On-Line Behaviour Classification . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 27511.3 Localisation of Rare Behaviour . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 27811.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 279

References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 281

Page 15: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

Contents xvii

12 Man in the Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28312.1 Active Behaviour Learning Strategy . . . . . . . . . . . . . . . .. . . . . . . . . . . 28512.2 Local Block-based Behaviour . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 28712.3 Bayesian Classification . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 28912.4 Query Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 291

12.4.1 Likelihood Criterion . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 29112.4.2 Uncertainty Criterion . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 292

12.5 Adaptive Query Selection . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 29412.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 297

References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 299

Part IV DISTRIBUTED BEHAVIOUR

13 Multi-Camera Behaviour Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30313.1 Multi-View Activity Representation . . . . . . . . . . . . . . .. . . . . . . . . . . . 306

13.1.1 Local Bivariate Time-Series Events . . . . . . . . . . . . . .. . . . . . . 30613.1.2 Activity-based Scene Decomposition . . . . . . . . . . . . .. . . . . . . 307

13.2 Learning Pair-Wise Correlation . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 31013.2.1 Cross Canonical Correlation Analysis . . . . . . . . . . . .. . . . . . . 31113.2.2 Time-Delayed Mutual Information Analysis . . . . . . . .. . . . . . 313

13.3 Multi-Camera Topology Inference . . . . . . . . . . . . . . . . . .. . . . . . . . . . 31413.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 316

References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 317

14 Person Re-Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31914.1 Re-Identification by Ranking . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 321

14.1.1 Support Vector Ranking . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 32114.1.2 Scalability and Complexity . . . . . . . . . . . . . . . . . . . . .. . . . . . . 32314.1.3 Ensemble RankSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 324

14.2 Context-Aware Search . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 32614.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 328

References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 331

15 Connecting the Dots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33315.1 Global Behaviour Segmentation . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 33315.2 Bayesian Behaviour Graphs . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 337

15.2.1 A Time-Delayed Probabilistic Graphical Model . . . . .. . . . . . 33715.2.2 Bayesian Graph Structure Learning . . . . . . . . . . . . . . .. . . . . . 33915.2.3 Bayesian Graph Parameter Learning . . . . . . . . . . . . . . .. . . . . 34415.2.4 Cumulative Anomaly Score . . . . . . . . . . . . . . . . . . . . . . .. . . . 34515.2.5 Incremental Model Structure Learning . . . . . . . . . . . .. . . . . . 348

15.3 Global Awareness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 35315.3.1 Time-Ordered Latent Dirichlet Allocation . . . . . . . .. . . . . . . . 353

Page 16: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

xviii Contents

15.3.2 On-Line Prediction and Anomaly Detection . . . . . . . . .. . . . . 35515.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 358

References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 360

Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 363

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 365

Page 17: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

Acronyms

1D one-dimensional2D two-dimensional3D three-dimensionalBIC Bayesian information criterionCCA canonical correlation analysisCCTV closed-circuit televisionCONDENSATION conditional density propagationCRF conditional random fieldEM expectation-maximisationDBN dynamic Bayesian networkHCI human computer interactionHCRF hidden conditional random fieldHOG histogram of oriented gradientsFACS facial action coding systemFOV field of viewFPS frame per secondHMM hidden Markov modelKL Kullback-LeiblerLBP local binary patternsLDA latent Dirichlet allocationLPP locality preserving projectionMAP maximum a posterioriMCMC Markov chain Monte CarloMLE maximum likelihood estimationMRF Markov random fieldPCA principal component analysisPGM probabilistic graphical modelPTM probabilistic topic modelPTZ pan-tilt-zoomSIFT scale-invariant feature transformSLPP supervised locality preserving projection

xix

Page 18: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

xx Acronyms

SVM support vector machinexCCA cross canonical correlation analysis

Page 19: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

Part IINTRODUCTION

Page 20: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour
Page 21: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

Chapter 1About Behaviour

Understanding and interpreting behaviours of objects, andin particular those of hu-mans, is central to social interaction and communication. Commonly, one consid-ers that behaviours are the actions and reactions of a personor animal in responseto external or internal stimuli. There is, however, a plethora of wider considera-tions of what behaviour is, ranging from economical (Simon,1955), organisational(Rollinson, 2004), social (Sherman and Sherman, 1930), to sensory attentional suchasvisual behaviour(Ingle et al., 1982). Visual behaviour refers to the actionsorreactions of a sensory mechanism in response to a visual stimulus, for example,the navigation mechanism of nocturnal bees in dim light (Warrant, 2008), visualsearch by eye movement of infants (Gough, 1962) or drivers inresponse to theirsurrounding environment (Harbluk and Noy, 2002). If visualbehaviour as a searchmechanism is a perceptual function that scans actively a visual environment in orderto focus attention and seek an object of interest among distracters (Ltti et al., 1998),visual analysis of behaviouris a perceptual task that interprets actions and reactionsof objects, such as people, interacting or co-existing withother objects in a visualenvironment (Buxton and Gong, 1995; Gong et al., 2002; Xiangand Gong, 2006).The study of visual analysis of behaviour, and in particularly of human behaviour,is the focus of this book.

Recognising objects visually by behaviour and activity rather than shape andsize plays an important role in a primate visual system (Barbur et al., 1980; Schillerand Koerner, 1971; Weiskrantz, 1972). In a visual environment of multiple objectsco-existing and interacting, it becomes necessary to identify objects not only bytheir appearance but also by what they do. The latter provides richer informationabout objects especially when visual data is spatially ambiguous and incomplete.For instance, some animals, such as most snakes, have a very poor visual sensingsystem, which is unable to capture sufficient visual appearance of objects but verysensitive to movements for detecting preys and predators. The human visual systemis highly efficient for scanning through large quantity of low-level imagery data andselecting salient information for a high-level semantic interpretation and gainingsituational awareness.

3

Page 22: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

4 1 About Behaviour

1.1 Understanding Behaviour

Since 1970s, the computer vision community has endeavouredto bring about in-telligent perceptual capabilities to artificial visual sensors. Computer vision aimsto build artificial mechanisms and devices capable of mimicking the sensing capa-bilities of biological vision systems (Marr, 1982). This endeavour is intensified inrecent years by the need for understanding massive quantityof video data, with theaim to not only comprehend objects spatially in a snapshot but also their spatio-temporal relations over time in a sequence of images. For understanding a dynam-ically changing social environment, a computer vision system can be designed tointerpret behaviours from object actions and interactionscaptured visually in thatenvironment. A significant driver for visual analysis of behaviour is automated vi-sual surveillance, which aims to automatically interpret human activities and detectunusual events that could pose a threat to public security and safety.

If a behaviour is considered the way how an object acts, oftenin relation to otherobjects in the same visual environment, the focus of this book is on visual analysisof human behaviour and behaviours of object that are manipulated by humans, forexample, vehicles driven by people. There are many interchangeable terms used inthe literature concerning behaviour, including activities, actions, events, and move-ments. They correspond to different spatial and temporal context within which abehaviour can be defined. One may consider a behaviour hierarchy of three layers:

1. Atomic actions correspond to instantaneous atomic entities upon which an actionis formed. For example, in a running action, the atomic action could be ‘left legmoving in front of the right leg’. In a returning action in tennis, it could be ‘swingright hand’ followed by ‘rotating the upper body’.

2. Actions correspond to a sequence of atomic actions that fulfil a function or pur-pose. For instance, walking, running, or serving a tennis ball.

3. Activities are composed of sequences of actions over space and time. For exam-ple, ‘a person walking from a living room to a kitchen to fetcha cup of water’,or ‘two people playing tennis’. Whilst actions are likely associated with a singleobject in isolation, activities are almost inevitably concerned with either interac-tions between objects, or an object engaging with the surrounding environment.

In general, visual analysis of behaviour is about constructing models and devel-oping devices for automatic analysis and interpretation ofobject actions and ac-tivities captured in a visual environment. To that end, visual analysis of behaviourfocuses on three essential functions:

1. Representation and modelling: To extract and encode visual information fromimagery data in a more concise form that also captures intrinsic characteristicsof objects of interest;

2. Detection and classification: To discover and search for salient, perhaps alsounique, characteristics of certain object behaviour patterns from large quantityof visual observations, and to discriminate them against known categories of se-mantic and meaningful interpretation;

Page 23: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

1.1 Understanding Behaviour 5

3. Prediction and association: To forecast future events based on the past andcurrent interpretation of behaviour patterns, and to forgeobject identificationthrough behavioural expectation and trend.

We consider that automated visual analysis of behaviour is information process-ing of visual data, capable of not only modelling previouslyobserved object be-haviours, but also detecting, recognising and predicting unseen behavioural patternsand associations.

1.1.1 Representation and Modelling

A human observer can recognise behaviours of interest directly from visual ob-servation. This suggests that imagery data embed useful information for semanticinterpretation of object behaviour. Behaviour representation addresses the questionof what information must be extracted from images and in whatform, so that ob-ject behaviour can be understood and recognised visually. The human visual systemutilises various visual cues and contextual information for recognising objects andtheir behaviour (Humphreys and Bruce, 1989). For instance,the specific stripe pat-tern and its colour is a useful cue for human to recognise a tiger and distinguish itfrom other cats such as lions. Similarly, the movement as well as posture of a tigercan reveal its intended action: running, walking, or about to strike. It is clear thatdifferent sources and types of visual information need be utilised for modelling andunderstanding object behaviour.

A behaviour representation needs to accommodate both cumulative and tempo-ral information about an object. In order to recognise an object and its behaviour,the human visual system relates any visual stimuli falling on to the retina to a setof knowledge and expectation about the object under observation: how it shouldlook like and how it issupposedto behave (Gregory, 1970; Helmholtz, 1962). Be-haviour representation should address the need for extracting visual information thatcan facilitate the association of visual observation with semantic interpretation. Inother words, representation of visual data is part of a computational mechanism thatcontributes towards constructing and accumulating knowledge about behaviour. Forexample, modelling the action of a person walking can be considered as to learnthe prototypical and generic knowledge of walking based on limited observations ofwalking examples, so that when an unseen instance of walkingis observed, it canbe recognised by utilising the accumulateda priori knowledge. An important differ-ence between object modelling and behaviour modelling is that a behaviour modelshould benefit more from capturing temporal information about behaviour. Objectrecognition in large only considers spatial information. For instance, a behaviourmodel is built based on visual observation of a person’s daily routine in an officewhich consists of meetings, tea breaks, paper works and a lunch at certain times ofevery day. What has been done so far can then have a significant influence on thecorrect interpretation of what this person is about to do (Agre, 1989).

Page 24: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

6 1 About Behaviour

A computational model of behaviour performs both representation and match-ing. For representing object behaviour, one considers a model capable of captur-ing distinctive characteristics of an object in action and activity. A good behaviourrepresentation aims to describe an object sufficiently wellfor both generalisationand discrimination in model matching. Model matching is a computational pro-cess to either explain away new instances of observation against known object be-haviours, considered as its generalisation capacity, or discriminate one type of ob-ject behaviour from the others, regarded as its discrimination ability. For effectivemodel matching, a representation needs to separate visual observation of differentobject behaviour types or classes in a representational space, and to maintain suchseparations given noisy and incomplete visual observations.

1.1.2 Detection and Classification

Generally speaking, visual classification is a process of categorising selected visualobservations of interest into known classes. Classification is based on an assump-tion that segmentation and selection of interesting observations have already beentaken place. On the other hand, visual detection aims to discover and locate pat-terns of interest, regardless of class interpretation, from a vast quantify of visualobservations. For instance, for action recognition, a model is required to detect andsegment instances of actions from a continuous observationof a visual scene. De-tection in crowded scenes, such as detecting people fightingor falling in crowd,becomes challenging as objects of interest can be swamped bydistracters and back-ground clutters. To spot and recognise actions from a sea of background activities,the task of detection often poses a greater challenge than classification.

The problem of behaviour detection is further compounded when the behaviourto be detected is unknowna priori. A common aim of visual analysis of behaviouris to learn a model that is capable of detecting unseen abnormal behaviour pat-terns whilst recognising novel instances of known normal behaviour patterns. Tothat end, an anomaly is defined as an atypical and un-random behaviour pattern notrepresented by sufficient observations. However, in order to differentiate anomalyfrom trivial unseen instances or outright statistical outliers, one should consider thatan anomaly satisfies a specificity constraint to known normalbehaviours, i.e. trueanomalies lie in the vicinity of known normal behaviours without being recognisedas any.

1.1.3 Prediction and Association

An activity is usually formed by a series of object actions executed following certaintemporal order at certain durations. Moreover, the ordering and durations of con-stituent actions can be highly variable and complex. To model such visual observa-

Page 25: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

1.2 Opportunities 7

tions, a behaviour can be considered as a temporal process, or a time series function.An important feature of of a model of temporal processes is tomake prediction. Tothat end, behaviour prediction is concerned with detectinga future occurrence ofa known behaviour based on visual observations so far. For instance, if the dailyroutine of a person’s activities in a kitchen during breakfast time is well understoodand modelled, the model can facilitate prediction of this person’s next action whencertain actions have been observed: the person could be expected to make coffee af-ter finishing frying an egg. Behaviour prediction is particularly useful for explainingaway partial observations, for instance, in a crowded scenewhen visual observationis discontinuous and heavily polluted, or for detecting andpreventing likely harmfulevents before they take place.

Visual analysis of behaviour can assist object identification by providing contex-tual knowledge on how objects of interest should behave in addition to how theylook. For instance, human gait describes the way people walkand can be a usefulmeans to identify different individuals. Similarly, the way people perform differentgestures may also reveal their identities. Behaviour analysis can help to determinewhen and where a visual identification match is most likely tobe valid and relevant.For instance, in a crowded public place such as an airport arrival hall, it is infea-sible to consider facial imagery identification for all the people all the time. A keyto successful visual identification in such an environment is effective visual search.Behaviour analysis can assist in determining when and whereobjects of interestshould be sought and matched against. Moreover, behaviour analysis can providefocus of attention for visual identification. Detecting people acting out of norm canactivate identification with improved effectiveness and efficiency. Conversely, in or-der to derive a semantic interpretation of an object’s behaviour, knowing what andwho the object is can help. For instance, a train station staff’s behaviour can bedistinctively different from that of a normal passenger. Recognising a person as amember of staff in a public space can assist in interpreting correctly the behaviourof the person in question.

1.2 Opportunities

Automated visual analysis of behaviour provides some key building blocks towardsan artificial intelligent vision system. To experiment withcomputational models ofbehaviour by constructing automatic recognition devices may help us with betterunderstanding of how the human visual system bridges sensory mechanisms and se-mantic understanding. Behaviour analysis offers a great deal of attractive opportuni-ties for application, despite that deploying automated visual analysis of behaviour toa realistic environment is still at its infancy. Here we outline some of the emergingapplications for automated visual analysis of behaviour.

Page 26: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

8 1 About Behaviour

1.2.1 Visual Surveillance

There has been an accelerated expansion of closed-circuit television (CCTV)surveillance in recent years, largely in response to risinganxieties about crime andits threat to security and safety. Visual surveillance is tomonitor the behaviour ofpeople or other objects using visual sensors, typically CCTV cameras. Substantialamount of surveillance cameras have been deployed in publicspaces, ranging fromtransport infrastructures, such as airports and underground stations, to shopping cen-tres, sport arenas and residential streets, serving as a tool for crime reduction and riskmanagement. Conventional video surveillance systems relyheavily on human op-erators to monitor activities and determine the actions to be taken upon occurrenceof an incident, for example, tracking suspicious target from one camera to anothercamera, or alerting relevant agencies to areas of concern. Unfortunately, many ac-tionable incidents are simply miss-detected in such a manual system due to inherentlimitations from deploying solely human operators eyeballing CCTV screens. Theselimitations include: (1) excessive number of video screensto monitor, (2) boredomand tiredness due to prolonged monitoring, (3) lack ofa priori and readily accessibleknowledge on what to look for, and (4) distraction by additional operational respon-sibilities. As a result, surveillance footages are often used merely as passive recordsfor post-event investigation. Miss-detection of important events can be perilous incritical surveillance tasks such as border control or airport surveillance. It has be-come an operational burden to screen and search exhaustively colossal amount ofvideo data generated from growing number of cameras in public spaces. Automatedcomputer vision systems for visual analysis of behaviour provide the potential fordeploying never-tiring computers to perform routine videoanalysis and screeningtasks, whilst assisting human operators to focus attentionon more relevant threats,thus improving the efficiency and effectiveness of a surveillance system.

1.2.2 Video Indexing and Search

We are living in a digital age with huge amount of digital media, especially videos,being generated at every single moment in the forms of surveillance videos, on-line news footages, home videos, mobile videos, and broadcasting videos. How-ever, once generated, very rarely they are watched. For instance, most visual datacollected by surveillance systems are never watched. The only time when they areexamined is when a certain incident or crime has occurred anda law enforcementorganisation needs to perform a post-event analysis. Unless specific time of the inci-dent is known, it is extremely difficult to search for an eventsuch as someone throwsa punch in front of a nightclub. For a person with a large home video collection, itis a tedious and time consuming task to indexing the videos sothat they can besearched efficiently for footages of a certain type of actions or activities from yearsgone by. For film or TV video archive, it is also a very challenging task to searchfor a specific footage without text meta information, specific knowledge about the

Page 27: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

1.2 Opportunities 9

name of a subject, or the time of an event. What is missing and increasingly desiredis the ability to visually search archives by what has happened, that is, automatedvisual search of object behaviours with categorisation.

1.2.3 Robotics and Healthcare

A key area for robotics research in recent years is to developautonomous robotsthat can see and interact with people and objects, known as social robots (Breazeal,2002). Such a robot may provide a useful device in serving an aging society, as acompanion and tireless personal assistant to elderly people or people with a dis-ability. In order to interact with people, a robot must be able to understand the be-haviour of the person who is interacting with. This ability can be based on gesturerecognition, such as recognising waving and initialising ahand-shake, interpretingfacial expression, and inferring intent by body posture. Earlier robotics research hadfocused more on static object recognition, manipulation and navigation through astationary environment. More recently, there has been a shift towards developingrobots capable of mimicking human behaviour and interacting with people. To thatend, enabling a robot to perform automated visual analysis of human behaviour be-comes essential. Related to the development of a social robot, personalised health-care in an aging society has gained increasing prominence inrecent years. To beable to collect, disseminate and make sense of sensory information from and to anelderly person in a timely fashion is the key. To that end, automated visual analysisof human behaviour can provide quantitative and routine assessment of a person’sbehavioural status needed for personalised illness detection and incident detection,e.g. a fall. Such sensor based systems can reduce the cost of providing personalisedhealth care, enabling elderly people to lead a more healthy and socially inclusivelife style (Yang, 2006).

1.2.4 Interaction, Animation and Computer Games

Increasingly more intelligent and user friendly human computer interaction (HCI)are needed for applications such as a game console that can recognise a player’s

gesture and intention using visual sensors, and a teleconferencing system that cancontrol cameras according to the behaviour of participants. In such automated HCIsystems using sensors, effective visual analysis of human behaviour is central tomeaningful interaction and communication. Not surprisingly, animation for filmproduction and gaming industries are also relying more on automated visual analy-sis of human behaviour for creating special effects and building visual avatars thatcan interact with players. By modelling human behaviour including gesture and fa-cial expression, animations can be generated to create virtual characters, known as

Page 28: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

10 1 About Behaviour

avatars, in films and for computer games. With players’ behaviour recognised auto-matically, these avatars can also interact with players in real-time gaming.

1.3 Challenges

Understanding object behaviour from visual observation alone is challenging be-cause it is intrinsically an ill-posed problem. This is equally true for both humansand computers. Visual interpretation of behaviour can be ambiguous and is subjectto changing context. Visually identical behaviours may have different meanings de-pending on the environment in which activities are taken place. For instance, whena person is seen waving on a beach, is he greeting somebody? swatting an insect? orcalling for help as his friend is drowning? In general, visual analysis of behaviourfaces two fundamental challenges.

1.3.1 Complexity

Compared to object recognition in static images, an extra dimension of time needsbe considered in modelling and explaining object behaviour. This makes the prob-lem more complex. Let us consider human behaviour as an example. Human has anarticulated body and the same category of body behaviours can be acted in differentways largely due to temporal variations, for example, waving fast versus slowly.This results in behaviours of identical semantics look visually different, knownas large intra-class variation. On the other hand, behaviours of different semanticclasses, such as jogging versus running, can be visually similar, known as smallinter-class variation. Beyond single object behaviour, a behaviour can be of multi-ple interacting objects characterised by their temporal ordering. In a more extremecase, a behaviour is defined in the context of a crowd where many people co-existboth spatially and temporally. In general, behaviours are defined in different spatialand temporal context.

1.3.2 Uncertainty

Based on visual information alone to describe object behaviour is inherently partialand incomplete. Unlike a human observer, when a computer is asked to interpretbehaviour without other sources of information except imagery data, the problemis compounded by visual information only available in two-dimensional images ofa three-dimensional space, lack of contextual knowledge, and in the presence ofimaging noise.

Page 29: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

1.4 The Approach 11

Two-dimensional visual data give rise to visual occlusion on objects under obser-vation. This renders not all behavioural information can beobserved visually. Forinstance, for two people interacting with each other, depending on the camera angle,almost inevitably part of or the full body of a person is self-occluded. As a result,semantic interpretation of behaviour is made considerablyharder when only partialinformation is available.

Behaviour interpretation is highly context dependent. However, contextual infor-mation is not always directly observable, nor necessarily always visual. For instance,on a motorway when there is a congestion, a driver often wishes to find out the causeof the congestion in order to estimate likely time delay, whether the congestion isdue to an accident or road work ahead. However, that information is often unavail-able in the driver’s field of view, as it is likely to be locatedmiles away. Takinganother example, on a train platform, passengers start to leave the platform due toan announcement by the station staff that the train line is closed due to signal failure.This information is in audio form therefore not captured by visual observation onpassenger behaviours. To interpret behaviour by visual information alone introducesadditional uncertainty due to a lack of access to non-visualcontextual knowledge.

Visual data are noisy, either due to sensor limitations or because of operationalconstraints. This problem is particularly acute for video based behaviour analysiswhen video resolution is often very low both spatially and temporally. For instance,a typical 24 hours 7 days video surveillance system in use today generates videofootages with a frame-rate of less than three frames per second, and with heavycompression for saving storage space. Imaging noise degrades visual details avail-able for analysis. This can further cause visual information processing to introduceadditional error. For instance, if object trajectories areused for behaviour analysis,object tracking errors can increase significantly in low frame-rate and highly com-pressed video data.

1.4 The Approach

We set out the scope of this book by introducing the problem ofvisual analysisof behaviour. We have considered the core functions of behaviour analysis from acomputational perspective, and outlined the opportunities and challenges for visualanalysis of behaviour. In the remaining chapters of Part I, we first give an overviewon different domains of visual analysis of behaviour to highlight the importance andrelevance of understanding behaviour in context. This is followed by an introductionto some of the core computational and machine learning concepts used throughoutthe book. Following Part I, the book is organised into further three parts accordingto the type of behaviour and the level of complexity involved, ranging from facialexpression, human gesture, single object action, multipleobject activity, crowd be-haviour analysis, to distributed behaviour analysis.

Page 30: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

12 1 About Behaviour

Part II describes methods for modelling single-object behaviours including facialexpression, gesture, and action. Different representations and modelling tools areconsidered and their strengths and weaknesses are discussed.

Part III is dedicated to group behaviour understanding. We consider modelsfor exploring context to fulfil the task of behaviour profiling and abnormal be-haviour detection. Different learning strategies are investigated, including super-vised learning, unsupervised learning, semi-supervised learning, incremental andadaptive learning, weakly-supervised learning, and active learning. These learningstrategies are designed to address different aspects of a model learning problem indifferent observation scenarios according to the availability of visual data and hu-man feedback.

Whilst Part II and Part III consider behaviours observed froma single cam-era view, Part IV addresses the problem of understanding distributed behavioursfrom multiple observational viewpoints. An emphasis is specially placed on non-overlapping multi-camera views. In particular, we investigate the problems of be-haviour correlation across camera views for camera topology estimation and globalanomaly detection, and the association of people across non-overlapping cameraviews, known as re-identification.

Page 31: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

References 13

References

P.E. Agre.The dynamic structure of everyday life. PhD thesis, Massachusetts Insti-tute of Technology, Cambridge, MA, USA, 1989.

J.L. Barbur, K.H. Ruddock, and V.A. Waterfield. Human visualresponses in theabsence of the geniculo-calcarine projection.Brain, 103(4):905–928, 1980.

C.L. Breazeal, editor.Desiging Sociable Robots. The MIT Press, 2002.H. Buxton and S. Gong. Visual surveillance in a dynamic and uncertain world.

Artificial Intelligence, 78(1-2):431–459, 1995.S. Gong, J. Ng, and J. Sherrah. On the semantics of visual behaviour, structured

events and trajectories of human action.Image and Vision Computing, 20(12):873–888, October 2002.

D. Gough. The visual behaviour of infants in the first few weeks of life. Proceedingsof the Royal Society of Medicine, 55(4):308–310, April 1962.

R.L. Gregory.The Intelligent Eye. Weidenfeld and Nicolson, London, 1970.J.L. Harbluk and Y.I. Noy. The impact of cognitive distraction on driver visual

behaviour and vehicle control. Technical Report TP 13889 E,Road Safety Direc-torate and Motor Vehicle Regulation Directorate, CanadianMinister of Transport,2002.

H. von Helmholtz.Popular Scientific Lectures, chapter “The Recent Progress of theTheory of Vision”. Dover Publications, 1962.

G.W. Humphreys and V. Bruce.Visual Cognition: Computational, Experimentaland Neuropsychological Perspectives. Lwarence Erlbaum Associates, Hove, EastSussex, 1989.

D.J. Ingle, M.A. Goodale, and R.J.W. Mansfield, editors.Analysis of Visual Be-haviour. The MIT Press, January 1982.

L. Ltti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapidscene analysis.IEEE Transactions on Pattern Analysis and Machine Intelligence,20(11):1254–1259, November 1998.

D. Marr. Vision: A Computational Investigation into the Human Representation andProcessing of Visual Information. W.H.Freeman & Co., 1982.

D. Rollinson. Organisational Behaviour and Analysis: An Integrated Approach.Prentice Hall, December 2004.

P.H. Schiller and F. Koerner. Discharge characteristics ofsingle units in superiorcolliculus of the alert rhesus monkey.Journal of Neurophysiology, 34(5):920–935, September 1971.

M.D. Sherman and I.C. Sherman. The process of human behaviour. Journal ofMental Science, 76:337–338, 1930.

H.A. Simon. A behavioral model of rational choice.The Quarterly Journal ofEconomics, 69(1):99–118, February 1955.

E.J. Warrant. Seeing in the dark: Vision and visual behaviour in nocturnal bees andwasps.Journal of Experimental Biology, 211:1737–1746, May 2008.

L. Weiskrantz. Review lecture: Behavioural analysis of themonkey’s visual nervoussystem.Proceedings of the Royal Society, 182:427–455, 1972.

Page 32: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

14 References

T. Xiang and S. Gong. Beyond tracking: Modelling activity and understandingbehaviour.International Journal of Computer Vision, 67(1):21–51, May 2006.

G.Z. Yang, editor.Body Sensor Networks. Springer, May 2006.

Page 33: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

Index

Symbols

1-norm 821D seeone-dimensional24/7 11, 189, 2242D seetwo-dimensional2D view 1163D seethree-dimensional3D model 1163D model based tracking 1163D skeleton 116

A

absolute difference vector 322accumulation factor 175action 4action cuboid 162action detection 21, 158action localisation 158action model 143, 149action recognition 19, 141action representation 143, 149

bag-of-words 48action unit 17active learning 61, 62, 279, 285

adaptive weighting 287critical example 286expected error reduction 285expected model change 285likelihood criterion 285, 291pool-based 285query criterion 285stream-based 285uncertainty criterion 285, 292

activity 4, 23, 196, 230activity classification 197

activity graph 196activity model 191activity recognition 173activity representation

bivariate time-series 306hierarchical 196

activity segmentation 180activity transition matrix 196acyclicity check 344AdaBoost 83affective body gesture 131affective state 131, 135affinity 208affinity feature vector 208affinity matrix 208, 250, 289, 307

eigenvector 208normalised 209

affinity metric 207aging society 9alignment 76, 87alternative hypothesis 216ambient space 84animation 9annotation 162

automated 159, 162, 165learning 162

anomalous region 260anomaly 6, 205, 258, 261

specificity 6anomaly detection 59, 213

cumulative factor 216on-line likelihood ratio test 216probe pattern 213sensitivity 216, 261specificity 216, 261threshold 216

appearance feature 28

365

Page 34: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

366 Index

association 5, 7atomic action 4, 122, 125atomic component seegestureatypical co-occurrence 240automated visual surveillance 4automatic model order selection 209avatar 9, 109average Kullback-Leibler divergence 293

B

background subtraction 48, 306bag-of-words 57, 133, 149, 157, 231

document size 238batch EM 221batch learning 236,see alsomodel learningbatch mode seemodel learningBaum-Welch algorithm 193, 334Bayes factor 197Bayes net seeBayesian networkBayes’ rule 128, 290Bayesian adaptation 60Bayesian classification 289Bayesian data association 106, 107Bayesian filtering 236Bayesian graph 337, 342,see alsostatic

Bayesian networkBayesian information criterion 178, 186,

192, 334, 342model under-fitting 210

Bayesian learning 344, 350conjugate prior 290

Bayesian model selection 197Bayesian network 110

belief revision 111belief update 111clique 113double count evidence 114dynamic 53explain away evidence 114explanation 111extension 111join tree 113most probable explanation 111singly connected 113static 53

Bayesian network learningbatch mode 348incremental 349mutual information 339prior structure 340scoring function 342

Bayesian Occam’s razor 277Bayesian parameter inference 233

Bayesian saliency 236Bayesian temporal model 92behaviour 3

abnormal 205computational model 6correlational context 247, 252global interpretation 303latent intent 303modelling 5normal 213profile discovery 206semantics 54spatial context 247, 248, 306temporal context 248, 252temporal segmentation 334

behaviour affinity matrix 208behaviour class distribution 283behaviour classification

on-line 275behaviour correlation 26behaviour detection 6behaviour hierarchy 4behaviour interpretation

multivariate space 51behaviour localisation 278behaviour model

abnormality 219abnormality update 222adaptive 218approximate abnormal 220bootstrapping 219complexity 229composite 213hierachical structure 229incremental learning 218incremental update 220initialisation 219normality 218normality update 221rarity 219rejection threshold 220uncertainty 229

behaviour pattern 23, 205, 206behaviour posterior 233behaviour prediction 7behaviour profile 278behaviour profiling 59, 201, 205

automatic 201, 205unsupervised 205, 207

behaviour representation 5, 43, 54, 287behaviour-footprint 250behavioural context 247, 349

abrupt change 349gradual change 349

Page 35: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

Index 367

behavioural interpretation 51, 196semantics 54statistical learning 51

behavioural saliency 236, 242irregularity 242known 243rarity 242unknown 243

behavioural surprise 238, 240between-class scatter matrix 81between-sets covariance matrix 312,see also

canonical correlation analysisBhattacharyya distance 319BIC seeBayesian information criterionBIC score 334, 342big picture 333, 335binary classification 205, 283

false positive 205sensitivity 205specificity 205true positive 205

binary classifier 81, 156binary decision 81biometric feature 28biometrics 319body gesture 17body language 18Boolean function 325boosting 83boosting learning 321bottom-up 51, 196bottom-up event detection 200bottom-up segmentation 188bounding box 44, 105,see alsoobject

C

camera connectivity matrix 315camera topology inference 28camera view

spatial gap 27temporal gap 27

canonical correlation 311canonical correlation analysis 134

between-sets covariance matrix 134canonical factor 134canonical variate 134, 311optimal basis vector 311principal angle 134within-set covariance matrix 134

canonical direction 231cardinal direction seecanonical directioncascaded LDA 253cascaded learning 83

cascaded topic model 248, 252Catch-22 problem 180, 327CCA seecanonical correlation analysisCCTV seeclosed-circuit televisionchain code 142Chi square statistic 80Chow-Liu tree 340class relevance measure 155class separability 284class-conditional marginal likelihood 277classification 4, 6, 53, 75classification sensitivity 205classifier 79closed-circuit television 8, 303closed-world 304cloud of interest points seecloud-of-pointscloud-of-points 149, 157

class relevance measure 155global scalar feature 155

clustering 58, 105, 126, 177, 256co-occurrence 29, 57, 230, 231co-occurring

event 230, 235topic 235

coarse labelling 262codebook 289colour space

hue-saturation 103complexity 10computational complexity 56, 342computational tractability 230, 343computer games 9, 18computer vision 4CONDENSATION algorithm 106, 119, 129conditional density propagation 106, 129conditional independence 53, 290, 339

test 340conditional probability 54, 234conditional probability distribution 51, 290,

339conditional random field 143,see also

Markov random fieldconjugate prior 344

BDeu prior 345connected component 105, 177connecting-the-dots 29, 333consensus probability seequery-by-

committeecontext 11

global 248, 252, 256local 248multiple scale 248regional 252

context dependent 11

Page 36: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

368 Index

context-aware 30contextual event 174, 249contextual information 11contextual knowledge 7contextually incoherent 258, 261continuous variable 55convolution operator 132corpus

topic model 240correlation scaling factor 308correspondence problem 28, 44cost function 80, 192

penalty term 192tractability 192

coupled hidden Markov model 56covariance matrix 117CRF seeconditional random fieldcross canonical correlation analysis 311cross-validation 162, 323crowd behaviour 24crowd flow analysis 50cumulative abnormality score 347cumulative scene vector 182

trajectory 183curse of dimensionality 85, 208curvature primal sketch 124

D

datagallery 237informative 285probe 237test 237testing 51training 18, 51, 57, 237

data association 106data mining 201, 207, 284DBN seedynamic Bayesian networkdecay factor 175, 347deceptive intent 21decision boundary 62, 284detection 4, 6, 158detection rate 347diagonal matrix 86dimensionality reduction 85, 117, 154, 208

eigen-decomposition 209feature selection 155

directed acyclic graph 110, 337,see alsoBayesian network

directed probabilistic link 192Dirichlet distribution 267, 270, 290, 292

posterior 292prior 232

Dirichlet-multinomial conjugate structure234

disagreement measureseequery-by-committee

discrete variable 55, 111, 337discrete-curve-evolution 188discrimination 6discriminative model 51displacement vector 49dissimilarity metric 307distance metric 58, 207distributed behaviour 26distributed camera network seemulti-camera

systemdistribution overlap 283document

querying 231similarity matching 231

dominant motion direction 289dynamic Bayesian network 55, 185, 191,

208, 230, 333topology 185

dynamic correlationparameter 191structure 191

dynamic programming 344dynamic scene analysis 191dynamic texture 84dynamic time warping 207dynamic topic model 230, 238dynamically-multi-linked HMM 56, 191

E

eigen-decomposition 209eigenvalue 86, 117, 210, 312eigenvector 117, 312

relevance learning 210selection 210

EM seeexpectation-maximisationEM algorithm seeexpectation-maximisation

algorithmemerging global behaviour 334, 345emotion 131ensemble learning 324ensemble RankSVM 324

learning weak ranker 324entropy ratio 177ergodic model 162, 334Euclidean 85, 188event

atomic action 51classification 177clustering 177

Page 37: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

Index 369

contextual 50recognition 198

event-based representation 50, 174exact learning 344expectation-maximisation 59, 270expectation-maximisation algorithm 104,

186E-step 186, 222M-step 186, 222

expression classification 75expression manifold 84extended Viteribi algorithm 199

F

face recognition 7face detection 83, 105facial action coding system 17facial expression 15, 75

classification 75recognition 15, 75representation 75

FACS seefacial action coding systemfactored sampling 129factorisation 56, 57false alarm 61, 283, 347false positive 165, 205feature extraction 152feature fusion 157feature representation 157

global 157joint 157local 157

feature selection 83, 150, 154kernel learning 155unsupervised 210variance 154

feature space 80, 81, 133correlation 133fusion 133

field of view 303non-overlapping 303

filtering threshold 352fitness value 120focus of attention 7forward-backward algorithm 192, 193,see

alsoBaum-Welch algorithmforward-backward procedureseeforward-

backward algorithmforward-backward relevance 187frame differencing 306frame rate 305fundamental pattern seeuniform patternfusion 133

expression and gesture 133

G

Gabor filter 109, 132, 151carrier 151envelope 151

Gabor wavelets 76,see alsoGabor filtergait analysis 23gallery image 92, 320Gamma distribution 292

shape parameter 292Gamma function seeGamma distributionGaussian mixture model seemixture of

GaussianGaussian noise 129generalisation 6, 53generalised eigenvalue problem 86generative model 53, 231, 266gesture 101

affective 19atomic action 122atomic component 122non-affective 19sign language 19

gesture component space 125gesture recognition 17, 101

body language 101emotional state 101expression 101hand gesture 101

gesture segmentation 122gesture tracking 102Gibbs sampler 234, 270

collapsed 270Gibbs sampling 230, 234, 270, 277Gibbs-EM algorithm 271, 273global abnormal behaviour 333, 347, 357global activity 29global activity analysis 29, 303global activity phase 334global anomaly detection 345

cumulative anomaly score 345global awareness 353global Bayesian behaviour graph 337global behaviour segmentation 333global behaviour time series 334global context LDA 256global situational awareness 29, 333global trail 29gradient descent 145gradient feature 50graph classification problem 141graph pruning 339, 351

Page 38: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

370 Index

graph theory 53graphical model 54

directed acyclic 54undirected 143, 146

graphical model complexity 192group activity 23, 173, 191

event-based 174modelling 173supervised learning 173

group association context 28

H

hand orientation 109histogram 114

hard negative mining procedure 163hard-wired 261harmonic mean 241, 277HCI seehuman computer interactionHCRF seehidden conditional random field

observable 146potential function 145, 146

healthcare 9heterogeneous sensor 27, 30, 303heuristic search 341, 344hidden conditional random field 143,see

alsoMarkov random fieldgraphical model 146initialisation 146non-convexity 146observable 146potential function 146undirected graph 146

hidden holistic pattern 333hidden Markov model 55, 129, 185, 333

HMM channelling 129HMM pathing 147on-line filtering 335

hidden state variable 55hierachical behaviour discovery 229hierarchical clustering 235hierarchical hidden Markov model 56hierarchical model 230hierarchical topic 256hinge loss function 157, 323histogram dissimilarity measure 80histogram of oriented gradients 50HMM seehidden Markov modelHOG seehistogram of oriented gradientsholistic awareness 29holistic context 29holistic interpretation 29human action 19human body

2D view ambiguity 1162D view model 1163D skeleton model 116nonlinear space 117observation distance 121observation subspace 121skeleton distance 121skeleton subspace 121view ambiguity 122

human computer interaction 9, 16, 18human feedback 61, 262, 284

query criterion 291human gait 7human gesture 17human intent 21human silhouette 116

motion moment 143spectrum feature 142

human-guided data mining 279hybrid 2D-3D representation 116hyper-parameter 233, 268, 271

Dirichlet prior 271hyperplane 80hypothesis test 216

I

identification 7ill-posed 10, 180, 326image cell 267image classification 75image descriptor 48imbalance class distribution 287importance sampling 277inactivity 181inactivity break-up 182incremental EM seeincremental

expectation-maximisationincremental expectation-maximisation 59,

218, 221off-line 221on-line 221

incremental learning 289, 348, 349incremental structure learning

Naïve method 349information processing 5inter-camera

appearance variance 303inter-camera gap 303inverse kinematics 116irregularity 237,see alsobehavioural

saliencyIsomap 85isotropic distribution 102

Page 39: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

Index 371

J

joined-up reasoning 305joint feature vector descriptor 143joint probability distribution 53, 314

K

k-means 207, 209, 256, 334k-nearest neighbour 82, 94, 164, 320

greedy 164K2 algorithm 341

re-formulation 341kernel function 80, 156kernel learning 155, 321kernel space 80key-frame matching 207KL divergence seeKullback-Leibler

divergenceKullback-Leibler divergence 255, 277, 287,

294distance metric 294non-symmetric 294symmetric 294

L

L1 norm 320label switching 237, 243Lagrange multiplier 80Laplace Beltrami operator 85Laplacian eigenmap 85Laplacian matrix 86largest eigenvector 209latent behaviour 235latent Dirichlet allocation 29, 231, 253, 267,

353time-ordered 353

latent structure 232latent topic 231LBP seelocal binary patternLBP histogram 78, 83LDA seelatent Dirichlet allocationlearning

model parameter 193model structure 191procedure 234rate 221strategy 57

learning strategy seelearningLevenshtein distance 207likelihood function 94likelihood score 291likelihood test 60

linear combination 311linear dimensionality reduction 85linear discriminant analysis 81, 231linear programming 82linearly non-separable 80local (greedy) searching 211local binary pattern 76local block behaviour pattern 287local motion event 230localisation 158locality preserving projection 85

supervised 88locally linear embedding 85log-likelihood 104, 145, 213, 254, 255, 345,

356normalised 213

logistic regression 163low-activity region 311LPP seelocality preserving projection

M

man-in-the-loop 279manifold 84manifold alignment 87MAP seemaximum a posteriorimarginal likelihood 236, 240, 277marginal probability distribution 314Markov chain 127, 162, 230Markov chain Monte Carlo 233, 234, 278,

344Markov clustering topic model 57, 230, 231Markov process 127, 128

first-order 128, 334gesture 127second-order 128state 128state space 128

Markov property 93, 128Markov random field

conditional random field 143hidden conditional random field 143

Markovian assumption 335,see alsoMarkovproperty

maximum a posteriori 237, 349maximum canonical correlation 312maximum likelihood 104maximum likelihood estimation 104, 145,

178, 192, 211, 344maximum margin 80maximum posterior probability 221MCMC seeMarkov chain Monte Carlomean field approximation 277mean shift 105

Page 40: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

372 Index

message-passing process 111metadata feature 231minimum description length 126, 147minimum eigenvalue solution 86mixing probability 211, 220mixture component trimming 223mixture model 126mixture of Gaussian 103, 126, 129, 147mixture of probabilistic principal components

59MLE seemaximum likelihood estimationmodel exploitation 286model exploration 286model inference 270model input

human annotation 284model learning

batch mode 325, 339, 348incremental 339, 348shared common basis 266statistical insufficiency 272structure 267

model matching 6model order 178model order selection 126, 178, 208, 334

automatic 209model output

human feedback 284model over-fitting 233, 240, 266model parameter update 220model robustness 229model sensitivity 216model specificity 216model structure 191

adaptation 223discovery 191

model topology 191model under-fitting 210, 251model-assisted search 279modelling 4

behaviour 43gesture 101

most probable explanation 192motion

descriptor 47, 162feature 177template 47

motion direction profile 289motion history volume 46motion moment 102

first-order 102second-order 102trajectory 102zeroth-order 102

motion-history-image 46multi-camera behaviour correlation 26multi-camera distributed behaviour 26multi-camera system 26, 303

blind area 27disjoint views 303surveillance 303

multi-camera topology 314connected regions 315region connectivity matrix 315

multi-channel kernel 162multi-instance learning 159multi-object behaviour 230multi-observation hidden Markov model 56,

185multi-observation HMM seemulti-

observation hidden Markov modelmulti-view activity representation 306multimodal 210multinomial distribution 231, 290, 294, 339

conjugate 345multiple kernel learning 155mutual information 314

analysis 313

N

naïve Bayesian classifier 290natural gesture 122nearest neighbour 60, 86nearest neighbour classifier 82,see also

k-nearest neighbournegative training video 163network scoring 352neutral expression 86Newton optimisation 323non-visual context 114nonverbal communication 131normal distribution 211normalised probability score 291normality score 260

O

objectappearance attribute 44bounding box 44descriptor 44tracking 44, 327tracklet 46trajectory 44

object association 27object association by detection 327object representation 47

Page 41: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

Index 373

bag-of-words 47constellation model 47

object segmentation 47, 48, 180object-based representation 43objective function 81, 86, 145, 156, 322observation

distributed viewpoints 303multiple viewpoints 303oversampling 303single viewpoint 303

observation probability 55observation variable 55observational space 185

factorisation 185off-line processing 183on-line anomaly detection 206, 213

likelihood ratio test 217maximum likelihood estimation 217

on-line filtered inference 236on-line likelihood ratio test 216on-line model adaptation 59on-line processing 183, 230on-line video screening 237one class learning 283one-dimensional 122, 185one-shot learning 266, 273one-versus-rest 81, 83, 156, 217open-world 310optical flow 47, 49, 231, 287

Lucas-Kanade method 287optimisation problem 156ordering constraint 344outlier 59outlier detection 266, 278, 283over-fitting seemodel over-fittingoversampling 83

P

pair-wise correlation 310, 337pan-tilt-zoom 27parallel hidden Markov model 56parameter learning 344

maximum likelihood estimation 344part-based representation 47PCA seeprincipal component analysisPearson’s correlation coefficient 307pedestrian detection 83person re-identification 24, 28, 319

absolute similarity score 321context-aware search 326contextual information 327matching criterion 320ranking 321

relative ranking 321personalised healthcare 9phoneme 122pipelined approach 229pixel-based representation 48pixel-change-history 49, 174point clouds 150point distribution model 116, 117point estimation 270Polya distribution 269polynomial 81positive importance weight 323posterior distribution 255posterior probability 93prediction 5, 131

blind search 131guided search 131

predictive likelihood 233, 237, 241Prewitt edge detector 152Prim’s algorithm 340principal component analysis 85, 117

dimensionality reduction 117hierarchical 117

prior domain knowledge 243probabilistic dynamic graph 191probabilistic graphical model 51, 53, 191,

337probabilistic latent semantic analysis 53, 258probabilistic relative distance comparison

320probabilistic topic model 53, 56, 230

hierarchical Dirichlet processes 53latent Dirichlet allocation 53

probability density function 103, 210probability theory 54probe behaviour pattern 216probe image 92, 319, 320probe pattern 213probe video 259profiling 59PTM seeprobabilistic topic modelPTZ seepan-tilt-zoom

Q

Quasi-Newton optimisation 145query criterion 291,see alsohuman

feedbackadaptive selection 294

query score 291query-by-committee 286, 292

consensus probability 293disagreement measure 293modified 286

Page 42: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

374 Index

vote entropy 292

R

radial basis function 81random process 125random variable 53, 110, 134

multivariate 54RankBoost 321ranking function 322ranking problem 321RankSVM 321

complexity 323ensemble 324scalability 323spatial complexity 324

rare behaviour 262, 266benign 284classification 265model 265

rarity seebehavioural saliencyre-identification 12,see alsoperson

re-identificationreasoning 109regional

context 253correlational context 254temporal context 254

regional activity 316regional activity affinity matrix 312regional activity pattern 310, 334regional temporal phase 256relative entropy 287relative ranking score 322relevance feedback 284relevance learning 210relevance rank 322representation 4representation space 6robust linear programming 82robustness 205, 230, 333rule-based 53, 58run-time anomaly measure 216

S

salient behaviour 230, 242salient pixel group 177salient point 161sampling bias 210scalability 333scale-invariant feature transform 48, 161scaling factor 251

spatial 308

scatter matrixbetween-class 82within-class 82

scene decomposition 289, 306activity-based 307

scene layout 305scene vector 181scoring function optimisation 340seeding event 174segmentation 47, 48, 122, 180, 250self-occlusion 116, 120semantic gap 190semantic region 247semantic scene

decomposition 250segmentation 251

semantic video indexing 190semantic video search 190semantics 54semi-supervised learning 60, 242, 285sensitivity 216, 230, 333,see alsomodel

sensitivitysequential minimisation optimization 157shape descriptor 142

chain code 143shape template 47SIFT seescale-invariant feature transformSIFT descriptor 161silhouette 20, 47, 142similarity measure 86single-camera system 303situational awareness 29, 247

holistic 303skeleton model 116skin colour 103, 107

model 103sliding-window 158, 162, 188, 189, 353SLPP seelocality preserving projectionsocial robot 9space-time cloud 149space-time cuboid 162space-time descriptor 149space-time interest point 48, 132, 149

bag-of-words 133cloud 150cuboid 132descriptor 132detection 132histogram 132prototype 132prototype library 132volume 132

space-time shape template 47sparse data 210

Page 43: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

Index 375

sparse example 262sparsity problem 272spatial topology 316spatio-temporal space 17specificity 216,see alsomodel specificityspectral clustering 208, 250, 289, 334spectrum feature 142state inference 198state space 117, 191

factorisation 191state transition 18state transition probability 55static Bayesian network 54statistical behaviour model 51, 57statistical insufficiency 266stochastic process 18, 128string distance 207strong classifier 83structure learning 339

constraint-based learning 340subspace 81

nonlinear 117subspace analysis 81subtle behaviour 266

model 265subtle unusual behaviour 283sufficient statistics 221, 350, 351supervised behaviour modelling 266supervised learning 58, 173

limitation 266support vector 80support vector machine 80, 105, 135, 156,

163, 321support vector ranking 321synchronised global behaviour 334synchronised global space 305

T

template matching 79, 128template-based representation 46temporal correlation 191temporal model 84, 92temporal offset 334temporal process 7temporal segmentation 122, 124temporal topology 316text document analysis 47, 57, 231text mining 57texture descriptor 76three-dimensional 109time delay index 311time series function 7time varying random process 128

time-delayed mutual information 313time-delayed probabilistic graphical model

337top-down 51, 196

rules 261top-down model inference 200topic

inter-regional 256profile 258regional 256

topic model 231complexity 240corpus 240document 231, 240topics 240

topic profile 256topic simplex 256, 259topological sorting 340topology inference 27tracking 44, 103

body part 103discontinuous motion 106holistic body 102salient point 161

tracklet seeobjecttractability 333trajectory seeobjecttrajectory transition descriptor 161trajectory-based representation 44transformation matrix 81two-dimensional 103

U

uncertainty 10under-fitting seemodel under-fittinguniform pattern 77unimodal 210unit association 48unit formation seesegmentationunlikely

behavioural 237dynamical 237intrinsic 237

unsupervised behaviour modelling 266unsupervised behaviour profiling 207unsupervised clustering 191unsupervised feature selection 210unsupervised incremental learning 59unsupervised learning 58, 201, 283unusual behaviour 283unusual behaviour detection 59, 61

Page 44: From Pixels to Semanticssgg/VAB/GongXiang_VAB_intro_index.pdf · 2011-03-04 · modelling rare behaviours, ‘man-in-the-loop’ active learning of behaviours, multi- camera behaviour

376 Index

V

validation set 163, 324, 352variance 154variation 75

between-class 75inter-class 10intra-class 10within-class 75

variational distribution 255variational EM 255variational inference 255, 356variational parameter 255vector descriptor 142video 173, 181

bag-of-words 48, 267clip 231, 254clip categorisation 236content 181content trajectory 187document 254, 259representation 173segmentation 173semantic interpretation 173stream 173

video analysisbehaviour-based 279

video content analysis 190video content breakpoint 185video corpus 267video descriptor 48video document 57, 232, 267video document category 232video indexing 8video mining 201video polyline 183

video saliency 236video screening 236video search 8video segmentation 180, 183, 185

on-line 187semantic content 183

video semantic content 181extraction 181

video topic 57, 232video word 48, 57, 232, 267view selection 120visual analysis of behaviour 3, 4visual appearance 303visual behaviour 3visual context 15, 24, 247visual cue 5, 107

colour 107motion 107texture 50

visual saliency 161visual search 7, 9visual surveillance 8, 59Viteribi algorithm 198vote entropy seequery-by-committee

W

weak classifier 83weakly-supervised joint topic model 267weakly-supervised learning 61, 262wearable camera 30wide-area scene 26, 303, 333wide-area space 303within-class scatter matrix 81within-set covariance matrix 312