Automatic recognition of primate behaviors and social ... · 2 Recognizing and modeling social...
Transcript of Automatic recognition of primate behaviors and social ... · 2 Recognizing and modeling social...
AUTOMATIC RECOGNITION OF PRIMATE BEHAVIORS AND SOCIAL
INTERACTIONS FROM VIDEOS
A Dissertation Presented
By
Nastaran Ghadar
to
The Department of Electrical & Computer Engineering Department
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
in the field of
Electrical and Computer Engineering
Northeastern University
Boston, Massachusetts
May 2015
NORTHEASTERN UNIVERSITY
Abstract
College of Engineering
Department of Electrical and Computer Engineering
Doctor of Philosophy
Automatic Recognition of Primate Behaviors and Social Interactions
from Videos
by Nastaran Ghadar
2
Recognizing and modeling social behaviors of animals has many applications, in-
cluding: (1) improved understanding of their behavior; (2) enhanced protection of
species; (3) hosting of specimens in enriched environment at zoos; and (4) efficient
extraction and analysis of data in important basic and applied biological research,
where animal models are used. Currently, understanding social behaviors in ani-
mals is achieved either by direct human observations or by videotaping and then
coding the behaviors. Both of these approaches have major limitations includ-
ing being heavily time consuming and requiring highly trained behavioral science
experts. Having an automated system to recognize and model social behaviors
would facilitate the scientific study of complex behaviors with less impact due to
these constraints. However, research in this area is very limited.
In this dissertation, we describe a framework that adopts current practices from
computer vision and machine learning in creating the preliminary steps towards
solving the problem of automatically recognizing behaviors of primates in a social
group (in this case, a pen hosting a group of 3 or more primates). Several chal-
lenges need to be overcome in order to achieve primate activity recognition from
videos, some of which are: the massive size of continuous video recordings from
multiple cameras over days and weeks, illumination variations throughout the day,
background changes due to moving objects in the pen and humans passing by (e.g.
for feeding or observing), highly variable shapes and poses of primates, and the
low visibility of color-coded primate collars causing difficulty in identifying the
3
primates.
This study is unique, to our knowledge, because it tackles automatic primate
behavior and interaction recognition in social groups hosted in a pen for the first
time. Results indicate that the activities extracted based on the detection and
tracking algorithms developed are sufficiently accurate to infer primate behaviors
and social interactions.
———————————————————-
Disclaimer: This work is supported by National Science Foundation (NSF) under
grant BCS-1027724. Any opinions, findings and conclusions or recommendations
expressed in this material are those of the author and do not necessarily reflect
the views of the NSF. Some parts of this work are a joint work with my colleague,
Xikang Zhang .
Acknowledgements
I want to thank my advisor Prof. Deniz Erdogmus for his support, training and
supervision during my Ph.D studies. I truly appreciate his guidance from the early
steps of my studies until graduation.
I would like to thank my colleague Xikang Zhang, who has been working on this
project with me and helped me with parts of this work.
I would also like to thank my colleagues from OHSU, Dr. Izhak Shafran, Dr.
Katherine Grant, Dr. Kristine Coleman, and Alireza Bayesteh. I also want to
extend my appreciation for my committee member, Dr. Jennifer Dy. I want
to acknowledge the National Science Foundation (BCS-1027724) for their sup-
port. I give my thanks to current and previous members of our lab, Prof. Dana
Brooks, Erhan Bas, Seyhmus Guler, Jamshid Sourati, Sina Moghadamfallahi,
Sarah Brown, Sheng You, Marzieh Haghighi, Asieh Ahani, Hooman Nezamfar
and other colleagues and members of Bspiral group for their friendliness, support,
and creating a pleasant environment during my research.
I would like to thank Payam Nia, my husband for his love and support. Finally,
I would like to thank my sisters Yasi and Shabnam, who have been always there
for me and supported me in every possible way.
4
Contents
Acknowledgements 4
List of Figures 7
List of Tables 9
1 Introduction 11
1.1 Motivation and Overview . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Primate Behavior Research . . . . . . . . . . . . . . . . . . . . . . . 15
1.3 Background and Related Work . . . . . . . . . . . . . . . . . . . . . 19
1.4 Description of Framework of Dissertation . . . . . . . . . . . . . . . 23
2 Data Collection and Preparation 25
2.1 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Recording Behaviors with Multi-channel Audio and Video Data . . 27
3 Segmentation and Object Recognition 32
3.1 Review on Object Detection Algorithms . . . . . . . . . . . . . . . 33
3.1.1 Segmentation-based Approaches . . . . . . . . . . . . . . . . 33
3.1.2 Background-modeling-based Object Detection . . . . . . . . 36
3.1.3 Supervised-learning-based Background Subtraction . . . . . 41
3.1.4 Point Detectors . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1.5 Feature-based Object Detection . . . . . . . . . . . . . . . . 43
3.1.6 Shape-based Object Detection . . . . . . . . . . . . . . . . . 51
3.1.7 Template-based Object Detection . . . . . . . . . . . . . . . 52
3.1.8 Classifier-based Object Detection . . . . . . . . . . . . . . . 53
3.1.9 Deep Neural Networks and Convolutional Neural Networks . 54
3.1.10 Comparison between Detection Algorithms and Finding aSuitable Approach for our Problem . . . . . . . . . . . . . . 55
5
Contents 6
3.2 Primate Detection in 2D . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.1 Background Subtraction . . . . . . . . . . . . . . . . . . . . 58
3.2.2 Using HOG and Color Features and Classification . . . . . . 59
3.2.3 Primate Identification . . . . . . . . . . . . . . . . . . . . . 59
4 Object Tracking 62
4.1 Definition and Common Algorithms . . . . . . . . . . . . . . . . . . 62
4.1.1 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1.2 Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.1.3 Multiobject Data Association and State Estimation . . . . . 70
4.2 Primate Tracking in 2D . . . . . . . . . . . . . . . . . . . . . . . . 71
5 Calibration and 3D Reconstruction 75
5.1 Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.1.1 Explicit Camera Calibration . . . . . . . . . . . . . . . . . . 76
5.2 Visual Hull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3 Calibration and Visual Hull Reconstruction of Primates . . . . . . . 81
5.3.1 Multiview Environment and Calibration . . . . . . . . . . . 81
5.3.2 3D Visual Hull Reconstruction of Primates . . . . . . . . . . 82
6 Activity Recognition Based on Spatial Relation 83
6.1 Activity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2 Primate Activity Recognition . . . . . . . . . . . . . . . . . . . . . 85
6.2.1 Velocity Measures . . . . . . . . . . . . . . . . . . . . . . . . 85
7 Experimental Results 91
7.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.2 2D Primate Detection . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.3 Multiview Environment and 3D Primate Visual Hull Results . . . . 97
7.4 2D Primate Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.5 Primate Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.6 Fusion of Multiple Views . . . . . . . . . . . . . . . . . . . . . . . . 102
8 Discussion and Conclusion 104
8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8.2 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . 106
List of Figures
1.1 Primate research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1 Primate research, group of four primates viewed from different cameras
in the pen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Environment set up and lens installation. . . . . . . . . . . . . . . . . 29
2.3 Primate research, group of four primates viewed from different cameras
in the pen with different setting than figure 2.1 . . . . . . . . . . . . 30
2.4 A sample image in norpix software from different . . . . . . . . . . . . . 31
3.1 This figure shows an example of static background subtraction algorithm.
This image is taken from [11] . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Background normalization using the static background image. . . . . . . 60
5.1 2D example of the visual hull approximation algorithm. C1, C2, C3
are different views with corresponding silhouettes S1, S2, S3. Theyellow area is the approximation of the visual hull; the area enclosedby black lines is the actual visual hull; and the blue shape in thecenter is the object. . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.1 A sample image of locomotion activity. The primate that is shown with
the red box is moving but no other primate has motivated this movement. 87
6.2 These series of images from top right to bottom left show the chasing
and avoiding activities that are happening between the two primates
that are shown with red circles. . . . . . . . . . . . . . . . . . . . . . 88
6.3 These series of images from top right to bottom left show the avoiding
activity for the primate that is specified with the red circle. Note that
this activity is not a result of chasing in this case. . . . . . . . . . . . . 89
6.4 This figure shows the decision tree we used to evaluate our test set. Th
leaf nodes show the decision made based on the feature values. . . . . . 90
7.1 Sample image from four views. . . . . . . . . . . . . . . . . . . . . . . 93
7
List of Figures 8
7.2 Primate detection in 2D. In column one, green boxes are the groundtruth; red boxes are the detection results. Column two shows the ex-tracted silhouettes by background subtraction over detected bound-ing boxes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.3 PR-curve of 2D detection. . . . . . . . . . . . . . . . . . . . . . . . 96
7.4 Calibration process. A checkerboard of size 16.8′′ × 24′′ is used forcalibration. The top figure shows the 3D locations of each camera. . 98
7.5 3D visual hull reconstruction result sample. Column one are theoriginal images; Column two shows the binary images from 2D pri-mate detection; Column three is the visual hull constructed fromthree views. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
List of Tables
1.1 Example macaque behaviors often encoded in human observations . 17
7.1 2D primate detection results from 4 views, video 1 . . . . . . . . . . 96
7.2 2D primate detection results from 2 views, video 2 . . . . . . . . . . 97
7.3 2D primate tracking results from 2 views, video 1 . . . . . . . . . . 100
7.4 2D primate tracking results from 2 views, video 2 . . . . . . . . . . 100
7.5 Activity recognition results on view 1, video 3. . . . . . . . . . . . . 101
7.6 Activity recognition results on view 1, video 2. . . . . . . . . . . . . 102
7.7 Activity recognition results on view 2, video 2. . . . . . . . . . . . . 102
9
To my mom;
For she has always been there for me and believed in me.
10
Chapter 1
Introduction
1.1 Motivation and Overview
In recent decades, biologists and scientists have shown great interest in studying
animals from images and videos. Wildlife recordings taken in the field, represent
challenging real-life situations for automated visual analysis. Biologists are inter-
ested in detection and analysis of the behavior of animals from camera traps [1].
For this purpose, biologists conduct a substantial portion of their research in the
field, and they collect large amounts of video from animals, which include monitor-
ing video, videos from field trips, and personally recorded wildlife video footage [2].
The result of this data collection is massive video data, which sometimes span for
11
Chapter 1. Introduction 12
several hundreds of hours. While fieldwork is very demanding, videotape analysis
is truly tiring. The quantity of video footage that must be viewed is enormous,
during which time, numerous notes and qualitative observations have to be done.
Unfortunately, for manual indexing biologists have to browse linearly through the
videos to find and describe objects and events of interest and they need to exam-
ine numerous hours of videotapes. The task of locating and identifying animals
in each video frame is heavily labor and tedious task for large amounts of videos
[4]. Since domain experts should preferably perform indexing, it quickly becomes
an expensive task; therefore they are in desperate need of computational video
analysis tools. Visual analysis methods have the ability to significantly acceler-
ate the process of video indexing and enable novel ways to efficiently access and
search large video collections. Usually camera trap videos tend to have very low
frame rates which cause popular tracking and supervised learning algorithms to
fail. While a lot of research has been performed on the visual analysis of human
beings and human-related events, there has been unexpectedly little work on the
problem of the automated analysis of animals despite its great importance.
An effective exploration of methods to tackle the specific aims of automatic recog-
nition of primate behaviors requires a multidisciplinary approach with expertise
from both primatology and computational science. This dissertation tackles the
computer vision aspects of a larger multi-disciplinary NSF project, whose goal is
to automatically recognize behavior of individual primates in social groups, using
Chapter 1. Introduction 13
audio and video recordings from multiple cameras and microphones. The animal
studies were conducted under the IUCAC approval of OHSU, and were supervised
by Dr. Kristine Coleman and Dr. Kathleen Grant who are experts in primate
models to study disease risks, and behavioral ecology. The project was managed
by Dr. Izhak Shafran who is an expert in speech recognition, and also supervised
the instrumentation and the data collection with assistance from Alireza Bayesteh
Tashk and Dr. Guillaume Thibault. The audio recordings are being analyzed by
Alireza Bayesteh Tashk under the guidance of Dr. Izhak Shafran.
As mentioned, we have chosen to focus our work on primates, in particular rhesus
macaques. The goal of this dissertation is to present a computational approach to
tackle the problem of automated understanding of behavior of highly social ani-
mals using computer vision algorithms. We expect this to have far-reaching effects
on primatology, behavioral ecology, animal husbandry and neuroscience. It will
enable researchers to formulate new cyber-enabled strategies in behavioral ecology,
conversational biology and animal husbandry. For example, the proposed methods
will allow scientists to closely monitor critical stages of life, such as mating and
breeding in captive breeding programs for highly social animals (e.g., wolves, voles)
before releasing them into the wild. Continuous monitoring of routine behaviors
in zoos and other facilities will also help scientists to identify individuals that
need special attention. For example, this research will allow researchers to follow
non-social behaviors (e.g. eating and sleeping), which are necessary to ensure the
Chapter 1. Introduction 14
welfare of animals. This knowledge could help reduce injuries and mortality rate
from fighting in zoos, primate centers, and other such facilities. In neuroscience,
the proposed research could help overcome current hurdles in studying the influ-
ence of social status and environmental factors associated, for instance, with drug
and alcohol consumption [16]. The proposed methods will also enable better un-
derstanding of daily patterns in behavior, or behaviors that occur at night, which
are currently constrained by the heavy dependence on human observers. To tackle
the task of learning behaviors from observations, one primary assumption is ap-
plied: All complex behaviors consist of a sequence of elementary behaviors and
by recognizing the elementary behaviors; one can learn the complicated dynamic
behaviors over time.
Figure 1.1: Primate research
Chapter 1. Introduction 15
1.2 Primate Behavior Research
After briefly describing behavior research of primates that are relevant to this
work and current measurements, we summarize the state-of-the-art algorithms
from computer vision for tackling computational challenges in our specific aims.
Social relationships are very important to scientists who study macaque behav-
iors, both in the wild and in captivity. In the wild, rhesus macaques typically
live in large crowds consisting of approximately 50-80 individuals including mul-
tiple males and females. Females remain in their natal groups their whole lives,
while males leave the troop at around puberty and move into new troops. There-
fore females have strong relationships with their daughters and sisters. There are
various levels at which these social behaviors can be described [5] – social interac-
tions, behavior that occurs between individuals, e.g., aggressive display between
two monkeys; social relationships, succession of social interactions between in-
dividuals, e.g., dominant/subordinate relationship between two individuals; and
social structures, networks of social relationships, e.g., dominance hierarchy in
the troop. Since the publication of Jeanne Altmanns seminal paper in 1974 ex-
plaining sampling methods [6], more or less the algorithm for measuring behavior,
including social behavior, has remained the same. Focal animal sampling is one
of the most commonly utilized observation approaches for studying animals in
Chapter 1. Introduction 16
groups. In this method, one individual (focal) is observed for some specific time
(observation period) and behaviors of interest are recorded. Researchers usually
define behaviors of interest on an ethogram, a quantitative description of behav-
ior, using labels such as the ones listed in Table 1.1. Some of the behaviors
that are often recorded in studies of social dominance include eye brow threat
(mouth open, brow back, agonistic behavior), bared-teeth or fear grimace (lips
retracted, teeth bared; submissive behavior), lipsmack (facial expression involving
rapid opening and closing of mouth and lips), and affiliative behavior [7], where
certain behaviors may co-occur (e.g., animals can lipsmack and move). The main
advantage of video recordings is that there is no room for human intrusion and
it replaces direct observations, at the cost of viewing multiple perspectives and
videotapes. To monitor eating and drinking behaviors, there are mechanisms to
automatically log the time and quantity of intake, but so far no automatic solution
exists for evaluating the behaviors of individuals in groups. While observational
methodologies have not undergone major changes, ways to interpret the data have
evolved with statistical methods. From computing statistics related to duration,
frequency and latencies of behaviors, now analysis often includes context in terms
of preceding behavior. Sociograms provide another perspective of social behavior
and relationship, representing associations between individuals using lines whose
thickness depends on the strength of association [8]. In this research our interest
is in some of the specific behaviors presented in Table 1.1, where we utilize the
Chapter 1. Introduction 17
focal observation method to annotate the recordings and create a data set to train
and validate our models. Specifically, we focus on types of behaviors, that can be
interpreted from the animals, relative position.
Label Comment
Aggression rough behavior or bitingChase pursuitDisplace subject leaves when approachedExplore inspects objects other than foodFear grimace subject bares teethForage searching presumably for foodFreeze subject is inactive; may move eyesGroom with hands or mouthLipsmack rapid movement of lipsLocomotion motion of entire bodyPlay grunting, wrestling, jumping, etcStationary immobile, moving head or armThreat scream, lunge, ground beating, etcVigilant subject scans environment with eyes
Table 1.1: Example macaque behaviors often encoded in human observations.
To better understand some of these behaviors, we discuss them separately.
Dominance: Rhesus macaques naturally live in social groups and they establish a
linear ”dominance hierarchy” over time. This hierarchy may change over time and
depends on many factors (age, sex, aggression, intelligence perhaps), and also could
depend on the support of other primates in the group. As clear from the expression,
primates that have a higher rank in the hierarchy tend to be more dominant,
i.e. displace lower ranked individuals from resources (mates, space, food). They
tend to have higher reproductive success (either by mating more often, or by
Chapter 1. Introduction 18
having more resources to invest in their offspring). The rank is established through
play, interactions and affiliated interactions (and rather tautological, that’s exactly
how it’s measured too). It is interesting to know that this maintenance of social
position, and social knowledge of one’s rank is one of the claimed theories for why
humans have been forced to evolve large brains.
Grooming: One of the most common activities among primates is grooming.
Grooming other primates is an important mechanism that shows their affection
for each other. There are several reasons for why a primate might groom another
primate ,subordinate animals tend to groom more dominant ones; males groom
females for sexual purposes. Mothers groom for practical purposes, infants to keep
their fur clean; but one thing that is definite is that it strengthens links between
them and keeps the primate social structure together.
Communication: This includes scents, body postures, gestures, and vocaliza-
tions. Some of these appear to be autonomic responses indicating emotional states:
fear, excitement, confidence, anger. Others seem to have a more specific purpose:
loud ranging calls in indri, howler monkeys and gibbons; quiet contact calls in
lemurs to keep the group together; fear calls in lost infants, or on spotting preda-
tors. From our human perspective, we often find it easier to associate sounds with
specific meaning, but among non-human primates, gestures and actions are often
used. Presentation and mounting behavior are often used to diffuse potentially
aggressive situations. Yawns exposing teeth are often threats, as is direct eye
Chapter 1. Introduction 19
contact. Facial expression is important too. It is very obvious in chimps: their ex-
pression often appears all too human-like; but other primates also use stereotyped
eyelid flashes or lip slaps.
Aggressive and affiliative behavior: As mentioned before, many behaviors
exist to keep the group structure running smoothly for the members of the group.
There are occasions though when these behaviors (especially aggression) are di-
rected outside the group.
Distance related behaviors: Behaviors including locomotion (running, jump-
ing, walking and climbing) and specifics of foraging behavior.
1.3 Background and Related Work
In this section, we provide a selected review of closely related work. In the com-
puter vision community, many studies employ videos of animals as standard data
sets to develop new algorithms, especially for tracking or behavior recognition.
Most of the presented methodologies on animal analysis are conducted in highly
controlled environments, for instance, with a static camera, in a well-defined loca-
tion, with static background, and with no environmental factors interfering, such
as occlusion, different illumination conditions, and interfering objects [9, 10]. One
common scenario for a controlled environment would be monitoring applications,
Chapter 1. Introduction 20
where there is a static background and a static camera [12]. This setting makes it
straightforward to learn the static background and easily obtain the foreground by
looking for the devisions from the background. More sophisticated techniques have
also been introduced. Khan et al. [13] developed a system that can automatically
generate the three dimensional trajectory of primates in an outdoor environment.
Their purpose is to evaluate the navigational abilities of non-human primates.
Their system extracts primate kinematic features such as path length, speed, and
other variables impossible for an unaided observer to note. From trajectories, they
computed and validated a path length measurement and proposed a method for
automatic behavior detection. Also, their system is used to examine the gender
differences in spatial navigation of rhesus primates. They set the environment
in a way to avoid occlusion, i.e. an open environment with minimal perturba-
tions, and they did not analyze the social interactions between primates, but put
their focus on individual actions.Chaumont et al. [14] proposed a computerized
method and a software called Mice Profiler, that uses geometrical primitives to
model and track social interactions in mice. Their system monitors a comprehen-
sive repertoire of behavioral states and temporal evolution, which is utilized for
identifying the key events that trigger social contact. Balch et al. [15] proposed an
automated labeling system to study social insect behaviors. Their ultimate goal
is to automatically create executable models of animal behavior. An algorithm
proposed by Burghardt and Calic [17] detects animal faces using Haar features
Chapter 1. Introduction 21
and then track animals; such algorithms would not work for animals whose faces
are not visible or hard to track. Other approaches [18, 19] have the user mark
or extract the location of the animal by hand. This, of course, is extraordinarily
time-consuming. Khorami et. al [10] proposed an approach that is able to detect
multiple types of animals in an entirely unsupervised scenario. The goal of their
system is to detect multiple types on animals in an unsupervised manner. Walther
et al. [20] apply saliency maps to minimize multi-agent tracking of low-contrast
translucent targets in underwater footage. Haering et al. [21] use neural network
algorithms to detect high-level events, such as hunts, by classifying and tracking
moving object blobs. Tweed and Calway [22] proposed an approach that achieves
multiple object tracking by developing a periodic model of animal motion and
exploit conditional density propagation to track flocks of birds. Ramanan and
Forsyth [23] proposed an interesting method, where they use low-level detectors
and a mean shift construct to create an appearance model for the animal and
use it to detect the animal in future frames. Their method takes into account
temporal coherency when building appearance models of animals. While they
present very good results in their paper, they only deal with three different animal
species and with cases that have no occlusion. Everingham et al. [24] proposed an
approach that combines a minimal manually labeled set with an object tracking
technique to gradually improve the detection model; however, they only deal with
human faces. Gibson et al. [25] and Hannuna et al. [26] try to address the issue
Chapter 1. Introduction 22
of animal behavior classification by detecting and classifying animal gait by ap-
plying statistical analysis on a sparse motion information extracted from wildlife
footage. Burghardt et al.[17] presents an algorithm that tracks animal faces in
wildlife rushes and populates a database [27] with appropriate semantics defining
their basic locomotive behavior. Their detection algorithm is an adapted version
of a human face detection method that exploits Haar-like features and the Ad-
aBoost classification algorithm [28] the Kanade-Lucas-Tomasi method, fusing it
with a specific interest model applied to the detected face region. They achieved
reliable detection and temporally smooth tracking of animal faces. Furthermore,
the tracking information is exploited to classify locomotive behavior of the tracked
animal, e.g. lion walking left or trotting towards the camera. Finally, the extracted
metadata about the presence of the animal, together with its locomotive behav-
ior, creates a strong prior in the process of learning animal models as well as in
extracting the additional semantic information about the animal’s behavior and
environment. The presented algorithm is a part of a large content-based retrieval
system [29] within the ICBR project that focuses on the computer vision research
challenges in the domain of wildlife documentary production. This algorithm is
close to what we are presenting in this project.
Chapter 1. Introduction 23
1.4 Description of Framework of Dissertation
In this dissertation, we developed a general framework for detecting, localizing,
tracking, and reconstructing images of social animals in a 3D observation environ-
ment. Finally, using these results, we were able to extract elementary behaviors
from videos.
As evident from the cited literature, the necessary components have developed
sufficiently in recent years to allow computational scientists to undertake the chal-
lenge of creating a framework for modeling and recognizing behavior of individuals
in their social groups. The structure of this dissertation is as follows:
1. Recording behaviors with multi-channel audio and video data: In
Chapter 2, I will discuss the details of data collection and how we acquired
our data for our experiments.
2. Detecting individual primates in the pen: In Chapter 3, I will start
with the definition of object detection. There are several algorithms currently
available in the literature for object detection and each has their advantages
and disadvantages. After introducing these algorithms and discussing where
they work best, I define the framework of our detection algorithm and why
we chose the proposed methods.
Chapter 1. Introduction 24
3. Tracking individuals over time: In Chapter 4,I will introduce some of
the most common algorithms for object tracking and when we would expect
to get a good performance out of them. Finally I will discuss the details of
our tracking algorithm.
4. Calibration and 3D visual hull reconstruction of primates: In Chap-
ter 5, I will explain the details necessary for us to obtain a 3D silhouette of
the primates in the pen and decide whether having a 3D system is helpful
or not.
5. Recognizing individual behaviors: In Chapter 6, I will discuss the activ-
ities we are interested in. After that, I will describe an algorithm to recognize
them.
6. Experimental results: In Chapter 7, I will present the results of each
section separately and discuss the results
7. Conclusion, discussion, and future work: In Chapter 8, I will discuss
the pros and cons of our algorithm and how one can improve it in terms of
efficiency and performance.
Chapter 2
Data Collection and Preparation
2.1 Acknowledgement
This section is completely done by our collaborators at OHSU. All the data was
collected by the OHSU team, which was led by Dr. Shafran. I would like to
acknowledge Alireza Bayesteh Tashk, Guillaume Thibault, and Meysam Asgari
for the grunt work they did for two years collecting the data. I would like to
acknowledge Dr.Kristine Coleman, Nicola Robertson and Megan McClintik for
conducting the animal studies, and Dr. Kathy Grant for her input in the process.
25
Chapter2. Data Collection and Preparation 26
2.2 Experimental Setup
Overall five groups of animals were observed and each group consisted of 4 or 6
rhesus macaques held in a pen (approximately 12 ft (length) x 7 ft (deep) x 7 ft
(high)) at the Oregon National Primate Research Center (ONPRC) using a pro-
tocol approved by the OHSU’s Institutional Review Board.
Individuals from isolated cages were put into the pen and their behavioral activi-
ties were recorded for two days from about 7am to 7pm, and there was no recording
when the lights were off. After a week their behavior was recorded again for two
days. By this time they have established a dominance hierarchy, i.e. a stable
phase. Two more sessions of two days were recorded to observe the effect of an
escalating series of perturbations, i.e. a perturbed phase. Major perturbations ap-
plied were: 1) Human Impostor (introduce an unfamiliar human near the cage or
pen for 15 minutes), 2) Resource Competition (modulate certain resources, for in-
stance preferred resting areas, toys, and treats), and 3) Social Instability( removal
of the most dominant individual for the entire last week). These perturbations cre-
ated the chance to observe interactions that establish social dominance hierarchies.
Chapter2. Data Collection and Preparation 27
Camera-‐1
Camera-‐4
1
2
3
4
1
2
3
1
2
3
4
3
1
Camera-‐3
2
Camera-‐2 4
3
3
1
1
1
2
Figure 2.1: Primate research, group of four primates viewed from differentcameras in the pen
2.3 Recording Behaviors with Multi-channel Au-
dio and Video Data
Automating recognition of behaviors requires capturing all the information rele-
vant for detecting individuals in the pen, tracking their movement over time and
recognizing their vocalization.
In the video domain, to avoid occlusions and to maximize coverage of the entire vol-
ume of the pen, we recorded behaviors using cameras from multiple perspectives,
Chapter2. Data Collection and Preparation 28
three cameras (GC1380CH, 2/3” CCD) with wide aperture lenses (Optron 5mm
f/2.8) on three corners of the pen and a forth camera (GC1380CH, 2/3” CCD)
with a wide-angle fisheye lens (Edmunds Optics NT62-274, focal length 1.8mm,
F1.4,185 x 185 degrees) on the top of the pen. Ideally, the pen should be uniformly
illuminated to avoid blotchy over-exposed and dark under-exposed regions in the
imag, but this is very difficult to achieve. We minimized the illumination variation
by relying on several overhead incandescent tube lights which was supplemented
by a light box mounted at the floor level. The lights were programmed to switch
off during the night hours, about 7pm to 7am. Figure 2.2 shows the camera setup
and figures 2.3 and 2.1 show a typical camera frame from four views for the two
groups of primates.
Additionally, for simplifying the task of identifying the individuals in the video
recordings, we color-coded the collars on the monkeys in a group. Collars were
powder coated with one of the six colors: purple, green, orange, blue, red and
yellow for the group of six monkeys and green, yellow, black, and red for the
group of four monkeys.
Obtaining high-level synchronization of frames from the four cameras was done by
triggering the cameras to capture each frame by a common trigger signal (National
Instrument Pulse Generation Module). The trigger signals were controlled and
programmed by a high-level software, the StreamPix 5, from Norpix on a dedicated
data collection workstation. StreamPix is NorPix’s flagship software product.
Chapter2. Data Collection and Preparation 29
View 1: Edmunds Op*cs NT62-‐274 fish-‐eye lens
View 2, 3, 4: Kowa WIDE MEGAPIXEL high resolu*ons lens, model-‐LM5JCM
1
4 2
3
Figure 2.2: Environment set up and lens installation.
StreamPix has become the fundamental digital video camera recording software.
It offers a state of the art user interface, and a lot of usage flexibility for single or
multiple camera recordings.
With StreamPix, it is possible to view, control, and acquire from multiple cam-
eras simultaneously, all in the same user interface. StreamPix provides a complete
management console for cameras, simplifying the setup, controlling and acquisi-
tion from any number and type of cameras. The number of digital video cameras
supported is only limited by a condition wherein the combined data rate of the
Chapter2. Data Collection and Preparation 30
Camera-‐1
1
2
3
4
Camera-‐4
12
4
Camera-‐2
1
Camera-‐3
3
4
1
Figure 2.3: Primate research, group of four primates viewed from differentcameras in the pen with different setting than figure 2.1
cameras exceeds the internal bus bandwidth or processor capabilities of the com-
puter.
The software was programmed to automatically start recording during the daylight
hours in the pen and the video from the four cameras were streamed into the work
station’s high-speed RAID array consisting of 8 disks, each with a capacity of about
2TB. The video was recorded with a resolution of 1360 x 1024 pixels, each pixel
Chapter2. Data Collection and Preparation 31
Figure 2.4: A sample image in norpix software from different .
quantized at 8-bits for each of the three channels. We had two such workstations,
one at the pen and one in the lab. After a few sessions the disks were swapped
between the two workstations and the recorded data, totaling about 1.5TB per
session, was then offloaded to larger file servers to be archived.
Since our focus is on the video domain, we do not discuss the details of audio set
up.
Chapter 3
Segmentation and Object
Recognition
In this section, we start by definition of object detection and present different ap-
proaches of detecting objects using different methods such as frame differencing,
optical flow, point detectors, background subtraction, temporal differencing, clas-
sification methods and the feature types of different methods of object detection
such as edge based feature type, patch based feature type etc. We compare the
accuracy rate of these methods and identify the advantages and disadvantages of
each method to find out what situations these algorithms work best, and find a
direction to pursue for our detection algorithm.
32
Chapter 3. Segmentation and Object Recognition 33
3.1 Review on Object Detection Algorithms
Object detection is finding areas of interest in images and videos and clustering
the pixels related to these areas of interest as objects[30]. Object detection is one
of the main challenging fields in computer vision and image processing area. It
is the essence of any tracking algorithm or activity recognition algorithm. Most
common techniques for object detection use the information in a single frame,
however there are several algorithms that use temporal information computed from
analyzing a sequence of frames in order to reduce the number of false detections
and increase accuracy rate [31]. To summarize the algorithms for object detection,
here the main and common categories that are used for object detection are briefly
discussed.
3.1.1 Segmentation-based Approaches
Segmentation based algorithms are used to segment the image frame into segments
to find out the objects of interest. There are some different principles that based
upon one can segment an image and they play an important part in finding ob-
jects of interest. Once segmentation is done, segmented objects are considered for
detecting the desired object.
1. Graph Cut:
Chapter 3. Segmentation and Object Recognition 34
When graph cut algorithm is used in computer vision, the input image is
considered as a graph and the graph splitting problem will be image segmen-
tation. In this representation the image will be the graph G, the pixels of the
image are the vertices, V = u, v, . . . .., of the graph and will be partitioned
into N disjoint sub-graphs (regions), Ai , by pruning the weighted edges
of the graph. The weights between the nodes are processed based on the
similarity of color, brightness and texture. Wu and Leahy proposed [104]to
use color similarity as the minimum cut condition for dividing an image into
regions but their method suffers from over segmentation. Yi and Moon [104]
considered graph cut image segmentation as pixel labeling problem. The
label of the foreground object (s-node) is set to be 1 and the background
(t-node) is set to be 0. By minimizing the energy-function with the help
of minimum graph cut the process of pixel labeling can be done. Shi and
Malik [33] proposed the normalized cut to overcome the over segmentation
problem. The ‘cut ’of their method depends on the sum of weights of the
edges in the cut and on the ratio of the total connection weights of nodes
in each partition to all nodes of the graph. For image-based segmentation,
the product of the spatial proximity and color similarity defines the weights
between the nodes. Graph cuts can only find a global optimum; therefor a
background/foreground situation does not work for multiple objects. And
the memory usage of graph cuts increases quickly as the image size increases.
Chapter 3. Segmentation and Object Recognition 35
2. Mean-shift Clustering:
Mean shift clustering is a segmentation algorithm that is used to cluster
image pixels of an image frame to clusters. For an input image, the algorithm
is initialized by randomly choosing a large number of cluster centers from
the data. In the next step each of the cluster centers are moved to the mean
of the data. The mean of the data is lying inside the multi-dimensional
ellipsoid. The multi-dimensional ellipsoid is centered on the cluster center.
Mean-shift vector is a vector which is defined by the old and the new cluster
centers. Comaniciu and Meer [34] used the Mean-shift clustering for image
segmentation problem to find clusters in the joint spatial and color space,
[l, u, v, x, y], where [l, u, v] denotes the color and [x, y] is the spatial location.
Saravanakumar et al. [35] represented the objects using properties of the
HSV color space. The weakness of this algorithm is the tracking drift (or
tracking failure) especially when the color distributions of target object and
the background clutter (or other objects) become similar.
3. Active Contours:
Active contour model, or snakes, is a segmentation algorithm to find the
outline of an object from a possibly noisy 2D image. A snake is an energy
minimizing, deformable spline influenced by constraint and image forces that
pull it towards object contours and internal forces that resist deformation.
Snakes may be understood as a special case of the general technique of
Chapter 3. Segmentation and Object Recognition 36
matching a deformable model to an image by means of energy minimization
[105]. Snakes do not solve the entire problem of finding contours in images,
since the method needs a good understanding of the desired contour shape
initially. Rather, they depend on other mechanisms such as interaction with
a user, interaction with some higher-level image understanding process, or
information from image data adjacent in time or space.
3.1.2 Background-modeling-based Object Detection
1. Static Background Subtraction: A very common object detection approach is
by creating a representation of the part of the image, which is the background
and find differences from the model for each consecutive frame in the video
imagery. In simple background subtraction an absolute difference is taken
between every current image frame It(x, y) and the reference background
image, B(x, y) , to find out the motion detection mask D(x, y). Usually
there are two main approaches to choose the reference background image,
which we discuss later. In this approach a threshold,T , is determined and for
each pixel , one could make a decision if t a pixel belongs to the background
or foreground by this rule:
|D(x, y)B(x, y)| > T
Chapter 3. Segmentation and Object Recognition 37
If the absolute difference is greater than or equal to the threshold, the pixel
is classified as foreground; otherwise the pixel is classified as background.
Any significant change in an image region from the background model is
noted down as a moving object. The pixels in the regions of the undergoing
change are marked as moving objects and reserved for further processing.
This process is referred to as the background subtraction. Figure 3.1 shows
an example of background subtraction. There are various methods of back-
ground subtraction as discussed in the survey [36] such as frame differencing,
region-based, spatial information, Hidden Markov Models (HMM) and eigen
space decomposition.[37].
As mentioned earlier, there are mainly two approaches for choosing the back-
ground reference image.
(a) Recursive Algorithm: Recursive techniques for background subtraction
[38, 39] which do not maintain a buffer for background estimation. This
method recursively updates a single background model based on each
input frame. In this scenario, input frames from distant past could have
an effect on the current background model being analyzed. Recursive
techniques require less storage compared with non-recursive techniques,
but any error in the background model can have a considerable effect for
a much longer period of time. This technique includes various methods
Chapter 3. Segmentation and Object Recognition 38
Figure 3.1: This figure shows an example of static background subtractionalgorithm. This image is taken from [11]
such as approximate median, adaptive background, Gaussian of mixture
[30].
(b) Non-Recursive Algorithm: Non-recursive techniques [38, 39] that use
a sliding-window approach for estimating changes in the background.
The process includes storing a buffer of the previous n video frames
and estimating the background image based on the temporal variation
of each pixel within the buffer. Non-recursive techniques have high
adaptability, as they do not depend on the history beyond those frames
Chapter 3. Segmentation and Object Recognition 39
stored in the buffer as in recursive algorithms. On the other hand,
the storage requirement can be very huge if a large buffer is needed to
manage the slow-moving data traffic. [30]
The problem with background subtraction [40, 41]is that it automatically
updates the background from the incoming video frame and it is not able to
overcome motion in the background, illumination changes, and shadows .
2. Gaussian Mixture Model: Knowing the moving object distribution in the first
frame of the video sequence, one can localize the object in the next frames
by tracking its distribution. Gaussian Mixture Model is a popular technique
for modeling dynamic background as it can represent complex distribution
of each pixel. The common steps in this method are as follows [43] : The
values of a pixel are modeled as a mixture of gaussians. At each iteration,
gaussians are evaluated using a simple heuristic to get which are likely to
correspond to background. Pixels that do not match with the “background
gaussians” are classified as foreground. Foreground pixels are grouped us-
ing 2D connected component analysis. Bodor et al. [42] tried to develop
automated intelligent vision-based monitoring systems. They detect objects
appearing in a digitized video sequence with the use of a mixture of gaussians
for background/foreground segmentation. Stauffer and Grimson [44] use a
mixture of gaussians to model the pixel color. In this method, every pixel
value of current frame is checked against the existing Gaussian distributions
Chapter 3. Segmentation and Object Recognition 40
of the background model. Until a matching Gaussian is found the pixel val-
ues are checked continuously in the model. The mean and variance of the
matched Gaussian is updated when a match is found. If this pixel value does
not fit into any one of the Gaussian distributions, the distribution with the
least weight is replaced by a new distribution mean as current pixel value,
with high variance at initial stage, and a low weight. Classification of pixels
is done based on whether matched distribution represents the background
process.
GMM suffers from slow convergence at the starting stage of detecting back-
grounds. Also it sometimes leads to false motion detection in complex back-
grounds. For example, rapid changes in the lighting of the outdoor scene
such as those caused by the sun suddenly going behind or the lights going
off, can introduce some major errors.
3. Eigen-space Decomposition of Background: Another approach for back-
ground modeling based object detection is eigen-space decomposition. It
is less sensitive to illumination. In this method, by projecting the current
image to the eigen-space and calculating the difference between the recon-
structed and actual image, the foreground objects are detected.
4. Hidden Markov Model: Recently Hidden Markov Models are widely used for
background subtraction. Corresponding to the events in the environment it
represents the intensity variations of a pixel in an image sequence as discrete
Chapter 3. Segmentation and Object Recognition 41
states. Hidden Markov Models (HMM) used by Rittscher et al. [45] classified
small blocks of an image into three states. Stenger et al. [46] used HMMs for
the background subtraction in the context of detecting light on /off events
in a room.
3.1.3 Supervised-learning-based Background Subtraction
Supervised learning based background subtraction methods can also be used for
object detection. Supervised learning mechanism helps in learning of different ob-
jects from a set of examples automatically. Supervised learning methods generate
a function that maps inputs to desired outputs for a given set of learning examples.
Classification problem is the standard formulation of supervised learning, where
the learner approximates the behavior of a function. This approximation is done
by generating an output in the form of either a continuous value (regression), or
a class label (classification). Some of the learning approaches are boosting. Viola
et al. [47], support vector machines [48].
1. Adaptive Boosting: Boosting is done by combining many base classifiers to
find accurate results. First step of training phase of the Adaboost algorithm
is to construct an initial distribution of weights over the training set . Next
step of Adaptive boosting is that the boosting mechanism selects the base
classifier with least error. The error of the classifier is proportional to the
Chapter 3. Segmentation and Object Recognition 42
misclassified data weights. Then, the misclassified data weights are increased
which are selected by the base classifier. Finally, in the next iteration the
algorithm selects another classifier that performs better on the misclassified
data.
2. Support Vector Machines: For a linear system, the available data can be
clustered into two classes or groups by finding the maximum marginal hyper
plane that separates one hyper plane and the closest data points help in
defining the margin of the maximized hyper plane. The data points that
lie on the hyper plane margin boundary are called the support vectors. For
object detection purpose the objects can be included in two classes, object
class (positive samples) and the non-object class (negative samples). For
applying SVM classifier to a nonlinear system, a kernel trick has to be applied
to the input feature vector, which is extracted from the input.
3.1.4 Point Detectors
Point detectors are used in finding useful points in images, which have an expressive
texture in their respective localities [37]. A useful interest point is one, which is
invariant to changes in illumination and camera viewpoint. Some commonly used
interest point detectors include Moravec’s detector, Harris detector, KLT detector,
SIFT detector, SURF, etc. [49]. The most important part in these algorithms
Chapter 3. Segmentation and Object Recognition 43
is to match the point descriptors in consecutive frames and having noise makes
these type of algorithms weak. Another problem with point detector algorithms
is that they are computationally very expensive and they do not work well under
illumination changes.
3.1.5 Feature-based Object Detection
In feature-based object detection, standardization of image features is important.
One or more features are extracted from the image and objects of interest are
modeled in terms of these features. Features may be shape, size or the color of
objects. In object detection, selection of the right features plays an important
role. To clearly distinguish the objects in the feature space we need to find the
object visual feature uniqueness.
1. Color Features: Unlike many other image features (e.g. shape) color is
relatively constant under viewpoint changes and it is easy to be acquired.
Although color is not always appropriate as the sole means of detecting and
tracking objects, but the low computational cost of the algorithms proposed
makes color a desirable feature to exploit when appropriate. To increase the
discriminative power of intensity based descriptors color feature descriptors
are used [50]. To describe the color information of an object, RGB color space
is usually used. But RGB color space is not a perceptually uniform color
Chapter 3. Segmentation and Object Recognition 44
space. Another color space used is HSV (Hue, Saturation and Value), which
is an approximately uniform color space. With respect to intensity of light,
HSV color model is scale-invariant as well as shift-invariant. Two physical
factors primarily influenced the apparent color of an object- 1) the spec-
tral power distribution of the illuminant and 2) object’s surface reflectance
property., therefore there is no efficient color space, which can define the
features of an object perfectly. Color descriptors in recent studies can be
classified into novel histogram-based color descriptors and Scale Invariant
Feature Transform (SIFT) based [51] color descriptors. In SIFT descriptor
the intensity channel is a combination of R, G and B channels. Therefore
SIFT descriptor is variant to light color change. Sebastien et al. [52] dealt
with object learning using color information. The GHOSP (Genetic Hy-
brid Optimization Search of Parameters) algorithm is developed which uses
multidimensional observations which are taken from RGB color images, that
contain object to be learnt. Zhenjun et al. [53] used combined feature set
which is built using color histogram (HC) bins and gradient orientation his-
togram (HOG) bins considering the color and contour representation of an
object for object detection. The combined feature set is the evolvement of
color, edge orientation histograms and SIFT descriptors.
2. Edge Features: The change in intensities of an image is strongly related
to object boundaries because after just the object boundary, the intensity
Chapter 3. Segmentation and Object Recognition 45
instantly changes. To identify the instant change, edge detection techniques
are used. Compared to color features, edge features are less sensitive to
illumination changes. Canny Edge detector is mostly used in finding the
edges of an object. Roberts operator, Sobel operator and Prewitt operator
are also used for finding the edges.
3. Texture Features: In Comparison to color features and edge features, a
processing step is required to generate the descriptors for the texture fea-
tures. Local Binary Patterns (LBP) texture features are known as one of
the efficient texture features. The LBP is defined as a gray scale invariant
texture measure, derived from a general definition of texture in a local neigh-
borhood. The most important property of the LBP operator is its tolerance
against illumination changes.
4. Optical Flow: Optical flow is one of the widely used methods used in
motion-based object segmentation and tracking applications. Furthermore
it is also used in tracking objects in video with moving background or in a
scene by a moving camera. The translation of each pixel in a region can be
found by a dense field of displacement vectors defined as optical flow. Optical
flow methods [58] involve calculating the image optical flow field and doing
clustering processing according to the optical flow distribution characteristics
of the image. In computing optical flow, brightness constraint is used as a
measure,i.e. assuming that brightness of corresponding pixels is constant in
Chapter 3. Segmentation and Object Recognition 46
consecutive frames. This method is very attractive in detecting and Dalal
et al. [59] developed a detector that could be used to analyze film and TV
content, or to detect pedestrians from moving car applications in which the
camera and the background often move as much as the people in the scene. It
studies oriented histograms of various kinds of local differences or differentials
of optical flow as motion features, evaluating these both independently and
in combination with the Histogram of Oriented Gradient (HOG) appearance
descriptors.
The downside of optical flow is large quantity of calculations, sensitivity
to noise, and poor anti-noise performance, which make it not suitable for
real-time object detection and tracking.
5. Spatio-temporal Features: Recently local spatio-temporal features have
become very popular to use. These features provide a visual representation
for recognition of actions and visual object detection [65]. Salient and motion
pattern characteristics in videos, are captured by local spatio-temporal fea-
tures. These features provide relative representation of events independently.
While presenting events the spatio-temporal shifts and scales of events, back-
ground clutter and multiple motions in the scene are considered. To show
the low level presentation of an object such as pedestrian, space-time con-
tours are used.
Chapter 3. Segmentation and Object Recognition 47
6. Gradient Features: Gradient features are important in object detection
in video sequences. To represent objects like human body, shape/contour of
the object body is used in gradient-based methods.
HOG Features
Since we used HOG features in our algorithm, this approach is explained in
more details. Histogram of Oriented Gradients (HOG) are feature descrip-
tors used for object detection. HOG features have become very famous in
computer vision algorithms for object detection and heavily used in recent
years. The technique sums how many times gradient orientation in local-
ized portions of an image happens. The fundamental assumption behind the
HOG descriptors is that local object appearance and shape within an image
can be described by edge directions. Implementation of the descriptors can
be obtained by dividing the image into small-connected regions called cells,
and for each cell compiling a histogram of gradient directions or edge orien-
tations for the pixels within the cell. The combination of these histograms
represents the descriptor. For improved accuracy, the local histograms can be
contrast-normalized by calculating a measure of the intensity across a larger
region of the image called a block, and then using this value to normalize
all cells within the block. This normalization results in better invariance to
changes in illumination or shadowing. Since the HOG descriptor operates on
localized cells, the method upholds invariance to geometric and photometric
Chapter 3. Segmentation and Object Recognition 48
transformations, except for object orientation [56].
There are four major steps in using HOG features for object detection. These
steps are as follows:
Gradient Computation:
The first step in finding HOG features is computing the gradient values.
Often a 1-D centered, point discrete derivative mask is applied on horizontal,
or vertical, or both directions. For this purpose a mask filter is used to filter
the color or intensity of the image, i.e.:
[−1, 0, 1] and [−1, 0, 1]T
Orientation Binning:
The second step of calculation involves creating the cell histograms. Each
pixel within the cell casts a weighted vote for an orientation-based histogram
channel based on the values found in the gradient computation. The cells
themselves can either be rectangular or radial in shape, and the histogram
channels are evenly spread over 0 to 180 degrees or 0 to 360 degrees, depend-
ing on whether the gradient is “unsigned” or “signed”. Dalal and Triggs
found that unsigned gradients used in conjunction with 9 histogram chan-
nels performed best in their human detection experiments. As for the vote
weight, pixel contribution can either be the gradient magnitude itself, or
Chapter 3. Segmentation and Object Recognition 49
some function of the magnitude. In tests the gradient magnitude itself gen-
erally produces the best results. Other options for the vote weight could
include the square root or square of the gradient magnitude, or some clipped
version of the magnitude[56].
Descriptor Blocks:
In order to account for changes in illumination and contrast, the gradient
strengths must be locally normalized, which requires grouping the cells to-
gether into larger, spatially connected blocks. The HOG descriptor is then
the vector of the components of the normalized cell histograms from all of
the block regions. These blocks typically overlap, meaning that each cell con-
tributes more than once to the final descriptor. Two main block geometries
exist: rectangular R-HOG blocks and circular C-HOG blocks. R-HOG blocks
are generally square grids, represented by three parameters: the number of
cells per block, the number of pixels per cell, and the number of channels
per cell histogram. In the Dalal and Triggs human detection experiment, the
optimal parameters were found to be 3x3 cell-blocks of 6x6 pixel cells with 9
histogram channels. Moreover, they found that some minor improvement in
performance could be gained by applying a Gaussian spatial window within
each block before tabulating histogram votes in order to weight pixels around
the edge of the blocks less. The R-HOG blocks appear quite similar to the
SIFT descriptors; however, despite their similar formation, R-HOG blocks
Chapter 3. Segmentation and Object Recognition 50
are computed in dense grids at some single scale without orientation align-
ment, whereas SIFT descriptors are computed at sparse, scale-invariant key
image points and are rotated to align orientation. In addition, the R-HOG
blocks are used in conjunction to encode spatial form information, while
SIFT descriptors are used singly.
C-HOG blocks can be found in two variants: those with a single, central cell
and those with an angularly divided central cell. In addition, these C-HOG
blocks can be described with four parameters: the number of angular and ra-
dial bins, the radius of the center bin, and the expansion factor for the radius
of additional radial bins. Dalal and Triggs found that the two main variants
provided equal performance, and that two radial bins with four angular bins,
a center radius of 4 pixels, and an expansion factor of 2 provided the best
performance in their experimentation. Also, Gaussian weighting provided no
benefit when used in conjunction with the C-HOG blocks. C-HOG blocks
appear similar to Shape Contexts, but differ strongly in that C-HOG blocks
contain cells with several orientation channels, while Shape Contexts only
make use of a single edge presence count in their formulation[56].
Block normalization:
Dalal and Triggs[56] explore different methods for block normalization. Over-
all, the performance significantly improves comparing to non-normalized
Chapter 3. Segmentation and Object Recognition 51
data.
SVM classifier:
The final step in object recognition using HOG descriptors is to feed the
descriptors into some supervised pattern recognition. Such as an SVM clas-
sifier [57]. Once trained on images containing some particular object, the
SVM classifier can make decisions regarding the presence of an object, such
as a human or an animal, in additional test images.
7. Multiple Features Fusion: The multi-feature fusion scheme has achieved
high boosting performance or robustness, in the field of computer vision,
multimedia and audio–visual speech processing, etc [65].
3.1.6 Shape-based Object Detection
Shape-based object detection is one of the complex problems due to the difficulty
of segmenting objects of interest in the images. The detection and shape charac-
terization of the objects becomes more difficult for complex scenes where there are
many objects with occlusions and shading.
Chapter 3. Segmentation and Object Recognition 52
3.1.7 Template-based Object Detection
If a template describing a specific object is available, object detection becomes a
process of matching features between the template and the image sequence under
analysis. There are two types of object template matching, fixed and deformable
template matching. Fixed template matching is useful when object shapes do not
change with respect to the viewing angle of the camera. The major technique
that has been used in fixed template matching is by correlation, which is gener-
ally immune to noise and illumination effects in the images, but suffers from high
computational complexity caused by summations over the entire template. Jiyan
et al. [67] proposed Content-Adaptive Progressive Occlusion Analysis (CAPOA)
algorithm; which analyzes the occlusion situation within a given region of interest
and generates corresponding template mask. Detection of a reappearing target is
somewhat difficult with this method. Deformable template matching approaches
are more suitable for cases where objects vary due to rigid and non-rigid defor-
mations. Because of the deformable nature of objects in most videos, deformable
models are more appealing in tracking tasks. Zhong et al. [68] proposed a novel
method for object detection using prototype-based deformable template models..
Deformed template is obtained by applying a parameterized deformation trans-
form on the prototype. The prototype-based template combines both the global
structure information and local image cues to derive an interpretation. Xiaobai
Liu et al. [69] proposed hybrid online templates for object detection which uses
Chapter 3. Segmentation and Object Recognition 53
different features such as flatness, texture, or edge/corner. The template consists
of multiple types of features, including edges, texture regions, and flatness regions.
The limitation of this method is, that the discriminative power of features change
along with the object movements. This means that the hybrid template should be
adaptively updated by either adjusting the feature confidence or substituting the
old features with the new discovered ones from the currently observed frames.
3.1.8 Classifier-based Object Detection
In this approach the detection problem becomes a classification problem between
two classes of background (negative) and foreground (positive). For classification
different features may be used such as color, texture, etc. Liu et al. [66] presented
a novel semiautomatic segmentation method for single video object extraction.
Proposed method formulates the separation of the video objects from the back-
ground as a classification problem. Each frame was divided into small blocks of
uniform size, which are called object blocks if the centering pixels belong to the
object, or background blocks otherwise. After a manual segmentation of the first
frame, the blocks of this frame were used as the training samples for the object-
background classifier. Yuhua et al. [70] presented new face detection method from
a video sequence using classification. First, a classifier with a set of parameters
was built up based on the knowledge of the interest object. Then both positive
and negative sample data were fed into the classifier to adjust those parameters.
Chapter 3. Segmentation and Object Recognition 54
There was a mapping between the object and the classifier. For complex objects,
multiple classifiers needed to be integrated. The limitation with this method is
that more object features need to be embedded to train the object model under
different environment and light conditions.
3.1.9 Deep Neural Networks and Convolutional Neural Net-
works
Deep Neural Networks (DNNs) have recently shown outstanding performance on
image classification tasks. In recent years, DNNs have emerged as a powerful
machine learning model[107]. DNNs exhibit major differences from traditional
approaches for classification. First, they are deep architectures, which have the
capacity to learn more complex models than shallow ones. This expressively and
robust training algorithms allow for learning powerful object representations with-
out the need to hand design features. Convolutional Neural Networks (CNNs) were
heavily used in the 1990s ([106]), but then fell out of fashion with the rise of sup-
port vector machines. In 2012, Krizhevsky et al. [107] rekindled interest in CNNs
by showing substantially higher image classification accuracy on the ImageNet
Large Scale Visual Recognition Challenge (ILSVRC) [108, 109]. Their success re-
sulted from training a large CNN on 1.2 million labeled images, together with a
Chapter 3. Segmentation and Object Recognition 55
few twists on LeCun’s CNN (e.g.,max(x;0)rectifying non-linearity and “dropout”
regularization).
However, one factor that is very important in using DNN and CNN is that although
they could provide very good detection results, they are computationally very
expensive. Krizhevsky et al [107] used over a million training images and multiple
GPUs to speed up the algorithm, and did not do pre-training even though it will
help but takes more time.
3.1.10 Comparison between Detection Algorithms and Find-
ing a Suitable Approach for our Problem
While there are so many algorithms in the literature for object recognition and
localization, to find a suitable approach to follow for our problem we needed to
find out what the pros and cons of each approach were. While we did not exhaust
all available approaches in the literature we have tried many algorithms to reach
a reasonable solution. Based on the categories provided in the previous section we
will discuss the reasoning behind our algorithm and what methods we did try to
come up with our algorithm.
Segmentation approaches do not need training, but usually they do not work well
when there are multiple objects in the image and the background and foreground
show similarity. We tried meanshift and camshift on our dataset but after a
Chapter 3. Segmentation and Object Recognition 56
few frames, the detection box loses the primates as expected because the color
distribution of primates and background clutter could become similar. Second,
background subtraction algorithms are usually used as a pre-processing algorithm
in most detection approaches, but they are not sufficient for complex scenarios.
For our experiments we tried using GMMs and simple background subtraction
but both of them do not provide a good accuracy and when we apply them they
produce silhouettes of the primates. Therefore, we used a simple background
subtraction algorithm as a pre-processing step to save speed and eliminate the
obvious background portions of the image frame. Third, for supervised learning
based background subtraction, we learned an svm classifier using color information
but it failed and did not classify the background accurately as there are more than
two classes of color distribution in our images. Fourth, point detectors are common
algorithms for object detection. We tried using SIFT, SURF, and KLT. All of these
algorithms are computationally very expensive, but the most important problem
with these algorithms is to match the point descriptors in consecutive frames and
having noise and too many interesting points make these types of algorithms weak
for our purpose. Fifth, for feature-based object detection, we used both color
and HOG features. Texture, contour, and edge features do not work well for
our scenario. The primates exhibit a very hard shape that the contour, edge, or
texture are not distinctive features. We also did not pursue using optical flow
approaches as the illumination changes and noise make these algorithms pretty
Chapter 3. Segmentation and Object Recognition 57
weak in our situation. For classification approaches, we used a binary classifier to
classify between the color and HOG features extracted previously. We could have
used nonlinear classifiers but to avoid extensive computation time we suffice using
a linear classifier. We did not use template-based or shape-based algorithms as the
primate shape is pretty variable and using these algorithms will not be suitable for
objects with too much shape variability. Although CNN and DNN have produced
very compelling results in the past couple of years, we did not use these approaches.
One main factor in not using these approaches is that they are computationally
very expensive and usually applied on GPUs to achieve reasonable computation
time.
3.2 Primate Detection in 2D
For our experiments based on the discussed algorithms, we used a 3-step detection
algorithm. First, we created a background model for background subtraction
approach. Then we created a large training set for desired objects, primates,
and finally we used HOG and color features to train an svm classifier to classify
between foreground (primates), and background (everything else).
Chapter 3. Segmentation and Object Recognition 58
3.2.1 Background Subtraction
In our project we used a simple background subtraction method as a pre-processing
step. This step will help both in computation speed and in accuracy. Using several
frames from different instances of time during the day, we created a static back-
ground image as a reference. The criteria that we used for background subtraction
is as follows:
For each pixel (x, y), let I(x,y) = {t1(x,y), ..., tN (x,y)}
B(x, y) = 1N
∑t∈I(x,y) It(x, y)
For each pixel of test image T , if |T (x, y)−B(x, y)| < Threshold , then ignore
T (x, y).
where I(x,y) is the intensity value at pixel (x, y), and ti(x,y) is the intensity value
of pixel (x, y) at time stamp ti, and B is the reference background image. Finally,
to detect primates, each camera view was processed separately: a static reference
background image was created and all frames were equalized to match the reference
background in terms of illumination distribution. Figure 3.2 shows an example of
a test image after normalization. The detection accuracy rises significantly and
number of false positives decreases profoundly using background subtraction as a
pre-processing step.
Chapter 3. Segmentation and Object Recognition 59
3.2.2 Using HOG and Color Features and Classification
In this part, first a training set of primate bounding boxes were manually generated
for various positions and poses for each view; (2) background subtraction was
employed to eliminate obvious non-primate portions of (test) frames; (3) image
features including HOG and Average RGB (aRGB) were extracted (the basic
detection sliding window sizes are set to [60, 60], [60,100], [100,60], [100,100]); (4)
an SVM classifier was trained as a primate detector, using 10-level pyramid images
to detect instances of multiple scales. The scale step size is 1.05. The same HOG
and aRGB features are extracted over the windows at all locations and linear SVM
classifier is run to decide if an instance is a primate or not;(5) Finally, multiple
detections are fused with non-maximum suppression.
3.2.3 Primate Identification
Each of the primates in the cage has a collar with a certain color. These colors
differ for the two groups of primates. For the group of six, these colors are red,
yellow, brown, purple, green, and blue. For the group of four primates these colors
are black, green, red, and blue. We developed an algorithm to find the likelihood of
these colors appearing in the detected bounding boxes for primates. To find these
colors, we used the RGB values of each color with a standard deviation of 50 gray
level intensity. For example, the RGB value of red color is [255,0,0] and we assumed
Chapter 3. Segmentation and Object Recognition 60
Camera-1
Camera-3 Camera-4
Camera-2 1
2
3 4
1 2
3
4
1
2 3
2
3
4
Figure 3.2: Background normalization using the static background image.
a pixel is considered red if the red component of the pixel is at least 205 and the
green and blue component of the pixels are less than 50. After finding out the
voting for each of the desired colors in the detected bounding boxes, based on the
number of pixels found for each color, we decided if we could determine the color
of the primate’s collar in that bounding box. The voting number should be higher
than a minimum threshold (at least 20 pixels) and less than a maximum threshold
(at most 50 pixels) to be considered as a known collar. Since the colors of collars
are not visible at all times identifying primates for all detected bounding boxes
is not possible. However, we can intermittently identify some of these detected
Chapter 3. Segmentation and Object Recognition 61
bounding boxes. Combining the result of detection and identification, we can get
color labels for primates. This information is used for tracking intermittent.
Chapter 4
Object Tracking
4.1 Definition and Common Algorithms
Object tracking algorithms process sequence of consecutive video frames and ob-
tain the movement of objects between the frames. Usually the final goal in video
analysis is to recognize activities and behaviors of interesting objects and tracking
is an intermediate step to achieve this goal. Some of the main categories that need
accurate object tracking are as follows:
1. Motion-based recognition: human identification based on gait, automatic
object detection
62
Chapter 4. Object Tracking 63
2. Automated surveillance: monitoring a scene to detect suspicious activities
or unlikely events
3. Video indexing: automatic annotation and retrieval of the videos in multi-
media databases
4. Human-computer interaction: gesture recognition, eye gaze tracking for data
input to computers
5. Traffic monitoring: real-time gathering of traffic statistics to direct traffic
flow
6. Vehicle navigation: video-based path planning and obstacle avoidance capa-
bilities
7. Human and animal behavior analysis: recognize and understand human/an-
imal activities
In its easiest form, tracking can be defined as the problem of estimating the tra-
jectory of an object in the image plane as it moves around a scene. In other words,
a tracker assigns consistent labels to the tracked objects in different frames of a
video. The object detection task and object correspondence task between the in-
stances of the object across frames can be done separately or jointly. In the first
scenario, with the help of object detection algorithm, possible object regions in
every frame are obtained, and object correspondence across frames is performed
Chapter 4. Object Tracking 64
by object tracker. In the latter scenario, information obtained from previous
frames helps in finding the object region and correct estimation of correspondence
is done jointly by iterative updating of object region and its location. Addition-
ally, depending on the tracking domain, a tracker can also provide object-centric
information, such as orientation, area, or shape of an object. Some of the major
challenges in object tracking are:
1. Loss of information caused by projection of the 3D world on a 2D image.
2. Noise in images.
3. Complex object motion.
4. Non-rigid or articulated nature of objects.
5. Partial and full object occlusions.
6. Complex object shapes.
7. Illumination changes.
8. Real-time processing requirements of objects.
Several approaches for object tracking have been proposed. Almost all tracking
algorithms assume that the object motion is smooth with no abrupt changes. One
can further constrain the object motion to be of constant velocity or constant
acceleration based on a priori information. Prior knowledge about the number
Chapter 4. Object Tracking 65
and the size of objects, or the object appearance and shape, can also be used to
simplify the problem. Based on our detection algorithm we introduce one of the
most common data association algorithms that is used for object tracking, kalman
filter, and works well with the dynamics of our system. After introducing kalman
filter and its main characteristics, we discuss the details of this approach for our
case.
4.1.1 Kalman Filter
Kalman filter is an optimal single object state estimator, i.e. infers parameters
of interest from indirect, inaccurate and uncertain observations to estimate the
current state of a variable or object. Kalman filter is used as an estimator to
predict and correct system state. It helps in studying system dynamics, estimation,
analysis, control and processing. It is not only powerful practically but also very
well precise theoretically. Kalman filter predicts the states of past, present, and
future of an object or variable efficiently. It is recursive so that new measurements
can be processed as they arrive. For a linear system Kalman filter finds the correct
estimation, with white Gaussian noise. So, if all the system noise is gaussian, the
Kalman filter minimizes the mean square error of the estimated parameters and
gives the best estimation and if the noise is not Gaussian, given only the mean
and standard deviation of noise, the Kalman filter is the best linear estimator. In
image analysis, kalman filter can be used to estimate the location of an object in
Chapter 4. Object Tracking 66
consecutive frames, by having the approximate location of objects from detection
and some prior information on the type of movement of the object. For example,
if we are looking to track a pedestrian who is walking and occasionally occluded
behind trees or other people, if we use a detection algorithm and find out his/her
location in some frames and assume that he/she is walking with a constant velocity
we can estimate the accurate position of the pedestrian in each frame using kalman
filter.
Kalman filter has become very famous in the past decades because:
1. It gives good results in practice due to optimality and structure
2. It is convenient form for online real time processing
3. It is easy to formulate and implement given a basic understanding
4. Its measurement equations need not be inverted
The Kalman filter dynamic model uses a system’s dynamic model (e.g., physical
laws of motion), known as control inputs to that system, and multiple sequential
measurements (from sensors) to form an estimate of the system’s time varying
state that is better than an estimate obtained by using any other measurement
[72].
Dynamic System Model:
Chapter 4. Object Tracking 67
To formulate a kalman filter problem, we require discrete time linear dynamic
system description by vector difference equation with additive white noise that
models unpredictable disturbances.
The state of a deterministic dynamic system is the smallest vector that summarizes
the past of the system in full. Knowledge of the state allows theoretically prediction
of the future (and prior) dynamics and outputs of the deterministic system in the
absence of noise. The state of the filter is represented by two variables:
xn|m, a posteriori state estimate at time n given observations up to and including
at time m; Pn|m, a posteriori error covariance matrix (a measure of the estimated
accuracy of the state estimate).
xn|m represents the estimate of x at time n given observations up to, and including
at time m <= n.
Kalman filter model assumes that the true state at time k is evolved from the
state at (k − 1) according to:
Chapter 4. Object Tracking 68
Xk = FKXk−1 + Bkuk + wk (4.1)
where, Fk is the state transition model, which is applied to the previous state Xk1;
Bk is the control-input model, which is applied to the control vector uk;
wk is the process noise, which is assumed to be drawn from a zero mean multi-
variate normal distribution with covariance Qk.
There are two distinct phases in a Kalman filter: Predict and update. The predic-
tion phase uses state estimated from the previous time step to produce an estimate
of the state at current time step. This predicted state is also known as the “a pri-
ori state estimate” since it does not get information from the latest observation.
In the update (correction) phase, a priori state estimate is combined with current
observation to refine the state estimate. This improved estimate is termed the “a
posteriori state estimate” [73].
Predict:
Predicted (a priori) state estimate: xk|k−1 = Fkxk−1|k−1 + Bkuk
Predicted (a priori) estimate covariance: Pk|k−1 = FkPk−1|k−1FTk + Qk
Chapter 4. Object Tracking 69
Update:
Innovation or measurement residual: yk = zk −Hkxk|k−1
Innovation (or residual) covariance: Sk = HkPk|k−1HTk + Rk
Optimal Kalman gain: Kk = Pk|k−1HTk S−1k
Updated (a posteriori) state estimate: xk|k = xk|k−1 + Kkyk
Updated (a posteriori) estimate covariance: Pk|k = (I −KkHk)Pk|k−1
where Hk is the different observation matrices. The formula for the updated
estimate and covariance above are valid for the optimal Kalman gain.
The Extended Kalman filter (EKF) is a nonlinear version of Kalman Filter. The
result of Extended Kalman Filtering shows faster convergence in terms of itera-
tions in comparison with traditional methods, though each iteration cost is higher.
There might also be some cases where EKF finds better or more robust solutions
rather than Kalman filter.
4.1.2 Particle Filter
The problem with Kalman filter is that the state variables are normally distributed
(Gaussian). Kalman filter will give poor estimations for state variables that do
not follow Gaussian distribution. This problem of the kalman filter can be solved
with the help of particle filtering [74]. Since their introduction in 1993, particle
Chapter 4. Object Tracking 70
filtering methods have become a very popular class of algorithms to solve these
estimation problems numerically in an online manner, i.e. recursively as observa-
tions become available, and are now routinely used in fields as diverse as computer
vision, econometrics, robotics and navigation[110].
4.1.3 Multiobject Data Association and State Estimation
Kalman filter, extended kalman filter and particle filtering give very good results
when the objects are not close to each other. For tracking multiple objects in a
video sequence using Kalman or particle filters, the most likely measurement for
a particular moving object needs to be associated with the object’s state. This
is called the correspondence problem. So for multiple-object tracking the most
important step is to solve the correspondence problem before kalman or particle
filters are applied. Nearest neighbor approach is the simplest method to solve
the correspondence problem. However the correspondence problem is hard to deal
with when the moving objects are close to each other, and then the correspondence
shows incorrect results. These filters fail to converge when incorrectly associated
measurement occurs. Several statistical data association techniques exist to tackle
this problem. Two commonly used techniques for data association in this complex
scenarios are Joint Probability Data Association Filtering (JPDAF) and Multiple
Hypothesis Tracking (MHT) [110].
Chapter 4. Object Tracking 71
4.2 Primate Tracking in 2D
To track primates, our algorithm has three major steps.
1. Initialization:
The fist step in tracking algorithm is to initialize the tracker. For this matter,
the first frame of the desired sequence is given to the user. User is asked
to specify the location of each primate by making a marker selection on the
center of the primate. Each primate, is specified by the color of their collar
and the user is also asked to give a confidence number between 0 and 1 when
they identify the primate center. For example if a primate with a certain
collar is occluded, then the confidence number for that primate would be
0 and if a primate is partially occluded based on the judgement of user, a
confidence number less than .5 is given and if a primate is totally visible, the
given confidence 1.
2. Nearest Neighbor and Data Association:
After initialization, the next step would be to apply nearest neighbor al-
gorithm to solve the correspondence issue .Since there are more than one
primate in the pen, before applying Kalman filter we need to solve the cor-
respondence issue. For this matter we used the occasional identification of
collars as benchmarks between frames and used nearest neighbor algorithm
Chapter 4. Object Tracking 72
to create trajectories for each primate. To be able to track primates cor-
rectly, we applied some geometric constrains. These geometric constrains
are: 1) If a primate is leaving the scene, it can only appear from margins
of the frame and not anywhere else. 2) Number of detected primates can
not be more than 4, since there are at most 4 primates in the scene. 3) If
number of detected bounding boxes change from one frame to the next, the
new added bounding box is only considered not an outlier iff the it is close
the other one detected in the previous frame. Using these constrains and
the identification method, we were able to interpolate the trajectories using
nearest neighbor algorithm.
3. Kalman filter:
The final step in our tracking algorithm is to apply kalman filter to the
trajectories for each primate to get a better trajectory estimation for each
primate. The model that we used for kalman filter is as follows:
initial acceleration magnitude = .005,
Gaussian noise standard deviation of acceleration : noisemag = .1,
measurement noise in the x direction: noisex = 1
measurement noise in the y direction: noisey = 1
initial velocity magnitude in x and y direction =
(positionatframe2 − positionatframe1)/t Q = estimate of initial location esti-
mation of where the primate is, what we are updating:
Chapter 4. Object Tracking 73
[positionX, positionY, velocityX, velocityY ]
measurement noise matrix : Ez =
∣∣∣∣∣∣∣∣noisex 0
0 noisey
∣∣∣∣∣∣∣∣Estimateofinitialprimatepositionvariance(covariancematrix) :Ex × noise2mag
Ex =
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣
t4
40 t3
20
0 t4
40 t3
2
t3
20 t2 0
0 t3
20 t2
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣Coefficient Matrix:statetransition + inputcontrol
State update matrix: A =
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣
1 0 t 0
0 1 0 t
0 0 1 0
0 0 0 1
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣B = [ t
2
2, t
2
2, t, t]
Measurement function to apply to the state estimate Q to get our expect new
measurement:
C =
∣∣∣∣∣∣∣∣1 0 0 0
0 1 0 0
∣∣∣∣∣∣∣∣Predict next state of the primate with the last state and predicted motion:
Chapter 4. Object Tracking 74
Qestimate = A×Qestimate + B × u
Predict next covariance : P = A× P × A′ + Ex
Predicted primate measurement covariance: P
Kalman Gain: K = P × C ′ × inv(C × P × C ′ + Ez)
Update the state estimate:
Qestimate = Qestimate + K × (Qmeasurement − C ×Qestimate)
Using this method, the tracking accuracy improves. The advantage of this algo-
rithm is that when a primate leaves the scene and comes back, it can track it
using the color information from the primate’s collar; However without using this
information by only using NN correspondence, or kalman filter, it will be very hard
to accurately track an object that leaves the scene and comes back at a different
location with a different direction of movement.
Chapter 5
Calibration and 3D
Reconstruction
5.1 Camera Calibration
It has been more than a decade that researchers in computer vision have been
interested in digitizing time-varying events that have been recorded by video cam-
eras from multiple viewpoints to 3D scenes. Usually the events in the videos are
human activities and the ultimate goal is to let the observer view the event from
any arbitrary viewpoint. This is called free-viewpoint video. Some of t he applica-
tions of converting a scene into 3D models are: 1) 3D tele-immersion, 2) digitizing
rare cultural performances, 3) sports action, and 4) generating content for 3D
75
Chapter 5. Calibration and and 3D Reconstruction 76
video-based realistic training and demos for surgery, medicine and other technical
fields.
Currently in all multi-camera systems [84–90], calibration and synchronization
must be done during an offline calibration stage before the actual video is cap-
tured. A person has to go to the scene with a calibration object such as a planar
calibration grid or a point LED, and different shots from different angles are taken
from the person with the calibration object. This offline step makes the calibration
process hard, as if the cameras move constantly and there is a need for calibration
more than once, this task has to be done every time.
5.1.1 Explicit Camera Calibration
Physical camera parameters are commonly divided into extrinsic and intrinsic pa-
rameters. Extrinsic parameters are needed to transform object coordinates to a
camera centered coordinate frame. In multi-camera systems, the extrinsic pa-
rameters also describe the relationship between the cameras. The pinhole camera
model is based on the principle of co-linearity, where each point in the object space
is projected by a straight line through the projection center into the image plane.
The intrinsic camera parameters include the effective focal length, the scale factor,
and the image center. This information is usually provided by the company, which
is building the cameras.
Chapter 5. Calibration and and 3D Reconstruction 77
5.2 Visual Hull
The earliest attempts in reconstruction of 3D models from images used the sil-
houettes of objects as sources of shape information. A 2D silhouette is the set
of close contours that outline the projection of the object onto the image plane.
Segmentation of the silhouettes from the rest of the image and combination with
silhouettes taken from different views provide a Shape-From-Silhouette(SFS). The
result of the SFS construction is an upper bound of the real object’s shape in
contrast to a lower bound, which is a big advantage for obstacle avoidance in the
field of robotic or visibility analysis in navigation. One of the advantages in using
SFS technique is the easy implementation of calculation for the silhouettes in sim-
ple situations, such as an indoor environment with static illumination and static
cameras (without these assumptions it can be difficult to calculate an accurate sil-
houette out of the images, because of shadows or moving backgrounds). Another
application of SFS estimation is the field of motion capturing [94].On the other
hand there are also disadvantages for these techniques. Usually these algorithms
are slow, which is an issue for real-time applications. The silhouette calculations
are relatively sensitive to noise such as bad camera calibration, which makes the
resulting 3D shapes inaccurate. Furthermore, the result of each SFS algorithm is
just an approximation of the actual object’s shape, especially if there are only a
limited number of cameras and therefore this approach is not practical for appli-
cations like detailed shape recognition or realistic shape reconstruction of objects
Chapter 5. Calibration and and 3D Reconstruction 78
[94].
Laurentini introduced the term of the Visual Hull in 1991 [92]. If the camera
intrinsic and extrinsic parameters are known from calibration, then the visual
hull of objects [100, 101, 103] can be computed by intersecting the visual cones
corresponding to silhouettes captured from multiple views. The visual hull of a
3D object S is the maximal volume consistent with silhouettes of S. A formal
definition of Visual Hull (VH) is first introduced by Laurentini [100] as following:
“The visual hull V H(S,R) of an object S relative to a viewing region
R is a region of E3 such that, for each point P ∈ V H(S,R) and each
viewpoint V ∈ R, the half-line starting at V and passing through P
contains at least a point of S.”[100]
If we consider these definitions it is easy to see, that S < V H(S,R). Directly
building visual hulls by intersecting the visual cones is very difficult in practice
due to the curved and irregular surface of objects, which results in a complex
geometrical representation for its cones. Therefore approximation methods are
preferred. Polyhedral shape based approach [101] and volume based approach
[102] are normally used for this purpose. We adopt the latter approach for its
efficiency. Algorithm 5.2 shows a pseudocode of the approach.
[h]
Chapter 5. Calibration and and 3D Reconstruction 79
1. Divide the 3D space of interest into N×N×N discrete voxels vn, n = 1, .., N3.
voxels
2. Initialize all the N3 voxels as object voxels
3. For n = 1 to N3 {
— For k = 1 to K {
—— Project vn into the kth image plane by the projection function P k;
—— If the projected area P k(vn) lies completely outside Sk, then classify vn
as non-object voxels;
— }
}
4. The visual hull V H is approximated by the union of all the object voxels.
Another more efficient way to calculate an approximation of the visual hull is
a volume based approach [96–99].Even though this technique is very easy and
fast, it has a big disadvantage; The resulting shape is significantly larger then the
true object shape, which makes it only feasible for application in which only an
approximation is used [94]. The modern approaches use surface-based represen-
tations instead of the volumetric representation of the scene, which allows using
regularization in an energy minimization framework. These techniques result in a
higher robustness to outliers and erroneous camera calibration. Furthermore these
approaches try to overcome the inability to reconstruct concavities, due to the fact
Chapter 5. Calibration and and 3D Reconstruction 80
𝐶1 𝐶2
𝐶3
𝑆1 𝑆2
𝑆3
Visual Hull Approximation
Actual Visual Hull
Figure 5.1: 2D example of the visual hull approximation algorithm. C1, C2, C3
are different views with corresponding silhouettes S1, S2, S3. The yellow areais the approximation of the visual hull; the area enclosed by black lines is the
actual visual hull; and the blue shape in the center is the object.
that they do not affect the silhouettes by using in addition stereo-based methods.
They are used to repeatedly ignored inconsistent voxels and so result in smoother
reconstruction. So that in addition the aim is to archive a photo consistency [95].
Chapter 5. Calibration and and 3D Reconstruction 81
5.3 Calibration and Visual Hull Reconstruction
of Primates
5.3.1 Multiview Environment and Calibration
In order to determine the visual hull corresponding to a set of primate silhouettes,
the cameras that produced the images must be calibrated. This means that the
intrinsic camera parameters (such as focal length, principal point) and the pose
must be (at least approximately) known. So camera calibration is another nec-
essary step in building our 3D vision assisted observation environment. We use
four cameras from different views as a quantitative sensor to recover 3D quanti-
tative measures about the observed scene from 2D images. For our study, from a
calibrated camera we can measure how far a primate is from the camera, or the
height of the primate, etc. Here we briefly introduce the calibration algorithm we
applied in our system and some specifications about the environment.
The calibration algorithm we used is very similar to [? ] which estimates the
intrinsic parameters, including focal length, principal point, skew coefficient, and
distortions, and extrinsic parameters including rotations and translations.
Chapter 5. Calibration and and 3D Reconstruction 82
5.3.2 3D Visual Hull Reconstruction of Primates
After calibration, we used the primate detection results to reconstruct the 3D
visual hulls of the primates in the pen. For each view, we have a detection log
that gives us the bounding boxes around primates; combining the detection results
and the foregrounds obtained from the background subtraction technique, we can
get a better estimate of the location and shape of primates in 2D. For each frame,
we created a binary image with primates as foreground and the rest as background,
in each view. Finally, we used these images to create the approximate 3D visual
hull of primates. Since we only have four cameras obtaining an accurate 3D visual
hull of the primates was not feasible, therefore; we decided to proceed with the
processing of videos in 2D, and to fuse the information we get from each view
separately at the end.
Chapter 6
Activity Recognition Based on
Spatial Relation
6.1 Activity Recognition
Initial work on activity recognition involved extracting a huge description from a
video sequence. This could have been a table of motion scale, rate, and position
within a segmented figure [76] or a table of the presence of motion at each location
[77]. Both of these techniques were able to distinguish some range of activities,
but because they were individually the full descriptions of a video sequence, rather
than features extracted from a sequence, it was difficult to use them as the building
blocks of a more complicated system.
83
Chapter6. Activity Recognition Based on Spatial Relation 84
Another approach to activity recognition is the use of explicit models of the ac-
tivities to be recognized. Domains where this approach has been applied include
face and facial expression recognition [78] and human pose modeling [79]. These
techniques can be very effective, but by their nature, they cannot offer general
models of the information in video in the way that less domain-specific features
can. Furthermore, these type of activities require high quality shots of the face
with low number of occlusions which is not applicable in many scenarios such as
ours as the face of primates are dark which makes it very hard to distinguish their
facial expressions and even occluded in many frames.
Recent work in activity recognition has been largely based on local spatio-temporal
features. Many of these features seem to be inspired by the success of statistical
models of local features in object recognition. In both domains, features are first
detected by some interest point detector running over all locations at multiple
scales. Local maxima of the detector are taken to be the center of a local spatial
or spatio-temporal patch, which is extracted and summarized by some descrip-
tor. Most of the time, these features are then clustered and assigned to words
in a codebook, allowing the use of bag-of-words models from statistical natural
language processing. Since at this point our system relies on finding the identi-
ties and locations of primates at consecutive frames, recognizing spatio-temporal
activities would be the best direction to pursue.
Chapter6. Activity Recognition Based on Spatial Relation 85
6.2 Primate Activity Recognition
The task of primate activity recognition is to use the primate locations and iden-
tities given by the tracking output to detect interesting activities that we may
want to explore or monitor. Some of these interesting activities were mentioned in
table 1 in the first chapter. However, technically speaking, not all of the activities
can be detected or classified, even for human beings. For example, it is very hard
for the camera to detect activities related to tiny features such as lips or teeth of
primates. These features are small and easily subjected to occlusion. Some other
activities are too hard or too complex to be classified correctly as there are many
categories that could be interpreted as one action. For example, ”play”, which can
include moving, jumping, wrestling and grunting, which makes it so hard to be
classified correctly. Therefore, we put our focus on activities that are not subject
to interpretations and we can classify them ourselves without the need of experts
to validate our classifications for training data.
6.2.1 Velocity Measures
Fortunately, there are several interesting activities that are important and techni-
cally easy to detect and interpret. These activities include stationary, locomotion,
chasing and avoiding. All these activities can be defined only by the position
Chapter6. Activity Recognition Based on Spatial Relation 86
trajectories of the centers of the primates, which are available from the track-
ing outputs. Specifically, we assume there are two basic activities: stationary and
moving. Moving include self-moving and pairwise moving. We defined self-moving
as ”locomotion”, which associates with only one primate. Pairwise moving is de-
fined as the activities which involve two primates moving simultaneously and there
is a causal relationship between them. As there can be many interesting activities
in the pairwise moving class, we only consider chasing and avoiding as examples
in this paper. Each of the interesting activities is defined by a few heuristics which
we developed from our observation of the sample videos. In the following we will
give a detailed illustration of these heuristic features.
1. Stationary: velocity of a primate is smaller than a predefined threshold Th1
all the time.
2. Moving: velocity of a primate is greater than a predefined threshold Th1
for a predefined number of frames.
3. Locomotion: it is ”moving” but does not belong to any known pairwise
activities. Figure 6.1 shows an example of locomotion.
4. Chasing: Suppose there are two primates M1 and M2. The position trajec-
tories of them are defined as ~p1 and ~p2. We can compute their first derivative
~v1 and ~v2. Without loss of generality, we assume M1 is chasing M2, then we
have the following necessary conditions:
Chapter6. Activity Recognition Based on Spatial Relation 87
Figure 6.1: A sample image of locomotion activity. The primate that is shownwith the red box is moving but no other primate has motivated this movement.
F1 : ~v1 > Th1
F2 : ~v2 > Th1
F3 : arccos (~v1·( ~p2− ~p1)
| ~v1|·|( ~p2− ~p1)|) < Th2
F4 : |( ~p2− ~p1)| < Th3
where, Fi shows the heuristic feature we computed to determine the chasing
activity. The intuitions behind these heuristic constraints are obvious. The
first two equations ensure that both primates are moving. The third equation
hints that the chasing primate is trying to get close to the chased one. Finally,
the last equation constraints that the two involved primates should be not
too far from each other and the distance between them should relatively not
grow much as they are following each other. Figure 6.2 shows an example of
chasing.
Chapter6. Activity Recognition Based on Spatial Relation 88
Figure 6.2: These series of images from top right to bottom left show thechasing and avoiding activities that are happening between the two primates
that are shown with red circles.
5. Avoiding: avoiding heuristics can be defined similar to chasing. Again, if we
assume that primate M1 is chasing primate M2, here primate M2 is avoiding
primate M1. Avoiding can be explained by these equations.
F1 : ~v1 > Th1
F2 : ~v2 > Th1
F3 : arccos (~v2·( ~p2− ~p1)
| ~v1|·|( ~p2− ~p1)|) > Th2
Similar to chasing, the first two equations ensure that both primates are
moving.The third equation hints that the avoiding primate is trying to get
further from the chasing one. Figure 6.3 shows an example of avoiding.
Chapter6. Activity Recognition Based on Spatial Relation 89
Figure 6.3: These series of images from top right to bottom left show theavoiding activity for the primate that is specified with the red circle. Note that
this activity is not a result of chasing in this case.
With the heuristic features above, the primate activity recognition problem is
equivalent to a binary decision tree. For each primate we calculated the values
for all the heuristic features in the training set, and labeled them with different
activities. We then fed this information to a binary decision tree using MATLAB
to find the optimal cut points for each threshold. To avoid over-fitting, we used
kfold cross validation with k = 10. Figure 6.4 shows the decision tree we built for
our activity classification.
If F1 < Th1 then primate M1 is stationary.
— else if F2 < Th2 then primate M1 is locomotive.
—— else if F3 > Th3 then primate M1 is avoiding primate M2.
——— else if F4 < Th4 then primate M1 is chasing primate M2.
———— else primate M1 is locomotive.
Chapter6. Activity Recognition Based on Spatial Relation 90
stationary�
chasing
avoiding
locomo.on
locomo.on
F1<Th1 F1>Th1
F2<Th1 F2>Th1
F3<Th2 F3>Th2
F4<Th3 F4>Th3
Th1 = 9.3 Th2 = 0.86 Th3 = 318
Figure 6.4: This figure shows the decision tree we used to evaluate our testset. Th leaf nodes show the decision made based on the feature values.
Algorithm above shows the decision process for primate M1. This algorithm an-
swers the question of ”what is the activity of a given primate,M1, with a given set
of features, i.e. Fi ?”
Chapter 7
Experimental Results
7.1 Experiments
For our experiments we used a 2.39 GHz(2 processors) CPU with 48 GH RAM.
We run all of our experiments on MATLAB. We used OpenCV and c++ for the
detection algorithm and then used MEX files to call it from MATLAB.
There are several hours of recorded data from four cameras, which were recorded in
the primates’ pen. However these data are not annotated and for our experiments
we had to label them manually. We had to create training sets for each view
and also test sets to test our algorithm on them and compare the results of our
algorithm with the manually labeled test set. Since labeling primates is a very
time consuming event and we are not experts in recognizing all activities, for our
91
Chapter 7. Experimental Results 92
test set we used two different data sets. One with the first group of primates, and
the other one with the second group of primates. In each of these test sets, we
put our focus on activities that are related to relative position of primates to each
other as explained in the previous chapter.
The two data sets are named 20121026 (video 1) and 20130619 (video 2). The
data set 20121026 is a video of 400 frames. There are six primates observed in
this video. The 20130619 is a video of 700 frames. Four primates are observed
in this video. The second group of primates (group of four) was generally much
less hostile than the first group (group of six) and most of the times they were
sitting around. We looked for portions of video that contained the full number
of primates and chose portions which primates were moving and had interesting
activities. At the beginning we annotated primates from each of the four views
and tested our detection algorithm on four view, however as we will see in the
“Detection ” section, we realized that view 3 and view 4 do not carry much extra
information than the combination of view 1 and view 2, and furthermore because
of the structure of the pen and the benches, the primates were occluding each other
in many frames and we had to discard those frames. Therefore, for our tracking
and activity recognition algorithm we focused on view 1 and view 2. Figure 7.1
shows a sample image frame from four views.
Chapter 7. Experimental Results 93
Camera-1
Camera-3 Camera-4
Camera-2 1
2
3 4
1 2
3
4
1
2 3
2
3
4
Figure 7.1: Sample image from four views.
7.2 2D Primate Detection
The challenge of detection comes from multiple factors. Firstly, due to the settings
of the environment, the illumination varied in different locations, furthermore, it
may change from time to time, too. So we cannot simply rely on background
subtraction or illumination-sensitive features. Secondly, although the primates
wear collars of different color, these are easily occluded when they move, or become
indistinguishable when the illumination is low. The main challenge to detect
primates with HOG feature is the variable shape of the primate body. The reason
Chapter 7. Experimental Results 94
that HOG can successfully detect pedestrians, for instance, is that the contours
of all standing human beings look similar. The ratio between width and height is
almost constant. However, the contour of a crouching monkey is quite different
from that of a jumping one.
For each view we trained a separate detector. We used about 5000 positive training
samples (primates) and 2000 negative samples (non primates) for training each
view. We used the two test videos mentioned before, to evaluate the detector’s
performance. The results are shown in Table 7.1 and table 7.2 . TP stands for
true positive, FP stands for false positive and FN stands for false negative. The
PR curve in Figure 7.3 shows the relation between precision and recall rate with
SVM threshold varied. From Figure 7.3, we can see that view 2 and view 4 are
better than view 3 and View 1. It is reasonable because in view 2 and view 4
the background is simpler and the primates are usually separated. In view 1, the
background is strongly cluttered so there are many false positives. In view 3, the
primates on the benches often occlude each other and the illumination is low on
the floor area, so it is difficult to locate primates and therefore many false negatives
occur. Figure 7.2 is a good illustration for these points.
Chapter 7. Experimental Results 95
View 1
View 2
View 4
View 3
Figure 7.2: Primate detection in 2D. In column one, green boxes are theground truth; red boxes are the detection results. Column two shows the ex-tracted silhouettes by background subtraction over detected bounding boxes.
Chapter 7. Experimental Results 96
Table 7.1: 2D primate detection results from 4 views, video 1
Cameras TP FP FN Precision RecallView 1 180 79 45 0.70 0.80View 2 98 29 37 0.77 0.73View 3 93 9 9 0.91 0.91View 4 129 11 77 0.92 0.63Overall 500 128 168 0.80 0.75
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall (TP/(TP+FN))
Pre
cisi
on (
TP
/(T
P+
FP
))
Detection over all four views
cam1cam2cam3cam4
Figure 7.3: PR-curve of 2D detection.
Chapter 7. Experimental Results 97
Table 7.2: 2D primate detection results from 2 views, video 2
Cameras TP FP FN Precision RecallView 1 295 18 107 0.73 0.94View 2 141 12 33 0.81 .92Overall 430 28 143 0.75 0.93
7.3 Multiview Environment and 3D Primate Vi-
sual Hull Results
Camera calibration is a necessary step in building the 3D vision assisted obser-
vation environment. We used four cameras from different views as a quantitative
sensor to recover 3D quantitative measures about the observed scene from 2D
images.
We used the factory information of each camera lens used in our pen, i.e. 3 of
the Kowa lenses (LM5JC1M 2/3”, focal length 5mm, f/2.8) for the wall cameras
and the Edmunds fish eye lens (Optics NT62-274, focal length 1.8mm, f/1.4) for
the ceiling camera. Using MATLAB camera calibration toolbox, we estimated the
intrinsic parameters, including focal length, principal point, skew coefficient, and
distortions, and extrinsic parameters including rotations and translations. Figure
7.4 illustrates our calibration process.
After calibration, we used the detection results to reconstruct the 3D visual hulls
of primates in the pen. For each view, we had a detection log which gave us the
Chapter 7. Experimental Results 98
View 3
View 4 View 2 View 1
Figure 7.4: Calibration process. A checkerboard of size 16.8′′ × 24′′ is usedfor calibration. The top figure shows the 3D locations of each camera.
Chapter 7. Experimental Results 99
bounding boxes around primates. For each frame, we created a binarized image
with primates as foreground and the rest as background in each view. Finally we
used these images to create 3D visual hulls of the primates. However, since the
number of cameras is limited, and in each view the detection algorithm only gives
us a bounding box, having a profound shape after 3D reconstruction is not possible;
therefore the final shape of 3D primate reconstruction will not be accurate. Figure
7.5 illustrates this process for one primate in a frame.
View 2
View 3
View 4
View 2
View 3
View 4
Figure 7.5: 3D visual hull reconstruction result sample. Column one arethe original images; Column two shows the binary images from 2D primate
detection; Column three is the visual hull constructed from three views.
Chapter 7. Experimental Results 100
Table 7.3: 2D primate tracking results from 2 views, video 1
Primates R Br B Y G OverallAccuracy (view 1) 0.89 0.83 0.92 0.62 0.59 0.77Accuracy (view 2) 0.94 0.89 0.86 0.71 0.75 0.83
Table 7.4: 2D primate tracking results from 2 views, video 2
Primates Y R G Bl OverallAccuracy (view 1) .91 .95 .63 .85 .83Accuracy (View 2) .90 .93 .58 0.81 .81
7.4 2D Primate Tracking
After initialization of the tracker, we used the results of our detection algorithm
to achieve trajectories of each primate. Table 7.3 and table 7.4 show the results of
tracking algorithm for the two sets of videos. The abbreviation after each primate
shows its collar, i.e. R stands for red, Br stands for brown, B stands for blue, G
stands for green, and Bl stands for black. We can see that the accuracy improves
significantly from the detection stage.
7.5 Primate Activities
To evaluate the performance of the activity recognition algorithm, we used a video
form camera 1, video 20121011 (video 3), view 1 containing 1500 frames .This
video contains quite frequent interesting activities and is from the top view. The
Chapter 7. Experimental Results 101
advantage of the fish-eye camera in using for activity recognition is that it permits
us to see all the primates all the time and it is easy to observe their relations.
Video 2 does not contain much interesting activities. Sometimes primates may
disappear due to occlusion and limited view area. But this is more close to what
is happening in real life. And since the sizes of the primates are bigger, we can
get better results.In both data sets, we divided videos temporally into 50-frame
segments as the activity samples. For each segment, the position of each primate
was manually labeled frame by frame. For each segment, each primate was given
an activity label such as stationary, locomotion, etc. Table 7.5 shows the results
of our algorithm.“GT“ is ground truth label while “DI“ is the detected incidents.
“TP“ stands for true positive while “FP“ stands for false positive. “FN“ stands
for false negative. We can see that the algorithm has a high accuracy and is able
to detect the activities in most cases. After evaluating our activity recognition
algorithm, we tested it on the tracking results obtained from the previous stage.
GT DI TP(TPR=TP/GT) FP FNChasing 15 10 10 (0.66) 0 4Avoiding 19 16 14 (0.74) 2 6Locomotion 50 53 46 (0.92) 7 7Stationary 96 100 95 (0.99) 5 1Occlude 0 0 0 (NA) 0 0
Table 7.5: Activity recognition results on view 1, video 3.
The activity detection results on “view 1, video 2” and “view 2, video 2” are shown
in table 7.6 and table 7.7.
Chapter 7. Experimental Results 102
GT DI TP(TPR=TP/GT) FP FNChasing 0 0 0 (NA) 0 0Avoiding 3 1 1 (0.33) 0 3Locomotion 11 11 9 (0.81) 2 2Stationary 42 44 42 (1) 2 0Occlude 0 0 0 (NA) 0 0
Table 7.6: Activity recognition results on view 1, video 2.
GT DI TP(TPR=TP/GT) FP FNChasing 0 0 0 (NA) 0 0Avoiding 0 0 0 (NA) 0 0Locomotion 9 8 7 (0.77) 1 2Stationary 35 35 34 (0.97) 1 1Occlude 16 16 15 (0.93) 1 1
Table 7.7: Activity recognition results on view 2, video 2.
From the tables we can see that the activity recognition algorithm overall has a
good accuracy and is able to detect activities correctly.
7.6 Fusion of Multiple Views
From the training set for dataset2, we see that there are no occlusions on view1,
whereas in view2 , there are 16 events of occlusion. Intuitively this makes sense,
since the top view is the most informative view and has the least occlusion gen-
erally. To fuse the results obtained from different views, a binary decision tree
with weighted coefficients is created. On first level, if an event is occluded in one
view, we rely on other views that are not occluded. If an event is not occluded
Chapter 7. Experimental Results 103
the decision will be made based on the four views. For view one the coefficient is
the highest, then view 2, then view 3, and finally view 4.
Decision = w1 × v1 + w2 × v2 + w3 × v3 + w4 × v4
where, wi shows the weights for each view, and vi is a 4×1 vector consisting of zeros
and one, which shows the decision for each view (chasing, avoiding, locomotion,
or stationary) and the decision is a 4 × 1 vector, where the highest value shows
the final decision. Since here we are dealing with only two views and the weight
for view 1 is higher than the weight for view 2, the final decision will be equal to
the recognition results from view 1.
Chapter 8
Discussion and Conclusion
8.1 Conclusion
Recognizing and modeling social behaviors of animals have many applications.
Limited research exists in the area of automatic primate behavior analysis using
videos in open non-occluded environments where primates are observed. In this
dissertation, we presented a complete framework to recognize some of the activ-
ities of social primates in groups. This framework contains different modules:
experimental set up, data collection module, primate detection module, primate
tracking module, and primate activity recognition module. In experimental set up
a group of primates with different colored collars around their neck were put in a
pen and observed during a couple of weeks. In the next module, data collection
104
Chapter 8. Discussion and Conclusion 105
module, we used four cameras (three side cameras and one fish-eye camera on
the ceiling) around the pen to record the activities of primates. Using streampix
software, data was stored and synchronized from different cameras. For detection
module, we first applied a static background subtraction to our test frames and
then equalized their histogram so they become more similar in terms of color dis-
tribution. We then used color features and HOG features to detect primates. In
tracking module, we used the color information of the collars around the primates’
neck and combined it with NN and kalman filter to get smooth trajectories for
each of the primates. For the final module, activity recognition, we used some
heuristics to define temporal based activities such as chasing or stationary. For
each of these modules we validated our algorithm by creating test videos from
our original data. The primate detection accuracy varied from 60 to 85 percent
depending on the view, activity level of the primates, and the rate of occlusion
events. With the help of tracking, the accuracy increased by an average of 10
percent across video clips. We were able to recognize some preliminary behavior
of primates, such as stationary, avoiding, chasing, locomotion. Our results are
promising and sufficiently accurate to analyze primate behaviors and social inter-
actions. This study is unique to the best our knowledge, because it studies primate
groups in a closed and controlled research environment for the first time and pro-
posed a complete framework that successfully tackled the issue of automatically
recognizing behaviors of primates. The design of each module was separate from
Chapter 8. Discussion and Conclusion 106
the other modules, which made it possible to evaluate the performance of each
module separately. This property makes our framework feasible for use of others
who would like to explore in this area. One can focus on any of the modules and
use alternate algorithms to improve the performance without the need to change
the entire framework.
Major challenges encountered, include the massive size of data, lack of labeled data
and annotated activities, environmental difficulties such as illumination variations
throughout the day, background changes due to perturbations introduced (moving
objects in the pen such as swings or toys, and human passing by), highly variable
shapes and poses of primates, and the low visibility of collars which made it very
difficult to identify the primates.
8.2 Discussion and Future Work
Some of the directions that would be suitable for future work would be : First
one of our major challenges was that we did not have any training sets. So we
had to spend a lot of time creating training sets by labeling primates both for
detection phase and activity phase. One of the approaches that in the past couple
of years got a lot attention is DNN for detection and classification that can work
semi supervised or not supervised. Taking advantage of this property one might
be able to use this method for detection of primates without the need of massive
Bibliography 107
labeling. If one wants to improve our current algorithm, definitely adding more
training data will help with the accuracy of detection. For the experimental set up
module, changing the position of cameras can help. Apart from top view and one
of the side views, the other two side views do not add much information since a lot
of times, primates tend to sit on the benches and from those views they occlude
each other. For the activity recognition module, having a more accurate shape of
primates could help us extract more features and recognize more activities. Facial
features might be hard to distinguish now, but if the illumination improves it
might be useful to extract some facial features and classify some activities based
on that.
Bibliography
[1] J. M. Rowcliffe and C. Carbone. “Surveys using camera traps: are we looking
to a brighter future?” Animal Conservation, 11(3):185–186, 2008.
[2] A. Herler and A. Stoger, “Vocalizations and associated behaviour of Asian
elephant (Elephas maximus) calve,” Behaviour, 149, 575–599, 2012.
[3] T. Burghardt and J. Calic, “Analysing animal behaviour in wildlife videos
using face detection and tracking,” Vision, Image and Signal Processing, IEE
Proceedings, vol.153, no.3, pp.305,312, 2006.
[4] J. Morrow-Tesch, J. Dailey, and H. Jiang, “A video data base system for
studying animal behavior,” J. Anim. Science, 76(10), 2605–2608, 1998.
[5] D. Reby, R. Andre-Obrecht, A. Galinier, J. Farinas, and B. Cargnelutti, “Cep-
stral coecients and hidden markov models reveal idiosyn- cratic voice charac-
teristics in red deer (cervus elaphus) stags,” J Acoust Soc Am, 120 (6), 4080-9,
2006.
108
Bibliography 109
[6] J. Altmann, “Observational study of behavior: sampling methods,” Behaviour,
49 (3), 227-267, 1974.
[7] D. Maestripieri and K. Wallen, “Aliative and submissive communica- tion in
rhesus macaques,” Primates, 38 (2), 127-138, 1997.
[8] P. Martin and P. Bateson, “Measuring Behavior: An introductory guide,”
Cambridge University Press, 2007.
[9] M. Dunn, J. Billingsley, and N. Finch, “Future Trends Machine vision classi-
fication of animals,” Proceedings of the 10th Annual Conference on Mechatron-
ics and Machine Vision in Practice, Mechatronics and Machine Vision, ed. by
Billingsley J (Perth, Australia, 2003.
[10] P. Khorrami, J. Wang, and T. Huang, “Multiple animal species detec-
tion using robust principal component analysis and large displacement optical
flow,” Proceedings of the 21st International Conference on Pattern Recognition
(ICPR), Workshop on Visual Observation and Analysis of Animal and Insect
Behavior, 2012.
[11] http://docs.opencv.org/trunk/doc/tutorials/video/background-
subtraction/background-subtraction.html
[12] D. Walther, D. Edgington, and C. Koch, “Detection and tracking of objects
in underwater video,” IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, 544–549, 2004.
Bibliography 110
[13] Z. Khan, R. A. Herman, K. Wallen, and T. Balch, “An outdoor 3-d visual
tracking system for the study of spatial navigation and memory in rhesus mon-
keys,” Behavior research methods, vol. 37, no. 3, pp. 453–463, 2005.
[14] F. de Chaumont, R. D. Coura, P. Serreau, A. Cressant, J. Chabout, S. Gra-
non, and J.-C. Olivo-Marin, “Computerized video analysis of social interactions
in mice,” Nature Methods, vol. 9, no. 4, pp. 410–417, 2012.
[15] T. Balch, F. Dellaert, A. Feldman, A. Guillory, C. L. Isbell, Z. Khan, S. C.
Pratt, A. N. Stein, and H. Wilde, “How multirobot systems research will accel-
erate our understanding of social animal behavior,” Proceedings of the IEEE,
vol. 94, no. 7, pp. 1445–1463, 2006.
[16] D. K. Mellinger, C. W. Clark, “Recognizing transient low- frequency whale
sounds by spectrogram correlation,” J Acoust Soc Am , 107 (6), 3518-29, 2000.
[17] T. Burghardt and J. Calic, “Analysing animal behaviour in wildlife videos
using face detection and tracking,” Vision, Image and Signal Processing,
153(3):305 – 312, 2006.
[18] L. Gamble, S. Ravela, and K. McGarigal, “Multi-scale features for identifying
individuals in large biological databases: an application of pattern recognition
technology to the marbled salamander ambystoma opacum,” Journal of Applied
Ecology, 45(1):170–180, 2008.
Bibliography 111
[19] M. Lahiri, C. Tantipathananandh, R. Warungu, D. I. Rubenstein, and T. Y.
Berger-Wolf, “Biometric animal databases from field photographs: identification
of individual zebra in the wild,” In Proceedings of the 1st ACM International
Conference on Multimedia Retrieval, pages 6:1–6:8, 2011.
[20] D. Walther, D. R. Edgington, and C. Koch, “Detection and tracking of objects
in underwater video,” IEEE International Conference on Computer Vision and
Pattern Recognition, 1:544–549, 2004.
[21] N. Haering, R.J. Qian, and M.I. Sezan, “A semantic event-detection approach
and its application to detecting hunts in wildlife video,” IEEE Transactions on
Circuits and Systems for Video Technology, 10:857–868, 2000.
[22] D. Tweed and A. Calway, “Tracking multiple animals in wildlife footage,”
16th International Conference on Pattern Recognition, 2:24–27, 2002.
[23] D. Ramanan and D. A. Forsyth, “Using temporal coherence to build models of
animals,” 9th International Conference on Computer Vision, 1:338–345, 2003.
[24] M. R. Everingham and A. Zisserman, “Automated person identification in
video,” 3rd International Conference on Image and Video Retrieval, 1:289–298,
2004.
[25] D. Gibson, N. Campbell, and B. Thomas, “Quadruped gait analysis using
sparse motion information,” In International Conference on Image Processing.
IEEE Computer Society, 2003.
Bibliography 112
[26] S. L. Hannuna, N. W. Campbell, and D. P. Gibson, “Segmenting quadruped
gait patterns from wildlife video,” IEE Visual Information Engineering Confer-
ence, 2005.
[27] J. Calic, N. Campbell, M. Mirmehdi, B. Thomas, R. Laborde, S. Porter, and
N. Canagarajah, “multimedia management system for intelligent content based
retrieval,” In International Conference on Image and Video Retrieval, pages
601–609. Springer LNCS 3115, 2004.
[28] P. Viola and M. Jones, “Robust real-time object detection,” Second Interna-
tional Workshop on Statistical and Computational Theories of Vision, 2001.
[29] J. Calic, N. Campbell, A. Calway, M. Mirmehdi, T. Burghardt, S. Hannuna,
C. Kong, S. Porter, N. Canagarajah, and D. Bull, “Towards intelligent con-
tent based retrieval of wildlife videos,” 6th International Workshop on Image
Analysis for Multimedia Interactive Services, 2005.
[30] H. S. Parekh, D. G. Thakore, U. K. Jaliya, “A Survey on Object Detection and
Tracking Methods,” International Journal of Innovative Research in Computer
and Communication Engineering, Vol. 2, Issue 2, 2014
[31] A. Elgammal, R. Duraiswami, D. Harwood, and L. Anddavis, “Background
and foreground modeling using nonparametric kernel density estimation for vi-
sual surveillance,” Proceedings of IEEE, 90(7):1151–1163, 2002.
Bibliography 113
[104] F. Yi and I. Moon, “Image Segmentation: A Survey of Graph-cut Methods,”
International Conference on Systems and Informatics, 2012.
[33] J. Shi And J. Malik, “Normalized cuts and image segmentation,” IEEE Trans.
Patt. Analy. Mach. Intell, 22(8), pp.888–905, 2000.
[34] D. Comaniciu, and P. Meer, “Mean shift: A robust approach toward fea-
ture space analysis,” IEEE Trans. Patt. Analy. Mach. Intell, 24(5), pp.603–619,
2002.
[35] S. Saravanakumar, A. Vadivel and C.G. Saneem Ahmed, “Human object
tracking in video sequences,” Journal on Image and Video Processing, 2(1),
2011.
[36] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey. Acm Com-
puting Surveys,” CSUR, 38(4):13, 2006.
[37] R. K. Rout, “A Survey on Object Detectionand Tracking Algorithms,” De-
partment of Computer Science and Engineering National Institute of Technology
Rourkela Rourkela, 2008.
[38] S. Cheung and C. Kamath, “Robust techniques for background subtraction
in urban traffic video,” Proc. SPIE 5308, Visual Communications and Image
Processing, 881, 2004.
Bibliography 114
[39] K. Srinivasan, K. Porkumaran, and G. Sainarayanan, “Improved background
subtraction techniques for security in video applications,” Anti-counterfeiting,
Security, and Identification in Communication, pp.114,117, 2009.
[40] C. Kim and J.-N. Hwang, “Fast and automatic video object segmentation
and tracking for content-based applications,” Circuits and Systems for Video
Technology, IEEE Transactions on, 12(2):122–129, 2002.
[41] Z. Chaohui, D. Xiaohui, X. Shuoyu, S. Zheng, and L. Min, “An improved
moving object detection algorithm based on frame dierence and edge detection,”
Fourth International Conference on Image and Graphics, pages 519–523, 2007.
[42] C. Stauffer, And W. Grimson, “Learning patterns of activity using real time
tracking,” IEEE Trans. Patt. Analy. Mach. Intell, 22(8), pp.747–767, 2000.
[43] R. Bodor, B. Jackson, and N. Papanikolopoulos, “Vision-Based Human Track-
ing and Activity Recognition”, Proc. of the 11th Mediterranean Conf. on Control
and Automation, 1, 2003.
[44] C. Stauffer, And W. Grimson, “Learning patterns of activity using real time
tracking,” IEEE Trans. Patt. Analy. Mach. Intell, 22(8), pp.747–767, 2000.
[45] J. Rittscher, J. Kato, S. Joga, and A. Blake, “A probabilistic background
model for tracking,” European Conference on Computer Vision, 2, pp.336–350,
2000.
Bibliography 115
[46] B. Stenger, V. Ramesh, N. Paragios, F. Coetzee, and J. Burmann, “Topology
free hidden markov models: Application to background modeling,” In IEEE
International Conference on Computer Vision, pp.294–301, 2001.
[47] P. Viola, M. Jones, and D. Snow 2003, “Detecting pedestrians using patterns
of motion and appearance,” In IEEE International Conference on Computer
Vision, pp.734–741, 2003.
[48] C. Papageorgiou, M. Oren, and T. Poggio, “A general framework for object
detection,” In IEEE International Conference on Computer Vision, pp.555–562,
1998.
[49] W. T. Lee and H. T. Chen, “Histogram-based interest point detectors,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
tion, pp. 1590-1596, 2009.
[50] A. Yilmaz, O. Javed and M. Shah, “Object Tracking: A Survey”, ACM Com-
puting Surveys, 38(4), 2006.
[51] W.-B. Yang, B. Fang, Y.-Y. Tang, Z.-W. Shang, D.-H. Li, “Sift features based
object tracking with discrete wavelet transform,” International Conference on
Wavelet Analysis and Pattern Recognition, pp. 380-385, 2009.
[52] S. L. evre and E. Bouton and T. Brouard and N. Vincent, “A new way
to use hidden markov models for object tracking in video sequences,” Image
Processing, 2003.
Bibliography 116
[53] Z. Han, Q. Ye, J. Jiao, “Online feature evaluation for object tracking Using
kalman filter,” Pattern Recognition, 2008.
[54] J. Zhu, Y. Lao, and Y. F. Zheng, “Object Tracking in Structured Environ-
ments for Video Surveillance Applications ,” IEEE Transactions On Circuits
And Systems For Video Technology, Vol. 20, No. 2, 2010.
[55] Z. H. Khan, I. Y. Gu, and A.G. Backhouse,“Robust Visual Object Tracking
Using Multi-Mode Anisotropic Mean Shift and Particle Filters,“ IEEE Trans-
actions On Circuits And Systems For Video Technology, Vol. 21, No. 1, January
2011.
[56] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human De-
tection,“CVPR, p. 886-893, 2005.
[57] C.C. Chang and C.J. Lin, “Libsvm: a library for support vector machines,“
ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no.
3, pp. 27, 2011.
[58] A.K. Chauhan, P. Krishan, “Moving Object Tracking Using Gaussian Mix-
ture Model And Optical Flow,“ International Journal of Advanced Research in
Computer Science and Software Engineering, April 2013.
[59] N. Dalal, B. Triggs, and C. Schmid,“Human Detection using Oriented His-
tograms of Flow and Appearance,“ 2011.
Bibliography 117
[65] H. Yang , L. Shao, F. Zheng , L. Wangd, and Z. Song,“Recent advances
and trends in visual tracking: A review,“ Elsevier Neurocomputing, no. 74, pp.
3823–3831, 2011.
[61] L. Wu,“Multiview Hockey Tracking with Trajectory Smoothing and Camera
Selection,“ 2005.
[62] S. Khire and J. Teizer,“Object Detection and Tracking,“ 2008.
[63] L. Vibha, C. Hegde, P. D. Shenoy, K. R. Venugopal , and L. M. Pat-
naik,“Dynamic Object Detection, Tracking and Counting in Video Streams for
Multimedia Mining,“ IAENG International Journal of Computer Science, Au-
gust 2008.
[64] S. Johnsen and A. Tews,“Real-Time Object Tracking and Classification Using
a Static Camera,“Proceedings of the IEEE ICRA 2009 Workshop on People
Detection and Tracking Kobe, Japan, May 2009.
[65] H. Yang , L. Shao, F. Zheng , L. Wangd, and Z. Song,“Recent advances
and trends in visual tracking: A review,“ Elsevier Neurocomputing 74, pp.
3823–3831, 2011.
[66] W.L. Lu and J.J. Little,“ Simultaneous Tracking and Action Recognition
using the PCA-HOG Descriptor,“ Proceedings of the 3rd Canadian Conference
on Computer and Robot Vision CRV), 2006.
Bibliography 118
[67] J. Pan, B. Hu, and J.Q. Zhang,“ Robust and Accurate Object Tracking Under
Various Types of Occlusions ,“ IEEE Transactions on Circuits and Systems for
Video Technology, Vol. 18, No. 2, February 2008.
[68] Y. Zhong, A.K. Jain, M.P. Dubuisson-Jolly,“Object Tracking Using De-
formable Templates“,IEEE transactions on pattern analysis and machine in-
telligence, vol. 22, no. 5, may 2000.
[69] X. Liu, L. Lin, S. Yan, H. Jin, and W. Jiang,“Adaptive Object Tracking by
Learning Hybrid Template Online,“IEEE Transactions On Circuits And Sys-
tems For Video Technology, Vol. 21, no. 11, November 2011.
[70] Y. Zheng1 and Y. Meng, “Object Detection And Tracking Using Bayes-
Constrained Particle Swarm Optimization,“ISBN,pp. 978-992, 2007.
[71] P. Viola, M. Jones, D. Snow,“Detecting Pedestrians Using Patterns of Motion
and Appearance,“ Proceedings of the International Conference on Computer
Vision (ICCV),Nice, France, October 2003.
[72] I. Strid and K. Walentin,“Block Kalman Filtering for Large-Scale DSGE
Models,“ Computational Economics (Springer),no. 33 vol. 3, pp. 277–304, April
2009.
[73] A. Kelly,“A 3D state space formulation of a navigation Kalman filter for
autonomous vehicles,“1994.
Bibliography 119
[74] H. Tanizaki,“Non-gaussian state-space modeling of nonstationary time se-
ries,“ J. Amer. Statist.Assoc.82 , pp.1032–1063, 1987.
[75] M. Isard and A. Blake, “Condensation - conditional density propagation for
visual tracking,“ Int. J. Comput. Vision no. 29, vol. 1, pp.5–28, 1998.
[76] R. Polana and R. C. Nelson,“ Detecting activities,“CVPR, 1993.
[77] A. F. Bobick and J. W. Davis,“ The recognition of human movement using
temporal templates,“ IEEE PAMI,pp. 23:257–267, 2001.
[78] M. Black and Y. Yacoob,“ Tracking and recognizing rigid and non-rigid facial
motions using local parametric models of image motion,“ ICCV, pp. 374–381,
1995.
[79] D. Ramanan and D. A. Forsyth,“ Automatic annotation of everyday move-
ments,“ NIPS, 2003.
[80] I. Junejo, E. Dexter, I. Laptev, and P. Perez,“ Cross-view action recognition
from temporal self-similarities,“ ECCV,v. 2, pp. 293–306, Marseille, France,
2008.
[81] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “ Learning realistic
human actions from movies,“ CVPR, pp. 1–8, Anchorage, Alaska, 2008.
[82] J. C. Niebles and L. Fei-Fei,“ A hierarchical model model of shape and ap-
pearance for human action classification“,CVPR, 2007.
Bibliography 120
[83] S. Savarese, A. D. Pozo, J. C. Niebles, and L. Fei-Fei,“Spatial-temporal cor-
relations for unsupervised action classification,“Motion and Video Computing,
2008.
[84] C. Buehler, S. J. Gortler, M.F. Cohen, and L. McMillan,“ Minimal surfaces
for stereo,“ ECCV (3), pp. 885–899, 2002.
[85] J. Carranza, C. Theobalt, M.A. Magnor, and H. P. Seidel,“ Free-viewpoint
video of human actors,“ACM SIGGRAPH 003, pp. 569–577, New York, USA,
2003.
[86] G.K.M. Cheung, S. Baker, and T. Kanade,“ Shape-from-silhouette of artic-
ulated objects and its use for human body kinematics estimation and motion
capture,“ CVPR, pp. 77–84, 2003.
[87] J.S Franco, M. Lapierre, and E. Boyer,“ Exact polyhedral visual hulls,“ In
Proceedings of the Fourteenth British Machine Vision Conference, pp. 329–338,
Norwich, UK, 2003.
[88] J.S Franco, M. Lapierre, and E. Boyer,“ Visual shapes of silhouette sets,“ In
Proceedings of the 3rd International Symposium on 3D Data Processing, Visu-
alization and Transmission, Chapel Hill, USA, 2006.
[89] W. Matusik, C. Buehler, R. Raskar, S.J. Gortler, and L. McMillan,“Image-
based visual hulls,“ Computer Graphics Proceedings ACM SIGGRAPH, pp.
369–374, Kurt, Akeley, 2000,
Bibliography 121
[90] P. Sand, L. McMillan, and J. Popovi,“ Continuous capture of skin deforma-
tion“, ACM SIGGRAPH. ACM Press, no. 3, pp. 578–586, New York, USA,
2003. .
[94] K. Man and G. Cheung,“ Visual Hull Construction, Alignment and Refine-
ment for Human Kinematic Modeling, Motion Tracking and Rendering,“ PhD
thesis, Carnegie Mellon University, 2003.
[92] A. Laurentini,“ The visual hull: A new tool for contour based image under-
standing,“, Proc. 7th Scandinavian Conf. Image Analysis, pp. 993-1002, 1991.
[93] A. Laurentini,“ The Visual Hull Concept for Silhouette- Based Image Under-
standing,“ IEEE Trans. Pattern Anal. Mach. Intell.,pp. 150–162, 1994.
[94] K. Man and G. Cheung,“ Visual Hull Alignment and Refinement Across
Time: A 3D Reconstruction Algorithm Combining Shape-From-Silhouette with
Stereo,“CVPR,no. 2 pp. 375– 382, 2003.
[95] K. Kolev, M. Klodt, T. Brox, S. Esedoglu, and D. Cremers,“ Continuous
global optimization in multiview 3d reconstruction,“ International Conference
on Energy Minimization Methods in Computer Vision and Pattern Recognition,
2007.
[96] R. Szeliski,“ Rapid octree construction from image sequences,“ Vision, Graph-
ics and Image Processing: Image Understanding,vo. 58, pp. 23-32, 1993.
Bibliography 122
[97] H. Noborio, S. Fukuda, and S. Arimoto,“ Construction of the Octree Approx-
imating a Three-Dimensional Object by Using Multiple Views,“ IEEE Trans.
Pattern Anal. Mach. Intell., pp. 769–782, 1988
[98] M. Potmesil,“ Generating octree models of 3D objects from their silhouettes
in a sequence of images,“ Comput. Vision Graph. Image Process., pp. 1–29, 1987
[99] N. Ahuja and J. Veenstra, “Generating Octrees from Object Silhouettes in
Orthographic Views,“ IEEE Trans. Pattern Anal. Mach. Intell, pp. 137–149,
1989
[100] A. Laurentini, “The visual hull concept for silhouette based image under-
standing,“ Pattern Analysis and Machine Intelligence, IEEE Transactions, vol.
16, no. 2, pp. 150–162, 1994.
[101] J.S. Franco and E. Boyer, “Exact polyhedral visual hulls,“ British Machine
Vision Conference (BMVC’03),, vol. 1, pp. 329–338, 2003.
[102] H. Noborio, S. Fukuda, and S. Arimoto,“Construction of the octree approx-
imating a threedimensional object by using multiple views,“ Pattern Analysis
and Machine Intelligence, IEEE Transactions, vol. 10, no. 6, pp. 769–782, 1988.
[103] K. Forbes, A. Voigt, and N. Bodika, “Visual hulls from single uncalibrated
snapshots using two planar mirrors,“ Proc. 15th Annual Symposium of the
Pattern Recognition Association of South Africa, 2004.
Bibliography 123
[104] F. Yi and I. Moon, “Image Segmentation: A Survey of Graph-cut Methods,“
International Conference on Systems and Informatics (ICSAI), 2012.
[105] M. Kass, A. Witkin, and D. Terzopoulos, ”Snakes: Active contour models,”
International Journal of Computer Vision, vol. 1, no.4,p. 321, 1988.
[106] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. “Gradient-based learning
applied to document recognition,“Proc. of the IEEE, 1998.
[107] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification with
deep convolutional neural networks,“NIPS , 2012.
[108] J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. Fei-Fei. “ImageNet
Large Scale Visual Recognition Competition,“ 2012.
[109] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImaageNet:
A large-scale hierarchical image database,“ CVPR, 2009.
[110] N.J. Gordon , D.J. Salmond , and A.F.M. Smith,“Novel approach to
nonlinear/non-Gaussian Bayesian state estimation,“IEEE-Proceedings, 1993. ,
140 , 107113