A homography-based multiple-camera person-tracking algorithm
Transcript of A homography-based multiple-camera person-tracking algorithm
Rochester Institute of Technology Rochester Institute of Technology
RIT Scholar Works RIT Scholar Works
Theses
6-1-2008
A homography-based multiple-camera person-tracking algorithm A homography-based multiple-camera person-tracking algorithm
Matthew Robert Turk
Follow this and additional works at: https://scholarworks.rit.edu/theses
Recommended Citation Recommended Citation Turk, Matthew Robert, "A homography-based multiple-camera person-tracking algorithm" (2008). Thesis. Rochester Institute of Technology. Accessed from
This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected].
A Homography-Based Multiple-Camera Person-TrackingAlgorithm
by
Matthew Robert Turk
B.Eng. (Mech.) Royal Military College of Canada, 2002
A thesis submitted in partial fulfillment of the
requirements for the degree of Master of Science
in the Chester F. Carlson Center for Imaging Science
Rochester Institute of Technology
12 June 2008
Signature of the Author
Accepted byCoordinator, M.S. Degree Program Date
CHESTER F. CARLSON CENTER FOR IMAGING SCIENCE
ROCHESTER INSTITUTE OF TECHNOLOGY
ROCHESTER, NEW YORK, UNITED STATES OF AMERICA
CERTIFICATE OF APPROVAL
M.S. DEGREE THESIS
The M.S. Degree Thesis of Matthew Robert Turkhas been examined and approved by thethesis committee as satisfactory for the
thesis required for theM.S. degree in Imaging Science
Dr. Eli Saber, Thesis Advisor
Dr. Harvey Rhody
Dr. Sohail Dianat
Date
ii
THESIS RELEASE PERMISSION
ROCHESTER INSTITUTE OF TECHNOLOGY
CHESTER F. CARLSON CENTER FOR IMAGING SCIENCE
Title of Thesis:
A Homography-Based Multiple-Camera Person-Tracking Algorithm
I, Matthew Robert Turk, hereby grant permission to the Wallace
Memorial Library of RIT to reproduce my thesis in whole or in part.
Any reproduction shall not be for commercial use or profit.
SignatureDate
iii
A Homography-Based Multiple-Camera Person-TrackingAlgorithm
by
Matthew Robert Turk
Submitted to theChester F. Carlson Center for Imaging Science
in partial fulfillment of the requirementsfor the Master of Science Degree
at the Rochester Institute of Technology
Abstract
It is easy to install multiple inexpensive video surveillance cameras aroundan area. However, multiple-camera tracking is still a developing field. Surveil-lance products that can be produced with multiple video cameras include cam-era cueing, wide-area traffic analysis, tracking in the presence of occlusions, andtracking with in-scene entrances.
All of these products require solving the consistent labelling problem. Thismeans giving the same meta-target tracking label to all projections of a real-world target in the various cameras.
This thesis covers the implementation and testing of a multiple-camera people-tracking algorithm. First, a shape-matching single-camera tracking algorithmwas partially re-implemented so that it worked on test videos. The outputs ofthe single-camera trackers are the inputs of the multiple-camera tracker. The al-gorithm finds the feet feature of each target: a pixel corresponding to a point ona ground plane directly below the target. Field of view lines are found and usedto create initial meta-target associations. Meta-targets then drop a series of mark-ers as they move, and from these a homography is calculated. The homography-based tracker then refines the list of meta-targets and creates new meta-targetsas required.
Testing shows that the algorithm solves the consistent labelling problem andrequires few edge events as part of the learning process. The homography-basedmatcher was shown to completely overcome partial and full target occlusions inone of a pair of cameras.
iv
Acknowledgements
• The Canadian Air Force made this work possible through the Spon-sored Post-Graduate Training Program.
• Professor Warren Carithers suggested the use of a function used inthe Generator program, which was used for testing the algorithm.
• Mr. Sreenath Rao Vantaram supervised the segmentation of all real-world video sequences.
• Finally, Ms. Jacqueline Speir helped me to clarify and expand manyof the concepts discussed herein.
v
Contents
1 Introduction 11.1 Motivating example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Scope – goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Scope – limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Contributions to field . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 Specific contributions . . . . . . . . . . . . . . . . . . . . . . 10
2 Background 112.1 Single camera tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Multiple camera tracking . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Disjoint cameras . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Pure feature matching . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Calibrated and stereo cameras . . . . . . . . . . . . . . . . . 17
2.2.4 Un-calibrated overlapping cameras . . . . . . . . . . . . . . 19
3 Proposed method 213.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Background subtraction . . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Single-camera tracking . . . . . . . . . . . . . . . . . . . . . . 27
3.2.3 Field of view line determination . . . . . . . . . . . . . . . . 32
3.2.4 Determining feet locations . . . . . . . . . . . . . . . . . . . 37
3.2.5 Dropping markers . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.6 Calculation of a homography . . . . . . . . . . . . . . . . . . 48
3.2.7 Multiple-camera tracking with a homography . . . . . . . . 53
3.3 Testing and validation . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3.1 Testing the feet feature finder . . . . . . . . . . . . . . . . . . 60
vi
CONTENTS vii
3.3.2 Testing the homography-based tracker . . . . . . . . . . . . 62
3.4 Alternative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4.1 Improving this method . . . . . . . . . . . . . . . . . . . . . 66
3.4.2 The fundamental matrix . . . . . . . . . . . . . . . . . . . . . 69
4 Implementation details 754.1 The Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2 Background subtraction . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3 Single-camera tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4 Finding FOV lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.5 Dropping markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.6 Calculation of a homography . . . . . . . . . . . . . . . . . . . . . . 92
4.7 Homography-based multi-camera tracking . . . . . . . . . . . . . . 94
4.7.1 Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.7.2 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5 Results and discussion 965.1 Feet feature finder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.1.1 Comparing to hand-found points . . . . . . . . . . . . . . . 96
5.1.2 Comparing meta-target creation distances . . . . . . . . . . 97
5.2 Homography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2.1 Markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2.2 Numerical tests with truth points . . . . . . . . . . . . . . . 106
5.2.3 Visual tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3 Occlusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6 Conclusions and future work 1186.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.2.1 Specific implementation ideas . . . . . . . . . . . . . . . . . 121
6.2.2 Computer vision . . . . . . . . . . . . . . . . . . . . . . . . . 122
Chapter 1
Introduction
Video surveillance is a difficult task. Based on the field of computer
vision, itself only a few decades old, the automatic processing of video
feeds often requires specialized encoding and decoding hardware, fast
digital signal processors, and large amounts of storage media.
The need to process multiple video streams is becoming more im-
portant. Video camera prices continue to drop, with decent “webcams”
available for less than twenty dollars. Installation is similarly inexpensive
and easy. Furthermore, social factors are assisting the spread of surveil-
lance cameras. City police forces, such as those in London and Boston,
and private businesses, such as shopping malls and airports, are using
recent terrorism to justify increasing video surveillance. In most major
cities it is now easy to spot video cameras. Some installations even boast
1
Section 1.1. Motivating example 2
low-light capabilities using cameras sensitive to near- or thermal-infrared
wavelengths.
Despite the increasing prevalence of multiple camera surveillance in-
stallations, few algorithms extract additional, meaningful multiple-camera
tracking information. Chapter 2 will cover a few of the algorithms that
track moving objects in a single video stream. Solutions to the single-
camera tracking problem are fairly well developed. However, multiple-
camera surveillance systems demand algorithms that can process multi-
ple video streams.
1.1 Motivating example
As a motivating example, consider the overhead view of a surveilled area
as seen in Figure 1.1. Cameras A and B are disjoint – they look at differ-
ent areas of the world and do not overlap. However, cameras A and C
partially overlap, as do cameras B and C. An object in either of the darker
overlapping areas will be visible to two cameras simultaneously.
Now examine the output of the three cameras. There are two people in
the world. However, between the three cameras they have been given four
different labels: A-8, B-2, C-4, and C-5. Given these object labels, the most
important piece of information that we could find is which labels refer to
the same real-world objects. This is the consistent labelling problem.
Section 1.1. Motivating example 3
AC
B C-5
A-8
B-2
C-4
Figure 1.1: Three cameras look at the same general area in this overhead view.Across the three cameras, two targets are given four tracking labels.
Humans are fairly good at solving the consistent labelling problem,
up to a certain point. Human surveillance operators can keep a mental
model of the locations of the cameras in the world, and can often match
features from camera to camera even if different camera modalities are
used (e.g. one RGB camera and one thermal infrared camera). Addition-
ally, humans are much better than computers at matching objects, even
if the objects are observed from widely disparate viewpoints and conse-
quently have different appearances. However, using humans to analyse
multiple video streams does not scale well since a person can only look
at one screen at at time, even though there may be many relevant views
of a scene. If multiple surveillance operators are used, each one responsi-
ble for a particular area, then the system would require the development
of procedures for control, target tracking, target handoff, and possible
operator inattention.
Section 1.1. Motivating example 4
display/storage
Camera 1 input
solo-cam tracking
Camera 2 input
solo-cam tracking
(a) Recording the output of single-camera trackers is common, but doesnot take advantage of having multiplecameras.
new surveillance capability
mult i-cam tracking
display/storage
Camera 1 input Camera 2 input
solo-cam tracking solo-cam tracking
(b) Multiple cameras can give ad-ditional information and increasesurveillance potential.
Figure 1.2: Algorithms that effectively use multiple cameras can produce extrauseful information.
An important task for a surveillance system is being able to track a
target of interest as it moves through a surveilled area. Many cameras
might have the target in view at any given time, but the conscious effort
required of a human to determine this set of cameras is not trivial, even
if only a handful of cameras are used. Additionally, the set of cameras
viewing the target changes constantly as the target moves. If the con-
sistent labelling problem is solved, and the computer knows whether a
target should appear in each camera’s field of view, then the computer
can automatically cue the correct set of cameras that show the target.
Figure 1.2 illustrates the difference between algorithms that treat mul-
tiple cameras as a group of single cameras, and those that treat the cam-
eras as something more. The algorithms that fall into Figure 1.2(a) do not
Section 1.2. Scope – goals 5
care that multiple cameras might view the same part of the world. The
second class of algorithms, shown in Figure 1.2(b), take the outputs of
single camera trackers and combine them. New surveillance capabilities
are created. Some examples of the capabilities created by these multi-
camera-aware algorithms are mentioned below.
1.2 Scope – goals
This thesis covers the development, implementation, and testing of a mul-
tiple camera surveillance algorithm. The algorithm shall have the follow-
ing characteristics:
1. Independent of camera extrinsic parameters, i.e. location and ori-
entation. The algorithm should smoothly handle widely disparate
views of the world.
2. Independent of camera intrinsic parameters, i.e. focal length, pixel
skew, and location of principal point. Different cameras are avail-
able on the market – the algorithm should be able to handle multiple
focal lengths, differences in resolution, and the like.
3. Independent of camera modality. The algorithm should be able
to handle the output of any single-camera tracker. The algorithm
should not depend on whether the underlying camera hardware is
Section 1.3. Scope – limitations 6
RGB, near-infrared, thermal infrared, or some other image-forming
technology.
4. Solve the consistent labelling problem. One real-world target should
be linked to one object label in each of the cameras in which that
target is visible.
5. Robust to target occlusions and in-scene entrances. If a target enters
the surveilled area in the middle of a scene, say, through a door,
then the algorithm should correctly solve the consistent labelling
problem. Similarly, if one target splits into two, as when two close
people take different paths, the algorithm should identify and cor-
rectly label the two targets.
6. Simple to set up. No camera calibration should be required. Train-
ing, if needed, should take as little time as possible and should be
done with normal in-scene traffic. Training should be automatic and
should not require operator intervention.
7. Capable of camera cueing. The algorithm should be able to deter-
mine which cameras should be able to see a given target.
1.3 Scope – limitations
The scope of the algorithm shall be limited as follows:
Section 1.3. Scope – limitations 7
1. The algorithm shall be used for tracking walking people. Vehicles,
animals, and other classes of moving objects are not included in the
scope of this thesis.
2. Pairs of cameras to be processed will have at least partially over-
lapping fields of view. This requires the operator to make an initial
judgement when installing the hardware and initializing the algo-
rithm: to decide which cameras see the same parts of the world.
3. The cameras shall be static. Once installed, both the intrinsic and
extrinsic parameters of the camera shall be fixed. This means that a
camera can not be mounted on a pan-tilt turret, or if it is, the turret
must not move.
4. The output images of the video cameras will be of a practical size.
The algorithm will not include single-pixel detectors (e.g. infrared
motion detectors, beam-breaking light detectors). This limitation is
necessary to ensure that single-camera tracking is possible without
major changes to the chosen algorithm.
5. Frame rates will be sufficient to allow the single-camera tracking
algorithm to work properly.
6. The cameras shall approximate regular central-projection cameras
with basic pinhole optics. Cameras with extremely wide fields of
Section 1.4. Contributions to field 8
view – fisheye lenses – or significant un-corrected distortions will
not be used.
7. Most importantly, the targets shall be walking on a ground plane.
The overlapping area between any two cameras shall not have sig-
nificant deviations from a planar surface. Code to deal with hilly
areas or steps shall not be included in this algorithm.
8. The cameras shall not be located on the ground plane. This prevents
a degenerate condition in the scene geometry, as shall be shown
later.
1.4 Contributions to field
As mentioned above, multiple-camera video processing is a relatively
new domain. Algorithms are constantly under development, and there
are many problems yet to solve. If the algorithm developed in this docu-
ment satisfies the goals and limitations described in Sections 1.2 and 1.3,
then the following scenarios will be made possible:
• Automatic cueing: A target of interest walks into a surveilled area.
The operator marks the target in one camera. As the target moves
throughout the area, the computer, driven by the algorithm, auto-
matically shows all the video feeds in which the target is visible.
Section 1.4. Contributions to field 9
The target could be marked by a consistently-coloured “halo” or
a bounding box. This lets the operator concentrate on the target’s
actions, rather than on its position in the world relative to every
camera.
• Path analysis: An area is placed under surveillance. Rather than
trying to manually match the paths of people from camera to cam-
era, the algorithm automatically links the paths taken by the people
moving through the area. This enables flow analysis to be carried
out quicker and more effectively.
• Tracking with occlusion recovery. To fool many current tracking
algorithms, move behind an occlusion (e.g. a building support pil-
lar or a tall accomplice), change your speed, and then move out of
the occlusion. Occlusion breaks many current tracking algorithms,
and most others break if the speed change is significant. So long
as the target remains visible in at least one camera, the algorithm
discussed in the forthcoming chapters shall recover from occlusions
and shall re-establish the consistent tracking label.
• In-scene entrances. The algorithm shall be able to create consistent
tracking labels in scenes where people can enter in the middle of a
frame, such as through an elevator or a door.
Section 1.4. Contributions to field 10
1.4.1 Specific contributions
This thesis provides these specific contributions to the field of video pro-
cessing:
• A method to find the feet feature of a target even when the camera
is significantly tilted,
• A method to use target motion to find a plane-induced homography
even when entrances and exits are spatially limited, and
• A method with specific rules describing how to use a plane-induced
homography to create and maintain target associations across mul-
tiple cameras.
The underlying theory is discussed in Chapter 3, with implementation
details in Chapter 4. Test results are shown in Chapter 5.
Chapter 2
Background
2.1 Single camera tracking
Multiple-camera tracking can be done in a variety of ways, but most
methods rely on the single camera tracking problem being previously
solved. Indeed, except for those multiple-camera tracking papers that
aim to simultaneously solve the single- and multiple-camera tracking
problems, most papers include a statement to the effect of ”we assume
that the single-camera tracking problem has been solved.” However, the
present work requires the problem to be actually solved, not just assumed
solved. This is because a working system is to be implemented for this
thesis, not just conceived.
A recent survey of single-camera tracking algorithms is found in [1].
11
Section 2.1. Single camera tracking 12
Plainly, a significant amount of thought has been applied to tracking ob-
jects in a single camera’s video stream. When successful, the output is
a series of frames with regions identified by unique identifiers. Two im-
age regions in frames t and t + 1 will have the same identifier i if the
algorithm has determined that the regions are the projections of the same
real-world target object. The single-camera tracking problem is how to
determine which regions in frame t + 1 to mark with which identifiers.
The survey shall not be replicated here. Instead, we describe a few char-
acteristic approaches to single camera tracking.
One main type of single-camera tracker calculates a feature vector
based on the target, and looks for regions in the next frame that pro-
duce a similar feature vector. The idea of kernel-based object tracking is
introduced in [2]. In this algorithm a feature vector is calculated from
the colour probability density function of an elliptical target region. It
is also possible to use a feature vector based on the target’s texture in-
formation, edge map, or some combination of those and other features,
although those were not tested. In the next frame a number of pdfs are
calculated for candidate regions around the target’s original location. The
candidate regions might have a different scale from the original target, to
take into account looming or shrinking targets. A distance metric is cal-
culated between the original pdf and each of the candidate pdfs. In this
case the metric is based on the Bhattacharyya coefficient. The candidate
Section 2.1. Single camera tracking 13
target with the smallest distance to the original target is declared to be the
projection of the same real-world target, and the appropriate identifier is
assigned. It is noted in [2] that other prediction elements can be incor-
porated into the algorithm, such as a Kalman filter to better predict the
target’s new location. Using other prediction elements means that fewer
candidate pdfs must be calculated.
Another type of tracker looks for regions in frame t + 1 that have a
similar shape to the target in frame t. A triangular mesh is laid over a
target in [3]. The texture of the object (i.e. its colour appearance) is then
warped based on deformations of the underlying mesh. The warped tex-
ture is then compared to regions in frame t + 1 with a metric that incor-
porates both appearance and mesh deformation energy, and the closest
match is called the same object.
Moving farther into shape matching methods, [4] discusses an active
shape model for object tracking. Their algorithm uses a principal compo-
nents analysis to find important points on the boundary of an object in
either grayscale or a multi-colour space. The points are then transformed
using an affine projection into the coordinate space of frame t + 1, and
compared with the local textural information. If the transformed points
are close to edges then the model is valid. If points are projected far
from edges then the model is less likely to be correct. Eventually the best
model is selected, yielding the new shape of the object in frame t + 1.
Section 2.2. Multiple camera tracking 14
Another method of shape matching, used in this thesis, is introduced
in [5]. The method first uses background subtraction to identify fore-
ground blobs. For basic motion the algorithm assumes that the projec-
tions of each target do not move very far from frame to frame, so they
contain overlapping points. If a target blob in frame t and a candidate
blob in frame t + 1 overlap each other but not other blobs, then they are
identified as the same target. If two target blobs merge into one blob in
frame t + 1 due to a partial occlusion then the algorithm uses a shape
matching technique introduced in [6]. B-splines are fitted to each origi-
nal target blob in frame t and to groups of regions in frame t + 1. The
regions are found using a segmentation algorithm shown in [7]. The dis-
tances between the B-spline control points of the target blob and each set
of candidate regions are compared, then the closest-matching regions are
given the same target identifier.
2.2 Multiple camera tracking
When dealing with more than one camera, it is possible that any given
pair of cameras may observe the same area or different areas in the world.
The latter case is discussed first.
Section 2.2. Multiple camera tracking 15
2.2.1 Disjoint cameras
Algorithms that attempt to associate targets across sets of disjoint cam-
eras use a variety of methods. Some, such as [8], use the entry and exit
times from each camera to learn the structure of the camera network.
From this structure they are able to determine which pairs of cameras
view the same regions of the world, and which are disjoint. The system
can match targets between disjoint cameras so long as people take a con-
sistent amount of time to pass between cameras. A similar approach to
pass target tracks between cameras is used in [9], even if the targets are
invisible some of the time (e.g. between cameras or occluded).
In [10], a set of randomly sampled target point correspondences is
used to create a homography between pairs of cameras. As training data
is accumulated they are able to improve the homography in cases where
the cameras actually overlap, and discard relationships between disjoint
cameras. Their system is thus capable of projecting the location of a
target in one camera to its location in another camera, assuming that
both cameras observe the target.
Three papers by Javed et al take a somewhat more appearance-based
approach to disjoint cameras [11,12,13]. The algorithm finds a brightness
transfer function that projects a target feature vector from one camera
to another. Simultaneously, a Parzen window technique is used to es-
timate the spatio-temporal relationship between entry and exit events in
Section 2.2. Multiple camera tracking 16
the various cameras. The projected appearance of a target is then matched
from camera to camera using a Bhattacharyya-distance based metric, and
the results combined with the spatio-temporal matcher to create target
matches.
The present research does not deal with pairs of disjoint cameras.
Rather, we assume that the system has been set to only find matches
between cameras that have overlapping fields of view. If this level of
installation knowledge is unavailable, one of the aforementioned tech-
niques could be used to determine which cameras are disjoint and which
overlap.
2.2.2 Pure feature matching
Multi-camera feature matching algorithms assume that some features of
a target will remain invariant from camera to camera. A very basic colour
matching method is used to identify projections of the same real-world
target in different cameras [14]. A matching algorithm on the colour
histogram taken along the vertical axis of an object is used in [15].
The method of [16] attempts to reconstruct 3d scene information by
creating associations between posed surface elements (surfels) using tex-
ture matching. This method has not seen much attention because of the
relative complexity of computing scene flow – the 3d equivalent of optical
flow.
Section 2.2. Multiple camera tracking 17
2.2.3 Calibrated and stereo cameras
If calibration data is available then it is possible to calculate target fea-
tures and use those features to match targets between cameras, even if
they are disjoint. Depending on the algorithm, calibration usually means
finding the camera’s internal and external parameters (focal length, field
of view, location, pose, etc.). For instance, [17] uses the height of targets
as a feature that will not change significantly as a target moves from one
camera to another. Both the target’s height and the colours of selected
parts of a target are used to match objects between cameras [18].
The principal axis of a target is the main feature used to match be-
tween cameras [19]. The principal axis is transformed from camera to
camera using a homography found by manually identifying correspond-
ing ground point pairs in both cameras.
In [20], a system is described that is used to track people and vehicles
conducting servicing tasks on aircraft. The system combines appearance-
based methods and the epipolar geometry of calibrated cameras to find
the 3d locations of targets in the scene, and match them accordingly. Fig-
ure 2.1 shows the epipolar geometry relation between two views. Essen-
tially, a 2d point in one camera can be projected to a one-dimensional line
in the other camera, upon which the image of world-point X will lie.
Rather than using epipolar geometry, [21] uses sets of stereo cameras.
Each pair of stereo cameras generates depth maps that can be combined
Section 2.2. Multiple camera tracking 18
Figure 2.1: Epipolar geometry: the fundamental matrix F links the image pointsof 3d world point X in two cameras by xBFxA = 0. eA and eB are the epipoles –the images of the other camera’s centre.
with appearance-based matching to create maps of likely target locations.
The map for each camera can be transformed until the cameras line up,
at which point full multi-camera tracking can be conducted.
Two papers, [22] and [23], describe methods that require the operator
to specify corresponding ground point pairs in cameras. Those point
pairs lead to a homography between the cameras that can be used to
match targets based on their projected locations in each camera, although
no rules are given on how this is done.
Section 2.2. Multiple camera tracking 19
2.2.4 Un-calibrated overlapping cameras
One of the more important recent papers in multi-camera tracking is [24],
by Khan and Shah. This thesis builds on concepts and methods from that
paper. Essentially, the algorithm, discussed at length in Chapter 3, records
entry and exit events in one camera, eventually finding the lines created
by projecting the edges of the camera’s field of view onto a ground plane.
Figure 2.2 illustrates this geometry. Once those field of view (fov) lines
are known, target matches can be created between cameras whenever an
object crosses one of those lines. An improvement to the original work is
made in [25] by re-defining when an entry or exit event is detected – this
improvement is discussed in Chapter 3.
CC
A
B
LAB,s
side s
LB,s
Figure 2.2: This thesis and [24] find field of view lines on ground plane π.
There are other approaches that effectively recover a geometric rela-
tionship between cameras. Starting from an appearance-based matching
Section 2.2. Multiple camera tracking 20
method, [26] uses a homography-based method to recover from occlu-
sions during which the appearance model can not be updated. The user
effectively specifies the homography. Point intersections are used in [27]
to create a 3d map of the world. Correct point intersections reinforce each
other, leading to higher probability of a match, while incorrect matches
are more scattered around the world, and so lead to a lower probability
of matching.
Chapter 3
Proposed method
3.1 Overview
As described in Sections 1.2 and 1.3, an algorithm is desired that receives
two video camera feeds from camera A and camera B. The cameras have
at least partially overlapping fields of view. By watching people move
through the scene the algorithm should eventually be able to predict the
location of a person in camera B based solely on their position in camera
A, and vice versa. This thesis will implement three functional objectives:
• Single-camera tracking,
• Learning the relationship between two cameras, and
• Using the relationship to predict target locations.
21
Section 3.1. Overview 22
In Section 3.2, various algorithms used in this thesis will be explained.
First, the two parts of single-camera tracking – background subtraction
and object tracking – will be discussed. This will be followed by the
method used to learn the relationship between camera A and camera B,
which consists of two main parts: finding field of view lines and then
accumulating enough corresponding points to calculate a plane-induced
homography. Finally, once that homography is known, the novel algo-
rithm that creates, updates, and refines relationships between targets will
be introduced in Section 3.2.7.
Following the algorithm development, Section 3.3 will cover how the
various components of the system will be tested.
3.1.1 Notation
In general, the following conventions will be used except when we adopt
the notation from specific papers:
• Scalars are represented with lower-case letters. y = 3.
• Vectors are represented with bold lower-case letters, and are in col-
umn form. Transposing a vector gives us a list made up of a single
row. xT = [x1 x2 x3].
• Matrices are represented with upper-case letters. In x = Hx, x and
x are column vectors, and H is an appropriately-sized matrix.
Section 3.2. Algorithms 23
3.2 Algorithms
3.2.1 Background subtraction
The first step in processing each single-camera input frame is to deter-
mine which parts of the image are background, and hence un-interesting,
and which are foreground. In this work, foreground objects are moving
people. The algorithm first described in [28] and [29], and elaborated
on in [30], describes a system that accomplishes these goals. Although
the system is designed to work on traffic scenes, where the targets (also
known as Moving Visual Objects, or mvos) are cars, the algorithm has the
following desirable attributes:
• The constantly-updated background model can adapt to changing
lighting conditions, objects becoming part of the scene background
(e.g. parking cars), and formerly-background objects leaving the
scene (e.g. departing cars);
• Shadow-suppression; and
• Parameters to tune for various object speeds and sizes.
The general flow of the algorithm can be seen in Figure 3.1.
The background model is made up of a number of the previous frames,
held in a FIFO queue. A number of methods can be used to initialize the
buffer. For instance, the first frame read can be replicated to fill each
Section 3.2. Algorithms 24
Initialize background model
Identify shadows
Identify MVOs and ghosts
Identify MVO shadows
Update background model
Read in frame
MVO map
Figure 3.1: The background subtraction algorithm from [30].
slot in the queue, although if the frame has foreground elements then the
MVO map will be incorrect at the start of processing. Alternatively, the
algorithm can be set to run in a training mode, reading in, say, 100 frames.
The median of those frames can then be calculated and the result used to
fill the background model queue. The number of frames used to generate
this model is a trade-off between the amount of memory available and
the amount of traffic in any particular pixel. The method to calculate the
background image from the buffer is described below.
In the main loop, there is a background image B. Any particular pixel,
p, in the background has a hue BH(p), saturation BS(p), and value BV(p),
which are easily calculated from the pixel’s red, green, and blue values.
An image, I, is read from the incoming video stream. Pixel p in the
incoming image has hue, saturation, and value IH(p), IS(p), and IV(p).
Section 3.2. Algorithms 25
Once an image has been read, the first step is to identify shadow
pixels. Pixel p in the shadow map, SM, is classified as shadow if:
SM(p) =
1 if
α ≤ IV(p)BV(p)
≤ β
| IS(p)− BS(p) |≤ τS
min (| IH(p)− BH(p) |, 2π− | IH(p)− BH(p) |) ≤ τH
0 otherwise
(3.1)
The first condition lets pixels be shadow if they have a value ratio
between two thresholds. In [29] a preliminary sensitivity analysis was
carried out, with values of α ∈ [0.3, 0.5] and β ∈ [0.5, 0.9].
The next two conditions essentially require that the image pixel have
the same colour as the background. Thresholds τS and τH are set by the
user. In the analysis mentioned above, τS ∈ [0.1, 0.9] and τH ∈ [0, 0.1].
Once the shadow map has been created, mvos are identified. First,
every image pixel’s distance from the background is calculated. The dis-
tance measure used is the largest absolute difference from the background
of each of the pixel’s hue, saturation, and value.
DB(p) = maxH,S,V (| IH,S,V(p)− BH,S,V(p) |) (3.2)
Section 3.2. Algorithms 26
Pixels with DB(p) lower than a threshold TL are not given further con-
sideration as foreground. A morphological opening is performed on the
blobs of pixels with DB(p) > TL. Following the opening, any pixels that
were previously found to be shadows are turned off, not to be considered
as potential foreground pixels. Next, each remaining blob is checked for
three conditions. If the blob satisfies all three conditions, then its pixels
are labelled as part of an mvo.
• The blob has an area larger than threshold Tarea.
• The blob contains at least one pixel with DB(p) > TH. Note that
TH > TL.
• The blob’s average optical flow must be above threshold TAOF.
Objects that do not meet all three criteria are called ghosts. No guid-
ance is given in [30] on how to select the thresholds TH, TL, Tarea, or TAOF.
Specific implementation details used in this thesis are discussed in the
Chapter 4.
At this point, the MVO map has been created, and can be passed to
the single-camera tracking algorithm. However, the background model
must still be updated for the next frame. To do this, the shadow map is
partitioned into foreground shadows and ghost shadows. Shadow pixel
blobs that touch any of the foreground blobs are, naturally, labelled as
Section 3.2. Algorithms 27
foreground shadows. Ghost shadows are pixels in shadow blobs that do
not touch an MVO.
There are now five possible classifications for any given pixel: fore-
ground mvo, foreground shadow, background, ghost, and ghost shadow.
The background model update proceeds as follows:
1. Pixels in the current frame that have been marked as foreground or
foreground shadow are replaced with the pixels from the current
background image.
2. The modified current frame is pushed onto the background FIFO
queue. The oldest image, at the end of the queue, is discarded.
3. n images are selected from the buffer, equally spread by m inter-
vening frames. For instance, every fifth image might be selected
(m = 5), with eight images in total (n = 8). The queue is sized
n×m.
4. The median of the selected is calculated. The result is the updated
background image for the next incoming frame.
3.2.2 Single-camera tracking
In order to associate targets using multiple cameras, we first need to track
moving objects in a single camera. The input to the single camera tracker
Section 3.2. Algorithms 28
is the output of the background subtraction algorithm, namely, a series
of blob masks that is non-zero on foreground pixels, and zero elsewhere.
The output of the single-camera tracker can be represented many ways.
In the present work the output is a frame-sized array, with each pixel’s
value an integer corresponding to a blob label. As targets move around
the scene the shape and position of their corresponding blob changes with
the output of the background subtraction algorithm. After single-camera
tracking, the value of the pixels in each target’s blob should be constant.
We add the additional requirement that target labels should be histor-
ically unique. This means that once a target disappears from the scene,
whether by leaving the field of view of the camera or by being fully-
occluded, its tracking label is not used again.
Depending on the machine architecture, the language used, and the
expected run times it is possible that the target label variable may even-
tually overflow. If this event is expected to occur inside the temporal
window where the oldest target and the newest target are active in some
part of the surveilled area, then there might be a problem. Since the
scenes used in this research were fairly short and had low traffic rates,
we did not encounter this problem when using 16-bit unsigned integer
target labels (65,535 possible labels).
The single-camera tracking algorithm used in this thesis can be found
in [5]. For basic motion it uses a simple mask-overlap method of track-
Section 3.2. Algorithms 29
ing blobs from frame to frame. In more complex motion cases, such
as partial occlusion, merging, and splitting, the algorithm uses a shape-
matching technique developed in [6]. The following is a brief overview
of the single-camera tracking algorithm; for more detail the reader is en-
couraged to examine the original papers.
Overlap tracking
In an un-crowded scene, targets move around without being occluded by
other moving targets. In this case tracking is a fairly simple process if
one key assumption is made: slow motion. This means that targets are
required to be imaged in such a position that they partially overlap their
previous location. For most surveillance applications with reasonably-
sized targets and normal video frame rates, this is a reasonable assump-
tion.
Given a map of foreground pixels identified by the background sub-
traction algorithm for frame t, and the output of the single-camera tracker
from the previous cycle (frame t− 1), we perform the following tests:
• Does each blob in frame t overlap any previously-identified targets
in frame t− 1? If so, how many?
• How many blobs in frame t overlap each target from frame t− 1?
If the answer to the first question is zero then the blob is a new target.
Section 3.2. Algorithms 30
It is given a new target label, and the new target counter is incremented.
If the answer to the both questions is exactly one then the slow mo-
tion assumption means that the blob is the same target as the one that it
overlaps. Thus, the blob in frame t inherits the target label from frame
t− 1, and for that blob, tracking is complete.
Complex motion – splitting
If two or more unconnected blobs in the current frame t overlap one
contiguous target in frame t− 1, then this indicates a splitting event. This
could happen when two people who originally walked side-by-side take
different paths.
In this case, [5] says that the largest blob in frame t should inherit the
tracking label from frame t− 1, and the smaller blob(s) should receive a
new, unique tracking label(s).
Complex motion – merging
If a contiguous blob in frame t overlaps more than one previously identi-
fied target region then we must determine which pixels in the blob belong
to each of the old targets. This situation occurs whenever two targets walk
towards each other – there is a period of time when one target partially
occludes the other. Matching is done using the shape matching technique
introduced in [6]. Essentially, the algorithm proceeds as follows:
Section 3.2. Algorithms 31
1. The current frame is segmented into regions based on colour and
texture information using the algorithm found in [7].
2. B-splines are fitted around various segments, starting with the seg-
ment that has the largest area. The splines are fitted according to
algorithms found in [31].
3. The B-spline control points of the targets in frame t − 1 are com-
pared to the control points of shapes made up of the frame t image
segments using a metric based on the Hausdorff distance. The met-
ric is invariant to affine transformations [6].
4. The segments that make up the best match for each of the previous
target shapes inherit the old tracking labels.
Note that if one target fully occludes the other then the occluded tar-
get’s tracking label is deprecated and is not used again. When the oc-
cluded target emerges from full occlusion back into partial occlusion the
tracker essentially believes that the occluding target is growing. Eventu-
ally the targets split into two non-contiguous blobs, and are treated as
a described in the splitting section, above. Thus, when a target passes
through full occlusion, it will emerge with a new target label not related
to its old label.
Section 3.2. Algorithms 32
3.2.3 Field of view line determination
Finding the relationship between the objects in camera A and objects in
camera B can be broken into two parts. The first part consists of finding
Field of View (fov) lines and creating target associations using a modified
implementation of [24]. That algorithm shall be described in this subsec-
tion. The second part consists of using the meta-target information to
create a list of corresponding points (Section 3.2.5), and then using those
points to find the plane-induced homography (Section 3.2.6).
As discussed in the previous chapter, the aim of [24] is similar to the
aim of this work. In short, we want to find the relationship between mov-
ing people as seen by cameras A and B. Whereas this thesis’s approach
wishes to find the correspondence between targets at any point in the
frame, the fov approach is designed to create meta-targets based on a
single correspondence event between tracks of moving objects. This sin-
gle correspondence event occurs when the target passes into or out of
the field of view of camera B. That instant is known as an “fov event”.
In [25] an fov event is slightly redefined to be the time at which an object
has completely entered a scene, and is not touching any frame edge. fov
events are also triggered at the instant when an in-scene target begins to
touch a frame edge. In the code implemented for this thesis we use the
modified definition of an fov event. Figure 3.2 shows a target just as it is
about to trigger an edge event.
Section 3.2. Algorithms 33
In the following paragraphs we use notation from [24].
Figure 3.2: In camera B a target is almost completely in the field of view. Whenno part of the target mask touches the frame’s edge then an fov event is trig-gered.
Once an fov event has been triggered by object k as seen in camera B,
OBk , the fov constraint allows us to choose the best matching object OA
i
from the set of all moving objects in camera A, OA. We quote from [24]:
If a new view of an object is seen in [camera B] such that it has
entered the image along side s, then the corresponding view
of the same object will be visible on the line LB,sA in [camera A]
The line LB,sA is the field of view line of side s of camera B, as seen by
camera A. This is illustrated in Figure 3.3.
The algorithm essentially has two steps: learn the fov lines of camera
B as seen by camera A, and then create associations whenever a target
Section 3.2. Algorithms 34
CC
A
B
LAB,s
side s
LB,s
Figure 3.3: The top side of camera B projects to a line on ground plane π, shownas LB,s
π . That line is then imaged by camera A as LB,sA .
steps into camera B. Automatic recovery of the fov lines is done by fol-
lowing these steps:
1. Whenever a target enters or leaves camera B on side s, trigger an
fov event.
2. When an fov event has been triggered, record the feet feature loca-
tions of all targets in camera A. Note that each side of camera B will
have its own map of accumulated feet locations.
3. After a minimum number of fov events have been recorded, per-
form a Hough transform on the map of feet points. The result is the
fov line in camera A, LB,sA .
Thereafter, whenever a target crosses side s of camera B, triggering an
FOV event, it can be associated with the target in camera A that is closest
Section 3.2. Algorithms 35
to LB,sA . To find the target with the feet closest to the fov line, we can use
the homogeneous representation of LB,sA : l = [a b c]T with ax + by + c = 0,
scaled such that a2 + b2 = 1. x and y can be taken as positions on the row
and column axes of camera A. If the homogeneous representation of the
target’s feet point is used, x = [x y 1]T, then the perpendicular distance
from the line to the feet point is given by the dot product d = lTx.
The initial association of targets as they cross fov lines essentially cre-
ates a point correspondence between the two feet feature locations. Khan
and Shah say that they can use those point correspondences to find a
homography between the two cameras [24]. Once the homography is
known, they can use it to predict target locations, thereby creating tar-
get associations for targets that are not close to a fov line. Although a
projective transformation was noted to be the general solution, [24] noted
that an affine transform was sufficient for their purposes. No details are
given on how to implement this homography-based matching system.
If only one side is used by targets to enter and exit the frame then the
homography will be degenerate, and the matching system will not work.
The method described in [25] also attempts to find a homography us-
ing point correspondences. However, instead of relying on target-created
point correspondences, they use the four points created by intersecting
fov lines. This method requires that all four fov lines be known. If one
or more edges have insufficient traffic to find the fov line, then the ho-
Section 3.2. Algorithms 36
mography is not computable with this method.
The method of [25] also introduces a potential source of error: because
fov events are triggered when the target is fully inside the frame, the
camera B fov lines are displaced towards the centre of the frame. Yet
the corner points, used for the calculation of the homography, are left at
the corners of the frame. Because the homography only uses four points,
it is exact (i.e. it is not a least-squares error-minimizing solution with
many input points), so it will certainly contain errors. The exact nature
of the error will depend on the angle of the cameras, the average height
of segmented targets, and which fov lines are inaccurate.
In relying solely on the fov line point correspondences, [24] effectively
makes the following assumptions:
1. At the moment they cross an fov line, all targets are properly seg-
mented in both viewpoints.
2. The location of the feet feature is correctly determined for all targets.
3. A homography based on points near fov lines will predict feet loca-
tions for targets in the middle of the frame with sufficient accuracy
to associate targets from camera to camera.
Ideally, these restrictive assumptions will be valid. However, if the seg-
mentation of targets is incorrect in either camera, either through addition
(i.e. background pixels wrongly flagged as part of an mvo) or subtraction
Section 3.2. Algorithms 37
(i.e. mvo pixels not included in the target’s mask), then the feet locations
will be incorrectly identified. This will have many knock-on effects: the
fov lines will be skewed; a target may be incorrectly judged as being clos-
est to the fov line; the list of corresponding points will be incorrect; and
thus the homography used for matching might contain significant errors.
We overcome these potential sources of error using methods intro-
duced below in Sections 3.2.5 and 3.2.7. First, we discuss two ancillary
algorithms: how to find the feet feature, and how to calculate a homog-
raphy given a list of corresponding points.
3.2.4 Determining feet locations
Finding the feet of a moving object is essentially a feature detection prob-
lem. In the present case, the input to the feet locator function will be a
mask of a moving object – a standing or walking person. The challenge is
to find a single point that represents the “feet” of that person. In general,
a person will have two feet touching or nearly touching the ground. The
single point then can be thought of as representing the location of the
person’s centre of gravity, projected onto the ground plane.
As mentioned above, the requirement for a single point stems from
our desire to project that point into another camera’s coordinate system.
The projection of a person’s centre of gravity should be onto the same
point on the world plane regardless of the location of the cameras.
Section 3.2. Algorithms 38
(a) The feet are saidto be at the centre ofthe bottom edge of thebounding box.
(b) If a target is tilted rela-tive to the camera, then themethod will fail to correctlyidentify the feet location.
Figure 3.4: The method of finding the feet feature location from [24].
In [24], Khan and Shah drew a vertical rectangular bounding box
around a moving object. The target’s feet were then deemed to be po-
sitioned at the centre of the bottom edge of the bounding box. This can
be seen in Figure 3.4(a). This method is widely used in the computer
vision community.
The bounding-box method of finding feet described in [24] has a sig-
nificant potential failure mode. It effectively requires the person to have
Section 3.2. Algorithms 39
their height axis aligned with the vertical axis of the camera. If the camera
is tilted relative to the person, which could occur if the camera is installed
on a tilted platform or if the person has a consistent tilt in their gait, then
the bounding box will enclose a large amount of non-target area. As a
result, the bottom centre of the bounding box will not necessarily be close
to the actual feet of the target. Figure 3.4(b) shows this error mode.
The second failure mode of the bounding-box feet-finding method is
that the feet will always be outside of the convex hull formed by the target
mask. This means that the feet will always appear below the actual feet
location. When considered in three dimensions for targets on a ground
plane, this means that the feet will always be marked closer to the camera
than they should be. Given a wide disparity in camera views, the pro-
jected locations of the feet in each view will not be at the same point –
the feet will be marked somewhere between the best feet location and the
cameras’ locations.
Instead of relying on the camera and the target’s height axis to be
aligned, if the target is rotated to a vertical position then the problem of
tilted targets is eliminated. One of many methods to find the rotation
angle is to perform a principle components analysis (pca) on the target
mask. On a two-dimensional mask, the output of the pca can be inter-
preted as a rotation that overlays the direction of maximum variability
with the first axis, i.e. the row axis. For most camera angles, people will
Section 3.2. Algorithms 40
appear to be much taller than they are wide, so the direction of maximum
variability will be the height axis.
Once the pca has rotated the object to eliminate the effect of a tilted
target, the actual position of the feet needs to be found. Instead of simply
using the centre of the bottom of the rotated bounding box, which may
lead to an inaccurate feature location as described above, the following
method can be employed:
1. Trace the outline of the target mask.
2. For each point on the outline of the mask, find the distance to the
two bottom corners of the bounding box.
3. Find the point with the smallest distance to each of the two corners.
4. The feet are located halfway between those two points.
The method is illustrated in Figure 3.5.
Height determination
Section 3.2.5, below, requires the measurement of the height of a target.
This is a solved problem for calibrated cameras: [32] showed a method
to find the actual in-world height of an target. However, this thesis does
not use calibrated cameras. Luckily, the algorithm does not need the real-
world height – only the target’s apparent height in pixels is needed.
Section 3.2. Algorithms 41
Figure 3.5: After using pca to align the height axis with the bounding box,the two points on the target’s outline closest to the bounding box’s corners areaveraged. The result is the location of the feet feature.
The method to determine the height begins similarly to that described
when finding the feet above. Again, since a target may be rotated with
respect to the camera’s vertical axis, we can not simply take the height of
the bounding box as the height of the object. Rather, the pca algorithm is
applied to the pixels in the target mask, and the target is rotated upright.
Now the vertical axis is parallel with the target’s height axis. At this stage
there are two options, simply take the height of the bounding box as the
height of the target, or find the feet feature and take the vertical distance
from that point to the top of the bounding box as the height. The latter
method, which can be seen in Figure 3.6, is used in this thesis.
Section 3.2. Algorithms 42
Height
Figure 3.6: The height of a target is the vertical distance between the feet andthe top of the correctly-oriented bounding box.
In contrast to the feet, the head of a target does not split into two
elements, so the top of the bounding box is nearly always at the correct
location on the target’s head. Furthermore, the shoulders and arms are
often closer to the bounding box corners than the closest head pixel. This
justifies not using the same feet-finding method on the head, and instead
simply using the top of the bounding box.
3.2.5 Dropping markers
In Section 3.2.3 we noted that [24] used only corresponding point pairs
that were found during fov events to calculate the plane-induced homog-
Section 3.2. Algorithms 43
raphy Hπ. These points naturally line the edges of the frame of camera B.
However, since only one point is created for each target (unless the target
surfs the fov line), the correspondence points will be sparse. Depending
on where targets enter and exit the frame, there may be many collinear
correspondences on one fov line and only a few on the other lines. If
only one edge serves as an entry and exit point, then all the accumulated
feet locations will be collinear, and the homography will be degenerate
for our purposes.
Another method was introduced in [25]: using exactly the four points
that are formed by intersecting the fov lines and the corners of the frame
of camera B. This method has a large potential pitfall: if one or more fov
lines are unknown, then the method can not work. It is trivial to find a
situation where an fov line can not be found – anytime camera B has one
edge above the vanishing line of the plane formed by the targets’ heads
(i.e. slightly above the horizon), no targets will trigger edge events on that
edge. Therefore, that fov line will never be found, and the homography
will never be computable.
An improved method to generate the homography is desired. Ideally,
the method will have the following attributes:
• Many corresponding points,
• Non-sparse corresponding points, and
Section 3.2. Algorithms 44
• Able to create corresponding points even if all targets enter and
leave the scene through the same edge.
In [33], Hansel and Gretel drop a trail of bread crumbs behind them
as they travel through the woods. We adapt this idea of a trail of bread
crumbs to the present algorithm. In this case, the bread crumbs, also
called markers, are located at the feet feature of each target, which is
assume to be a point on the ground plane π. This can be seen in Figure
3.7.
Figure 3.7: As targets move through the scene, they leave trails of markers be-hind them. These markers are pairs of point correspondences.
After a meta-target is created by associating a target from camera A
with a target from camera B, at an fov event or using the homography-
based method described in Section 3.2.7, the marker-dropping algorithm
begins. The logical flow through the algorithm is shown in Figure 3.8.
Every frame, a number of tests are performed to decide whether to drop a
Section 3.2. Algorithms 45
marker. If the tests are passed then a marker is dropped, thereby creating
a corresponding point pair.
Wait 1 frame
Wait N framesAdd marker to list
Are both MVOs completely inside frame?
Find height of both MVOs
Are both heights close to historical median?
Is at least one feet location significantly different from last marker?
Yes
No
Yes
Yes
No
No
Figure 3.8: The flow of the corresponding-point generation algorithm. Aftercreation, this algorithm is run on every meta-target.
The first test simply detects whether both target masks are completely
inside their respective frames. If one target is even partially out of the
frame then it is impossible to say with certainty where the feet feature is
located. Therefore, it is a bad idea to create a corresponding point pair.
The second test is of the height of the target. This test is designed
to prevent a marker from being dropped if in that particular frame the
target is grossly mis-segmented or partially-occluded. The method used
to determine the target height was outlined in Section 3.2.4. The height
of an object is measured in pixels, and compared to the median of the
heights in the past few frames. If the height is similar, then we pass on
to the next test. If it is significantly different, we abort processing of that
Section 3.2. Algorithms 46
meta-target until the next frame. This test should work because the height
of a target should not change radically over the course of a few frames.
So long as the threshold is well-chosen, this test should only throw away
badly-segmented or partially-occluded targets.
The third test is to determine whether there has been significant mo-
tion in the past few frames. The distances of the meta-target’s feet features
from the last marker location are calculated. One of the following three
cases must be true:
• Both feet features are close to the previous marker. This means that
the target has not moved much since dropping the last marker. Cre-
ating a new marker will essentially duplicate the previous marker,
so it is not useful. No marker should be dropped in this case.
• Exactly one of the targets has moved away from the previous marker.
This could have two explanations.
The first possibility is that the target has moved towards or away
from one camera centre. In that camera the target will appear in
nearly the same location. However, in the other camera the target
might be seen to move across-frame. This motion is important in
the creation of the homography, so the marker should be dropped.
The second explanation is that the target was segmented as fore-
ground significantly differently in the current frame than when the
Section 3.2. Algorithms 47
last marker was dropped, but it still passed the height test. In this
case adding the marker to the list will will likely increase the accu-
racy of the homography. This is because on average we expect the
feet feature to be correctly identified, so any incorrect feet will be
drowned out in the least-squares nature of the dlt algorithm.
• Both targets have moved away from the previous marker. This prob-
ably indicates real target motion. In this case a marker should be
dropped.
One of the requirements stated in Section 1.3 was that neither camera
can be on the ground plane. If the camera is in the ground plane then all
of the markers in that camera will appear on a line – the line formed by
the intersection of the camera’s focal plane and the ground plane. This
geometry forces the markers to be collinear, which leads to a degenerate
homography.
The end result is a list of corresponding points. If plotted on the
frame, the points will follow show the historical tracks of the meta-targets.
Depending on meta-target movement – how people walk through the
environment – the points may initially be roughly collinear. A homog-
raphy calculated based on corresponding points when one set of points
is collinear will be degenerate. Therefore, before calculating the homog-
raphy, we must test for collinearity.
Section 3.2. Algorithms 48
3.2.6 Calculation of a homography
Given a point that lies on a world plane π, X = [X1 X2 X3 1]T we wish to
find a predictive relation between the images of X in two cameras. That
is, given image points x = [x1 x2 1]T and x′ = [x′1 x′2 1]T, we wish to find
Hπ such that
x′ = Hπx (3.3)
Hπ is said to be the homography induced by the world plane π. It
can be thought of as two projectivities chained together: one that takes
a two-dimensional point from the image plane of the first camera, x, to
a two-dimensional point on plane π, and a second that takes a point
from plane π to a point on the second camera’s focal plane, x′. (Note
that a point that lies on a plane in the three dimensional world only has
two degrees of freedom – those that are required to move on the two
dimensional plane.) The geometry of such a scene can be seen in Figure
3.9.
Calculation of Hπ is a fairly simple using the Direct Linear Transfor-
mation (dlt) algorithm. We follow the notation used in [34]. First, given
n ≥ 4 corresponding points of form xi = [xi yi wi]T and x′i = [x′i y′i w′i]
T,
Section 3.2. Algorithms 49
X
CC
AB
xx
A
B
H
Figure 3.9: A world plane π induces a projective homography between the twoimage planes. xB = HπxA. After [34] Fig. 13.1
we can take the cross product of Equation 3.3:
x′i × Hπxi = x′i × x′i = 0 (3.4)
If expanded, the cross product can be written explicitly as a matrix
multiplication with the elements:
0T −w′
ixTi y′ix
Ti
w′ix
Ti 0T −x′ix
Ti
−y′ixTi x′ix
Ti 0T
h1
h2
h3
= Aih = 0 (3.5)
where H has been rearranged into a column vector h using a lexico-
Section 3.2. Algorithms 50
graphic ordering:
H =[h1 h2 h3
](3.6)
In Equation 3.5, it can be seen that the third row is actually a linear
sum of the first two rows. Therefore, there are only two linearly indepen-
dent rows in Ai. This makes intuitive sense. Each point correspondence
provides two constraints on the homography. The third element in each
homogeneous point, wi and w′i, are simply scale factors that do not con-
strain H.
Therefore, if we only write the first two rows of Ai for each of the
n ≥ 4 point correspondences, then we can stack each Ai into a 2n × 9
matrix A. Assuming that we have some measurement noise in our pixel
locations, then the solution to Ah = 0 will not be exact. We therefore
attempt to minimize ‖Ah‖. As shown in [34], finding the solution that
minimizes ‖Ah‖ is equivalent to minimizing ‖Ah‖/‖h‖.
The solution is the unit singular vector corresponding to the smallest
singular value of A. By taking the Singular Value Decomposition (svd)
of A, such that A = UDVT, h is the column in V corresponding to the
smallest singular value in D. H is obtained by simply rearranging the
values of h into a 3× 3 matrix.
Section 3.2. Algorithms 51
Data normalization
In [35], Hartley noted that many users of the 8-point algorithm did not
pay close attention to numerical considerations when performing their
calculations. The 8-point algorithm is used to find the essential matrix,
but the first steps of the algorithm are very similar to the dlt described
above.
The arguments in the 1995 paper, re-explained in [34], boil down to
this: because xi and yi are typically measured in the hundreds, whereas
wi is usually about 1, some entries in Ai will have a vastly different mag-
nitudes then others. When the svd is performed and h is found, the
effects of the low-magnitude entries of A will be drowned out by the
high-magnitude entries, and the result will be inaccurate.
The solution is to normalize or pre-condition the input data xi and x′i.
The normalization of each data set is performed by calculating two 3× 3
transformations, T and T′, that
1. Translate the points such their centroids are at the origin, and
2. Scale the points such that the mean absolute distance from the origin
along the x and y axes is 1.
The second step is equivalent to scaling the points such that mean
absolute distance from the origin is√
2.
Section 3.2. Algorithms 52
The two transformations are applied to the two data sets before the
dlt algorithm is performed. After transformation, the average absolute
values of xi, yi, and wi will all be 1, and the solution will be numerically
well-conditioned.
Thus, instead of using x and x′, the inputs to the dlt are Tx and T′x′.
The output of the dlt will not operate directly on the image coordinates
– HDLT will only be valid for normalized coordinates. To directly use
image coordinates, we observe that:
T′x′ = HDLTTx
x′ = T′ −1HDLTTx
x′ =(
T′ −1HDLTT)
x = Hπx
(3.7)
In other words, Hπ is the product of three matrices:
• A normalization transformation that works on points in the coordi-
nate system of the first image,
• HDLT, the output of the dlt algorithm fed with normalized data,
and
• The de-normalization transform that takes normalized coordinates
into the coordinate system of the second image.
Section 3.2. Algorithms 53
It should be noted that the normalization step is very important. Al-
though the algorithm will appear to work correctly, without normaliza-
tion the predicted locations x′ will be incorrect. Not including the data
normalization step is an insidious error.
3.2.7 Multiple-camera tracking with a homography
Let us re-capitulate the algorithm to this point. Two cameras’ video feeds
were taken, background subtraction was performed, and unique tracking
labels were assigned to each target in each camera. By watching for Field
of View (fov) events, the lines marking the edges of the field of view of
camera B, as seen by camera A, were found. As a target crosses those
lines, it is associated with a target in the other camera. These associa-
tions are called meta-targets. As the meta-targets move around the world
they leave a trail of markers behind them, creating a list of correspond-
ing points on the world plane. Once the list of points in both cameras
is sufficiently non-collinear, the Direct Linear Transform (dlt) algorithm
calculates the homography, Hπ, induced between the two cameras by the
world ground plane.
The task now is to take Hπ and the list of meta-targets found using
fov lines, refine the list, and correctly identify any meta-targets in the
scene. This process continues ad infinitum.
We first use the algorithm of Section 3.2.4 to find the feet feature loca-
Section 3.2. Algorithms 54
tion for each active target in both cameras. The projected location of each
camera A target in the camera B coordinate system is given by
xA = HπxA (3.8)
The inverse projection can also be carried out. The feet location of
each target in camera B projects to a location in the camera A coordinate
system by:
xB = H−1π xB (3.9)
In both of the previous equations, the hat (circumflex accent) signifies
that the location has been projected into the other camera’s coordinate
system and that it is not directly measured.
If the projected location of a target is within the bounds of the frame,
then the distance is calculated from that projected location to the feet of all
targets in that frame. The distance measure used is the scaled symmetric
transfer distance (sstd). We use the two terms interchangeably. For a
given homography Hπ, a point xA in camera A, a point xB in camera B,
and the diagonal size in pixels of the two frames, dA and dB, we define
the scaled symmetric transfer distance ε as
Section 3.2. Algorithms 55
ε =‖HπxA − xB‖
dB+‖H−1
π xB − xA‖dA
(3.10)
=‖xA − xB‖
dB+‖xB − xA‖
dA(3.11)
dA H
Bd-1H
xA
xB
xA
xB
Figure 3.10: The scaled symmetric transfer distance is the sum of the distancesbetween the points and their projected cousins, each divided by the diagonalframe size.
The terms are shown graphically in Figure 3.10. The sstd represents a
frame-size-independent method of determining which targets are closest
to each other. In the very worst case, with the homography projecting
the points to opposite corners from the real points, the sstd will have a
maximum value of 2. A flowchart depicting the first part of the multiple-
camera tracking algorithm is shown in Figure 3.11
Following calculation of each of the applicable distances, the pair of
targets corresponding to the smallest distance is found. If that smallest
distance is above some maximum meta-target creation distance threshold
τε then no more good matches can be created and so the algorithm ends.
Section 3.2. Algorithms 56
Calculate scaled symmetric transfer distance to all other targets
Yes
No
For each object, find feet
Calculate or
Is projected location in frame?
Move on to next object
Figure 3.11: The first part of the algorithm consists of finding the sstd for eachvalid pair of targets.
If the smallest sstd is smaller than τε then any of the following cases
could be true of the pair of targets:
1. Neither of the targets is a member of an pre-existing meta-target. In
this case a new meta-target is created to associate the two targets.
This case occurs most frequently when a target appears in the mid-
dle of a frame, either through an entrance or after a full occlusion.
2. Both of the targets are members of the same pre-existing meta-
target. This means that the current method agrees with the method
used to create or maintain the meta-target, and that no changes need
to be made to the meta-target.
3. Exactly one of the targets already belongs to a pre-existing meta-
target. Should the pre-existing meta-target be replaced by the new
pairing? First, the distance measure of the pre-existing meta-target
Section 3.2. Algorithms 57
is retrieved. In theory, if the pre-existing meta-target’s distance is
larger than the current pair’s distance, then we should delete the old
meta-target and replace it with the new pair. However, this might
introduce a flip-flopping behaviour where close real-world targets
lead to meta-targets continuously being created and destroyed.
To combat this undesirable behaviour we introduce a change reti-
cence threshold τ∆. In order to replace a pre-existing meta-target,
the current pair of targets’ sstd must satisfy εcurrent < τ∆εold. This
means that a potential target pair will only replace a pre-existing
meta-target if the relevant distance measurement is significantly bet-
ter than the old distance.
4. Both of the targets belong to different pre-existing meta-targets. In
this case the same logic is applied as if there was only one pre-
existing meta-target. The pre-existing meta-target with the smallest
sstd is used for the comparison, i.e. εcurrent < τ∆min (εold−1, εold−2).
After the pair of targets has been dealt with the next smallest distance
is found. The algorithm then repeats for the pair of targets corresponding
to that distance, and so on until the smallest remaining distance is above
the maximum threshold τε. A flowchart depicting the second part of the
multi-camera tracking algorithm is shown in Figure 3.12.
Any time a new meta-target is created, the row and column in the table
Section 3.2. Algorithms 58
1
0
2
Determine pre-existing meta-target's distance
Multiply by change reticence factor
Which distance is smaller?
Pre-existing meta-target
How many targets are already in meta-targets?
Create new meta-target
New target pair
Yes
NoIn the same meta-target?
Find smallest distance of the 2 meta-targets
No action
Figure 3.12: The pair of targets corresponding to the smallest distance is selected.After the steps shown here are complete, the pair with the next best distance isselected, and the flow repeats. This continues until the next smallest distance isabove a threshold τε.
of distances corresponding to the two targets are set to a large value. This
prevents them from being considered further.
In the third and fourth cases listed above, there is a potential for mem-
bers of a pre-existing meta-target to be orphaned by a newer and bet-
ter meta-target. If this occurs, the orphan will necessarily have a larger
distance to any other target. However, the next largest distance for the
orphan may still be below the maximum meta-target creation distance
threshold τε. If this is the case then the orphan will be considered in one
of the next passes through the distance matrix for re-assocation with a
different target. Because the orphan will be considered in a later pass, no
additional logic is required to deal specifically with orphans.
Section 3.3. Testing and validation 59
3.3 Testing and validation
The goal of the multiple-camera tracking system, stated simply, is to cor-
rectly associate target labels in one camera with target labels in another
camera. Testing this might seem simple – look at each pair of targets
when it is created, and divide the number of correct meta-target associ-
ations by the total number of meta-targets to get some measure of the
accuracy of the algorithm.
Unfortunately, this simple approach only works for meta-targets cre-
ated after the plane-induced homography Hπ is known. In the short
video sequences available to us, the time taken to learn Hπ is variable
and non-trivial. Indeed, since each new meta-target drops new markers
as it moves, Hπ is constantly changing. Rather than trying to measure
the performance of the whole algorithm at once, we try to measure the
performance of parts of the algorithm.
Our attention then turns to measuring the accuracy of the homog-
raphy. The homography learned by the system must be able to accurately
project points from one camera to the other. If this is not done well then
invalid meta-target associations could be created, and valid meta-targets
might not be found.
The system discussed above has had a couple of its components tested
in other publications. Background subtraction was covered in [30]. The
single-camera tracking system was developed and tested in [5]. Testing
Section 3.3. Testing and validation 60
of those components is not repeated in this thesis.
The critical parts of the algorithm left to be tested are the feet fea-
ture finder and the method to determine the homography, as well as the
performance of the homography-based matcher.
3.3.1 Testing the feet feature finder
The feet feature finding algorithm looks at a mask of a target and returns
a pixel location that best represents the centre of gravity of the person
projected onto the ground plane. This task, tricky to do automatically by
a computer, is simply and accurately performed by a human.
The algorithm shall be tested as follows:
1. At a proscribed interval in a video sequence, a frame shall be pre-
sented to a user with the target person indicated but the precise
target mask not shown.
2. The user shall mark the location that best represents the feet feature.
Because the target mask is not shown, it will not colour the user’s
perception of where the feet should be.
3. The algorithm shall mark the feet feature.
4. The Euclidean distance between the two locations shall be calcu-
lated.
Section 3.3. Testing and validation 61
5. The distance shall then be divided by the diagonal of the frame size,
producing a scaled distance d as shown in Equation 3.12 below. If
this step was not performed, then distances measured in a video
camera with a large frame size would probably be larger than dis-
tances measured in a camera with a small frame size. Normalizing
the distance in this fashion enables comparison of the performance
of the algorithm when run on cameras with different frame sizes.
d =‖xhuman − xcomputer‖
d f rame(3.12)
If the feet feature finder works perfectly, d will be zero. The maximum
possible value of d is 1, which would occur in the unlikely case that the
user and the algorithm pick locations on opposite corners of the image.
This is true regardless of the resolution of the camera, since the measure
d is scaled by the frame’s diagonal size.
For example, consider Figure 3.13. In this contrived example, the hu-
man selected the feet at xhuman = [164 149] and the algorithm found the
feet at xcomputer = [153 171]. The output of the camera is 240× 360. There-
fore,
d =‖[164 149]− [153 171]‖√
2402 + 3602= 0.057 (3.13)
In this case, d = 0.057 indicates a difference of about 24.6 pixels in a
Section 3.3. Testing and validation 62
frame with a diagonal size of d f rame ≈ 433 pixels.
Figure 3.13: In this contrived example, the green marker indicates the human-found location of the feet feature, and the blue marker indicates the location ofthe feet given by the algorithm. Note that this image is a blown-up section ofthe full frame.
This method can be used to compare the performance of the bounding
box algorithm used in [24] to the performance of the rotated algorithm
introduced in Section 3.2.4. The better algorithm will have a lower average
value of d. It is hoped that the rotated method will be generally better,
and clearly better when the camera is tilted with respect to the height-axis
of the targets.
3.3.2 Testing the homography-based tracker
Numerical tests
In order for the correct meta-target associations to be created based on
the plane-induced homography Hπ, the scaled symmetric transfer dis-
Section 3.3. Testing and validation 63
tance (sstd) for a given target pair must be both less than any other pair
involving either of those targets, as well as less than the threshold τε.
The first step to testing the homography is to find a set of correspond-
ing points on the ground plane. At least four points must be chosen;
more is better. This is done by a person, not automatically by a computer.
Since the cameras are not moving, the point locations will be constant for
all frames in each video sequence.
Then, using the dlt, a truth homography can be found. Using this
homography, the points can be checked for consistency using the sstd –
large errors indicate that the point matches are outliers. This part of the
test is done to ensure that the points have been precisely and correctly
identified.
Once a set of corresponding truth points have been determined, the
algorithm is run. Eventually, the ground plane-induced homography will
be found, and after that time it will be constantly updated as more and
more markers are added to the corresponding point list (see Section 3.2.5).
The homography is recorded at certain times:
• The first time it is declared valid, i.e. just after enough of the mark-
ers have been declared non-collinear in both frames.
• At intervals after that time.
• At the end of the sequence.
Section 3.3. Testing and validation 64
The last homography should be the most accurate, since it is based
on the most corresponding points (recall that the dlt algorithm gives a
least-squares fit to the input points).
Once the homographies have been recorded, the sstds are calculated
for each of the truth points found by the operator. To be judged a suc-
cess, each homography must have scaled symmetric transfer distances
less than the threshold τε. This means that if the corresponding points
were target feet features, then they would be associated into meta-targets.
Assuming that it meets that standard, the best homography will have
sstds of zero, or at most on the same order as that of the truth homog-
raphy.
For a mono-numeric comparison, the mean of the scaled symmetric
transfer distances can be used to compare homographies.
Visual test
By using standard image manipulation techniques, the image of one cam-
era’s ground plane can be transformed and re-sampled into the coordi-
nate system of the other camera. The two images can be overlaid using
alpha-blending. If the homography is perfect then the two ground planes
will overlap perfectly, and corresponding points will lie on top of each
other. Errors in the homography will show up as a disparity in the loca-
tions of features on the ground plane.
Section 3.4. Alternative methods 65
This method, a simple visual comparison, should be a quick way to
determine whether a homography accurately represents the geometry of
the scene.
Note that the visual comparison can only be made for points on the
ground plane π. Points off of the ground plane are not represented by the
two projective transformations that make up the homography found by
this algorithm (those two transformations are HA→π and Hπ→B). There-
fore, points off of the ground plane will not overlap in the melded image.
Artificial occlusion testing
If a sequences does not include natural occlusions, we can create artificial
occlusions by interrupting the program and deleting meta-target associa-
tions. When the algorithm is re-started, it will see targets in each frame
that appear completely new, as if they had magically appeared. If the
targets are correctly re-associated then the algorithm works properly.
3.4 Alternative methods
The method described above has some significant advantages over many
other multiple-camera tracking algorithms: it requires very little user in-
tervention and aside from the ground plane, does not require particular
in-scene content. So long as the ground-plane requirement is satisfied,
Section 3.4. Alternative methods 66
after the operator selects a pair of partially overlapping cameras there re-
mains nothing to do but hit the “go” button and let the system produce
multiple-camera tracking results. As discussed in Section 1.2, the system
was designed for near-fully automatic operation and un-sophisticated op-
erators.
However, it is possible to build trackers that have other sets of re-
strictions. If additional information is available then alternative tracking
methods might be used, or the current tracker could be made more ro-
bust. In this section various methods of obtaining and using additional
pieces of information are discussed.
3.4.1 Improving this method
How might the present algorithm be made better? Can parts of the al-
gorithm be removed entirely if additional information is available? The
answer to both questions is yes, provided we can ask the operators to
perform additional duties.
Closed training
The present method for finding fov lines relies on targets naturally cross-
ing the lines during a training period. However, these crossings could be
triggered by a “planted” target. For instance, a person with knowledge of
the camera setup could enter the surveilled area and walk across the fov
Section 3.4. Alternative methods 67
lines at various points along the line. This would guarantee that every
recoverable fov line would be correctly identified, and so could be used
for meta-target association shortly after the system is activated.
Replacing FOV lines
Consider the fov part of the algorithm, discussed in Section 3.2.3. The
purpose of this system is to create an initial set of meta-target corre-
spondences, after which markers are dropped and the homography is
calculated. This system could be replaced by an operator. Given two
video streams, the operator could be asked to indicate corresponding tar-
get projections. This could be done with, for example, two touch screens
displaying single-camera tracked targets with a glowing aura. When the
operator touches one target in each camera at the same time, the system
acknowledges the input, perhaps by changing the colour of the targets to
be the same, and then lets that new meta-target start dropping markers.
Even more simply, the operator could be asked to directly draw the
fov lines as seen by camera A. This would jump-start the meta-target
creation process, since the system would not need to wait for fov events
before starting to create associations.
Section 3.4. Alternative methods 68
Directly specifying the homography
The system of dropping markers could be rendered irrelevant by an oper-
ator. The system essentially creates a list of corresponding point pairs. An
operator could create this list fairly easily with any number of tie-point
selection tools, such as Matlab’s cpselect() tool.
Naturally, the operator would have to be instructed to only pick points
on the ground plane, so as to prevent calculation of an incorrect homog-
raphy. This is almost exactly the method used in [23].
This method could be undermined by a featureless ground plane. If
less than four non-collinear corresponding point pairs are visible then the
homography will not be findable. In this case the operator could use the
truth-point selection tool used in this thesis, marking only points on a
target that are clearly on the ground plane such as one of the target’s feet
or the estimated location of the feet feature.
After the initial homography was specified by the operator and used
by the multiple-camera tracking algorithm, the marker system could be
activated. This would increase the number of corresponding points and
should therefore increase the accuracy of the homography.
At this point it should be noted that the foregoing methods all require
accurate operator input. This is significant, since this sort of information
might not always be available. The system discussed in this thesis does
not need this sort of input.
Section 3.4. Alternative methods 69
3.4.2 The fundamental matrix
The fundamental matrix F links the images of a 3d world point in two
cameras. Any point in camera A leads to an epipolar line `TB = FxA in
camera B. The projection of the world point in camera B, xB, will lie on the
epipolar line and so xB`TB = xBFxA = 0. This geometry was previously
shown in Figure 2.1 on page 18.
Although the fundamental matrix is a 3× 3 matrix, it is singular, of
rank 2, and only has 7 degrees of freedom. The right orthonormal basis
of the null space of F is the camera A epipole eA, the pixel location in
camera B that is the image of camera A. The orthonormal basis of the
null space of FT is the transpose of the camera B epipole eB, the image of
camera B as seen by camera A.
The fundamental matrix therefore links two cameras, but it is not
restricted to linking points on a ground plane like the ground plane-
induced homography used in this thesis. If the fundamental matrix is
known, it should be possible to create an algorithm better than the one
discussed in Section 3.2.7. Rather than the sstd measure, the new algo-
rithm might use a distance measure based on each point’s distance to the
epipolar line.
To track objects in an airport apron environment, [20] uses an epipolar
tracking system that was developed in [22].
Calculation of the fundamental matrix can be done in a variety of
Section 3.4. Alternative methods 70
ways:
• A corner-finding algorithm could be applied to images from two
cameras. Using a robust method such as ransac or the Least Me-
dian of Squares (LMedS), the fundamental matrix can be calculated
using a number of minimal samples from the list of corners. Corners
on moving targets or on non-overlapping regions should be rejected
by a robust method.
• If an operator is available, they could be asked to select correspond-
ing points between two cameras. The simplest method of calculat-
ing the fundamental matrix is the normalized eight point algorithm
introduced in [35]. It requires at least eight corresponding point
pairs; more than eight points will give a least-squares solution for
F, similar to the dlt algorithm discussed above in Section 3.2.6.
Once the fundamental matrix is found a 3d model of the scene can be
calculated using basic triangulation methods for known correspondence
points. A fundamental-matrix based tracker might allow corresponding
points to be found using appearance-based methods, provided that the
cameras were of the same modality. This includes both feet points –
which could be used to create a topographic map of the ground – and
other in-scene points such as corners on structures not on the ground
plane. These 3d points could then be displayed with image-based tex-
Section 3.4. Alternative methods 71
tures from arbitrary viewpoints. Once a basic model is created using,
say, the background images, the live targets could be injected into the
“world”. These targets could then be watched from an arbitrary view-
point by surveillance operators using game-like interface controls or full-
immersion stereo 3d interfaces. It would even be possible to view the
model with a binocular interface, since the retrieved world model would
have full 3d information for major points in the scene: visual corners and
moving targets.
There are clear advantages to using the fundamental matrix. Aside
from 3d world construction, targets could be tracked when a common
ground plane is not present. However, using the fundamental matrix
has downsides. Depending on the algorithm used, high-quality operator
input may be required to set up the system. If a more automatic method
is used to find the fundamental matrix (e.g. a ransac or LMedS-based
method), a significant number of feature points must be both detected
and shared between the two views. Also, the fundamental matrix does
not directly match points to points, it matches points to lines. This may
lead to problems in crowded scenes.
Calibrated cameras
A camera projects 3d world points onto points on a 2d image plane ac-
cording to x = PX. Camera calibration is the act of finding the 3 × 4
Section 3.4. Alternative methods 72
camera projection matrix P. For a basic finite projective camera, P =
KR[I | −C
]. In this equation, C is the location of the camera in the world
coordinate frame, R is the rotation matrix to correctly point the camera in
the world, and K is a matrix
K =
αx s x0
αy y0
1
(3.14)
where the two α parameters are the scale parameters in each direction
(these are usually related to the focal length), s is the skew factor, and x0
and y0 are the coordinate of the principal point.
Cameras can be calibrated using many methods. Some of the methods
are covered in [34]. Calibration methods include using automatically-
recognizable in-scene calibration targets, or manually selecting the image
locations of surveyed points.
Depending on how the cameras are calibrated, a variety of methods
may be used to determine a fundamental matrix or a plane-induced ho-
mography. For instance, for two general cameras P and P′, F = [e′]× P′P+,
where P+ is the psuedo-inverse of P and [e′]× is the skew-symmetric
form of the epipole of the first camera (so e′ = P′C with PC = 0) [34].
Section 3.4. Alternative methods 73
More cameras
The algorithm developed for this thesis uses pairs of cameras. Also, it
only looks for fov events in one camera of the pair. The algorithm could
be enhanced to operate both ways, looking for fov events in both cameras
simultaneously in an effort to find meta-target correspondences quicker.
What happens if a third camera (camera C) is added to a currently-
existing pair of cameras (A and B)? In this case camera C could be paired
with overlapping camera B, and the inter-camera homography HBC could
be found with the usual method. At this stage two homographies are
known: HAB and HBC. However, since the cameras all share a common
ground plane, the missing homography could be easily calculated with
HAC = HABHBC.
If cameras A and C do not actually overlap then the resulting homog-
raphy will not project feet points to locations in the frame of camera C;
this can be detected and the A-C pairing automatically rejected. However,
if cameras A and C do overlap then the multiple-camera tracker could use
HAC for tracking.
Many of the advanced chapters in [34] are devoted to three-view and
n-view geometry. The trifocal tensor and other geometric constructs can
be used similarly to the fundamental matrix, and are recoverable using
similar methods to those discussed above. In general, using these re-
lationships requires finding a number of corresponding points between
Section 3.4. Alternative methods 74
the cameras, calculating a geometric relationship (possibly using a robust
method), and then using that relationship to project where a target feature
will show up in another camera. The projected location can be compared
to the actual locations of the targets in the frame using a distance metric of
some sort, and meta-target relationships created. The method described
in this thesis is an implementation of this generalized algorithm: the rela-
tionship is the plane-induced homography between two cameras and the
distance metric is the scaled symmetric transfer distance.
In this thesis a priority was placed on automatic functioning with min-
imal operator input. As described above, other methods could be imple-
mented that use any amount of operator input. The common downside
to gathering more input is that the person installing or operating the sys-
tem needs to periodically spend time performing activities that do not
directly relate to tracking targets. The upside to the present algorithm
is that this is not required – the operator can concentrate on watching
targets instead of operating the surveillance system.
Chapter 4
Implementation details
This chapter explains exactly how the various algorithms presented in
Chapter 3 were implemented. The goal of this chapter is to give the
reader details on how to create a working computer program.
Reading Section 4.1 is recommended, since that section covers the
Generator program, written to simulate multiple-cameras in a simple
world. The output of the Generator is used to test the algorithm in Chap-
ter 5.
The remainder of the chapter can be safely skipped by those not look-
ing for implementation details. The sections following Section 4.1 cover:
• Threshold selection. Since many of the thresholds were set using a
heuristic process, typical values are given with explanations of how
those values were determined.
75
Section 4.1. The Generator 76
• Specific tests. Some algorithms require specific tests to be passed
before they should be used. Those tests are outlined and where
necessary, their implementation explained.
The Generator program was written in C++. All other code was writ-
ten in Matlab using student version 2007a. All code should be avail-
able from the same source as this document. If not, email the author at
4.1 The Generator
As the multiple-camera tracking software was being developed, a need
for test data was identified. Early attempts to use real-world video se-
quences added potential sources of error into the system. It was unclear
whether any given error was caused by errors in one of the the multiple-
camera tracking functions, or in the background subtraction or single-
camera tracking code.
To satisfy this need, a program was written using C++ and the OpenGL
graphics library to simulate people moving around a small world, seen
from two vantage points. The program is known as the Generator. Some
of the features of the Generator are:
Independent extrinsic camera parameters The world location and orien-
tation of the two cameras are independent of each other. Both are
Section 4.1. The Generator 77
set in world coordinate terms directly in the code. These parameters
were set with a call to gluLookAt().
Independent intrinsic camera parameters Similarly, the intrinsic prop-
erties of the cameras are independent. As a result, it was possible
to set the cameras to different resolutions (i.e. frame sizes) and focal
lengths. Although it is possible to do so using hand-coded matrices,
this is complicated. Instead, the parameters were set with calls to
glPerspective().
Random target behaviours The people walking through the scene ap-
pear at semi-random locations. They initially are set to walk to-
wards a goal area around the centre of the world coordinate sys-
tem, which is approximately where the cameras are pointed. How-
ever, each target’s velocity vector is randomly disturbed by a small
amount each frame. This means that targets do not walk in boring,
straight lines, but instead they speed up and slow down, changing
direction in un-predictable ways.
Simple target appearance - shape Targets are rigid. Each target is made
up of four ellipsoids: a head, a torso, and two legs. All targets are
the same size. The orientation of the legs in the world coordinate
system is dependent on the target’s velocity. Targets appear to move
in the direction indicated by their legs, not a direction that would
Section 4.1. The Generator 78
require side-stepping or un-naturally angled walking.
Simple target appearance - colour Targets are red, green, blue, cyan, yel-
low, or magenta. There is only one light in the scene, and it is
white. No significant lighting effects are used. Because OpenGL
is a raster-based graphics library, no shadows exist. As a result, in
the output images the targets have colour values of 255 in the ap-
propriate channels. This makes the targets trivial to segment. Fur-
thermore, it makes background subtraction easy. This leads to high-
quality single-camera tracking results. Having high-quality input to
the multiple-camera tracking code reduces the amount of code that
must be searched when looking for bugs.
Image output The Generator uses the ImageMagick magick++ library to
write frames to disk. The output of the program is two folders con-
taining thousands of numbered Portable Network Graphics (png)
images. Using images enabled the use of Matlab’s imread() func-
tion.
Synchronized output Frames from different cameras with the same num-
ber area taken at the same instant. There is no target motion be-
tween writing one image to disk and writing the other camera’s
image to disk. As a result, it is guaranteed that the targets will be
in geometrically-consistent positions from frame to frame.
Section 4.1. The Generator 79
(a) Camera A – resolution 600× 450 (b) Camera B – resolution 400× 300
Figure 4.1: A typical scene from the Generator. From the point of view of cameraA, camera B is located somewhere in the top-left of the frame.
Alignment cues
Figure 4.1 shows a typical pair of frames from the Generator. The world
is structured with the following characteristics:
• The red, green, and blue lines near the centre of the frames indicate
the world X, Y, and Z axes, respectively. Each line is one unit long.
The lines join at the world origin.
• The white square is on the Z = 0 plane. It covers X = [−5 . . . 5] and
Y = [−5 . . . 5].
• Targets can move in the range X = [−10 . . . 10] and Y = [−10 . . . 10].
If they exceed this range then they fall off of the edge of the world
(the world is flat), and so are no longer displayed.
Section 4.1. The Generator 80
90% of the targets start at a random position on the edges of a square
with boundaries at X = ±8 and Y = ±8. The other 10% start in random
locations inside that square.
Each target’s initial velocity vector points to a random location inside
a square with boundaries at X = ±4 and Y = ±4. This means that
they start out walking towards somewhere near the centre of the world.
This helps to improve the chances that a target will trigger fov events in
camera B.
The number of targets in the world at any given time can be lim-
ited. Limiting the number of active targets limits the number of distractor
points in the maps used to find the fov lines. The maximum number of
targets was typically set to three.
Section 3.3.2 called for a number of corresponding test points to be
found on the ground plane. Because the Generator world is somewhat
barren, a grid was overlaid on the white square. The resulting world has
many intersections on the ground plane that can be easily picked out by
hand. Figure 4.2 shows the gridded plane from Run 17, where camera A
was tilted by about twenty degrees.
In this thesis three runs are discussed. The runs are numbered 17, 18,
and 23. The numbers are based on the dates they were created, so, for
instance, there is no run 19 because the Generator was not run on the
19th.
Section 4.2. Background subtraction 81
(a) Camera A (b) Camera B
Figure 4.2: To be able to find corresponding points, a grid was overlaid on theworld ground plane. The grid has unit spacing, and is aligned on the half unitsof the world X and Y axes.
4.2 Background subtraction
Of all the algorithms implemented in support of multiple-camera track-
ing, the background subtraction algorithm contains the largest number of
critical threshold values. These values are critical because if they are not
set properly then targets will not be detected, shadows will be included
in target masks, parts of targets will be called shadow, and background
pixels can be flagged as targets. The thresholds in this algorithm are:
• τH, the hue difference threshold for shadow detection
• τS, the saturation difference threshold for shadow detection
• α and β, the range of value ratios for shadow detection
Section 4.2. Background subtraction 82
• Tlow, a low threshold to determine which pixels are sufficiently dif-
ferent from the background to warrant further inspection
• Thigh, each blob must contain a pixel that exceeds this threshold to
be considered for further inspection
• TArea, the minimum size, in square pixels, of an mvo blob
• TAOF, the minimum average optical flow of a target blob
No automatic methods to select the thresholds were presented in any
of [28,29,30]. In [29] a preliminary sensitivity analysis was carried out on
the shadow detection thresholds. Ten combinations of the four variables
were tested.
Further qualitative testing of the shadow detection module was per-
formed through visual inspection of the various sub-results. Unfortu-
nately, we were unable to come up with any specific recommendations.
The ten combinations of values tested in [29] do not appear to have been
chosen with any sort of plan, and do not lead to particularly good results
Section 4.2. Background subtraction 83
on the test sequences. Currently, the following settings are used:
τH = 0.4
τS = [0.08 . . . 0.3]
α = 0.2
β = 0.8
(4.1)
Note the range given for τS. This parameter needs to be tuned de-
pending on the strength of the shadows.
The remaining parameters govern how to determine which pixels are
part of an interesting blob, and which blobs are actually part of the back-
ground. We discuss each parameter in turn.
No guidance is given in any of the original papers on how to set
Tlow or Thigh. Tlow was set to 0.15 times the maximum value found in
DB (originally defined in Equation 3.2 on page 25). Similarly, Thigh was
set to 0.42 times the maximum value. These values were found by a
qualitative visual inspection of the results, seeking to get enough of the
target identified in the blob without making the blob too large.
The area threshold TArea was set to a fixed number of pixels in the
original papers. However, in this research we dealt with varying frame
sizes. As a result, a fixed threshold would not work. We therefore set the
Section 4.2. Background subtraction 84
area threshold to a multiple of the area of the frame:
TArea = 10−3 × (width× height) (4.2)
Because the targets in the Generator were much smaller than targets in
the real-world scenes, the threshold was reduced by a factor of 10 when
processing simulated scenes.
The average optical flow threshold TAOF was set to a fixed number in
the original papers. Again, due to the different frame sizes, this method
would not work here. Because optical flow is measured in units of pixels,
we set it to a multiple of the diagonal size of the frame.
TAOF = 0.8× 10−3√
width2 + height2 (4.3)
Other background subtraction notes
The algorithm takes the median of a number of recent images to update
the background model. The suggested number of images from [30] is
n = 7, with 10-frame spacing and a single duplicate of the current frame.
We found that n = 9 with 7-frame spacing and a single duplicate of the
current frame worked well.
When profiling the background subtraction code, it was found that
calls to Matlab’s median() function were taking up a 50% of the run time.
The next longest function was rgb2hsv(), which transforms RGB images
Section 4.3. Single-camera tracking 85
into the hue, saturation, and value (HSV) colour space. Both of these
functions could be re-written to be significantly faster than their Matlab
implementations, but at the cost of reduced accessibility and flexibility.
4.3 Single-camera tracking
Of all the functions that make up the single-camera tracking algorithm
described in [5], none of them remained un-modified during this research.
At the beginning of the research the system worked correctly on the test
sequences used in [5], each less than 200 frames long and each designed
to show a particular case (e.g. occlusion, crossing targets). However,
the algorithm did not work on any of the longer test sequences used for
this research, including the movies available in the PETS database [36].
Failures were typified by ”hard” crashes such as reading beyond the end
of a matrix, rather than qualitative errors or ”soft” crashes stemming from
faults inherent in the logic of the algorithm.
A number of changes were made in order to fix the single-camera
tracking algorithm. In general, the overall logical structure and algorith-
mic flow of the algorithm was kept the same. The changes made to the
algorithm fall into three main categories:
1. Replacement of the background subtraction components,
2. Improvements or creation of functionality, including bug fixes, and
Section 4.3. Single-camera tracking 86
3. Improvements in speed.
Background subtraction
The original algorithm used a static, pre-computed background image to
detect foreground targets. This led to significant problems with scenes
longer than a few hundred frames – eventually large chunks of the frame
would be identified as foreground. Furthermore, objects that exhibited
slow motion from frame t− 1 to frame t were discarded because the al-
gorithm’s inter-frame motion detection was naıve.
In this research, the background subtraction part of the single-camera
tracking algorithm was replaced with the algorithm described in Section
3.2.1, whose implementation details are found above in Section 4.2. This
significantly improved the detection of foreground objects. With the pre-
vious algorithm, targets would often disappear for one frame if their mo-
tion was insufficiently salient, but then re-appear in the next frame. With
the current algorithm this problem disappeared.
Functionality fixes
As previously mentioned, the single-camera tracker was effectively non-
functional due to frequent crashes. Many fixes were made, too numerous
to be fully described here. Some of the more important fixes include:
• A monotonically-increasing target counter. A new target now re-
Section 4.3. Single-camera tracking 87
ceives a unique label instead of receiving a previously-used label.
• The shape matching code was originally written to fit segments with
a B-spline using 60 control points. The number of control points was
a fixed constant. Some segments targets in the long test sequences
had perimeter lengths shorter than 60 pixels. As a result, the shape
matching code failed – it is impossible to have more unique B-spline
control points than there are pixels in the perimeter of an object. In
the current code, the number of control points was set to be 3/4
of the circumference of a circle with the same area as the segment.
Since a circle has the minimum circumference for a given area, this
guarantees that the number of B-spline control points will be less
than the number of perimeter pixels.
• Many morphological operations were changed to be dependent on
the frame size. The sizes of the various structure elements used to
enlarge targets and detect overlaps were made dependent on the
diagonal size of the frame instead of being hard-coded.
• At one point the algorithm needs to find points of high curvature
on the boundary of an object. The existing algorithm did not do this
effectively, and gave incorrect results. That algorithm was re-written
to provide correct results. As a happy side effect, the new code was
found to be much faster.
Section 4.4. Finding FOV lines 88
Speed fixes
Many functions or individual stanzas were re-written to increase execu-
tion speed. The final speed increase was on the order of 50× – from
one frame every thirty seconds to about two frames per second. This
speed increase is especially evident during object merges, when the shape
matching code is called.
Some of the speed increases were possible by re-writing algorithms
to use matrix operations. In Matlab matrix operations are almost al-
ways significantly faster than numerically equivalent operations using
for() loops and the similar structures. This is a result of history: Matlab
was originally built as a front-end to the eispack and linpack linear al-
gebra libraries. Other improvements were made by changing a function’s
logic to reduce or eliminate the number of calls to expensive functions
such as circshift().
4.4 Finding FOV lines
Four significant changes were made to the algorithm discussed in Section
3.2.3, originally found in [24].
First, as in [25], an fov event was defined to occur when a target
completely entered the frame, or, when exiting, at the first instant that it
begins to touch an edge of the frame. This has the effect of moving the
Section 4.4. Finding FOV lines 89
fov lines inwards from their actual locations. However, it also increases
the accuracy of the lines, since partially-visible objects are not used to find
the lines. Although it is possible to run the feet feature finding algorithm
on a partially-visible target’s mask, it is far from clear that the output
will be the location of the feet of the target. Indeed it is possible that the
target’s feet are not even in the frame, although the feature finder will
give an output regardless. Making this change ensures that the feet are
visible, and so the feet feature finder has a good chance of finding the
correct point.
The other three changes have to do with the maps of points used
to calculate each fov line. When an fov event has been triggered from
camera B, the original algorithm adds the feet locations of all targets in
camera A to a map. Eventually, the algorithm uses a Hough transform to
find the fov line. However, in some sequences the background subtrac-
tion algorithm created foreground targets that lasted for a very small time
– usually one frame. The false targets were in the areas such as tree-lines
on the horizon, where high-frequency wind-driven leaf motion appeared
as target motion. These targets were correctly given labels by the single-
camera tracker, but disappeared in the next frame. Naively including the
”feet” of these spurious targets effectively added distraction points to the
map of feet locations, potentially confusing the Hough transform.
To combat these spurious targets, a minimum age requirement was
Section 4.4. Finding FOV lines 90
added. In order to be included in the target map, a target must have
been visible for at least one frame before the current frame. This simple
requirement prevented inclusion of many of the distracting tree-line and
cloud targets.
The third change was to increase the minimum number of fov events
before the fov line is computed. In theory, two fov events should be
enough to define a line. However, if multiple targets are present in cam-
era A then their feet will distract the Hough transform, and the resulting
line will not necessarily be correct. No guidance is given in [24] on how
many fov events should be counted before computation of the fov line.
We raised the minimum number to six events. Changing this threshold ef-
fectively changes the training period before meta-target associations can
be created. A higher number means more training is required, but the
algorithm will be more resistant to high-traffic scenes with multiple dis-
tracting targets. If we had had scenes with more distracting targets (i.e.
higher-traffic scenes), then the minimum number of fov events could be
raised even higher.
The final change has to do with the actual method used to compute
the fov line. The original paper uses a Hough transform on the list of
points. Instead of this method, we decided to use the Random Sample
Consensus (ransac) method. Originally introduced in [37], ransac uses
minimal samples of the data to fit a model. Points are then classified into
Section 4.5. Dropping markers 91
inliers and outliers depending on how well they fit the model. The model
from the minimal sample that yields the maximum number of inliers is
the winner – it fits the data the best.
In this case the model is the slope and intercept of the line, and the
minimal set needed to find these parameters is two data points. We set
the inlier threshold at 1% of the diagonal size of the frame. The original
ransac implementation was found at [38], although we speeded up the
code by replacing some stanzas with matrix operations. At the end of
the ransac algorithm the final output fov line is calculated by taking a
least-squares fit of the inliers.
In this implementation, the ransac method took approximately 2-4
times longer to fit the same line. When more markers were used, ransac
slowed down. However, the actual time taken was very small – fitting
1000 points with ransac takes about 0.01 seconds on the machine used
for this research (a 1.83GHz Apple Macbook), and is far from the slowest
step in the system.
4.5 Dropping markers
There are two thresholds and a timer used to determine whether a marker
should be dropped in any given frame.
1. In order to drop a marker, both targets must be fairly close to the
Section 4.6. Calculation of a homography 92
median of their heights in the past few frames. This prevents drop-
ping a marker just as a target enters occlusion. The number of his-
torical frames to be considered was set to 10, and the threshold to
±25%.
2. Markers should not be dropped too close to the previous marker.
Therefore, at least one of the targets in the pair has to move more
than a minimal distance. This distance was set to 3% of the diagonal
frame size.
3. Assuming that the previous conditions were satisfied, there is no
reason why we should not drop a marker every frame. In the func-
tion there is a timer that can be set to ensure that markers are only
dropped after at least n frames have passed since the previous drop.
After experimentation, it was found that there was not much benefit
from reducing the number of markers in this manner, so the timer
was set to 1 – so long as the height and distance conditions are met,
a marker will be dropped every frame.
4.6 Calculation of a homography
If either of the sets of points (xA and xB) used to calculate a homography
is collinear, then the homography will be degenerate. Such a condition
might occur near the start of dropping markers. If one target walks in
Section 4.6. Calculation of a homography 93
an approximately straight line, they might drop four or more markers
(enough to calculate a homography), but the points will be degenerate
for the purpose of predicting target location off of the line.
To detect this condition, the ransac line-fitting function that was used
to find fov lines was re-purposed. After fitting a line to the markers, the
number of inliers as a fraction of the number of input points is calculated.
There are two thresholds at work here:
1. The inlier threshold, measured in pixels, determines whether a given
point is close enough to the model line to be called an inlier. This
threshold was set to 2.5% of the diagonal frame size for all se-
quences except the arena sequence, where it was set to 0.8%. Re-
ducing this threshold means that more points will be classified as
outliers. More outliers means that the data is more likely to be called
non-colliner. Therefore, lowering the threshold means that we are
more likely to call the data non-collinear, and are therefore more
likely to say that the homography is valid. Lowering the threshold
means nearly-collinear data is more likely to be declared valid.
2. The number of inliers as a percentage of the total number of points
is set to 90%. This means that at most 90% of the points can be inliers
– if there are more inliers than this then the data is called collinear
and the homography is not calculated. Raising the threshold above
Section 4.7. Homography-based multi-camera tracking 94
90% means that fewer outliers (i.e. non-collinear points) are required
before the homography is calculated and declared valid.
4.7 Homography-based multi-camera tracking
4.7.1 Thresholds
There are two thresholds in this function. The first, τε, determines the
maximum symmetric transfer distance for a pair of targets to be associ-
ated. The other is the change reticence threshold, which determines how
much better a pair has to be to replace a pre-existing meta-target.
Recall the first part of Equation 3.10, which defines the scaled sym-
metric transfer error ε:
ε =‖HπxA − xB‖
dB+‖H−1
π xB − xA‖dA
The threshold was set to τε = 0.2√
width2 + height2, i.e. 20% of the
diagonal frame size. Setting a higher threshold means that matches could
be created between targets that are farther apart. This means that bad
matches might accidentally get created.
The second threshold is τ∆, the change reticence threshold. This is
multiplied with the sstd of the best of the previously-created meta-targets
Section 4.7. Homography-based multi-camera tracking 95
threatened by the prospective pair of targets. If the distance of the prospec-
tive pair is lower than the reticence distance then the old meta-target(s)
are deleted and a new meta-target association is created. This threshold
was set to τ∆ = 0.75.
4.7.2 Speed
The multiple-camera tracking function was found to be fairly fast com-
pared to the single-camera tracker and the background subtraction al-
gorithm. The actual tracking is a very quick process, with most of the
time being taken by the feet feature finder (which is itself speedy). When
many targets are in the scene, some form of caching locations of the feet
feature and removing redundant calls to the function would speed up the
system, but was not implemented in order to keep the code simple.
When the list of markers becomes large (> 2, 500) the dlt algorithm,
and specifically the SVD contained therein, starts to become the slowest
part of the multi-camera tracker. However, even when the dlt is slow,
it is still an order of magnitude faster than the single-camera tracker.
An experimental change was made after most testing was completed to
sample the marker list so that only the last 1,000 markers were used. This
did ensure that the time taken was both consistent and small. The effect
of this change on the whole system was not tested. All results reported
in this thesis were made using all of the markers available.
Chapter 5
Results and discussion
5.1 Feet feature finder
Recall that the feet feature is the single point that represents the target’s
location on the ground plane. It is represented by a pixel location in the
image’s coordinate system.
5.1.1 Comparing to hand-found points
The data in Table 5.1 was found by calculating the distance between the
manually identified feet feature and the point identified by each algo-
rithm. Two algorithms were tested. The first was the widely-used method
of finding the centre of the bottom edge of the bounding box. The second
is the method described in Section 3.2.4, which rotates the target mask
96
Section 5.1. Feet feature finder 97
upright, then finds the point between the two edge points closest to the
bottom corners of the bounding box.
The raw mean distance between the point pairs was measured. It
was then divided by the diagonal frame size for a resolution-independent
measurement. The ratio between the bounding box and rotated measure-
ments was taken: ratio = drot/dBB. Ratios greater than 1 indicate that
the bounding box method performed worse, on average, than the rotated
method.
The data shows that in all sequences the rotated method described in
Section 3.2.4 performs better than the widely used bounding box method.
In the un-tilted real-world sequences the bounding box method produces
results 28% more distant from the hand-identified location than the ro-
tated method. In the two sequences where the camera was tilted about
twenty degrees, arena A and Generator 17A, the rotated method performs
much better than the bounding box method.
However, despite the improvement, it should be noted that the differ-
ence between the two methods in units of pixels is actually fairly small.
5.1.2 Comparing meta-target creation distances
Knowing how well the feet finders perform with respect to a human ob-
server is good, but how well do the feet feature finding algorithms per-
form in practice? The scaled symmetric transfer distance, ε, was recorded
Section 5.1. Feet feature finder 98
Table 5.1: Data on the performance of the bounding box (B.B.) and rotated feetfeature finders, from both Generator and real-world sequences.
SequenceDiag.framesize
Numberof testpoints
AlgorithmRaw meandist (pix)
Scaledmeandist×103
Ratio
Gen. 23A 750 203
B.B. 2.66 3.55
1.97
Rotated 1.35 1.80
Gen. 23B 500 106
B.B. 3.40 6.80
1.60
Rotated 2.12 4.24
Gen. 17A (tilted) 750 55
B.B. 3.28 4.37
2.56
Rotated 1.28 1.71
Gen. 17B 500 101
B.B. 2.77 5.54
2.16
Rotated 1.28 2.56
Gym A 433 108
B.B. 4.60 10.61.39
Rotated 3.30 7.63
Gym B 433 108
B.B. 5.93 13.71.27
Rotated 4.68 10.8
Arena A (titled) 433 56
B.B. 5.48 12.72.85
Rotated 1.92 4.44
Arena B 433 42
B.B. 6.00 13.91.37
Rotated 4.39 10.2
Field A 433 101
B.B. 4.70 10.91.12
Rotated 4.20 9.72
Section 5.1. Feet feature finder 99
every time a meta-target association was created. At the same instant, the
sstd was re-computed with the feet features found using the bounding
box method. The mean values ε are show in Table 5.2. The actual values
for each meta-target are plotted in Figures 5.1, 5.2, 5.3, 5.4, and 5.5.
Table 5.2: Mean scaled symmetric transfer distances at meta-target creation. TheBB-misses column is the number of meta-targets that would not have been cre-ated had the feet been found with the bounding box method, given a thresholdof τε = 0.20.
Sequence Numberof meta-targets
εactual × 103 εBB × 103 Ratio εBBεactual
BB misses
Gen. 18 143 8.40 9.70 1.15 0
Gen. 23 127 18.7 18.0 0.96 0
Gen. 17 261 27.8 29.0 1.04 1
Gym 108 66.7 90.8 1.36 5
Arena 140 47.9 90.6 1.89 7
Figures 5.1 and 5.2 show data from essentially identical sequences.
The only difference from Generator run 18 to run 23 was that the peo-
ple had different paths. Neither sequence had meta-targets created that
would have been missed had the bounding box method been used. Inter-
estingly, when we compare the two figures it is clear that the homography
used in run 18 started out much more accurate than the homography used
in run 23. This result is discussed below.
As described in Section 4.7, the threshold for meta-target creation was
set to τε = 0.2. Given that threshold, either feet feature finder would
Section 5.1. Feet feature finder 100
perform quite well.
Figure 5.1: sstd upon meta-target creation for Generator sequence 18.
The results of the un-tilted real-world gym sequence, shown in Figure
5.3 have a similar shape to that of Generator run 18. Distances for the
first few meta-targets are near to the threshold. As time passes and more
markers are dropped, the accuracy of the homography is increased and
the sstds are reduced. In the gym sequence, five of the 108 meta-targets
would not have been created had the bounding-box method been used to
find the feet.
The tilted Generator sequence, run 17, shows similar behaviour to the
gym sequence and Generator run 18. That is, the homography starts out
with relatively large errors, but increases in accuracy as more markers
are dropped. With one camera tilted, the bounding box distance would
Section 5.1. Feet feature finder 101
Figure 5.2: sstd upon meta-target creation for Generator sequence 23.
Figure 5.3: sstd upon meta-target creation for a the real-world gym sequence.Note the five peaks where the bounding box line exceeds 0.2. Those meta-targetswould not have been created if those distances were used.
Section 5.1. Feet feature finder 102
Figure 5.4: sstd upon meta-target creation for Generator sequence 17. CameraA was tilted by about twenty degrees.
not have been below the threshold τε for one of the meta-targets; that
association would not have been created until the distance dropped below
the threshold.
The arena sequence, in which camera A is tilted, clearly shows the
benefit of using the rotated method to find the feet feature point. Fig-
ure 5.5 shows that even after a significant number of markers have been
dropped and the homography has settled, the bounding box method
still produces worse scaled symmetric transfer distances than the rotated
method. Although Table 5.2 indicates that only seven meta-targets would
not have been created with the bounding-box method based on a thresh-
old of τε = 0.2, Figure 5.5 shows that if a slightly lower threshold had
been used then many of the meta-targets would not have been correctly
Section 5.2. Homography 103
Figure 5.5: sstd upon meta-target creation for the real-world Arena sequence.Camera A was tilted by about twenty degrees.
associated at all.
5.2 Homography
5.2.1 Markers
When testing the algorithm, it was often useful to examine the list of
markers in graphical form. When running, this allowed an observer to
watch markers being created.
Figure 5.6 shows the marker images from the Generator run 23. In
addition to the marker images, the fov lines have been overlaid on the
camera A image. It is clear that the fov lines are in the correct locations.
Observe the blank space near the top of the camera B image, Figure 5.6(b).
Section 5.2. Homography 104
(a) Camera A (b) Camera B
Figure 5.6: Plotting the list of markers from the Generator 23 sequence shows theextent of the overlap between the cameras. Each marker corresponds to exactlyone point in the other image. The fov lines have been overlaid in camera A.
The height of this space roughly corresponds to the apparent height of a
target when it reaches the top edge of the field of view. Since markers are
not dropped for targets that touch any of the fov lines, this area does not
get any markers.
Figure 5.7 shows the background model images for the gym sequence.
The recovered fov lines of camera B are superimposed on the background
image from camera A in Figure 5.7(a). The final set of markers for the
gym sequence, Figure 5.8, shows two interesting artifacts:
• The bottom fov line is incorrectly placed. This is because there were
very few edge events on the bottom edge of camera B, and they were
all close together.
• The same border effect mentioned above for the Generator sequence
Section 5.2. Homography 105
(a) Camera A, with superimposed fov
lines(b) Camera B
Figure 5.7: The background models from the gym sequence.
is present here. In addition, there are clear borders on the left and
right edges. These are all caused by the same effect – typical targets
have a non-trivial height and width relative to the frame size, so by
the time that they stop touching all the edges their feet are already
significantly inside the frame.
Note that the neither artifact affects homography-based meta-target
creation. Since meta-target associations are only created when targets are
wholly inside the frame, the feet will be in the area of support of the
homography before the homography will be used to calculate a scaled
symmetric transfer distance.
Section 5.2. Homography 106
(a) Camera A (b) Camera B
Figure 5.8: These are the markers from the gym sequence. The fov lines havebeen drawn the image from camera A. The bad line is from the bottom of cameraB.
5.2.2 Numerical tests with truth points
For Generator runs 18 and 23, 95 corresponding point pairs were man-
ually identified and used ground truth. During the algorithm’s run the
sstd for each of those truth points was recorded at a regular interval
as the homography was updated. The mean distance at each interval is
shown in Figure 5.9(a). The distance for each of the 95 points using the fi-
nal homography for each of the two runs is shown in Figure 5.9(b), along
with the distances based on the homography that best fit the manually-
identified points, i.e. the truth homography.
In Run 23 some of the first markers were dropped by a meta-target that
was incorrectly created by the fov line method. This resulted in relatively
large sstds. However, the homography-based tracker quickly identified
the error. As more markers were dropped by other, correctly-associated
Section 5.2. Homography 107
(a) Mean sstd of truth points (b) sstd using final homography
Figure 5.9: Scaled symmetric transfer distances for 95 truth points in Generatorsequences 18 and 23.
meta-targets the system quickly corrected itself.
Table 5.3 shows the mean scaled symmetric transfer error for the first
and final homographies, as well as for the homography created using the
dlt algorithm on the manually-identified points (i.e. the best-fit homog-
raphy). It should be noted that the final homography yields distances
well below the threshold τε = 0.2 required to create new meta-target
associations.
Table 5.3: Mean scaled symmetric transfer distances for the first and final ho-mography
Sequence Truth pointsHomography
First Final ManualGen. run 18
95
0.0167 0.0169
0.0029
Gen. run 23 0.2369 0.0167
Gym 84 0.0818 0.0317 0.0124
Arena 48 0.0425 0.0416 0.0238
Section 5.2. Homography 108
(a) Gym (b) Arena
Figure 5.10: The mean sstd of the truth points for the two real-world sequencesover time.
The gym sequence also started with a meta-target that was incorrectly
associated by the fov method. As more meta-targets travelled around,
the homography increased in accuracy and the sstds levelled off. The
mean sstd values for the 84 truth points in that sequence are shown in
Figure 5.11(a)
The arena sequence was fairly calm and low-traffic, so it started and
finished with all-around good performance. The sstd values of the 48
truth points used in that sequence are shown in Figure 5.11(b).
5.2.3 Visual tests
Generator sequences
Figure 5.12 shows two frames from different Generator sequences that
have been melded together using standard image processing techniques.
Section 5.2. Homography 109
(a) Gym – 84 points (b) Arena – 48 points
Figure 5.11: sstd of truth points using both the final algorithm and the truthhomographies for the real-world sequences.
The homographies found using the multi-camera tracking algorithm were
used to create the melded outputs.
In both images it is obvious that the ground-planes are not perfectly
lined up. This is because the corresponding point pairs used to create
the homography are not the same as points that would be matched by a
human. We discuss the reason for this disparity as observed in the gym
sequence starting on page 111.
Note also that the pixels of the targets in the camera B image have
been projected into the camera A coordinate system as if they were on
the ground plane. If the pixels in the targets’ heads were thought of as
being on the ground plane in camera B, then because they are above the
feet pixels they will be projected to coordinates farther from the location
of camera B than the feet. This manifests itself in the shadow-like appear-
Section 5.2. Homography 110
(a) Generator run 23 (b) Tilted Generator run 17
Figure 5.12: Frames from two Generator sequences melded using the homog-raphy found with the multi-camera tracking algorithm.
ance of the projected targets.
Gym sequence
One way to test the validity of a homography between two planes is to
transform one image using standard image processing techniques and
then meld it with the other image. If features in the world line up in the
melded image then the homography is valid. The two background im-
ages for the gym sequence were melded using the ground plane-induced
homography found with the multiple-camera tracking algorithm. The
result is shown in Figure 5.13
In order to more closely examine the results, the main section of the
melded image was isolated and cropped. Figure 5.14 shows two versions
of the melded image. Figure 5.14(a) was created using a homography
Section 5.2. Homography 111
Figure 5.13: The gym background models overlaid into one image. Thealgorithmically-found homography was used.
(a) Manual truth homography (b) Algorithm homography
Figure 5.14: A cropped section of the two melded background images, usingdifferent homographies.
found with 84 manually-identified corresponding point pairs – it is a
truth image. Figure 5.14(b) was created using the algorithmically-found
homography.
Figure 5.14(b) clearly shows that the algorithmically-found homog-
raphy comes close, but does not line up the images correctly. Figure 5.15
shows the difference between the projected and actual truth point loca-
tions in camera B when the two different homographies are used.
In Figure 5.15(a), with the truth homography, the errors are all fairly
Section 5.2. Homography 112
(a) Hand-found homography (b) Algorithm homography
Figure 5.15: Projection errors in camera B, in units of pixels, when two differenthomographies are used to project the points. Red is the column error, blue is therow error. Negative errors mean that the projected location is right of or belowthe actual location.
benign. The row is accurate, but the column is very sensitive. This is
because of the relatively small range of vertical inputs in camera A –
small errors in the row position of a point in camera A leads to large
differences in the projected column position in camera B. However, the
average error is very close to zero.
In Figure 5.15(b) the column error (red) again shows a similarly large
variance, but is centred well below zero. The row error (blue) is slightly
below zero. This means that points from camera A are being projected
significantly to the right and slightly below where the human observer
placed them in camera B. This is because of the differences in the feature
points selected by the human observer and by the algorithm.
The human observer selected pairs of corresponding points with di-
rect regard to the location on the ground plane. The size of the target
mask was not seen. On the other hand, the computer only saw the mask
Section 5.2. Homography 113
when it selected the feet feature location in both cameras. The feet fea-
ture became the corresponding point pair. The target mask goes through
a series of morphological processes in the background subtraction and
single-camera tracking algorithms. The result of these operations is that
the mask is dilated when compared to the mask that a person would se-
lect. Thus, in both cameras the feet are identified to be in lower rows than
a human might select.
Therefore, a human-identified point in camera A is seen by the com-
puter to represent a target whose feet, if identified by the person, would
be at a higher row. The homography found by the algorithm takes this
into account, and projects that point to where the feet feature would be
if it were found by the algorithm. Because we are comparing that pro-
jected location to a hand-found location, the projected feet location will
be displaced to the right and slightly below the hand-found location.
The main consequence of this explanation is that plots such as that
in Figure 5.15(b) can be trusted, but only so long as the geometry of the
scene is taken into account when interpreting the data. This also implies
that the images created by overlaying the background images should not
be expected to be perfect matches.
Section 5.2. Homography 114
(a) Camera A – with fov lines (b) Camera B
Figure 5.16: Background model images for the arena sequence with overlaid fov
lines.
Arena sequence
Figure 5.16 shows the background images used in the arena sequence.
To give a sense of scale, the building in the scene is about three stories
tall. The left and right fov lines have been overlaid on the camera A
image. The top and bottom fov lines were not calculated, since no targets
transited those edges in this sequence.
Figure 5.17 shows the final set of markers used to calculate the ho-
mography. Clearly, based on the two background images, camera B only
covers a small slice of the total field of view of camera A. Therefore, given
the large area covered by camera A, it is un-surprising that the targets
did not cover much of the image. This is seen in the tight grouping of the
markers.
Because the markers were tightly grouped in camera A, they were de-
Section 5.2. Homography 115
(a) Camera A – with fov lines (b) Camera B
Figure 5.17: Markers were dropped in a very small area on the image. Thisnecessitated a lower collinearity threshold when calculating the homography.
clared collinear based on the original threshold, and the homography was
not calculated. This was un-satisfactory. Therefore, for this sequence the
inlier threshold used when determining if the marker data was collinear
had to be reduced from from 2.5% down to 0.8% (see Section 4.6).
As with the gym sequence, we can use the ground plane-induced ho-
mography to resample one of the background images into the other’s co-
ordinate system. Because both of the images contain the ground plane’s
line at infinity, the melded images are large. A lightly cropped meld of
the background images is shown in Figure 5.18. Note the fairly straight
line formed at the base of the building – that this line overlaps inside
the melded region and continues straight on outside the melded region
indicates that the homography is correct.
The large melded image is further cropped down in Figure 5.19. In
that smaller cropped image it appears that there is a line in the grass.
Section 5.2. Homography 116
Figure 5.18: A lightly cropped section of the arena background images meldedusing the final algorithmically-found homography.
This line is an artifact of the melding algorithm, and essentially marks
the left-hand edge of the field of view of camera A, as seen by camera
B. Also note the location of the sewer (the brown patch in the centre of
the grass). It overlaps quite well because the homography is correct and
because it is near the centre of the projection.
Figure 5.19: A further cropped section of Figure 5.18.
The same errors in the projected locations that were found in the other
sequences are present in this sequence too. Those errors, due to the differ-
ence between where the algorithm found the feet and where the human
marked the feet, were discussed above as they related to the gym se-
Section 5.3. Occlusions 117
quence. They do not affect meta-target creation, since they are errors of
visual interpretation rather than signs of errors in the algorithm.
5.3 Occlusions
Occlusions were simulated at various times in all sequences. This was
done by interrupting the algorithm at various times after the first homog-
raphy had been found, deleting the list of meta-target associations, and
then re-starting the algorithm. This simulated a complete loss of track
information, since all target history was deleted. Also, in some cases the
masks of targets were adjusted to simulate partial occlusions.
In all cases the homography-based multiple-camera tracking system
correctly and immediately recovered the correct meta-target associations.
This testing indicates that the algorithm is highly resistant to both partial
and full target occlusions.
Chapter 6
Conclusions and future work
6.1 Conclusions
The goal of this research, broadly stated, was to solve the consistent la-
belling problem for pairs of arbitrary-modality static overlapping cam-
eras sharing a common ground plane, with no calibration and automatic
learning. The algorithms that were implemented for this thesis have the
following characteristics:
• The single-camera tracker works for long RGB video sequences. The
background subtraction model allows it to handle objects stopping
and becoming part of the background as well as background objects
beginning to move as foreground objects.
• To change modalities, the only component that needs to be changed
118
Section 6.1. Conclusions 119
is the background subtraction function. So long as that function can
tell what is different between the current frame and the background
model, the whole system will work with non-RGB modalities.
• Field of view lines are found without using any operator input. No
calibration is required. The only element required in the scene is
the existence of a common ground plane, but neither that plane nor
any other part of the scene requires any special calibration pattern.
• The homography-based tracker quickly learns the geometry of the
scene, and was shown to create meta-target associations with scaled
symmetric transfer distances well below the maximum threshold
value used in this thesis.
• The only operator input needed is to decide whether the pair of
cameras contains an overlapping field of view.
In addition to the implementation of the full tracking system, this
thesis contributed the following elements to the field of computer vision
and target tracking:
• A method of finding the pixel location corresponding to the feet of
a target was introduced and tested. The method was shown to be
slightly better than the bounding box method for normal upright
cameras. For tilted cameras the new method was shown to be sig-
nificantly better, allowing the multiple-camera tracking module to
Section 6.1. Conclusions 120
create additional meta-target associations that would not have been
found with the old bounding-box method.
• A method to use trails of markers to calculate a plane-induced ho-
mography was introduced. This method reduces the ability of a
few badly-segmented targets to ruin the homography, thereby in-
creasing the capabilities of the multiple-camera tracking function.
In addition, it enables the multiple-camera tracking function to op-
erate across the whole frame even if only one edge of the field of
view is used as an entry and exit point.
• A homography-based multiple camera person tracking algorithm
was introduced. The rules and methods used in the operation of
the tracker were fully specified. The tracker solves the consistent
labelling problem for all targets visible in the scene, including ones
that travel through in-scene entrances and exits. The tracker uses a
scaled version of the symmetric transfer distance, which is usable
when the cameras have differently-sized frames.
The algorithms that were discussed and implemented for this research
therefore satisfy the goals stated in Chapter 1.
Section 6.2. Future work 121
6.2 Future work
6.2.1 Specific implementation ideas
The algorithms that were implemented for this thesis could be improved
in a variety of ways. The segmentation algorithm used in the single-
camera tracking algorithm is slow, even when the interpreted nature of
Matlab code is taken into account. Tightly-coded implementations of
the segmentation algorithm might be too slow for live video processing.
Therefore, another segmentation algorithm should be investigated.
To prove that the system works with other modalities, video should
be acquired and processed using a different type of camera. For instance,
to use a thermal camera, the two changes required are to re-implement
the background-subtraction algorithm and the image segmentation algo-
rithm.
When a target’s feet are partially occluded they might not get properly
associated into a meta-target, since their ”feet” will be detected at an
incorrect location. This might also cause meta-target stealing, where a
target appears close to another target’s knees, but not their feet. It may be
possible to use the historical median apparent height of the target to find a
putative location for the feet feature, and use that location for meta-target
association purposes. This might prevent meta-target stealing problems
that could occur when targets are moving in a region where their lower
Section 6.2. Future work 122
bodies are sometimes partially occluded, such as a food court or a cafe.
In its present implementation, the algorithm works discretely in this
sequence: acquisition, segmentation, single-camera tracking, then multiple-
camera tracking. Ideally this would be a continuous process without in-
terruption. The code, as written, is amenable to being integrated into a
continuous process, however, speed would be an issue and the segmen-
tation code is proprietary. In the future, we would like to see the entire
algorithm running on live data on a single machine. To do this it is likely
that the whole system would have to be re-implemented in a language
other than Matlab. Luckily, nearly all of the code uses fairly generic
functionality, not proprietary Mathworks tool-box functions. The only
exception is the segmentation algorithm. The most complicated functions
outside the segmentation algorithm are a singular value decomposition
and a few morphological operations (dilation, opening).
6.2.2 Computer vision
As discussed in Chapter 2, work is being done by other research groups to
automatically discover the relationship between cameras. The algorithms
in this thesis require that this be specified. It should be possible to in-
tegrate the methods developed in this thesis with a network-discovery
function. This would improve the in-scene performance of the meta-
target creation algorithms, since they would use a better homography-
Section 6.2. Future work 123
based tracker.
If a meta-target has been properly segmented, then the head could be
considered a corresponding feature point that does not lie on the ground
plane. With two or more such corresponding points the epipoles and the
fundamental matrix could be calculated, thereby defining the epipolar
geometry. If the fundamental matrix is known then the epipolar line upon
which the head feature should appear can be calculated. The distance of
the target’s head to this epipolar line could be incorporated into the scaled
symmetric transfer distance metric, where it would therefore be used for
meta-target matching. This would enable better matching and selection
of close targets.
Additional ideas for systems that could be implemented were dis-
cussed in Section 3.4. Those ideas included systems that directly improve
the method discussed in this thesis and methods that use the fundamen-
tal matrix or other geometric constructs.
Bibliography
[1] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACM Com-puting Surveys (CSUR), vol. 38, no. 4, 2006.
[2] D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object tracking,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 5,pp. 564–577, May 2003.
[3] S. Sclaroff and J. Isidoro, “Active blobs: region-based, deformable appear-ance models,” Computer Vision and Image Understanding, vol. 89, no. 2-3, pp.197–225, 2003.
[4] A. Koschan, S. Kang, J. Paik, B. Abidi, and M. Abidi, “Color active shapemodels for tracking non-rigid objects,” Pattern Recognition Letters, vol. 24,pp. 1751–1765, 2003.
[5] A. Martin and E. Saber, “An improved method for dynamic object trackingusing partial shape matching and color image segmentation,” January 2008,submitted to IEEE International Conference on Image Processing 2008.
[6] E. Saber, Y. Xu, and A. M. Tekalp, “Partial shape recognition by sub-matrixmatching for partial matching guided image labeling,” Pattern Recognition,vol. 38, pp. 1560–1573, 2005.
[7] L. Garcia, E. Saber, V. Amuso, and R. Bhaskar, “Automatic color imagesegmentation by dynamic region growth and multimodal merging of colorand texture information,” in International Conference on Acoustics, Speech andSignal Processing, March 2008.
[8] D. Makris, T. Ellis, and J. Black, “Bridging the gaps between cameras,” inProceedings of the IEEE Computer Society Conference on Computer Vision andPattern Recognition, vol. 2, July 2004, pp. 205–210.
124
BIBLIOGRAPHY 125
[9] T. Yang, F. Chen, D. Kimber, and J. Vaughan, “Robust people detectionand tracking in a multi-camera indoor visual surveillance system,” in IEEEInternational Conference on Multimedia and Expo, July 2007, pp. 675–678.
[10] C. Stauffer and K. Tieu, “Automated multi-camera planar tracking corre-spondence modeling,” in IEEE Computer Society Conference on Computer Vi-sion and Pattern Recognition, vol. 1. IEEE Computer Society, 2003, p. 259.
[11] O. Javed, Z. Rasheed, K. Shafique, and M. Shah, “Tracking across multiplecameras with disjoint views,” in IEEE International Conference on ComputerVision, vol. 2, October 2003, pp. 952–957.
[12] O. Javed, K. Shafique, and M. Shah, “Appearance modeling for trackingin multiple non-overlapping cameras,” in IEEE International Conference onImage Processing, vol. 2. IEEE Computer Society, 2005, pp. 26–33.
[13] O. Javed, K. Shafique, Z. Rasheed, and M. Shah, “Mod-eling inter-camera space-time and appearance relationships fortracking across non-overlapping views,” Computer Vision and Im-age Understanding, vol. 109, no. 2, pp. 146–162, 2008.[Online]. Available: http://www.sciencedirect.com/science/article/B6WCX-4N4YMR5-1/1/14477a55fa6ebb45f4713fe74f71be28
[14] J. Orwell, P. Remagnino, and G. Jones, “Multi-camera color tracking,” inIEEE Workshop on Visual Surveillance. Los Alamitos, CA, USA: IEEE Com-puter Society, 1999, p. 14.
[15] J. Li, C. S. Chua, and Y. K. Ho, “Color based multiple people tracking,”in International Conference on Control, Automation, Robotics and Vision, vol. 1,December 2002, pp. 309–314.
[16] F. Devernay, D. Mateus, and M. Guilbert, “Multi-camera scene flow bytracking 3D points and surfels,” IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition, vol. 2, pp. 2203–2212, 2006.
[17] C. Madden and M. Piccardi, “Height measurement as a session-based bio-metric for people matching across disjoint camera views,” in Proceedings ofthe Conference of Image and Vision Computing New Zealand, November 2005.
[18] ——, “A framework for track matching across disjoint cameras using robustshape and appearance features,” in IEEE Conference on Advanced Video andSignal Based Surveillance, September 2007, pp. 188–193.
BIBLIOGRAPHY 126
[19] W. Hu, M. Hu, X. Zhou, T. Tan, J. Lou, and S. Maybank, “Principal axis-based correspondence between multiple cameras for people tracking,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 4, pp.663–671, April 2006.
[20] D. Thirde, M. Borg, J. Ferryman, J. Aguilera, M. Kampel, and G. Fernandez,“Multi-camera tracking for visual surveillance applications,” in ComputerVision Winter Workshop, O. Chum and V. Franc, Eds. Czech Pattern Recog-nition Society, February 2006.
[21] T. Zhao, M. Aggarwal, R. Kumar, and H. Sawhney, “Real-time wide areamulti-camera stereo tracking,” in IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition, vol. 1. IEEE Computer Society, 2005,pp. 976–983.
[22] J. Black and T. Ellis, “Multi camera image tracking,” Image and Vision Com-puting, vol. 24, no. 11, pp. 1256–1267, 2006.
[23] S. Velipasalar and W. Wolf, “Recovering field of view lines by using projec-tive invariants,” in International Conference on Image Processing, vol. 5, Octo-ber 2004, pp. 3069–3072.
[24] S. Khan and M. Shah, “Consistent labeling of tracked objects in multiplecameras with overlapping fields of view,” IEEE Transactions on Pattern Anal-ysis and Machine Intelligence, vol. 25, no. 10, pp. 1355–1360, 2003.
[25] S. Calderara, A. Prati, R. Vezzani, and R. Cucchiara, “Consistent labeling formulti-camera object tracking,” Image Analysis and Processing, pp. 1206–1214,2005.
[26] Z. Yue, S. Zhou, and R. Chellappa, “Robust two-camera tracking using ho-mography,” IEEE International Conference on Acoustics, Speech, and Signal Pro-cessing, vol. 3, pp. 1–4, May 2004.
[27] A. Mittal and L. Davis, “Unified multi-camera detection and tracking usingregion-matching,” in Proceedings of IEEE Workshop on Multi-Object Tracking.IEEE Computer Society, 2001, pp. 3–10.
[28] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, “Statistic and knowledge-based moving object detection in traffic scenes,” in IEEE Intelligent Trans-portation Systems Conference Proceedings, October 2000, pp. 27–32.
BIBLIOGRAPHY 127
[29] R. Cucchiara, C. Grana, M. Piccardi, A. Prati, and S. Sirotti, “Improvingshadow suppression in moving object detection with hsv color informa-tion,” in IEEE Intelligent Transportation Systems Conference Proceedings, 2001,pp. 334–339.
[30] R. Cucchiara, M. Piccardi, and A. Prati, “Detecting moving object, ghosts,and shadows in video streams,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 25, no. 10, pp. 1337–1342, October 2003.
[31] R. H. Bartels, J. C. Beatty, and B. R. Barsky, An introduction to splines for usein Computer Graphics and Geometric Modeling, 2nd ed. Morgan KaufmannPublishers Inc., 1987.
[32] A. Criminisi, I. Reid, and A. Zisserman, “Single view metrology,” in FifthInternational Conference on Computer Vision, vol. 1. IEEE Computer Society,1999, p. 434.
[33] J. Grimm and W. Grimm, The Complete Grimm’s Fairy Tales, M. Hunt, Ed.Pantheon Books, 1944.
[34] R. I. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision,2nd ed. Cambridge University Press, ISBN: 0521540518, 2004.
[35] R. Hartley, “In defence of the 8-point algorithm,” in International Conferenceon Computer Vision, vol. 0. Los Alamitos, CA, USA: IEEE Computer Society,1995, p. 1064.
[36] (2008, May) Performance evaluation of tracking and surveillance. [On-line]. Available: http://www.cvg.cs.rdg.ac.uk/cgi-bin/PETSMETRICS/page.cgi?dataset
[37] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigmfor model fitting with applications to image analysis and automated cartog-raphy,” Commun. ACM, vol. 24, no. 6, pp. 381–395, 1981.
[38] P. D. Kovesi, “MATLAB and Octave functions for computer vision andimage processing,” School of Computer Science & Software Engineer-ing, The University of Western Australia, February 2008, available from:<http://www.csse.uwa.edu.au/∼pk/research/matlabfns/>.