Towards Intelligent Camera Networks: A Virtual Vision...

8
Towards Intelligent Camera Networks: A Virtual Vision Approach Faisal Z. Qureshi 1 and Demetri Terzopoulos 2,1 1 Department of Computer Science, University of Toronto, Toronto ON M5S 3G4, Canada 2 Courant Institute of Mathematical Sciences, New York University, New York, NY 10003, USA Abstract The goals of this paper are two-fold: (i) to present our ini- tial efforts towards the realization of a fully autonomous sensor network of dynamic video cameras capable of pro- viding perceptive coverage of a large public space, and (ii) to further the cause of exploiting visually and behaviorally realistic virtual environments in the development and test- ing of machine vision systems. In particular, our proposed sensor network employs techniques that enable a collection of active (pan-tilt-zoom) cameras to collaborate in perform- ing various visual surveillance tasks, such as keeping one or more pedestrians within view, with minimal reliance on a human operator. The network features local and global autonomy and lacks any central controller, which entails ro- bustness and scalability. Its functionality is the result of lo- cal decision-making capabilities at each camera node and communication between the nodes. We demonstrate our surveillance system in a virtual train station environment populated by autonomous, lifelike virtual pedestrians. Our readily reconfigurable virtual cameras generate synthetic video feeds that emulate those generated by real surveil- lance cameras monitoring public spaces. This type of re- search would be difficult in the real world given the costs of deploying and experimenting with an appropriately com- plex camera network in a large public space the size of a train station. 1. Introduction Recent advances in camera and video technologies have made it possible to network numerous video cameras to- gether in order to provide visual coverage of large public spaces such as airports or train stations. As the size of the camera network grows and the level of activity in the public space increases, it becomes infeasible for human operators to monitor the multiple video streams and identify all events of possible interest, or even to control individual cameras in performing advanced surveillance tasks, such as zoom- ing in on a particular subject of interest to acquire one or more facial snapshots. Consequently, a timely challenge for computer vision researchers is to design camera sensor net- works capable of performing visual surveillance tasks au- tonomously, or at least with minimal human intervention. Even if there were no legal obstacles to monitoring peo- ple in public spaces for experimental purposes, the cost of deploying a large-scale camera network in the real world and experimenting with it can easily be prohibitive for computer vision researchers. As was argued in [1], how- ever, computer graphics and virtual reality technologies are rapidly presenting viable alternatives to the real world for developing vision systems. Legal impediments and cost considerations aside, the use of a virtual environment can also offer greater flexibility during the system design and evaluation process. Terzopoulos [2] proposed a Vir- tual Vision approach to designing surveillance systems us- ing a virtual train station environment populated by fully autonomous, lifelike virtual pedestrians that perform vari- ous activities (Figure 1). Within this environment, virtual cameras generate synthetic video feeds (Figure 2). The video streams emulate those generated by real surveillance cameras, and low-level image processing mimics the per- formance characteristics of a state-of-the-art surveillance video system. Within the virtual vision paradigm, we propose a sen- sor network architecture capable of performing common visual surveillance tasks with minimal operator assistance. Once an operator monitoring surveillance video feeds spots a pedestrian involved in some suspicious activity, or a visual behavior analysis routine selects such a pedestrian automat- ically, the cameras decide amongst themselves how best to observe the subject. For example, a subset of cameras can collaboratively track the pedestrian as he weaves through the crowd. The problem of assigning cameras to follow pedestrians becomes challenging when multiple pedestri- ans are involved. To deal with the myriad of possibilities, the cameras must be able to reason about the dynamic sit- uation. To this end, we propose a sensor network commu- nication model whose nodes are capable of task dependent self-organization through local and global decision making. Each node is aware of its neighbors. 1 Our proposed sensor network is novel insofar as it does 1 The neighborhood of a node A can be defined automatically as the set of nodes that are, e.g., within nominal radio communications distance of A [4]. References [5, 6] present schemes to learn sensor network topologies. 1

Transcript of Towards Intelligent Camera Networks: A Virtual Vision...

Page 1: Towards Intelligent Camera Networks: A Virtual Vision Approachweb.cs.ucla.edu/~dt/papers/pets05/pets05.pdf · Towards Intelligent Camera Networks: A Virtual Vision Approach Faisal

Towards Intelligent Camera Networks: A Virtual Vision Approach

Faisal Z. Qureshi1 andDemetri Terzopoulos2,1

1Department of Computer Science, University of Toronto, Toronto ON M5S 3G4, Canada2Courant Institute of Mathematical Sciences, New York University, New York, NY 10003, USA

Abstract

The goals of this paper are two-fold: (i) to present our ini-tial efforts towards the realization of a fully autonomoussensor network of dynamic video cameras capable of pro-viding perceptive coverage of a large public space, and (ii)to further the cause of exploiting visually and behaviorallyrealistic virtual environments in the development and test-ing of machine vision systems. In particular, our proposedsensor network employs techniques that enable a collectionof active (pan-tilt-zoom) cameras to collaborate in perform-ing various visual surveillance tasks, such as keeping oneor more pedestrians within view, with minimal reliance ona human operator. The network features local and globalautonomy and lacks any central controller, which entails ro-bustness and scalability. Its functionality is the result of lo-cal decision-making capabilities at each camera node andcommunication between the nodes. We demonstrate oursurveillance system in a virtual train station environmentpopulated by autonomous, lifelike virtual pedestrians. Ourreadily reconfigurable virtual cameras generate syntheticvideo feeds that emulate those generated by real surveil-lance cameras monitoring public spaces. This type of re-search would be difficult in the real world given the costsof deploying and experimenting with an appropriately com-plex camera network in a large public space the size of atrain station.

1. IntroductionRecent advances in camera and video technologies havemade it possible to network numerous video cameras to-gether in order to provide visual coverage of large publicspaces such as airports or train stations. As the size of thecamera network grows and the level of activity in the publicspace increases, it becomes infeasible for human operatorsto monitor the multiple video streams and identify all eventsof possible interest, or even to control individual camerasin performing advanced surveillance tasks, such as zoom-ing in on a particular subject of interest to acquire one ormore facial snapshots. Consequently, a timely challenge forcomputer vision researchers is to design camera sensor net-works capable of performing visual surveillance tasks au-

tonomously, or at least with minimal human intervention.Even if there were no legal obstacles to monitoring peo-

ple in public spaces for experimental purposes, the cost ofdeploying a large-scale camera network in the real worldand experimenting with it can easily be prohibitive forcomputer vision researchers. As was argued in [1], how-ever, computer graphics and virtual reality technologiesare rapidly presenting viable alternatives to the real worldfor developing vision systems. Legal impediments andcost considerations aside, the use of a virtual environmentcan also offer greater flexibility during the system designand evaluation process. Terzopoulos [2] proposed aVir-tual Visionapproach to designing surveillance systems us-ing a virtual train station environment populated by fullyautonomous, lifelike virtual pedestrians that perform vari-ous activities (Figure 1). Within this environment, virtualcameras generate synthetic video feeds (Figure 2). Thevideo streams emulate those generated by real surveillancecameras, and low-level image processing mimics the per-formance characteristics of a state-of-the-art surveillancevideo system.

Within the virtual vision paradigm, we propose a sen-sor network architecture capable of performing commonvisual surveillance tasks with minimal operator assistance.Once an operator monitoring surveillance video feeds spotsa pedestrian involved in some suspicious activity, or a visualbehavior analysis routine selects such a pedestrian automat-ically, the cameras decide amongst themselves how best toobserve the subject. For example, a subset of cameras cancollaboratively track the pedestrian as he weaves throughthe crowd. The problem of assigning cameras to followpedestrians becomes challenging when multiple pedestri-ans are involved. To deal with the myriad of possibilities,the cameras must be able toreasonabout the dynamic sit-uation. To this end, we propose a sensor network commu-nication model whose nodes are capable of task dependentself-organization through local and global decision making.Each node is aware of its neighbors.1

Our proposed sensor network is novel insofar as it does

1The neighborhood of a node A can be defined automatically as the setof nodes that are, e.g., within nominal radio communications distance of A[4]. References [5, 6] present schemes to learn sensor network topologies.

1

Page 2: Towards Intelligent Camera Networks: A Virtual Vision Approachweb.cs.ucla.edu/~dt/papers/pets05/pets05.pdf · Towards Intelligent Camera Networks: A Virtual Vision Approach Faisal

Waiting room Concourses and platforms Arcade

Figure 1:A large-scale virtual train station populated by self-animating virtual humans (from [3]).

Figure 2: Virtual Vision. Synthetic video feeds from mul-tiple virtual surveillance cameras situated in the (empty)Penn Station environment.

not require camera calibration, a detailed world model, or acentral controller. The overall behavior of the network is theresult of local decision making at each node and internodecommunication. Visual surveillance tasks are performed bygroups of one or more camera nodes. These groups, whichare created on the fly, define the information sharing pa-rameters and the extent of collaboration between nodes. Agroup keeps evolving—i.e., old nodes leave and new nodesjoin the group—during the lifetime of the surveillance task.One node in each group acts as the group supervisor and isresponsible for group level decision making.

Our sensor network is deployed and tested within the vir-tual train station simulator that was developed in [3]. Thesimulator incorporates a large-scale environmental model(of the original Pennsylvania Station in New York City)with a sophisticated pedestrian animation system that com-bines behavioral, perceptual, and cognitive human simula-tion algorithms. The simulator can efficiently synthesize

well over 1000 self-animating pedestrians performing a richvariety of activities in the large-scale indoor urban environ-ment. Like real humans, the synthetic pedestrians are fullyautonomous. They perceive the virtual environment aroundthem, analyze environmental situations, make decisions andbehave naturally within the train station. They can enter thestation, avoiding collisions when proceeding though portalsand congested areas, queue in lines as necessary, purchasetrain tickets at the ticket booths in the main waiting room,sit on benches when they are tired, purchase food/drinksfrom vending machines when they are hungry/thirsty, etc.,and eventually proceed downstairs in the concourse area tothe train tracks. Standard computer graphics techniques en-able a photorealistic rendering of the busy urban scene withconsiderable geometric and photometric detail (Figure 1).

In this paper, we first develop new image-basedfixateandzoomalgorithms for active cameras. Next, we propose asensor network framework particularly suitable for design-ing camera networks for surveillance applications. Finally,we demonstrate the advantages of developing and evaluat-ing this sensor network framework within our virtual worldenvironment. The remainder of the paper is organized asfollows: Section 2 covers relevant prior work on cameranetworks. We explain the low-level vision emulation inSection 3. In Section 4, we develop the behavior modelsfor camera nodes. Section 5 introduces the sensor networkcommunication model. In Section 6, we demonstrate theapplication of this model in the context of visual surveil-lance. We present our initial results in Section 7 and ourconclusions and future research directions in Section 8.

2. Related WorkPrevious work on camera networks has dealt with issuesrelated to low and medium-level computer vision, namelyidentification, recognition, and tracking of moving objects[7]. The emphasis has been on model transference fromone camera to another, which is required for object iden-tification across multiple cameras [8]. Many researchers

2

Page 3: Towards Intelligent Camera Networks: A Virtual Vision Approachweb.cs.ucla.edu/~dt/papers/pets05/pets05.pdf · Towards Intelligent Camera Networks: A Virtual Vision Approach Faisal

1

2

3

1

2

3

Figure 3: Pedestrian segmentation and tracking. (1) Multiplepedestrians are grouped together due to poor segmentation. (2)Noisy pedestrian segmentation results in a tracking failure. (3)Pedestrian segmentation and tracking failure due to occlusion.

have proposed camera network calibration to achieve robustobject identification and classification from multiple view-points, and automatic camera network calibration strategieshave been proposed for both stationary and actively con-trolled camera nodes [9, 10]. Our scheme does not re-quire calibration; however, we assume that the cameras canuniquely identify a pedestrian with reasonable accuracy.To this end we employ color-based pedestrian appearancemodels.

Multiple cameras have also been employed either to in-crease the reliability of the tracking algorithm [11] (by over-coming the effects of occlusion or by using 3D informationfor tracking) or to track an object as it meanders throughthe fields of view (FOVs) of different cameras. In mostcases, object tracking is accomplished by combining somesort of background subtraction strategy and an object ap-pearance/motion model [12].

Little attention has been paid to the problem of control-ling/scheduling active cameras to provide visual coverageof a large public area, such as a train station or an airport.[13] use a stationary wide-FOV camera to control an ac-tive tilt-zoom camera. The cameras are assumed to be cali-brated and the total coverage of the cameras is restricted tothe FOV of the stationary camera. [14] presents a schemefor scheduling available cameras in a task-dependent fash-ion. Here, the tasks are defined within a world modelthat consists of the ground plane, traffic pathways, and de-tailed building layouts. The scheduling problem is cast asa temporal logic problem that requires access to a centraldatabase consisting of current camera schedules and view-ing parameters. This scheme is not scalable due to the cen-tral decision-making bottleneck.

3. Local Vision RoutinesEach camera has its own suite of visual routines for pedes-trian recognition, identification, and tracking, which we dub

“Local Vision Routines” (LVRs). The LVRs are computervision algorithms that directly operate upon the syntheticvideo generated by virtual cameras and the informationreadily available from the 3D virtual world. They mimicthe performance of a state-of-the-art pedestrian recognitionand tracking module. The virtual world affords us the ben-efit of fine tuning the performance of the recognition andtracking modules by taking into consideration the readilyavailable ground truth. Our imaging model emulates cam-era jitter and imperfect color response; however, it does notyet account for such imaging artifacts as depth-of-field andimage vignetting. More sophisticated rendering schemeswould address this limitation.

We employ appearance-based models to track pedestri-ans. Pedestrians are segmented to construct unique androbust color-based pedestrian signatures (appearance mod-els), which are then matched across the subsequent frames.Pedestrian segmentation is carried out using 3D geomet-ric information and background modeling/subtraction. Thequality of segmentation depends upon the amount of noiseintroduced into the process, and the noise is drawn fromGaussian distributions with appropriate means and vari-ances. Color-based signatures, in particular, have foundwidespread use in tracking applications [12]. Color-basedsignatures are sensitive to illumination changes; however,this shortcoming can be mitigated by operating in HSVspace instead of RGB space.

Zooming can drastically change the appearance of apedestrian, thereby confounding conventional appearance-based schemes, such as color histogram signatures. Wetackle this problem by maintaining HSV color histogramsfor several camera zoom settings for each pedestrian. Thus,a distinctive characteristic of our pedestrian tracking routineis its ability to operate over a range of camera zoom settings.Appendix A provides additional algorithmic details regard-ing our pedestrian tracking approach.

The tracking module mimics the performance of a state-of-the-art tracking system (Figure 3). For example, it losestrack due to occlusions, poor segmentation, or bad light-ing. Tracking sometimes locks on the wrong pedestrian,especially if the the scene contains multiple pedestrianswith similar visual appearance; i.e., wearing similar clothes.Tracking also fails in group settings when the pedestriancannot be segmented properly.

4. Camera BehaviorsWe treat each camera as a behavior-based autonomousagent. The overall behavior of the camera is determined bythe LVR (bottom-up) and the current task (top-down). Thecamera controller is modeled as an augmented finite statemachine. At the highest level, the camera can be in oneof the following states:free, tracking, searching, and lost(Figure 4).

3

Page 4: Towards Intelligent Camera Networks: A Virtual Vision Approachweb.cs.ucla.edu/~dt/papers/pets05/pets05.pdf · Towards Intelligent Camera Networks: A Virtual Vision Approach Faisal

Free

Tracking

Searching

Lost

Tracking FailureRegion Of Interest

Re-aquired

Time Up

Search Timer

Signal Track Failure

Figure 4:Top-level camera controller.

time

error act actmaintain

act -> maintain

maintain -> act

maintain

error signal

control signal

Figure 5:Dual-state controller for fixation and zooming.

Each camera canfixateandzoomin on an object of in-terest. Fixation and zooming routines are image driven anddo not require any 3D information, such as camera calibra-tion or a global frame of reference. We discovered that tra-ditional Proportional Derivative (PD) controllers generateunsteady control signals resulting in jittery camera motion.The noisy nature of tracking forces the PD controller to trycontinuously to minimize the error metric without ever suc-ceeding, so the camera keeps servoing. Hence, we modelthe fixation and zooming routines as dual-state controllers.The states are used to activate/deactivate the PD controllers.In the act state the PD controller tries to minimize the er-ror signal; whereas, in themaintainstate the PD controllerignores the error signal altogether and does nothing (Fig-ure 5).

Thefixateroutine brings the region of interest—e.g., thebounding box of a pedestrian—into the center of the imageby tilting the camera about its localx andy axes (Figure 6,Row 1). Thezoomroutine controls the FOV of the cam-era such that the region of interest occupies the desired per-centage of the image. This is useful in situations where, forexample, the operator desires a closer look at a suspiciouspedestrian (Figure 6, Row 2). The details of thefixateandzoomroutines are given in Appendix A.

Figure 6:Row 1: A fixate sequence. Row 2: A zoom sequence.Row 3: Camera returns to its default settings upon losing thepedestrian; it is now ready for another task.

5. Sensor Network ModelWe now explain the sensor network communication schemethat allows task-specific node organization. The idea is asfollows: A human operator presents a particular sensing re-quest to one of the nodes. In response to this request, rel-evant nodes self-organize into a group with the aim of ful-filling the sensing task. The group, which formalizes thecollaboration between member nodes, is a dynamic arrange-ment that keeps evolving throughout the lifetime of the task.At any given time, multiple groups might be active, eachperforming its respective task.

The following procedure sets up the task-specific nodegroups.

Task-specific group formation & evolution (Figure 7)

1: Let N be the sets of all nodes2: Nodeni receives a query3: Nodeni sets up anamedtask and broadcasts it to other nodes

N ′ = N − {ni}2

4: A subsetR of nodesN ′ respond by sending their relevancefor the task{Local decision making}

5: Nodeni ranks the nodesR using the information returned bythese nodes, chooses a subsetG of R, and forms the groupG ∪ {ni}, which is responsible for this task

6: Nodeni is the designated supervisor of the group7: while The task is activedo

2In the context of computer vision, it is sufficient to broadcast the taskto the neighboring nodes.

4

Page 5: Towards Intelligent Camera Networks: A Virtual Vision Approachweb.cs.ucla.edu/~dt/papers/pets05/pets05.pdf · Towards Intelligent Camera Networks: A Virtual Vision Approach Faisal

A

BC

D

EH

F

G I

J

KA

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

KA

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

KA

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

KA

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

K

(a) (b)

A

BC

D

EH

F

G I

J

KA

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

KA

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

KA

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

KA

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

K

(c) (d)

A

BC

D

EH

F

G I

J

KA

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

KA

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

KA

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

KA

BC

D

EH

F

G I

J

K

A

BC

D

EH

F

G I

J

K

(e) (f)

Figure 7:(a) Group 1: A and B; possible candidate: C (b) Group1: A, B, and C; possible candidate: E, F, and D. Group 2: J and K;possible candidate: I. (c) Group 1: E and C; possible candidate: H,F, D, and B. Group 2: J and I; possible candidate: F. (d) Group 1: Eand E; possible candidate C. Group 2: C and F; possible candidate:B, D, G, I, and E. (e) Group 1 and 2 require the same resources, soGroup 1 vanished; task failure. (f) A unique situation where bothgroups successfully use the same nodes, e.g., imagine two groupstracking two pedestrians that started walking together.

8: ni monitors the group and decides when to add/remove anode and when to designate another node to act as the su-pervisor{Group level or global decision making}

9: end while

Group formation is determined by the local computationat each node and the communication between the nodes.We require each node to compute its relevance to a taskin the same currency. Our approach draws inspirationfrom behavior-based autonomous agents where the popu-larly held belief is that, rather than being the result of somepowerful central processing facility, the overall intelligentbehavior is a consequence of the interaction between manysimple processes, called behaviors. We leverage the inter-action between the individual nodes to generate global task-directed behavior.

Node failures are handled by simply dropping the nodesthat have not communicated with the group in the previ-ous x seconds. We have not yet implemented a supervi-sor node failure recovery strategy; however, we propose toreplicate the supervisor’smental statein every other nodeof the group, which is straightforward when the nodes areidentical. That would allow another node to take over whenthe supervisor fails to communicate with the group in thepreviousx seconds.

6. Video SurveillanceWe now consider how a sensor network of dynamic videocameras might be used in the context of surveillance. A hu-man operator spots a suspicious pedestrian(s) in one of thevideo feeds and, for example, requests the network to “trackthis pedestrian,” “zoom in on this pedestrian,” or “track theentire group.” The successful execution and completion ofthese tasks requires intelligent allocation and scheduling ofthe available cameras; in particular, the network must de-cide which cameras should track the pedestrian and for howlong.

A detailed world model that includes the location ofcameras, their FOVs, pedestrian motion prediction mod-els, occlusion models, and pedestrian movement pathwaysmight allow (in some sense)optimalallocation and schedul-ing of cameras; however, it is cumbersome and in mostcases infeasible to acquire such a world model. Our ap-proach does not require such a knowledge base. We only as-sume that a pedestrian can be identified by different cameraswith reasonable accuracy and that camera network topologyis knowna priori. A direct consequence of this approach isthat the network can be easily modified through removal,addition, or replacement of camera nodes.

6.1. Computing Camera Node RelevanceThe accuracy with which individual camera nodes are ableto compute their relevance to the task at hand determinesthe overall performance of the network. Our scheme forcomputing the relevance of a camera to a video surveillancetask encodes the intuitive observations that 1) a camera thatis currently free should be chosen for the task, 2) a camerawith better tracking performance with respect to the taskshould be chosen, 3) the turn and zoom limits of camerasshould be taken into account when assigning a camera to atask; i.e., a camera that has more leeway in terms of turn-ing and zooming might be able to follow a pedestrian fora longer time, and 4) it is better to avoid unnecessary reas-signments of cameras to different tasks, as that might de-grade the performance of the underlying computer visionroutines.

Upon receiving a task request, a camera node returnsa relevance metric—a list of attribute-value pairs describ-ing relevance to the current task across multiple dimensions(Table 1)—to the supervisor node, which converts this met-ric into a scalar relevance valuer using the following equa-tion:

r =

exp

(− (c−1)2

2σc2 − (θ−θ̂)2

2σθ2 − (α−α̂)2

2σα2 − (β−β̂)2

2σβ2

)whens = free

t whens = busy(1)

where θ̂ = (θmin + θmax)/2, α̂ = (αmin + αmax)/2, andβ̂ = (βmin + βmax)/2, and whereθmin andθmax are extremal

5

Page 6: Towards Intelligent Camera Networks: A Virtual Vision Approachweb.cs.ucla.edu/~dt/papers/pets05/pets05.pdf · Towards Intelligent Camera Networks: A Virtual Vision Approach Faisal

status = s ⊂ {busy, free}quality = c ∈ [0, 1]fov = θ ∈ [θmin, θmax] degreesx turn = α ∈ [αmin, αmax] degreesy turn = β ∈ [βmin, βmax] degreestime = t ∈ [0,∞) secondstask = a ⊂ {ai|i = 1, 2, · · ·}

Table 1:The relevance metric returned by a camera node relativeto a new task request. The supervisor node converts the metricinto a scalar value representing the relevance of the node for theparticular surveillance task.

field of view settings,αmin andαmax are extremal rotationangles around thex-axis (up-down), andβmin andβmax areextremal rotation angles around they-axis (left-right). Thevalues of the variancesσc, σθ, σα, andσβ associated witheach attribute are chosen empirically (for our experiments,we assignedσθ = σα = σβ = 5.0 andσc = 0.2).

The computed relevance values are used by a greedy al-gorithm to assign cameras to various tasks. The supervisornode gives preference to the nodes that are currently free,so the nodes that are part of another group are selected onlywhen the required number of free nodes are unavailable forthe current task. We are investigating more formal schemesfor reasoning about camera assignment. Also, we have notyet fully resolved the implications of group-group interac-tions (Figure 7).

6.2. Surveillance TasksWe have implemented an interface that presents the operatora display of the synthetic video feeds from multiple virtualsurveillance cameras (c.f., Figure 2). The operator can se-lect a person in any video feed and instruct the camera net-work to perform one of the following tasks: 1) follow theperson, 2) capture a high-resolution snapshot, or 3) zoom-in and follow the person. The network then automaticallyassigns cameras to fulfill the task requirements. The op-erator can also initiate multiple tasks, in which case eithercameras that are not currently occupied are chosen for thenew task or some cameras are reassigned to the new task.The task outline is as follows:

Task Outline

1: Operator selects a pedestrian in camerac and specifies oneof the following tasks: follow, capture snapshot, follow andzoom.

2: c computes color-based pedestrian signature3: c creates an appropriate task and broadcasts it to the adjacent

camerasNc

4: Each node inNc computes its relevance to the task proposedby c

5: c ranks nodes inNc according to their relevance

6: c utilizes the ranking to assign a subsetGc of Nc to the taskat hand.Gt = Gc ∪ {c} is the task-specific group

7: c sends the pedestrian signature to every camera inGc alongwith any information that might be helpful in positively iden-tifying the pedestrian in question

8: Each node inGc uses its visual and behavior routines to lo-cate and track the pedestrian{Each node has a suite of searchroutines to help re-acquire a pedestrian after short-durationtracking losses.}

9: Each node inGt also polls its neighborhood for possible ad-dition to the group

10: Each node inGt continuously updates its relevance for thepurpose of supervisor selection and node removal

7. ResultsTo date, we have tested our visual sensor network systemwith up to 4 pan-tilt-zoom cameras, one at each corner ofthe main waiting room of the virtual train station. Our ini-tial results are promising. The sensor network correctly as-signed cameras in most cases. Some of the problems thatwe encountered are related to pedestrian identification andtracking. As we increase the number of virtual pedestri-ans in the train station, the identification and tracking mod-ule has difficulty following the correct pedestrian, so thesurveillance task fails (and the cameras just return to theirdefault settings).

Figure 8 illustrates a “follow” task sequence. An oper-ator selects the pedestrian with the green shirt in Camera1 (top row). Camera 1 forms a group with Camera 2 (bot-tom row) to follow and zoom in on the pedestrian. At somepoint, Camera 2 loses the pedestrian (due to occlusion), andit invokes a search routine, but it fails to reacquire the pedes-trian. Camera 1, however, is still tracking the pedestrian.Camera 2 leaves the group and returns to its default settings.

8. Conclusions and Future WorkWe envision future surveillance systems to be networks ofstationary and active cameras capable of providing percep-tive coverage of extended environments with minimal re-liance on a human operator. Such systems will require notonly robust, low-level vision routines, but also novel sensornetwork methodologies. The work presented in this paperis a step toward the realization of new sensor networks.

The overall behavior of our network is governed by localdecision making at each node and communication betweenthe nodes. Our approach is new insofar as it does not requirecamera calibration, a detailed world model, or a central con-troller. We have intentionally avoided multi-camera track-ing schemes that assume prior camera network calibrationwhich, we believe, is an unrealistic goal for a large-scalecamera network consisting of heterogeneous cameras. Sim-ilarly, our approach does not expect a detailed world modelwhich, in general, is hard to acquire. Since it lacks any cen-

6

Page 7: Towards Intelligent Camera Networks: A Virtual Vision Approachweb.cs.ucla.edu/~dt/papers/pets05/pets05.pdf · Towards Intelligent Camera Networks: A Virtual Vision Approach Faisal

(a) (b) (c) (d) (e)

Figure 8:“Follow” sequence. (a) The operator selects a pedestrian in Camera 1 (upper row). (b) and (c) Camera 1 and Camera 2 (lowerrow) are tracking the pedestrian. (d) Camera 2 loses track. (e) Camera 1 is still tracking; Camera 2 has returned to its default settings.

tral controller, we expect the proposed approach to be robustand scalable.

We have developed and demonstrated our surveillancesystem in a virtual train station environment populated byautonomous, lifelike pedestrians, and our initial results ap-pear promising. Our simulator should continue to facilitateour ability to design such large-scale networks and experi-ment with them on commodity personal computers.

We are currently pursuing aCognitive Modeling[15, 16]approach to node organization and camera scheduling [17].We are also investigating scalability and node failure is-sues. Moreover, we are constructing more elaborate sce-narios involving multiple cameras situated in different lo-cations within the train station, with which we would liketo study the performance of the network when it is requiredto follow multiple pedestrians during their entire stay in thetrain station.

A. Algorithmic Details

Pedestrian Tracking

Require: A color-based signatureh (histogram) of the pedestrianto be tracked

1: Render: imageI2: Perform pedestrian segmentation on this frame and compute

2D bounding boxes for all visible pedestrians3: for all bounding boxesdo4: Compute pedestrian’s color histogramhi usingI masked

with the respective bounding box5: Comparehi with h and store the result6: end for7: Pick the pedestrian with the highest match score

8: Update stored signatureh when the conditions differ suffi-ciently from those whenh was last computed and stored.3

Fixate Algorithm

1: Let m andn be the dimensions of the video image in pixels2: Let (l, b) and(r, t) be the left-bottom and right-top corners of

the rectangle defining the ROI3: Let fx andfy be the camera’s FOV settings (indegrees) along

the image’sx andy axes, respectively4: Center of the image,cI =

(m2

, n2

)5: Center of the ROI,cROI =

(l+r2

, b+t2

)6: Define rectangler1 with cornerscI ∓ 0.025 ∗ (m, n)7: Define rectangler2 with cornerscI ∓ 0.225 ∗ (m, n)8: if ROI is enclosed withinr1 then9: Set state =Maintain

10: end if11: if ROI is not enclosed withinr2 then12: Set state =Turn13: end if14: Error,(ex, ey) = cI − cROI

15: if state isMaintain then16: Control-signal =(0, 0) {rotation about localx andy axes}17: else18: Control-signal =k

(〈ex〉 fx

m, 〈ex〉 fy

n

)19: end if

Zoom Algorithm

1: Let m andn be the dimensions of the video image in pixels2: Let (l, b) and(r, t) be the left-bottom and right-top corners of

the rectangle defining the ROI

3E.g.,zoomingcan drastically change the appearance of a pedestrian,thereby confounding traditional appearance-based schemes such as colorhistogram signatures.

7

Page 8: Towards Intelligent Camera Networks: A Virtual Vision Approachweb.cs.ucla.edu/~dt/papers/pets05/pets05.pdf · Towards Intelligent Camera Networks: A Virtual Vision Approach Faisal

3: Let fx andfy be the camera’s FOV settings (indegrees) alongthe image’sx andy axes, respectively

4: Desired coverage,dr ∈ [0, 1], where 1 means that ROI shouldcover the entire image.

5: Let r is the smallest rectangle that encloses the ROI and whose

aspect ratio isnm

, and let arear = area ofrmn

6: Error,e = arear − dr

7: if 〈e〉 < 0.01 then8: Set state =Maintain9: end if

10: if 〈e〉 > 0.03 then11: Set state =Act12: end if13: if state isMaintain then14: Control-signal =015: else16: if e > 0 then17: Control-signal =min(1.0, k〈e〉)18: else19: Define rectanglervalid with corners(m

2, n

2) ∓ 0.45 ∗

(m, n)20: if r is not enclosed withinrvalid then21: Control-signal =022: end ifrvalid and23: Let (vx, vy) be the corner ofr that is farthest from the

point(

m2

, n2

)24: Let s =

{2 vx

mif vx ≥ vy

2vy

nif vx < vy

25: Let s′ = 1− s2

0.4+s2

26: Control-signal =−min(1.0, k

(1− s2

0.4+s2

)〈e〉

)27: end if28: end if

Acknowledgements

The research reported herein was supported in part by a grantfrom the Defense Advanced Research Projects Agency (DARPA)of the Department of Defense. We thank Dr. Tom Strat of DARPAfor his generous support and encouragement. We also thank WeiShao and Mauricio Plaza-Villegas for their invaluable contribu-tions to the implementation of the Penn Station simulator. Mauri-cio Plaza-Villegas implemented a prototype virtual vision infras-tructure within the simulator.

References

[1] D. Terzopoulos and T. Rabie, “Animat vision: Active visionin artificial animals,”Videre: Journal of Computer VisionResearch, vol. 1, pp. 2–19, September 1997.

[2] D. Terzopoulos, “Perceptive agents and systems in virtualreality,” in Proc. 10th ACM Symposium on Virtual RealitySoftware and Technology, (Osaka, Japan), pp. 1–3, October2003.

[3] W. Shao and D. Terzopoulos, “Autonomous pedestrians,” in

Proc. ACM SIGGRAPH / Eurographics Symposium on Com-puter Animation, (Los Angeles, CA), pp. 19–28, July 2005.

[4] C. Intanagonwiwat, R. Govindan, D. Estrin, J. Heidemann,and F. Silva, “Directed diffusion for wireless sensor network-ing,” IEEE/ACM Transactions on Networking, vol. 11, pp. 2–16, February 2003.

[5] A. Ihler, J. Fisher, R. Moses, and A. Willsky, “Nonparametricbelief propagation for self-calibration in sensor networks,”in Proc. Third International Symposium on Information Pro-cessing in Sensor Networks, (Berkeley, CA), pp. 225–233,April 2004.

[6] D. Marinakis, G. Dudek, and D. Fleet, “Learning sensor net-work topology through Monte Carlo expectation maximiza-tion,” in Proc. IEEE Intl. Conf. on Robotics and Automation,(Barcelona, Spain), April 2005.

[7] R. Collins, O. Amidi, and T. Kanade, “An active camera sys-tem for acquiring multi-view video,” inProc. InternationalConference on Image Processing, (Rochester, NY), pp. 517–520, September 2002.

[8] S. Khan and M. Shah, “Consistent labeling of tracked objectsin multiple cameras with overlapping fields of view,”IEEETransactions on Pattern Analysis and Machine Intelligence,vol. 25, pp. 1355–1360, October 2003.

[9] F. Pedersini, A. Sarti, and S. Tubaro, “Accurate and simplegeometric calibration of multi-camera systems,”Signal Pro-cessing, vol. 77, no. 3, pp. 309–334, 1999.

[10] T. Gandhi and M. M. Trivedi, “Calibration of a reconfig-urable array of omnidirectional cameras using a moving per-son,” in Proc. 2nd ACM International Workshop on VideoSurveillance and Sensor Networks, (New York, NY), pp. 12–19, ACM Press, 2004.

[11] J. Kang, I. Cohen, and G. Medioni, “Multi-views track-ing within and across uncalibrated camera streams,” inProc. First ACM SIGMM International Workshop on VideoSurveillance, (New York, NY), pp. 21–33, ACM Press, 2003.

[12] N. T. Siebel,Designing and Implementing People TrackingApplications for Automated Visual Surveillance. PhD thesis,Dept. of Computer Science. The University of Reading., UK,March 2003.

[13] R. Collins, A. Lipton, H. Fujiyoshi, and T. Kanade, “Algo-rithms for cooperative multisensor surveillance,”Proceed-ings of the IEEE, vol. 89, pp. 1456–1477, October 2001.

[14] C. J. Costello, C. P. Diehl, A. Banerjee, and H. Fisher,“Scheduling an active camera to observe people,” inProc.2nd ACM International Workshop on Video Surveillance andSensor Networks, (New York, NY), pp. 39–45, ACM Press,2004.

[15] J. Funge, X. Tu, and D. Terzopoulos, “Cognitive modeling:Knowledge, reasoning and planning for intelligent charac-ters,” in Proc. ACM SIGGRAPH 99(A. Rockwood, ed.),pp. 29–38, Aug. 1999.

[16] F. Qureshi, D. Terzopoulos, and P. Jaseiobedzki, “Cognitivevision for autonomous satellite rendezvous and docking,”in Proc. IAPR Conference on Machine Vision Applications,(Tsukuba Science City, Japan), pp. 314–319, May 2005.

[17] F. Qureshi and D. Terzopoulos, “Surveillance camerascheduling: A virtual vision approach,” inProc. Third ACMInternational Workshop on Video Surveillance and SensorNetworks, (Singapore), November 2005. In press.

8