Multi-person interaction and activity analysis: a...

Machine Vision and ApplicationsDOI 10.1007/s00138-006-0055-x

SPECIAL ISSUE

Multi-person interaction and activity analysis:a synergistic track- and body-level analysis framework

Sangho Park · Mohan M. Trivedi

Received: 18 November 2005 / Revised: 31 May 2006 / Accepted: 6 August 2006© Springer-Verlag 2007

Abstract This paper presents a synergistic track- andbody-level analysis framework for multi-person inter-action and activity analysis in the context of video sur-veillance. The proposed two-level analysis frameworkcovers human activities both in wide and narrow fieldsof view with distributed camera sensors. The track-levelanalysis deals with the gross-level activity patterns ofmultiple tracks in various wide-area surveillance situa-tions. The body-level analysis focuses on detailed-levelactivity patterns of individuals in isolation or in groups.‘Spatio-temporal personal space’ is introduced to modelvarious patterns of grouping behavior between persons.‘Adaptive context switching’ is proposed to mediate thetrack-level and body-level analysis depending on theinterpersonal configuration and imaging fidelity. Ourapproach is based on the hierarchy of action concepts:static pose, dynamic gesture, body-part action, single-person activity, and group interaction. Event ontologywith human activity hierarchy combines the multi-levelanalysis results to form a semantically meaningful eventdescription. Experimental results with real-world datashow the effectiveness of the proposed framework.

Keywords Activity analysis · Video surveillance ·Two-level framework · Personal space · Contextswitching · Event hierarchy

S. Park · M. M. Trivedi (B)Computer Vision and Robotics Research Laboratory,University of California at San Diego,La Jolla, CA 92093-0434, USAe-mail: [email protected]

S. Parke-mail: [email protected]

1 Introduction and motivation

The recognition of people activities from videosequences is a challenging task, especially in uncon-strained environments due to the environmental noiseand ambiguities involved with people’s activities. Mostoutdoor human monitoring systems have been targetedat specific environmental situations: i.e., specific time,place, and activity scenarios [7,12,24,29]. We addressmore general and desirable human movement analysissystems that should be able to handle multipleheterogeneous situations caused by environmental vari-ations, which require an adaptive and robust framework.Most of the system adaptability has been pursued onlyin detection or tracking processes, and it has not beenaddressed that high-level activity analysis can be adap-tive at multiple levels of analysis. Handling multiple het-erogeneous situations may be modeled by just addingmore and more event-specific models. But the situa-tions can be more efficiently handled by introducing aspatio-temporal structure which is common to varioussituations [17].

The spatio-temporal structure of the activities ofpeople can be analyzed at different levels of detail. Atthe gross level, a person’s activity is analyzed in terms ofthe tracks of moving bounding boxes; we call it ‘track-level analysis’. Track-based surveillance systems are use-ful for various situations including different time zonessince the moving bounding box can be relatively reliablyextracted from various scenes. However, track data byitself does not provide detailed information about themonitored scene. Detailed analysis of human activityrequires the incorporation of situational context infor-mation and knowledge about human ecology. For exam-ple, similar track patterns may have very different

S. Park, M. M. Trivedi

connotation depending on a given situation and bodyconfiguration.

At the detailed level, a person’s activity is analyzedin terms of the individual body parts’ coordination; wecall it ‘body-level analysis’. In indoor surveillance situ-ations, the body-level analysis has been actively studiedby virtue of the stable environmental factors such as reg-ulated illumination, stable background, etc. In outdoorsituations, the system performance depends on reliablelow-level vision processes such as robust backgroundmodeling, reliable segmentation, etc.

In this paper, we present a synergistic two-stageframework for the analysis of multi-person interactionsand activities in heterogeneous situations. An adaptivecontext switching mechanism is proposed to mediatebetween the two stages. We also define the concept ofspatio-temporal personal space to model the aspect ofhuman ecology in interpersonal interactions.

The rest of the paper is organized as follows: Sect. 2reviews previous studies in video surveillance. Section 3describes the general idea of ‘spatio-temporal personalspace’. Section 4 explains the two-stage framework forperson activity analysis. Section 5 describes the humanactivity model. Section 6 shows experimental evalua-tion. Concluding remarks follow in Sect. 7.

2 Previous study

Reviews of general research on human motion under-standing can be found in [1,5,13]. Most outdoor humanmonitoring systems have been targeting specific envi-ronmental situations: i.e., specific time, place, and activ-ity scenarios involved [7,12,24,29].

Exemplary surveillance systems have been eitherbased on track analysis [15,21,27] or body analysis [7].Track-level analysis is usually applied to the wide-areasurveillance of multiple moving vehicles/pedestrians inan open space such as a parking lot or a pedestrianplaza. In some wide-area surveillance situations, coarserepresentation of the human body in terms of a movingbounding box or an ellipse may be enough for tracking[15]. Other researchers have applied more detailed rep-resentation of a human body such as a moving region ora blob [21,27]. Velastin et al. [27] estimated optical flowto compute the motion direction of pedestrians in sub-way environments. Optical flow effectively distinguishedpersons moving in different directions. Body-level anal-ysis usually focuses on more detailed activity analysisof individual persons. Haritaoglu et al. [7] analyzed sil-houette contours to detect body parts such as the head,hands, torso, legs, etc. Body posture can be estimatedfrom the configuration of the body parts.

Another important categorization of exemplarysurveillance systems is related to indoor versus outdoorsetup. Indoor environments usually have more stablelighting conditions and stable background clutters, buthuman bodies may be easily occluded by other objects.Indoor surveillance systems have a narrow field of view(FOV) and can provide a relatively high-resolution im-age of people. Park and Aggarwal [16] analyzed two-person interactions in an indoor setting using a detailedrepresentation of the human body with multiple blobs.Oliver et al. [14] showed a system to recognize human–object interactions in a living room by analyzing trackpatterns. Structural advantages of indoor environmentscan make the space an intelligent environment embed-ded with multiple sensors and processors. Trivedi et al.[23] presented an activity monitoring and summariza-tion system using multiple arrays of multimodal (i.e.,video and audio) inputs. Their system tracks multiplehuman motions in a meeting room setup with multiplecalibrated cameras and microphones, selects the currentspeaker, and archives event-annotated data for futureretrieval.

Outdoor environments have a lot of environmentalvariations such as changing weather, time shifting frommorning to evening, moving backgrounds, etc. Outdoorsurveillance systems have to deal with those variationsand robustness is an issue in outdoor surveillance. Parkand Trivedi [19] presented a robust system for track-based surveillance and privacy protection in outdoorenvironments. The system was tested with continuous10-h video and demonstrated very robust tracking per-formance. Most of the outdoor survillance systems ap-ply track analysis due to the limited input resolution,because the wide FOV for outdoor surveillance usuallylimits the resolution of a person’s appearance to rela-tively low-resolution images. Some recent works, how-ever, attempted to apply more detailed analysis. Zhaoet al. [29] have used both track and lower-body models toanalyze pedestrian behaviors such as walking, running,staying, etc.

One of the recent developments in video surveil-lance is the use of distributed systems to cover multiplemonitoring scenes with various FOV’s. Distributed sur-veillance systems have many advantages. For example,distributed systems with multiple cameras can providea more accurate model of the monitored scenes by us-ing multiple-view geometry to recover the world coor-dinates for perspective compensation. Communicationbetween multiple processing modules with redundantvideo inputs also enhances robustness and reliability.Remagnino et al. [21] presented a modular multi-agentbased surveillance system with decentralized intelli-gence. A review of distributed surveillance systems can

Multi-person interaction and activity analysis: a synergistic track- and body-level analysis framework

Table 1 A comparison of exemplary surveillance systems

Authors Year Target Processing Scene Person Cameras Analysis Targetsite stages model model level event

Haritaoglu [7] 2000 O 1 N/A 2D silhouette Single Body level ActionOliver [15] 2000 O 1 N/A 2D bounding box Single Track level InteractionTrivedi [23] 2000 I 1 PC 2D bounding box Distributed Track level ActionZhao [29] 2004 O 2 PC 2D ellipse + leg Single Body level ActionRemagnino [21] 2004 O 1 PC 2D single blob Distributed Track level ActionMakris [11] 2004 O 1 PC 2D single blob Distributed Track level ActionVelastin [27] 2005 I 1 PC 2D single blob Distributed Track level ActionTrivedi [25] 2005 I 1 PC 3D full body Distributed Body level ActionHuang [8] 2005 I 2 PC 3D full body Distributed Track level + Action

body levelProposed method I + O 2 PC 2D ellipse + Distributed Track level + Action +

2D full body body level interaction

O Outdoor, I indoor, PC perspective compensation

be found in [26]. Heterogeneous cameras with differentFOV’s in distributed surveillance systems may be asso-ciated with different levels of human body/motion anal-ysis. Integration of analyses from multiple degrees ofdetail (i.e., track-level vs. body-level analyses) from theheterogeneous cameras is an open research issue. Theintegration of multiple analysis levels of human activityis also an important open question. One of the goals ofthe current paper is to address this issue. A compari-son of the exemplary surveillance systems related to thecurrent paper is shown in Table 1.

3 Spatio-temporal personal space

We introduce the concept of spatio-temporal personalspace to explain the grouping behavior of people. The

personal space, borrowed from a theory of socialpsychology [3,22], is defined as the region surround-ing each person, or the area which a person considershis domain or territory. Another person’s unexpectedintrusion into the region makes them feel uncomfortableand move away to increase the distance between them.The personal space is adaptive in that it may enlarge orshrink depending on environmental and socio-culturalcontexts [3]. We extend the classical concept of per-sonal space toward the spatio-temporal personal spaceas shown in Fig. 1. A person of height H and shoul-der width S located at position Pw = [xw, yw, zw]T inthe world coordinate system occupies a physical spacedefined by the cylinder with radius Ra = S centeredat Pw (Fig. 1a). The bounding box of the person ina 2D image is regarded as an approximate projectionof the cylinder onto the image plane. Notice that the

Fig. 1 Personal spacemodeling. a A surroundingcylinder of 3D body atlocation Pw, b the top view ofthe person (i.e., the innermostellipse), the cylinder withradius Ra, and the stationarypersonal space with radius Rb,c spatio-temporal personalspace at low velocity v1,d spatio-temporal personalspace at high velocity v2


apparent extension of the bounding box on the imageplane varies according to body posture and camera per-spective. Therefore we put a stable anchor of the cylin-der by locating the axis of the cylinder along the torsoaxis of the person. The stationary personal space is thendefined by the outer cylinder with radius Rb as shownfrom the top view of the stationary person in Fig. 1b.

Rb = α · H (1)

A law of dynamics shows that the displacement xreachable in time t is proportional to velocity v (i.e.,x = v × t.) Therefore the higher the velocity, the far-ther the range of impact of interaction in a given timeperiod. Humans are subconsciously aware of this fact,and anticipate the consequence of speed with respect tohis/her own safety, resulting in enlarged personal spacein motion. From the viewpoint of computer vision, thedirection of body motion at low speed may be ambigu-ous due to the possibility of agile body motion, whereasthe direction of body motion at high speed would bemore deterministic because of the difficulty of agility dueto inertia. Therefore, the effective range of body motiongets diffused at low speed and gets more focused at highspeed; we model this effect by using an adaptive fan-shaped boundary as shown in Fig. 1c, d. As the personmoves forward, the spatio-temporal personal space isdefined with extended radius Rp and attentional angle θ

from the direction of motion (Fig. 1c, d). The outermostenvelops in Fig. 1c, d show the spatio-temporal personalspace, Ψ , at low and high speed, respectively.

We define Rp to be proportional to speed |v| and θ tobe inversely-proportional to speed |v|. This leads us tothe following formulation:

Rp = Rb + β · |v| · t, 0 ≤ |v| ≤ |vMAX| (2)

θ = π · e−γ ·|v|, 0 ≤ |v| ≤ |vMAX| (3)

Hypothetical example plots of Eqs. 2 and 3 are shownin Fig. 2 with different parameter values for β and γ .

(We have set α = 1, H = 100, t = 1, vMAX = 100 for theillustration). The parameters α, β, γ denote the adap-tive nature of the personal space, and vMAX denotesthe effective maximum velocity of human body transla-tion. The parameters need to be tuned based on imagingconfiguration and human dynamics. Even if the currentformulation of Eqs. 2 and 3 is ad hoc, it represents thebasic relationship between |v|, Rp, and θ .

By using the spatio-temporal personal space, Ψ , wecan effectively model different patterns of human inter-action in terms of the overlap of multiple Ψ ’s. We distin-guish interaction potential, �P, from interaction region,�R; the interaction potential among K persons is definedas the union of overlapping Ψk which is denoted by theoutermost envelop in Fig. 1c, d. The interaction regionis defined as the union of the actually touching tightbounds, Bk′ , specified by the innermost circle with radiusRa in Fig. 1. This leads us to the following definition of�P and �R:

�P = ∪Ψk, k ∈ [1, . . . , K1] such that ∩k Ψk �= � (4)

�R = ∪Bk′ , k′ ∈ [1, . . . , K2] such that ∩k′ Bk′ �= �(5)

The intersection in Eqs. 4 and 5 denotes the requirementof overlap or touching between interacting persons. Notethat Bk′ is embedded in Ψk′ for k′-th person as shown inFig. 1.

�P and �R are augmented by adding a superscript,τ ∈ N+, to denote the period of the touch. Figure 3shows the diagram of the personal space of two inter-acting persons. The adaptive factor τ may vary depend-ing on the various context, and can be learned fromactual surveillance data in each context. The interac-tion region, �τ

R, is usually a subset of the interactionpotential, �τ

P, because Bk′ is a subset of Ψk′ as explainedabove. The traditional indicator of human interactionin computer vision is based on the interaction region,but we propose that the interaction potential is more

0 20 40 60 80 100100

150

200

250

300

350

400

β = 1

β = 2

β = 3

hypothetical speed

Rad

ius

Rp

0 20 40 60 80 1000

20

40

60

80

100

120

140

160

180

γ = 0.1

γ = 0.05

γ = 0.01

hypothetical speed

Atte

ntio

n θ

Fig. 2 The hypothetical plots of speed versus spatio-temporal personal space in Fig. 1


Fig. 3 The illustration of spatio-temporal personal space. Left thespatial personal space in color regions, and bounding boxes in solidrectangles. Right the duration, τ , of interaction potential betweenthe two persons denoted by brown area along time frames

Table 2 Context-dependent variations in spatio-temporal per-sonal space

Site dependencySite type Crowded Passage Comfort

zone zone zone

Rb Narrow Wide Wideτ Short/long Short LongExamples Elevator Corridor Lounge

Activity dependencyActivity type Pass by Meet Wait

Rb Narrow Narrow Narrow/wideτ Short Long LongExamples Walkway Lounge Bus stop

Rb and τ denote spatial boundary and temporal duration, respec-tively

useful, because proximal human interaction does notnecessarily involve physical body contact. Table 2 showssome examples of a hypothetical categorization of thespatio-temporal personal space for various situations.

4 A two-stage analysis of person activity

The current system’s analysis of human activity startsfrom foreground segmentation. Various methods ofbackground modeling have been developed [4,6]. Weadopt a modified version of the codebook-based back-

ground model [9] to segment foreground regions of out-door scenes under varying environmental conditions.The background subtraction is followed by an ‘attributerelational graph’ based multitarget-multiassociationtracking (ARG-MMT) [18] to segment and track multi-ple body parts. Figure 4 shows the output of this process.

4.1 Multi-level representation of body motion

The multi-body tracking in ARG-MMT uses boundingboxes and 2D Gaussian ellipses to track the foregroundbodies. As the people translate, the Gaussian parame-ters are updated along the sequence in a frame-by-framemanner. Updating these Gaussian parameters amountsto tracking the whole body translation of each personon the 2D image domain [19]. In each bounding box,multiple body parts are simultaneously segmented.

This framework represents body motion at multiplelevels: bounding box, 2D ellipse, and segmented bodyparts. The bounding box and ellipse represent the overallbody motion at track level (i.e., 2D translation of body),whereas the segmented body parts represent individualbody part motion at a detailed body-part level. How-ever, occlusion during human interaction degrades thebody part segmentation, while the bounding box is wellmaintained as shown in Fig. 4: right. Proper handling ofbody parts during occlusion from a single perspectiveis still an open question in computer vision. Track-levelanalysis can survive the occlusion but the analysis isnot detailed, whereas body-level analysis provides richinformation but it fails at occlusion. This motivates us todevelop a synergistic two-level analysis framework andan adaptive mechanism to switch the track-level and thebody-level analysis.

4.2 Switching between two analysis levels

Track-level analysis represents human activity in termsof the movement of body centroid. Body-level analy-sis represents more detailed activity in terms of skeletaljoint angles or limb tip positions. The sensitivity ofbody-level analysis is affected by many sources of uncer-

Fig. 4 Bounding box of eachperson and the detailedsegmentation of multiplebody parts. Notice thedifference from Fig. 3. Leftbefore occlusion, Right duringthe occlusion


Body-level analysis

Track-level analysis

ST SB

Initialize the body-level analyzer

Deactivate the body-level analyzer

Extract body-level features

Update track analysis

Initialize system

Fig. 5 Two-stage processes for activity analysis. ‘Switching tobody level’ (SB) occurs when reliable body information is avail-able. ‘Switching to track level’ (ST) occurs when the body infor-mation gets unreliable

Fig. 6 Body-appearance fidelity feature Fkj for person j at frame k

tainty including occlusion, articulation, camera perspec-tive, imaging noise, and algorithmic uncertainty, etc.Track-level analysis is more robust across these condi-tions, and it is regarded as surveillance systems’ baselineanalysis which is always available.

The proposed algorithm switches to the body-levelanalysis whenever possible, and switches back to thetrack-level analysis whenever the body-part appearancequality degrades. This feedback-based iterative processis illustrated in Fig. 5 The body-appearance quality isevaluated by comparing the body-appearance fidelityfeature Fk

j for person j at frame k with the learned previ-ous frames. The individual features in Fig. 6 are obtainedfrom the ARG-MMT output in Fig. 4. The pseudo-codefor the context switching process is shown in Fig. 7.

5 Person activity modeling

The size of a pedestrian’s appearance in wide-viewsurveillance video changes systematically according tothe distance from the camera to the person. The aspectratio of pedestrian appearance under the camera

Fig. 7 Pseudo code for context switching algorithm between thetwo stages

configuration with large inclination angle also changessystematically with the degree of camera inclination. Insuch situations, the track-level analysis will be alwaysavailable, whereas the body-level analysis may be diffi-cult due to degraded image appearance.

5.1 Track-level activity modeling

We represent the individual track pattern Γ ki of the

i-th person at time k in terms of the features shown inFig. 8. The above features are extracted from a leastmean square based polynomial regression [28] curveof the track points computed along a moving windowof size ρ seconds. The current application considersthe past 1 s period. The track points can be perspec-tive-compensated by available methods such as cameracalibration, planar homography, etc. to unwarp the

Fig. 8 Track feature vector


imaging artifact. The adjacency in the formulation dkij

is a predicate that represents whether the distance iswithin a certain proximity such as the interaction poten-tial in Eq. 4. The main interest in the track-level analysisincludes the estimation of a moving person’s speed,perimeter sentry for cautious or secured areas, the esti-mation of proximity between persons, etc. As long asthe tracks are well maintained along the sequence, thetrack-level analysis is reliable.

5.2 Body-level activity modeling

Many kinds of human activity and interaction are per-formed while people are staying in the same position.The track-level analysis can not handle detailed humanactivity patterns performed by stationary people: e.g.,shaking hands, dancing, pushing, kicking, etc.

We formulate the body-level person activity in termsof a stochastic estimation of poses and gestures usingHidden Markov models. The pose estimation starts byextracting the occupancy map (OM) of the person byoverlaying a 9×10 grid on the foreground silhouette andcounting the normalized histogram of the foregroundpixels within each cell of the grid. Each cell of the OMrepresents the ratio of the foreground pixels in the cellwith range of [0, 1]. The occupancy map is bisected intoupper body and lower body, with the head included inthe upper body for simplicity. The cells of the OM areconcatenated row by row to form a 45 dimensional fea-ture vector that represents the individual person’s up-per body silhouette. A similar procedure is applied tothe the lower body silhouette. Dimensionality reductionis achieved by vector quantization of the feature spaceusing K-means clustering. The K codewords of the clus-ters are trained with training data that spans varioustypes of single person activity. A human gesture is repre-sented by a sequence of the codewords, and recognizedby HMMs.

HMM-based approaches to activity recognition havebeen presented in [8,14] with different feature sets. Weuse independent sets of HMMs to represent the upperbody gestures, lower body gestures, and torso transla-tions, respectively, with the assumption that the individ-ual body part’s gesture evolution is independent fromone another. The assumption of independence betweenthe individual HMMs dramatically reduces the size ofthe overall state space and the number of relevant jointprobability distributions. We use the standard Baum–Welch algorithm and the Viterbi algorithm [20] to trainand decode the following HMM sets. The number ofhidden nodes of the HMMs are 3 or 4 depending on thegesture complexity.

1Q, HMMs for lower body, represents the gestures:1Q = {“stay”, “walk”, and “kick.”}

2Q is the set of HMMs for the torso: 2Q = {“stay”,“moving left”, “moving right”, “moving up”, and “mov-ing down.”}

3Q is the set of HMMs for arms: 3Q = {“stay”, “stretchout”, and “withdraw.”}

5.3 Interaction modeling

A gap exists between geometric information obtainedfrom images and semantic information contained in con-ceptual terms [10]. It is necessary to associate visual fea-tures with concepts and symbols to build event semanticsof a person’s activity.

Our representation of multi-person activity is basedon an event hierarchy [17]; a human interaction is a com-bination of single-person actions, and the single-personaction is composed of multiple body-part gestures such astorso motion and arm/leg motion. Each body-part ges-ture is an elementary event of motion and is composedof a sequence of instantaneous poses at each frame.

We adopt Allen’s interval temporal logic [2] todescribe the causal relations between two events, E1and E2, in the temporal domain. We distinguish onlycausal and coincident relations (Fig. 9); the causal rela-tions encompass (1) before, (2) meet, and (3) overlap,whereas the coincident relations encompass (4) start, (5)during, and (6) finish in the interval temporal logic. Theoverlap or lag between the two events is allowed withina tolerance of δ frames. A single event can be a track-level or body-level activity, depending on the applicationpurposes.

6 Experimental results

NTSC videos of pedestrians are captured at 30 framesper seconds from various outdoor environments such asa building entrance, walkway, bus station, etc. The video

Fig. 9 Causal and coincident relations between two events E1and E2 in interval temporal logic


Fig. 10 The testbed of the current system. 1st row: actual siteimage, corresponding 3D site model, and camera placements(C1–C4) with viewing directions and the corresponding viewing

areas (A1–A4). 2nd row: C2’s view of A1, C3’s view of A3, andC4’s view of A4, respectively, overlayed with ROI’s and detectedentry/exit zones

data was captured from various perspectives includingcamera inclinations of 0, 25, 35, and 55◦ from horizon-tal level. Some videos were captured with a GenwacGW-202D CCD camera, and others with a Sony Handy-Cam DCR-TRV950.

6.1 Testbed site with distributed sensors

We have built a smart space that includes a campusbuilding and its surrounding roads such as walkways anda driveway (Fig. 10). A manually generated 3D CADmodel was used to incorporate information about thebuilding structure and floor plans. Four cameras weremounted on specific locations of the building to view thesurrounding roads as indicated by (C1–C4) with viewingdirections and the corresponding view areas (A1–A4).The walkway and driveway surrounding the buildingare registered by planar homography as in [11]. Thelocations of main entry/exit zones were learned by accu-mulating the frequency of appearance of pedestriansaround the border of each camera view as denoted byellipses in Fig. 10. The ROIs of the walkway (in green)and driveway (in red) zones were manually specified.The information about various zones were used as sitecontext for activity analysis (see Table 2).

6.2 Experiment on person detection

We evaluated the robustness of the surveillance systemwith a very long sequence of outdoor video data. Wecaptured a walkway scene with a frame rate of 30 framesper second from 9 AM to 7 PM, resulting in a 10-h videocomposed of 1,080,000 compressed frames (16.7 GB of

disk space.) The all-day long outdoor video sequencecontains various kinds of environmental changes.Figure 13 shows some examples of varying illuminationconditions of the same site from the morning, noon,afternoon, and evening time.

The long video sequence involves dramatic variationsin average illumination level, moving shadows fromwind-blown branches, drastic changes of intensity his-togram profile, etc. Fig. 11 shows a 3D view of concate-nated histogram profiles for the day, and Fig. 12 showssome instances of the histogram profiles and the varia-tion of average illumination levels of each frame.

We evaluated the foreground detection performanceby computing the false rejection rate FRR and falseacceptance rate FAR defined in Eqs. 6 and 7, respectively.

Fig. 11 Histogram profile of the walkway for a day (from 9 AMto 7 PM)


Fig. 12 Example histogram profiles of morning, noon, and evening time, and the mean intensity variation along a day at the walkway

FRR = Falsely rejected pixelsNumber of foreground pixels

(6)

FAR = Falsely accepted pixelsNumber of background pixels

(7)

To calculate the FRR and FAR, a set of framescontaining foreground objects (people) have been sam-pled from the whole time span. The frames have beensampled approximately every 10,000 frames, which gave93 frames. For each of the sampled frames the fore-ground region was marked by hand (based on the origi-nal input frame) and used as the ground truth.

FRR is 8.45(±13.43)% and the least mean square(LMS) based regression line shows a slightly increasingtendency in FRR from 5 to 12% during 10 h. FAR is0.14(±0.33)% and the LMS-based regression line showsalmost flat tendency in FAR during the 10 h. The low val-ues of FRR and FAR show good results for background

subtraction over 10 h, and the background model doesadapt to the changes.

6.3 Experiment on track-Level analysis

We evaluated the tracking performance of the systemusing the all-day video sequence. The long videosequence contains a natural scene, and the persons cap-tured in the scene are anonymous pedestrians; no artifi-cial treatment was made toward the pedestrians. Themajority of the video frames are background sceneswithout any pedestrians; therefore we subsampled thevideo clips that contain pedestrians. Table 3 shows theinput data summary and the analysis of detection andtracking errors of the subsampled clips. Single or mul-tiple pedestrians were tracked along the walkway allday. Most of the false alarms are originated from thebackground subtraction noise. Some example framesof multi-person and single-person tracking results are


Table 3 Input data and tracking error results for the all-day trackanalysis

Input data summaryInput data Count

Number of persons 15Number of frames 21,600Event Count

Single person 92-Person passing-by 33-Person passing-by 0Entry as a group 1Detection errorsType of error CountFalse alarm 11Missed detection (partial) 1Missed detection (complete) 0Tracking errorsTrack swap (temporary) 0Track swap (permanent) 1Track lost 0Track drift onto other person 2

shown in Fig. 13 including noon, afternoon, and eveningtimes.

We also tested our system for various outdoor sceneswith different camera angles and viewing distances.Example images are shown in Fig. 14. Table 4 showsthe input data summary and the analysis of detectionand tracking errors of the tested video sequences. Wesubsampled the video sequences that included pedes-trians. The subsampled videos correspond to total a30-min long video sequence. False alarm denotes thecases that the tracker detects noise foreground blobs

as pedestrians. Missed detection (partial) represents thecases that some body regions are not detected. Misseddetection (complete) represents the case that a pedes-trian is not detected. Redundant detection denotes asingle person is detected as multiple persons. Track swap(temporary) means that some tracks are incorrectlyswitched from each other but recovered. Track swap(permanent) means that two tracks are switched and aperson’s identity gets confused. Track lost means that atrack is lost during tracking.

Our surveillance system is robust to variable condi-tions including illumination changes, different camera-viewing angles, viewing distances, etc. The track-basedanalysis results are shown in Fig. 14. Various views ofdifferent sites are included, and track result are over-layed in the images. Different paths of the tracks aremaintained and the interaction moments are denotedby the overlayed color rectangles.

6.4 Experiment on body-level analysis

Purely vision-based analysis of human activity involvesinherent ambiguity in determining whether or notnearby people actually interact. Socio-cultural contextmay be associated in their behavior. The proposed two-level analysis framework with the formulation of spatio-temporal personal space can reduce the ambiguity. Inthis Section, we present examples of detailed body-levelanalysis in the proposed two-stage framework. Body-level analysis requires reliable feature extraction thatis robust to the outdoor environment. We observedthat body feature extraction was not always reliable for

Fig. 13 Example frames ofthe all-day tracking results at9 AM, 11 AM, 3 PM, and7 PM, respectively


Fig. 14 Example frames oftracking results at varioussites with different cameraconfigurations

Table 4 Tracking error results for the various-view track analysis

Input data summaryInput data Count

Number of persons 207Number of frames 56,603Event CountSingle person 232-Person passing-by 533-Person passing-by 16Entry as a group 11

Detection errorsType of error Count %False alarm 5Missed detection (partial) 10 5Missed detection (complete) 1 0.5Redundant detection 6 3Tracking errorsTrack swap (temporary) 13 19Track swap (permanent) 21 30Track lost 1 0.5

some camera configurations such as the in-depth viewand the large-inclination (55◦) view in Fig. 14 (the 1stand last images, respectively.) An example sequence of

passing interaction in Fig. 15 shows that person ID-5 inthe middle goes upstairs to knock the door, and personID-6 passes by, and person ID-5 goes downstairs to exit.The second row shows the corresponding raw frames.The third row shows the track-level analysis results inthe image domain (i.e., XY plot) and the more detailedspatio-temporal domain (i.e., XYT plot), respectively.The track-level analysis shows each person’s body trans-lation; person ID-5 enters from right, goes upstairs, thenexits to the right side, and person ID-6 traverses from theleft to the right side. Figure 16 shows the body-level anal-ysis of the two person interaction. The XT-/YT-plots (1strow) show the overlapping period of the two persons’appearance in the video denoted by the gray (yellow)regions. The following analysis is based on this period.The two persons’ Euclidean distance D is normalized bytheir average height (i.e., ‘height-normalized’) in orderto get incorporated with the spatio-temporal personalspace, and plotted along timeline in abscissa (i.e., ‘TDplot’) on the 2nd row (1st image.) The horizontal bar inthe TD plot denotes the interaction potential manuallyspecified by Rb = 0.5H and Rp = 0.7H in Eqs. 1 and 2.(Notice that the specification of β and γ requires theestimation of true velocity v, which is not available in


50 100 150 200 250 300 350

20

40

60

80

100

120

140

160

180

200

220

240

X

Y

100

200

300

50

100

150

200

0

200

400

600

XY

fram

e#

Fig. 15 Analysis of passing interaction in horizontal view. 1st row: tracking results, 2nd row: raw input frames, 3rd row: spatio-temporalplots (XY and XYT) of the tracking results

this sequence due to the lack of calibration.) The periodof the interaction potential (1.2 s) is obtained by the twovertical projection lines onto the time dimension (i.e.,frame number in abscissa.) The last two plots in Fig. 16show the plots of height-normalized lower-body widthof person ID-6 and ID-5, respectively. The overlayedellipses denote the correctly classified patterns as ‘walk-ing’ by the leg HMMs. The context switching mechanismidentified the erratic distortion of the body-appearancefidelity feature F (in Fig. 6) due to the occlusion in theinteraction potential, and nullified the HMM output asinvalid.

An example of a kicking interaction is shown inFig. 17. The person ID-1 on the left side kicks personID-0 on the right, and then they rapidly departs. The3rd row shows the XYT plot and the height-normalized

Euclidean distance between the persons. The 4th rowshows the height normalized lower-body (dotted line)and upper-body (solid line) width for person ID-1 andID-0, respectively. The overlayed rectangle and ellipseindicate the correctly recognized ‘kicking’ and ‘walking’gestures, respectively. The period of interaction poten-tial (less than 1 second) was specified by Rb = 0.5Hand Rp = 0.7H. It means that the ‘kicking’ interactionhappens fast and lasts shorter than the ‘passing-by’ inter-action in Fig. 16.

The track-level and body-level analysis results areintegrated into the semantic description with detectedactivities aligned along a common timeline. Figure 18shows the passing and kicking interactions in Figs. 15, 16and 17, respectively. Notice that track-level descriptionsare located below the timeline, and body-level descrip-


Fig. 16 Body-level analysis of passing interaction in Fig. 15.1st row: XT- and YT-plots of the tracks, 2nd row: the height-normalized Euclidean distance between the two tracks and the

height-normalized lower-body width of person ID-6, 3rd row: theheight-normalized lower-body width of person ID-5, respectively.See the text for details

tions are above the timeline. The period of interactionpotential specified by Rb = 0.5H and Rp = 0.7H ismarked with the green (gray) vertical bars on the time-line. The semantic-level description in Fig. 18 providesusers with intuitive and concise summary of events;multi-person interaction is obtained by focusing on theperiod of interaction potential, while the description ofindividual activities is available along the entire time-line. Notice that gross-level description in terms of torsotracks is shown below the timelines, while a moredetailed description in terms of body part-level activ-ity is shown above the timelines.

7 Concluding remarks

In this paper, we have presented a synergistic two-levelanalysis framework for multi-person interaction andactivity in outdoor environments, which include vary-ing illumination, changing weather conditions, movingcast shadows, various camera perspectives, and site var-iation depending on locations. We have introduced thespatio-temporal personal space to address the differ-ent behaviors of people. We have proposed an adaptivecontext switching that bridges the track- and body-levelanalysis of human activity. Synchronized semantic


100200

300

50100

150200

0

20

40

60

80

100

120

XY

fram

e#fr

ame

#

XY 1580 1600 1620 1640 1660 16800

0.5

1

1.5

2

2.5

3

3.5

4

0 20 40 60 80 100 1200

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 20 40 60 80 100 1200

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Fig. 17 Analysis of kicking interaction in oblique view (35◦ cam-era inclination). 1st row: tracking results, 2nd row: raw inputframes, 3rd row: XYT plot and the height-normalized Euclid-

ean distance between the two tracks, and 4th row: the height-normalized body widths of person ID-1 and person ID-0, respec-tively. See the text for details

descriptions with event hierarchies provide users withconcise summaries of events. The track-level analysis isrobust to environmental fluctuations, while theappearance-based body-level analysis sometimes fails incertain conditions with low illumination and similar

appearances of people. The experimental evaluationsshow that the proposed framework efficiently mediatesthe robust track-level analysis and the less robust body-level analysis of interpersonal activities. Future plansinclude model-based analysis of joint angles and per-


Fig. 18 Semantic description of the passing (Figs. 15, 16) andkicking (Fig. 17) interactions

spective-independent estimation of the spatio-temporalpersonal space using multi-perspective approaches.

Acknowledgements This research was supported in part by USDoD Technical Support Working Group (TSWG) and by NSFRESCUE ITR-Project. We thank the visiting students fromAalburg University in Denmark, Preben Andersen and RasmusCorlin, for their enthusiastic participation and contribution to thestudy. We are also thankful to our colleagues from the UCSDComputer Vision and Robotics Research Laboratory for theirvaluable support. Finally we thank the reviewers for their insight-ful comments which helped us improve the quality of the paper.

References

1. Aggarwal, J.K., Cai, Q.: Human motion analysis: a review.Comput. Vis. Image Underst. 73(3), 295–304 (1999)

2. Allen, J.F., Ferguson, G.: Actions and events in interval tem-poral logic. J. Logic Comput. 4(5), 531–579 (1994)

3. Altman, I.: The environment and social behavior: pri-vacy, personal space, territory, crowding. Irving Publish-ers, New York (1981)

4. Chalidabhongse, T., Kim, K., Harwood, D., Davis, L.: Aperturbation method for evaluating background subtractionalgorithms. In: Joint IEEE International Workshop on VisualSurveillance and Performance Evaluation of Tracking andSurveillance. Nice, France (2003)

5. Gavrila, D.: The visual analysis of human movement: a sur-vey. Comput. Vis. Image Underst. 73(1), 82–98 (1999)

6. Hall, D., Nascimento, J., Ribeiro, P., Andrade, E., Moreno,P., Pesnel, S., List, T., Emonet1, R., Fisher, R., Victor, J.S.,Crowley, J.: Comparison of target detection algorithms usingadaptive background models. In: IEEE VS-PETS. Beijing,China (2005)

7. Haritaoglu, I., Harwood, D., Davis, L.S.: W4: Real-time sur-veillance of people and their activities. IEEE Trans. PatternAnal. Mach. Intell. 22(8), 797–808 (2000)

8. Huang, K.S., Trivedi, M.M.: 3D shape context based gestureanalysis integrated with tracking using omni video array. In:Proceedings of the IEEE Workshop on Vision for Human-Computer Interaction (V4HCI). San Diego, USA (2005)

9. Kim, K., Chalidabhongse, T., Harwood, D., Davis, L.:Real-time foreground-background segmentation using code-book model. Real Time Imaging 11, (2005)

10. Kojima, A., Tamura, T., Fukunaga, K.: Textual description ofhuman activities by tracking head and hand motions. In: Inter-national Conference on Pattern Recognition, vol. 2, pp. 1073–1077 (2002)

11. Makris, D., Ellis, T., Black, J.: Learning scene semantics. In:ECOVISION 2004 Early Cognitive Vision Workshop. Isle ofSkye, Scotland, UK (2004)

12. McKenna, S.J., Jabri, S., Duric, Z., Wechsler, H.: Trackinginteracting people. In: 4th IEEE International Conferenceon Automatic Face and Gesture Recognition (FG 2000), pp.348–353 (2000)

13. Moeslund, T., Granum, E.: A survey of computer vision-based human motion capture. Comput. Vis. Image Und-erst. 81(3), 231–268 (2001)

14. Oliver, N., Horvitz, E., Garg, A.: Layered representationsfor human activity recognition. In: Proceedings of the IEEEInternational Conference on Multimodal Interfaces, pp. 3–8(2002)

15. Oliver, N.M., Rosario, B., Pentland, A.P.: A Bayesian com-puter vision system for modeling human interactions.IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 831–843(2000)

16. Park, S., Aggarwal, J.K.: A hierarchical bayesian network forevent recognition of human actions and interactions. Multi-media Systems: Special Issue on Video Surveillance, pp. 164–179 (2004)

17. Park, S., Aggarwal, J.K.: Semantic-level understanding ofhuman actions and interactions using event hierarchy. In:IEEE Workshop on Articulated and Nonrigid Motion. Wash-ington, DC, USA (2004)

18. Park, S., Aggarwal, J.K.: Simultaneous tracking of multiplebody parts of interacting persons. Comput. Vis. Image Und-erst. 102(1), 1–21 (2006)

19. Park, S., Trivedi, M.M.: A track-based human movementanalysis and privacy protection system adaptive to envi-ronmental contexts. In: IEEE International Conference onAdvanced Video and Signal based Surveillance. Como, Italy(2005)

20. Rabiner, L.: A tutorial on hidden markov models and selectedapplications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)

21. Remagnino, P., Shihab, A., Jones, G.: Distributed intelli-gence for multi-camera visual surveillance. Pattern Recognit.:Special Issue on Agent-Based Computer Vision 37(4), 675–689 (2004)

22. Sommer, R.: Personal Space: The Behavioral Basis ofDesign. Prentice Hall, Englewood Cliffs (1969)

23. Trivedi, M., Mikic, I., Bhonsle, S.: Active camera networksand semantic event databases for intelligent environments.In: IEEE Workshop on Human Modeling, Analysis and Syn-thesis. Hilton Read, South Carolina (2000)

24. Trivedi, M.M., Gandhi, T., Huang, K.: Distributed interac-tive video arrays for event capture and enhanced situationalawareness. IEEE Intelligent Systems, Special Issue on Artifi-cial Intelligence for Homeland Security (2005)

25. Trivedi, M.M., Huang, K.S., Mikic, I.: Dynamic context cap-ture and distributed video arrays for intelligent spaces. IEEETrans. Syst. Man Cybern. Part A 35(1), 145–163 (2005)

26. Valera, M., Velastin, S.: Intelligent distributed surveillancesystems: a review. IEEE Proc. Vis. Image Signal Pro-cess. 152(2), 192–204 (2005)

27. Velastin, S., Boghossian, B., Lo, B., Sun, J., Vicencio-Silva,M.: Prismatica: toward ambient intelligence in publictransport environments. IEEE Trans. Syst. Man Cybern. PartA 35(1), 164–182 (2005)

28. Williams, E.: Regression Analysis. Wiley, New York (1959)


29. Zhao, T., Nevatia, R.: Tracking multiple humans in complexsituations. IEEE Trans. Pattern Anal. Mach. Intell. 26(9),1208–1221 (2004)

Author Biographies

Sangho Park received his bach-elor of science degree in elec-tronics and computer engineer-ing at Yonsei Uinversity, Seoul,Korea. He earned his M.A.in perceptual psychology andPh.D. in electrical and computerengineering from the Universityof Texas at Austin, specializingin computer vision. Currently,he is a postdoctoral research sci-entist at the Computer Visionand Robotics Research labora-tory at the University of Cali-fornia at San Diego. His inter-

ests include computer vision, human activity analysis, image pro-cessing, sensor networks, and pattern recognition. He is work-ing on projects involving video surveillance, activity analysis inintelligent systems, and sensor-based enhancement of responsesin unexpected crises.

Mohan Manubhai Trivedi isa professor of electrical andcomputer engineering at theUniversity of California atSan Diego. Trivedi has abroad range of research inter-ests in the intelligent systems,computer vision, intelligent(“smart”) environments, intel-ligent vehicles and trans-portation systems, andhuman-machine interfaceareas. He established theComputer Vision and Robot-ics Research Laboratory at

UCSD. Currently, Trivedi and his team are pursuing sys-tems-oriented research in distributed video arrays and activevision, omnidirectional vision, human body modeling and

movement analysis, face and affect analysis, and intelligentvehicles and interactive public spaces. He serves on the execu-tive committee of the California Institute for Telecommunicationand Information Technologies [Cal-(IT)2] as the leader of theIntelligent Transportation and Telematics Layer at UCSD. Healso serves as a charter member of the executive committee ofthe University of California systemwide Digital Media InnovationProgram (DiMI). He serves regularly as a consultant to industryand government agencies in the USA and abroad. Trivedi was edi-tor-in-chief of the Machine Vision and Applications Journal dur-ing 1997–2003. He has served on the editorial boards of journalsand program committees of several major conferences. He servedas chairman of the Robotics Technical Committee of the IEEEComputer Society. He was elected to serve on the administrativecommittee (BoG) of the IEEE Systems, Man and CyberneticsSociety. Trivedi has received the Distinguished Alumnus awardfrom the Utah State University and Pioneer (Technical Activi-ties) and Meritorious Service awards from the IEEE ComputerSociety. He is a fellow of the International Society for OpticalEngineering (SPIE).

Multi-person interaction and activity analysis: a...

Documents

Transcript of Multi-person interaction and activity analysis: a...