Cviu Nazim

download Cviu Nazim

of 16

description

Nazim paper action recognition body points

Transcript of Cviu Nazim

  • easf Ce

    neseOrla

    Available online 5 February 2013

    Keywords:View invariancePose transitionAction recognitionAction alignmentFundamental ratios

    estig

    action recognition. A moving plane observed by a xed camera induces a fundamental matrix F betweentwo frames, where the ratios among the elements in the upper left 2 2 submatrix are herein referred to

    g of hu

    this area [25]. Action can be regarded as a collection of 4Dspacetime data observed by a perspective video camera. Due toimage projection, the 3D Euclidean information is lost and projec-tively distorted, which makes action recognition rather challeng-ing, especially for varying viewpoints and different camera

    ing a model of the human body, e.g. silhouette, body points, stickmodel, etc., and build algorithms that use the adopted model torecognize body pose and its motion over time. Spacetime featuresare essentially the primitives that are used for recognizing actions,e.g. photometric features such as the optical ow [911] and thelocal spacetime features [12,13]. These photometric features canbe affected by luminance variations due to, for instance, camerazoom or pose changes, and often work better when the motion issmall or incremental. On the other hand, salient geometric featuressuch as silhouettes [1418] and point sets [8,19] are less sensitiveto photometric variations, but require reliable tracking. Silhouettes

    q This paper has been recommended for acceptance by J.K. Aggarwal. Corresponding author.

    E-mail address: [email protected] (N. Ashraf).1 Assistant Professor, Department of Computer Science, FC College (A Chartered

    Computer Vision and Image Understanding 117 (2013) 587602

    Contents lists available at

    Computer Vision and I

    w.eUniversity).is an important area of research in computer vision that plays acrucial role in various applications such as surveillance, humancomputer interaction (HCI), ergonomics, etc. In this paper, we focuson the recognition of actions in the case of varying viewpoints anddifferent and unknown camera intrinsic parameters. The chal-lenges to be addressed in action recognition include perspectivedistortions, differences in viewpoints, anthropometric variations,and the large degrees of freedom of articulated bodies [1]. The lit-erature in human action recognition has been extremely active inthe past two decades and signicant progress has been made in

    action recognition. The execution rates of the same action in differ-ent videos may vary for different actors or due to different cameraframe rates. Therefore, the mapping between same actions in dif-ferent videos is usually highly non-linear.

    To tackle these issues, often simplifying assumptions are madeby researchers on one or more of the following aspects: (1) cameramodel, such as scaled orthographic [6] or calibrated perspectivecamera [7]; (2) camera pose, i.e. little or no viewpoint variations;(3) anatomy, such as isometry [8], coplanarity of a subset of bodypoints [8], etc. Human action recognition methods start by assum-1. Introduction

    The perception and understandin1077-3142/$ - see front matter 2013 Elsevier Inc. Ahttp://dx.doi.org/10.1016/j.cviu.2013.01.006as the fundamental ratios. We show that fundamental ratios are invariant to camera internal parametersand orientation, and hence can be used to identify similar motions of line segments from varying view-points. By representing the human body as a set of points, we decompose a body posture into a set of linesegments. The similarity between two actions is therefore measured by the motion of line segments andhence by their associated fundamental ratios. We further investigate to what extent a body part plays arole in recognition of different actions and propose a generic method of assigning weights to differentbody points. Experiments are performed on three categories of data: the controlled CMU MoCap dataset,the partially controlled IXMAS data, and the more challenging uncontrolled UCF-CIL dataset collected onthe internet. Extensive experiments are reported on testing (i) view-invariance, (ii) robustness to noisylocalization of body points, (iii) effect of assigning different weights to different body points, (iv) effectof partial occlusion on recognition accuracy, and (v) determining how soon our method recognizes anaction correctly from the starting point of the query video.

    2013 Elsevier Inc. All rights reserved.

    man motion and action

    parameters. Another source of challenge is the irregularities of hu-man actions due to a variety of factors such as age, gender, circum-stances, etc. The timeline of action is another important issue inReceived 10 February 2012Accepted 19 January 2013

    signicance in view-invariant action recognition, and explore the importance of different body parts inView invariant action recognition using w

    Nazim Ashraf a,, Yuping Shen a,b, Xiaochun Cao a,c, HaCollege of Engineering and Computer Science, Computational Imaging Lab, University obAdvanced Micro Devices, Quadrangle Blvd., Orlando, FL 32817, USAc State Key Laboratory of Information Security, Institute of Information Engineering, ChidDepartment of EECS, Computational Imaging Laboratory, University of Central Florida,

    a r t i c l e i n f o

    Article history:

    a b s t r a c t

    In this paper, we fully inv

    journal homepage: wwll rights reserved.ighted fundamental ratiosq

    san Foroosh a,d,1

    ntral Florida, 4000 Central Florida Blvd., Orlando, FL 32816, USA

    Academy of Sciences, Beijing 100093, Chinando, FL 32816, USA

    ate the concept of fundamental ratios, demonstrate their application and

    SciVerse ScienceDirect

    mage Understanding

    lsev ier .com/ locate/cviu

  • ettes of input videos into spacetime objects, and extract features in

    1.2. Our approach

    mage Understanding 117 (2013) 587602are usually stacked in time as 2D [16] or 3D object [14,18], whilepoint sets are tracked in time to form spacetime curves. Someexisting approaches are also more holistic and rely on machinelearning techniques, e.g. HMM [20], SVM [12], etc. As in mostexemplar-based methods, they rely on the completeness of thetraining data, and to achieve view-invariance, are usually expen-sive, as it would be required to learn a model from a large dataset.

    1.1. Previous work on view-invariance

    Most action recognitionmethods adopt simplied cameramodelsand assume xed viewpoint or simply ignore the effect of viewpointchanges. However, in practical applications such as HCI and surveil-lance, actions may be viewed from different angles by different per-spective cameras. Therefore, a reliable action recognition systemhas to be invariant to the camera parameters or viewpoint changes.View-invariance is, thus, of great importance in action recognition,and has started receiving more attention in recent literature.

    One approach to tackle view-invariant action recognition hasbeen based on using multiple cameras [2022,7]. Campbell et al.[23] use stereo images to recover a 3D Euclideanmodel of the humansubject, and extract view invariance for 3D gesture recognition;Weinland et al. [7] use multiple calibrated and background-sub-tracted cameras, and they obtain a visual hull for eachpose frommul-ti-viewsilhouettes, and stack themas amotionhistory volume, basedon which Fourier descriptors are computed to represent actions. Ah-mad et al. [20] build HMMs on optical ow and human body shapefeatures from multiple views, and feed a test video sequence to alllearned HMMs. These methods require the setup of multiple cam-eras, which is quite expensive and restricted inmany situations suchas online video broadcast, HCI, or monocular surveillance.

    A second line of research is based on a single camera and is moti-vated by the idea of exploiting the invariants associated with a gi-ven camera model, e.g. afne, or projective. For instance, Raoet al. [24] assume an afne camera model, and use dynamic in-stants, i.e. the maxima in the spacetime curvature of the hand tra-jectory, to characterize hand actions. The limit with thisrepresentation is that dynamic instants may not always exist ormay not be always preserved from 3D to 2D due to perspective ef-fects. Moreover the afne camera model is restrictive in most prac-tical scenarios. Amore recent work reported by Parameswaran et al.[8] relaxes the restrictions on the camera model. They propose aquasi-view-invariant 2D approach for human action representationand recognition, which relies on the number of invariants in a givenconguration of body points. Thus a set of projective invariants areextracted from the frames and used as action representation. How-ever, in order to make the problem tractable under variable dynam-ics of actions they introduced heuristics, and made simplifyingassumptions such as isometry of human body parts. Moreover, theyrequire that at least ve body points form a 3D plane or the limbstrace planar area during the course of an action. [25] described amethod to improve discrimination by inferring and then using la-tent discriminative aspect parameters. Another interesting ap-proach to tackle unknown views has been suggested by [26], whouse virtual views, connecting the action descriptors extracted fromsource view to those extracted from target view. Another interest-ing approach is [27], who used a bag of visual-words to model anaction and present promising results.

    Another promising approach is based on exploiting the multi-view geometry. Two subjects in the same exact body postureviewed by two different cameras at different viewing angles canbe regarded as related by the epipolar geometry. Therefore, corre-sponding poses in two videos of actions are constrained by the asso-

    588 N. Ashraf et al. / Computer Vision and Iciated fundamental matrices, providing thus a way to match posesand actions in different views. The use of fundamental matrix inview invariant action recognition is rst reported by Syeda-Mah-This work is an extension of [31], which introduced the conceptof fundamental ratios that are invariant to rigid transformations ofcamera, and were applied to action recognition. We make the fol-lowing main extensions: (i) Instead of looking at fundamental ra-tios induced by triplets of points, we look at fundamental ratiosinduced by line segments. This, as we will later see, introducesmore redundancy and results in better accuracy. (ii) It has beenlong argued in the applied perception community [32] that hu-mans focus only on the most signicant aspects of an event or ac-tion for recognition, and do not give equal importance to everyobserved data point. We propose a new generic method of learninghow to assign different weights to different body points in order toimprove the recognition accuracy by using a similar focusing strat-egy as humans; (iii) We study how this focusing strategy can beused in practice when there is partial but signicant occlusion;(iv) We investigate how soon after the query video starts our meth-od is capable of recognizing the action - an important issue neverinvestigated by others in the literature; and (v) our experimentsin this paper are more extensive than [31] and include larger setof data with various levels of difculty.

    The rest of the paper is organized as follows: In Section 2, weintroduce the concept of fundamental ratios, which are invariantto rigid transformations of camera, and describe how they maybe used for action recognition in Section 3. Then in Section 4, wefocus on how we can weigh different body parts for better recogni-tion. We present our extensive experimental evaluation in Section5, followed by discussions and conclusion in Section 6.

    2. Fundamental ratios

    In this section, we establish specic relations between the epi-polar geometry induced by line segments. We derive a set offeature ratios that are invariant to camera intrinsic parametersfor a natural perspective camera model of zero skew and unit as-pect ratio. We then show that these feature ratios are projectivelyinvariant to similarity transformations of the line segment in the3D space, or equivalently invariant to rigid transformations ofcamera.

    Proposition 1. Given two cameras Pi Ki[Rijti], Pj Kj[Rjjtj] withzero skew and unit aspect ratio, denote the relative translation androtation from Pi to Pj as t and R respectively, then the upper 2 2submatrix of the fundamental matrix between two views is of the form

    F22 1sttsrt1 1stt

    srt22sttsrt1 2stt

    srt2

    ; 1

    where rk is the kth column of R, the superscripts s, t = 1, . . ., 3 indicatethe element in the vector, and rst, r = 1, 2 is a permutation tensor. 1

    Remark 1. The ratios among elements of F22 are invariant tocamera calibration matrices K and K .different ways, which are then used to compute a matching scorebased on the fundamental matrices. A similar work is also pre-sented in [29], which is based on body points instead of silhouettes.A recent method [30] uses probabilistic 3D exemplar model thatcan generate 2D view observations for recognition.mood et al. [28] and later by Yilmaz et al. [18,19]. They stack silhou-i j

    1 The use of tensor notation is explained in details in [33, p. 563].

  • of the view-invariant equation in (2) implies that the elements in

    been used frequently in action recognition primarily because a hu-

    magman body can be modeled as an articulate object, and secondly,body points capture sufcient information to achieve the task ofaction recognition [29,34,8,19]. Other representations of pose in-clude subject silhouette [14,16,28], optical ow [9,11,10], and localspace time features [13,12].

    We have used the body point representation. Therefore, an ac-tion consists of a sequence of point sets. These points can be ob-tained by using articulated object tracking techniques such as[3539]. Further discussions on articulated object tracking can befound in [40,2,3], and is beyond the scope of this paper. We shall,henceforth, assume that tracking has already been performed onthe data, and thatwe are given a set of labeled points for each image.

    3.2. Pose transitions

    We are given a video sequence {It} and a database of referencethe sub-matrices on the both sides of (2) are equal up to an arbi-trary non-zero scale factor, and hence only the ratios among themmatter. We call these ratios the fundamental ratios, and as Proposi-tions 1 and 2 state, these fundamental ratios are invariant to cameraintrinsic parameters and viewpoints. To eliminate the scale factor,

    we can normalize both sides using bFi F22i = F22i F; i 1;2,

    where j j refers to absolute value operator and k kF stands forthe Frobenius norm. We then have

    bF1 bF2: 3In practice, bF1 and bF2 may not be exactly equal due to noise, com-putational errors or subjects different ways of performing same ac-tions. We, therefore, dene the following function to measure theresidual error:

    EbF1; bF2 kbF1 bF2kF : 43. Action recognition using fundamental ratios

    3.1. Representation of pose

    Using a set of body points for representing human pose hasThe upper 2 2 sub-matrices F22 for two moving cameras canbe used to measure the similarity of camera motions. That is, if twocameras perform the same motion (same relative translation androtation during the motion), and F1 and F2 are the fundamentalmatrices between any pair of corresponding frames, thenF221 F222 . This also holds for the dual problem when the twocameras are xed, but the scene objects in both cameras performthe same motion. A special case of this problem is when the sceneobjects are planar surfaces, which is discussed below.

    Proposition 2. Suppose two xed cameras are looking at two movingplanar surfaces, respectively. Let F1 and F2 be the two fundamentalmatrices induced by the two moving planar surfaces. If the motion ofthe two planar surfaces is similar (differ at most by a similaritytransformation), then

    F221 F222 ; 2where the projective equality, denoted by , is invariant to cameraorientation.

    Here similar motion implies that plane normals undergo samemotion up to a similarity transformation. The projective nature

    N. Ashraf et al. / Computer Vision and Isequences corresponding to K different known actions,

    DB J1tn o

    ; J2tn o

    ; . . . ; JKtn o

    , where It and Jkt are labeled body pointsin frame t. Our goal is to identify the sequence Jktn o

    from DB such

    that the subject in {It} performs the closest action to that observed

    in Jktn o

    .

    Existing methods for action recognition such as [16,18] consideran action as a whole, which usually requires known start and endframes and is limited when action execution rate varies. Someother approaches such as [29] regard an action as a sequence ofindividual poses, and rely on pose-to-pose similarity measures.Since an action consists of spatio-temporal data, the temporalinformation plays a crucial role in recognizing action, which is ig-nored in a pose-to-pose approach. We thus propose using posetransition. One can thus compare actions by comparing their posetransitions.

    3.3. Matching pose transition

    The structure of a human can be divided into lines of bodypoints using 2 body points. The problem of comparing articulatedmotions of human body thus transforms to comparing rigid mo-tions of body line segments. According to Proposition 2, the motionof a plane induces a fundamental matrix, which can be identiedby its associated fundamental ratios. If two pose transitions areidentical, their corresponding body point segments would inducethe same fundamental ratios, which provide a measure for match-ing two pose transitions.

    3.3.1. Fundamental matrix induced by a moving line segmentAssume that we are given an observed pose transition Ii? Ij

    from sequence {It}, and Jkm ! Jkn from sequence Jkt

    n ofrom an action

    dataset containing k actions, with N examples of each action.When Ii? Ij corresponds to J

    1m ! J1n , and J2m ! J2n, one can regard

    them as observations of the same 3D pose transition by three dif-ferent cameras P1, P2, and P3, respectively. There are two instancesof epipolar geometry associated with this scenario:

    1. The mapping between the image pair hIi, Iji and the image pairsJ1m; J

    1n

    D E; J2m; J

    2n

    D Eis determined by the fundamental matrices F12

    and F13 [33] related to P1, P2, and P3. Also, the mapping between

    image pair J1m; J1n

    D Eand J2m; J

    2n

    D Eis determined by the fundamen-

    tal matrices F23. The projection of the camera center of P2 in Ii orIj is given by the epipole e21, which is found as the right nullvector of F12. The image of the camera center of P1 in J

    1m or J

    1n

    is the epipole e12 given by the right null vector of F12T. The pro-jection of the camera center of P3 in Ii or Ij is given by the epi-pole e31, which is found as the right null vector of F13.Similarly the image of the camera center of P1 in J

    1m or J

    1n is

    the epipole e13 given by the right null vector of F13T. The imageof the camera center of P3 in J

    1m or J

    1n is the epipole e32 given by

    the right null vector of F23T. Note that e31 and e32 are corre-sponding points in Ii or Ij and J

    1m or J

    1n , respectively. This fact

    would be used later on.2. The other instance of epipolar geometry is between transitioned

    poses of a line segments of body points in two frames of thesame camera, i.e. the fundamental matrix induced by a movingbody line segment, which we denote as F. We call this funda-mental matrix the inter-pose fundamental matrix, as it is inducedby the transition of body point poses viewed by a stationarycamera.

    Let be a line of 3D points, whose motion lead to different image

    e Understanding 117 (2013) 587602 589projections on Ii; Ij; J1m; J

    1n; J

    2m and J

    2n as i; j;

    1m;

    1n;

    2m and

    2n,

    respectively:

  • Q andimage

    mage Understanding 117 (2013) 587602i hx1; x2i; j hx01;x02i;1m hy1; y2i; 1n hy01; y02i;2m hz1; z2i; 2n hz01; z02i:

    i and j can be regarded as projections of a stationary 3D linehX1, X2i on two virtual cameras P0i and P0j. Assume that the epipolesin P0i and P

    0j are known and let us denote these as

    e0i a1; b1;1T ; e0j a01; b01;1 T

    ; e0m a2; b2;1T , ande0n a02; b02;1T .

    We can use the epipoles as parameters for the fundamentalmatrices induced by i and j and

    1m;

    1n [41]:

    F1a1 b1 a1a1b1b1c1 d1 a1c1b1d1

    a01a1b1c1 a01b1b01d1 a1a01a1a01b1b1b01a1c1b1b01d1

    264

    375;

    5

    F2a2 b2 a2a2b2b2c2 d2 a2c2b2d2

    a02a2b2c2 a02b2b02d2 a2a02a2a02b2b2b02a2c2b2b02d2

    264

    375:

    6To solve for the four parameters, we have the following

    equations:

    xT1F1x1 0; 7xT2F1x2 0: 8

    Similarly, F2 induced by 1m and

    1n can be computed from:

    yT1F2y1 0; 9yT2F2y2 0: 10

    Given N 1 other examples of the same action in the dataset,we have:

    eT11F1e11 0; 11eT12F2e12 0; 12eT21F1e21 0; 13eT22F2e22 0; 14...

    eTN11F1eN11 0; 15eTN12F2eN12 0: 16where ei1 and ei2 are the projection of the ith sequences cameracenter in Ii or Ij and J

    im or J

    in.

    With N > 2, we have an overdetermined system, which can beeasily solved by re-arranging the above equations in the form ofAx = 0 and solving for the right null space of A to solve for the ratios.

    In fact, given N examples in the dataset, we can have as many asN N 1 112

    ratios per frame, where 112

    are the total number

    of different combinations given 11 body points. Compared to usingtriplets, we would have N 113

    ratios per frame. Given N > 4, this

    is a huge advantage over using triplets as we have more redun-dancy leading to more accuracy.

    The difculty with Eqs. (5) and (6) is that the epipoles e0i; e0j; e

    0m

    and e0n are unknown. Fortunately, however, the epipoles can be clo-sely approximated as described below.

    Proposition 3. If the exterior orientation of P1 is related to that of P2by a translation, or by a rotation around an axis that lies on the axisplanes of P1, then under the assumption:

    590 N. Ashraf et al. / Computer Vision and Ie0i e0j e1; e0m e0n e2; 17not affect the projective equality in Proposition 2.

    3.3.2. Algorithm for matching pose transitionsThe algorithm for matching two pose transitions Ii? Ij and

    Jkm ! Jkn is as follows:

    1. Compute F, e1, e2 between image pair hIi, Iji and Jkm; JknD E

    usingthe method proposed in [43].

    2. For each non-degenerate 3D line segment that projects ontoi; j;

    km and

    kn in Ii; Ij; J

    km and J

    kn, respectively, computebF1; bF2 as described above, and compute e EbF1; bF2 from

    Eq. (4).3. Compute the average error over all non-degenerate line seg-

    ments using

    E Ii ! Ij; Jkm ! Jkn

    1L

    X1...L

    e; 19

    where L is the total number of non-degenerate line segments.4. If EIi ! Ij; Jkm ! Jkn < E0, where E0 is some threshold, then the

    two pose transitions are matched. Otherwise, the two posetransitions are classied as mismatched.

    3.4. Sequence alignment

    Before we give our solution of action recognition, we rst de-scribe the algorithm of matching two action sequences. We repre-sent an action A = {I1,. . .,n} as a sequence of pose transitions,PA; r fI1!r ; . . . ; Ir1!r ; Ir!r1; . . . ; Ir!ng,2 where Ir is an arbi-trarily selected reference pose. If two sequences A = {I1. . .n} andB = {J1. . .m} contain the same action, then there exists an alignmentbetween PA; r1 and PB; r2, where Ir1 and Jr2 are two correspond-ing poses. To align the two sequences of pose transitions, we useddynamic programming. Therefore, our method to match two actionsequences A and B can be described as follows:

    1. Initialization: select a pose transition Ii0 ! Ii1 from A so that twoposes are distinguishable. Then nd its best matched pose tran-sition Jj0 ! Jj1 in B, by checking all pose transitions in thesequence as described in Section 3.3.2 ForQ0, such that after transformation the epipoles and transformedpoints are not at innity. Note that these transformations dowe have:

    EbF1; bF2 0: 18Under more general motion, the equalities in (17) become only

    approximate. However, we shall see in Section 5.1.1 that thisapproximation is inconsequential in action recognition for a widerange of practical rotation angles. As described shortly, using Eq.(4) and the fundamental matrices F1 and F2 computed for everynon-degenerate line segment, we can dene a similarity measurefor matching pose transitions Ii? Ij and J

    km ! Jkn.

    Degenerate congurations: If the other examples camera pro-jection is collinear with the 2 points in the line-segment, the prob-lem becomes ill-conditioned. We can either ignore this cameracenter in favor of other camera centers or we can simply ignorethe line-segment altogether. This does not produce any difcultyin practice, since with 11 body point representation used in thispaper, we obtain 55 possible line segments, the vast majority ofwhich are in practice non-degenerate.

    A special case iswhen theepipole is close toorat innity, forwhichall line-segmentswould degenerate.We solve this problem by trans-forming the image points in projective space in a manner similar toZhangetal. [42]. The idea is tondapairofprojective transformationsbrevity of notation, we denote pose transition Ii? Ij as Ii?j.

  • is invawhen

    posed

    4. We

    sitionsthe apas boximpor

    magIn his classic experiments, Johansson [34], demonstrated that

    humaset ofsuggeand action recognition, which goes against the evidences inplied perception literature [32]. For instance, in a sport suching, the motion of the upper body parts of the boxer is moretant than the motion of the lower body.In the previous section, we saw how fundamental ratios can beused for action recognition. But we implicitly made the assumptionthat all body joints have equal contribution to matching pose tran-ighting-based human action recognitionare used without the weighting (discussed in Section 4), and wecan recognize an action from a single example. This is experimen-tally veried in Section 5.method is that there is no training involved if line segmentstion 5.1.1, this can be achieved by using reference sequences frommore viewpoints for each action. One major feature of the pro-riant to camera intrinsic parameters and viewpoint changes,the approximation of epipoles is valid. As discussed in Sec-2. For all i = 1 . . . n, j = 1 . . . m, compute

    Si;j s EIi0 ! Ii; Jj0 ! Jj i i0; j j0;s EIi0 ! Ii1 ; Jj0 ! Jj1 i i0; j j0;0 otherwise;

    8>:

    where s is a threshold, e.g., s = 0.3. S is the matching score matrix of{I1,. . .,n} and {J1,. . ., m}.3. Initialize the n m accumulated score matrix M as

    Mi;j Si;j i 1 or j 10 otherwise

    4. Update matrix M from top to bottom, left to right (i, jP 2),using

    Mi;j Si;j maxfMi;j1;Mi1;j;Mi1;j1g:5. Find (i, j) such that

    i; j argmaxi;j

    Mi;j:

    Then back traceM from (i, j), and record the path P until it reachesa non-positive element.

    The matching score of sequences A and B is then dened asSA;B Mi ;j . The back-traced path P provides an alignment be-tween two video sequences. Note that this may not be a one-to-one mapping, since there may exist horizontal or vertical lines inthe path, which means that a frame may have multiple candidatematches in the other video. In addition, due to noise and computa-tional error, different selections of Ii0 ! Ii1 may lead to different va-lid alignment results.

    3.5. Action recognition

    To solve the action recognition problem, we need a referencesequence (a sequence of 2D poses) for each known action, and

    maintain an action database of K actions, DB J1tn o

    ; J2tn o

    ;

    . . . ; JKtn o

    . To classify a given test sequence {It}, we match {It}

    against each reference sequence in DB, and classify {It} as the action

    of best-match, say Jktn o

    , if S fItg; Jktn o

    is above a threshold T. Due

    to the use of view-invariant fundamental ratios vector, our solution

    N. Ashraf et al. / Computer Vision and Ins can identify motion when presented with only a smallmoving dots attached to various body parts. This seems tost that people are quite naturally adept at ignoring trivial vari-ations of some body part motions, and paying attention to thosethat capture the features that are essential for action recognition.

    With the line segment representation of human body pose, asimilar assertion can be made on body line segments: some bodypoint line segments have greater contribution to pose and actionrecognition than others. Therefore, it would be reasonable to as-sume that by assigning appropriate weights to the similarity errorsof body point line segments, the performance of pose and action rec-ognition could be improved. To study the signicance of differentbody-point line segments in action recognition, we selected two dif-ferent sequences of walking action WA = {I1. . .l} and WB = {J1. . .m},and a sequence of running action R = {K1. . .n}. We then aligned se-quence WB and R to WA, using the alignment method described inSection 3.4, and obtained the corresponding alignment/mappingw:WA?WB and w0:WA? R. As discussed in Section 3.3, the simi-larity of two poses is computed based on error scores of all body-point line segmentsmotion. For each pair of matched poses hIi, Jw(i)i,we stacked the error scores of all line segments as a vector Ve(i):

    Vei

    E1E2

    :

    ET

    26664

    37775; 20

    We then built an error score matrix Me for alignment wWA?WB:

    Me Ve1 Ve2 . . . Vel : 21Each row i of Me indicates the dissimilarity scores of line segment iacross the sequence, and the expected value of each column j of Meis the dissimilarity score of pose Ij and JwWA!WBj. Similarly we builtan error score matrix M0e for alignment wWA?R.

    To study the role of a line segment i in distinguishing walkingand running, we can compare the ith row of Me and M

    0e, as plotted

    in Fig. 1af. We found that, some line segments such as line seg-ments 1, 2 and 11 have similar error scores in both cases, whichmeans the motion of these line segments are similar in walkingand running. On the other hand, line segments 19, 46 and 49 havehigh error scores in M0e and low error scores in Me, that is, the mo-tion of these line segments in a running sequence is different fromtheir motion in a walking sequence. Line segments 55, 94 and 116reect the variation in actions of walking and running, thus aremore informative than line segments 1, 21 and 90 for the task ofdistinguishing walking and running actions.

    We compared sequences of different individuals performing thesame action in order to analyze the importance of line segments inrecognizing them as the same action. For instance, we selected foursequences G0, G1, G2, and G3 of golf-swing action, and aligned G1,G2, and G3 to G0 using the alignment method described in Section3.4, and then built error score matrices M1e ; M

    2e ; M

    3e as described

    above. From the illustrations ofM1e ; M2e ; M

    3e in Fig. 2ac, the dissim-

    ilarity scores of some line segments, such as line segments 53 (seeFig. 2f), is very consistent across individuals. Some other line seg-ments such as line segments 6 (Fig. 2d) and 50 (Fig. 2e) have variouserror score patterns across individuals, that is, these line segmentsrepresent the variations in individuals performing the same action.

    Denition 1. If a line segments reects the essential differencesbetween an action A and other actions, we call it a signicant linesegments of action A. All other line segments are referred to astrivial line segments of action A.

    A typical signicant line segments should (i) convey the varia-tions between actions and/or (ii) tolerate the variations of the sameaction performed by different individuals. For example, line seg-

    e Understanding 117 (2013) 587602 591ments 19, 46 and 49 are signicant line segments for walking ac-tion, and line segment 53 is a signicant line segmentsfor thegolf-swing action.

  • e shi

    00

    0.1

    0.2

    0.3

    0.4

    0.5

    mag0 10 20 30 40 50 600

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    frame #

    diss

    imila

    rity

    scor

    e

    WalkwalkWalkrun

    (a) Line segment 1 (b) LinExamples of insignificant line segments w

    diss

    imila

    rity

    scor

    e

    592 N. Ashraf et al. / Computer Vision and IIntuitively, in action recognition, we should place more empha-sis on the signicant line segments while reducing the negativeimpact of trivial line segments, that is, assigning appropriate inu-ence factor to the body-point line segments. In our approach to ac-tion recognition, this can be achieved by assigning appropriateweights to the similarity errors of body point line segments inEq. (19). That is, Eq. (19) can be rewritten as:

    E Ii ! Ij; Jkm ! Jkn

    X1...L

    xe; 22

    where L is the total number of non-degenerate line segments andx1 +x2 + +xL = 1.

    0 10 20 30 40 50 600

    0.2

    0.4

    0.6

    0.8

    1

    frame #

    diss

    imila

    rity

    scor

    e

    WalkwalkWalkrun

    00

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    diss

    imila

    rity

    scor

    e

    (d) Line segment 19 (e) Line sExamples of significant line segments for dist

    Fig. 1. Roles of line segments in action recognition: (a)(f) are the plots of dissimilaalignments. As can be observed, line segments 1, 21 and 90 have similar error scores inwalking and running. But line segments 55, 94 and 116 have high error scores in M0e anrunning sequence is different from their motion in a walking sequence. Therefore, thesemore useful for distinguishing between walking and running actions.

    10 20 30 40 50 600

    0.5

    1

    1.5

    frame #

    diss

    imila

    rity

    scor

    e

    Golfswing 01Golfswing 02Golfswing 03

    0 10 200

    0.5

    1

    1.5

    2

    fram

    diss

    imila

    rity

    scor

    e

    Golfswing 0Golfswing 0Golfswing 0

    (a) Line segment 6 (b) LinesFig. 2. Roles of different line segments in action recognition. We selected four sequencealignment method described in Section 2, and then build error score matrix M1e ; M

    2e ; M

    scores of some line segments, such as line segments 53 is very consistent across individuapatterns across individuals, that is, these line segments represent the variations of indivegment 2 (c) Line segment 11ch are similar in both walking and running.

    10 20 30 40 50 60frame #

    WalkwalkWalkrun

    0 10 20 30 40 50 600

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    frame #

    diss

    imila

    rity

    scor

    e

    WalkwalkWalkrun

    e Understanding 117 (2013) 587602The next question is, how to determine the optimal set ofweights xi for different actions. Manual assignment of weightscould be biased and difcult for a large database of actions, andis inefcient when new actions are added in. Therefore, automaticassignment of weight values is desired for a robust and efcient ac-tion recognition system. To achieve this goal, we propose to use axed size dataset of training sequences to learn weight values.Suppose we are given a training dataset T which consists of K Jaction sequences for J different actions, performed by K differentindividuals. Let x be the weight value of body joint with label( = 1 . . . L) for a given action. Our goal is to nd the optimalweights x that maximize the similarity error between sequences

    10 20 30 40 50 60frame #

    WalkwalkWalkrun

    0 10 20 30 40 50 600

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    frame #

    diss

    imila

    rity

    scor

    e

    WalkwalkWalkrun

    egment 46 (f) Line segment 49inguishing between walking and running.rity scores of some line segments across frames in the walk-walk and walk-runboth cases, which essentially means the motion of these line segments is similar ind low error scores in Me, which means that the motion of these line segments in aline segments reect the variation in actions of walking and running and are much

    30 40 50 60

    e #

    123

    0 10 20 30 40 50 600

    0.1

    0.2

    0.3

    0.4

    0.5

    frame #

    diss

    imila

    rity

    scor

    e

    Golfswing 01Golfswing 02Golfswing 03

    egment 50 (c) Line segment 53s G0, G1, G2, and G3 of golf-swing action, and align G1, G2, and G3 to G0 using the3e correspondingly as in above experiments. As can be observed, the dissimilarityls. Some other line segments such as line segments 6 and 50 have various error scoreiduals performing the same action.

  • magof different actions and minimize those of same actions. Since thesize of the dataset and the alignments of sequences are xed, thisturns out to be an optimization problem over x. Our task is to de-ne a good objective function f(x1, . . ., xL) for this purpose, and toapply optimization to solve the problem.

    4.1. Weights on line segments versus weights on body points

    Given a human body model of n points, we could obtain at mostn2

    line segments, and need to solve a n2

    dimensional optimi-

    zation problem for weight assignment. Even with a simplied hu-man body model of 11 points, this yields an extremely high

    dimensional ( 112

    55 dimensions) problem. On the other

    hand, the body point line segments are not independent of eachother. In fact, adjacent line segments are correlated by their com-mon body point, and the importance of a line segments is alsodetermined by the importance of its two body points. Therefore,

    instead of using n2

    variables for weights of n2

    line segments,

    we assign n weights x1. . .n to the body points P1. . .n, where:

    x1 x2 xn 1: 23The weight of a line segments = hPi, Pji is then computed as:

    k xi xjn : 24

    Note that the denition of k in (24) ensures that k1 + k2 + + kT = 1.Using (24) and (22) is rewritten as:

    EI1 ! I2; Ji ! Jj 1nMedian16i

  • times to test on the 48 sequences. IXMAS is a large dataset, and wetested each sequence by randomly generating a reference datasetof 2 5 10 = 100 sequences for 10 actions performed by twopeople observed from ve different viewpoints, and tested on theremaining sequences.

    5.1. Analysis based on motion capture data

    We generated our data based on the CMUMotion Capture Data-base, which consists of 3D motion data for a large number of hu-man actions. We generated the semi-synthetic data by projecting

    (marked by red color), with focal length f1 = 1000; camera 2 wasobtained by rotating camera 1 around an axis on xz plane of cam-era 1 (colored as green), and a second axis on yz plane of camera 1(colored as blue), and changing focal length as f2 = 1200. Let I1 andI2 be the images of poses P1 and P2 on camera 1 and I3, I4, I5 and I6the images of poses P1, P2, P3 and P4 on camera 2, respectively. Twosets of pose similarity errors were computed at all camera posi-tions shown in Fig. 4a: E(I1? I2, I3? I4) and E(I1? I2, I5? I6).The results are plotted in Fig. 4b and c, which show that, whentwo cameras are observing the same pose transitions, the error iszero regardless of their different viewpoints, conrming Proposi-

    Fig. 3. Left: Our body model. Right: Experiment on view-invariance. Two different pose transitions P1? P2 and P3? P4 from a golf swing action are used.

    25

    SaDif

    594 N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 5876023D points onto images through synthesized cameras. In otherwords, our test data consist of video sequences of true persons,but the cameras are synthetic, resulting in semi-synthetic data towhich various levels of noise were added. Instead of using all bodypoints provided in CMUs database, we employed a body modelthat consists of only eleven points, including head, shoulders, el-bows, hands, knees and feet (see Fig. 3).

    5.1.1. Testing view invarianceWe selected four different poses P1, P2, P3, P4 from a golf swing-

    ing sequence (see Fig. 3). We then generated two cameras asshown in Fig. 4a: camera 1 was placed at an arbitrary viewpoint

    500

    50

    1000

    100

    50

    0

    50

    X

    Y

    Z

    0 50 100 150 200

    00.20.40.60.8

    11.2

    Pose

    tran

    sitio

    n Si

    mila

    rity

    Erro

    rRotation angle

    (a) (b)

    100 0100

    1000100

    20015010050

    050

    XY

    Z

    0 100200 300

    400

    1000

    1000

    1

    2

    Rotation anglearound x axis

    Rotation anglearound y axis

    1000

    0.5

    1

    1.5

    Rotationaround

    (d) (e)Fig. 4. Analysis of view invariance: (a) Camera 1 is marked in red, and all positions of camwhen camera 2 is located at viewpoints colored as green in (a). (c) Errors of same and diff(d) General camera motion: camera 1 is marked as red, and camera 2 is distributed on a(d). (f) Error surface of different pose transitions for all distribution of camera 2 in (d). (g)references to colour in this gure legend, the reader is referred to the web version of thtion 3.Similarly, we xed camera 1 and moved camera 2 on a sphere as

    shown in Fig. 4d. The errors E(I1? I2, I3? I4) and E(I1? I2, I5? I6)are shown in Fig. 4e and f. Under this more general camera motion,the pose similarity score of corresponding poses is not always zero,since the epipoles in Eqs. (5) and (6) are approximated. However,this approximation is inconsequential in most situations, becausethe error surface of different pose transitions is in general abovethat of corresponding pose transitions. Fig. 4h shows the regions(black colored) where approximation is invalid. These regions cor-respond to the situation that the angles between camera orienta-tions around 90 degrees, which usually implies severe self-

    0 300 350

    me pose transitionsferent pose transitions

    0 50 100 150 200 250 300 350

    00.20.40.60.8

    11.2 Same pose transitions

    Different pose transitions

    Pose

    tran

    sitio

    n Si

    mila

    rity

    Erro

    rRotation angle

    (c)

    0 100200 300

    400

    1000

    Rotation anglearound x axis

    angley axis

    (f) (g)Rotation angle around x axis

    0 90 180 270 35090

    0

    90

    Rot

    atio

    n an

    gle

    arou

    nd

    y ax

    is

    era 2 are marked in blue and green. (b) Errors for same and different pose transitionserent pose transitions when camera 2 is located at viewpoints colored as blue in (a).sphere. (e) Error surface of same pose transitions for all distributions of camera 2 inThe regions of confusion for (d) marked in black (see text). (For interpretation of theis article.)

  • nN. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602 5950.2

    0.3

    0.4

    0.5Same pose transitionsDifferent pose transitio

    sta

    ndar

    d de

    viat

    ion

    in

    100

    runsocclusion and lack of corresponding points in practice. The exper-iments on real data in Section 5.2 also show the validity of thisapproximation under practical camera viewing angles.

    5.1.2. Testing robustness to noiseWithout loss of generality, we used the four poses in Fig. 3 to

    analyze the robustness of our method to noise. Two cameras withdifferent focal lengths and viewpoints were examined. As shown inFig. 5, I1 and I2 are the images of poses P1 and P2 on camera 1 and I3,I4, I5 and I6 are the images of P1, P2, P3 and P4 on camera 2. We thenadded Gaussian noise to the image points, with r increasing from 0

    0 1 2 3

    0

    0.1

    gaussian nois

    mea

    n er

    ror a

    nd

    Fig. 5. Robustness to noise: I1 and I2 are the images in camera 1, and I3, I4, I5 and I6 are thfor r < 4.

    100500501000

    20

    40

    60

    80

    100

    1000

    100

    Fig. 6. The distribution of cameras used to evaluate view-invariance and cameraparameter changes.

    (1) (2) (3) (4) (5) (6) (7) (8) (9)Fig. 7. A pose observed from 17 viewpoints. Note that only 11 body points in red color arand extreme variability being handled by our method. (For interpretation of the referenarticle.)to 8 pixels. The errors E(I1? I2, I3? I4) and E(I1? I2, I5? I6) werecomputed. For each noise level, the experiment was repeated for

    4 5 6 7 8

    e level (sigma)

    e images in camera 2. Same and different actions are distinguished unambiguously100 independent trials, and the mean and standard deviation ofboth errors were calculated (see Fig. 5). As shown in the results,the two cases are distinguished unambiguously until r increasesto 4.0, i.e., up to possibly 12 pixels. Note that the image sizes ofthe subject were about 200 300, which implies that our methodperforms remarkably well under high noise.

    5.1.3. Performance in action recognitionWe selected ve classes of actions from CMUs MoCap dataset:

    walk, jump, golf swing, run, and climb. Each action class is per-formed by 3 actors, and each instance of 3D action is observedby 17 cameras, as shown in Fig. 6. The focal lengths were changedrandomly in the range of 1000 300. Fig. 7 shows an example of a3D pose observed from 17 viewpoints.

    Our dataset consists of totally 255 video sequences, from whichwe generated a reference action Database (DB) of 5 video se-quences, i.e. one video sequence for each action class. The rest ofthe dataset was used as test data, and each sequence was matchedagainst all actions in the DB and classied as the one with the high-est score. For each sequence matching, 10 random initializationswere tested and the best score was used. Classication results

    (10) (11) (12) (13) (14) (15) (16) (17)e used. The stick shapes are shown here for better illustration of pose congurationces to colour in this gure legend, the reader is referred to the web version of this

  • method provides a successful recognition of various actions by dif-ferent subjects, regardless of camera intrinsic parameters andviewpoints.

    We test each sequence using the take-one-out strategy. Withweighting, the classication results are summarized in Table 6.The overall recognition rate is 100%, which is an improvementof 4.17% compared to the non-weighted case (see Tables 7 and8)

    5.2.2. IXMAS data setWe also evaluated our method on IXMAS data set [7], which has

    Table 1Confusion matrix before applying weighting: large values on the diagonal entriesindicate accuracy. The overall recognition rate is 89.20%.

    Ground-truth Recognized as

    Walk Jump Golf swing Run Climb

    Walk 43 2 1 2 2Jump 2 46 1 1Golf swing 1 1 46 1 1Run 2 2 44 2Climb 1 1 1 2 44

    Ground-truth Recognized as

    596 N. Ashraf et al. / Computer Vision and Imagwithout weighting are summarized in Table 1. The overall recogni-tion rate is 89.2%.

    For weighting, we build a MoCap training dataset which con-sists of total of 2 17 5 = 170 sequences for 5 actions (walk,jump, golf swing, run, and climb): each action is performed by 2subjects, and each instance of action is observed by 17 camerasat different random locations. We use the same set of reference se-quences for the 5 actions as the unweighted case, and align the se-quences in the training set against the reference sequences. Toobtain optimal weighting for each action j, we rst aligned all se-quences against the reference sequence Rj, and stored the similar-ity scores of line segments for each pair of matched poses. Theobjective function fj(x1, x2, . . ., x10) is then built based on Eq.(27), and the computed similarity scores of line segments in thealignments. fj() is a 10-dimensional function, and the weights xiare constrained by

    0 6 xi 6 1; i 1 . . .10;X10i1xi 6 1:

    8>: 32

    The optimal weights hx1,x2, . . .,x10i are then searched to max-imize fj(), with the initialization at 111 ; 111 ; . . . ; 111

    . The conjugate

    gradient method is then applied to solve this optimization prob-lem. After performing the above steps for all the actions, we ob-tained a set of weights W j for each action j in our database.Classication results are summarized in Table 2. The overall recog-nition rate is 92.4%, which is an improvement of 3.2% compared tothe unweighted case (see Tables 3 and 4).

    5.2. Results on real data

    5.2.1. UCF-CIL datasetThe UCF-CIL dataset consists of video sequences of eight classes

    of actions collected on the internet (see Fig. 9): ballet fouette, balletspin, push-up exercise, golf swing, one-handed tennis backhandstroke, two-handed tennis backhand stroke, tennis forehandstroke, and tennis serve. Each action is performed by different sub-

    jects, and the videos are taken by different unknown cameras fromvarious viewpoints. In addition, videos in the same class of action

    Table 2Confusion matrix after applying weighting: large values on the diagonal entriesindicate accuracy. The overall recognition rate is 92.40%, which is an improvement of3.2% compared to the non-weighted case.

    Ground-truth Recognized as

    Walk Jump Golf swing Run Climb

    Walk 45 1 1 2 1Jump 2 47 1Golf swing 1 47 1 1Run 2 1 46 1Climb 1 1 2 46may have different starting and ending points, thus may be onlypartially overlapped. The execution speeds also vary in the se-quences of each action. Self-occlusion also exists in many of the se-quences, e.g., golf, tennis, etc.

    Fig. 8a shows an example of matching action sequences. Theframe rates and viewpoints of two sequences are different, andtwo players perform golf-swing action at different speeds. Theaccumulated score matrix and back-tracked path in dynamic pro-gramming are shown in Fig. 8c. Another result on tennis-serve se-quences is shown in Fig. 8b and d (see Fig. 10).

    We built an action database DB by selecting one sequence foreach action; the rest were used as test data, and were matchedagainst all actions in the DB. The action was recognized as theone with the highest matching score for each sequence. The confu-sion matrix is shown in Table 5, which indicates an overall 95.83%classication accuracy for real data. As shown by these results, our

    Walk Jump Golf swing Run Climb

    Walk 39 3 1 5 2Jump 4 44 1 1Golf swing 1 1 45 2 1Run 4 3 41 2Climb 8 3 1 3 35Table 3Confusion matrix for [45]. The overall recognition rate is 91.6%.

    Ground-truth Recognized as

    Walk Jump Golf swing Run Climb

    Walk 45 1 2 2Jump 2 47 1Golf swing 1 48 1Run 3 47Climb 6 2 42

    Table 4Confusion matrix for [31]. The overall recognition rate is 81.6%.

    e Understanding 117 (2013) 5876025 different views of 13 different actions, each performed threetimes by 11 different actors. We tested on actions, {1, 2, 3, 4, 5,8, 9, 10, 11, 12}. Similar to [7], we applied our method on all actorsexcept for Pao and Srikumar, and used andreas 1 undercam1 as the reference for all actions similar to [45]. The rest ofthe sequences were used to test our method. The recognition re-sults are shown in Table 10 for non-weighted case. The averagerecognition rate is 90.5%. For weighting, we tested each sequenceby randomly generating a reference dataset of 2 5 10 = 100 se-quences for 10 actions performed by two people observed fromve different viewpoints. The results are shown in Table 11. Theaverage recognition rate is 92.6%, which boosts 2.1% over thenon-weighted case. In addition, we compare our method to othersin Table 9. As can be seen, our method improves on each cameraview (see Tables 12 and 13).

  • magN. Ashraf et al. / Computer Vision and I5.2.3. Testing occlusionAs discussed earlier, we handle occlusions by ignoring the

    line segments involving the occuluded points. Since there are atotal of 11 points in our body model, there are a total of 55 linesegments. If, lets assume, three points are occluded, there arestill 28 line segments. While the non-weighted method wouldbe expected to degenerate when lesser line segments are used,weighting the line segments would still be able to differentiatebetween actions, which are dependent on the non-occludedpoints. While our previous experiments implicitly involve self-

    (a) Example 1:matching tw

    (b) Example 2: matching tw

    10 20 30 40 50 60 70

    5

    10

    15

    20

    25

    30

    35

    Frame no. of upper sequence

    Fram

    e no

    . of l

    ower

    seq

    uenc

    e

    (c)Fig. 8. Examples of matching action sequences: (a) and (b) are two examples in golf-swbacktracked paths, resulting in the alignments shown in (a) and (b), respectively.e Understanding 117 (2013) 587602 597occlusion, in this section, we want to rigorously test our methodwhen occlusion is present. In particular, we test for these differ-ent scenarios: (i) Upper body is occluded including the head andshoulder points. (ii) The right side of the body is occludedincluding the shoulder, arm, hand, and knee points. (iii) The leftside of the body is occluded including the shoulder, arm, hand,and knee points. (iv) Lower body is occluded including the kneeand feet points. Therefore (i) has 3 occluded points and the restof the test cases have 4 occluded points. The results are shownin Tables 1417.

    o golf-swing sequences

    o tennis-serve sequences

    20 40 60 80 100

    10

    20

    30

    40

    50

    Frame no. of lower sequence

    Fram

    e no

    . of u

    pper

    seq

    uenc

    e

    (d)ing and tennis-serve actions. (c) and (d) Show the accumulated score matrices and

  • mag598 N. Ashraf et al. / Computer Vision and IAs can be seen from these results, our method is able to recog-nize actions even when such drastic occlusions are present. Thefew low percentages in the tables correspond to actions that aremore or less dependent on the occluded part. For instance, kickaction has a percentage of only 5.5% when lower body is occluded.But this action is solely based on the lower part of the body. There-fore, it is not surprising that the recognition rate is low. In general,the recognition rates are low since we are using lesser number ofline segments, and more importantly, we are using lesser numberof points to compute the fundamental matrix (when 4 points areoccluded, we are forced to use the 7 point algorithm [41]).

    5.3. How soon can we recognize the action?

    We also experimented with how soon our method is able to dis-tinguish between different actions. This is helpful to gaugewhether our method would be able to perform real-time or notand has received attention from researchers such as [52,53]. To

    Fig. 9. Examples from the UCF-CIL dataset consisting of 8 categories (actions) used to tes(22); golf swing: (23)(30); one-handed tennis backhand stroke: (31)(34); two-handed(47)(56).e Understanding 117 (2013) 587602do this, we looked at all the correctly classied sequences andthe results are summarized in Table 18. So, for instance, for action1, on average we can detect the action after 60% of the sequence.The best case and the worst case are also provided.

    6. Discussions and conclusion

    Table 19 gives a summary of the existing methods for view-invariant action recognition. In terms of the number of requiredcameras, the existing methods fall into two categories: multipleview methods ([8,20], etc.) and monocular methods ([29,8,24,6,28,18] and ours). Multiple view methods are more expensive, andless practical in real life problems when only one camera is avail-able, e.g. monocular surveillance. Many of these methods alsomakeadditional assumptions such as afne camera model ([24,6,18]),which can be readily violated in many practical situations, orimpose anthropometric constraints, such as isometry. Others, e.g.Parameswaran et al. [8], make additional assumptions that canon-

    t the proposed method. Ballet fouettes: (1)(4); ballet spin: (5)(16); push-up: (17)tennis backhand stroke: (35)(42); tennis forehand stroke: (43)(46); tennis serve:

  • mage Understanding 117 (2013) 587602 599N. Ashraf et al. / Computer Vision and Iical poses are predened, or certain limbs trace planar areas duringactions; Sheikh et al. [6] assume that each action is spanned bysome action bases, estimated directly using training sequences.This implicitly requires that the start and the end of a test sequencebe restricted to those used during training. Moreover, the trainingset needs to be large enough to accommodate for inter-subjectirregularities of human actions.

    In summary, the major contributions in this paper are: (i) Wegeneralize the concept of fundamental ratios and demonstrate its

    Fig. 10. Examples from IXMAS dataset.

    Table 5Confusion matrix before applying weighting: large values on the diagonal entriesindicate accuracy. The overall recognition rate is 97.92%. The actions are denoted bynumbers: 1 ballet fouette, 2 ballet spin, 3 pushup, 4 golf swing, 5 one handedtennis backhand, 6 two handed tennis backhand, 7 tennis forehand, 8 tennisserve.

    Ground-true actions Recognized as action

    #1 #2 #3 #4 #5 #6 #7 #8

    #1 3#2 1 10#3 5#4 7#5 3#6 7#7 3#8 9

    Table 6Confusion matrix after applying weighting: the overall recognition rate is 100%,which is an improvement of 2% compared to the non-weighted case.

    Ground-true actions Recognized as action

    #1 #2 #3 #4 #5 #6 #7 #8

    #1 3#2 11#3 5#4 7#5 3#6 7#7 3#8 9

    Table 7Confusion matrix for [45].

    Ground-true actions Recognized as action

    #1 #2 #3 #4 #5 #6 #7 #8

    #1 3#2 11#3 5#4 7#5 3#6 7#7 3#8 9

  • magTable 8Confusion matrix for [31].

    600 N. Ashraf et al. / Computer Vision and Iimportant role in action recognition. The advantage of using linesegments as opposed to triplets is that it introduces more redun-dancy and leads to better results. (ii) We compare transitions oftwo poses, which encodes temporal information of human motionwhile keeping the problem at its atomic level. (iii) We decompose a

    Ground-true actions Recognized as action

    #1 #2 #3 #4 #5 #6 #7 #8

    #1 3#2 1 10#3 5#4 7#5 3#6 1 6#7 3#8 9

    Table 9Recognition rates in % on IXMAS dataset. Shen [45] and Shen [31] use the same set ofbody points as our method.

    Method All Cam1 Cam2 Cam3 Cam4 Cam5

    Fundamental ratios withoutweighting

    90.5 92.0 89.6 86.6 82.0 78.0

    Fundamental ratios withweighting

    92.6 94.2 93.5 94.4 92.6 82.2

    Shen [31] 85.6 Shen [45] 90.2 Weinland [46] 83.5 87.0 88.3 85.6 87.0 69.7Weinland [30] 57.9 65.4 70.0 54.3 66.0 33.6Reddy [47] 72.6 69.6 69.2 62.0 65.1 Tran [48] 80.2 Junejo [49] 72.7 74.8 74.5 74.8 70.6 61.2Liu [50] 76.7 73.3 72.0 73.0 Farhadi [51] 58.1 Liu [27] 82.8 86.6 81.1 80.1 83.6 82.8

    Table 10Confusion matrix for IXMAS dataset before applying weighting. Average recognitionrate is 90.5%. The actions are denoted by numbers: 1 = check watch, 2 = cross arms,3 = scratch head, 4 = sit down, 5 = get up, 8 = wave, 9 = punch, 10 = kick, 11 = point,and 12 = pick up.

    Action 1 2 3 4 5

    Recognition rate % 92.6 91.1 85.2 91.1 89.6Action 8 9 10 11 12Recognition rate % 92.6 92.6 88.1 91.1 87.3

    Table 11Confusion matrix for IXMAS dataset after applying weighting: the overall recognitionrate is 92.6%, which is an improvement of 2.1% compared to the non-weighted case.

    Action 1 2 3 4 5

    Recognition rate % 94.8 91.1 87.2 92.6 92.6Action 8 9 10 11 12Recognition rate % 92.6 92.6 91.1 92.6 89.6

    Table 12Confusion matrix for [45]. Average recognition rate is 90.23%.

    Action 1 2 3 4 5

    Recognition rate % 89.6 94.8 85.2 91.1 91.1Action 8 9 10 11 12Recognition rate % 85.2 92.6 91.1 90.4 89.6Table 13Confusion matrix for [31]. Average recognition rate is 85.6%.

    Action 1 2 3 4 5

    Recognition rate % 85.2 89.6 82.1 78.4 89.6Action 8 9 10 11 12Recognition rate % 90.4 89.6 82.1 91.1 82.1

    Table 14Confusion matrix when head and two shoulder points are occluded. The actions arethe same as in Table 10.

    Action 1 2 3 4 5

    Recognition rate % 85.5 91.1 83.3 81.1 91.1Action 8 9 10 11 12Recognition rate % 92.3 90.3 83.3 90.4 83.3

    e Understanding 117 (2013) 587602human pose into a set of line segments and represent a human ac-tion by the motion of 3D lines dened by line segments. This con-verts the study of non-rigid human motion into that of multiplerigid planar motions, making it thus possible to apply well-studiedrigid motion concepts, and providing a novel direction to studyarticulated motion. Our results denitely conrm that using linesegments improves considerably the accuracy in [31]. Of course,this does not preclude that our ideas of line segments and weight-ing could be applied to other methods such as [45], and they mayalso result in improved accuracy in the same manner as [31] (asstudied this paper). (iv) We propose a generic method for weight-ing body point line segments, in an attempt to emulate humansfoveated approach to pattern recognition. Results after applyingthis scheme indicate signicant improvement. This idea can be ap-plied to [45] as well, and probably to a host of other methods,whose performance may improve in the same manner as [31], asshown in this paper. (v) We study how this weighting strategycan be useful when there is partial but signicant occlusion. (vi)We also investigate how soon our method is able to recognize

    Table 15Confusion matrix when the right side of the body is occluded including the rightshoulder, arm, hand, and knee point.

    Action 1 2 3 4 5

    Recognition rate % 83.3 54.5 5.5 58.8 61.3Action 8 9 10 11 12Recognition rate % 3.3 10.3 79.1 5.6 16.1

    Table 16Confusion matrix when the left side of the body is occluded including the leftshoulder, arm, hand, and knee point.

    Action 1 2 3 4 5

    Recognition rate % 3.3 47.5 75.5 57.7 66.7Action 8 9 10 11 12Recognition rate % 83.3 73.3 76.7 77.1 66.7

    Table 17Confusion matrix when the lower body is occluded including the two knee and feetpoints.

    Action 1 2 3 4 5

    Recognition rate % 86.6 83.3 78.1 45.2 54.8Action 8 9 10 11 12Recognition rate % 81.1 79.3 5.5 78.1 36.6

  • References

    [27] J. Liu, M. Shah, B. Kuipers, S. Savarese, Cross-view action recognition via view

    d od o

    mag[1] V. Zatsiorsky, Kinematics of Human Motion, Human Kinetics, 2002.[2] D. Gavrila, Visual analysis of human movement: a survey, CVIU 73 (1) (1999)the action. (vii) We provide extensive experiments to rigorouslytest our method on three different datasets.

    Table 18This table shows how soon we can recognize an action for IXMAS dataset.

    Action 1

    % of Sequence used: best caseworst caseaverage case 308860Action 8% of Sequence used: best caseworst caseaverage case 568869

    Table 19Comparison of different methods.

    Method # of views Camera model Input

    Ours 1 Persp. projection Body points[45] 1 Persp. projection Body points[46] P1 Persp. projection 3D HoG[30] P1 Persp. projection Silhouettes[47] All Persp. projection Interest points[51] >1 Persp. projection Histogram of the silhouette an[48] All Persp. projection Histogram of the silhouette an[50] 1 Persp. projection 3D Interest points[8] >1 Persp. projection Body Points[7] All Persp. projection Visual hulls[20] >1 Persp. projection Optical ow silhouettes[24] 1 Afne Body points[18] 1 Afne Silhouettes[6] 1 Afne Body points[29] 1 Persp. projection Body points[49] P1 Persp. projection Body points/optical ow/HoG

    N. Ashraf et al. / Computer Vision and I8298.[3] T. Moeslund, E. Granum, A survey of computer vision-based human motion

    capture, CVIU 81 (3) (2001) 231268.[4] T. Moeslund, A. Hilton, V. Krger, A survey of advances in vision-based human

    motion capture and analysis, CVIU 104 (23) (2006) 90126.[5] L. Wang, W. Hu, T. Tan, Recent developments in human motion analysis,

    Pattern Recognition 36 (3) (2003) 585601.[6] Y. Sheikh, M. Shah, Exploring the space of a human action, ICCV 1 (2005) 144

    149.[7] D. Weinland, R. Ronfard, E. Boyer, Free viewpoint action recognition using

    motion history volumes, CVIU 104 (23) (2006) 249257.[8] V. Parameswaran, R. Chellappa, View invariants for human action recognition,

    CVPR 2 (2003) 613619.[9] A. Efros, A. Berg, G. Mori, J. Malik, Recognizing action at a distance, ICCV (2003)

    726733.[10] G. Zhu, C. Xu, W. Gao, Q. Huang, Action recognition in broadcast tennis video

    using optical ow and support vector machine, LNCS 3979 (2006) 8998.[11] L. Wang, Abnormal walking gait analysis using Silhouette-masked ow

    histograms, ICPR 3 (2006) 473476.[12] C. Schuldt, I. Laptev, B. Caputo, Recognizing human actions: a local SVM

    approach, ICPR 3 (2004) 3236.[13] I. Laptev, S. Belongie, P. Perez, J. Wills, C. universitaire de Beaulieu, U. San

    Diego, Periodic motion detection and segmentation via approximate sequencealignment, ICCV 1 (2005) 816823.

    [14] M. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri, Actions as space-timeshapes, in: Proc. ICCV, vol. 2, 2005, pp. 13951402.

    [15] L. Wang, D. Suter, Recognizing human activities from silhouettes: motionsubspace and factorial discriminative graphical model, in: CVPR, 2007, pp. 18.

    [16] A. Bobick, J. Davis, The recognition of human movement using temporaltemplates, IEEE Transactions on PAMI 23 (3) (2001) 257267.

    [17] L. Wang, T. Tan, H. Ning, W. Hu, Silhouette analysis-based gait recognition forhuman identication, IEEE Transactions on PAMI 25 (12) (2003) 15051518.

    [18] A. Yilmaz, M. Shah, Actions sketch: a novel action representation, CVPR 1(2005) 984989.

    [19] A. Yilmaz, M. Shah, Matching actions in presence of camera motion, CVIU 104(23) (2006) 221231.

    [20] M. Ahmad, S. Lee, HMM-based human action recognition using multiviewimage sequences, ICPR 1 (2006) 263266.

    [21] F. Cuzzolin, Using bilinear models for view-invariant action and identityrecognition, Proceedings of CVPR (2006) 17011708.[22] F. Lv, R. Nevatia, Single view human action recognition using key posematching and viterbi path searching, Proceedings of CVPR (2007) 18.

    [23] L. Campbell, D. Becker, A. Azarbayejani, A. Bobick, A. Pentland, Invariantfeatures for 3-d gesture recognition, FG 0 (1996) 157162.

    [24] C. Rao, A. Yilmaz, M. Shah, View-invariant representation and recognition ofactions, IJCV 50 (2) (2002) 203226.

    [25] A. Farhadi, M.K. Tabrizi, I. Endres, D.A. Forsyth, A latent model ofdiscriminative aspect, in: ICCV, 2009, pp. 948955.

    [26] R. Li, T. Zickler, Discriminative virtual views for cross-view action recognition,in: CVPR, 2012.

    2 3 4 5

    337750 569177 356756 4077669 10 11 12488163 458977 609278 377955

    Other assumptions

    f the optic owf the optic ow

    Five pre-selected coplanar points or limbs trace planar area

    Same start and end of sequences

    e Understanding 117 (2013) 587602 601knowledge transfer, in: CVPR, 2011, pp. 32093216.[28] T. Syeda-Mahmood, A. Vasilescu, S. Sethi, I. Center, C. San Jose, Recognizing

    action events from multiple viewpoints, in: Proceedings of IEEE Workshop onDetection and Recognition of Events in Video, 2001, pp. 6472.

    [29] A. Gritai, Y. Sheikh, M. Shah, On the use of anthropometry in the invariantanalysis of human actions, ICPR 2 (2004) 923926.

    [30] D. Weinland, E. Boyer, R. Ronfard, Action recognition from arbitrary viewsusing 3d exemplars, in: ICCV, 2007, pp. 17.

    [31] Y. Shen, H. Foroosh, View-invariant action recognition using fundamentalratios, in: Proc. of CVPR, 2008, pp. 16.

    [32] A. Schtz, D. Brauna, K. Gegenfurtnera, Object recognition during foveating eyemovements, Vision Research 49 (18) (2009) 22412253.

    [33] R.I. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, seconded., Cambridge University Press, 2004. ISBN: 0521540518.

    [34] G. Johansson, Visual perception of biological motion and a model for itsanalysis, Perception and Psychophysics 14 (1973) 201211.

    [35] V. Pavlovic, J. Rehg, T. Cham, K. Murphy, A dynamic bayesian networkapproach to gure tracking using learned dynamic models, ICCV (1) (1999)94101.

    [36] D. Ramanan, D.A. Forsyth, A. Zisserman, Strike a pose: tracking people bynding stylized poses, Proceedings of CVPR 1 (2005) 271278.

    [37] D. Ramanan, D.A. Forsyth, A. Zisserman, Tracking people and recognizing theiractivities, Proceedings of CVPR 2 (2005) 1194.

    [38] J. Rehg, T. Kanade, Model-based tracking of self-occluding articulated objects,ICCV (1995) 612617.

    [39] J. Sullivan, S. Carlsson, Recognizing and tracking human action, in: ECCV,Springer-Verlag, London, UK, 2002, pp. 629644.

    [40] J. Aggarwal, Q. Cai, Human motion analysis: a review, CVIU 73 (3) (1999) 428440.

    [41] R.I. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision,Cambridge University Press, 2000.

    [42] Z. Zhang, C. Loop, Estimating the fundamental matrix by transforming imagepoints in projective space, CVIU 82 (2) (2001) 174180.

    [43] R.I. Hartley, In defense of the eight-point algorithm, IEEE Transactions on PAMI19 (6) (1997) 580593.

    [44] Y. Shen, H. Foroosh, View-invariant recognition of body pose from space-timetemplates, in: Proc. of CVPR, 2008, pp. 16.

    [45] Y. Shen, H. Foroosh, View-invariant action recognition from point triplets, IEEETransactions on PAMI 31 (10) (2009) 18981905.

  • [46] D. Weinland, M. zuysal, P. Fua, Making action recognition robust toocclusions and viewpoint changes, in: ECCV, 2010, pp. 635648.

    [47] K.K. Reddy, J. Liu, M. Shah, Incremental action recognition using feature-tree,in: ICCV09, 2009, pp. 10101017.

    [48] D. Tran, A. Sorokin, Human activity recognition with metric learning, in: D.Forsyth, P. Torr, A. Zisserman (Eds.), ECCV, Lecture Notes in Computer Science,vol. 5302, Springer Berlin/Heidelberg, 2008, pp. 548561.

    [49] I. Junejo, E. Dexter, I. Laptev, P. Perez, View-independent action recognitionfrom temporal self-similarities, IEEE Trans. PAMI 99, preprints.

    [50] J. Liu, M. Shah, Learning human actions via information maximization., in:CVPR08, 2008, pp. 18.

    [51] A. Farhadi, M.K. Tabrizi, Learning to recognize activities from the wrong viewpoint, in: ECCV (1)08, 2008, pp. 154166.

    [52] S. Masood, C. Ellis, A. Nagaraja, M. Tappen, J. LaViola Jr., Sukthankar, R.,Measuring and reducing observational latency when recognizing actions, in:The 6th IEEE Workshop on Human Computer Interaction: Real-Time VisionAspects of Natural User Interfaces (HCI2011), ICCV Workshops, 2011.

    [53] M. Hoai, F. De la Torre, Max-margin early event detectors, in: Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition, 2012.

    602 N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602

    View invariant action recognition using weighted fundamental ratios1 Introduction1.1 Previous work on view-invariance1.2 Our approach

    2 Fundamental ratios3 Action recognition using fundamental ratios3.1 Representation of pose3.2 Pose transitions3.3 Matching pose transition3.3.1 Fundamental matrix induced by a moving line segment3.3.2 Algorithm for matching pose transitions

    3.4 Sequence alignment3.5 Action recognition

    4 Weighting-based human action recognition4.1 Weights on line segments versus weights on body points4.2 Automatic adjustment of weights

    5 Experimental results and discussion5.1 Analysis based on motion capture data5.1.1 Testing view invariance5.1.2 Testing robustness to noise5.1.3 Performance in action recognition

    5.2 Results on real data5.2.1 UCF-CIL dataset5.2.2 IXMAS data set5.2.3 Testing occlusion

    5.3 How soon can we recognize the action?

    6 Discussions and conclusionReferences