Mid-level Features Improve Recognition of Interactive Activities · 2012. 11. 14. · Mid-level...

Mid-level Features Improve Recognition of Interactive

Activities

Kate SaenkoBen PackerC.-Y. ChenS. BandlaY. LeeYangqing JiaJ.-C. NieblesD. KollerL. Fei-FeiK. GraumanTrevor Darrell

Electrical Engineering and Computer SciencesUniversity of California at Berkeley

Technical Report No. UCB/EECS-2012-209

http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-209.html

November 14, 2012

Copyright © 2012, by the author(s).All rights reserved.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission.

Acknowledgement

This research was supported in part by DARPA's Mind's Eye Program.

Mid-level Features Improve Recognition of

Interactive Activities

K. Saenko1, B. Packer2, C.-Y. Chen3, S. Bandla3, Y. Lee3, Y. Jia1,J. Niebles2, D. Koller2, L. Fei-Fei2, K. Grauman3, and T. Darrell1

1UC Berkeley, 2Stanford, 3UT Austin

Abstract

We argue that mid-level representations can bridge the gap betweenexisting low-level models, which are incapable of capturing the structureof interactive verbs, and contemporary high-level schemes, which rely onthe output of potentially brittle intermediate detectors and trackers. Wedevelop a novel descriptor based on generic object foreground segments;our representation forms a histogram-of-gradient representation that isgrounded to the frame of detected key-segments. Importantly, our methoddoes not require objects to be identified reliably in order to compute a ro-bust representation. We evaluate an integrated system including novelkey-segment activity descriptors on a large-scale video dataset containing48 common verbs, for which we present a comprehensive evaluation proto-col. Our results confirm that a descriptor defined on mid-level primitives,operating at a higher-level than local spatio-temporal features, but at alower-level than trajectories of detected objects, can provide a substantialimprovement relative to either alone or to their combination.

1 Introduction

Broadly speaking, competing lines of research on activity recognition have fo-cused on either “AI” based approaches, which exploit high-level models thatinvolve explicit detection of objects, people and pose as an intermediate repre-sentation, or “learning” based models, which exploit low-level methods includingpoint trajectories, local bag of feature models, etc. At the same time, empiricalchallenge problems that define the field have been progressing from relativelysimple activities (e.g., run, jump, walk) to those that involve, complex, struc-tured events and the interaction of multiple people and/or multiple objects (e.g.,exchange, hand-over, lead). These latter “interactive” activities are most valu-able for many real world applications, but have previously been the subject ofrelatively limited evaluation efforts.

Performance using low-level features and learning-based methods has beenoutstanding in many cases in earlier evaluations, but these new datasets provide

1

mo#on

keysegments

Object/Person Tracks

car person

Low-‐level features

Mid-‐level descriptors

High-‐level predicates

input video

SVM

SVM

SVM

Ensemble

Figure 1: Mid-level descriptors based on generic object key-segment regions offera trade-off between the robustness of low-level models and the structure capturedby high-level models. We develop a new system which combines low-level mo-tion features, mid-level key-segment descriptors, and high-level predicates onobjects/people and their relationships, on a large-scale dataset containing 48categories of interactive activities such as, in this example, approach.

a challenge: can low-level models really suffice when considering structured in-teractive activities? It would seem that rich intermediate representations mustbe essential for recognizing interactive verbs such as open, for example, giventhe broad range of instantiations of a semantic category (e.g. compare openinga door to opening a bottle).

Several authors have recently attempted to address this question in the caseof static action/activity recognition [43, 25], motivated by the PASCAL activityrecognition challenge [9]; however, our focus on the case of dynamic activitiesas revealed in video sequences, as in the classic activity recognition benchmarksof KTH, UCF Sports, Olympics, etc [18, 36, 28]. Dynamic interactive activitydatasets–video corpora that include multiple people interacting in various roles–are more rare, and we focus our development on the recently introduced publiclydownloadable “Visual Intelligence” (VISINT) activity dataset [1].

Recent results suggest that high-level models—those that operate on rep-resentations formed over tracked objects, object attributes, and/or interactionpredicates between objects—are quite powerful at recognition of dynamic inter-active activities [38, 5, 41]. A recent model for recognition based on interactionprimitives [30] was shown to offer strong performance on the VISINT corpus;this model exploited appearance and STIP primitives defined on person andobject trajectories obtained using a deformable part model detector [11]. Suchmodels are powerful, but still fail to track all types of objects, or be robust tomany types of observation conditions.

To improve robustness on complex activities that are difficult to track com-prehensively using existing detectors, we introduce a novel mid-level activity de-scriptor based on generic object segments. We leverage the key-segments methodof [22], and compute a descriptor based on static and dynamic properties of adetected key segment.

We combine this novel representation with two existing methods: a low-level latent state temporal sequence model [28], and a high-level model based on

2

a sequence of structured features including primitive representations of objectstate and person-object interaction [30]. We provide a comprehensive evaluationof these representations, separately and in combination, on the VISINT corpus.We show that the mid-level representation offers a clear improvement whencombined with the previous techniques, and the best performance is obtainedwith the fusion of all three methods. Intuitively, we believe that the key-segmentsmethod improves in the cases where higher-level models are required (activitieswith structured interaction) but where the primitive trackers have failed to doobject occlusion or appearance variation.

The dataset we use for evaluation is large and complex, and a further con-tribution of this paper is the protocols we designed for evaluating action recog-nition. In particular, humans do not always agree on verb presence when theverbs are defined most broadly; we propose a new metric for evaluating systemagreement with noisy human labels. We hope this dataset, along with the pro-tocol we developed, will be used by others to advance the state of the art ininteractive activity recognition.

2 Background

Many researchers have adopted activity representations using low-level trackedpoints in a video as features [27, 29, 40, 39, 7, 20, 42, 24, 12]. Such a represen-tation is prone to errors in tracking, which is especially true in the presence ofbackground clutter, but it avoids the difficult task of object and person detec-tion. A number of current approaches entail the use of local space-time interestpoints [39, 7, 29, 20, 4, 24, 42, 6, 26, 15]; several build representations usingvisual vocabularies computed with gradient-based descriptors extracted at de-tected interest points [7, 20, 6, 39, 42], while others build descriptors from thepoint positions themselves [4, 12]. The advantages of combining both static anddynamic descriptors have also been demonstrated [29, 24, 26, 15]. The strategy ofgenerating compound neighborhood-based features—explored initially for staticimages and object recognition [44, 33, 21, 23, 32]—has since been extended tovideo [12, 6, 42, 20]. Various approaches either subdivide the space-time vol-ume globally using a coarse grid of histogram bins [20, 6, 42, 15], or place gridsaround the raw interest points, and compute a new representation using thepositions of the interest points that fall within the grid cells surrounding thatcentral point [12].

Another line of work attempts to describe activities using intermediate pred-icates. At the mid-level, several approaches represent activities in terms ofspatio-temporal shapes or segments [17, 3]. At the higher level, methods rep-resent actions by the positions and velocities of an entire object using either abounding-box detector [14] or a parts-based model [34, 35, 31, 13, 37]. Althougha more intuitive framework, these representations suffer from the inherent inac-curacy of bounding box detection and from the fact that they do not model theentire scene and hence, cannot exploit contextual and geometric cues. On theother hand, the high-level approach adopted in our paper [30] can model activity

3

Key-segments

Background subtraction

Figure 2: The key-segments output (top row) automatically generates space-time object segments that appear most central to the activity in the entire videoclip. Unlike simple background subtraction (bottom row), they can distinguishthe shapes of adjacent foreground objects, and extract object-like regions thatdo not move.

using both semantic and relational modules.

3 A Mid-level Activity Representation based onKey Segment Descriptors

A key challenge in realistic scenarios is that the system cannot know entirelyin advance what objects may appear during an activity. While it is natural tosearch for people and an a priori bank of other very common objects, the sys-tem must be flexible enough to discover novel objects that appear and includethem in the activity description. Thus, our approach to obtain a mid-level rep-resentation is to exploit multiple foreground segmentations, each correspondingto a unique human or object, but without having to detect or identify that ob-ject. To this end, we adapt a recent approach for key-segment discovery [22].The method takes an unannotated video as input, and returns a ranked listof hypothesized space-time segmentations of the salient “object-like” regions asoutput. We use a key-segments decomposition of a video clip, which provides aspace-time segmentation of the salient object-like regions that appear central tothe activity [22].

Briefly, the key-segment extraction method we use works as follows: Givenan unannotated video, we compute an initial pool of bottom-up regions. Then,we rank that pool according to how “human-like” and “object-like” each regionappears. The former is based on a region’s overlap with high-scoring persondetections (we use [11]). The latter is based on the extent to which the re-gion exhibits (1) appearance cues typical to objects in general (e.g., boundarystrengths, probability of belonging to a vertical surface [8]), and (2) differencesin motion patterns relative to its surroundings [22]. We cluster the top-rankedregions across all frames to form multiple key-segment hypotheses. Each hy-

4

Figure 3: Overview of the mid-level segmentation descriptor. (a) Key-segmentsat frame t and t + 1, where x’s denote centroids. (b) Our descriptor encodesthe segment’s appearance (using quantized pHOG), its displacement, and itsdisplacement angle in the next frame. We compute descriptors for each segmentin all frames in the video; each increments a single bin in our final 3D appearance-motion histogram of the video. Best viewed in color.

pothesis defines a foreground color likelihood model within a space-time MarkovRandom Field (MRF), where each node is a pixel, and each edge connects ad-jacent pixels in space and time. We partition this graph with graph-cuts toobtain a pixel-wise segmentation for the discovered object/human as it movesover time. We compute the segmentations in order of hypothesis rank, and en-force non-overlap between the selected key-segments, such that each hypothesiscorresponds to a unique human/object in the video. See Figure 2, and [22] forcomplete details.

With the key-segmentations in hand, now we want to describe each discoveredobject. This descriptor should capture the appearance and motion patterns, andideally exploit the shape-based nature of the extracted segments (which contrastswith the cues a bounding box detector would provide). To this end, we design anovel mid-level descriptor that summarizes the shape of the object-like regionsas well as their frame-to-frame motion trajectories, over the entire video clip. Weprocess each space-time segmentation hypothesis separately, and then combinetheir features to create a single representation for the video.

To capture appearance, we compute a series of 2D pyramid of HOG (pHOG)descriptors on a window tightly fit to the segment, one for each frame wherethe segment appears. We compute the descriptor on a window that tightly fitsthe foreground segment in the frame, where the background pixels are zeroedout before the descriptor computation in order to capture the outer shape inaddition to the internal contours. We then quantize the descriptors into 50pHOG-words using k-means. To capture motion, we compute the difference inposition and angle between the foreground segments in adjacent frames. We

5

quantize the positions into 10 bins, in 2 pixel increments from 0 pixels (i.e.,[0, 2), [2, 4), . . . , [18,∞)); the bins range from small displacements to very largedisplacements. We quantize the angles into 4 bins, in π/4 increments from π/4;the bins correspond to up, down, left, or right (see Figure 3).

Using these measurements, we create a 3D histogram whose dimensions cor-respond to the appearance, distance, and angle, respectively. The size of thehistogram is 50×10×4. Each segment increments a single bin in the histogram.We aggregate the contributions of all segments in all space-time segmentationhypotheses to create a single histogram representation for the video. Finally,since some videos may generate no key-segment hypotheses due to missed per-son detections or high overlap in color distribution between the foreground andbackground models (which can lead to the foreground being “smoothed out”),we augment the mid-level descriptor with a histogram on the clip’s space-timeinterest points and HoG/HoF features [20]. We train binary SVM classifiersusing the resulting histograms to distinguish each verb against the rest.

4 An Activity Recognition System using Key-Segment Descriptors

We propose a multi-tier system design incorporating several levels of representa-tion of increasing semantic richness. Our architecture is comprised of the novelmid-level description scheme defined in the previous section, as well as low- andhigh-level models based on previously reported methods. Our low-level represen-tation employs a discriminative statistical sequence model built on top of setsof low-level spatio-temporal interest point (STIP) descriptors. Our high-leveltier is a generative probabilistic sequence model incorporating high-level struc-tural representations including person and object relationships. We combine theoutputs of these components using a max-margin fusion scheme.

4.1 Low-level model

The algorithm implemented by our low-level tier is based on the method foractivity classification described in [28]. This model is based on a frameworkfor modeling motion by exploiting the temporal structure of human activities.The model represents activities as temporal compositions of motion segments,and a discriminative model is trained that encodes a temporal decompositionof video sequences, with STIP-based appearance models [18] for each motionsegment. In recognition, a query video is matched to the model according tothe learned appearances and motion segment decomposition. Classification isperformed using a latent template max-margin model, based on the quality ofmatching between the motion segment classifiers and the temporal segments inthe query sequence. The model is comprised of a set of motion segment classifierseach operating over a histogram of quantized interest points extracted from atemporal segment whose length is defined by the classifier’s temporal scale. Inaddition to the temporal scale, each motion segment classifier also specifies a

6

temporal location centered at its preferred anchor point. Lastly, the motionsegment classifier is enriched with a flexible displacement model that capturesthe variability in the exact placement of the motion segment within the sequence.[28] describes associated learning and inference procedures for this model.

4.2 High-level model

The high-level model evaluated here is based on the joint activity recognitionand object tracking method of [30], which presents a model for understandingthe interactions between humans and objects while performing an action overtime. The model uses an “in the hand” interaction primitive and represents avariety of actions in which an object is manipulated. This representation notonly allows for recognition of actions in sequences, but is also able to provideimproved localization of the object of interest. The outputs of monocular objectand person detectors are used as input to the Pose-Transition-Feature model of[30]. This model captures the spatial trajectory of the person and surroundingvisual features, and is run on the person trajectory of a sequence to produce ascore for each verb.

We then use the person and object trajectories and define three additionalinteraction primitives with respect to the object: “moving toward,” “movingaway from,” and “touching”. The primitives are observed in each frame, andwe collect the counts of the primitives and transitions between the primitivesin adjacent frames as additional features. For a given sequence, we concate-nate the Pose-Transition-Feature scores with the primitive counts in addition tothe following object-person interaction features: the percentage of frames eachobject-person primitive was active, the spatial variance of the object’s trajectory,the spatial variance of the offset between the person and object, and an indictorvariable for the identity of the object; the average object detection score in thesequence (to help the classifier ignore the object if it was not strongly detected).Using this feature vector, an SVM with a quadratic kernel is trained for eachaction separately.

4.3 Max-margin fusion

While the three individual models above have very different internal represen-tations, they share a common max-margin learning framework, and thus aresuitable to be fed into an ensemble pipeline. Specifically, we adopt a late fusionscheme by training a one-vs-all SVM with each of the low, mid, and high levelfeatures separately, and then combining their outputs via a soft voting scheme.To combine the outputs, we first convert the prediction of each SVM to a pseudolikelihood with a softmax function [2]:

pci(x) =exp(mci(x))∑j exp(mcj(x))

,

7

where mci is the output for class i from classifier c (c ∈ {L,M,H}). Our fusedprediction is then based on a mixture of experts model [16]:

p∗i (x) =∑

cwcpci(x),

where the mixing coefficients wc(∑

c wc = 1) are found via cross-validation onthe training data.

5 Data and Evaluation Procedure

5.1 Interaction Dataset

We based our evaluation on the publicly available VISINT dataset [1]; thisdataset was recently collected to aid the development of action recognition meth-ods, and represents a variety of commonplace interactions between humans byfar exceeding that in previous datasets. Hired actors performed 10 exemplarsof each action in outdoor scenes such as parks and streets, each 14 sec. longon average. Each exemplar action was shot 16 times, varying the dimensions:urban/park, daylight/evening, close/far field, center/side location within theframe. The full dataset was split into training and testing portions; a smallersubset containing fewer verbs was also used in some of our experiments. For thefull set, we split the data to 3480 training and 1294 testing videos, and for thesubset, we used 314 training and 86 testing videos. The dataset is provided withground truth labels indicating per-video present/absent labels for each of theverbs in the corpus, which represent non-expert human responses from AmazonMechanical Turk (AMT) to questions of the form “Did you see X?”, where X isone of the verbs, separately for each verb.

5.2 Computing Annotator Agreement

While care was taken to control present/absent verb label quality, this wasmostly aimed at removing malicious workers and not at enforcing agreement.In fact, verb annotation in videos without specific annotator training besidesproviding the broad verb definition is a highly subjective task. Some annotatorshave an over-detection bias, answering “yes” to the question “Did you see X?”even if the action appeared briefly and was accompanied by many other actions.Others answer “yes” only if the action was central in the video sequence. There-fore, the resulting binary labels are noisy and not reliable sources of trainingdata for traditional binary discriminative classifiers.

Fortunately, the dataset contains 16 variations per exemplar, or the sameaction (see Section5.1). This allows us to combine the human responses for the16 unique videos that represent the variants of an exemplar action, effectivelyresulting in a score from 0 to 16 for that action for each of the 48 verbs1.

1A few videos that did not have all 16 variants were regarded as not having a label andremoved from evaluation.

8

Agreement Positive Negative Use in Dataset50%/75% ≥8/16 said yes ≤4/16 said yes subset, full62%/87% ≥10/16 said yes ≤2/16 said yes subset, full93%/93% ≥15/16 said yes ≤1/16 said yes full

Table 1: Levels of agreement used in our experiments to map Turker votes intobinary labels.verb BOUNCE HAVE HIT HOLD KICK LEAVE TOUCH WALK Total

yes(≥10) 12 16 12 16 16 16 33 13 134no(≤2) 37 12 29 28 41 20 16 36 219

Table 2: The number of labels at 62%/87% level of agreement in the trainingportion of subset.

To obtain binary present/absent labels from the tallied votes, we tried severalagreement thresholds, in order of increasing conservatism: (1) treat as positivevideos for which 8 or more of the 16 annotators said the verb was present, and asnegative those where 4 or fewer said the verb was present; (2) treat as positivethose with 10 or more “yes” votes, and treat as negative those with 2 or fewer;and (3) treat as positive those with 15 or 16 votes, and as negative those with atmost 1 vote. These are summarized in Table 1. The most stringent level did notproduce enough labels in the subset datasets (10 or more positive and negativeper verb). A list of verbs in the subset and their number of yes and no labelsat the second level of agreement is shown in Table 2.

5.3 Evaluation Metrics

We use both a traditional detection metric (mean average precision) to evaluatew.r.t. binary labels, and propose a divergence-based metric to capture howwell the predicted likelihood of an action agrees with the distribution of humanjudgments.

Mean Average Precision (mAP): mAP is traditionally used to evaluate de-tections of binary labels, and is better-suited to unbalanced data than accuracy.The AP for a binary detection problem is defined as the average precision ob-tained by varying a threshold (sensitivity) of the classifier, where precision is thenumber of true positives divided by the total number of assigned positive labelsat a particular threshold. The mAP is then defined as the mean AP across allbinary problems (across all verbs in our case).

Mean JS Divergence (mJSD): mAP is limited to hard binary labels andcannot measure the distance between soft labels (probabilities) of each verb be-ing present, as given by the human votes. It is important to note that unlikemost object recognition tasks, the verb labels for action recognition are not mu-tually exclusive, e.g., a set of responses for a video may have p(“go”) = 0.9and p(“walk”) = 0.9, thus computing a distance between the overall distribu-tion of responses for all verbs is incorrect, as it does not constitute a prob-ability distribution of a single random variable, but rather a set of distribu-

9

tions of several random variables corresponding to the presence or absence ofeach verb. Thus, we propose to compute the distance separately for each la-bel, using the Jensen-Shannon divergence (JS-divergence). Let Q denote theBernoulli distribution of a verb being present given the human response data,and P denote the system response distribution, both of which have been nor-

malized. Then the KL-divergence is given by DKL(P ||Q) =∑

i P (i) log P (i)Q(i) .

JS-divergence compares the two distributions (P and Q) to their mean as fol-lows: DJS(P ||Q) = 1

2DKL(P ||M) + 12DKL(Q||M). JS-divergence has a number

of advantages over the KL-divergence: (a) it can handle zero probabilities, andis less sensitive to small numerical values; (b) it is symmetric and thus a truedistance; (c) it is bounded in [0, 1], whereas KL-divergence is unbounded. Fi-nally, the mean JS divergence (mJSD) metric for a test video is calculated byaveraging over verbs.

5.4 Procedure

The following sections describe the details of the experimental setup specific toeach system component, such as the processing applied to the videos, whichinputs are given to the system at each level, and the model parameters used inthe experiments.

Low-level model settings. We subsample all videos to a size of 640×360. Weuse spatio-temporal interest points detected with the 3D Harris corner method[18] and described with HOG/HOF descriptors, using the binaries available at[19]. We run the detector on each video with the parameters: -res 2. If nodetections are found, (e.g. resolution is too low to capture the motion in thevideo), we run the detector again with the parameters: -res 4. We then uni-formly sample 200 videos from the training set and use all their local featuredescriptors to form a codebook. The codebook is computed using k-means withk = 500. Finally, we train binary classifiers for each verb separately using agree-ment labels, using a fixed number of K = 3 motion segments per model.

Mid-level model settings: For efficiency, we generate the initial region pool on40 frames uniformly sampled from the video. We compute up to four segmenta-tions per video (two human, two other generic objects). For the color-likelihoodmodels we use 5 fg and 10 bg GMMs and the RGB color space. We normalizethe histograms to sum to 1, and use χ2 kernels for the SVMs, with C = 100.

High-level model settings: Object and person detections for the model werecomputed using the DPM model [10]; a Viterbi tracker was run on top of theperson detections to provide a trajectory for the primary person of interestin the sequence. We selected which single object was present in the sequenceaccording to which object had the highest maximum detection score in eachframe, averaged across frames. We included a second person as a possible object,only using detections that were outside of the trajectory found for the primaryperson. A Viterbi tracker was similarly run on the chosen object to produce anobject trajectory. Codewords of STIP features from the low-level model were

10

used as local appearance features near the person track.

5.5 Results

In addition to results obtained by different levels of representation, we also showseveral baselines: assigning random probabilities to the verbs (Random), assign-ing the prior probability as measured on the training set (Prior), and a baselinek-nearest neighbor classifier with k = 1 (Baseline). The latter retrieves thetraining examples with the nearest STIP feature vectors to the input video andreturns the averaged human verb distributions.

Effect of Agreement: First, we investigate the effect of annotator agreementon classifier performance. Figure 4 shows the mAP score obtained by Baselineand LowLevel, as well as the random baselines, using labels obtained by requiringincreasing levels of agreement (Table 1). The trend is that the stricter agreementproduces more accurate results, however, the strictest level (15/16 or 93% foryes and no, i.e. all but one must agree) results in a smaller set of availabletraining labels, producing lower accuracy. The second level (62% for yes and87% for no) is therefore optimum, and we only report results with these labelsfor the rest of the paper. Note that, because each agreement level producesdifferent sets of binary labels, the number of verbs that have sufficient labeledpositive and negative examples changes: e.g., at the 50%/75% level, 47 verbshad sufficient (10 or more) positive and negative examples in the training set.

Activity Recognition: We first compare the performance of each level ofrepresentation on subset, which at the 62%/87% agreement contains sufficientlabels for: bounce, have, hit, hold, kick, leave, touch, and walk. These verbswere chosen as a representative set, and the high-level relational primitives weredesigned with these verbs in mind. Table 3 (columns labeled “subset”) shows themAP and mJSD obtained by the models. Overall, results are encouraging: allmodels achieved scores well above chance performance and significantly higherthan those of Baseline. The highest single-component mAP is obtained by thehigh-level model (0.75). We believe its good performance is explained by its useof interaction features. Figure 5 compares the performance by verb in terms ofthe mAP score. Here we see the complementariness of the “pixel” vs. “predicate”approaches: the high-level model does significantly better on verbs bounce, have,hit and worse on touch,walk. In particular, its poor performance on walk canbe explained by the fact that walk does not involve interaction between humansand objects.

Finally, we evaluate performance on the full dataset. With the 62%/87%agreement labels, this amounts to 46 verbs (all but move and bury). The re-sults are shown in Table 3 (columns labeled “full”). Here the low-level modeloutperforms the others; the lower performance of the high-level model may bebecause the primitives used are not enough to capture the other actions beyondthe eight verbs they were designed for (extending the high-level primitives tomore verbs is part of future work). Finally, we combine all three representa-tion levels, obtaining the best overall results: mAP=.81, mJSD=.0397 on the

11

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

50%/75% 62%/87% 93%/93%

Random

Prior

Baseline

LowLevel

Figure 4: Comparison of verb recognition mAP at increasing levels of annotatoragreement.

BOUNCE HAVE HIT HOLD KICK LEAVE TOUCH WALK

1.0

0.8

0.6

0.4

0.2

0

Baseline

LowLevel

MidLevel

HighLevel

Figure 5: Breakdown by verb of mAP obtained on the subset, using annotatoragreement 62%/87%.

subset and mAP=.71, mJSD=.0198 on the full data dataset. The fact that thecombined model performs significantly better than either level alone indicatesthat there is additional information in the low-level features that is not beingexploited by the intermediate mid- and high-level models.

Our evaluation in terms of the divergence between the predicted verb prob-ability and the human judgements reveals that the lowest divergence is alsoobtained by the combined three-level scheme. However, the behaviour of thismetric is different from mAP. Notice that using the verb prior to predict labels(Prior) only improves mAP marginally, but cuts the mJSD to a third on thefull dataset. This suggests that the prior distribution of certain verbs (e.g. go islikely to be present in most videos) may play a larger role in cases where humansdo not agree.

6 Conclusion

Based on our results, we argue that intermediate representations should be usedin addition to low-level features to get best performance; one reason high-levelmodels fail is they throw away useful information available in the sequence bycommitting to the (possibly erroneous) object track. We presented a novel mid-level representation based on generic object key-segments found in video se-quences; our approach combined elements of low and high-level representations.While the key-segments approach alone was not the strongest model, in concertwith the other paths it significantly improved performance.

12

mAP mJSDmodel subset full subset full

Random .50 .20 .1099 .1086Prior .59 .25 .0639 .0301

Baseline .63 .40 .0623 .0270Low .70 .63 .0489 .0286Mid .70 .59 .0600 .0259High .75 .49 .0594 .0361

Low+High .75 .69 .0544 .02541Low+Mid+High .81 .71 .0397 .0198

Table 3: Results obtained by the different representation levels on the interactiondataset, using annotator agreement for binary labels of 62%/87%. The tableshows mAP (higher is better) and mJSD (lower is better) scores.

References

[1] http://www.visint.org/.

[2] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

[3] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. In ICCV, 2005.

[4] M. Bregonzio, S. Gong, and T. Xiang. Recognizing action as clouds of space-timeinterest points. In CVPR, 2009.

[5] W. Brendel, A. Fern, and S. Todorovic. Probabilistic event logic for interval-basedevent recognition. In CVPR, 2011.

[6] J. Choi, W. Jeon, and S.-C. Lee. Spatio-temporal pyramid matching for sportsvideos. In ACM Multimedia, 2008.

[7] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparsespatio-temporal features. In 2nd Joint IEEE International Workshop on VisualSurveillance and Performance Evaluation of Tracking and Surveillance, 2005.

[8] I. Endres and D. Hoiem. Category independent object proposals. In ECCV, 2010.

[9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisser-man. The PASCAL Visual Object Classes Challenge 2011 (VOC2011) Results.http://www.pascal-network.org/challenges/VOC/voc2011/workshop/index.html.

[10] P. Felzenszwalb, D. Girshick, and D. R. McAllester. Object detection with dis-criminatively trained part-based models. IEEE Transactions on Pattern Analysisand Machine Intelligence (PAMI), 2010.

[11] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained,multiscale, deformable part model. In CVPR, 2008.

[12] A. Gilbert, J. Illingworth, and R. Bowden. Fast realistic multi-action recognitionusing mined dense spatio-temporal features. In ICCV, 2009.

[13] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. PAMI, 29(2):2247–2253, 2007.

[14] A. Gupta, P. Srinivasan, J. Shi, and L. Davis. Understanding videos, constructingplots - learning a visually grounded storyline model from annotated videos. InCVPR, 2009.

13

http://www.visint.org/

[15] D. Han, L. Bo, and C. Sminchisescu. Selection and context for action recognition.In ICCV, 2009.

[16] R. Jacobs, M. Jordan, S. Nowlan, and G. Hinton. Adaptive mixtures of localexperts. Neural Computation, 3(1):79–87, 1991.

[17] Y. Ke, R. Sukthankar, and M. Hebert. Spatio-temporal shape and flow correlationfor action recognition. In CVPR, 2007.

[18] I. Laptev. On space-time interest points. IJCV, 64(2):107–123, 2005.

[19] I. Laptev. Space-Time Interest Points, 2010. Software available at http://www.

irisa.fr/vista/Equipe/People/Laptev/download.html.

[20] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic humanactions from movies. In CVPR, 2008.

[21] Y. J. Lee and K. Grauman. Foreground Focus: Unsupervised Learning fromPartially Matching Images. International Journal of Computer Vision (IJCV),85(2), May 2009.

[22] Y. J. Lee, J. Kim, and K. Grauman. Key-segments for video object segmentation.In ICCV, 2011.

[23] H. Ling and S. Soatto. Proximity distribution kernels for geometric context incategory recognition. In ICCV, 2007.

[24] J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos “in thewild”. In CVPR, 2009.

[25] S. Maji, L. Bourdev, and J. Malik. Action recognition using a distributed repre-sentation of pose and appearance. In CVPR, 2011.

[26] M. Marszalek, I. Laptev, and C. Schmid. Actions in context. In CVPR, 2009.

[27] R. Messing, C. Pal, and H. Kautz. Activity recognition using the velocity historiesof tracked keypoints. In ICCV, 2009.

[28] J. Niebles, C. Chen, and L. Fei-Fei. Modeling temporal structure of decomposablemotion segments for activity classification. In ECCV, 2010.

[29] J. Niebles and L. Fei-Fei. A hierarchical model of shape and appearance for humanaction classification. In CVPR, 2007.

[30] B. Packer, K. Saenko, and D. Koller. A combined pose, object, and feature modelfor action understanding. In CVPR, 2012.

[31] V. Parameswara and R. Chellappa. Human action-recognition using mutual in-variants. CVIU, 1998.

[32] D. Parikh, L. Zitnick, and T. Chen. Unsupervised learning of hierarchical spatialstructures in images. In CVPR, 2009.

[33] T. Quack, V. Ferrari, B. Leibe, and L. V. Gool. Efficient mining of frequent anddistinctive feature configurations. In ICCV, 2007.

[34] D. Ramanan and D. A. Forsyth. Automatic annotation of everyday movements.In NIPS, 2003.

[35] C. Rao and M. Shah. View-invariance in action recognition. In CVPR, 2001.

[36] M. Rodriguez, J. Ahmed, and M. Shah. Action mach: a spatio-temporal maximumaverage correlation height filter for action recognition. In CVPR, 2008.

[37] M. Rodriguez, J. Ahmed, and M. Shah. Action MACH: A spatio-temporal maxi-mum average correlation height filter for action recognition. In CVPR, 2008.

[38] M. Ryoo and J. Aggarwal. Spatio-temporal relationship match: Video structurecomparison for recognition of complex human activities. In CVPR, 2009.

[39] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: A local SVMapproach. In ICPR, 2004.

14

http://www.irisa.fr/vista/Equipe/People/Laptev/download.html

http://www.irisa.fr/vista/Equipe/People/Laptev/download.html

[40] E. Shechtman and M. Irani. Space-time behavior based correlation or how totell if two underlying motion fields are similar without computing them? PAMI,29(11):2045–2056, 2007.

[41] M. Sridhar, A. Cohn, and D. Hogg. Unsupervised learning of event classes fromvideo. In AAAI, 2010.

[42] J. Sun, X. Wu, S. Yan, L.-F. Cheong, T.-S. Chua, and J. Li. Hierarchical spatio-temporal context modeling for action recognition. In CVPR, 2009.

[43] B. Yao, X. Jiang, A. Khosla, A. Lin, L. Guibas, and L. Fei-Fei. Action recognitionby learning bases of action attributes and parts. In ICCV, 2011.

[44] J. Yuan, Y. Wu, and M. Yang. Discovery of collocation patterns: from visualwords to visual phrases. In CVPR, 2007.

15

Mid-level Features Improve Recognition of Interactive Activities · 2012. 11. 14. · Mid-level...

Documents

Transcript of Mid-level Features Improve Recognition of Interactive Activities · 2012. 11. 14. · Mid-level...