Analyzing Diving: A Dataset for Judging Action - UCLA Vision Lab

Analyzing Diving: A Dataset for Judging ActionQuality

Kamil Wnuk Stefano Soatto

University of California, Los Angeles, CA, 90095{kwnuk,soatto}@cs.ucla.edu

Abstract. This work presents a unique new dataset and objectives foraction analysis. The data presents 3 key challenges: tracking, classifica-tion, and judging action quality. The last of these, to our knowledge,has not yet been attempted in the vision literature as applied to sportswhere technique is scored.This work performs an initial analysis of the dataset with classificationexperiments, confirming that temporal information is more useful thanholistic bag-of-features style analysis in distinguishing dives. Our inves-tigation lays a groundwork of effective tools for working with this type ofsports data for future investigations into judging the quality of actions.

1 Introduction

For sports as well as rehabilitation, quality control, security, and interfaces, theability to make a critical judgment about how a particular action is performedcan be imperative. With a significant portion of recent work focused on classifi-cation and detection of action categories, we instead propose focusing on analysisin domains where small details of temporal phenomena are the key informativeelements. To this end, we have collected footage from a diving competition inwhich each dive is scored for technique irrespective of dive type or difficulty.

The convenience of working with a sport like diving is that it naturally avoidssome of the early action dataset perils such as forced or exaggerated actions orloosely defined categories like “dance”. All dive types are strictly defined andrecorded in competition. Although the dive types vary, their quality is evalu-ated independently of type, meaning there are subtle universal details that areimportant and must be discovered.

In this work we evaluate representations for the diving data on the morefamiliar classification task to build our intuitions about the dataset. In particular,if a representation is not useful in discriminating between different dives, we canhardly expect it to capture the minutiae that might have a significant influenceon a score for technique.

In the process we make incremental modifications to background subtractionand setup a feature extraction pipeline suited for individual performance sportscaptured at relatively high frame rates. Our methods are able to overcome notice-ably compressed video with significant motion blur, a highly deforming subject,and changing illumination.

2 Kamil Wnuk and Stefano Soatto

Fig. 1. Mean raw scores for all dives, shown with one standard deviation. The overallaverage of mean raw scores is shown in green. It is interesting to note that the mostvariance in scores took place during the middle of the competition: this is typicallywhen divers take the most risks to try to get ahead.

1.1 Related Work

The most notable recent foray into sports footage in the literature was a broad-cast sports dataset collected by [1]. However the dataset was a mixture of manydifferent sports captured at highly variable angles and the taks was limited to acategorization exercise. People have also demonstrated detection on particularice skating or ballet moves, but to our knowledge none of the sports datasetsavailable to date have provided a corresponding set of performance scores de-scribing the quality with which the actors executed their moves.

In this preliminary work, our investigation focuses on classification exper-iments to test our intuition about suitable features and inference techniques.Notably we draw upon the success of descriptors based on gradient orientationhistograms [12, 2, 3] in person detection, and adopt a variant thereof to encodepose.

We explore classification, drawing from both the holistic philosophy of treat-ing the entire video as a bag of features [4–6] (which in our case encode poses),and from the temporal analysis perspective [7–9] of comparing multidimensionaltime series or their features.

2 The FINA09 diving dataset

We introduce a new dataset gathered from the finals of the men’s 3 meter spring-board diving event at the 13th FINA World Championships which took placein Rome, Italy on July 23, 2009. The dataset is gathered with the intent of mo-tivating a new action analysis application of judging action quality, however, itpresents a challenge for tracking and classification as well.

Analyzing Diving: A Dataset for Judging Action Quality 3

The dataset includes 12 different divers, each performing 6 unique dives.Each dive is recorded with two synchronized high speed cameras from orthog-onal viewpoints. The first camera captures the dive head-on while the secondrecords video from the side. Unfortunately 4 videos were partially omitted inthe available footage, thus we only have 68 total samples. Example frames areshown in Fig. 2. The dive types and their distribution of scores are also shownin Table 1 and Fig. 1.

2.1 How diving is scored

In the diving competition, each dive was given a raw score for technique by7 independent judges. The scores are in the range from 1 to 10, and must beexpressed in increments of 0.5 (ex: 6.0 and 6.5 are acceptable scores, 6.2 is not).The 2 highest and 2 lowest scores are discarded. The total raw score explaininghow well the dive was executed is obtained by summing the remaining 3 scores.To obtain the final dive score this sum is then multiplied by the difficulty of thedive. Dive difficulties are constants assigned to each dive type agreed upon byFINA (the organization overseeing the competition).

As an end goal we are interested in evaluating the quality of a dive, irre-spective of its difficulty class. However, in this work we explore the issue ofrepresentation in the context of the more familiar task of classification.

3 Background subtraction and tracking

When considering representations our intuition steers us towards a person-centricframe of reference, since factors such as symmetry and pose typically have highsignificance in perception of technique. We do however, take the time to compareclassification performance with a representation requiring minimal preprocess-ing, which represents a video as a collection of spatial-temporal interest pointswith histogram of oriented gradients (HoG) and histogram of optical flow (HoF)

Table 1. Distribution of Dives by Type

Dive Type ID Num Samples Description

107B 11 forward 3.5 somersaults205B 11 back 2.5 somersaults307C 11 reverse 3.5 somersaults405B 3 inward 2.5 somersaults407C 9 inward 3.5 somersaults5154B 9 forward 2.5 somersaults 2 twists5156B 1 forward 2.5 somersaults 3 twists5253B 2 back 2.5 somersaults 1.5 twists5353B 10 reverse 2.5 somersaults 1.5 twists5355B 1 reverse 2.5 somersaults 2.5 twists


Fig. 2. Sample synchronized frame pairs from various dives, exemplifying the type ofcolor and shape irregularity, as well as effects of motion blur, compression, and regionsof background with similar statistics to the foreground object.

descriptors [10] (see Table 2). A person-centric representation implies trackingover the entire duration of the video. Throughout each video a diver under-goes significant deformations and passes through zones with varying shadowsand lighting, while the background motions at high frame rate remain relativelysmall. Therefore we choose to model the background instead of the appearanceof the diver for tracking purposes. In this section we explain our pipeline andunderscore any adaptations made to existing works due to the unique challengesand properties of our data.

3.1 Robust registration

In order to fix the small variation in scale and viewpoint between frames, weregister sequential frames with an affine transform. We have found that simplyapplying RANSAC [11] to matching SIFT keypoints [12] works satisfactory inour sequences if we add correspondences with the frame at time t − 5 to theconstraints in the least-squares estimation. Sample results are shown in Fig. 3.

3.2 Background subtraction

In addition to stabilizing the video and allowing us to express events in a canoni-cal reference frame, registration enables the construction of an initial backgroundmodel (from mosaics Fig. 3). Since the videos in our dataset are recorded at “In-ternet streaming quality” they are plagued with compression artifacts that createconstant jitter. These quantization artifacts are an additional nuisance on top ofthe already existing small background motions in the audience, and motion blurdue to fast motion of the camera or actor. To maximize our robustness to suchnuisances we select the background subtraction method of [13], which models the


Fig. 3. Typical median mosaic results from the sequential affine registration.

background with a color histogram at each pixel taking into account a spatialand temporal neighborhood. We use a neighborhood radius of 4 (9× 9 window),with color quantized via k-means to 32 bins. An update rate of α = 0.1 is usedto keep the background current. Local pixel histograms are compared with theBhattacharyya Coefficient measure of similarity.

Improving foreground coherence In the above background subtraction ap-proach a threshold is typically selected on the Bhattacharyya Coefficient to sep-arate foreground pixels from background pixels (0.76 in our case, determinedemprically). An artifact of thresholding is that regions belonging to a solid ob-ject often get fragmented. This is most evident when the foreground object passesover an area where the color distribution in the background model is very similarto a region on the object. To encourage better label consistency we augment theabove approach by constructing a Random Field on the pixel lattice V,E:

c = arg minc

∑xi∈V

Ψ(ci|xi) + λ∑

xi,xj∈EΦ(ci, cj |xi, xj), (1)

where ci is the binary label assigned to a given pixel, and each pixel xi ∈ V isconnected to its 4 vertical and horizontal neighbors via the edges, E, representedas tuples xi, xj .


The unary term is simply the Bhattacharyya Coefficient resulting from thebackground subtraction approach, which ranges between 0 and 1. For the pair-wise potential we use a function of the Euclidean distance between pixel colorvalues in LUV space. This way the pairwise cost is highest when neighboringpixels have different labels but are similar in color, and low if different labels areassigned to neighboring pixels with a large difference in color:

Ψ(ci|xi) = BCi, Φ(ci, cj |xi, xj) =1

1 + ‖xi − xj‖[ci 6= cj ]. (2)

3.3 Foreground Object Tracking

Once we have established a background subtraction method to apply sequentiallyto each frame, we then have to find correspondences between foreground regionsto obtain time series of region properties.

Naive Correspondence We first try the naive correspondence approach bymatching the nearest region with similar area in consecutive frames. Howeverthis approach breaks down when background subtraction fails to produce a singlecoherent region. Once a mismatch occurs in a noisy background subtractionresult, the correspondence may continue to drift and produce nonsensical results.

Kalman Filtering in 3D We leverage the orthogonal viewpoints in our datasetto implement a Kalman Filter in three dimensions to track the center of massof the diver. The state includes the location, velocity, and acceleration of thecenter of mass and uses a random walk as the motion model. The observationsare the naive correspondences described above.

We have manually annotated the time of the frames at which the diver beginsreceiving the final lift from the springboard and the time of impact with thewater. In our experiments we do not use information before the diver startsbeing lifted by the springboard and stop tracking at the time of impact with thewater to capture the angle of entry and splash. In classification experiments, thepost-entry information is also discarded, and only the time spent in the air isconsidered.

4 Representation

4.1 Describing pose with gradient orientation histograms

The works of [12, 2] and many others have shown variations on gradient orien-tation histograms to be highly successful in object recognition as well as persondetection. Taking a cue from the above works, we encode diver poses by com-puting a SIFT descriptor [12, 14] at a fixed scale and orientation centered at thediver’s tracked location. However, instead of smoothing the image to the scaleof the descriptor before computing gradients as typically done in SIFT, we take


Fig. 4. Our diver-centered unsmoothed SIFT feature in various frames, computed with-out and with background subtraction for comparison. Notice in the descriptors usingbackground subtraction that even frames with discontinuities and false foregroundpatches produce seemingly little noise in responses.

the suggestion of [2] and skip the smoothing to capture finer level gradient infor-mation. The scale is selected so that it spans a square window size of 250× 250pixels. We have found empirically that the default 4×4 spatial bin configurationwith 8 bins for orientation works sufficiently well for this dataset.

4.2 Leveraging background subtraction

We choose to incorporate background information into the descriptor by maskingthe gradient responses before descriptor computation. Since the descriptor en-coding is designed with insensitivity to noise in mind, it overcomes frames wherethe background subtraction result was less than ideal. Notice in the examples inFig. 4 that even frames with discontinuities and false foreground patches pro-duce seemingly little noise in responses. We compare the performance gained byincorporating background subtraction in Table 2 by running classification exper-iments both with descriptors computed before (1-SIFT) and after (1-SIFT-fg)applying the foreground mask.


5 Classifying Time Series

By extracting a diver centered descriptor, as described above, at each frame inboth viewpoints, we obtain a 256 dimensional time series encoding the sequenceof the diver’s poses. We approach the classification task in 2 ways: without regardfor temporal ordering of poses using a bag method, and with a dynamic timewarping based kernel that keeps the sequence in tact during comparison.

5.1 Bag of Poses

The first approach, disregarding temporal ordering, uses the familiar bag-of-features pipeline. Since our representation has one feature per frame centeredaround the diver, this method essentially represents each dive as an unorderedcollection of poses.

To construct the pose dictionary we randomly sample features from divesin the training set (15 per dive sample) and compute a k-means clustering of apredetermined size. For each dive, we then project the descriptor in every frameto a given dictionary and represent the full dive sequence as a single histogramof pose occurrences. To discriminate between histograms we train a χ2-SVM.We also experiment with the RBF-χ2 kernel, but find that the former performsbetter in our case. Table 2 shows our results over a range of parameters.

5.2 Dynamic Time Alignment Kernel (DTAK)

To preserve the sequential nature of the dives while comparing time series andstill get the benefit of using SVMs we can use the Dynamic Time Alignment Ker-nel (DTAK) proposed by Shimodaira et al. [15, 16]. The DTAK kernel betweentwo time series, X = (x1, . . . ,xn) and Y = (y1, . . . ,ym), is defined as:

KDTAK(X,Y) =pn,mn+m

, (3)

pi,j = max

pi−1,j + k(xi,yj)pi−1,j−1 + 2k(xi,yj)pi,j−1 + k(xi,yj)

(4)

where k(xi,yj) = e−1

2σ2‖xi−yj‖2 .

When using the DTAK kernel, we bypass the dictionary construction stageused in the bag of poses framework and compare descriptors directly. That is,X,Y are sequences of descriptors taken from each frame of two different dives.

To add a level of robustness when comparing pose descriptors, similar tothat offered by the dictionary creation stage in the bag pipeline, we do a re-duced dimensionality approximation of the descriptors in each frame via PCA.Descriptors for PCA analysis were selected randomly from the training data inthe same way as for k-means clustering in the bag approach. In our experimentswe chose to represent each frame with 25 principal components, which captureapproximately 43% of the variance of the data.

We apply the DTAK kernel on sequences of these reduced dimensionalitydescriptors and feed the computed kernel into an SVM.


6 Classification Experiments and Results

For our classification experiments we select the set of dive types to consist ofthose of which we have at least 6 samples. Referring back to Table 1, this leaves uswith 6 unique dive types. We perform our tests by leave-one-out cross validation,where 1 dive sample is left out each round as the test sample, while all the otherdives serve as training data. Dictionaries and principal components are alwaysrecomputed at each iteration to avoid training and testing on the same data.

Table 2 summarizes our results using various descriptors, parameters, andclassification approaches. The top section of Table 2 demonstrates that withoutincorporating background subtraction, a person centric representation which re-quires tracking provides little gain if the classification is performed without re-gard for temporal order. We do notice a boost in performance when we classifythe person-centric representation as a sequence. For this comparison we used thebinaries provided by [10] to extract features from our videos.

Comparing results computed without incorporating background subtractioninto the descriptor (1-SIFT) to those where the descriptor was masked to keepforeground (1-SIFT-fg) we see the benefit of this procedure. It boosts perfor-mance by almost 30% for the best performing classifier. Also, notice that DTAKkernel based classification shows significantly better performance and appearsfairly stable with respect to changes in its σ parameter. The bag performancecould be improved with temporal binning to include some sequential informa-tion, but the point was to compare both methods at their basic level. We alsonoticed that in our case there was no benefit to be had from using the RBF-χ2

kernel.

Table 2. Mean Classification Accuracies for All Experiments

Tracking Representation Classifier Mean Accuracy

none [10]:HoG+HoF bag (k=500) χ2-SVM 58.673d-KF 1-SIFT bag (k=200) χ2-SVM 47.733d-KF 1-SIFT bag (k=500) χ2-SVM 58.133d-KF 1-SIFT PCA (npc=25) dtak-SVM (σ = 500) 62.19

3d-KF 1-SIFT-fg bag (k=10) χ2-SVM 51.063d-KF 1-SIFT-fg bag (k=50) χ2-SVM 64.013d-KF 1-SIFT-fg bag (k=100) χ2-SVM 75.103d-KF 1-SIFT-fg bag (k=200) χ2-SVM 74.263d-KF 1-SIFT-fg bag (k=500) χ2-SVM 80.813d-KF 1-SIFT-fg bag (k=100) RBF-χ2-SVM 73.593d-KF 1-SIFT-fg bag (k=500) RBF-χ2-SVM 75.773d-KF 1-SIFT-fg PCA (npc=25) dtak-SVM (σ = 500) 91.753d-KF 1-SIFT-fg PCA (npc=25) dtak-SVM (σ = 750) 91.753d-KF 1-SIFT-fg PCA (npc=25) dtak-SVM (σ = 250) 81.84


7 Conclusion

We have presented a preliminary analysis of a new diving dataset and constructedan effective pipeline of tools for processing video of individual technique-basedsports. Our results confirm that temporally constrained analysis is preferablefor distinguishing dives, and that a gradient orientation histogram based poserepresentation is effective in the classification task. The question left open forour future investigations is whether these techniques will also hold true for divequality score estimation. The diving dataset will be made publicly available onthe website of the authors.

Acknowledgements. We would like to acknowledge the support of ONR 67F-1080868/N00014-08-1-0414, ARO56765-CI and AFOSR FA9550-09-1-0427.

References

1. Rodriguez, M., Ahmed, J., Shah, M.: Action mach: A spatio-temporal maximumaverage correlation height filter for action recognition. In: Proc. CVPR. (2008)

2. Dalal, N., Triggs, B.: Histogram of oriented gradients for human detection. In:Proc. CVPR. (2005)

3. Felzenszwalb, P., Girshick, D., McAllester, D., Ramanan, D.: Object detectionwith discriminatively trained part based models. In: PAMI. (2010)

4. Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In:Proc. ICCV. (2003)

5. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local svmapproach. In: Proc. ICPR. (2004)

6. Yeffet, L., Wolf, L.: Local trinary patterns for human action recognition. In: Proc.ICCV. (2009)

7. Wilson, A., Bobick, A.: Parametric hidden markov models for gesture recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence 21 (1999)

8. Raptis, M., Wnuk, K., Soatto, S.: Flexible dictionaries for action recognition. In:Proceedings of the 1st International Workshop on Machine Learning for Vision-based Motion Analysis, in conjunction with ECCV. (2008)

9. Saad, A., Arslan, B., Mubarak, S.: Chaotic invariants for human action recognition.In: IEEE 11th International Conference on Computer Vision. (2007) 1–8

10. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic humanactions from movies. In: Proc. CVPR. (2008)

11. Fischler, M., Bolles, R.: Random sample consensus: A paradigm for model fittingwith applications to image analysis and automated cartography. In: Comm. of theACM. Volume 24. (1981) 381–395

12. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 2(2004) 91–110

13. Ko, T., Soatto, S., Estrin, D.: Background subtraction with distributions. In: Proc.ECCV. (2008)

14. Vedaldi, A., Fulkerson, B.: VLFeat: An open and portable library of computervision algorithms. http://www.vlfeat.org/ (2008)

15. Shimodaira, H., Noma, K.I., Nakai, M., Sagayama, S.: Dynamic time alignmentkernel in support vector machine. In: Proc. NIPS. (2002)

16. Zhou, F., De la Torre, F., Hodgins, J.K.: Aligned cluster analysis for temporalsegmentation of human motion. In: IEEE Conference on Automatic Face andGestures Recognition. (2008)

Analyzing Diving: A Dataset for Judging Action - UCLA Vision Lab

Documents

Transcript of Analyzing Diving: A Dataset for Judging Action - UCLA Vision Lab