Human Action Recognition across Datasets by Foreground-weighted Histogram Decomposition

42
Human Action Recognition across Datasets by Foreground-weighted Histogram Decomposition Waqas Sultani, Imran Saleemi CVPR 2014

description

Human Action Recognition across Datasets by Foreground-weighted Histogram Decomposition. Waqas Sultani , Imran Saleemi CVPR 2014. Motivation. Dense STIP for cross dataset recognition. UCF50. UCF50. 70 %. UCF50. HMDB51. 55.7 %. Olympic Sports. 71.8 %. Olympic Sports. - PowerPoint PPT Presentation

Transcript of Human Action Recognition across Datasets by Foreground-weighted Histogram Decomposition

PowerPoint Presentation

Human Action Recognition across Datasets by Foreground-weighted Histogram DecompositionWaqas Sultani, Imran Saleemi CVPR 2014MotivationTrainingTestingAccuracy (avg)Dense STIP for cross dataset recognitionUCF50UCF5070 %UCF50HMDB5155.7 %Olympic SportsOlympic Sports71.8 %UCF50Olympic Sports16.67 %Training and Testing is done on similar actions across the datasetsRecognition Accuracy drops across the datasets!

UCF50HMDB51Is the background responsible for this drop in accuracy?Do action classifiers learn background?ExperimentTwo recent challenging datasets:UCF YouTube, 1100 Videos, 11 ActionsUCF Sports, 150 Videos, 10 Actions

Extract STIP (HOG, HOF)50% Spatial-Temporal overlaps.Single scaleActor Bounding Boxes are available for these datasetsForeground Features: 50% overlap with bounding box Background Features: Less than 50% overlap with bounding boxDense Features: All featuresExperimentExperimental Setup:Leave one group out for UCF YouTubeLeave one actor out for UCF Sports

UCF YouTube Biking VideoSTIP Features UCF YouTube UCF SportsForeground71.92 %59.80 %Foreground Features

UCF Sports Running VideoFeatures with more than 50% overlap with actor bounding boxDense Features

UCF YouTube Biking Video UCF Sports Running VideoSTIP Features UCF YouTube UCF SportsDense60.6%75.34 %Foreground71.92 %59.80 %All features STIP Features UCF YouTube UCF Sports59.80 %71.92 %Dense60.6 %75.34 %Background55.27 %73.97 %Foreground

Background Features

UCF YouTube Biking Video UCF Sports Running VideoComparable performance with only background featureswithout even observing the actorFeatures with less than 50% overlap with actor bounding boxBackground should be diverse but not discriminative !As action datasets are becoming large and more complex, their background may become more discriminative!Action class discriminatively using GISTExperiment 1

Compute GIST descriptor for KTH, UCF50, HMDB51 datasetsCluster GIST descriptor in k clusters using K-means clustering

Estimate point-wise mutual information between each cluster and action class

Action class discriminatively using GIST

KTHUCF50HMDB51PMI Distance MatricesClusters100200300KTH5.128.2510.4HMDB517.1511.0714.06UCF507.9711.7914.38Small value means more interclass confusionBased on scene information alone, KTH is harder to classify than HMDB51 and UCF50

Experiment 1 (Continue)Action class discriminatively using GISTExperiment 2

is the set of descriptorsCompute GIST descriptor for every 50th frame in KTH, UCF50, HMDB51 datasetsGraph Connected Component Analysis is performed by threshold E

Action class discriminatively using GISTExperiment 2 (Continue)

Our ApproachForeground Focused RepresentationAction localizationBinary foreground/Background SegmentationVery challenging and difficult, akin to introducing a new problem to solve the first.InsteadEstimate the confidence in each pixel being a part of the foreground, and use it obtain video representationForeground Focused RepresentationMotion Gradients

Color Gradients

Visual Saliency

Fully connected graph is built, where Edge between two pixel is given as

By computing stationary distribution of Markov chain, new graph is build

Equilibrium distribution of chain is used as per pixel saliencyVisual Saliency

Due to camera motion, video saliency is noisyGraph based Image Saliency ( NIPS 2006)Coherence of Foreground ConfidenceInitial aggregate of confidence mapThe score is max-normalized for each frame of a videoThe quality of labeling is given by:Coherence of Foreground Confidence (continue)InferenceThe message the node p send to q is given byThe belief vector of node q is given by Obtain probability of each pixel being the foreground using Motion GradientsColor GradientsSaliencySpatial-Temporal Coherence using 3D-MRFFinal weightsVideo

UCF50 Pull up

HMDB51 ride-horseHMDB51 ride-bikeOlympic SportsPole vaultHMDB51 ride-bikeOlympic SportsDivingUCF50 BasketballUCF50 Golf swingExamples: Final Foreground weights

UCF YouTube Biking Video

Foreground wordsBackground wordsTraditional Bag of wordsHistogram To make codebook biased towards foreground features, use weights of features during clusteringWeighted k-meansThe confidence of each descriptors as being on foreground in given by:

The goal of clustering is to minimize the following energy function:Example To reduce the contribution of background features, use weights for each features of being foreground during quantizationWeighted HistogramWeighted-kmeansWeighted Histogram

Background words

Foreground wordsWeighted bag of words

UCF YouTube Biking VideoProblemNo separate foreground and background words or vocabularyThe large number of background features can sum up to be significant.Weighted-kmeansWeighted Histogram

Background words

Foreground wordsWeighted bag of words

UCF YouTube Biking VideoWeighted-kmeansWeighted Histogram

Background words

Foreground wordsWeighted bag of words

UCF YouTube Biking VideoWeighted-kmeansWeighted Histogram

Background words

Foreground wordsWeighted bag of words

UCF YouTube Biking VideoForeground confidence based Histogram decompositionCompute Histograms for each region separatelyThe regions of two videos that has same foreground confidence are compared only with each otherThe kernel function becomes

Features partitions based on weightsPartition based weighted histograms

01

UCF50 VideoHMDB51 VideoWeights PartitionsWeights PartitionsWeighted HistogramsWeighted HistogramsFinal SimilarityHistogram IntersectionWeighted Summations

Experimental ResultsUCF50 Vs. HMDB5110 common actionsWe choose actions which are visually similar: Biking, Golf Swing, Pull Ups, Horse Riding, BasketballUCF50 Vs. Olympic Sport6 common actions:Basketball, Pole Vault, Tennis serve, Diving, Clean and Jerk, Throw DiscusDatasets used:UCF50, HMDB51, Olympic SportsFeatures used:STIPQualitative ResultsBikingHMDB51UCF 50

Histogram Intersection= 0.1035Weighted Histogram Intersection=0.1142Weighted Histogram Decomposition =0.1295Qualitative ResultsGolf SwingHMDB51UCF 50Histogram Intersection= 0.1684Weighted Histogram Intersection=0.2740Weighted Histogram Decomposition =0.3089

Quantitative ResultsPull UpsHMDB51UCF 50Histogram Intersection= 0.2744Weighted Histogram Intersection=0.5454Weighted Histogram Decomposition =0.5586

Qualitative ResultsTrainingTestingUnweightedWeightedUCF50UCF5070.0074.2077.85UCF50HMDB5155.7060.0068.70HMDB51HMDB5165.3069.3068.00HMDB51UCF5063.364.0068.67Olympic SportsOlympic Sports71.8073.9569.79UCF50Olympic Sports31.2531.2533.33Olympic SportsUCF5016.6732.2947.91 Histogram Decomposition

42Quantitative ResultsConfusion Matrix

UCF50 classifiers on HMDB51UnweightedHistogram DecompositionThank you