IEAAIE LNAI 2012 Presentation

download IEAAIE LNAI 2012 Presentation

of 18

Transcript of IEAAIE LNAI 2012 Presentation

  • 8/10/2019 IEAAIE LNAI 2012 Presentation

    1/18

    Time Invariant Gesture Recognitionby Modelling Body Posture Space

    Binu M Nair, Vijayan K Asari

    06/10/2012

    IEA-AIE 2012

  • 8/10/2019 IEAAIE LNAI 2012 Presentation

    2/18

    Contents of Presentation

    Introduction Proposed Methodology

    Shape Representation Using HOG

    Computation of Reduced Posture Space using PCA

    Modeling of Mapping using GRNN

    Experiments and Results

    Weizmann Action Dataset

    Cambridge Hand Gesture Dataset

    Conclusions and Future Work

  • 8/10/2019 IEAAIE LNAI 2012 Presentation

    3/18

    INTRODUCTION

  • 8/10/2019 IEAAIE LNAI 2012 Presentation

    4/18

    Introduction Action Recognition

    Widely Researched Area Potential applications in Security and Surveillance

    Literature Survey

    Some of the research work was on space time shapes with different kinds ofrepresentation

    Gorelick et. al Representing space time shape using Poissons equation and extractingstick/plate like structures.

    Nair et. al - Representing space time shape using 3D Distance transform along with theR-Transform at multiple levels.

    Representing space time shape as a collection of spatio-temporal words in a bag of wordsmodel. (Neibles et.al using probabilistic Latent Semantic Analysis model).

    Batra et.al characterizes a space time shape as a histogram in a dictionary of space time

    shapelets which are local motion patterns.

    Scovannar et al. represents a spatio-temporal word by a 3D SIFT region descriptor in abag of words model.

  • 8/10/2019 IEAAIE LNAI 2012 Presentation

    5/18

    Introduction Literature Survey

    Some of the research work was done as a tracking problem i.e. to tracksuitable points in the human body

    Ali et al. used concepts from Chaos Theory to reconstruct the phase space from eachtrajectories and compute the dynamic and metric invariants.

    Some of the latest work characterizes human action sequences as multi-dimensional arrays called tensors and use these as the basis for feature

    extraction. Kim et al. presents a framework called Tensor Canonical Correlation Analysis where

    descriptive similarity features between two video volumes(tensors) are used. Lui et al. studied the underlying geometry of the tensor space and performed

    factorization on this space to obtain product manifolds and the comparing using thegeodesic measure

    Proposed Work

    Models the feature variations extracted from a frame with respect to time. In other words, find an underlying manifold in the feature space which

    captures the temporal variance needed for discriminating between actionsequences.

    Classifies a set of contiguous frames irrespective of the speed of the action orthe time instant of the body posture.

  • 8/10/2019 IEAAIE LNAI 2012 Presentation

    6/18

    PROPOSED METHODOLOGYMETHODOLOGY

  • 8/10/2019 IEAAIE LNAI 2012 Presentation

    7/18

    Proposed Methodology We focus on 3 main aspects

    1. Feature Extraction Shape Descriptor computed for the region of interest in a frame.

    2. Computation of an appropriate reduced dimensional space which spans the change ofshape occurring across time.

    3. Suitable Modeling of the mapping from the feature space to the reduced space.

    Histogram of gradients is used as shape descriptor It provides a more local representation of the shape. It is partially invariant to illumination

    Principal Component Analysis is used as the reduced feature space. The inter-frame variation of the shape descriptors across the frame is maximized in the Eigen space. These variations indirectly corresponds to the variation occurring in the silhouette (body posture

    changes) which differs with different action sequences.

    To model the mapping from the shape descriptor or feature space to the Eigen space for eachaction class, we use a regression based network such as generalized regression neuralnetwork(GRNN).

    Back propagation neural networks take a lot of time for training and may not often converge while theGRNN is based on the radial basis functions which is a one-pass training algorithm and converges to astable state.

    Feature

    Extraction

    Reduced

    Space

    Computation

    Modeling

  • 8/10/2019 IEAAIE LNAI 2012 Presentation

    8/18

    Proposed Methodology

    Block Schematic ( Training) A set of N frames (complete sequence where N=60 or a partial sequence where N = 15) consisting of

    segmented body regions (silhouettes). From each frame, the Histogram of gradient is computed and accumulated for each frame of the sequences

    from all the action classes. The Eigen space is obtained by performing PCA on the accumulated features and suitable reduced

    representations of the features are obtained. Each Model from 1 to M where M is the number of action classes is represented by one GRNN network.

    It is trained by using the HOG descriptors of a frame as input and the corresponding representation in the Eigen space as

    the output. In short, each GRNN models the mapping from the HOG space to the Eigen space.

    Testing Phase From an input set of frames from a test sequence, the corresponding HOG descriptors are computed. The representation of the HOG descriptors in the Eigen space is computed by projecting it into the Eigen

    space (Obtained in the Training) and this is taken as the reference. The estimation of the reduced representation by each of the GRNN is then compared with the reference

    representation and the model which gives the closest match is considered as the class of the test actionsequence.

  • 8/10/2019 IEAAIE LNAI 2012 Presentation

    9/18

    Shape Representation Using HOG

    Computation of Histogram of Gradient

    Gradient of the frame in taken in x and y direction Image is divided into overlapping K blocks. Orientation of the gradient is divided into n bins. For each block, the histogram of the orientation weighted by the magnitude is computed. Histograms from the various blocks are combined and normalized. (Fig 2. and Fig.3

    illustrates the HOG for a particular frame)

    Binary Silhouette

    Produces a gradient where all of its points correspond to the silhouette. Produces HOG which is discriminative and more localized due to block operation. Thus, the

    body posture is represented in a discriminative and localized manner.

    Gray Scale Image Noise present in the gradient image due to illumination variations in the image. Some noise is reflected onto the HOG descriptor but since the HOG are partially illumination

    invariant due to the normalization, the feature descriptors do not vary much.

  • 8/10/2019 IEAAIE LNAI 2012 Presentation

    10/18

    Computation of the Eigen Space

    Eigen space obtained for 2 datasets is shown. Each action class is color-coded to illustrate how close the action manifolds are and there exists a small separation

    betweeen them. A lot of overlaps are present between the manifolds and the aim is to use a functional mapping for each manifold

    to distinguish them. We denote the action space as

    K(m) number of frames accumulated from the video sequences of action m , - corresponding HOG descriptor of dimension 1

    The Eigen space is obtained by performing PCA on the matrix = [,,2 ,3 (), ] to get the Eigenvectors , 2 . corresponding to the largest variances between the HOG descriptors.

    These Eigen vectors with the highest Eigenvalues corresponds to the direction along which the temporal variancebetween the HOG descriptors is maximum.

  • 8/10/2019 IEAAIE LNAI 2012 Presentation

    11/18

    Modeling of Mapping using GRNN

    Mapping from the HOG descriptor space 1 to the Eigen space ( 1) is

    represented as where = {,: = 1 } vector representing a point in the Eigen space

    We aim to model this mapping using GRNN GRNN is a one pass learning algorithm which provides fast convergence to the optimal

    regression surface. It is memory intensive and so we train the GRNN with the cluster centers obtained from K-

    means clustering. Each GRNN is represented by the equation as

    Where (, , , ) are the cluster centers in the HOG descriptor space and the Eigen space.

    Selection of the standard deviation for the radial basis function for each action class is takenas the median Euclidean distance between the corresponding actions cluster centers.

    Classification The set of HOG descriptors from consecutive frames of a test sequence are projected onto

    the Eigenspace to get the corresponding projections(reference) given by : 1 . This is compared with the estimated projections of the corresponding frames by

    each of the GRNN action model using Mahalanobis distance measure. The action model which gives the closest estimate to the reference projections is the class.

  • 8/10/2019 IEAAIE LNAI 2012 Presentation

    12/18

    EXPERIMENTS AND RESULTS

  • 8/10/2019 IEAAIE LNAI 2012 Presentation

    13/18

    Experiments and Results

    The action framework is evaluated on two datasets. The Weizmann Human Action Dataset. The Cambridge Hand Gesture Dataset.

    The Histogram of gradient feature descriptor uses 7 7 overlapping cells 9 orientation bins Normalized by taking the L2-norm The histograms from each block are combined to form a feature vector of size 441 1

    Weizmann Action Dataset Action Dataset consists of 10 action classes and each action class has 9-10 video sequences. Each video sequence of an action is performed by a different individual. There is variation in size of person and speed of motion

    Cambridge Hand Gesture Dataset It has 3 main action classes corresponding to different shapes of the hand. Each of this class is furthur

    divided by the motion of the hand. In short, there are 9 different action classes.

  • 8/10/2019 IEAAIE LNAI 2012 Presentation

    14/18

    Weizmann Database Results

    The various actions are : a1 - bend; a2 - jplace ; a3 - jack ; a4 - jforward ; a5 - run ; a6 - side ; a7 - wave1 ; a8 - skip ; a9 -

    wave2 ; a10 walk

    The test sequence is divided into overlapping windows of size with an overlap

    of 1 Testing is done using the Leave-10-sequence out strategy. In short, all the partial

    sequences corresponding to the test sequence are left out of the training. On the left shows the Confusion Matrix and on the right shows the average

    accuracy obtained with the framework for different window sizes of =10,12,15,18,20,23,25,28 and 30.

  • 8/10/2019 IEAAIE LNAI 2012 Presentation

    15/18

    Cambridge Hand Gesture Database

    This database has 5 different sets with each set corresponding to a different kind of illumination. Each action class has 20 sequences. For each sequence, skin segmentation was done in order to get the region of interest and for

    centering the region in the image. The HOG descriptor extracted contains noise variation due to different illumination conditions. The testing strategy used was the leave-9- out test sequences where each test sequence

    corresponds to an action class. The confusion matrix shown on the left is obtained by considering 4 clusters during training. If all the illumination conditions are trained into the system, the overall accuracy is higher. Testing was done on each set individually and the overall accuracy computed for each set as

    shown on the right. For set 1, the overall accuracy is high as the lighting in set 1 is pretty uniform but for sets 2,3,4,5

    gives moderate overall accuracy due to extreme non-linear lighting conditions.

  • 8/10/2019 IEAAIE LNAI 2012 Presentation

    16/18

    CONCLUSIONS AND FUTUREWORK

  • 8/10/2019 IEAAIE LNAI 2012 Presentation

    17/18

    Conclusions and Future Work We presented a framework for recognizing actions from partial sequences.

    The framework is invariant to the speed of the action being performed. Results shows good accuracy on the Weizmann database but on the Cambridge

    hand gesture database, the illumination condition affects the accuracy. Severe illumination conditions as seen in the Hand gesture database affects the

    HOG space and thus the Eigen space is more tuned to the noise. Our Future work is to develop and use a descriptor which represents a shape

    from a set of corner points where the relationships between them are determined

    in the spatial and temporal scale. Other regression techniques and classification methodology will also be

    investigated.

  • 8/10/2019 IEAAIE LNAI 2012 Presentation

    18/18

    Thank You

    Questions?

    Please contact

    Binu M Nair : [email protected]