A Motion Classifier for Microsoft Kinectpioneer.netserv.chula.ac.th/~achatcha/Publications/... · A...

5
A Motion Classifier for Microsoft Kinect Chitphon Waithayanon Master in Computer Science and Information Technology Department of Mathematics and Computer Science Faculty of Science, Chulalongkorn University Bangkok, 10330 Thailand [email protected] Chatchawit Aporntewan Department of Mathematics and Computer Science Faculty of Science, Chulalongkorn University Bangkok, 10330 Thailand [email protected] Abstract - In this paper we proposed a motion classifier using Microsoft Kinect. The main algorithm was based on Dynamic Time Warping (DTW). We tested the classifier with 7 motions. The experimental results showed that DTW coupled with human joint positions captured by Microsoft Kinect were capable of classifying the selected motions. I. INTRODUCTION In the past, human motion analysis was a complicated task because the input was video images [1,2]. The most difficult part is image processing and feature extraction from 2D images. Recently, Microsoft has released a gaming device for XBOX360 namely “Microsoft Kinect,” plus the programming toolkit called “Kinect for Windows SDK Beta” for developing applications on a PC. Kinect provides real-time human skeleton tracking with positions of each human joints in 3D [3,4]. The skeleton is very useful information for human motion analysis. Microsoft Kinect has extremely eased the programming difficulty and has brought a new era of Natural User Interface (NUI). Since the release of Kinect, a large number of games and applications have employed motion detection to interface with users. Although the algorithms are not shown to the public, we believe that most of them are hard coding, e.g. programmers putting their knowledge for a particular motion. If users do something beyond what programmers expect, the program will fail to detect the motion. Moreover, the motion is fixed and cannot be changed. In contrast, good software should allow users to customize NUI. For instance, users prefer their own motions rather than what defined by programmers. To do so, the software must be able to learn motions with users’ assistance (telling the software the class of motions, e.g. standing, sitting, jumping, etc). Later the software is able to classify motions when a user repeats. In this paper, we aim to develop a classifier for human motions. The motion classifier will ease the programming difficulty, speed up software prototyping, and allow users to customize NUI to their preferences. II. LITERATURE REVIEW There are various methods for human motion classifier such as Fuzzy Rules which work with contrast depend on trajectory of body part [5], FMDistance which calculates the approximation of total kinetic energy of each joint [6], Ensemble based Human Motion Classification Approach (EHMCA) which extracts the descriptors from human motion sequences. Then, singular value decomposition (SVD) is adopted to reduce the dimensionality of all the feature vectors [7], K-Mean , C-Mean, Expectation Maximization (EM) [8,9] and Dynamic Time Warping (DTW) which aligns two motions and gives a similarity score [10]. To obtain motion capture data, several devices can be used, for example, RGBD sensor in Microsoft Kinect [11], custom-designed orient inertial sensor devices [8,9,12]. III. RESEARCH METHODOLOGY A. Data Collection via Microsoft Kinect The coordinate of joint (x,y,z) between depth data, skeletal data and colors image data is based on different coordinate systems. For skeletal data it can return (x,y,z) by converting skeletal coordinate to depth coordinate which ranges between 0.0 – 1.0. After that this value is converted to 640x480 of color image coordinate. Next, x is divided by 640 and y is divided by 480 where 640x480 is width and height of screen. For z data or depth value, the measurement unit is millimeters and can be obtained via the method DepthImageSkeletal. The center of screen is at (0,0) and the normalized data is between –1 and +1. Microsoft Kinect defines a position with 4D vector as (x,y,z,w). The (x,y,z) is the value of position in camera space. The z value is the distance between Kinect and human and (w) value gives the quality level (between 0 and 1) of the position. Fig. 1 All human joints that were tracked by Kinect [3]. 727

Transcript of A Motion Classifier for Microsoft Kinectpioneer.netserv.chula.ac.th/~achatcha/Publications/... · A...

Page 1: A Motion Classifier for Microsoft Kinectpioneer.netserv.chula.ac.th/~achatcha/Publications/... · A Motion Classifier for Microsoft Kinect ... number of games and applications have

A Motion Classifier for Microsoft Kinect

Chitphon Waithayanon

Master in Computer Science and Information Technology Department of Mathematics and Computer Science

Faculty of Science, Chulalongkorn University Bangkok, 10330 Thailand

[email protected]

Chatchawit Aporntewan

Department of Mathematics and Computer Science Faculty of Science, Chulalongkorn University

Bangkok, 10330 Thailand [email protected]

Abstract - In this paper we proposed a motion classifier using Microsoft Kinect. The main algorithm was based on Dynamic Time Warping (DTW). We tested the classifier with 7 motions. The experimental results showed that DTW coupled with human joint positions captured by Microsoft Kinect were capable of classifying the selected motions.

I. INTRODUCTION

In the past, human motion analysis was a complicated task because the input was video images [1,2]. The most difficult part is image processing and feature extraction from 2D images. Recently, Microsoft has released a gaming device for XBOX360 namely “Microsoft Kinect,” plus the programming toolkit called “Kinect for Windows SDK Beta” for developing applications on a PC. Kinect provides real-time human skeleton tracking with positions of each human joints in 3D [3,4]. The skeleton is very useful information for human motion analysis. Microsoft Kinect has extremely eased the programming difficulty and has brought a new era of Natural User Interface (NUI). Since the release of Kinect, a large number of games and applications have employed motion detection to interface with users. Although the algorithms are not shown to the public, we believe that most of them are hard coding, e.g. programmers putting their knowledge for a particular motion. If users do something beyond what programmers expect, the program will fail to detect the motion. Moreover, the motion is fixed and cannot be changed. In contrast, good software should allow users to customize NUI. For instance, users prefer their own motions rather than what defined by programmers. To do so, the software must be able to learn motions with users’ assistance (telling the software the class of motions, e.g. standing, sitting, jumping, etc). Later the software is able to classify motions when a user repeats. In this paper, we aim to develop a classifier for human motions. The motion classifier will ease the programming difficulty, speed up software prototyping, and allow users to customize NUI to their preferences.

II. LITERATURE REVIEW There are various methods for human motion classifier such as Fuzzy Rules which work with contrast depend on trajectory of body part [5], FMDistance which calculates the approximation of total kinetic energy of each joint [6],

Ensemble based Human Motion Classification Approach (EHMCA) which extracts the descriptors from human motion sequences. Then, singular value decomposition (SVD) is adopted to reduce the dimensionality of all the feature vectors [7], K-Mean , C-Mean, Expectation Maximization (EM) [8,9] and Dynamic Time Warping (DTW) which aligns two motions and gives a similarity score [10]. To obtain motion capture data, several devices can be used, for example, RGBD sensor in Microsoft Kinect [11], custom-designed orient inertial sensor devices [8,9,12].

III. RESEARCH METHODOLOGY A. Data Collection via Microsoft Kinect The coordinate of joint (x,y,z) between depth data, skeletal data and colors image data is based on different coordinate systems. For skeletal data it can return (x,y,z) by converting skeletal coordinate to depth coordinate which ranges between 0.0 – 1.0. After that this value is converted to 640x480 of color image coordinate. Next, x is divided by 640 and y is divided by 480 where 640x480 is width and height of screen. For z data or depth value, the measurement unit is millimeters and can be obtained via the method DepthImageSkeletal. The center of screen is at (0,0) and the normalized data is between –1 and +1. Microsoft Kinect defines a position with 4D vector as (x,y,z,w). The (x,y,z) is the value of position in camera space. The z value is the distance between Kinect and human and (w) value gives the quality level (between 0 and 1) of the position.

Fig. 1 All human joints that were tracked by Kinect [3].

727

Page 2: A Motion Classifier for Microsoft Kinectpioneer.netserv.chula.ac.th/~achatcha/Publications/... · A Motion Classifier for Microsoft Kinect ... number of games and applications have

Fig. 1 shows a total of 20 human joints that Kinect sensor can track. The Kinect coordinate system is shown in Fig. 2. We used the Head as the origin (0,0,0). Then all points obtained from Kinect were converted to a position relative to the head. Next, each point was equipped with an angle relative to the Shoulder Center. Finally, we used only the joint angles for motion classification.

Fig. 2 Kinect coordinate system [3]. Note that the angle of two vectors is calculated by

𝜃 =   cos!!𝑎 ∗ 𝑏𝑎   𝑏

where a is coordinator of Handleft,Wristleft,Handright and Wristright b is coordinator of Shoulder Center.

B. Classifier Algorjthm We used Dynamic Time Warping (DTW) technique to align two motions which are angle changes over time. DTW gives a distance between two motions. Note that the distance is inversely proportional to the similarity. Given an unseen motion, the classifier calculated the distance to each known or training motion. The unseen motion was predicted to be the class or the motion that has the shortest distance with. DTW is a technique to find an optimal alignment between two time-dependent sequences X =  𝑥!, 𝑥!,… , 𝑥!   and Y = (  𝑦!, 𝑦!,… , 𝑦!  ) which sequences are warped in a non linear to match each other and calculate distance which using Euclidean Distance from this matching [13,14]. At the core of the technique lies the warping curve φ(k), k = 1 ... T :

𝜙 𝑘 =   𝜙! 𝑘 ,𝜙!(𝑘) with

𝜙! 𝑘  ∈     1,… ,𝑀 , 𝜙! 𝑘  ∈     1,… ,𝑁

The warping functions 𝜙! and 𝜙! remap the time indices of X and Y respectively. Given 𝜙, we compute the average

accumulated distortion between the warped time series X and Y:

𝑑! 𝑋,𝑌 =   𝑑!!!! 𝜙! 𝑘 ,𝜙! 𝑘

!! !

!!

where 𝑚! 𝑘 is a per-step weighting coefficient and 𝑚! is the corresponding normalization constant.

IV. EXPERIMENT RESULT We used 7 hand motions that are single circle, double circle, punch, uppercut, square, triangle, and flipped triangle (see Fig. 3–9). Each motion took 5 seconds. Kinect made sampling at every 0.2 seconds. Only four joints, left/right hands and left/right wrists, were captured and used. The motions of all joints were concatenated to make a long motion of a single joint in 3D and compared with the reference (see Fig. 10–16). Note that we used a package in R Statistics to perform DTW [15]. The distance comparisons are shown in Table 1–7. There were 3 sets of testing data.

Fig. 3 Single circle motion made by left (blue) and right hand (red).

Fig. 4 Double circle motion made by left (blue) and right hand (red).

Kinect Sensor

Human

728

Page 3: A Motion Classifier for Microsoft Kinectpioneer.netserv.chula.ac.th/~achatcha/Publications/... · A Motion Classifier for Microsoft Kinect ... number of games and applications have

Fig. 5 Punch motion made by left (blue) and right hand (red).

Fig. 6 Uppercut motion made by left (blue) and right hand (red).

Fig. 7 Square motion made by left (blue) and right hand (red).

Fig. 8 Triangle motion made by left (blue) and right hand (red).

Fig. 9 Flipped triangle motion made by left (blue) and right hand (red).

Fig. 10 A comparison of single circle with the reference (circle).

729

Page 4: A Motion Classifier for Microsoft Kinectpioneer.netserv.chula.ac.th/~achatcha/Publications/... · A Motion Classifier for Microsoft Kinect ... number of games and applications have

Fig. 11 A comparison of single circle with the reference (double circle).

Fig. 12 A comparison of single circle with the reference (punch).

Fig. 13 A comparison of single circle with the reference (uppercut).

Fig. 14 A comparison of single circle with the reference (square).

Fig. 15 A comparison of single circle with the reference (triangle).

Fig. 16 A comparison of single circle with the reference (flipped triangle).

730

Page 5: A Motion Classifier for Microsoft Kinectpioneer.netserv.chula.ac.th/~achatcha/Publications/... · A Motion Classifier for Microsoft Kinect ... number of games and applications have

Table 1. Distance between single circle and all references. Reference Data DTW Distance

(single circle 1) DTW Distance (single circle 2)

DTW Distance (single circle 3)

Single circle Double circle

Punch Uppercut Square

Triangle Flipped Triangle

22.82 116.82 573.52 319.03 86.50 94.99

121.23

30.18 130.24 562.97 293.71 121.45 98.12

102.19

37.87 135.19 543.89 294.20 114.02 89.50

102.36

Table 2. Distance between double circle and all references. Reference Data DTW Distance

(double circle1) DTW Distance (double circle2)

DTW Distance (double circle3)

Single circle Double circle

Punch Uppercut Square

Triangle Flipped Triangle

139.09 71.38

302.68 303.04 127.81 109.69 111.54

104.70 56.08

340.93 319.28 108.32 96.42

119.73

136.50 58.22

324.28 312.05 133.46 92.98

112.60

Table 3. Distance between punch and all references. Reference Data DTW Distance

(Punch 1) DTW Distance

(Punch 2) DTW Distance

(Punch 3) Single circle Double circle

Punch Uppercut Square

Triangle Flipped Triangle

555.28 355.92 23.83

623.85 362.74 405.06 344.75

580.19 355.32 43.58

678.36 352.18 421.38 362.69

638.53 387.23 45.65

687.97 389.30 461.37 405.96

Table 4. Distance between uppercut and all references.

Reference Data DTW Distance (Uppercut 1)

DTW Distance (Uppercut 2)

DTW Distance (Uppercut 3)

Single circle Double circle

Punch Uppercut

Square Triangle

Flipped Triangle

266.0061 302.6803 568.2701 40.68694 395.6499 261.2207 289.5794

296.4954 312.4756 583.3942 37.05502 419.1425 273.2485 303.8263

286.1613 305.813

584.8413 39.51109 420.8889 263.5611 296.7424

Table 5. Distance between square and all references.

Reference Data DTW Distance (Square 1)

DTW Distance (Square 2)

DTW Distance (Square 3)

Single circle Double circle

Punch Uppercut Square Triangle

Flipped Triangle

109.37 106.15 345.80 404.15 41.54 84.99

135.96

106.30 100.27 366.37 354.50 71.26 88.80 89.40

125.38 111.24 300.12 380.46 73.27

101.13 80.45

Table 6. Distance between triangle and all references. Reference Data DTW Distance

(Triangle 1) DTW Distance

(Triangle 2) DTW Distance

(Triangle 3) Single circle Double circle

Punch Uppercut Square

Triangle Flipped Triangle

151.22 110.75 420.14 248.39 128.07 53.65

118.12

124.42 99.41

386.48 291.92 92.49 38.28 98.45

118.94 108.48 425.31 282.23 91.78 44.64

149.46

Table 7. Distance between flipped triangle and all references. Reference Data DTW Distance

(flipped tri 1) DTW Distance (flipped tri 2)

DTW Distance (flipped tri 3)

Single circle Double circle

Punch Uppercut Square

Triangle Flipped Triangle

115.98 119.42 370.51 308.94 141.48 75.80 43.45

97.09 106.09 319.90 308.47 140.18 90.00 64.04

123.43 117.64 327.56 335.41 140.55 78.43 47.21

V. CONCLUSION

We have shown an easy but efficient method for motion classification. Microsoft Kinect gives joint positions in 3D which are precise enough for performing DTW. The experimental results show that the classification among the 7 motions is 100% accurate in 3 test sets.

REFERENCES [1] Liang Wang, Li Cheng and Guoying Zhao, Machine Learning for

Human Motion Analysis Theory and Practice, Medical Information Science Reference; 1st edition (December 9, 2009).

[2] Liang Wang, Li Cheng, Guoying Zhao and Matti Pietikainen, Machine Learning for Vision-based Motion Analysis, Springer; 1st edition (November 19, 2010).

[3] Kinect for Windows SDK Beta, Programming Guide, Beta 1 Draft Version 1.0a – June 24, 2011.

[4] Kinect for Windows SDK Beta, Skeletal Viewer Walkthrough : C++ and C#, Beta 1 Draft Version 1.0a – June 24, 2011. [5] Chee Seng Chan, Honghai Liu and David Brown, “An Effective

HumanMotion Classification Approach using Knowledge Representation in Qualitative Normalised Templates”, IEEE 2007.

[6] Kensuke Onuma, Christos Faloutsos and Jessica K. Hodgins, “FMDistance: A fast and effective distance function for motion capture data”, EUROGRAPHICS 2008 / K. Mania and E. Reinhard.

[7] Zhiwen Yu Xing Wang and Hau-San Wong, “Ensemble Based 3D Human Motion Classification”, 2008 International Joint Conference on Neural Networks (IJCNN 2008) pp. 50–509.

[8] D. K. Arvind and M. M. Bartosik, “Motion capture and classification for real-time interaction with a bipedal robot using on-body, fully wireless, motion capture Specknets”, The 18th IEEE International Symposium on Robot and Human Interactive Communication Toyama, Japan, Sept. 27-Oct. 2, 2009 , pp. 1087–1092.

[9] D. K. Arvind and M. M. Bartosik, “Clustering of Motion Data from On-Body Wireless Sensor Networks for Human-Imitative Walking in Bipedal Robots”, Advanced Robotics, 2009. ICAR 2009.

[10] Kevin Adistambha, Christian H. Ritz and Ian S. Burnett, “Motion Classification Using Dynamic Time Warping”, IEEE 2008, pp. 622– 627.

[11] Jaeyong Sung and Colin Ponce and Bart Selman and Ashutosh Saxena, “Human Activity Detection from RGBD Images”, Association for the Advancement of Artificial Intelligence 2011.

[12] Tsuyoshi Matsuoka, Asako Soga, Kazuhiro Fujita, A Retrieval System for Ballet Steps using Three-dimensional Motion Data, 2009 Fifth International Conference on Intelligent Information Hiding and Multimedia Signal Processing pp. 1144-1147.

[13] Meinard Müller, “Information Retrieval for Music and Motion”, Springer; 1st edition (November 14, 2007).

[14] Ralph Niels and Louis Vuurpijl, “Dynamic Time Warping Applied to Tamil Character Recognition” IEEE 2005, pp. 730 – 734.

[15] Toni Giorgino, “Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package”, Journal of Statistical Software 2009.

731