An adaptive stacked hourglass network with Kalman ˝lter for...

16
ORIG I NAL AR TI CLE Journal Section An adaptive stacked hourglass network with Kalman lter for estimating 2D human pose in video Tao Hu 1,2 | Chunxia Xiao 1* | Geyong Min 3 | Noushin Najjari 3 1 School of Computer Science, Wuhan University, Wuhan, Hubei,430072, China 2 School of Information Engineering, Hubei Minzu University, Enshi, Hubei,445000, China 3 College of Engineering, Mathematics and Physical Science, University of Exeter, Exeter, United Kingdom Correspondence Chunxia Xiao, School of Computer Science, Wuhan University, Wuhan, Hubei,430072, China Email: [email protected] Funding information The Key Technological Innovation Projects of Hubei Province under Grants 2018AAA062 and 2018AKB035; Wuhan Science and Technology Plan Project under Grant 2017010201010109; The National Key Research and Development Program of China under Grant 2017YFB1002600; National Natural Science Foundation of China under Grants 61672390, 61562025,61972298,61962019; Hubei Chengguang Talented Youth Development Foundation (HBCG). One of the main challenges in computer science and image processing is 2D human pose estimation. Specically, oc- clusion and in particular occlusion of human joints caused by camera angle are of paramount importance. In this pa- per, a new highly accurate network was proposed that can estimate 2D human poses in video images using deep learn- ing. We employ the Single Shot MultiBox Detector network to detect the centre position of each human within a video frame and then use the stacked hourglass network to es- timate the 2D human pose. We approximate the human motion as a linear motion between dierent frames in a cer- tain period; and optimize the human centres based on the local outlier factor and Kalman lters. The same method is applied to optimize the human pose estimations in video, which can address the inaccurate prediction caused by hu- man joints occlusion. The proposed adaptive network is tested using the two well known benchmarks for human pose estimation (MPII and JHMDB datasets), and we also generate some 2D human pose estimating qualitative re- sults of single and multiple people in Internet videos. The experimental results show that the proposed network has strong practicability and can achieve high accuracy on adap- tive estimating the 2D human pose in video. 1

Transcript of An adaptive stacked hourglass network with Kalman ˝lter for...

Page 1: An adaptive stacked hourglass network with Kalman ˝lter for …graphvision.whu.edu.cn/papers/hutao2020.pdf · 2020. 3. 9. · The stacked hourglass network (Newell et al. (2016))

OR I G I N A L A RT I C L EJou rna l Se c t i on

An adaptive stacked hourglass network withKalman filter for estimating 2D human pose invideo

Tao Hu1,2 | Chunxia Xiao1∗ | Geyong Min3 |Noushin Najjari3

1School of Computer Science, WuhanUniversity, Wuhan, Hubei,430072, China2School of Information Engineering, HubeiMinzu University, Enshi, Hubei,445000,China3College of Engineering, Mathematics andPhysical Science, University of Exeter,Exeter, United Kingdom

CorrespondenceChunxia Xiao, School of Computer Science,Wuhan University, Wuhan, Hubei,430072,ChinaEmail: [email protected]

Funding informationThe Key Technological Innovation Projectsof Hubei Province under Grants2018AAA062 and 2018AKB035; WuhanScience and Technology Plan Project underGrant 2017010201010109; The NationalKey Research and Development Program ofChina under Grant 2017YFB1002600;National Natural Science Foundation ofChina under Grants 61672390,61562025,61972298,61962019; HubeiChengguang Talented Youth DevelopmentFoundation (HBCG).

One of the main challenges in computer science and imageprocessing is 2D human pose estimation. Specifically, oc-clusion and in particular occlusion of human joints causedby camera angle are of paramount importance. In this pa-per, a new highly accurate network was proposed that canestimate 2D human poses in video images using deep learn-ing. We employ the Single ShotMultiBoxDetector networkto detect the centre position of each human within a videoframe and then use the stacked hourglass network to es-timate the 2D human pose. We approximate the humanmotion as a linear motion between different frames in a cer-tain period; and optimize the human centres based on thelocal outlier factor and Kalman filters. The same method isapplied to optimize the human pose estimations in video,which can address the inaccurate prediction caused by hu-man joints occlusion. The proposed adaptive network istested using the two well known benchmarks for humanpose estimation (MPII and JHMDB datasets), and we alsogenerate some 2D human pose estimating qualitative re-sults of single and multiple people in Internet videos. Theexperimental results show that the proposed network hasstrong practicability and can achieve high accuracy on adap-tive estimating the 2D human pose in video.

1

Page 2: An adaptive stacked hourglass network with Kalman ˝lter for …graphvision.whu.edu.cn/papers/hutao2020.pdf · 2020. 3. 9. · The stacked hourglass network (Newell et al. (2016))

2 Tao Hu et al.

K E YWORD S

Human detection, Human pose estimation, Kalman filter, Outlierdetection

1 | INTRODUCTION

Two dimensional human pose estimation is an important problems in multimedia content analysis. It has many ap-plications in robotics, virtual reality, computer animation, intelligent surveillance and life sciences (Kim et al. (2016);Neves et al. (2016); Koppula and Saxena (2016); Rogez and Schmid (2018); Liang et al. (2018); Liu and Wang (2018),Kim et al. (2016)). For example, a Schaeffer gesture recognition system (Oprea et al. (2018)) has been proposed toimprove the estimating accuracy of human pose based on several recurrent neural network models, such as LSTM(Luo et al. (2018)), and VRNN (Chung et al. (2015)). Although most recent pose estimation methods have achievedhigh accuracy with pre-localized people (Wei et al. (2016); Carreira et al. (2016); Newell et al. (2016)) in single images,it is difficult to estimate human poses in realistic video with self-occlusion (Iqbal et al. (2017b)).

There are three core challenges to estimating a 2D human pose in video. The first is detecting the human bodyin each video frame. Second, it is difficult to estimate the 2D pose of the detected human in each video frame. Third,the most difficult part is to find the relevance between each human pose and the estimated results in continuousvideo frames. We focus on optimizing the first and third challenges in this paper. This means that we have to addressvarious forms of human pose, fast motions and human joints occlusion.

In contrast to the previous works, we aim to detect the human location and the association of each person’s jointsthrought a video. To this end, combining the Single Shot MultiBox Detector (Liu et al. (2016)) and stacked hourglassnetworks (Newell et al. (2016)), we design an adaptive network for estimating 2D human poses in videos that canestimate the pose of a person in any position within a video frame.

In particular, we define the problem as an optimization of time series motion data, such as human centre positionand each joint’s position in continuous video frames. The displacement change of the human centre and human jointsduring a certain period can be approximated by a linear motion. The implementation of the optimization methodincludes the local outlier factor (Lu et al. (2017); Salehi et al. (2016); Fanaee-T and Gama (2015)) to determine theoutlier of the input time series data and uses a Kalman filter (Liang et al. (2012); Nguyen et al. (2017); Rodger (2018))to update the value of the outlier. In this manner, we can address self-occlusion and improve the accuracy of stackedhourglass network. Put differently, our network can estimate the centre position of a human and the displacement ofeach human joint in a video more accurately. Finally, we evaluate our network using the following two benchmarks:the MPII human pose dataset (Andriluka et al. (2014)) and the Joint-annotated Human Motion Data Base (JHMDB)dataset (Jhuang et al. (2013)).

The main contributions of this paper are summarized as follows:

1) An adaptive estimating network of 2D human pose in video is designed that can estimate the centre positionof each human in a video frame automatically. The basic stacked hourglass network must use the manual input centrecoordinates of the humans in the image before estimating the 2D human poses. Our network employs the SSDnetwork to estimate the human’s location in a video frame. Regardless of where the human is located, our networkcan detect the human’s centre position in the frame .

2) Due to the occlusion of the human joints and the effect of ambient light, the adaptive stacked hourglass networkcannot estimate the complete human pose. Our network optimizes the human pose prediction from the two aspectsof human centre prediction and joint position estimation. It employs the local outlier factor to determine the outlier’s

Page 3: An adaptive stacked hourglass network with Kalman ˝lter for …graphvision.whu.edu.cn/papers/hutao2020.pdf · 2020. 3. 9. · The stacked hourglass network (Newell et al. (2016))

Tao Hu et al. 3

F IGURE 1 The overview of the proposed adaptive 2D human pose estimating framewrok

position. The Kalman filter is used to generate a smoother data series than the original pose data series for theestimation. Finally, the data values of the displacement outliers are replaced based on the Kalman filtering results.

We organize the remainder of this paper as follows: The related works of estimating a human 2D pose in a singleimage and video are presented in Section 2.The proposed adaptive 2D human pose estimation method with SSD andstacked hourglass network, along with the time series motion data optimizing method are presented in Section 3. InSection 4, we present the experimental results and discuss the performance of our network. Finally, the paper isconcluded in Section 5.

2 | RELATED WORKS

The classic human pose estimatingmethods (Rihan and Torr (2012); Yang and Ramanan (2013); Pishchulin et al. (2013);Song et al. (2018); Nie et al. (2015); Pang et al. (2015)) need to construct a classifier by extracting a high dimensionalfeature. The learning process of a high dimensional feature is very complex and time consuming. Based on thecomparison between these classic human pose estimating methods, Toshev et al. (Toshev and Szegedy (2014)) firstproposed the "DeepPose" method to estimate a human pose using deep learning. Since then, many human poseestimating methods based on deep learning have appeared. In this paper, we extend the stacked hourglass network(Newell et al. (2016)) with an SSD network (Liu et al. (2016)) to estimate a 2D human pose. Below is a brief overviewof these topics.

Human pose estimation in image: The "DeepPose" method (Toshev and Szegedy (2014)) regresses the x and ycoordinates of joints by deep neural networks directly. Based on the pose machine (Ramakrishna et al. (2014)), theconvolutional pose machines (Wei et al. (2016)) consist of a sequence of Convolutional Neural Networks (CNN)(Wanget al. (2019); Ding et al. (2019)) to construct 2D belief maps of the joints. A stacked hourglass network (Newell et al.(2016)) employs Residual networks to gradually extract the features of a single image and to capture the variousspatial relationships that exist between human joints. Part Affinity Field (Cao et al. (2017)) is designed to estimatethe 2D poses of people in an image, that are generated by a two-branch multi-stage CNN. A Pose Residual Network(Kocabas et al. (2018)) is designed to estimate the 2D poses of multiple people. These methods are state-of-the-artwith evaluations in some popular datasets, such as the MPII human pose dataset. The MPII dataset has labelled thehuman centre position of each image, therefore the use of videos in existing works is limited. In this paper, we willaddress the human pose estimation in video with no labels of human location.

Human pose estimation in video: Due to motion blurriness, uncommon poses and human self-occlusion, estimat-ing a 2D human pose in a video is challenging. Hernández et al. (Hernández et al. (2014)) designed a tracking, featureextraction and action recognition model in video sequences using kinematic features. Song et al. (Song et al. (2017))

Page 4: An adaptive stacked hourglass network with Kalman ˝lter for …graphvision.whu.edu.cn/papers/hutao2020.pdf · 2020. 3. 9. · The stacked hourglass network (Newell et al. (2016))

4 Tao Hu et al.

(a) Input Image (b) 2D human pose estimation without accurate human

center location

(c) Human position location in input image based on

SSD network

(d) 2D human pose prediction based on adaptive 2D

human pose estimating method

F IGURE 2 Results of adaptive 2D human pose estimating method.(a) is the input image, (b) is the 2D humanpose estimating results based on stacked hourglass network (Newell et al. (2016)) with non-accurately human centrecoordinates, (c) is the human centre coordinates estimating result based on SSD network, and (d) is the 2D humanpose estimating results base on the proposed model.

proposed incorporating spatial and temporal modelling into a ConvNet to predict a sequence of 2D human poses inarbitrary videos. A pose refinement network (Fieraru et al. (2018)), which learns to directly estimate a refined 2D poseby joint reasoning, is proposed to estimate the poses of multiple people in a video. A spatial-temporal and-or graphmodel (Nie et al. (2015)) is introduced to estimate pose from video by capturing the appearance and geometric varia-tions of 2D poses in each video frame. Luo et al. (Luo et al. (2018)) propose "LSTM Pose Machines" for video poseestimation. "LSTM Pose Machines" is a recurrent CNN model with Long Short-Term Memory (LSTM). LSTM is veryeffective in imposing geometric consistency among frames. Work by (Girdhar et al. (2018)) introduces a lightweightapproach to estimate in videos that includes the following two-stages: key point estimating in short frames and keypoint predictions throughout the video. In this paper, a time series motion data method was proposed to optimize thelocations of humans and human joints.

3 | THE PROPOSED NETWORK

The smart network for adaptive estimation of 2D human pose in video is composed of four parts. FIGURE 1 illustratesthe complete process of the proposed adaptive estimation network. This section mainly focuses on the two parts of

Page 5: An adaptive stacked hourglass network with Kalman ˝lter for …graphvision.whu.edu.cn/papers/hutao2020.pdf · 2020. 3. 9. · The stacked hourglass network (Newell et al. (2016))

Tao Hu et al. 5

F IGURE 3 Adaptive 2D human pose estimation netwrok

adaptive estimation of a 2D human pose and optimization of the time series motion data.

3.1 | An adaptive 2D human pose estimation method

The stacked hourglass network (Newell et al. (2016)) is used to generate a 2D human pose from a still image; it needscertain position coordinates of a human in the image as an input. If the position coordinates of the human centre arenot accurate, the stacked hourglass network cannot estimate the position of each human’s joints correctly. Due tothese factors, we propose an adaptive 2D human pose estimationmethod based on the Single ShotMultiBoxDetector(SSD) (Liu et al. (2016)) and the stacked hourglass network (Newell et al. (2016)). The SSD network is employed topredict the human centre position in the input video frame and the stacked hourglass network is used to predict the2D human pose in the corresponding video frame. We demonstrate the result of adaptive 2D human pose estimationin FIGURE 2.

The entire network of adaptive 2D human pose estimation is divided into the two stages of human centre positiondetection and human 2D pose estimation, which are demonstrated in FIGURE 3.

Stage 1: Estimating the center position of a human within a video frame. Our network employs an SSD network(Liu et al. (2016)), which is an object detection network with a feature pyramid structure, to detect a human andestimate the centre position of the human in each frame. Additionally, this network employs a classic deep learningmodel, such as VGG − 16 (Simonyan and Zisserman (2015)), to construct the input feature map of the input video

Page 6: An adaptive stacked hourglass network with Kalman ˝lter for …graphvision.whu.edu.cn/papers/hutao2020.pdf · 2020. 3. 9. · The stacked hourglass network (Newell et al. (2016))

6 Tao Hu et al.

(a) human pose estimating result based on second

ordered stacked hourglass network

(b) human pose estimating result based on fourth

ordered stacked hourglass network

F IGURE 4 Comparison of human 2D pose estimation with different ordered stacked hourglass network

frame. It then generates different size feature maps of input the feature map using the convolution kernels conv4− 3,conv −7, con6−2, conv7−2, conv8−2, conv9−2 and so on. The SSD network designs multiple pr i or boxes for eachobject in different sized feature maps, which is trained with the matched truth bounding box of the object. Basedon the prior boxes, the network achieves the object classification using the sof tmax function and object locationregression with different sized feature maps.

In our network, we merely estimate the object, whose label is set as "human"; we can obtain the bounding boxof the detected object based on a location regression. Finally, our network can predict a more accurate result of thehuman centre position in the video frame based on the classification and the bounding box.

Stage 2: Estimating the human 2D pose in video frame. Based on the predicted human centre position, at thefirst stage, the network crops the frame into a candidate area whose size is 256 × 256, and it extracts the featuremap. In the next step, the network employs the fourth ordered stacked hourglass network, composed of a series ofResidual modules, to estimate the human 2D pose. The fourth ordered stacked hourglass network uses max poolingto down-sample the feature map of the candidate area and then inputs it to the next hourglass network. Concurrently,this network employs three Residual modules to extract the input feature map’s high dimensional feature, which canpreserve the size of the input before down-sampling. When the network reaches the fourth hourglass network, itemploys the nearest neighbourhood interpolation to up-sample the feature that is extracted by five Residual modules.The up-sampling feature map adds the upper scale feature and the results of the input into the upper-level hourglassnetwork by using one Residual module.

To clearly explain the performance of the fourth ordered stacked hourglass network, the intermediate and finalresults are compared. Given a single image (FIGURE 2(a)), we extract the human pose estimation result when thenetwork reaches the second ordered as shown in FIGURE 4(a). The left sub-image of FIGURE 4(a) shows the entire2D pose of FIGURE 2(a), and the right sub-image shows the heatmap of each human joint. From FIGURE 4(a), we findthat the stacked hourglass network did not predict the correct coordinates of the left ankle, left knee, right ankle orright knee when it reached the second order. Because the right ankle is within the self-occlusion portion of the inputimage, the fourth ordered stacked hourglass network estimates the coordinates of all joints except the right anklewhen the network reaches the fourth order (which is shown as FIGURE 4(b)).

3.2 | Optimizing time series motion data

When a 2D human pose is estimated in a video, there are two types of time series data – human centre position andthe displacement of each human joint. The time series data of a human centre position are estimated from continuous

Page 7: An adaptive stacked hourglass network with Kalman ˝lter for …graphvision.whu.edu.cn/papers/hutao2020.pdf · 2020. 3. 9. · The stacked hourglass network (Newell et al. (2016))

Tao Hu et al. 7

video frames by the SSD network, and the series data of human joints displacement are estimated from continuousvideo frames by the fourth ordered stacked hourglass network. These two types of time series data are approximatedas linear motion data; thus, the same method can be used to optimize both types of data.

The key to optimizing the time series data is to detect the outlier of these data and update the value of the outlier.In the proposed network, the outlier detection algorithm is employed based on the local outlier factor, denoted asLOF, to estimate the outlier of the input time series data and use a Kalman filter to update the value of the outlier.The LOF uses a KD-tree to perform a nearest neighbour search and to calculate the local density with the nearestneighbours to classify the outlier.

Before calculating the local density of each position, a KD-tree is built for all positions. Based on the KD-tree, wecan generate the k t h neighbours of each individual position. Given a time series of position P = [x , y ], time interval tand time period L, equation (1) is used to calculate the local density LOFk (A) at position P (xA, yA).

LOFk (A) = (

∑B∈Nk (A)

l d (B)

|Nk (A) |/l d (A) (1)

where Nk (A) are the k neighbors of position A and l d (A) is defined as follows:

l d (A) = 1/(

∑B∈Nk (A)

max {k _d i st ance(B), d (A,B)}|Nk (A) |

) (2)

where k _d i st ance(B) is the distance between position B and its k nearest neighbor, and d (A,B) is the distance be-tween position A and B . If LOFk (A) is above a threshold, position A is classified as an outlier.

Simultaneously, we use the Kalman filter to fit the positions of the human centre in all video frames. We modelthe human motion in a video with the following linear difference equation:

Pt = At ,t−1Pt−1 +wt−1 (3)

where Pt = (xt , yt ) is the human centre position at time t in period L, At ,t−1 is the state transition matrix, and wt−1 isthe random noise at t − 1. The Kalman filter’s observation equation is defined as follows:

O t = Ht Pt + Nt (4)

whereHt andNt are the observationmatrix and the observation noisematrix at time t , respectively. The state updatingequation and the state prediction equation of the Kalman filter are described in equations (5) and (6), where Kt is thegain matrix that deduced as equations (7) through (9). Q t andGt are the covariancematrices ofwt and Nt , respectively.

P̂ (t | t ) = P̂ (t | t − 1) + Kt[O t − Ht P̂ (t | t − 1)

](5)

P̂ (t + 1 | t ) = AP̂ (t | t − 1) + AKt[O t − Ht P̂ (t | t − 1)

](6)

Kt = τ−t H

Tt (Ht τ

−t H

Tt + Q t )

−1 (7)

τ−t = At ,t−1τ+t−1A

Tt ,t−1 +Gt−1 (8)

τ+t = (I − KtHt )τ−t (9)

Finally, our network optimizes the time series motion data P with the detection [out l i er ] and the predictionresults P ′ of the Kalman filter. The entire optimization is described in Algorithm 1.

Page 8: An adaptive stacked hourglass network with Kalman ˝lter for …graphvision.whu.edu.cn/papers/hutao2020.pdf · 2020. 3. 9. · The stacked hourglass network (Newell et al. (2016))

8 Tao Hu et al.

(a) Fitting trajectory of original time series motion data (b) Outlier detection based on LOF

F IGURE 5 Optimization results of time series positions.

Algorithm 1 Optimizing time series motion dataRequire: Time series motion data PEnsure: Optimized time series motion data P ′

1: initialize: Set P ′ = P2: Predict the [out l i er ] of P based LOF3: Predict the P̂ of P based on Kalman fliter4: Len = Leng th([out l i er ])

5: for i = 1, 2, ..., Len do6: I nd [i ] = out l i er [i ].i ndex

7: P ′(I nd [i ]) = P̂ (I nd [i ])

8: end for9: return:P ′

FIGURE 5 is an example of optimizing the time series motion data. Assuming there is a time series motion data,which are shown as blue points in FIGURE 5(a), random noises are added to some data (shown as red asterisks inFIGURE 5(a)). FIGURE 5(b) is the outlier detection based on LOF, where the outliers are labelled as red points. Thefinal optimization positions of the input time series data are shown as green blocks in FIGURE 5(a). We also fit thetrajectories for input, noise and optimization positions based on the polynomial least squares method, which is shownin FIGURE 5(a). Based on the fitting results, the proposed network can optimize the trajectory of time series motiondata.

4 | EXPERIMENTS

The proposed network was evaluated against the following two benchmarks: (1) the MPII human pose dataset for2D single human pose estimation in a still image and (2) the JHMDB Dataset for 2D human pose estimation in avideo. The proposed network is performed on a server with two NVIDIA GeForce GTX-1080Ti GPUs and 32GB RAM.The adaptive 2D human pose estimation method is first evaluated using the MPII human multi-person dataset. TheJHMDBdataset is employed to evaluate optimization time seriesmotion data. The 2D human poses ofmultiple people

Page 9: An adaptive stacked hourglass network with Kalman ˝lter for …graphvision.whu.edu.cn/papers/hutao2020.pdf · 2020. 3. 9. · The stacked hourglass network (Newell et al. (2016))

Tao Hu et al. 9

F IGURE 6 PCK comparison with proposed network and fourth ordered stacked hourglass network

in the videos are finally estimated.

4.1 | Testing in the MPII human pose dataset

Based on the fourth ordered stacked hourglass network, the basic 2D human pose estimating network must firstinitialize the human centre position within a still image. The test dataset of MPII contains 2940 images, and thehuman centre position of the images are labelled in the test dataset. To evaluate the performance of the proposednetwork, the network is compared with the basic fourth ordered stacked hourglass network and random noises ofinitial human centre positions are set in the test dataset. To verify the recognition rate of the human centre positionin an image, we train the SSD network with two training datasets: VOC0712 andMPII. Two types of recognition sizesare set in VOC0712: 300 × 300 and 512 × 512. The stacked hourglass network is trained by the MPII dataset.

First, a 2D human pose in the MPII test dataset is estimated with the ground truth of the human centre positionbased on the fourth ordered stacked hourglass network. Then, random noises are added to the ground truth of thehuman centre position in theMPII test dataset. The fourth ordered stacked hourglass network is employed to estimatethe 2D human pose in the MPII test dataset with a noisy human centre position. The different SSD networks withdifferent training datasest and different image sizes are used to predict the human centre position of each image inthe MPII test dataset. The predicted human centre positions are input parameters used to estimate the 2D humanpose in the MPII test dataset based on the fourth ordered stacked hourglass network.

We use the Percentage of Correct Keypoints, which is denoted as PCK, to evaluate the estimation percentageswithin a normalized distance of the ground truth. FIGURE 6 shows the visual PCK comparison results, which describesthe percentage of estimations on Head, Wrist, Ankle, Elbow, Knee and Shoulder.

The testing baseline of in the MPII human pose dataset is fourth ordered stacked hourglass network (Newell et al.(2016)). The quantitative comparison results are shown in TABLE 1. The visualization and quantitative comparisonresults confirm that the human centre position in an image is a very important data point in a stacked hourglass

Page 10: An adaptive stacked hourglass network with Kalman ˝lter for …graphvision.whu.edu.cn/papers/hutao2020.pdf · 2020. 3. 9. · The stacked hourglass network (Newell et al. (2016))

10 Tao Hu et al.

TABLE 1 PCK comparison results of proposed network

Method Head Knee Ankle Shoulder Elbow Wrist Average

The proposed method

(VOC_512)83.95% 69.33% 67.03% 82.89% 77.97% 73.38% 75.76%

The proposed method

(VOC_300)80.78% 65.77% 63.00% 79.45% 74.65% 70.95% 72.43%

The proposed method

(MPII_300)58.36% 46.60% 45.48% 56.62% 53.28% 50.93% 51.88%

Newell et al. (2016)

(SHN-noise1: [0.6,0.8])35.60% 12.75% 9.81% 31.29% 27.33% 28.28% 24.18%

Newell et al. (2016)

(SHN-noise3: [0.8,1.0])79.29% 58.41% 54.45% 75.44% 70.14% 67.02% 67.46%

Newell et al. (2016)

(SHN-noise2:[1.0,1.2])72.40% 59.68% 58.44% 71.19% 67.14% 63.89% 65.46%

Newell et al. (2016) (SHN) 96.78% 83.20% 80.39% 95.19% 89.14% 84.17% 88.15%

TABLE 2 Error analysis of FIGURE 5(a)

Original data Noise data Optimization data

SSE 0.0625 1.3405 0.5657

RMSE 0.0898 41.3312 7.3598

MSPE − 64.64% 64.58%

network to estimate the 2D human pose. When we use the VOC0712 with image size of 512×512, to train the SSDnetwork, the proposed network obtains good results for the PCK, for example the PCK of the Head can reach 83.95%and the PCK of a Shoulder can reach 82.89%. Although the PCK of Head has reached in 96.78% based on the stackedhourglass network, there are multiple-humans in the images of the MPII test dataset. If there are many humans in theimage, the SSD network can detect the humans’ positions. However, it cannot determine which human is labelled inthe image. This is why the PCK of the proposed network is lower than the stacked hourglass network. Additionally,the training dataset of used by Newell et al. (2016) disabled the shifting data augmentation of a human location, andtheir work Newell et al. (2016) has a very high recognition rate. The human centre labels in the MPII dataset werehidden in our work, so the performance of the proposed method is lower than the work preformed by Newell et al.(2016).

4.2 | Testing in the JHMDB dataset

The JHMDB dataset has annotated full body joints of human in a video frame. Additionally, it has a videos subset(called sub-JHMDB) that contains 316 clips distributed over 12 categories, such as clap, jump, and kickball. The sub-

Page 11: An adaptive stacked hourglass network with Kalman ˝lter for …graphvision.whu.edu.cn/papers/hutao2020.pdf · 2020. 3. 9. · The stacked hourglass network (Newell et al. (2016))

Tao Hu et al. 11

F IGURE 7 Qualitative results of 2D pose estimation in JHMDB

F IGURE 8 Qualitative results of 2D pose estimation in Internet videos

JHMDB dataset is used to evaluate the accuracy of estimating 2D human pose in a video. We also use the PCK toevaluate the proposed network on the sub-JHMDB dataset.

Given a visual original trajectory (the blue curve in FIGURE 5(a)) and a noise trajectory (the red curve in FIGURE5(a)), the optimization trajectory (the green curve in FIGURE 5(a)) is estimated by our optimization method of timeseries motion data. We calculate the MSPE between the original and noise trajectories in TABLE 2, and calculatethe Mean Square Percent Error (MSPE) between the original and optimization trajectories. The comparison betweentwoMSPEs confirms that the proposed optimization method of time series motion data could reduce the estimationerrors. To test the effectiveness of the proposed optimization method, the sum of squares due to errors (SSE) and theRoot Mean Squared Errors (RMSE) of the original, noise and optimization trajectories are also calculated; they are fitbased on polynomial least squares.

We compared our work to convolutional channel features (Iqbal et al. (2017a)), the spatial-temporal AOG model(Nie et al. (2015)) and the Thin-Slicing Network (Song et al. (2017)) in the sub-JHMDB dataset. The PCK of thekicking-ball and pullup categories of the sub-JHMDB are counted as shown in TABLE 3.

The average PCK of the sub-JHMDB dataset can reach 83.18% based on the proposed network without optimiza-

Page 12: An adaptive stacked hourglass network with Kalman ˝lter for …graphvision.whu.edu.cn/papers/hutao2020.pdf · 2020. 3. 9. · The stacked hourglass network (Newell et al. (2016))

12 Tao Hu et al.

TABLE 3 The PCKs of in sub-JHMDB dataset

method Shoulder Elbow Wrist Knee Ankle Average

Nie et al. (2015) 63.5% 32.5% 21.6% 62.7% 53.1% 46.68%

Song et al. (2017) 95.7% 87.5% 81.6% 92.7% 89.9% 89.46%

Iqbal et al. (2017a) 76.9% 59.3% 55.0% 76.4% 73.0% 68.12%

The proposed method

(without optimization)86.2% 76.6% 86.4% 77.8% 88.9% 83.18%

The proposed method

(with optimization)88.6% 76.8% 88.1% 77.9% 88.9% 84.06%

F IGURE 9 Qualitative results of 2D pose estimation of multi-person in Internet videos

tion (shown as Ours(without optimization) in TABLE 3) and 84.06%with optimization (shown as The proposed method(with optimization) in TABLE 3). Although the average PCK of the proposed method is lower than the Thin-SlicingNetwork (Song et al. (2017)), the proposed method has improved the PCK of a specific joint. For example, the PCK ofWrist reaches 88.1% based on the proposed method. The PCK of the Wrist is just 81.6% based on the Thin-SlicingNetwork. The proposed work preforms 66.5% better than the spatial-temporal AOG model (Nie et al. (2015)), 6.5%better than the Thin-Slicing Network (Song et al. (2017)), and 33.1% better than the convolutional channel features(Iqbal et al. (2017a)) in the sub-JHMDB dataset.

We show some qualitative results of the estimating 2D human pose results on the sub-JHMDB dataset in FIGURE7. FIGURE 8 shows other 2D human estimating results in Internet videos, which also confirmed that the proposednetwork can estimate good 2D human poses in video.

Page 13: An adaptive stacked hourglass network with Kalman ˝lter for …graphvision.whu.edu.cn/papers/hutao2020.pdf · 2020. 3. 9. · The stacked hourglass network (Newell et al. (2016))

Tao Hu et al. 13

4.3 | Testing for Multiple-people

Finally, we use the proposed network to estimate the 2Dmultiple-people in videos that are accessed from the Internet.FIGURE 9 shows the qualitative results of estimating 2D poses of multiple people in Internet videos based on theproposed network. Since there are no complex location crossovers of two epeeists in the testing video, the proposednetwork can accurately generate 2D human poses in the first row in FIGURE 9. The second and third rows in FIGURE 8are the 2D pose estimation results of pairs skating and a boxing matching. Additionally, the 2D pose estimation resultsof multiple people dancing are shown as the fourth and fifth rows in FIGURE 9. The results of FIGURE 9 indicate thatthe proposed network has a strong practicability to estimate 2D pose of multiple people for some common multi-person scenes.

5 | CONCLUSIONS

We presented an adaptive network with a stacked hourglass network and SSD for video pose estimation in this paper.We considered a critical component of our supposition – the continuity between each joint of a human pose in a video.We presented an optimization method of time series motion data based on LOF outlier detection and a Kalman filterin the proposed network. We achieved significant improvement in estimating accuracy. We showed that the proposednetwork contributed to a better utilization of human centre location and time series data. Finally, we demonstratedthe 2D human pose estimation visualization results with a single person and multiple people in some Internet videos.

Acknowledgements

The authors would like to thank the anonymous reviewers for their constructive comments. This work was supportedby The Key Technological Innovation Projects of Hubei Province under Grants 2018AAA062 and 2018AKB035, andWuhan Science and Technology Plan Project under Grant 2017010201010109; and The National Key Research andDevelopment Program of China under Grant 2017YFB1002600, and National Natural Science Foundation of Chinaunder Grants 61672390, 61562025, 61972298, 61962019, and supported by TheHubei Chengguang Talented YouthDevelopment Foundation (HBCG).

referencesAndriluka, M., Pishchulin, L., Gehler, P. and Schiele, B. (2014) 2d human pose estimation: New benchmark and state of the art

analysis. In Computer Vision and Pattern Recognition, 3686–3693.

Cao, Z., Simon, T., Wei, S. E. and Sheikh, Y. (2017) Realtime multi-person 2d pose estimation using part affinity fields. In IEEEConference on Computer Vision and Pattern Recognition, 1302–1310.

Carreira, J., Agrawal, P., Fragkiadaki, K. and Malik, J. (2016) Human pose estimation with iterative error feedback. In ComputerVision and Pattern Recognition, 4733–4742.

Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C. and Bengio, Y. (2015) A recurrent latent variable model for sequentialdata. In Advances in neural information processing systems, 2980–2988.

Ding, B., Long, C., Zhang, L. and Xiao, C. (2019) Argan: attentive recurrent generative adversarial network for shadow detec-tion and removal. In Proceedings of the IEEE International Conference on Computer Vision, 10213–10222.

Fanaee-T, H. and Gama, J. (2015) Eigenspace method for spatiotemporal hotspot detection. Expert systems, 32, 454–464.

Page 14: An adaptive stacked hourglass network with Kalman ˝lter for …graphvision.whu.edu.cn/papers/hutao2020.pdf · 2020. 3. 9. · The stacked hourglass network (Newell et al. (2016))

14 Tao Hu et al.

Fieraru, M., Khoreva, A., Pishchulin, L. and Schiele, B. (2018) Learning to refine human pose estimation. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition Workshops, 205–214.

Girdhar, R., Gkioxari, G., Torresani, L., Paluri, M. and Tran, D. (2018) Detect-and-track: Efficient pose estimation in videos. InIEEE Conference on Computer Vision and Pattern Recognition.

Hernández, J., Cabido, R., Montemayor, A. S. and Pantrigo, J. J. (2014) Human activity recognition based on kinematic features.Expert Systems, 31, 345–353.

Iqbal, U., Garbade, M. and Gall, J. (2017a) Pose for action - action for pose. In 2017 12th IEEE International Conference onAutomatic Face & Gesture Recognition (FG 2017)(FG), vol. 00, 438–445.

Iqbal, U., Milan, A. and Gall, J. (2017b) Posetrack: Joint multi-person pose estimation and tracking. In Computer Vision andPattern Recognition, 4654–4663.

Jhuang, H., Gall, J., Zuffi, S., Schmid, C. and Black, M. J. (2013) Towards understanding action recognition. In International Conf.on Computer Vision (ICCV), 3192–3199.

Kim, H., Lee, S., Kim, Y., Lee, S., Lee, D., Ju, J. and Myung, H. (2016) Weighted joint-based human behavior recognitionalgorithm using only depth information for low-cost intelligent video-surveillance system. Expert Systems with Applications,45, 131–141.

Kocabas, M., Karagoz, S. and Akbas, E. (2018) Multiposenet: Fast multi-person pose estimation using pose residual network.In Proceedings of the European Conference on Computer Vision (ECCV), 417–433.

Koppula, H. S. and Saxena, A. (2016) Anticipating human activities using object affordances for reactive robotic response. IEEEtransactions on pattern analysis and machine intelligence, 38, 14–29.

Liang, H., Yuan, J. and Thalmann, D. (2012) 3d fingertip and palm tracking in depth image sequences. In ACM InternationalConference on Multimedia, 785–788.

Liang, X., Gong, K., Shen, X. and Lin, L. (2018) Look into person: Joint body parsing & pose estimation network and a newbenchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Liu, H. and Wang, L. (2018) Gesture recognition for human-robot collaboration: A review. International Journal of IndustrialErgonomics, 68, 355–367.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y. and Berg, A. C. (2016) Ssd: Single shot multibox detector. InEuropean Conference on Computer Vision, 21–37.

Lu, W., Wei, X., Xing, W. and Liu, W. (2017) Trajectory-based motion pattern analysis of crowds. Neurocomputing, 247, 213–223.

Luo, Y., Ren, J., Wang, Z., Sun, W., Pan, J., Liu, J., Pang, J. and Lin, L. (2018) Lstm pose machines. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 5207–5215.

Neves, J., Narducci, F., Barra, S. and Proença, H. (2016) Biometric recognition in surveillance scenarios: a survey. ArtificialIntelligence Review, 46, 515–541.

Newell, A., Yang, K. and Deng, J. (2016) Stacked hourglass networks for human pose estimation. In European Conference onComputer Vision, 483–499.

Nguyen, T., Mann, G. K. I., Vardy, A. and Gosine, R. G. (2017) Developing a cubature multi-state constraint kalman filter forvisual-inertial navigation system. In Conference on Computer and Robot Vision, 321–328.

Nie, B. X., Xiong, C. and Zhu, S. C. (2015) Joint action recognition and pose estimation from video. In Computer Vision andPattern Recognition, 1293–1301.

Page 15: An adaptive stacked hourglass network with Kalman ˝lter for …graphvision.whu.edu.cn/papers/hutao2020.pdf · 2020. 3. 9. · The stacked hourglass network (Newell et al. (2016))

Tao Hu et al. 15

Oprea, S., Garcia-Garcia, A., Orts-Escolano, S., Villena-Martinez, V. and Castro-Vargas, J. A. (2018) A long short-term memorybased schaeffer gesture recognition system. Expert Systems, 35, e12247.

Pang, Z., Zhao, Y. and Xiao, C. (2015) Effective skeletons extraction for animated surfaces based on geometry propagation.Computer Animation and Virtual Worlds, 26, 301–309.

Pishchulin, L., Andriluka, M., Gehler, P. and Schiele, B. (2013) Strong appearance and expressive spatial models for humanpose estimation. In IEEE International Conference on Computer Vision, 3487–3494.

Ramakrishna, V., Munoz, D., Hebert, M., Bagnell, J. A. and Sheikh, Y. (2014) Pose Machines: Articulated Pose Estimation viaInference Machines. Springer International Publishing.

Rihan, J. and Torr, P. H. (2012) Fast human pose detection using randomized hierarchical cascades of rejectors. InternationalJournal of Computer Vision, 99, 25–52.

Rodger, J. A. (2018) Advances in multisensor information fusion: A markov–kalman viscosity fuzzy statistical predictor foranalysis of oxygen flow, diffusion, speed, temperature, and time metrics in cpap. Expert Systems, e12270.

Rogez, G. and Schmid, C. (2018) Image-based synthesis for deep 3d human pose estimation. International Journal of ComputerVision, 1–16.

Salehi, M., Leckie, C., Bezdek, J. C., Vaithianathan, T. and Zhang, X. (2016) Fast memory efficient local outlier detection in datastreams. IEEE Transactions on Knowledge & Data Engineering, 28, 3246–3260.

Simonyan, K. and Zisserman, A. (2015) Very deep convolutional networks for large-scale image recognition. In InternationalConference on Learning Representations.

Song, C., Pang, Z., Jing, X. and Xiao, C. (2018) Distance field guided l-1 median skeleton extraction. The Visual Computer, 34,243–255.

Song, J., Wang, L., Gool, L. V. and Hilliges, O. (2017) Thin-slicing network: A deep structured model for pose estimation invideos. In IEEE Conference on Computer Vision and Pattern Recognition, 5563–5572.

Toshev, A. and Szegedy, C. (2014) Deeppose: Human pose estimation via deep neural networks. In IEEE Conference on Com-puter Vision and Pattern Recognition, 1653–1660.

Wang, X., Liang, C., Chen, C., Chen, J., Wang, Z., Han, Z. and Xiao, C. (2019) S3d: Scalable pedestrian detection via score scalesurface discrimination. IEEE Transactions on Circuits and Systems for Video Technology.

Wei, S. E., Ramakrishna, V., Kanade, T. and Sheikh, Y. (2016) Convolutional pose machines. In Computer Vision and PatternRecognition, 4724–4732.

Yang, Y. and Ramanan, D. (2013) Articulated human detection with flexible mixtures of parts. IEEE Trans Pattern Anal MachIntell, 35, 2878–2890.

Tao Hu is working toward the PhD degree at the School of Computer, Wuhan University. Previously,he obtained his B.S. degree in software engineering from School of Computer at Wuhan Universityof Technology, Wuhan, in 2006. And he obtained his Mater degree in software engineering fromSchool of Software and Microelectronics at Peking University, Beijing, in 2009. His current researchinterests include deep learning, computer animation and image processing.

Page 16: An adaptive stacked hourglass network with Kalman ˝lter for …graphvision.whu.edu.cn/papers/hutao2020.pdf · 2020. 3. 9. · The stacked hourglass network (Newell et al. (2016))

16 Tao Hu et al.

Chunxia Xiao is currently a professor at the School of Computer, Wuhan University, China. Hisresearch areas include computer graphics, computer vision, machine learning, and image processing.He has conducted a considerable amount of research in geometry reconstruction, image editing,computational photography, image analysis and synthesis, and he has published more than 100papers in journals and conferences. He received his BSc and MSc degrees from the Mathematics

Department of HunanNormal University in 1999 and 2002, respectively, and received his Ph.D. from the State KeyLab of CAD & CG of Zhejiang University in 2006, China. He became an assistant professor at Wuhan Universityin 2006, and became a professor in 2011. During October 2006 to April 2007, he worked as a postdoc at HongKong University of Science and Technology, and During February 2012 to February 2013, he visited University ofCalifornia-Davis for one year.

GeyongMin is the Chair and Director of High Performance Computing and Networking (HPCN) Re-search Group at the University of Exeter, UK. His main research interests include Next-GenerationInternet, Wireless Networks, Mobile Computing, Cloud Computing, Big Data, Multimedia Systems,Information Security, System Modelling and Performance Optimization. He is an Editorial Boardmember of 8 international journals including IEEE Transactions on Computers and serves as the

Guest Editor for 18 international journals.

Noushin Najjari is a PhD student in Computer Science from College of Engineering, Mathematicsand Physical Sciences, University of Exeter. Noushin does research in Big Data andMachine Learn-ing.