TensorPose: Real-time pose estimation for interactive ...€¦ · Keywords: Convolutional Neural...

Computers & Graphics (2019)

Contents lists available at ScienceDirect

Computers & Graphics

journal homepage: www.elsevier.com/locate/cag

TensorPose: Real-time pose estimation for interactive applications

Luiz Jose Schirmer Silvaa, Djalma Lucio Soares da Silvaa, Alberto Raposoa, Luiz Velhob, Helio Lopesa

aPUC-Rio - Pontifıcia Universidade Catolica do Rio de JaneirobIMPA - Instituto Nacional de Matematica Pura e Aplicada, Rio de Janeiro, Brazil

A R T I C L E I N F O

Article history:Received August 23, 2019

Keywords: Convolutional Neural Net-works, Pose estimation, Real time appli-cations

A B S T R A C T

The state of the art has outstanding results for 2D multi-person pose estimation usingmulti-stage Deep Neural Networks in images with high accuracy. However, the useof these models on real-time applications may be impractical not just because they arecomputationally intensive, but also because they suffer from flicking, from the inabilityfor capturing temporal correlations among video frames, as well as from image degrada-tion. To tackle these problems, we expand the use of pose estimation to motion capturein interactive applications. To do so, we propose a novel deep neural network withstreamlined architecture and tensor decomposition for pose estimation with improvedprocessing time, named TensorPose. We introduce an architecture for markerless mo-tion capture using Convolutional Neural Networks combined with sparse optical flowand Kalman Filters. We also apply this architecture in a multi-user environment, basedon the Holojam framework, where it is possible to create simultaneous collaborativeexperiences.

c© 2019 Elsevier B.V. All rights reserved.

1. Introduction1

Pose estimation is a challenging problem in computer vision2

with many real applications in areas including augmented re-3

ality, virtual reality, computer animation and 3D scene recon-4

struction [1, 2, 3, 4, 5]. Usually, the problem to be addressed5

involves estimating the 2D human pose, i.e., the anatomical6

keypoints or body parts of persons in images or videos[6]. Re-7

cent works have presented great results for multi-person pose8

estimation in images, but still, have some issues when consid-9

ering videos [7]. These approaches can achieve high accuracy10

using architectures based on conventional convolution neural11

networks; however, mistakes caused by occlusion and motion12

blur are not uncommon, and those models are computationally13

very intensive for real-time applications [7, 5]. Also, despite14

some related work to increase performance, few papers explore15

the use of temporal coherence and the possibility of using such16

e-mail: [email protected] (Luiz Jose Schirmer Silva)

machine learning models for real-time applications involving 17

animation [8]. 18

OpenPose is considered the state-of-art approach on multi- 19

person pose estimation, but it does not achieve the desired per- 20

formance in terms of frames per second, which make it difficult 21

to use in interactive applications that require frame rates close 22

to or above 30 FPS. To increase the OpenPose performance, we 23

propose a new pose estimation model, called TensorPose. In our 24

model, we replace conventional convolution operations by suc- 25

cessive pointwise and regular convolutions in a reduced space. 26

As we will show, the proposed modifications is analogous to 27

applying a tensor decomposition, more specifically a high or- 28

der singular value decomposition (HOSVD). Another issue that 29

we study is the temporal coherence, which is not present in pre- 30

vious works since they do not consider the relationship between 31

the processed frames. To track the detected persons in videos 32

and 3D captures, we use sparse optical flow and a Kalman filter 33

to smooth the movement. 34

In this work, we also propose the creation of an architec- 35

http://www.sciencedirect.com

http://www.elsevier.com/locate/cag

2 Preprint Submitted for review / Computers & Graphics (2019)

ture that uses markless motion-capture for real-time applica-1

tions. Such an architecture brings us the possibility of creat-2

ing environments for the development of shared and distributed3

applications. With the collected information, our architecture4

easily composes an environment for the development of ap-5

plications that can involve motion tracking using WebGL or a6

game engine. This software infrastructure can be used to imple-7

ment multi-user interaction, positional tracking of people and8

objects, to create a complete sense of immersion in a tangible9

space, where it is possible to bridge the gap between the physi-10

cal and virtual spaces. In other words, users in different physical11

environments and using various devices can share and explore12

a unique and integrated virtual environment.13

In summary, the main contributions of this work are:14

• a novel deep neural network with streamlined architec-15

ture and tensor decomposition for pose estimation with im-16

proved processing time;17

• the extension of our model with Sparse Optical Flow and18

Kalman Filters for motion tracking; and19

• an architecture for multi-user applications, where is possi-20

ble to create simultaneous collaborative experiences based21

on the Holojam Framework.22

The paper is structured as follows. Section 2 presents the23

related works considering pose estimation and motion capture.24

Section 3 describes an overview of pose estimation task using25

a neural network. Section 4 presents the main properties and26

operations with tensor, as well the decomposition that we use27

in our approach. Section 5 introduces the main aspects of our28

deep learning model and the software architecture for interac-29

tive applications. Section 6 shows the mains aspects of virtual30

environment communication based on the Holojam Framework.31

Section 7 shows the results and finally, Section 8 concludes32

the work by discussing our contributions and proposing future33

works.34

2. Related Work35

2.1. Pose Machines and Tracking36

Ramakrishna et al. proposed [9] an inference machine frame-37

work and presented a method for articulated human pose esti-38

mation, called Pose Machines. A pose machine consists of a se-39

quence of multi-class predictors, that is trained to predict the lo-40

cation of each part in each level of the hierarchy. Their method41

has the benefits of implicitly learning long-range dependencies42

between image and multi-part cues, tight integration between43

learning and inference, and a modular sequential design. The44

model was built on the inference machine framework so that it45

could learn reliable interconnections between body parts.46

Wei et al. [10] introduced the Convolutional Pose Machines47

(CPMs) for the task of pose estimation. CPMs expand the main48

idea of pose machine architectures and combine them with the49

advantages of convolutional neural networks. Their contribu-50

tions were the ability to learn feature representations for both51

image and spatial context directly from data, and the ability52

to handle large training datasets efficiently. CPMs consists of53

a sequence of convolutional networks that repeatedly produce 54

2D confidence maps for the location of each body part. At 55

each stage in a CPM, image features and the confidence maps 56

produced by the previous stage are used as input. As said by 57

Wei et al. [10], the confidence maps provide the subsequent 58

stage an expressive non-parametric encoding of the spatial un- 59

certainty of location for each part, allowing the CPM to learn 60

rich image-dependent spatial models of the relationships be- 61

tween elements. 62

Cao et al. [6] proposed an efficient method for 2D multi- 63

person pose estimation with high accuracy on multiple public 64

datasets, extending the original Convolutional Pose Machines. 65

They created a bottom-up representation of association scores 66

via Part Affinity Fields(PAFs), a set of 2D vector fields that en- 67

code the location and orientation of limbs over the image do- 68

main. They show that their approach encodes global context 69

sufficiently well to allow for a combinatorial optimization al- 70

gorithm that solves an assignment problem to detect a person’s 71

skeleton. However, they did not consider temporal coherence 72

in the original work. When considering a video, there is no re- 73

lation between persons detected in multiple frames. Also, this 74

approach is computationally intense, demanding a lot of pro- 75

cessing power. Following their algorithm and considering their 76

benchmarks, we achieved approximately only 10 FPS using an 77

Nvidia Titan Xp GPU. 78

In regards to the tracking problem, a solution is to use dense 79

optical flow to adjust the predicted positions to generate a 80

smooth movement across frames. The Thin-Slicing Network 81

[11] achieved excellent results. However, this system is com- 82

putationally very intensive and is slower when compared with 83

the proposed by other papers. 84

Luo et al. [7] achieved some improvements in terms of both 85

accuracy and efficiency. They propose a recurrent CNN model 86

with LSTM for video pose estimation. Although considering 87

occlusion, some incorrect predictions can be detected when the 88

joint is not visible for an extended period. Also, their tests do 89

not consider multi-person pose estimation. 90

2.2. Motion Capture and Mesh Recovery 91

Concerning motion capture, recent papers have presented 92

promising solutions. Shafaei et al. [12] showed an efficient and 93

inexpensive approach to markerless motion capture using a set 94

of sensors, like the Microsoft Kinect. They split the multiview 95

pose estimation task into three subproblems of dense classifi- 96

cation, view aggregation, and pose estimation. Their approach 97

also runs in real-time. They do not assume a shape model or 98

a specific application, such as home entertainment, being ap- 99

plicable to a wider range of scenarios. However, their method 100

considers only one person and not a multi-user system. Also, 101

they use depth sensors like Kinect, and not regular cameras or 102

stereo cameras as our architecture does. 103

Tome et al. [13] proposed a CNN-based approach for multi- 104

camera markerless motion capture of the human body. Their 105

technique makes use of 3D reasoning throughout a multi-stage 106

procedure. They extended existing works on single view recon- 107

struction to a multi-camera setting and showed how such single 108

view methods could be enhanced by training on the multiview 109

Preprint Submitted for review / Computers & Graphics (2019) 3

based annotation of unlabelled data. However, this technique1

seems to be intense, considering the computational cost and, for2

multi-user detection, cannot be used in real time applications.3

Mehta et al. [8] introduced the first real-time method to cap-4

ture the full global 3D skeletal pose of a human in a stable, tem-5

porally consistent manner using a single RGB camera. How-6

ever, their approach suffers when there are significant move-7

ment shifts between frames as well as occlusion. Also, the8

method is only capable of tracking a single person.9

Kanazawa et al. [14] show an end-to-end framework for re-10

constructing a full 3D mesh of a human body from a single11

RGB image. They used the generative human body model,12

SMPL[15], which parameterizes the mesh by 3D joint angles13

and low-dimensional linear shape space. Also, an adversarial14

model is presented to adjust the mesh, determining if a pose can15

be considered acceptable. An image is passed through a convo-16

lutional encoder and sent to a 3D regression module that infers17

the 3D representation of the human. The 3D parameters are18

also sent to a discriminator, whose goal is to tell if these param-19

eters come from a real human shape and pose. This mesh can20

be useful to generate humanoid animations, which can immedi-21

ately be used by animators, modified, measured, manipulated,22

and retargeted. Again, this approach has a high computational23

cost and only considers single person detection.24

3. Pose estimation model25

This section describes the main aspects of the architecture26

proposed by Cao et al. [6]. The network architecture consists27

of a feedforward neural network that predicts a set S of 2D28

“confidence maps”, from which the body parts are located in29

an image, and also a set L of 2D vector fields that help iden-30

tify the connection between two body parts to generate a skele-31

ton. The set S = {S 1, S 2, ..., S n } has n “confidence maps” one32

for each part of the body (hand, elbow, head, etc) and the set33

L = {L1, L2, ..., LC} has C 2D vector fields, each used to con-34

struct the skeleton, identifying some member of it: an arm or35

leg for example. N candidates for each body part are generated,36

creating for each pair a bipartite graph. Finally, the results are37

analyzed following a greedy strategy, using the Hungarian algo-38

rithm with some modifications to create a connection between39

two parts. This process is performed in each frame.40

The neural network is divided into two branches: one re-41

sponsible for generating the “confidence maps”, and the other,42

the affinity fields. Also, these two branches are divided into t43

stages, where the authors of the original paper set t = 6 stages.44

At first, the images are processed by the first ten layers of a45

VGG-19 convolution network, generating a set of F features46

that will be used in the subsequent stages of the OpenPose.47

These features are used as input for the other layers of the net-48

work and, as result, they have a set S of confidence maps and a49

set L of Affinity Fields (vector fields). Let ρ and Φ represent the50

convolution layers for the confidence maps and affinity fields.51

At the end of this step, the results are concatenated with the ini-52

tial set F to produce refined results. This process is performed53

for each subsequent stage.54

To guide the network in its predictions, two objective func-55

tions are given at the end of each stage, one for each branch.56

The objective functions for both branches are based on the dis- 57

tance between the network output data and the ground truth gen- 58

erated for the confidence maps and vector fields, according to 59

Equations 1 and 2 [6]: 60

f tS =

J∑j=1

∑p

W (p) ||S tj (p) − S ∗j (p) ||22; (1)

f tL =

C∑c=1

∑p

W (p) ||Ltc (p) − L∗c (p) ||22, (2)

where S ∗j and L∗c represent the ground truth of the annotated 61

data; J and C are, respectively, the confidence maps for each 62

part of the body, and the vector fields. The W (p) is a binary 63

mask with W (p) = 0 when there is no data annotated at the 64

point p of image. The mask is used because there are images 65

in the training dataset that are not entirely annotated, so it pre- 66

vents true positive data from being penalized. The overall loss 67

function is defined by the sum of Equations 1 an 2. 68

The training images, as well as their annotations, were 69

obtained through the COCO (Common Objects in COntext) 70

dataset [16], using more than 100000 images for the task. 71

Following the architecture model, they first analyze the set 72

S to identify the possible candidates for a part of the body. In 73

total, 18 confidence maps are generated by branch 1 defined by 74

the convolution network ρ. Each confidence map of the set S 75

can be viewed as a heatmap. For this, they used a technique 76

called Non-maximum suppression to identify the position of 77

potential candidates by getting the peaks in the image. With 78

all possible candidates (points) obtained, it is necessary to find 79

the real connections between them. Therefore, for each con- 80

nection, is built a complete bipartite graph where each node is 81

a part of the body, and each edge is a connection representing 82

an arm or a leg, for example. To find the right relationship, 83

they face a well-known problem from graph theory that is find- 84

ing the best match between the vertices of a bipartite graph: an 85

assignment problem. For this task, the model uses the vector 86

fields that were generated by the output of Φ of the network. 87

An integral line is computed along the segment described by 88

each pair of candidates, and since the integral line shows the in- 89

fluence of a given field along a line. Equation 3 shows how they 90

compute the score of each segment represented by each pair of 91

body parts. Given two points, d1 and d2, as candidates for two 92

parts of a skeleton forming a connection, the score of an edge 93

connecting these two points is: 94

E =

∫ u=1

u=0Lc (p (u)) .

d j2 − d j1

||d j2 − d j1||2du (3)

where Lc represents the vector field and p(u) interpolates the 95

position between two body parts d j1 e d j2, 96

p (u) = (1 − u) d j1 + ud j2. (4)

Afterward, to obtain the optimal solution, they used a classi- 97

cal method called Hungarian Algorithm. Regarding the detec- 98

tion of multiple people, determining the pose of each becomes 99

an n-dimensional matching problem. Since this is a considered 100


Fig. 1. 2D pose estimation example.

NP-Hard problem, some restrictions can be made. First, a min-1

imum number of vertices for each body segment is chosen to2

generate a spanning tree instead of having the complete graph.3

Afterward, the problem is decomposed into a set of bipartite4

graphs to determine the matching between adjacent nodes in-5

dependently. With all the connections defined, they make the6

union of all the connections that share common vertices, to-7

gether with each other, to create the skeleton as a whole. Figure8

1 illustrates the final result of the pose estimation.9

4. Tensor Decompositions and Convolutional Neural Net-10

works11

This section presents Tensor analysis and its relation with12

Convolutional Neural Networks. This can lead us to better un-13

derstand the connection between the HOSVD theory and the14

factorized convolutions. There is a rising interest in exploring15

efficient architectures for Neural Networks, either for use in em-16

bedded device applications or for better allocation of resources17

and services, reducing network latency and size [17]. Several18

papers propose different architectures for Convolutional Neural19

Networks based on factorized convolutions [17, 18, 19], and ac-20

cording to our opinion, the hidden theory behind these applica-21

tions is related to tensor algebra and high order decompositions;22

although few papers discuss these fundamentals.23

A 2D convolutional layer of a neural network can be de-24

scribed as a 4-dimensional tensor, where its dimensions are de-25

fined by the number of columns and rows, the number of chan-26

nels of an input image, i.e., the RGB channels, and the number27

of output channels. In this work, we are interested in improv-28

ing the pose estimation network for fast applications. To do so,29

we decompose convolutional layers into smaller tensors, i.e.,30

we made approximations of these layers. This approximation31

is essential since it reduces the number of operations performed32

by each layer and, in this way, it accelerates our neural network33

model. The following subsections will discuss general aspects34

of tensors as well the Tucker decomposition [20, 21, 22, 23].35

4.1. Tensor Properties36

A tensor can be seen as a high-dimensional matrix, i.e, with37

three or more dimensions. The order of a tensor T is the num-38

ber of its dimensions, also known as ways or modes [22]. In39

a manner analog to matrices’ rows and columns, tensors have 40

fibers. A matrix column is a mode-1 fiber and a matrix row is 41

a mode-2 fiber [22, 23]. Third-order tensors have column, row, 42

and depth fibers for example. 43

In tensor analysis, there are operators that creates subarrays 44

by fixing some of the given tensors indices. In this way, when 45

we fix all indexes leaving only one free, we create a fiber. In the 46

same way, we create slices if we fix all indexes but leaving two 47

free. For a third order tensor, the frontal, lateral and horizontal 48

slices are obtained by fixing the third, the second and the first 49

indexes, respectively. 50

Unfolding is similar to flattening a matrix, where we stack 51

the fibers of a tensor in a given way to obtain a matrix rep- 52

resentation [24, 20, 22, 25]. For a tensor T with dimensions 53

(I1, I2, ..., In), its mode-n unfolding of T ∈ RI1×I2×···×IN is a ma- 54

trix T[n] ∈ RIn,IM , with M =N∏

k=1,k,n

Ik, mapping from element 55

(i1, i2, · · · , iN) to (in, j), with j =N∑

k=1,k,n

ik ×∏N

m=k+1 Im [22]. 56

For example, if we have a 3D tensor T with dimensions 57

(3, 3, 2) defined by the frontal slices: 58

T1 =

1 2 34 5 67 8 9

,T2 =

10 11 1213 14 1516 17 18

the mode-1 unfolding matrix is by definition given by: 59

T[mode−1] =

1 10 2 11 3 124 13 5 14 6 157 16 8 17 9 18

A rank of a tensor T is the smallest number of rank-one ten- 60

sors that generate T by computing their sum [22]. A rank-one 61

tensor is a mode-N tensor where it can be seen as the outer 62

product of N vectors [22, 23] as we can see in Equation 5. In 63

other words, each element of T ∈ RI1 xI2 x...IN is the product of the 64

corresponding vector elements defined by Equation 6. 65

T = v(1) ◦ v(2) ◦ ...v(N) (5)

ti1i2...in = v(1)i1

v(2)i2...v(N)

iN(6)

The product of a mode-n tensor T ∈ RI1×I2×...×IN by a matrix 66

M ∈ RJn×In is a tensor X with dimensions (I1 × I2 × ... × In−1 × 67

Jn × In+1 × ... × IN) given by Equation 7 [26]. 68

X = (T × M)(I1×I2×...×In−1×Jn×In+1×...×IN ) =∑

in

ti1....iN m jnin (7)

Another property is the Kruskal form of a tensor. It is similar 69

to matrix properties where the sum of the outer products of two 70

vectors can generate a matrix M. Let us consider a single tensor 71

T. We can generate it as a sum of outer products of N vectors, 72

where the number of terms in the sum is called the Kruskal rank 73

of the tensor [22]. 74


4.2. Tucker Decomposition1

A Singular Value Decomposition of a 2D matrix, i.e, a 2-order tensor, can be defined as follows in Equation 8.

M = UΣVT (8)

where U is a unitary matrix m × m, Σ is a rectangular diagonalmatrix m × n and VT is the transpose conjugate n × n. This,equation can be rewritten as follows, in Equation 9 [26]:

M = C × U1 × U2 (9)

where U1 is defined as (u11 × u1

2 × ... × u1I1

) and U2, defined as2

(u21 × u2

2 × ... × u2I1

), where both are unitary matrices, and C is a3

core matrix (I1 × I2).4

The SVD decomposition can be generalized to High OrderTensors, where this approach is also know as High Order SVDor Tucker decomposition [26]. Such mathematical tool consid-ers the orthonormal spaces associated with the different modesof a tensor [21]. For example, Equation 9 can be extended for3D tensors as follows:

M = C × U1 × U2 × U3 (10)

where U1,U2,U3 contains the 1-mode, 2-mode, and 3-mode5

singular vectors, respectively, related to the column space of6

Mmode−1,Mmode−2, and Mmode−3 matrix unfoldings. C is a core7

tensor with orthogonality property [26].8

We could now apply these Tucker decomposition concepts toour domain [21]. Consider a mode-4 kernel tensor T, which canbe seem as a kernel in a convolution layer. All operations for itsdecomposition can be written in the form described in Equation11 [21]:

Tx,y,z,k =

R1∑r1=1

R2∑r2=1

R3∑r3=1

R4∑r4=1

Cr1,r2,r3,r4 U1x,r1

U2y,r2

U3z,r3

U4k,r4, (11)

where C is a core tensor of size (R1 × R2 × R3 × R4) and the U∗9

matrices are factor matrices of sizes X × R1,Y × R2,Z × R3, and10

K × R4, respectively. Each dimension, in our domain, is asso-11

ciated with the dimensions defined by the number of columns12

and rows, RGB channels and the number of output channels13

in a convolutional layer of a neural network. Although this is14

true considering as input an RGB image, in successive layers15

the dimensions and number of channels vary. In this way, in16

intermediate layers, our tensors can be defined by the spatial17

dimensions X and Y of the input, the number of filters of the18

input data, i.e. the ”depth”, and the number of filters to the out-19

put data of the layer. Considering this, the decomposition is20

performed over the weights of each layer of convolution in our21

Neural Network and, as we can see, the convolution kernels are22

4D tensors. We can boost the speed performance of a neural23

network by creating several factorized approximations of reg-24

ular convolutions with only a few reductions in accuracy. The25

next section presents the details of our model as well its relation26

with tensor decompositions.27

5. TensorPose Network and Architecture 28

In this section, we describe the main aspects of our deep con- 29

volutional neural network model and the architecture for inter- 30

active applications. At first, we focus on the new CNN model. 31

Secondly, we present the main aspects of our architecture for 32

interactive applications. We describe the modules for inference 33

of 2D and 3D data and for distributed applications. 34

5.1. Network Model and Skeleton Tracking 35

In real time applications, efficiency and approaches for re- 36

ducing the computational cost are essential. Concerning the 37

model of Cao et al. [6], we propose a new deep neural network 38

based on a streamlined architecture and Tensor decompositions. 39

Initially, we modified the generation of the first features using 40

Depthwise Separable Convolution of MobileNet [17], which is 41

typically used for embedded devices. We replace the original 42

layers of VGG network because its performance in this step 43

represents the first bottleneck. Here, we decrease the number of 44

operations required in the convolution network process, making 45

an approximation through successive convolutions. The Mo- 46

bileNet model is based on a form of factorized convolutions 47

which transforms a standard convolution into a depthwise and 48

a pointwise convolution [17]. Depthwise Separable Convolu- 49

tion deals not only with the spatial characteristics of the image 50

but also with depth, i.e., the RGB channels. In this process, 51

a kernel is separated into other two to perform two convolu- 52

tions: a depthwise and a pointwise. Essentially, the depthwise 53

convolution applies a single filter to each input channel and the 54

pointwise applies a 1x1 kernel convolution to combine the out- 55

puts. 56

A Depthwise Separable Convolution requires approximately 57

8 to 9 times less computation than a standard convolution, with 58

only a small reduction in accuracy [17]. Based on the architec- 59

ture of the MobilenetV2 [27], we create 12 layers of Depthwise 60

separable convolutions instead of regular ones for feature ex- 61

traction. 62

Following this same idea, we create another lightweight 63

structure for intermediate layers of the network intending to 64

increase FPS performance. In the first three stages, we per- 65

form approximations using tensors to speedup the runtime per- 66

formance without having significant losses in the network ac- 67

curacy. Each intermediate convolution stage was replaced by 68

a block consisting of a pointwise, a standard convolution, but 69

with reduced space, and another pointwise convolution. For ex- 70

ample, in the intermediate layers of our network, convolutions 71

with input channels and output channels of n = 128, kernel 3x3 72

and stride 1x1 were replaced by a block containing a pointwise 73

convolution with 128 input channels and 28 output channels, a 74

regular convolution with 28 and 32, for input and output size 75

respectively, with a kernel 3x3 and another pointwise convo- 76

lution with 32 channels for input and 128 channels for output. 77

We use this approach to change the stages 1 to 3 of our network. 78

This can be seen in an analogous manner as an approximation 79

of a regular convolution described by Kim et al. [21], were 80

they used a High Order SVD for this task. Similar as in the 81

depthwise separable convolutions, we split a regular convolu- 82

tion into other 3, where it drastically reduce the computation 83


and model size. Considering the tensor decomposition’s theory1

stated in the previous sections, let us relate it with convolutional2

layers. A regular convolution maps an input tensor into another3

with different size by successive operations as we can see in the4

Equation 12:5

conv(x, y, z) =∑

i

∑j

∑k

Θ(i, j,k,z)W(x−i,y− j,k) (12)

where Θ is a kernel of size IJKZ (for example, a kernel with6

size I × J = 3 × 3, with K = 3 channels (RGB), and Z = 1287

convolution filters) and W an input tensor with size XYK, as8

we refer typically as an image, for example, with X and Y the9

image dimensions and K the number of channels. Replacing10

Θ by our approximation, this approach can be seen as kernel11

tensor decomposition [21] as we can see in the Equation 13.12

conv(x, y, z) =∑

i

∑j

∑k

approxW(x−i,y− j,k) (13)

Regarding the equation 11 in our previous section, the equa-tion 14 defines the tensor named approx computed by theHOSVD decomposition [20]:

approx =

R3∑r=1

R4∑r=1

Ci, j,r3,r4 Ukr3

(k)Uzr4

(z), (14)

where Uk,r3 and Uz,r4 are the factor matrices of sizes KxR3 and13

ZxR4, respectively, and C is s a core tensor of size (I, J,R3,R4).14

Notice that this equation uses only the R3 and R4 ranks. In our15

application, we suppress mode-1 (R1) and mode-2 (R2), which16

are associated with spatial dimensions of an image, because17

they are quite small [21].18

When replacing Θ in the equation 12 by the equation 14,19

we can rearrange the operations obtaining 3 convolutional op-20

erations used to approximate the original: 2 pointwise con-21

volutions and 1 regular convolution with reduced space. This22

leads us to the convolutional block previously described. Here23

the first pointwise convolution reduces the number of channels24

from K to R3, the regular convolution of K input and Z output25

channels has now R3 input channels and R4 output channels.26

Finally, the last pointwise convolution is used to get back the27

original output size. This process not only makes a speedup28

in our application, but also reduces the number of parameters29

needed.30

A regular convolution requires D2ZKXY multiplication-addition operations, where D2 refer to the dimensions of ourkernel 3 × 3 and X and Y refer to the spatial dimensions ofour data. Considering the decomposition, our approach has aspeed-up ratio S defined by the equation 15 [21].

S =D2ZKXY

KR3XY + D2R3R4XY + ZR4XY(15)

Regarding our model, each decomposed layer requires approx-31

imately 9.3 times less operations than a regular convolution.32

Figure 2 represents the architecture model of our network.33

Here, we have more layers of convolution, but the number of34

operations and weights is smaller, since each regular convolu-35

tion is substituted by other 3 factorized. As we can see in figure36

2, for stage 1, the overall number of convolutions, considering 37

all blocks, is 3 × 3 + 2 and for stages 2 and 3 is 5 × 3 + 2. 38

We defined this model for the first three stages, after the feature 39

extraction, in an empirical way, where we did not observe sig- 40

nificant differences in the accuracy of our network when com- 41

paring to the OpenPose. If we extend this approach to the other 42

stages, for example, to stage 4, we verified a more significant 43

difference in the accuracy, more than 25%, leading to wrong re- 44

sults in inference. Also, if we apply to the whole 6 stages, we 45

face an instability problem [21, 28], where it is difficult to find 46

a good learning rate as a well how to initialize the weights. We 47

believe that is an effect of applying stacked decomposed layers 48

to the model, and for this reason, we limited the number of de- 49

composed stages to the first three. Figure 3 shows some results 50

of the inference using our network. As we can see, in major 51

cases, we can achieve a good performance in the detection de- 52

spite some small errors considering high movement shifts and 53

blurred images. 54

Despite frame rate issues, another problem identified in the 55

OpenPose was the lack of temporal coherence in the original pa- 56

per. In other words, there is no relation between the objects de- 57

fined in one frame to the ones defined in the subsequent frames. 58

Therefore, the reference is lost for each identified person. For 59

the context of our applications, it is necessary to create a mod- 60

ule for this task. Temporal coherence enables us to actually 61

implement real motion tracking, where the temporal correspon- 62

dence between each pair of the frame needs to be preserved, 63

considering motion coherence between spatial neighbors and 64

across the temporal dimension [29]. We also can use this to fix 65

the tracking in a single person, removing problems involving 66

people interaction, for example. 67

To solve this problem, in a first experiment, we create a mod- 68

ule for temporal coherence using Optical Flow. Optical Flow 69

is a pattern of apparent motion of objects in images between 70

two consecutive frames caused by the movement of an object 71

or the camera. In other words, it is an approximation of image 72

motion based upon local derivatives in a given sequence of im- 73

ages. Some patterns can affect the sequence of images, causing 74

temporal variations in the brightness. In sum, it specifies how 75

image pixels moves between subsequent frames. 76

Firstly, we tested the Kanade-Lucas-Tomasi (KLT) algorithm 77

[30], and our preliminary results not only presented an increase 78

in the performance of the technique considering the frame rate, 79

but also smoothness in the motion capture. For the KLT algo- 80

rithm, we pass the points of skeletons detected by the network. 81

At each interval of t frames, these points are iteratively tracked 82

through the optical flow, where the previous frame, the posi- 83

tion of the previous points, and the next frame are passed to the 84

detection function. With each new frame, the location of each 85

point of the tracked skeletons is updated. If any position or 86

point are lost in the process, the tracking is stopped. Then, the 87

network is executed again and the skeletons recalculated. The 88

same is done at the end of the interval of t frames to guarantee 89

the consistency of the tracking. Normally we use an interval of 90

10 frames, where we do not where we did not detect a greater 91

error in the tracking when compared with the technique frame 92

to frame. Here, the sparse optical flow deals with much less 93


Fig. 2. TensorPose CNN model. The initial features were extracted using the first 12 layers of a Mobilenet V2 and in the intermediary layers we replaceconventional convolutions by a block containing a pointwise convolution, regular but with reduced space and another pointwise convolution. We have morelayers of convolution but the number of operations and weights is smaller, which leads to a boost in performance. At the final of each stage the results areconcatenated with the results of feature extraction from Mobilenet to refine the results.

parameters than the use of the CNN and the calculation of line1

integral at each frame, where it tracks just the previous point de-2

tected by the CNN. The optical flow assumes that the intensity3

of the pixels of an object does not change with time and neigh-4

boring pixels have the same pattern of movement. Classical5

approaches, such as optical flow fail when it comes to environ-6

ments with illumination changes and long-range motions. To7

solve these issues, we change the algorithm to use the Robust8

Local Optical Flow [31], following the framework proposed by9

Senst et al., while taking into account an illumination model to10

deal with varying illumination. Here the computational cost of11

this variation, used by the KLT method, is given by the upper12

bound of O(nNlargei), where n is the number of computed mo-13

tion vectors and N is the number of pixels of the larger image14

region i [31]. Also, as support to the tracking process, we use15

the Kalman filter and Multiple Instance Learning (MIL). The16

Kalman filter [32] is used to predict the position of the skeleton17

in the frame after the current one, and to ensure that the refer-18

ences for each skeleton detected are maintained. Given the dis-19

tance between the points calculated by the network and those20

predicted by the Kalman filter, a threshold is given to ensure21

that the reference points are maintained. We fix this threshold22

considering a radius of 20 pixels of error. Combining Optical23

Flow and Kalman filter gives us the possibility to interpolate24

the data over subsequent frames, which enable us to smooth the25

captured movement. We also use the MIL [33] just to ensure26

that the detected person’s reference is maintained. We use it as27

an additional feature to identify the bounding box of detected28

head keypoints, and to maintain the reference of a person, even29

when the network is called. Although their use is optional.30

5.2. TensorPose Architecture31

TensorPose was developed with the aim of being a platform32

for the development of 2D and 3D applications that involve33

motion capture. In this section, we will describe the general34

aspects of its architecture, as well as the framework for net- 35

work communication of different applications. Such applica- 36

tions can be computer animation, games and also shared virtual 37

experiences. One of our goals is to provide developers with a 38

way to create environments for shared 2D or 3D experiences 39

where users share a virtual environment, but not necessarily at 40

the same physical environment. The motion capture data can be 41

used to track the activities of users in this kind of application. In 42

our architecture, each module is independent in terms of video 43

processing, the inference of captured skeletons, temporal co- 44

herence, and information transmission. Figure 4 represents the 45

architecture model. 46

As one can see in Figure 4, the architecture model is com- 47

posed by several modules. Next we will describe the function- 48

ality of each one. 49

Pose Client. This module is responsible for processing video 50

data from different cameras. Here, each frame is prepared using 51

the OpenCV library [34], where we also do an equalization of 52

the histogram to improve the contrast in the images and send the 53

data to the inference module or the temporal coherence module. 54

In this case, here, the resolution of the image is 1280x960, and 55

the network resolution is 656x386 to fit in GPU memory. 56

Pose Inference. This neural network module receives data from 57

the pose client and performs the inference of skeletons in an 58

image. Also, a post-processing is performed to solve the as- 59

signment problem. For each set of key points detected and for 60

each person, a dictionary is created, which contains the 2D po- 61

sition of each body part and the connections between them. For 62

each recognized person is given an Id and their corresponding 63

dictionary is added to a list of tracked persons who are sent to 64

the module responsible by the instantiating objects, which will 65

represent the capture of those people. 66

Zed SDK. In this stage, we use the ZED stereo camera SDK 67

[35][36] for 3D data capture. This module was developed us- 68


Fig. 3. Results containing viewpoint variation and occlusion, which are common characteristics in images. Images from COCO dataset, Rio Olympics 2016and Brazilian National Basketball league.

ing the Python API, where data from a 2D image is sent to the1

Pose Client to be processed and the 3D data is sent to the depth2

module.3

Depth Front. This module is responsible for processing the4

depth data of one or more ZED cameras and receiving intrinsic5

camera data. ZED has two cameras separated by 12 cm, which6

capture a high-resolution 3D video of the scene and estimate7

depth and motion by comparing the displacement of pixels be-8

tween the left and right images. This module stores a distance9

value Z for each pixel in the image.10

2D/3D Positions. This part of the architecture receives data11

from the positions inferred by the network and creates objects12

identifying each detected person. In case of 2D, keeps the po-13

sitions received in coordinates of the image. In the case of14

3D, it receives data from Front-Depth and performs transfor-15

mations in 2D coordinates for positions in the 3D world. It is16

also responsible for passing information to the temporal coher- 17

ence module, also processing the Kalman filter, identifying the 18

points to be tracked and ensuring the reference of the detected 19

skeletons. 20

Temporal Coherence. This module tracks the points using the 21

Optical Flow algorithm. It receives the initial data detected by 22

the network, and every t frames perform the tracking of the 23

points. 24

Holojam Emitter. Uses the Holojam platform to send the cap- 25

tured data via the network. It is a Python client that commu- 26

nicates with the Holojam Server. The Holojam platform is de- 27

scribed in the following section. 28


Fig. 4. TensorPose Architecture

6. Virtual enviroment communication by Holojam1

Holojam is a virtual space sharing platform developed by2

the Future Reality Lab of the New York University [37]. This3

platform consists of the Holojam Node library and the Holo-4

jam SDK project. It enables content creators to build complex5

location-based multiplayer VR experiences in a simple and uni-6

fied Unity project. The development framework provides an7

extensible and clean interface, allowing rapid prototyping. Ad-8

ditionally, it abstracts away specific VR hardware, promoting a9

flexible and customizable creation of virtual reality experiences.10

We use the Holojam protocol framework in our architecture to11

communicate applications with very low latency, and to send12

motion capture information through the network, expanding the13

use to not only VR applications.14

6.1. Holojam Node15

We use the Holojam Node, which is a client-server library16

developed in Node.js and targeted to applications running on a17

local network or over the web. One of its main characteristics18

is to perform low-latency communication between the various19

clients and the server. It consists of the following components:20

relay, emitters, sinks, and clients.21

The relay component is responsible for routing the messages22

(updates and events) in a Holojam network. It acts both as a23

central server, which collects data received through a precon-24

figured (upstream) address, as well as performs a broadcast of25

this data through a multicast (downstream) address.26

In addition to the relay in a Holojam network, there are also 27

several nodes, called endpoints. Endpoints are either emitters, 28

which only send upstream data; sinks, which only receive data 29

through the downstream; or clients that receive downstream 30

data and send upstream data. Holojam Node also provides a 31

WebSocket interface where you can receive and transmit data 32

over the web. 33

6.2. Holojam Protocol 34

Packets routed through a network are either hosted or up- 35

dated. An update is essentially an array of flakes (generic Holo- 36

jam objects). An event has an array containing only one flake. 37

There is also a notification that is an abstraction of an event that 38

includes just the label. 39

6.3. Holojam Objects 40

Holojam provides two types of objects: Nuggets and Flakes. 41

These objects are defined through Google’s FlatBuffers Inter- 42

face Description Language (IDL). We use Flatbuffers to serial- 43

ize data, as well as streaming data, because it offers very low 44

processing overhead. As follows, we present the structure of 45

the objects used. 46

Nugget Composition: 47

• It is event type or update. Default is update 48

• It is mandatory to have an array of flakes 49


enum NuggetType : byte { UPDATE, EVENT }

// Message (update or event)

table Nugget {

scope : string; // Namespace

origin : string; // Source

type : NuggetType = UPDATE;

flakes : [Flake] (required); // Data array

}

Flake Composition:1

• A label is mandatory2

table Flake { // Data container

label : string (required);

// Optional data

vector3s : [Vector3];

vector4s : [Vector4];

floats : [float];

ints : [int];

bytes : [ubyte];

text : string;

}

6.4. Holojam SDK3

The Holojam SDK project is a Unity3D project containing4

all the elements needed to create a multiplayer virtual reality5

application. This project provides a C# API that allows the ap-6

plication created to use or extend this API allowing for rapid7

prototyping. Also, it abstracts the configuration of the VR hard-8

ware used in the project. One of its main components is the9

implementation of the Holojam client in C#. Thus, all Holojam10

objects can be integrated easily into the Unity project.11

6.5. Holojam and TensorPose applications12

All components of the Holojam platform were used in the13

TensorPose project applications, both in 2D applications as well14

as 3D and VR applications. Since TensorPose is implemented15

in Python, the components needed to use Holojam in Tensor-16

Pose were also deployed in Python, including a serializer for17

the keypoints and the emitter. The serialization of the obtained18

keypoints was implemented from the IDL FlatBuffers of Holo-19

jam. With this, it is possible to send the keypoints to the relay,20

giving access to data to any sink/client/websocket that is con-21

nected to this relay.22

Figure 5 shows the components for various applications cre-23

ated in the TensorPose project and to which other component24

they are related to.25

Fig. 5. TensorPose network infrastructure. The Relay component is re-sponsible for synchronizing and sending information over the network tothe clients of our applications.

7. Experiments 26

Our proposal was tested and compared with the results ob-tained by OpenPose 1.3. The TensorPose code was imple-mented in Python, including the post-processing steps, such asthe line integral calculation and the assignment problem. Intraining, 110000 images were used as well as 5000 imagesin validation, with a maximum of 200 iterations per epoch.In COCO dataset, approximately 150000 people and 1.7 mil-lion labeled keypoints are available. Considering the inference,we use the COCO validation dataset to compare our modifiedmodel and the OpenPose. COCO uses a metric called ObjectKeypoint Similarity (OKS), which measures of how close thepredicted Keypoint is to the ground truth annotation. Also,COCO uses the mean average precision (AP) and average re-call (AR) over 10 OKS thresholds as the main competitionmetrics[6]. In our initial tests, we consider Precision and Recallat 50% and 75% OKS (AP-50, AP-75, AR-50, AR-75) whichare usually used for benchmark. We test both models with theevaluation data from COCO and submit the results to the COCOevaluation server. Equation 16 shows the formula used to cal-culate the OKS.

OKS =

∑i e−

d2i

s2k2i δ(υi > 0)∑

i δ(υi > 0)(16)

where di are the Euclidean distances between each detected 27

keypoint and the ground truth, υi are the visible flags for each 28

annotated keypoint, and s × k (scale × keypointconstant ) is re- 29

lated to a standard deviation of a gaussian related to concentric 30

circles where their radius varies by keypoint type. The met- 31

rics AP-50, AP-75, AR-50, AR-75 are related to how close a 32

prediction is to annotated data considering the Gaussian stan- 33

dard deviation. The s × k is computed by evaluating this Gaus- 34

sian function, centered on the ground-truth position of a key- 35

point and the standard deviation is specific to the keypoint type, 36

which is scaled by the area of the instances, measured in pix- 37

els [38]. Table 1 shows our results in contrast to OpenPose and 38

Figure 6 show us a visual comparison between the 2 techniques. 39

Also, varying the OKS threshold from 0.5 to 0.95, we measure 40


both precision and recall for the COCO validation dataset, as1

we can see in the figure 7, where highest values mean how close2

our predictions are to the ground truth annotations considering3

each threshold.4

Model AP 0.50 AP 0.75 AR 0.50 AR 0.75OpenPose 0.780 0.593 0.807 0.647TensorPose 0.750 0.536 0.772 0.591

Table 1. Precision and Recall for OpenPose and TensorPose Models.

Here, higher OKS means higher overlap between predicted5

Keypoints and the ground truth. As we can see, our model has6

about 6.5% lower accuracy than OpenPose, which is mainly7

caused by our proposed simplification on the intermediary lay-8

ers. Just to remember that this simplification promotes an ad-9

vantage of having 9.3 times less operations in these layers than10

the original OpenPose. We have focus on real-time application,11

so this accuracy loss does not represent significant problems,12

since some prediction errors are acceptable.13

Following the benchmarking and error diagnosis for pose es-14

timation proposed by Ruggero et al. [38], we analyze 4 types15

of localization errors using the annotations of COCO dataset:16

• Jitter: small position error around the correct keypoint lo-17

cation;18

• Miss: large localization error when the detected key point19

is far away of any body part;20

• Inversion: confusion between semantically similar parts of21

a detected person,i.e, wrong body part associations for the22

same person due problems like self occlusion, for exam-23

ple;24

• Swap: confusion between semantically similar parts of25

different persons, i.e., problems related to the association26

of body parts of a detected person to another.27

Table 2 shows the total of correct predictions and errors and28

Table 3 presents the estimate error associated to each detected29

joint considering our model. As we can see, most of the er-30

rors are related to the small difference between the predict key-31

point and the annotated data and to significant localization er-32

ror, which we believe this is due to false positives detected. We33

can also see that most localization errors affect keypoints more34

sensitive to errors related to occlusion or that usually involves35

some interaction between people such as the arms.36

Total Num. keypoints: [62790] Abs PercentageGood Prediction 45300 72.1453%Jitter 8039 12.80%Inversion 1519 2.41%Swap 613 0.976%Miss 7319 11.656%

Table 2. Total of correct predictions and errors considering the COCO val-idation dataset in absolute values and percentage.

In terms of runtime performance comparisons, we begin with37

the test of the CNN processing. Considering just one image38

Keypoints jitter % inversion % swap % miss %nose 9.6 0 2.4 5eyes 16 4.8 4.2 7.4ears 13.1 0.3 3.6 6.1shoulders 12.8 12 21.9 7.1elbows 12.6 4.6 17.5 14.3wrists 9.7 10.8 15.2 21.8hips 13.5 24.2 14 11.7knees 7.1 22.7 11.1 13.4ankles 5.5 20.6 10.1 13.2

Table 3. The frequency of localization errors over all the predicted key-points. We have 62790 joints detected in 5000 images with this test.

with 23 people, we compared our approach with OpenPose. 39

The tests were performed in a GPU Nvidia RTX 2060 with 6GB 40

of memory, where we vary the scale of the image tested by a 41

factor of 0.5 and repeat each test 1000 times. As we can see 42

in Table 4, in average, our approach has a better performance 43

when compared with the OpenPose. 44

Image Resolution OpenPose TensorPose328 x 193 ∼55,08 ms ∼54,86 ms656 x 386 ∼120.224 ms ∼84,25 ms984 x 579 ∼174,75 ms ∼116,66 ms1312 x 772 ∼320,18 ms ∼210,41 ms

Table 4. CNN processing time for OpenPose 1.3 and TensorPose. We varythe scale of the image tested by a factor of 0.5 considering 4 image scales.

When processing videos, to avoid processing the CNN and 45

calculating the line integral and the execution of the Hungarian 46

algorithm at each step, we use sparse optical flow. We get not 47

only performance improvements, with a frame rate above 30 48

FPS, but also smoothness in motion tracking, also due to the 49

Kalman filter. In the first test, we defined a fixed interval of 10 50

frames, where the optical flow is processed and, afterward, its 51

parameters are updated by a new processing step in the neural 52

network. Such range was defined empirically, where we saw 53

that there was not a significant difference in the accuracy of the 54

tracking. Our strategy, considering both changes in the network 55

and the use of optical flow during t frames, reduces the overall 56

time for tracking and increases the frame rate, as we can see in 57

Table 5. 58

Similar to the OpenPose, we analyzed some limitations when 59

our approach fails. We face the same problems with highly 60

crowded images where people are overlapping, and the ap- 61

proach tends to merge annotations from different people while 62

missing others[6]. Also, our application has less precision when 63

compared to previous work due to the factorized layers. It has 64

more difficulty to detect joints considering occlusion having a 65

slight noise in the inferred data. As we can see in table 1, this is 66

not a significant problem as the results are comparable and we 67

still have a good performance gain. 68

Subsequently, with the trained network, we defined an ar- 69

chitecture for the creation of applications for motion capture, 70

connecting different cameras and applications, such as scenes 71

in the Unity3D engine and web applications.The initial mod- 72


Fig. 6. Comparison of between the OpenPose model and ours. The first figure is generated by the OpenPose and the second by our model. As we can see,our model is slightly less accurate than the original work and fails to find the exactly position of a body joint in particular cases. Although due to the natureof our application, this issue can be considered acceptable. Image from COCO dataset.

GPU CPU OpenPose TensorPoseNvidia Titan RTX Intel(R) Core I9(R) CPU @

3.3GHz∼11. FPS ∼36.8 FPS

Nvidia Tesla P40 Intel(R) Xeon(R) CPU E5-2630v3 @ 2.40GHz

∼10 FPS ∼34.43 FPS

Nvidia TITAN Xp Intel(R) Core I7(R) CPU @3.3GHz

∼8.28 FPS ∼27 FPS

Nvidia RTX 2060 AMD Ryzen 7 1700 @ 3.2GHz ∼6.121 FPS ∼18.27 FPSNvidia GTX 960 Intel(R) Xeon(R) CPU E5-2630

v3 @ 2.40GHz∼3.73 FPS ∼12 FPS

Table 5. Frame rate comparison between the OpenPose 1.3 and TensorPose.As we can see, our model surpasses in approximately 3x the performance ofthe OpenPose model used as base.

0.0 0.2 0.4 0.6 0.8 1.0recall

0.0

0.2

0.4

0.6

0.8

1.0

precision

areaRng:[all], maxDets:[20]

Oks 0.50: .750Oks 0.55: .726Oks 0.60: .691Oks 0.65: .645Oks 0.70: .596Oks 0.75: .536Oks 0.80: .454Oks 0.85: .347Oks 0.90: .228Oks 0.95: .099

Fig. 7. Precision and Recall for all OKS thresholds. AreaRng is related tothe area of the objects annotated with size bigger than 322 pixels.

ule was developed for 2D applications where capture could be1

done through ordinary cameras such as webcams. Each frame2

is captured and sent to the detection module, which is in charge3

of performing the inference and detection of the people in each4

frame. Subsequently, the data regarding the captured persons,5

are sent via the net using the framework Holojam [37].6

In addition to the 2D capture, tests involving stereo cameras7

were also conducted. We used the ZED camera to map three- 8

dimensional coordinates of the world from two-dimensional co- 9

ordinates of images. The skeletal position is initially processed 10

in the same way, but, in a later step, it is transformed into 3D 11

space using intrinsic camera data and depth information. As 12

in the previous process, the capture and processing modules are 13

independent, with Holojam also being used to send information. 14

Some applications were developed as a proof of concept us- 15

ing the Unity 3D engine as well as WebGL. Regarding applica- 16

tions in Unity3D, the data is sent via the network by the Holo- 17

jam Server, using multicast. This application was developed 18

for the use of multiple devices, but it is also possible to use uni- 19

cast connections. Regarding the 3D scene in the Unity engine, 20

”Holojam Unity” objects implement support for multiplayer, 21

which includes communication with the Holojam Server, com- 22

ponents with position recognition, and Avatar presence. These 23

objects have a script component called ”Holojam Network”, 24

which contains some fields indicating how to communicate with 25

the server. 26

The “Multicast Address” field indicates the IP address used 27

to receive data from the server. This field is not relevant if 28

there is already a unicast connection between this client and 29

the server since, in this case, the client will receive the data di- 30

rectly from the server. The Server Address field can be used 31

to indicate the IP of the server directly. This address is used 32

to send data to the server, but if the indicated address is not 33


local (not starting with 192), Holojam will ping this address,1

which creates a connection between the server and this client.2

All data that is sent and received uses an update data structure.3

When Holojam sends an object, it uses the label “SendData” for4

such an update. For the virtual characters, they are created from5

the tracked data of the real person using inverse kinematics. A6

manager implemented in the application reconstructs the entire7

body of the Actor from pairable elements corresponding to the8

connections between hands, elbows, shoulders, neck, head, and9

so on. The reconstructed data are used as targets for a 3D model10

in the scene. Figure 8 shows the Unity3D application.11

The OpenPose also has a plugin for Unity 3D, although it is12

limited to run the inference locally and do not achieve the nec-13

essary performance. Their plugin often encounters problems14

when considering low cost GPUs, where it consumes a lot of15

memory and frequently is necessary to run in CPU mode. Also,16

unlike our application, their plugin does not deals with possi-17

bility to create shared virtual experiences where the physical18

presence of users in the same environment is not necessary. In19

our proposal, we can have a server dedicated to inference, with-20

out the need to make intensive use of the local machine. The21

entire project was developed considering network communica-22

tion using the low latency model of Holojam. All modules in23

the figure 4 were developed to be independent.24

Similarly, a web application with a unicast connection was25

developed, as we can see in Figure 9. The entire application26

was designed in javascript using WebGL. A script called “Man-27

ager” was implemented, responsible for receiving the data sent28

by the server, considering each object that identifies each person29

tracked and the position of their skeleton. The Manager instan-30

tiates objects called characters for the identification of actors,31

while still being in charge of updating their positions to each32

frame. It also manages the removal of actors from the scene,33

if they are no longer in it, or if the track has been lost. Subse-34

quently, the characters are rendered as a ragdoll using the po-35

sition of each part of their skeleton directly, not using inverse36

kinematics, in this case.37

8. Conclusion38

In this work we proposed a novel deep neural network with39

streamlined architecture and tensor decomposition for pose esti-40

mation with improved processing time, named TensorPose. We41

adopted it on a real-time motion capture multi-user interactive42

application. Considering our results, we show an efficient op-43

timization in the CNN model, since we achieved performance44

above 30 FPS, where in major cases represents 3× the perfor-45

mance of the original work. The accuracy of the detection of46

keypoints shows to be slightly smaller. However, this fact does47

not represent great problems in our applications, since our pri-48

mary objective is the performance gain, when considering FPS.49

As we showed in Figure 3 some failure cases were detected, but50

do not represents major problems in our applications. Also, we51

present a statistic to each kind of error detected. As a proof of52

concept, we present two main applications using Unity3D and53

WebGL, integrated with Holojam platform. We show that it is54

possible to create shared virtual experiences using motion cap-55

ture with the TensorPose. As future work, we expect to adapt56

our CNN architecture to be used in low power devices [28]. To 57

do so, we have to still continue to explore computational cost 58

reduction. For this, we intend to explore strategies to create a 59

complete form of the factored network, trying to solve the sta- 60

bility error, where all the layers will be in the factored form, 61

using different types of initialization of the weights. 62

Acknowledgments 63

We gratefully acknowledge the support of NVIDIA Corpo- 64

ration with the donation of the Titan Xp GPU used for this 65

research. The authors also thanks CAPES and CNPq for the 66

partial support of this research. 67

References 68

[1] Ranjan, R, Patel, VM, Chellappa, R. Hyperface: A deep multi-task 69

learning framework for face detection, landmark localization, pose esti- 70

mation, and gender recognition. IEEE Transactions on Pattern Analysis 71

and Machine Intelligence 2019;41(1):121–135. 72

[2] Sindagi, VA, Patel, VM. A survey of recent advances in cnn-based 73

single image crowd counting and density estimation. Pattern Recognition 74

Letters 2018;107:3–16. 75

[3] Ge, L. Real-time 3d hand pose estimation from depth images. Ph.D. 76

thesis; 2018. 77

[4] Schwarcz, S, Pollard, T. 3d human pose estimation from deep multi-view 78

2d pose. In: 2018 24th International Conference on Pattern Recognition 79

(ICPR). IEEE; 2018, p. 2326–2331. 80

[5] Lin, M, Lin, L, Liang, X, Wang, K, Cheng, H. Recurrent 3d pose se- 81

quence machines. In: Proceedings of the IEEE Conference on Computer 82

Vision and Pattern Recognition. 2017, p. 810–819. 83

[6] Cao, Z, Simon, T, Wei, SE, Sheikh, Y. Realtime multi-person 2d pose 84

estimation using part affinity fields. In: CVPR. 2017,. 85

[7] Luo, Y, Ren, J, Wang, Z, Sun, W, Pan, J, Liu, J, et al. Lstm pose 86

machines. In: Proceedings of the IEEE Conference on Computer Vision 87

and Pattern Recognition. 2018, p. 5207–5215. 88

[8] Mehta, D, Sridhar, S, Sotnychenko, O, Rhodin, H, Shafiei, M, Seidel, 89

HP, et al. Vnect: Real-time 3d human pose estimation with a single rgb 90

camera. ACM Transactions on Graphics (TOG) 2017;36(4):44. 91

[9] Ramakrishna, V, Munoz, D, Hebert, M, Bagnell, JA, Sheikh, Y. Pose 92

machines: Articulated pose estimation via inference machines. In: Euro- 93

pean Conference on Computer Vision. Springer; 2014, p. 33–47. 94

[10] Wei, SE, Ramakrishna, V, Kanade, T, Sheikh, Y. Convolutional pose 95

machines. In: Proceedings of the IEEE Conference on Computer Vision 96

and Pattern Recognition. 2016, p. 4724–4732. 97

[11] Song, J, Wang, L, Van Gool, L, Hilliges, O. Thin-slicing network: A 98

deep structured model for pose estimation in videos. In: Proceedings of 99

the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 100

p. 4220–4229. 101

[12] Shafaei, A, Little, JJ. Real-time human motion capture with multiple 102

depth cameras. In: 2016 13th Conference on Computer and Robot Vision 103

(CRV). IEEE; 2016, p. 24–31. 104

[13] Tome, D, Toso, M, Agapito, L, Russell, C. Rethinking pose in 3d: 105

Multi-stage refinement and recovery for markerless motion capture. In: 106

2018 International Conference on 3D Vision (3DV). IEEE; 2018, p. 474– 107

483. 108

[14] Kanazawa, A, Black, MJ, Jacobs, DW, Malik, J. End-to-end recovery 109

of human shape and pose. In: Proceedings of the IEEE Conference on 110

Computer Vision and Pattern Recognition. 2018, p. 7122–7131. 111

[15] Loper, M, Mahmood, N, Romero, J, Pons-Moll, G, Black, MJ. Smpl: A 112

skinned multi-person linear model. ACM transactions on graphics (TOG) 113

2015;34(6):248. 114

[16] Lin, TY, Maire, M, Belongie, S, Hays, J, Perona, P, Ramanan, D, et al. 115

Microsoft coco: Common objects in context. In: European conference on 116

computer vision. Springer; 2014, p. 740–755. 117

[17] Howard, AG, Zhu, M, Chen, B, Kalenichenko, D, Wang, W, Weyand, 118

T, et al. Mobilenets: Efficient convolutional neural networks for mobile 119

vision applications. arXiv preprint arXiv:170404861 2017;. 120


Fig. 8. Unity3D application with TensorPose. One of our use cases is the development of applications that make use of shared virtual environments.

[18] Wang, M, Liu, B, Foroosh, H. Factorized convolutional neural networks.1

In: Proceedings of the IEEE International Conference on Computer Vi-2

sion. 2017, p. 545–553.3

[19] Jin, J, Dundar, A, Culurciello, E. Flattened convolutional neural4

networks for feedforward acceleration. arXiv preprint arXiv:141254745

2014;.6

[20] Tucker, LR. Some mathematical notes on three-mode factor analysis.7

Psychometrika 1966;31(3):279–311.8

[21] Kim, YD, Park, E, Yoo, S, Choi, T, Yang, L, Shin, D. Compres-9

sion of deep convolutional neural networks for fast and low power mobile10

applications. arXiv preprint arXiv:151106530 2015;.11

[22] Kolda, TG, Bader, BW. Tensor decompositions and applications. SIAM12

review 2009;51(3):455–500.13

[23] Smith, S, Karypis, G. Accelerating the tucker decomposition with com-14

pressed sparse tensors. In: European Conference on Parallel Processing.15

Springer; 2017, p. 653–668.16

[24] Cichocki, A, Zdunek, R, Phan, AH, Amari, Si. Nonnegative matrix and17

tensor factorizations: applications to exploratory multi-way data analysis 18

and blind source separation. John Wiley & Sons; 2009. 19

[25] De Lathauwer, L, De Moor, B, Vandewalle, J. A multilinear singular 20

value decomposition. SIAM journal on Matrix Analysis and Applications 21

2000;21(4):1253–1278. 22

[26] Symeonidis, P, Nanopoulos, A, Manolopoulos, Y. A unified frame- 23

work for providing recommendations in social tagging systems based on 24

ternary semantic analysis. IEEE Transactions on Knowledge and Data 25

Engineering 2010;22(2):179–192. 26

[27] Sandler, M, Howard, A, Zhu, M, Zhmoginov, A, Chen, LC. Mo- 27

bilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of 28

the IEEE Conference on Computer Vision and Pattern Recognition. 2018, 29

p. 4510–4520. 30

[28] Zhou, S, Vinh, NX, Bailey, J, Jia, Y, Davidson, I. Accelerating online 31

cp decompositions for higher order tensors. In: Proceedings of the 22nd 32

ACM SIGKDD International Conference on Knowledge Discovery and 33

Data Mining. ACM; 2016, p. 1375–1384. 34


Fig. 9. TensorPose web test.

[29] Tsai, D, Flagg, M, Nakazawa, A, Rehg, JM. Motion coherent track-1

ing using multi-label mrf optimization. International journal of computer2

vision 2012;100(2):190–202.3

[30] Kim, JS, Hwangbo, M, Kanade, T. Realtime affine-photometric klt fea-4

ture tracker on gpu in cuda framework. In: 2009 IEEE 12th International5

Conference on Computer Vision Workshops, ICCV Workshops. IEEE;6

2009, p. 886–893.7

[31] Senst, T, Geistert, J, Sikora, T. Robust local optical flow: Long-8

range motions and varying illuminations. In: IEEE International Con-9

ference on Image Processing. Phoenix, AZ, USA: IEEE; 2016, p. 4478–10

4482. IEEE Catalog Number: CFP16CIP-USB ISBN: 978-1-4673-9960-11

9 DOI:10.1109/ICIP.2016.7533207.12

[32] Chui, CK, Chen, G, et al. Kalman filtering with Real-Time Applications.13

Springer; 2017.14

[33] Babenko, B, Yang, MH, Belongie, S. Visual Tracking with Online15

Multiple Instance Learning. In: CVPR. 2009,.16

[34] Bradski, G, Kaehler, A. Learning OpenCV: Computer vision with the17

OpenCV library. ” O’Reilly Media, Inc.”; 2008.18

[35] Burbano, A, Vasiliu, M, Bouaziz, S. 3d cameras benchmark for human19

tracking in hybrid distributed smart camera networks. In: Proceedings of20

the 10th International Conference on Distributed Smart Camera. ACM;21

2016, p. 76–83.22

[36] Gupta, T, Li, H. Indoor mapping for smart citiesan affordable approach:23

Using kinect sensor and zed stereo camera. In: 2017 International Con-24

ference on Indoor Positioning and Indoor Navigation (IPIN). IEEE; 2017,25

p. 1–8.26

[37] Masson, T, Perlin, K, et al. Holo-doodle: an adaptation and expansion27

of collaborative holojam virtual reality. In: ACM SIGGRAPH 2017 VR28

Village. ACM; 2017, p. 9.29

[38] Ruggero Ronchi, M, Perona, P. Benchmarking and error diagnosis in30

multi-instance pose estimation. In: Proceedings of the IEEE International31

Conference on Computer Vision. 2017, p. 369–378.32

TensorPose: Real-time pose estimation for interactive ...€¦ · Keywords: Convolutional Neural...

Documents

Transcript of TensorPose: Real-time pose estimation for interactive ...€¦ · Keywords: Convolutional Neural...