Real-Time Human Pose Recognition
in Parts from Single Depth Images
Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard
Moore, Alex Kipman, Andrew Blake
CVPR 2011
PRESENTER: AHSAN ABDULLAH
PROBLEM
right
elbow
right hand left
shoulderneck
APPROACH
• Partitioning into body parts helps localizing the joints
Shotton et. al. CVPR 2011
infer
body parts
per pixelcluster pixels to
hypothesize
body joint
positions
capture
depth image &
remove bg
fit model &
track skeleton
PIPELINE
Shotton et. al. CVPR 2011
Design Goals
• Efficiency
• Robustness
Compute P(ci|wi)
pixels i = (x, y)
body part ci
image window wi
Discriminative approach
learn classifier P(ci|wi) from training data
image windows move
with classifier
BODY PART CLASSIFICATION
Shotton et. al. CVPR 2011
LEARNING DATA
synthetic(train & test)
real(test) Shotton et. al. CVPR 2011
LEARNING – DATA SYNTHESIS
Record MoCap500k frames
distilled to 100k poses
Retarget to several models
Render (depth, body parts) pairs
Shotton et. al. CVPR 2011
• Depth comparisons
- very fast to compute
input
depth
image
xΔ
xΔ
xΔx
Δ
x
Δ
x
Δ
𝑓 𝐼, x = 𝑑𝐼 x − 𝑑𝐼(x + Δ)
image depth
image coordinate
offset depth
feature
response
Background pixelsd = large constant
scales inversely with depth
Δ =𝐯
𝑑𝐼 x
FEATURE SET
Shotton et. al. CVPR 2011
Aggregation of decision trees
DECISION FORESTS
Qn = (I, x)
f(I, x; Δn) > θn
no yes
c
Pr(c)
body part c
Pn(c)
c
Pl(c)
Take (Δ, θ) that maximises information gain
n
l r
reduce
entropy
[Breiman et al. 84]
for all pixels
Shotton et. al. CVPR 2011
TRAINING DECISION TREES
image windowcentred at x
no
Toy example:Distinguish left (L)
and right (R) sides of
the body
no yes
yes
L R
P(c)
L R
P(c)
L R
P(c)
f(I, x; Δ1) > θ1
f(I, x; Δ2) > θ2
Shotton et. al. CVPR 2011
DECISION TREE CLASSIFICATION
Trained on different random subset of images
“bagging” helps avoid over-fitting
Average tree posteriors
[Amit & Geman 97]
[Breiman 01]
[Geurts et al. 06]
………tree 1 tree T
c
P1(c)c
PT(c)
(𝐼, x) (𝐼, x)
𝑃 𝑐 𝐼, x =1
𝑇
𝑡=1
𝑇
𝑃𝑡(𝑐|𝐼, x)
Shotton et. al. CVPR 2011
DECISION FOREST CLASSIFIER
ground truth
1 tree 3 trees 6 trees
inferred body parts (most likely)
40%
45%
50%
55%
1 2 3 4 5 6
Av
era
ge
pe
r-c
lass
…
Number of trees
Shotton et. al. CVPR 2011
NUMBER OF TREES
30%
35%
40%
45%
50%
55%
60%
65%
8 12 16 20
Av
era
ge
pe
r-c
lass
ac
cu
rac
y
Depth of trees
30%
35%
40%
45%
50%
55%
60%
65%
5 15Depth of trees
synthetic test data real test data
Shotton et. al. CVPR 2011
TREE DEPTH
• Define 3D world space density
• Mean shift for mode detection
Body parts to joint hypotheses
3. hypothesize
body joints
…
1 2
pixel index ibandwidth
3D coord
of i th pixel3D coord
pixel
weight
inferred
probability
depth at
i th pixel
Shotton et. al. CVPR 2011
front view top viewside view
input depth inferred body parts
inferred joint positions
Shotton et. al. CVPR 2011No tracking or smoothing
front view top viewside view
input depth inferred body parts
inferred joint positions
Shotton et. al. CVPR 2011No tracking or smoothing
0.00.10.20.30.40.50.60.70.80.91.0
Ce
nte
r H
ea
d
Ce
nte
r N
ec
k
Left
Sh
ou
lde
r
Rig
ht…
Left
Elb
ow
Rig
ht
Elb
ow
Left
Wrist
Rig
ht
Wrist
Left
Ha
nd
Rig
ht
Ha
nd
Left
Kn
ee
Rig
ht
Kn
ee
Left
An
kle
Rig
ht
An
kle
Left
Fo
ot
Rig
ht
Fo
ot
Me
an
AP
Av
era
ge
pre
cis
ion
Shotton et. al. CVPR 2011
JOINT PREDICTION ACCURACY
0.00.10.20.30.40.50.60.70.80.91.0
Cen
ter
Hea
d
Cen
ter
Nec
k
Lef
t S
ho
uld
er
Rig
ht
Sh
ou
lder
Lef
t E
lbo
w
Rig
ht
Elb
ow
Lef
t W
rist
Rig
ht
Wri
st
Lef
t H
and
Rig
ht
Han
d
Lef
t K
nee
Rig
ht
Kn
ee
Lef
t A
nkl
e
Rig
ht
An
kle
Lef
t F
oo
t
Rig
ht
Fo
ot
Mea
n A
P
Ave
rag
e p
reci
sio
n
Joint prediction from ground truth body parts
Joint prediction from inferred body parts
Shotton et. al. CVPR 2011
JOINT PREDICTION ACCURACY
• No temporal information
- frame-by-frame
• Very fast
- simple depth image feature
- parallel decision forest classifier
Shotton et. al. CVPR 2011
ANALYSIS
Uses…
• 3D joint hypotheses
• kinematic constraints
• temporal coherence
… to give
• full skeleton
• higher accuracy
• invisible joints
• multi-player4. track skeleton
1
2
3
KINECT SYSTEM
• Frame-by-frame gives robustness
• Body parts representation for efficiency
• Fast, simple machine learning
• Significant engineering to scale to a
massive, varied training data set
Shotton et. al. CVPR 2011
SUMMARY
QUESTIONS
Top Related