2D Human Pose Estimation in TV Shows Vittorio Ferrari Manuel Marin Andrew Zisserman Dagstuhl Seminar...
-
Upload
amice-gilbert -
Category
Documents
-
view
224 -
download
2
Transcript of 2D Human Pose Estimation in TV Shows Vittorio Ferrari Manuel Marin Andrew Zisserman Dagstuhl Seminar...
2D Human Pose Estimation2D Human Pose Estimationin TV Showsin TV Shows
Vittorio FerrariVittorio FerrariManuel MarinManuel Marin
Andrew ZissermanAndrew Zisserman
Dagstuhl SeminarDagstuhl Seminar
July 2008July 2008
GoalGoal: detect people and estimate 2D pose in images/video: detect people and estimate 2D pose in images/video
Posespatial configuration of body parts
Estimationlocalize body parts in
Desired scopeTV shows and feature films
focus on upper-body
Body parts
fixed aspect-ratio rectangles for head, torso, upper and lower armsRamanan and Forsyth, Mori et al., Sigal and Black,
Felzenszwalb and Huttenlocher, Bergtholdt et al.
The search space of pictorial structures is largeThe search space of pictorial structures is large
Search space
- 4P dim (a scale factor per part)
- 3P+1 dims (a single scale factor)
- P = 6 for upper-body
- 720x405 image = 10^45 configs !
Kinematic constraints
- reduce space to valid body configs (10^28)
- efficiency by model independencies (10^12)
Body parts
fixed aspect-ratio rectangles for head, torso, upper and lower arms
= 4 parameters each
€
(x, y,θ,s)
Ramanan and Forsyth, Mori et al., Sigal and Black,Felzenszwalb and Huttenlocher, Bergtholdt et al.
The challenge of unconstrained imageryThe challenge of unconstrained imagery
varying illumination and low contrast; moving camera and background;multiple people; scale changes; extensive clutter; any clothing
The challenge of unconstrained imageryThe challenge of unconstrained imagery
Extremely difficult when knowing nothing about appearance/pose/location
Approach overviewApproach overview
Humandetection
Foregroundhighlighting
reducesearch space
estimate pose
new additions
Ramanan NIPS 2006
every frameindependently
multiple framesjointly
Image parsing
Learn appearancemodels over
multiple frames
Spatio-temporalparsing
Search space reduction by human detectionSearch space reduction by human detection
+ fixes scale of body parts
+ sets bounds on x,y locations
- little info about pose (arms)
Ideaget approximate location and scale with adetector generic over pose and appearance
Building an upper-body detector
- based on Dalal and Triggs CVPR 2005
- train = 96 frames X 12 perturbations (no Buffy)
+ fast
Benefits for pose estimation
detected enlarged
+ detects also back views
Train
Test
Search space reduction by human detectionSearch space reduction by human detection
Upper body detection and temporal association
QuickTime™ and a decompressor
are needed to see this picture.
code
availab
le!
www.robots.ox.ac.uk/~vgg/software/UpperBody/index.html
Search space reduction by foreground highlightingSearch space reduction by foreground highlighting
+ further reduce clutter
+ conservative (no loss 95.5% times)
Ideaexploit knowledge about structure of search area to initialize Grabcut(Rother et al. 2004)
Initialization- learn fg/bg models from regions where person likely present/absent
- clamp central strip to fg
- don’t clamp bg (arms can be anywhere)
Benefits for pose estimation
+ needs no knowledge of background
+ allows for moving backgroundinitialization output
Search space reduction by foreground highlightingSearch space reduction by foreground highlighting
+ further reduce clutter
+ conservative (no loss 95.5% times)
Initialization- learn fg/bg models from regions where person likely present/absent
- clamp central strip to fg
- don’t clamp bg (arms can be anywhere)
Benefits for pose estimation
+ needs no knowledge of background
+ allows for moving background
Ideaexploit knowledge about structure of search area to initialize Grabcut(Rother et al. 2004)
Pose estimation by image parsingPose estimation by image parsing
+ much more robust+ much faster (10x-100x)
Goalestimate posterior of part configuration
Algorithm1. inference with = edges
2. learn appearance models of body parts and background
3. inference with = edges + appearance
Advantages of space reduction
€
Φ
= image evidence (given edge/app models)
€
Φ = spatial prior (kinematic constraints)
€
Ψ
€
Ψ
€
Ψ
€
Ψ
€
Ψ
Ramanan 06
Pose estimation by image parsingPose estimation by image parsing
appearancemodels
edgeparse
edge + appparse + much more robust
+ much faster (10x-100x)
Goalestimate posterior of part configuration
Algorithm1. inference with = edges
2. learn appearance models of body parts and background
3. inference with = edges + appearance
Advantages of space reduction€
Φ
€
Φ
= image evidence (given edge/app models)
€
Φ = spatial prior (kinematic constraints)
€
Ψ€
P(L | I)∝ exp Φ(li) + Ψ(li, l j )i,i∈E
∑i
∑ ⎛
⎝ ⎜
⎞
⎠ ⎟
Ramanan 06
Pose estimation by image parsingPose estimation by image parsing
appearancemodels
edgeparse
edge + appparse + much more robust
+ much faster (10x-100x)
Goalestimate posterior of part configuration
Algorithm1. inference with = edges
2. learn appearance models of body parts and background
3. inference with = edges + appearance
Advantages of space reduction€
Φ
€
Φ
no foreground highlighting
= image evidence (given edge/app models)
€
Φ = spatial prior (kinematic constraints)
€
Ψ€
P(L | I)∝ exp Φ(li) + Ψ(li, l j )i,i∈E
∑i
∑ ⎛
⎝ ⎜
⎞
⎠ ⎟
Ramanan 06
Failure of direct pose estimationFailure of direct pose estimation Ramanan NIPS 2006 unaided
Ramanan 06
Transferring appearance models across framesTransferring appearance models across framesIdearefine parsing of difficult frames, based on appearance models from confident ones(exploit continuity of appearance)
lowest-entropy frames
higher-entropy frameAdvantages
+ improve parse in difficult frames
+ better than Ramanan CVPR 2005: richer and more robust models need no specific pose at any point in video
Algorithm
1. select frames with low entropy of P(L|I)
2. integrate their appearance models by min KL-divergence
3. re-parse every frame using integrated appearance models
lowest-entropy frames
higher-entropy frame
Idearefine parsing of difficult frames, based on appearance models from confident ones(exploit continuity of appearance)
Transferring appearance models across framesTransferring appearance models across frames
Algorithm
1. select frames with low entropy of P(L|I)
2. integrate their appearance models by min KL-divergence
3. re-parse every frame using integrated appearance models
Advantages
+ improve parse in difficult frames
+ better than Ramanan CVPR 2005: richer and more robust models need no specific pose at any point in video
Idearefine parsing of difficult frames, based on appearance models from confident ones(exploit continuity of appearance)
Transferring appearance models across framesTransferring appearance models across frames
parse
Algorithm
1. select frames with low entropy of P(L|I)
2. integrate their appearance models by min KL-divergence
3. re-parse every frame using integrated appearance models
Advantages
+ improve parse in difficult frames
+ better than Ramanan CVPR 2005: richer and more robust models need no specific pose at any point in video
Idearefine parsing of difficult frames, based on appearance models from confident ones(exploit continuity of appearance)
Transferring appearance models across framesTransferring appearance models across frames
re-parse
Algorithm
1. select frames with low entropy of P(L|I)
2. integrate their appearance models by min KL-divergence
3. re-parse every frame using integrated appearance models
Advantages
+ improve parse in difficult frames
+ better than Ramanan CVPR 2005: richer and more robust models need no specific pose at any point in video
Ideaextend kinematic tree with edgespreferring non-overlapping left/right arms
Model
- add repulsive edges
- inference with Loopy Belief Propagation
Advantage
+ less double-counting
The repulsive modelThe repulsive model
€
lH
€
lT€
Ψ
€
lLUA
€
lRUA
€
Ψ
€
Ψ
€
lRLA
€
lLLA
= kinematic constraints
€
Ψ
= repulsive prior
€
Ψ
€
Ψ
Ideaextend kinematic tree with edgespreferring non-overlapping left/right arms
Model
- add repulsive edges
- inference with Loopy Belief Propagation
Advantage
+ less double-counting
The repulsive modelThe repulsive model
€
lH
€
lT€
Ψ
€
lLUA
€
lRUA
€
Ψ
€
Ψ
€
lRLA
€
lLLA
= kinematic constraints
€
Ψ
= repulsive prior
€
Λ
€
Ψ
€
Ψ
€
P(L | I)∝ exp Φ(li) + Ψ(li, l j )i,i∈E
∑i
∑ + Λ(lLkA , lRkA )k∈{U ,L}
∑ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
€
Λ
€
Λ
Spatio-temporal parsingSpatio-temporal parsing
Transferring appearance modelsexploits continuity of appearance
Image parsing
Learn appearancemodels over
multiple frames
After repulsive model
Now exploit continuity of position
time
Spatio-temporal parsingSpatio-temporal parsing
Idea
extend model to a spatio-temporal block
€
Ψ
time
€
Ψ
€
Ψ€
Ω
€
Ω
€
Ω
€
Ψ
€
Ψ
€
Ψ
€
Ψ
€
Ψ
€
Ψ
€
Ω
€
Ω
Model
- add temporal edges
- inference with Loopy Belief Propagation
= image evidence (given edge+app models)
€
Φ = spatial prior (kinematic constraints)
€
Ψ
= temporal prior (continuity constraints)
€
Ω
Spatio-temporal parsingSpatio-temporal parsing
Idea
extend model to a spatio-temporal block
€
lH1
€
lH2
€
lHn
€
l1T
€
l2T
€
lnT
€
Ψ
time
€
Ψ
€
Ψ€
Ω
€
Ω
€
Ω
€
Ω
€
l1LA
€
l1RA
€
l2LA
€
l2RA
€
lnLA
€
lnRA
€
Ψ
€
Ψ
€
Ψ
€
Ψ
€
Ψ
€
Ψ
€
Ω
€
Ω
Model
- add temporal edges
- inference with Loopy Belief Propagation
= image evidence (given edge+app models)
€
Φ = spatial prior (kinematic constraints)
€
Ψ
= temporal prior (continuity constraints)
€
Ω
= repulsive prior (anti double-counting)
€
Λ€
Λ
€
Λ
€
Λ
Spatio-temporal parsingSpatio-temporal parsing
Integrated spatio-temporal parsingreduces uncertainty
Image parsing
Learn appearancemodels over
multiple frames
Spatio-temporalparsing
spatio-temporal
no spatio-temporal
Full-body pose estimation in easier conditionsFull-body pose estimation in easier conditions
QuickTime™ and a decompressor
are needed to see this picture.
Weizmann action dataset (Blank et at. ICCV 05)
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
Evaluation datasetEvaluation dataset
- >70000 frames over 4 episodes of Buffy the Vampire Slayer (>1000 shots)
- uncontrolled and extremely challenging low contrast, scale changes, moving camera and background, extensive clutter, any clothing, any pose
- figure overlays = final output, after spatio-temporal inference
Quantitative evaluationQuantitative evaluation
Ground-truth 69 shots x 4 = 276 frames x 6 = 1656 body parts (sticks)
Upper-body detector
fires on 243 frames (88%, 1458 body parts)
Data
availab
le!
http://www.robots.ox.ac.uk/~vgg/data/stickmen/index.html
Quantitative evaluationQuantitative evaluation Pose estimation accuracy
62.6% with repulsive model ( with spatio-temporal parsing)
59.4% with appearance transfer
57.9% with foreground highlighting (best single-frame)
41.2% with detection
9.6% Ramanan NIPS 2006 unaided
Conclusions:
+ both reduction techniques improve results
+= small improvement by appearance transfer …
+ … method good also on static images
+ repulsive model brings us further
= spatio-temporal parsing mostly aesthetic impact
€
≅
€
≅