Recovering Human Body Configurations: Combining Segmentation and Recognition

Recovering Recovering Human Body Human Body

Configurations: Configurations: Combining Combining

Segmentation and Segmentation and RecognitionRecognitionGreg Mori, Xiaofeng Ren, and Greg Mori, Xiaofeng Ren, and

Jitentendra Malik (UC Jitentendra Malik (UC Berkeley)Berkeley)

Alexei A. Efros (Oxford)Alexei A. Efros (Oxford)

The goalThe goal

Given an image:Given an image: Detect a human figureDetect a human figure Localize joints and limbsLocalize joints and limbs

Create a skeleton of their poseCreate a skeleton of their pose Create a segmentation mask of the personCreate a segmentation mask of the person

Other approaches: Other approaches: Simple featuresSimple features

Model people as generalized Model people as generalized cylinders (1980’s)cylinders (1980’s) Easily implemented bottom upEasily implemented bottom up Often use tree to express Often use tree to express

relationsrelations Problems:Problems:

Cylinders are commonCylinders are common Often dependencies between Often dependencies between

body partsbody parts Really need contextReally need context

Other approaches: Other approaches: Probable poseProbable pose

Often use probable poseOften use probable pose Template matchingTemplate matching Top down constraints on poseTop down constraints on pose But even highly improbable poses are But even highly improbable poses are

still possiblestill possible

Other approaches: Other approaches: Frequent simplificationsFrequent simplifications

Nude modelsNude models Limited posesLimited poses Background subtraction or limited Background subtraction or limited

clutterclutter

““Arguably the most Arguably the most difficult recognition difficult recognition problem in computer problem in computer

vision”vision” Variation in clothingVariation in clothing Variation in limbsVariation in limbs Variation in poseVariation in pose

Solution: “Islands of Solution: “Islands of Saliency”Saliency”

Use low-level features that are Use low-level features that are informative independent of contextinformative independent of context

Based on these islands, one is able Based on these islands, one is able to fill in gaps with contextto fill in gaps with context

AlgorithmAlgorithm

Algorithm: Segmenting Algorithm: Segmenting into regions and into regions and

superpixelssuperpixels

SegmentationSegmentation

Combine boundary finder (Martin et Combine boundary finder (Martin et al., 2002) with Normalized Cuts al., 2002) with Normalized Cuts (Malik, Belongie, et al., 2001)(Malik, Belongie, et al., 2001) Groups similar pixels into regionsGroups similar pixels into regions

Segmentation: RegionsSegmentation: Regions

40 regions40 regions Most salient parts Most salient parts

of body become of body become regionsregions Limbs usually two Limbs usually two

“half-limbs”“half-limbs”

Segmentation: Segmentation: SuperpixelsSuperpixels

200 region 200 region (oversegmentation(oversegmentation))

Retains virtually Retains virtually all structures in all structures in originaloriginal

Still reduces Still reduces complexity from complexity from 400,000 pixels to 400,000 pixels to 200 superpixels200 superpixels

Algorithm: Finding Algorithm: Finding salient limbs and torsossalient limbs and torsos

Finding limbsFinding limbs

Candidates: all 40 regionsCandidates: all 40 regions Four cues for half-limb detectionFour cues for half-limb detection

Contour: Probability of the boundaryContour: Probability of the boundary Average probability of the region’s Average probability of the region’s

boundary, as measured by Martin’s boundary, as measured by Martin’s boundary finderboundary finder

Shape: How close to a rectangleShape: How close to a rectangle Area of overlap with reconstructed Area of overlap with reconstructed

rectangle,rectangle,

Find limbsFind limbs

ShadingShading Limbs are roughly cylindrical, so should Limbs are roughly cylindrical, so should

have 3D pop out due to shadinghave 3D pop out due to shading Compare ICompare Ix-x-, I, Ix+x+, I, Iy-y-, I, Iy+y+ for region to mean of for region to mean of

IIx-x-, I, Ix+x+, I, Iy-y-, I, Iy+y+ for training set for training set

Focus cueFocus cue Background is often not in focusBackground is often not in focus CCfocusfocus = E = Ehighhigh/(a E/(a Elowlow + b) + b)

Finding limbsFinding limbs

Cues are combined by summingCues are combined by summing Use logistic regression to learn Use logistic regression to learn

weights (training set of hand-labeled weights (training set of hand-labeled half-limbs)half-limbs)

Evaluation: CuesEvaluation: Cues

Number of candidates generated

Num

ber

of h

its

Evaluation: PerformanceEvaluation: Performance

Evaluation summaryEvaluation summary

Not very good detectorsNot very good detectors Strength of boundary best cueStrength of boundary best cue Combining cues yields better Combining cues yields better

performanceperformance On average 4.08 of top 8 candidates On average 4.08 of top 8 candidates

produced were hitsproduced were hits 89% have at least 3 hits among top 889% have at least 3 hits among top 8

Motivates search for 3 half-limbs Motivates search for 3 half-limbs combined with head and torsocombined with head and torso

Finding torsosFinding torsos

Unlike half-limbs, typically several Unlike half-limbs, typically several regionsregions

Consider all sets of adjacent regions Consider all sets of adjacent regions within some range of total sizeswithin some range of total sizes

Set of cues:Set of cues: ContourContour ShapeShape FocusFocus (No shading)(No shading)

Finding torsosFinding torsos

Find orientation of torsoFind orientation of torso Find best matching headFind best matching head

Again contour, shape, and focus cues with Again contour, shape, and focus cues with shape a diskshape a disk

Score for torso, score for head, and Score for torso, score for head, and score for relative positions of head to score for relative positions of head to torso multiplied to create score for torso multiplied to create score for oriented torsooriented torso

EvaluationEvaluation

Success if all four torso points within 60 Success if all four torso points within 60 pixels of ground truthpixels of ground truth

Algorithm: Pruning to Algorithm: Pruning to form partial form partial

configurationsconfigurations

Body buildingBody building

From 5-7 half-limbs and ~50 From 5-7 half-limbs and ~50 candidate oriented torsos form candidate oriented torsos form partial configurations consisting of:partial configurations consisting of: Each torsoEach torso Three half limbs assigned each assigned Three half limbs assigned each assigned

to:to: One of 8 half limb body partsOne of 8 half limb body parts One of two polaritiesOne of two polarities

2-3 million partial configurations!2-3 million partial configurations!

Enforce constraints:Enforce constraints: Relative widthsRelative widths

Foreshortening doesn’t affect width of limbs muchForeshortening doesn’t affect width of limbs much Use anthropomorphic data to rule out limbs more Use anthropomorphic data to rule out limbs more

than 4 standard deviations wider than expectedthan 4 standard deviations wider than expected Length of limbs relative to torsoLength of limbs relative to torso

Assume torso not too foreshortenedAssume torso not too foreshortened No more than +/- 40% angle with image planeNo more than +/- 40% angle with image plane

Again, prune limbs more than 4 standard Again, prune limbs more than 4 standard deviations away from mean length, relative to deviations away from mean length, relative to torsotorso

Seems to be making some assumptions of probable Seems to be making some assumptions of probable posepose

Enforce constraintsEnforce constraints

AdjacencyAdjacency Upper limbs must be adjacent to torsoUpper limbs must be adjacent to torso Lower limbs must be adjacent to upper limbsLower limbs must be adjacent to upper limbs

Symmetry in clothing: color histograms Symmetry in clothing: color histograms must not be overly dissimilar for must not be overly dissimilar for corresponding segmentscorresponding segments E.g. right and left upper arms should be E.g. right and left upper arms should be

similarsimilar Makes some small assumptions about Makes some small assumptions about

variations in clothingvariations in clothing

Body building: slimming Body building: slimming downdown

Reduces to ~1000 partial Reduces to ~1000 partial configurationsconfigurations

Sorted by linear combination of the Sorted by linear combination of the torso and the three half-limb scorestorso and the three half-limb scores (This score can be used to improve (This score can be used to improve

torso detection)torso detection)

AlgorithmAlgorithm

Extending to full limbsExtending to full limbs Adding additional rectangles evaluated on Adding additional rectangles evaluated on

adjacent superpixels to empty limb jointsadjacent superpixels to empty limb joints Want high internal similarity and high Want high internal similarity and high

dissimilarity to surroundingsdissimilarity to surroundings

AlgorithmAlgorithm

SummarySummary ““Arguably the most difficult problem in Arguably the most difficult problem in

computer vision”computer vision” Not solved hereNot solved here

Method here is appealing:Method here is appealing: Don’t need to store exemplarsDon’t need to store exemplars Island of saliency approach seems useful in Island of saliency approach seems useful in

many contextsmany contexts Use some configural knowledge to make Use some configural knowledge to make

reasonable guessesreasonable guesses Good illustration of integrating recognition Good illustration of integrating recognition

and segmentationand segmentation

Recovering Human Body Configurations: Combining Segmentation and Recognition

Documents

Transcript of Recovering Human Body Configurations: Combining Segmentation and Recognition