3D Menagerie: Modeling the 3D Shape and Pose of...

9
3D Menagerie: Modeling the 3D Shape and Pose of Animals Silvia Zuffi 1 Angjoo Kanazawa 2 David Jacobs 2 Michael J. Black 3 1 IMATI-CNR, Milan, Italy, 2 University of Maryland, College Park, MD 3 Max Planck Institute for Intelligent Systems, T ¨ ubingen, Germany [email protected], {kanazawa, djacobs}@umiacs.umd.edu, [email protected] Figure 1: Animals from images. We learn an articulated, 3D, statistical shape model of animals using very little training data. We fit the shape and pose of the model to 2D image cues showing how it generalizes to previously unseen shapes. Abstract There has been significant work on learning realistic, articulated, 3D models of the human body. In contrast, there are few such models of animals, despite many ap- plications. The main challenge is that animals are much less cooperative than humans. The best human body mod- els are learned from thousands of 3D scans of people in specific poses, which is infeasible with live animals. Con- sequently, we learn our model from a small set of 3D scans of toy figurines in arbitrary poses. We employ a novel part- based shape model to compute an initial registration to the scans. We then normalize their pose, learn a statistical shape model, and refine the registrations and the model to- gether. In this way, we accurately align animal scans from different quadruped families with very different shapes and poses. With the registration to a common template we learn a shape space representing animals including lions, cats, dogs, horses, cows and hippos. Animal shapes can be sam- pled from the model, posed, animated, and fit to data. We demonstrate generalization by fitting it to images of real an- imals including species not seen in training. 1. Introduction The detection, tracking, and analysis of animals has many applications in biology, neuroscience, ecology, farm- ing, and entertainment. Despite the wide applicability, the computer vision community has focused more heavily on modeling humans, estimating human pose, and analyzing human behavior. Can we take the best practices learned from the analysis of humans and apply these directly to an- imals? To address this, we take an approach for 3D human pose and shape modeling and extend it to modeling animals. Specifically we learn a generative model of the 3D pose and shape of animals and then fit this model to 2D im- age data as illustrated in Fig. 1. We focus on a sub- set of four-legged mammals that all have the same num- ber of “parts” and model members of the families Felidae, Canidae, Equidae, Bovidae, and Hippopotamidae. Our goal is to build a statistical shape model like SMPL [23], which captures human body shape variation in a low-dimensional Euclidean subspace, models the articulated structure of the body, and can be fit to image data [7]. Animals, however, differ from humans in several impor- tant ways. First, the shape variation across species far ex- 1

Transcript of 3D Menagerie: Modeling the 3D Shape and Pose of...

Page 1: 3D Menagerie: Modeling the 3D Shape and Pose of Animalskanazawa/papers/cvpr17_menagerie_camready.pdf3D Menagerie: Modeling the 3D Shape and Pose of Animals ... tices deform towards

3D Menagerie: Modeling the 3D Shape and Pose of Animals

Silvia Zuffi1 Angjoo Kanazawa2 David Jacobs2 Michael J. Black3

1IMATI-CNR, Milan, Italy, 2University of Maryland, College Park, MD3Max Planck Institute for Intelligent Systems, Tubingen, Germany

[email protected], {kanazawa, djacobs}@umiacs.umd.edu, [email protected]

Figure 1: Animals from images. We learn an articulated, 3D, statistical shape model of animals using very little trainingdata. We fit the shape and pose of the model to 2D image cues showing how it generalizes to previously unseen shapes.

Abstract

There has been significant work on learning realistic,articulated, 3D models of the human body. In contrast,there are few such models of animals, despite many ap-plications. The main challenge is that animals are muchless cooperative than humans. The best human body mod-els are learned from thousands of 3D scans of people inspecific poses, which is infeasible with live animals. Con-sequently, we learn our model from a small set of 3D scansof toy figurines in arbitrary poses. We employ a novel part-based shape model to compute an initial registration to thescans. We then normalize their pose, learn a statisticalshape model, and refine the registrations and the model to-gether. In this way, we accurately align animal scans fromdifferent quadruped families with very different shapes andposes. With the registration to a common template we learna shape space representing animals including lions, cats,dogs, horses, cows and hippos. Animal shapes can be sam-pled from the model, posed, animated, and fit to data. Wedemonstrate generalization by fitting it to images of real an-imals including species not seen in training.

1. Introduction

The detection, tracking, and analysis of animals hasmany applications in biology, neuroscience, ecology, farm-ing, and entertainment. Despite the wide applicability, thecomputer vision community has focused more heavily onmodeling humans, estimating human pose, and analyzinghuman behavior. Can we take the best practices learnedfrom the analysis of humans and apply these directly to an-imals? To address this, we take an approach for 3D humanpose and shape modeling and extend it to modeling animals.

Specifically we learn a generative model of the 3D poseand shape of animals and then fit this model to 2D im-age data as illustrated in Fig. 1. We focus on a sub-set of four-legged mammals that all have the same num-ber of “parts” and model members of the families Felidae,Canidae, Equidae, Bovidae, and Hippopotamidae. Our goalis to build a statistical shape model like SMPL [23], whichcaptures human body shape variation in a low-dimensionalEuclidean subspace, models the articulated structure of thebody, and can be fit to image data [7].

Animals, however, differ from humans in several impor-tant ways. First, the shape variation across species far ex-

1

Page 2: 3D Menagerie: Modeling the 3D Shape and Pose of Animalskanazawa/papers/cvpr17_menagerie_camready.pdf3D Menagerie: Modeling the 3D Shape and Pose of Animals ... tices deform towards

ceeds the kinds of variations seen between humans. Evenwithin the canine family, there is a huge variability in dogshapes as a result of selective breeding. Second, all theseanimals have tails, which are highly deformable and obvi-ously not present in human shape models. Third, obtaining3D data to train a model is much more challenging. SMPLand previous models like it (e.g. SCAPE [5]) rely on a largedatabase of thousands of 3D scans of many people (captur-ing shape variation in the population) and a wide range ofposes (capturing pose variation). Humans are particularlyeasy and cooperative subjects. It is impractical to bring alarge number of wild animals into a lab environment forscanning and it would be difficult to take scanning equip-ment into the wild to capture animals shapes in nature.

Since scanning live animals is impractical we insteadscan realistic toy animals to create a dataset of 41 scans ofa range of quadrupeds as illustrated in Fig. 2. We show thata model learned from toys generalizes to real animals.

The key to building a statistical 3D shape model is thatall the 3D data must be in correspondence. This involvesregistering a common template mesh to every scan. Thisis a hard problem, which we approach by introducing anovel part-based model and inference scheme that extendsthe “stitched puppet” (SP) model [34]. Our new Global-Local Stitched Shape model (GLoSS) aligns a template todifferent shapes, providing a coarse registration betweenvery different animals (Fig. 5 left). The GLoSS registrationsare somewhat crude but provide a reasonable initializationfor a model-free refinement, where the template mesh ver-tices deform towards the scan surface under an As-Rigid-As-Possible (ARAP) constraint [30] (Fig. 5 right).

Our template mesh is segmented into parts, with blendweights, so that it can be reposed using linear blend skin-ning (LBS). We “pose normalize” the refined registrationsand learn a low-dimensional shape space using principalcomponent analysis (PCA). This is analogous to the SMPLshape space [23].

Using the articulated structure of the template and itsblend weights, we obtain a model where new shapes canbe generated and reposed. With the learned shape model,we refine the registration of the template to the scans us-ing co-registration [17], which regularizes the registrationby penalizing deviations from the model fit to the scan. Weupdate the shape space and iterate to convergence.

The final Skinned Multi-Animal Linear model (SMAL)provides a shape space of animals trained from 41 scans.Because quadrupeds have shape variations in common, themodel generalizes to new animals not seen in training. Thisallows us to fit SMAL to 2D data using manually detectedkeypoints and segmentations. As shown in Fig. 1 and Fig. 9,our model can generate realistic animal shapes in a varietyof poses.

In summary we describe a method to create a realistic 3D

model of animals and fit this model to 2D data. The problemis much harder than modeling humans and we develop newtools to extend previous methods to learn an animal model.This opens up new directions for research on animal shapeand motion capture.

2. Related WorkThere is a long history on representing, classifying, and

analyzing animal shapes in 2D [31]. Here we focus onlyon work in 3D. The idea of part-based 3D models of ani-mals also has a long history. Similar in spirit to our GLoSSmodel, Marr and Nishihara [25] suggested that a wide rangeof animals shapes could be modeled by a small set of 3Dshape primitives connected in a kinematic tree.

Animal shape from 3D scans. There is little work thatsystematically addresses the 3D scanning [3] and modelingof animals. The range of sizes and shapes, together with thedifficulty of handling live animals and dealing with theirmovement, makes traditional scanning difficult. Previous3D shape datasets like TOSCA [9] have a limited set of 3Danimals that are artist-designed and with limited realism.

Animal shape from images. Previous work on model-ing animal shape starts from the assumption that obtaining3D animal scans is impractical and focuses on using imagedata to extract 3D shape. Cashman and Fitzgibbon [10] takea template of a dolphin and learn a low-dimensional modelof its deformations from hand clicked keypoints and man-ual segmentation. They optimize their model to minimizereprojection error to the keypoints and contour. They alsoshow results for a pigeon and a polar bear. The formulationis elegant but the approach suffers from an overly smoothshape representation; this is not so problematic for dolphinsbut for other animals it is. The key limitation, however, isthat they do not model articulation.

Kanazawa et al. [18] deform a 3D animal template tomatch hand clicked points in a set of images. They learnseparate deformable models for cats and horses using spa-tially varying stiffness values. Our model is stronger in thatit captures articulation separately from shape variation. Fur-ther we model the shape variation across a wide range ofanimals to produce a statistical shape model.

Ntouskos et al. [26] take multiple views of different an-imals from the same class, manually segment the parts ineach view, and then fit geometric primitives to segmentedparts. They assemble these to form an approximate 3Dshape. Vicente and Agapito [32] extract a template froma reference image and then deform it to fit a new imageusing keypoints and the silhouette. The results are of lowresolution when applied to complex shapes.

Our work is complementary to these previous ap-proaches that only use image data to learn 3D shapes. Fu-ture work should combine 3D scans with image data to ob-tain even richer models.

Page 3: 3D Menagerie: Modeling the 3D Shape and Pose of Animalskanazawa/papers/cvpr17_menagerie_camready.pdf3D Menagerie: Modeling the 3D Shape and Pose of Animals ... tices deform towards

Figure 2: Toys. Example 3D scans of animal figurines used for training our model.

Animal shape from video. Ramanan et al. [27] modelanimals as a 2D kinematic chain of parts and learn the partsand their appearance from video. Bregler et al. [8] trackfeatures on a non-rigid object (e.g. a giraffe neck) and ex-tract a 3D surface as well as its low-dimensional modes ofdeformation. Del Pero et al. [13] track and segment ani-mals in video but do not address 3D shape reconstruction.Favreau et al. [14] focus on animating a 3D model of an an-imal given a 2D video sequence. Reinert et al. [28] takea video sequence of an animal and, using an interactivesketching/tracking approach, extract a textured 3D model ofthe animal. The 3D shape is obtained by fitting generalizedcylinders to each sketched stroke over multiple frames.

None of these methods model the kinds of detail avail-able in 3D scans, nor do they model the 3D articulated struc-ture of the body. Most importantly none try to learn a 3Dshape space spanning multiple animals.

Human shape from 3D scans. Our approach is in-spired by a long history of learning 3D shape models ofhumans. Blanz and Vetter [6] began this direction by align-ing 3D scans of faces and computing a low-dimensionalshape model. Faces have less shape variability and areless articulated than animals, simplifying mesh registrationand modeling. Modeling articulated human body shape issignificantly harder but several models have been proposed[4, 5, 12, 16, 23]. Chen et al. [11] model both humans andsharks, factoring deformations into pose and shape. The 3Dshark model is learned from synthetic data and they do notmodel articulation. Khamis et al. [20] learn an articulatedhand model with shape variation from depth images.

We base our method on SMPL [23], which combinesa low-dimensional shape space with an articulated blend-skinned model. SMPL is learned from 3D scans of 4000people in a common pose and another 1800 scans of 60 peo-ple in a wide variety of poses. In contrast, we have muchless data and at the same time much more shape variabilityto represent. Despite this, we show that we can learn a use-ful animal model for computer vision applications. Moreimportantly, this provides a path to making better modelsusing more scans as well as image data. We also go be-yond SMPL to add a non-rigid tail and more parts than arepresent in human bodies.

Figure 3: Template mesh. It is segmented into 33 parts,and here posed in the neutral pose.

3. Dataset

We created a dataset of 3D animals by scanning toy fig-urines (Fig. 2) using an Artec hand-held 3D scanner. Wealso tried scanning taxidermy animals in a museum butfound, surprisingly, that the shapes of the toys looked morerealistic. We collected a total of 41 scans from severalspecies: 1 cat, 5 cheetahs, 8 lions, 7 tigers, 2 dogs, 1 fox,1 wolf, 1 hyena, 1 deer, 1 horse, 6 zebras, 4 cows, 3 hip-pos. We estimated a scaling factor so animals from differ-ent manufacturers were comparable in size. Like previous3D human datasets [29], and methods that create animalsfrom images [10, 18], we collected a set of 36 hand-clickedkeypoints that we use to aid mesh registration. For moreinformation see [1].

4. Global/Local Stitched Shape Model

The Global/Local Stitched Shape model (GLoSS) is a3D articulated model where body shape deformations arelocally defined for each part and the parts are assembled to-gether by minimizing a stitching cost at the part interfaces.The model is inspired by the SP model [34], but has signif-icant differences from it. In contrast to SP, the shape defor-mations of each part are analytic, rather than learned. Thismakes it more approximate but, importantly, allows us toapply it to novel animal shapes, without requiring a prioritraining data. Second, GLoSS is a globally differentiablemodel that can be fit to data with gradient-based techniques.

To define a GLoSS model we need the following: a 3Dtemplate mesh of an animal with the desired polygon count,its segmentation into parts, skinning weights, and an ani-mation sequence. To define the mesh topology, we use a 3Dmesh of a lioness downloaded from the Turbosquid website.The mesh is rigged and skinning weights are defined. Wemanually segment the mesh into N = 33 parts (Fig. 3) and

Page 4: 3D Menagerie: Modeling the 3D Shape and Pose of Animalskanazawa/papers/cvpr17_menagerie_camready.pdf3D Menagerie: Modeling the 3D Shape and Pose of Animals ... tices deform towards

make it symmetric along its sagittal plane.We now summarize the GLoSS parametrization. Let i be

a part index, i ∈ (1 · · · N). The model variables are: partlocation li ∈ R3×1, part absolute 3D rotation ri ∈ R3×1,expressed as a Rodrigues vector, intrinsic shape variablessi ∈ Rns×1 and pose deformation variables di ∈ Rnd×1.Let πi = {li, ri, si,di} be the set of variables for part i andΠ = {l, r, s,d} the set of variables for all parts. The vectorof vertex coordinates, pi ∈ R3×ni , for part i in a globalreference frame is computed as:

pi(πi) = R(ri)pi + li, (1)

where ni is the number of vertices in the part, and R ∈SO(3) is the rotation matrix obtained from ri. The pi ∈R3×ni are points in a local coordinate frame, computed as:

vec(pi) = ti + mp,i +Bs,isi +Bp,idi. (2)

Here ti ∈ R3ni×1 is the part template, mp,i ∈ R3ni×1 isthe vector of average pose displacements; Bs,i ∈ R3ni×ns

is a matrix with columns representing a basis of intrinsicshape displacements, and Bp,i ∈ R3ni×nd is the matrix ofpose dependent deformations. These deformation matricesare defined below.Pose deformation space. We compute the part-based posedeformation space from examples. For this we use an ani-mation of the lioness template using linear blend skinning(LBS). Each frame of the animation is a pose deformationsample. We perform PCA on the vertices of each part in alocal coordinate frame, obtaining a vector of average posedeformations mp,i and the basis matrix Bp,i.Shape deformation space. We define a synthetic shapespace for each body part. This space includes 7 deforma-tions of the part template, namely scale, scale along x, scalealong y, scale along z, and three stretch deformations thatare defined as follows. Stretch for x does not modify the xcoordinate of the template points, while it scales the y andz coordinates in proportion to the value of x. Similarly wedefine the stretch for y and z. This defines a simple analyticdeformation space for each part. We model the distributionof the shape coefficients as a Gaussian distribution with zeromean and diagonal covariance, where we set the variance ofeach dimension arbitrarily.

5. Initial Registration

The initial registration of the template to the scans is per-formed in two steps. First, we optimize the GLoSS modelwith a gradient-based method. This brings the model closeto the scan. Then, we perform a model-free registration ofthe mesh vertices to the scan using As-Rigid-As-Possible(ARAP) regularization [30] to capture the fine details.GLoSS-based registration. To fit GLoSS to a scan, we

minimize the following objective:

E(Π) = Em(d, s) + Estitch(Π)+

Ecurv(Π) + Edata(Π) + Epose(r), (3)

where

Em(d, s) = ksmEsm(s) + ks

N∑i=1

Es(si) + kd

N∑i=1

Ed(di)

is a model term, where Es is the squared Mahalanobisdistance from the synthetic shape distribution and Ed is asquared L2 norm. The term Esm represents the constraintthat symmetric parts should have similar shape deforma-tions. We impose similarity between left and right limbs,front and back paws, and sections of the torso. This lastconstraint favors sections of the torso to have similar length.

The stitching term Estitch is the sum of squared dis-tances of the corresponding points at the interfaces betweenparts (cf. [34]). Let Cij be the set of vertex-vertex corre-spondences between part i and part j. Then Estitch(Π) =

kst∑

(i,j)∈C

∑(k,l)∈Cij

‖pi,k(πi)− pj,l(πj)‖2, (4)

where C is the set of part connections. Minimizing this termfavors connected parts.

The data term is defined as: Edata(Π) =

kkpEkp(Π) + km2sEm2s(Π) + ks2mEs2m(Π), (5)

where Em2s and Es2m are distances from the model to thescan and from the scan to the model, respectively:

Em2s(Π) =

N∑i=1

ni∑k=1

ρ(mins∈S‖pi,k(πi)− s‖2), (6)

Es2m(Π) =

S∑l=1

ρ(minp‖p(Π)− sl‖2), (7)

where S is the set of S scan vertices and ρ is the Geman-McClure robust error function [15]. The term Ekp(Π) isa term for matching model keypoints with scan keypoints,and is defined as the sum of squared distances betweencorresponding keypoints. This term is important to enablematching between extremely different animal shapes.

The curvature term favors parts that have a similar pair-wise relationship as those in the template; Ecurv(Π) =

kc∑

(i,j)∈C

∑(k,l)∈Cij

∣∣‖ni,k(πi)− nj,l(πj)‖2 − ‖n(t)i,k − n(t)

j,l ‖2∣∣,

where ni and nj are vectors of vertex normals on part i andpart j, respectively. Analogous quantities on the template

Page 5: 3D Menagerie: Modeling the 3D Shape and Pose of Animalskanazawa/papers/cvpr17_menagerie_camready.pdf3D Menagerie: Modeling the 3D Shape and Pose of Animals ... tices deform towards

Figure 4: GLoSS fitting. (a) Initial template and scan. (b)GLoSS fit to scan. (c) GLoSS model showing the parts. (d)Merged mesh with global topology obtained by removingthe duplicated vertices at the part interfaces.

Figure 5: Registration results. Comparing GLoSS (left)with the ARAP refinement (right). The fit to the scan ismuch tighter after refinement.

are denoted with a superscript (t). Lastly, Epose is a poseprior on the tail parts learned from animations of the tail.The values of the energy weights are manually defined andkept constant for all the toys.

We initialize the registration of each scan by aligning themodel in neutral pose to the scan based on the median valueof their vertices. Given this, we minimize Eq. 3 using theChumpy auto-differentiation package [2]. Doing so alignsthe lioness GLoSS model to all the toy scans. Figure 4a-cshows an example of fitting of GLoSS (colored) to a scan(white), and Fig. 5 (first and second column) shows someof the obtained registrations. To compare the GLoSS-basedregistration with SP we computed SP registrations for thebig cats family. We obtain an average scan-to-mesh distanceof 4.39(σ = 1.66) for SP, and 3.22(σ = 1.34) for GLoSS.ARAP-based refinement. The GLoSS model gives a goodinitial registration. Given this, we turn each GLoSS meshfrom its part-based topology into a global topology whereinterface points are not duplicated (Fig. 4d). We then furtheralign the vertices v to the scans by minimizing an energyfunction defined by a data term equal to Eq. 5 and an As-

Scan ARAP Neutral pose

Figure 6: Registrations of toy scans in the neutral pose.

Rigid-As-Possible (ARAP) regularization term [30]:

E(v) = Edata(v) + Earap(v). (8)

This model-free optimization brings the mesh verticescloser to the scan and therefore more accurately capturesthe shape of the animal (see Fig. 5).

6. Skinned Multi-Animal Linear ModelThe above registrations are now sufficiently accurate to

create a first shape model, which we refine further below toproduce the full SMAL model.Pose normalization. Given the pose estimated withGLoSS, we bring all the registered templates into the sameneutral pose using LBS. The resulting meshes are not sym-metric. This is due to various reasons: inaccurate pose es-timation, limitations of linear-blend-skinning, the toys maynot be symmetric, and pose differences across sides of thebody create different deformations. We do not want to learnthis asymmetry. To address this we perform an averaging ofthe vertices after we have mirrored the mesh to obtain theregistrations in the neutral pose (Fig. 6). Also, the fact thatmouths are sometimes open and other times closed presentsa challenge for registration, as inside mouth points are notobserved in the scan when the animal has a closed mouth.To address this, palate and tongue points in the registrationare regressed from the mouth points using a simple linearmodel learned from the template. Finally we smooth themeshes with Laplacian smoothing.Shape model. Pose normalization removes the non-lineareffects of part rotations on the vertices. In the neutral posewe can thus model the statistics of the shape variation in aEuclidean space. We compute the mean shape and the prin-cipal components, which capture shape differences betweenthe animals.

Page 6: 3D Menagerie: Modeling the 3D Shape and Pose of Animalskanazawa/papers/cvpr17_menagerie_camready.pdf3D Menagerie: Modeling the 3D Shape and Pose of Animals ... tices deform towards

Figure 7: PCA space. First 4 principal components. Meanshape is in the center. The width of the arrow represents theorder of the components. We visualise deviations of ±2std.

SMAL. The SMAL model is a function M(β,θ,γ) ofshape β, pose θ and translation γ. β is a vector of thecoefficients of the learned PCA shape space, θ ∈ R3N ={ri}Ni=1 is the relative rotation of the N = 33 joints in thekinematic tree, and γ is the global translation applied to theroot joint. Analogous to SMPL, the SMAL function returnsa 3D mesh, where the template model is shaped by β, artic-ulated by θ through LBS, and shifted by γ.Fitting. To fit SMAL to scans we minimize the objective:

E(β,θ,γ) = Epose(θ) + Es(β) + Edata(β,θ,γ), (9)

where Epose(θ) and Es(β) are squared Mahalanobis dis-tances from prior distributions for pose and shape, respec-tively. Edata(β,θ,γ) is defined as in Eq. 5 but over theSMAL model. For optimization we use Chumpy [2].Co-registration. To refine the registrations and the SMALmodel further, we then perform co-registration [17]. Thekey idea is to first perform a SMAL model optimization toalign the current model to the scans, and then run a model-free step where we couple, or regularize, the model-free reg-istration to the current SMAL model by adding a couplingterm to Eq. 8:

Ecoup(v) = ko

V∑i=1

|v0i − vi|, (10)

where V is the number of vertices in the template, v0i is ver-tex i of the model fit to the scan, and the vi are the coupledmesh vertices being optimized. During co-registration weuse a shape space with 30 dimensions. We perform 4 itera-tions of registration and model building and observe the reg-istration errors decrease and converge (see Sup. Mat. [1]).With the registrations to the toys in the last iteration welearn the shape space of our final SMAL model.

Figure 8: Visualization (using t-SNE [24]) of different ani-mal families using 8 PCs. Large dots indicate the mean ofthe PCA coefficients for each family.

Animal shape space. After refining with co-registration,the final principal components are visualized in Fig. 7. Theglobal SMAL shape space captures the shape variability ofanimals across different families. The first component cap-tures scale differences; our training set includes adult andyoung animals. The learned space nicely separates shapecharacteristics of animal families. This is illustrated inFig. 8 with a t-SNE visualization [24] of the first 8 dimen-sions of the PCA coefficients in the training set. The meshescorrespond to the mean shape for each family. We also de-fine family-specific shape models by computing a Gaussianover the PCA coefficients of the class. We compare genericand family specific models below.

7. Fitting Animals to ImagesWe now fit the SMAL model, M(β,θ,γ), to image cues

by optimizing the shape and pose parameters. We fit themodel to a combination of 2D keypoints and 2D silhouettes,both manually extracted, as in previous work [10, 18].

We denote Π(·; f) as the perspective camera projectionwith focal length f , where Π(vi; f) is the projection of thei’th vertex onto the image plane and Π(M ; f) = S is theprojected model silhouette. We assume an identity cameraplaced at the origin and that the global rotation of the 3Dmesh is defined by the rotation of the root joint.

To fit SMAL to an image, we formulate an objectivefunction and minimize it with respect to Θ = {β,θ,γ, f}.The function is a sum of the keypoint and silhouette repro-jection errors, a shape prior, and two pose priors, E(Θ) =

Ekp(Θ; x) + Esilh(Θ;S) + Eβ(β) + Eθ(θ) + Elim(θ).(11)

Each energy term is weighted by a hyper-parameter definingtheir importance.Keypoint reprojection. See [1] for a definition of key-points which include surface points and joints. Since key-points may be ambiguous, we assign a set of up to four ver-

Page 7: 3D Menagerie: Modeling the 3D Shape and Pose of Animalskanazawa/papers/cvpr17_menagerie_camready.pdf3D Menagerie: Modeling the 3D Shape and Pose of Animals ... tices deform towards

tices to represent each model keypoint and take the averageof their projection to match the target 2D keypoint. Specifi-cally for the k’th keypoint, let x be the labeled 2D keypointand {vkj

}kmj=1 be the assigned set of vertices, then

Ekp(Θ) =∑k

ρ(||x− 1

|km|

|km|∑j=1

Π(vkj; Θ)||2), (12)

where ρ is the Geman-McClure robust error function [15].Silhouette reprojection. We encourage silhouette coverageand consistency similar to [19, 21, 33] using a bi-directionaldistance:

Esilh(Θ) =∑x∈S

DS(x) +∑x∈S

ρ(minx∈S||x− x||2), (13)

where S is the ground truth silhouette and DS is its L2distance transform field such that if point x is inside thesilhouette, DS(x) = 0. Since the silhouette terms havesmall basins of attraction we optimize the term over mul-tiple scales in a coarse-to-fine manner.Shape prior. We encourage β to be close to the priordistribution of shape coefficients by defining Eβ to be thesquared Mahalanobis distance with zero mean and variancegiven by the PCA eigenvalues. When the animal familyis known, we can make our fits more specific by using themean and covariance of training samples of the particularfamily.Pose priors. Eθ is also defined as the squared Mahalanobisdistance using the mean and covariance of the poses acrossall the training samples and a walking sequence. To makethe pose prior symmetric, we double the training data by re-flecting the poses along the template’s sagittal plane. Sincewe do not have many examples, we further constrain thepose with limit bounds:

Elim(θ) = max(θ − θmax, 0) + max(θmin − θ,0). (14)

θmax and θmin are the maximum and minimum range ofvalues for each dimension of θ respectively, which we de-fine by hand. We do not limit the global rotation.Optimization. Following [7], we first initialize the depthof γ using the torso points. Then we solve for the globalrotation {θi}3i=0 and γ using Ekp over points on the torso.Using these as the initialization, we solve Eq. 11 for theentire Θ without Esilh. Similar to previous methods [7, 18]we employ a staged approach where the weights on poseand shape priors are gradually lowered over three stages.This helps avoid getting trapped in local optima. We thenfinally include theEsilh term and solve Eq. 11 starting fromthis initialization. Solving for the focal length is importantand we regularize f by adding another term that forces γto be close to its initial estimate. The entire optimizationis done using OpenDR and Chumpy [2, 22]. Optimizationfor a single image typically takes less than a minute on acommon Linux machine.

8. Experiments

We have shown how to learn a SMAL animal model froma small set of toy figurines. Now the question is: does thismodel capture the shape variation of real animals? Here wetest this by fitting the model to annotated images of real an-imals. We fit using class specific and generic shape models,and show that the shape space generalizes to new animalfamilies not present in training (within reason).Data. For fitting, we use 19 semantic keypoints of [13]plus an extra point for the tail tip. Note that these keypointsdiffer from those used in the 3D alignment. We fit frames inthe TigDog dataset, reusing their annotation, frames fromthe Muybridge footage, and images downloaded from theInternet. For images without annotation, we click the same20 keypoints for all animals, which takes about one minutefor each image. We also hand segmented all the images. Noimages were re-visited to improve their annotations and wefound the model to be robust to noise in the exact locationof the keypoints. See [1] for data, annotations, and results.Results. The model fits to real images of animals are shownin Fig. 1 and 9. The weights for each term in Eq. 11 aretuned by hand and held fixed for fitting all images. All re-sults use the animal specific shape space except for thosein mint green, which use the generic shape model. Despitebeing trained on scans of toys, our model generalizes to im-ages of real animals, capturing their shape well. Variabilityin animal families with extreme shape characteristics (e.g.lion manes, skinny horse legs, hippo faces) are modeledwell. Both the generic and class-specific models capturethe shape of real animals well.

Similar to the case of humans [7], our main failures aredue to inherent depth ambiguity, both in global rotation andpose (Fig 10). In Fig. 11 we show the results of fitting thegeneric shape model to classes of animals not seen in thetraining set: boar, donkey, sheep and pigs. While character-istic shape properties such as the pig snout cannot be exactlycaptured, these fits suggest that the learned PCA space cangeneralize to new animals within a range of quadrupeds.

9. Conclusions

Human shape modeling has a long history, while animalmodeling is in its infancy. We have made small steps to-wards making the building of animal models practical. Weshowed that starting with toys, we can learn a model thatgeneralizes to images of real animals as well as to types ofanimals not seen during training. This gives a procedure forbuilding richer models from more animals and more scans.While we have shown that toys are a good starting point, wewould clearly like a much richer model. For that we believethat we need to incorporate image and video evidence. Ourfits to images provide a starting point from which to learnricher deformations to explain 2D image evidence. Here we

Page 8: 3D Menagerie: Modeling the 3D Shape and Pose of Animalskanazawa/papers/cvpr17_menagerie_camready.pdf3D Menagerie: Modeling the 3D Shape and Pose of Animals ... tices deform towards

Figure 9: Fits to real images using manually obtained 2D points and segmentation. Colors indicate animal family. We showthe input image, fit overlaid, views from −45◦ and 45◦. All results except for those in mint colors use the animal specificshape prior. The SMAL model, learned form toy figurines, generalizes to real animal shapes.

Figure 10: Failure examples due to depth ambiguity in poseand global rotation.

have focused on a limited set of quadrupeds. A key issueis dealing with varying numbers of parts (e.g. horns, tusks,trunks) and parts of widely different shape (e.g. elephantears). Moving beyond the class of animals here will involvecreating a vocabulary of reusable shape parts and new waysof composing them.

Acknowledgements. We thank Seyhan Sitti for scanningthe toys and Federica Bogo and Javier Romero for their helpwith the silhouette term. AK and DJ were supported by theNational Science Foundation under grant no. IIS-1526234.

Figure 11: Generalization of SMAL to animal species notpresent in the training set.

References[1] http://smal.is.tue.mpg.de.

Page 9: 3D Menagerie: Modeling the 3D Shape and Pose of Animalskanazawa/papers/cvpr17_menagerie_camready.pdf3D Menagerie: Modeling the 3D Shape and Pose of Animals ... tices deform towards

[2] http://chumpy.org.[3] Digital life. http://www.digitallife3d.com/,

Accessed November 12, 2016.[4] B. Allen, B. Curless, Z. Popovic, and A. Hertzmann. Learn-

ing a correlated model of identity and pose-dependent bodyshape variation for real-time synthesis. In Proceedings of the2006 ACM SIGGRAPH/Eurographics Symposium on Com-puter Animation, SCA ’06, pages 147–156, Aire-la-Ville,Switzerland, Switzerland, 2006. Eurographics Association.

[5] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers,and J. Davis. SCAPE: Shape Completion and Animation ofPEople. ACM Trans. Graph. (Proc. SIGGRAPH, 24(3):408–416, 2005.

[6] V. Blanz and T. Vetter. A morphable model for the synthesisof 3D faces. In SIGGRAPH, pages 187–194. ACM, 1999.

[7] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero,and M. J. Black. Keep it SMPL: Automatic estimation of3D human pose and shape from a single image. In EuropeanConf. on Computer Vision (ECCV), Oct. 2016.

[8] C. Bregler, A. Hertzmann, and H. Biermann. Recoveringnon-rigid 3d shape from image streams. In CVPR, pages2:690–696, 2000.

[9] A. Bronstein, M. Bronstein, and R. Kimmel. Numerical Ge-ometry of Non-Rigid Shapes. Springer Publishing Company,2008.

[10] T. J. Cashman and A. W. Fitzgibbon. What shape are dol-phins? building 3d morphable models from 2d images. IEEETransactions on Pattern Analysis and Machine Intelligence,35(1):232–244, Jan 2013.

[11] Y. Chen, T. Kim, and R. Cipolla. Inferring 3D shapes anddeformations from single views. In iProc. European Conf.on Computer Vision, Part III, page 300313, 2010.

[12] Y. Chen, Z. Liu, and Z. Zhang. Tensor-based human bodymodeling. In IEEE Conf. on Computer Vision and PatternRecognition (CVPR), pages 105–112, June 2013.

[13] L. Del Pero, S. Ricco, R. Sukthankar, and V. Ferrari. Be-havior discovery and alignment of articulated object classesfrom unstructured video. International Journal of ComputerVision, 121(2):303–325, 2017.

[14] L. Favreau, L. Reveret, C. Depraz, and M.-P. Cani. Ani-mal gaits from video. In Proceedings of the 2004 ACM SIG-GRAPH/Eurographics symposium on Computer animation,pages 277–286. Eurographics Association, 2004.

[15] S. Geman and D. McClure. Statistical methods for tomo-graphic image reconstruction. Bulletin of the InternationalStatistical Institute, 52(4):5–21, 1987.

[16] N. Hasler, C. Stoll, M. Sunkel, B. Rosenhahn, and H. Seidel.A statistical model of human pose and body shape. ComputerGraphics Forum, 28(2):337–346, 2009.

[17] D. Hirshberg, M. Loper, E. Rachlin, and M. Black. Coregis-tration: Simultaneous alignment and modeling of articulated3D shape. In European Conf. on Computer Vision (ECCV),LNCS 7577, Part IV, pages 242–255. Springer-Verlag, Oct.2012.

[18] A. Kanazawa, S. Kovalsky, R. Basri, and D. Jacobs. Learning3d deformation of animals from 2d images. Comput. Graph.Forum, 35(2):365–374, May 2016.

[19] A. Kar, S. Tulsiani, J. Carreira, and J. Malik. Category-specific object reconstruction from a single image. In CVPR,2015.

[20] S. Khamis, J. Taylor, J. Shotton, C. Keskin, S. Izadi, andA. Fitzgibbon. Learning an efficient model of hand shapevariation from depth images. In IEEE Conference on Com-puter Vision and Pattern Recognition, 2015.

[21] C. Lassner, J. Romero, M. Kiefel, F. Bogo, M. J. Black, andP. V. Gehler. Unite the people: Closing the loop between 3Dand 2D human representations. In Proc. of the IEEE Con-ference on Computer Vision and Pattern Recognition CVPR,2017.

[22] M. Loper and M. J. Black. OpenDR: An approximate dif-ferentiable renderer. In European Conf. on Computer Vision(ECCV), pages 154–169, 2014.

[23] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J.Black. SMPL: A skinned multi-person linear model. ACMTrans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, Oct. 2015.

[24] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(Nov):2579–2605,2008.

[25] D. Marr and K. Nishihara. Representation and recognitionof the spatial organization of three dimensional shapes. Pro-ceedings of the Royal Society of London. Series B, BiologicalSciences, 200(1140):269–294, 1978.

[26] V. Ntouskos, M. Sanzari, B. Cafaro, F. Nardi, F. Natola,F. Pirri, and M. Ruiz. Component-wise modeling of artic-ulated objects. In The IEEE International Conference onComputer Vision (ICCV), December 2015.

[27] D. Ramanan, D. A. Forsyth, and K. Barnard. Building mod-els of animals from video. Pattern Analysis and MachineIntelligence, IEEE Transactions on, 28(8):1319–1334, 2006.

[28] B. Reinert, T. Ritschel, and H.-P. Seidel. Animated 3d crea-tures from single-view video by skeletal sketching. In GI’16: Proceedings of the 42st Graphics Interface Conference,2016.

[29] K. Robinette, S. Blackwell, H. Daanen, M. Boehmer,S. Fleming, T. Brill, D. Hoeferlin, and D. Burnsides. CivilianAmerican and European Surface Anthropometry Resource(CAESAR) final report. Technical Report AFRL-HE-WP-TR-2002-0169, US Air Force Research Laboratory, 2002.

[30] O. Sorkine and M. Alexa. As-rigid-as-possible surface mod-eling. In Proceedings of the Fifth Eurographics Symposiumon Geometry Processing, Barcelona, Spain, July 4-6, 2007,pages 109–116, 2007.

[31] D. W. Thompson. On Growth and Form. Cambridge Univer-sity Press, 1917.

[32] S. Vicente and L. Agapito. Balloon shapes: Reconstructingand deforming objects with volume from images. In Confer-ence on 3D Vision-3DV, 2013.

[33] A. Weiss, D. Hirshberg, and M. Black. Home 3D body scansfrom noisy image and range data. In Int. Conf. on ComputerVision (ICCV), Barcelona, Nov. 2011. IEEE.

[34] S. Zuffi and M. J. Black. The stitched puppet: A graphicalmodel of 3D human shape and pose. In IEEE Conf. on Com-puter Vision and Pattern Recognition (CVPR 2015), pages3537–3546, June 2015.