Deep Visual Analogy-Making Scott Reed Yi Zhang Yuting Zhang Honglak Lee University of Michigan, Ann...

46
Deep Visual Analogy- Making Scott Reed Yi Zhang Yuting Zhang Honglak Lee University of Michigan, Ann Arbor

Transcript of Deep Visual Analogy-Making Scott Reed Yi Zhang Yuting Zhang Honglak Lee University of Michigan, Ann...

Deep Visual Analogy-Making

Deep Visual Analogy-MakingScott Reed Yi Zhang Yuting Zhang Honglak Lee

University of Michigan, Ann Arbor

Text analogiesKING : QUEEN :: MAN :

We are familiar with word analogies like the following2Text analogiesKING : QUEEN :: MAN :

WOMAN

Text analogiesKING : QUEEN :: MAN :

WOMAN

PARIS: FRANCE :: BEIJING:

Text analogiesKING : QUEEN :: MAN :

WOMAN

PARIS: FRANCE :: BEIJING:

CHINA

Text analogiesKING : QUEEN :: MAN :WOMANPARIS: FRANCE :: BEIJING:CHINABILL: HILLARY :: BARACK:Text analogiesKING : QUEEN :: MAN :WOMANPARIS: FRANCE :: BEIJING:CHINABILL: HILLARY :: BARACK:MICHELLE2D projection of embeddings

Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." In NIPS, 2013.T. Mikolov et al. :Linguistic Regularities in Continuous Space Word Representations, NAACL 2013.Neural word embeddings have been found to exhibit regularities allowing analogical reasoning by *vector* addition.

https://code.google.com/p/word2vec/

82D projection of embeddings

Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." In NIPS, 2013.T. Mikolov et al. :Linguistic Regularities in Continuous Space Word Representations, NAACL 2013.KingQueenManWomanhttps://code.google.com/p/word2vec/

9:: :::: :::: ::Changing colorChanging shapeChanging size:: ::Visual analogy-making?We can also make up *visual* analogy problems.Add animation:: :::: :::: ::Changing colorChanging shapeChanging size:: ::Visual analogy-makingCan we take a similar approach as for the neural word embedding models?

Solving the analogy requires 2 things:

We understand the visual relationship of the first pair of imagesWe can correctly apply the transformation to a query image

Add animationRelated workTenenbaum and Freeman, 2000. Separating style and content with bilinear modelsHertzmann et al., 2001: Image AnalogiesDollr et al., 2007: Learning to traverse image manifolds (Locally-smooth manifold learning)Memisevic & Hinton, 2010: Learning to represent spatial transformations with factored higher-order Boltzmann MachinesSusskind et al., 2011. Modeling the joint density of two images under a variety of transformationsHwang et al., 2013. Analogy-preserving semantic embedding for visual object categorizationTenenbaum: factorize representation into style and content units so they can be separately adjustedHertzmann: change image textures / style by exampleDollar: traverse image manifold induced by transformations (e.g. out of place rotations)Memisevic: Boltzmann machine learns to represent relation between transformation pair, apply transformation to queriesHwang: Use image analogies for regularization to improve classification performance12Very recent / contemporary workZhu et al., 2014. Multi-view perceptronMichalski et al., 2014. Modeling deep temporal dependencies with recurrent grammar cells.Kiros et al, 2014. Unifying visual-semantic embeddings with multimodal neural language modelsDosovitskiy et al., 2015. Learning to generate chairs with convolutional neural networks Kulkarni et al., 2015. Deep convolutional inverse graphics network Cohen and Welling, 2014. Learning the irreducible representations of commutative Lie groups. Cohen and Welling, 2015: Transformation properties of learned visual representationsZhu: Deep network disentangling face identity and viewpointMichalsky: Multiplicative and recurrent sequence prediction, multi-step transformationsKiros: Regularities in multi-modal embedding space, showed some correct analogy image *retrieval* by vector additionDosovitsky: Showed that high-quality images can be rendered by convnetKulkarni: Deep VAE model with disentangled representationCohen: develop model with tractable probabilistic inference over compact commutative Lie group (includes rotation and cyclic translation), later extended to 3D rotation (NORB)

What we do differently: - simple deep convolutional encoder-decoder architecture - training objective is end-to-end analogy completion - we can also learn disentangled representations as as a special case

13

Here I will walk through a cartoon example of our approach:14

Analogy image prediction objective:Research questions:1) What form should encoder f and decoder g take?2) What form should the transformation T take?

1) What form should f and g take?

202) What form should T take?Add:Multiply:Deep:

Manifold regularizationIdea: We also want the increment T to be close to difference of embeddings f(d) f(c).Stronger local gradient signal for encoderIn practice, helps to traverse image manifoldsAllows repeated application of analogiesUse weighted combination,

* Note: there is no decoder here.Force the transformation increment T to match the actual step on the manifold from C to D.23Traversing image manifolds - algorithmz = f(c)for i = 1 to N doz = z + T(f(b) f(a) , z)xi = g(z)endreturn generated images x

abcx1x2x3x4Learning a disentangled representation

Disentangling + analogy training

Perform analogy-making on the pose units, disentangle from these the identity units.26Classification + analogy training

Perform analogy-making on the pose units, classification on the separate identity units. Note that identity units are also used in decoding.

27Experiments

Shape predictions: additive modelrotatescaleshiftrefoutqueryt=1predictionst=2t=3t=4Shape predictions: multiplicative modelrotatescaleshiftrefoutqueryt=2t=3t=4

t=1predictionsShape predictions: deep modelrotatescaleshiftrefoutqueryt=2t=3t=4

t=1predictions

Repeated rotation predictionShapes quantitative comparison

Shapes quantitative comparison

The multiplicative (mul) is slightly better than additive (add) model, but

Shapes quantitative comparison

The multiplicative (mul) is slightly better than additive (add) model, butOnly the deep network model (deep) can learn repeated rotation analogies.

RotationScalingTranslationScale + TranslateRotate + TranslateScale + RotateNote that a single model can do all of these (multi-task). We do not train 1 model for each transformation.36

WalkThrustSpell-castReference animationQuery start frame

Transfer the *trajectory* from the reference to the query frame.At each step, we get a new transformation [ f(x_t) f(x_{t-1}) ]Apply this transformation to the current query embedding (all updates happening on the manifold)37Animation transfer - quantitative

Animation transfer - quantitative

Additive and disentangling objectives perform comparably, generating reasonable results.The best performance by a wide margin is achieved by disentangling + attribute classifier training, generating almost perfect results.Extrapolating animations by analogy

Idea: Generate training examples in which the transformation is advancing frames in the animation. Extrapolating animations by analogy

Disentangling car pose and appearance

Pose units are discriminative for same-or-different pose verification, but not for ID verification.ID units are discriminative for ID verification, but less discriminative for pose.

Repeated rotation analogy applied to 3D car CAD modelsConclusionsWe proposed novel deep architectures that can perform visual analogy making by simple operations in an embedding space.

Convolutional encoder-decoder networks can effectively generate transformed images.

Modeling transformations by vector addition in embedding space works for simple problems, but multi-layer networks are better.

Analogy and disentangling training methods can be combined together, and analogy representations can overcome limitations of disentangled representations by learning transformation manifold.Thank You!Questions?