Download - Recurrent Transformer Networks for Semantic Correspondence

Transcript

Recurrent Transformer Networks for Semantic CorrespondenceSeungryong Kim, Stephen Lin, Sangryul Jeon, Dongbo Min, and Kwanghoon Sohn

Neural Information Processing Systems (NeurIPS) 2018

Semantic Correspondenceโ€ข Establishing dense

correspondences between semantically similar images (different instances within the same object)

Introduction

Background

Recurrent Transformer Networks Experimental Results and Discussion

Challenges in Semantic Correspondenceโ€ข Photometric/geometric deformations, lack of supervisions

Problem Formulationโ€ข Given a pair of image ๐ผ๐‘  and ๐ผ๐‘ก, infer a fields of affine

transformations for each pixel๐“๐‘– = ๐€๐‘– , ๐Ÿ๐‘–

that maps pixel ๐‘– to ๐‘–โ€ฒ = ๐‘– + ๐Ÿ๐‘–

Intuition of RTNs

Network Configuration

Feature Extraction Networksโ€ข To extract features ๐ท๐‘  and ๐ท๐‘ก, input images ๐ผ๐‘  and ๐ผ๐‘ก are passed

through convolution networks with parameters ๐–๐น such that๐ท๐‘– = ๐น(๐ผ|๐–๐น)

using CAT-FCSS, VGGNet (conv4-4), ResNet (conv4-23)

Recurrent Geometric Matching Networksโ€ข Constraint correlation volume

๐ถ(๐ท๐‘–๐‘ , ๐ท๐‘ก(๐“๐‘—)) =< ๐ท๐‘–

๐‘ , ๐ท๐‘ก(๐“๐‘—) >/ < ๐ท๐‘–๐‘ , ๐ท๐‘ก(๐“๐‘—) >

2

โ€ข Recurrent geometry estimation

๐“๐‘–๐‘˜ โˆ’ ๐“๐‘–

๐‘˜โˆ’1 = ๐น(๐ถ(๐ท๐‘–๐‘ , ๐ท๐‘ก(๐“๐‘–

๐‘˜โˆ’1))|๐–๐บ)

Weakly-supervised Learningโ€ข Intuition: Matching score between the source feature ๐ท๐‘  at

each pixel ๐‘– and the target feature ๐ท๐‘ก(๐“๐‘–) should be maximized while keeping the scores of other transformation candidates low

๐ฟ ๐ท๐‘–๐‘ , ๐ท๐‘ก ๐“ = โˆ’

๐‘—โˆˆ๐‘€๐‘–

๐‘๐‘—โˆ—log(๐‘(๐ท๐‘–

๐‘ , ๐ท๐‘ก(๐“๐‘—)))

where the function ๐‘(๐ท๐‘–๐‘ , ๐ท๐‘ก(๐“๐‘—)) is a Softmax probability

๐‘(๐ท๐‘–๐‘ , ๐ท๐‘ก(๐“๐‘—)) =

exp(๐ถ(๐ท๐‘–๐‘ , ๐ท๐‘ก(๐“๐‘—)))

๐‘™โˆˆ๐‘€๐‘– exp(๐ถ(๐ท๐‘–๐‘ , ๐ท๐‘ก(๐“๐‘—)))

where ๐‘๐‘—โˆ— denotes a class label defined as 1 if ๐‘— = ๐‘–, 0 otherwise

Ablation Studyโ€ข RTNs converges in 3-5 iterationsโ€ข Accuracy improves until window 9 ร— 9,

but larger window sizes reduce accuracy

Results on TSS Benchmark

Results on PF-WILLOW/PF-PASCAL BenchmarksMethods for geometric invariance in the regularization steps Geometric matching

methods [Roccoโ€™17,โ€™18] Inference using

source/target images ๐“๐‘– is learned w/๐“๐‘–

โˆ—

using self- or meta-supervision

Methods for geometric invariance in the feature extraction steps STN-based methods

[Choyโ€™16, Kimโ€™18] ๐€๐‘– is learned wo/๐€๐‘–

โˆ—

๐Ÿ๐‘– is learned w/๐Ÿ๐‘–โˆ—

Inference based only source or target image

Recurrent Transformer Networks (RTNs) Weaves the

advantages of both existing STN-based methods and geometric matching methods!

Source Target DCTM SCNet Gmat. w/Inl RTNs

Source Target CAT-FCSS

SCNet Gmat.w/Inl

RTNs

ResNet feature exhibits the best performance!

Fine-tuned features show improved accuracy!

Learning the feature extraction networks and geometric matching networks jointly can boost accuracy!

RTNs has shown the state-of-the-art performance!

Project webpage: http://diml.yonsei.ac.kr/~srkim/RTNs