Recurrent Transformer Networks for Semantic CorrespondenceSeungryong Kim, Stephen Lin, Sangryul Jeon, Dongbo Min, and Kwanghoon Sohn
Neural Information Processing Systems (NeurIPS) 2018
Semantic Correspondenceโข Establishing dense
correspondences between semantically similar images (different instances within the same object)
Introduction
Background
Recurrent Transformer Networks Experimental Results and Discussion
Challenges in Semantic Correspondenceโข Photometric/geometric deformations, lack of supervisions
Problem Formulationโข Given a pair of image ๐ผ๐ and ๐ผ๐ก, infer a fields of affine
transformations for each pixel๐๐ = ๐๐ , ๐๐
that maps pixel ๐ to ๐โฒ = ๐ + ๐๐
Intuition of RTNs
Network Configuration
Feature Extraction Networksโข To extract features ๐ท๐ and ๐ท๐ก, input images ๐ผ๐ and ๐ผ๐ก are passed
through convolution networks with parameters ๐๐น such that๐ท๐ = ๐น(๐ผ|๐๐น)
using CAT-FCSS, VGGNet (conv4-4), ResNet (conv4-23)
Recurrent Geometric Matching Networksโข Constraint correlation volume
๐ถ(๐ท๐๐ , ๐ท๐ก(๐๐)) =< ๐ท๐
๐ , ๐ท๐ก(๐๐) >/ < ๐ท๐๐ , ๐ท๐ก(๐๐) >
2
โข Recurrent geometry estimation
๐๐๐ โ ๐๐
๐โ1 = ๐น(๐ถ(๐ท๐๐ , ๐ท๐ก(๐๐
๐โ1))|๐๐บ)
Weakly-supervised Learningโข Intuition: Matching score between the source feature ๐ท๐ at
each pixel ๐ and the target feature ๐ท๐ก(๐๐) should be maximized while keeping the scores of other transformation candidates low
๐ฟ ๐ท๐๐ , ๐ท๐ก ๐ = โ
๐โ๐๐
๐๐โlog(๐(๐ท๐
๐ , ๐ท๐ก(๐๐)))
where the function ๐(๐ท๐๐ , ๐ท๐ก(๐๐)) is a Softmax probability
๐(๐ท๐๐ , ๐ท๐ก(๐๐)) =
exp(๐ถ(๐ท๐๐ , ๐ท๐ก(๐๐)))
๐โ๐๐ exp(๐ถ(๐ท๐๐ , ๐ท๐ก(๐๐)))
where ๐๐โ denotes a class label defined as 1 if ๐ = ๐, 0 otherwise
Ablation Studyโข RTNs converges in 3-5 iterationsโข Accuracy improves until window 9 ร 9,
but larger window sizes reduce accuracy
Results on TSS Benchmark
Results on PF-WILLOW/PF-PASCAL BenchmarksMethods for geometric invariance in the regularization steps Geometric matching
methods [Roccoโ17,โ18] Inference using
source/target images ๐๐ is learned w/๐๐
โ
using self- or meta-supervision
Methods for geometric invariance in the feature extraction steps STN-based methods
[Choyโ16, Kimโ18] ๐๐ is learned wo/๐๐
โ
๐๐ is learned w/๐๐โ
Inference based only source or target image
Recurrent Transformer Networks (RTNs) Weaves the
advantages of both existing STN-based methods and geometric matching methods!
Source Target DCTM SCNet Gmat. w/Inl RTNs
Source Target CAT-FCSS
SCNet Gmat.w/Inl
RTNs
ResNet feature exhibits the best performance!
Fine-tuned features show improved accuracy!
Learning the feature extraction networks and geometric matching networks jointly can boost accuracy!
RTNs has shown the state-of-the-art performance!
Project webpage: http://diml.yonsei.ac.kr/~srkim/RTNs
Top Related