Post on 25-Mar-2022
Recurrent Transformer Networks for Semantic CorrespondenceSeungryong Kim, Stephen Lin, Sangryul Jeon, Dongbo Min, and Kwanghoon Sohn
Neural Information Processing Systems (NeurIPS) 2018
Semantic Correspondenceβ’ Establishing dense
correspondences between semantically similar images (different instances within the same object)
Introduction
Background
Recurrent Transformer Networks Experimental Results and Discussion
Challenges in Semantic Correspondenceβ’ Photometric/geometric deformations, lack of supervisions
Problem Formulationβ’ Given a pair of image πΌπ and πΌπ‘, infer a fields of affine
transformations for each pixelππ = ππ , ππ
that maps pixel π to πβ² = π + ππ
Intuition of RTNs
Network Configuration
Feature Extraction Networksβ’ To extract features π·π and π·π‘, input images πΌπ and πΌπ‘ are passed
through convolution networks with parameters ππΉ such thatπ·π = πΉ(πΌ|ππΉ)
using CAT-FCSS, VGGNet (conv4-4), ResNet (conv4-23)
Recurrent Geometric Matching Networksβ’ Constraint correlation volume
πΆ(π·ππ , π·π‘(ππ)) =< π·π
π , π·π‘(ππ) >/ < π·ππ , π·π‘(ππ) >
2
β’ Recurrent geometry estimation
πππ β ππ
πβ1 = πΉ(πΆ(π·ππ , π·π‘(ππ
πβ1))|ππΊ)
Weakly-supervised Learningβ’ Intuition: Matching score between the source feature π·π at
each pixel π and the target feature π·π‘(ππ) should be maximized while keeping the scores of other transformation candidates low
πΏ π·ππ , π·π‘ π = β
πβππ
ππβlog(π(π·π
π , π·π‘(ππ)))
where the function π(π·ππ , π·π‘(ππ)) is a Softmax probability
π(π·ππ , π·π‘(ππ)) =
exp(πΆ(π·ππ , π·π‘(ππ)))
πβππ exp(πΆ(π·ππ , π·π‘(ππ)))
where ππβ denotes a class label defined as 1 if π = π, 0 otherwise
Ablation Studyβ’ RTNs converges in 3-5 iterationsβ’ Accuracy improves until window 9 Γ 9,
but larger window sizes reduce accuracy
Results on TSS Benchmark
Results on PF-WILLOW/PF-PASCAL BenchmarksMethods for geometric invariance in the regularization steps Geometric matching
methods [Roccoβ17,β18] Inference using
source/target images ππ is learned w/ππ
β
using self- or meta-supervision
Methods for geometric invariance in the feature extraction steps STN-based methods
[Choyβ16, Kimβ18] ππ is learned wo/ππ
β
ππ is learned w/ππβ
Inference based only source or target image
Recurrent Transformer Networks (RTNs) Weaves the
advantages of both existing STN-based methods and geometric matching methods!
Source Target DCTM SCNet Gmat. w/Inl RTNs
Source Target CAT-FCSS
SCNet Gmat.w/Inl
RTNs
ResNet feature exhibits the best performance!
Fine-tuned features show improved accuracy!
Learning the feature extraction networks and geometric matching networks jointly can boost accuracy!
RTNs has shown the state-of-the-art performance!
Project webpage: http://diml.yonsei.ac.kr/~srkim/RTNs