G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD...

26
G 2 DeNet: Global Gaussian Distribution Embedding Network and Its Application to Visual Recognition Qilong Wang 1 Peihua Li 1 Lei Zhang 2 1 Dalian University of Technology, 2 Hong Kong Polytechnic University

Transcript of G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD...

Page 1: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

G2DeNet: Global Gaussian Distribution Embedding Network and Its Application to

Visual Recognition

Qilong Wang1 Peihua Li1 Lei Zhang2

1Dalian University of Technology, 2Hong Kong Polytechnic University

Page 2: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

Tendency of CNN architectures

LeNet-5

AlexNet-8

VGG-VD-19

ResNet-152

CNN architectures tend to be Deeper & Wider

More accurate !

Only Convolution, Non-linear (ReLU), Pooling

/GoogLeNet-22

/Inception-V4

Page 3: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

Trainable structural layers

……

Images Conv. layersLoss

Bilinear pooling (COV)

[B-CNN, ICCV’15]

O2P layer (LogCOV)[DeepO2P, ICCV’15]

Mean Map Embedding

[DMMs, arXiv’15]

VLAD Coding

[NetVLAD, CVPR’16]

Modeling outputs of the last convolutional layer as trainable structural layers.

Page 4: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

Trainable structural layers

B-CNN [D,D]

(84.1, 84.1, 91.3)

VGG-VD16

(76.4, 74.1, 79.8)

Fine-grained Visual Classification

~ 8%

T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear CNN models for fine-grained visual recognition. In ICCV,

2015.

Page 5: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

Trainable structural layers

Place Recognition (Pitts30k)

NetVLAD (85.6) VS. AlexNet(69.8)

(+AlexNet)

~ 15%

R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. NetVLAD: CNN architecture for weakly supervised

place recognition. In CVPR, 2016.

Page 6: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

Trainable structural layers

Scene Categorization (Place205)

DMMs + GoogLeNet (49.00)

VS.

GoogLeNet(47.5)

J. B. Oliva, D. J. Sutherland, B. P´oczos, and J. G. Schneider. Deep mean maps. arXiv, abs/1511.04150, 2015.

Page 7: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

Trainable structural layers

Scene Categorization (Place205)

DMMs + GoogLeNet (49.00)

VS.

GoogLeNet(47.5)

J. B. Oliva, D. J. Sutherland, B. P´oczos, and J. G. Schneider. Deep mean maps. arXiv, abs/1511.04150, 2015.

Integration of trainable structural layers into deep

CNNs achieves significant improvements in many

challenging vision tasks.

Page 8: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

Parametric probability distribution modelingP

aram

etri

c pro

bab

ilit

y

dis

trib

uti

on m

odel

ing Gaussian

Distribution

Gaussian Mixture Model

……

Gaussian-Laplacian Model

① Modelling abundant statistics of features.

② Producing fixed size representations regardless of varying feature sizes.

Promising modeling performance (> coding methods)

Nakayama et al. CVPR’10

Serra et al. CVIU’15

Wang et al. CVPR’16

High computational efficiency

Closed-form solution of parameters estimation

Page 9: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

Embedding of global Gaussian in CNN

……

Images Conv. layers LossGlobal Gaussian

Page 10: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

Global Gaussian distribution embedding network (G2DeNet)

A trainable global Gaussian embedding layer for modeling convolutional features.

The first attempt to plug a parametric probability distribution into deep CNNs.

Matrix Partition Sub-

layerSquare-rooted SPD

Matrix Sub-layer

X Y Z

……

( )f Z

1( )

2

T T

MPL

T T

sym

fN

N

X AX XA

AX 1b B

1

2( )ESRLf Y Y

Global Gaussian Embedding Layer

Images Conv. Layers Loss

1

2

,1

T

T

Σ μμ μμ Σ

μ

Global

Gaussian:

( )f

Z

Y( )f

Z

X

Page 11: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

Challenges

Q: How to construct our trainable global Gaussian embedding layer?

A: The key is to give the explicit forms of Gaussian distributions.

Forward

Propagation

RiemannianGeometry Structure

Algebraic Structure

Backward

PropagationDifferentiable

Page 12: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

Gaussian embedding

[TPAMI’17] shows space of Gaussians is equipped with a Lie group structure.

The space of Gaussians is a Riemannian manifold having special geometric structure.

[TPAMI’17] Peihua Li, Qilong Wang et al. Local Log-Euclidean Multivariate Gaussian Descriptor

and Its Application to Image Classification. TPAMI, 2017.

,

, 1T

T

TL

LA

0

1

2

1

T

TP

, TLA PO

Gaussian Positive upper triangular matrix SPD matrix

1T L Lleft polar decomp.

Cholesky decomp.

Page 13: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

Global Gaussian embedding layer

1

2

,1

T

TP

1. Matrix Partition Sub-layer : 2. Square-rooted SPD Matrix Sub-layer:

1

1 2

T

TMPL

T T T T

sym

f

N N

Y X

AX XA AX 1b B

1

2

1

2

ESRL

T

f

Z Y Y

U U

Gaussian Embedding :

Y is a function of convolutional features X. Computing square-root of Y via SVD.

Page 14: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

Matrix Partition Sub-

layerSquare-rooted SPD

Matrix Sub-layer

X Y Z……

( )f Z

1( )

2

T T

MPL

T T

sym

fN

N

X AX XA

AX 1b B

1

2( )ESRLf Y Y

Global Gaussian Embedding Layer

Images Conv. Layers Loss

1

2

,1

T

T

Σ μμ μμ Σ

μ

Global

Gaussian:

( )f

Z

Y( )f

Z

X

BP for global Gaussian embedding layer

The goal is to compute f

Z

XThe first step is to compute

f

Z

Y

Page 15: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

BP for square-rooted SPD matrix sub-layer

Compute

f Z

Y

: : :f f fd d d

Y U

Y U

2 ,

.

T T T

sym

T

diag

d d

d d

U U K U YU

U YU

2 2

12 ,T T T

ij

diag i jsym

f f fU K U U K

Y U

TY U U

[DeepO2P, ICCV’15]

[DeepO2P, ICCV’15]: Catalin Ionescu et al. Matrix Backpropagation for Deep Networks with

Structured Layers. ICCV, 2015.

Page 16: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

BP for square-rooted SPD matrix sub-layer

: : :f f fd d d

Z U

Z U

1 1

2 21

2 , .2

T

sym

f f f fU U U U

U Z Z

1 1

2 21

22

T T

sym

d d d

Z U U U U

f

U

fCompute and

1 1

2 2 TESRLfZ Y Y U U

Page 17: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

BP for global Gaussian embedding layer

The goal is to compute

f

X

: :f fd d

X Y

X Y

2 T T

sym

f f

NXA 1b A

X Y

1 2T T T T

MPL symf

N NY X AX XA AX 1b B

given

f

Y

BP for global Gaussian embedding layer

Page 18: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

Matrix Partition Sub-

layerSquare-rooted SPD

Matrix Sub-layer

X Y Z

……

( )f Z

1( )

2

T T

MPL

T T

sym

fN

N

X AX XA

AX 1b B

1

2( )ESRLf Y Y

Global Gaussian Embedding Layer

Images Conv. Layers Loss

1

2

,1

T

T

Σ μμ μμ Σ

μ

Global

Gaussian:

( )f

Z

Y( )f

Z

X

Global Gaussian distribution embedding network (G2DeNet)

Gaussian Embedding.

Structural Backpropagation and .

f

X

f

Y

Page 19: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

Experiments on MS-COCO

Convergence curve of our G2DeNet-

FC with AlexNet on MS-COCO.

DeepO2P

[ICCV 15]

DeepO2P-FC

(S) [ICCV 15]

DeepO2P-FC

[ICCV 15]

Err. 28.6 28.9 25.2

G2DeNet

(Ours)

G2DeNet-FC (S)

(Ours)

G2DeNet-FC

(Ours)

Err. 24.4 22.6 21.5

Comparison of classification errors on MS-COCO.

890k segmented instances from MS-COCO

dataset. 80 classes, ~600k training instances,

~290k validation ones. [DeepO2P, ICCV’15]

Page 20: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

Experiments on MS-COCO

Convergence curve of our G2DeNet-

FC with AlexNet on MS-COCO.

AlexNet

(baseline)

DeepO2P

[ICCV 15]

DeepO2P-FC

(S) [ICCV 15]

DeepO2P-FC

[ICCV 15]

Err. 25.3 28.6 28.9 25.2

DMMs-FC

[arXiv‘15]

G2DeNet

(Ours)

G2DeNet-FC (S)

(Ours)

G2DeNet-FC

(Ours)

Err. 24.6 24.4 22.6 21.5

Comparison of classification errors on MS-COCO.

890k segmented instances from MS-COCO

dataset. 80 classes, ~600k training instances,

~290k validation ones. [DeepO2P, ICCV’15]

Page 21: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

Experiments on FGVR - Benchmarks

Birds CUB-200-2011 FGVC-Aircraft FGVC-Car

100 classes

6,667 training/3,333 test196 classes

8,144 training/8,041 test

200 classes

5,994 training/5,794 test

Page 22: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

Experiments on FGVR - Results

Methods Birds CUB-200-2011 FGVC-Aircraft FGVC-Cars

FC-CNN 76.4 74.1 79.8

FV-CNN 77.5 77.6 85.7

VLAD-CNN 79.0 80.6 85.6

NetFV [TPAMI’17] 79.9 79.0 86.2

NetVLAD [CVPR’16] 81.9 81.8 88.6

B-CNN [ICCV’15] 84.1 84.1 91.3

G2DeNet (Ours) 87.1 89.0 92.5

Comparison of different counterparts by using VGG-VD16 without Bounding Box

& Part sharing the same settings with B-CNN.

NetFV [TPAMI’17]: Lin et al. Bilinear CNNs for Fine-grained Visual Recognition. TPAMI, 2017.

Page 23: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

Experiments on FGVR - Results

Methods Birds CUB-

200-2011

FGVC-

Aircraft

FGVC-

Cars

Remarks

PG-Alignment [CVPR’15] 82.0 - 92.6 PG-Alignment + BB

PD [CVPR’16] 84.5 - - PD+FC+SWFV-CNN

BoT [CVPR’16] - 88.4 92.5 Bag of Triplets + BB

SPDA-CNN[CVPR’16] 85.1 - - SPDA-CNN + Ensemble

Boosted CNN [BMVC’16] 86.2 88.5 92.1 Boosted CNN + B-CNN

RA-CNN [CVPR’17] 85.3 - 92.5 Recurrent attention CNN

CVL [CVPR’17] 85.6 - - Combining Vision and Language

KP-CNN [CVPR’17] 86.2 86.9 92.4 Kernel Pooling for CNN

G2DeNet (Ours) 87.1 (87.5) 89.0 92.5 VG-VD16 w/o BB & Part

Comparison of various state-of-the-art methods.

Page 24: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

Experiments on ablation – Training methods

VD16-NoTr: Global Gaussian embedding layer

+ pre-trained VGG-VD16 on ImageNet.

VD16-FT: Global Gaussian embedding layer +

fine-tuned VGG-VD16.

G2DeNet: Pre-trained VGG-VD16 on ImageNet

+ train G2DeNet in an end-to-end manner.

Effects of different training methods on G2DeNet using VGG-VD16 on Birds dataset.

Page 25: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

Experiments on ablation - Embedding methods

Comparison of different Gaussian embedding methods for G2DeNet on Birds dataset.

Method Gaussian Embedding Acc. ( % )

Nakayama et al. [CVPR’2010] 83.5

Calvo et al. or Lovric et al.

[JMVA’1990 & JMVA’2000] 84.1

Calvo et al. or Lovric et al. + Log-

Euclidean [ICCV’2013] 83.8

Ours

87.1

1

2

1

T

T

log1

T

T

1

T

T

,T

T Tvec

Page 26: G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) ~ 15% R. Arandjelovic, P. Gronat, A. Torii,

Conclusion

The first attempt to plug a global Gaussian distribution into deep CNNs.

More CNN architectures and computer vision applications.

Please refer to our poster [ID #11] for more details.

……

Images Conv. layers LossGlobal Gaussian