G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD...
Transcript of G2DeNet: Global Gaussian Distribution Embedding Network ... · Place Recognition (Pitts30k) NetVLAD...
G2DeNet: Global Gaussian Distribution Embedding Network and Its Application to
Visual Recognition
Qilong Wang1 Peihua Li1 Lei Zhang2
1Dalian University of Technology, 2Hong Kong Polytechnic University
Tendency of CNN architectures
LeNet-5
AlexNet-8
VGG-VD-19
ResNet-152
CNN architectures tend to be Deeper & Wider
More accurate !
Only Convolution, Non-linear (ReLU), Pooling
/GoogLeNet-22
/Inception-V4
Trainable structural layers
……
…
…
Images Conv. layersLoss
Bilinear pooling (COV)
[B-CNN, ICCV’15]
O2P layer (LogCOV)[DeepO2P, ICCV’15]
Mean Map Embedding
[DMMs, arXiv’15]
VLAD Coding
[NetVLAD, CVPR’16]
Modeling outputs of the last convolutional layer as trainable structural layers.
Trainable structural layers
B-CNN [D,D]
(84.1, 84.1, 91.3)
VGG-VD16
(76.4, 74.1, 79.8)
Fine-grained Visual Classification
~ 8%
T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear CNN models for fine-grained visual recognition. In ICCV,
2015.
Trainable structural layers
Place Recognition (Pitts30k)
NetVLAD (85.6) VS. AlexNet(69.8)
(+AlexNet)
~ 15%
R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. NetVLAD: CNN architecture for weakly supervised
place recognition. In CVPR, 2016.
Trainable structural layers
Scene Categorization (Place205)
DMMs + GoogLeNet (49.00)
VS.
GoogLeNet(47.5)
J. B. Oliva, D. J. Sutherland, B. P´oczos, and J. G. Schneider. Deep mean maps. arXiv, abs/1511.04150, 2015.
Trainable structural layers
Scene Categorization (Place205)
DMMs + GoogLeNet (49.00)
VS.
GoogLeNet(47.5)
J. B. Oliva, D. J. Sutherland, B. P´oczos, and J. G. Schneider. Deep mean maps. arXiv, abs/1511.04150, 2015.
Integration of trainable structural layers into deep
CNNs achieves significant improvements in many
challenging vision tasks.
Parametric probability distribution modelingP
aram
etri
c pro
bab
ilit
y
dis
trib
uti
on m
odel
ing Gaussian
Distribution
Gaussian Mixture Model
……
Gaussian-Laplacian Model
① Modelling abundant statistics of features.
② Producing fixed size representations regardless of varying feature sizes.
Promising modeling performance (> coding methods)
Nakayama et al. CVPR’10
Serra et al. CVIU’15
Wang et al. CVPR’16
High computational efficiency
Closed-form solution of parameters estimation
Embedding of global Gaussian in CNN
……
…
…
Images Conv. layers LossGlobal Gaussian
Global Gaussian distribution embedding network (G2DeNet)
A trainable global Gaussian embedding layer for modeling convolutional features.
The first attempt to plug a parametric probability distribution into deep CNNs.
Matrix Partition Sub-
layerSquare-rooted SPD
Matrix Sub-layer
X Y Z
……
( )f Z
1( )
2
T T
MPL
T T
sym
fN
N
X AX XA
AX 1b B
1
2( )ESRLf Y Y
Global Gaussian Embedding Layer
…
…
Images Conv. Layers Loss
1
2
,1
T
T
Σ μμ μμ Σ
μ
Global
Gaussian:
( )f
Z
Y( )f
Z
X
Challenges
Q: How to construct our trainable global Gaussian embedding layer?
A: The key is to give the explicit forms of Gaussian distributions.
Forward
Propagation
RiemannianGeometry Structure
Algebraic Structure
Backward
PropagationDifferentiable
Gaussian embedding
[TPAMI’17] shows space of Gaussians is equipped with a Lie group structure.
The space of Gaussians is a Riemannian manifold having special geometric structure.
[TPAMI’17] Peihua Li, Qilong Wang et al. Local Log-Euclidean Multivariate Gaussian Descriptor
and Its Application to Image Classification. TPAMI, 2017.
,
, 1T
T
TL
LA
0
1
2
1
T
TP
, TLA PO
Gaussian Positive upper triangular matrix SPD matrix
1T L Lleft polar decomp.
Cholesky decomp.
Global Gaussian embedding layer
1
2
,1
T
TP
1. Matrix Partition Sub-layer : 2. Square-rooted SPD Matrix Sub-layer:
1
1 2
T
TMPL
T T T T
sym
f
N N
Y X
AX XA AX 1b B
1
2
1
2
ESRL
T
f
Z Y Y
U U
Gaussian Embedding :
Y is a function of convolutional features X. Computing square-root of Y via SVD.
Matrix Partition Sub-
layerSquare-rooted SPD
Matrix Sub-layer
X Y Z……
( )f Z
1( )
2
T T
MPL
T T
sym
fN
N
X AX XA
AX 1b B
1
2( )ESRLf Y Y
Global Gaussian Embedding Layer
…
…
Images Conv. Layers Loss
1
2
,1
T
T
Σ μμ μμ Σ
μ
Global
Gaussian:
( )f
Z
Y( )f
Z
X
BP for global Gaussian embedding layer
The goal is to compute f
Z
XThe first step is to compute
f
Z
Y
BP for square-rooted SPD matrix sub-layer
Compute
f Z
Y
: : :f f fd d d
Y U
Y U
2 ,
.
T T T
sym
T
diag
d d
d d
U U K U YU
U YU
2 2
12 ,T T T
ij
diag i jsym
f f fU K U U K
Y U
TY U U
[DeepO2P, ICCV’15]
[DeepO2P, ICCV’15]: Catalin Ionescu et al. Matrix Backpropagation for Deep Networks with
Structured Layers. ICCV, 2015.
BP for square-rooted SPD matrix sub-layer
: : :f f fd d d
Z U
Z U
1 1
2 21
2 , .2
T
sym
f f f fU U U U
U Z Z
1 1
2 21
22
T T
sym
d d d
Z U U U U
f
U
fCompute and
1 1
2 2 TESRLfZ Y Y U U
BP for global Gaussian embedding layer
The goal is to compute
f
X
: :f fd d
X Y
X Y
2 T T
sym
f f
NXA 1b A
X Y
1 2T T T T
MPL symf
N NY X AX XA AX 1b B
given
f
Y
BP for global Gaussian embedding layer
Matrix Partition Sub-
layerSquare-rooted SPD
Matrix Sub-layer
X Y Z
……
( )f Z
1( )
2
T T
MPL
T T
sym
fN
N
X AX XA
AX 1b B
1
2( )ESRLf Y Y
Global Gaussian Embedding Layer
…
…
Images Conv. Layers Loss
1
2
,1
T
T
Σ μμ μμ Σ
μ
Global
Gaussian:
( )f
Z
Y( )f
Z
X
Global Gaussian distribution embedding network (G2DeNet)
Gaussian Embedding.
Structural Backpropagation and .
f
X
f
Y
Experiments on MS-COCO
Convergence curve of our G2DeNet-
FC with AlexNet on MS-COCO.
DeepO2P
[ICCV 15]
DeepO2P-FC
(S) [ICCV 15]
DeepO2P-FC
[ICCV 15]
Err. 28.6 28.9 25.2
G2DeNet
(Ours)
G2DeNet-FC (S)
(Ours)
G2DeNet-FC
(Ours)
Err. 24.4 22.6 21.5
Comparison of classification errors on MS-COCO.
890k segmented instances from MS-COCO
dataset. 80 classes, ~600k training instances,
~290k validation ones. [DeepO2P, ICCV’15]
Experiments on MS-COCO
Convergence curve of our G2DeNet-
FC with AlexNet on MS-COCO.
AlexNet
(baseline)
DeepO2P
[ICCV 15]
DeepO2P-FC
(S) [ICCV 15]
DeepO2P-FC
[ICCV 15]
Err. 25.3 28.6 28.9 25.2
DMMs-FC
[arXiv‘15]
G2DeNet
(Ours)
G2DeNet-FC (S)
(Ours)
G2DeNet-FC
(Ours)
Err. 24.6 24.4 22.6 21.5
Comparison of classification errors on MS-COCO.
890k segmented instances from MS-COCO
dataset. 80 classes, ~600k training instances,
~290k validation ones. [DeepO2P, ICCV’15]
Experiments on FGVR - Benchmarks
Birds CUB-200-2011 FGVC-Aircraft FGVC-Car
100 classes
6,667 training/3,333 test196 classes
8,144 training/8,041 test
200 classes
5,994 training/5,794 test
Experiments on FGVR - Results
Methods Birds CUB-200-2011 FGVC-Aircraft FGVC-Cars
FC-CNN 76.4 74.1 79.8
FV-CNN 77.5 77.6 85.7
VLAD-CNN 79.0 80.6 85.6
NetFV [TPAMI’17] 79.9 79.0 86.2
NetVLAD [CVPR’16] 81.9 81.8 88.6
B-CNN [ICCV’15] 84.1 84.1 91.3
G2DeNet (Ours) 87.1 89.0 92.5
Comparison of different counterparts by using VGG-VD16 without Bounding Box
& Part sharing the same settings with B-CNN.
NetFV [TPAMI’17]: Lin et al. Bilinear CNNs for Fine-grained Visual Recognition. TPAMI, 2017.
Experiments on FGVR - Results
Methods Birds CUB-
200-2011
FGVC-
Aircraft
FGVC-
Cars
Remarks
PG-Alignment [CVPR’15] 82.0 - 92.6 PG-Alignment + BB
PD [CVPR’16] 84.5 - - PD+FC+SWFV-CNN
BoT [CVPR’16] - 88.4 92.5 Bag of Triplets + BB
SPDA-CNN[CVPR’16] 85.1 - - SPDA-CNN + Ensemble
Boosted CNN [BMVC’16] 86.2 88.5 92.1 Boosted CNN + B-CNN
RA-CNN [CVPR’17] 85.3 - 92.5 Recurrent attention CNN
CVL [CVPR’17] 85.6 - - Combining Vision and Language
KP-CNN [CVPR’17] 86.2 86.9 92.4 Kernel Pooling for CNN
G2DeNet (Ours) 87.1 (87.5) 89.0 92.5 VG-VD16 w/o BB & Part
Comparison of various state-of-the-art methods.
Experiments on ablation – Training methods
VD16-NoTr: Global Gaussian embedding layer
+ pre-trained VGG-VD16 on ImageNet.
VD16-FT: Global Gaussian embedding layer +
fine-tuned VGG-VD16.
G2DeNet: Pre-trained VGG-VD16 on ImageNet
+ train G2DeNet in an end-to-end manner.
Effects of different training methods on G2DeNet using VGG-VD16 on Birds dataset.
Experiments on ablation - Embedding methods
Comparison of different Gaussian embedding methods for G2DeNet on Birds dataset.
Method Gaussian Embedding Acc. ( % )
Nakayama et al. [CVPR’2010] 83.5
Calvo et al. or Lovric et al.
[JMVA’1990 & JMVA’2000] 84.1
Calvo et al. or Lovric et al. + Log-
Euclidean [ICCV’2013] 83.8
Ours
87.1
1
2
1
T
T
log1
T
T
1
T
T
,T
T Tvec
Conclusion
The first attempt to plug a global Gaussian distribution into deep CNNs.
More CNN architectures and computer vision applications.
Please refer to our poster [ID #11] for more details.
……
…
…
Images Conv. layers LossGlobal Gaussian