Deep Visual Learning on Hypersphere - Nvidia
Transcript of Deep Visual Learning on Hypersphere - Nvidia
![Page 1: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/1.jpg)
Deep Visual Learning on HypersphereWeiyang Liu*, Zhen Liu*College of ComputingGeorgia Institute of Technology
1
![Page 2: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/2.jpg)
• Why Learning on Hypersphere
• Loss Design - Large-Margin Learning on Hypersphere
• Convolution Operator - Deep Hyperspherical Learning and Decoupled Networks
• Weight Regularization - Minimum Hyperspherical Energy for Regularizing Neural Networks
• Conclusion
Outline
2
![Page 3: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/3.jpg)
• Why Learning on Hypersphere
• Loss Design - Large-Margin Learning on Hypersphere
• Convolution Operator - Deep Hyperspherical Learning and Decoupled Networks
• Weight Regularization - Minimum Hyperspherical Energy for Regularizing Neural Networks
• Conclusion
Outline
3
![Page 4: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/4.jpg)
Why Learning on Hypersphere
• An empirical observation• Setting the output feature dimension as 2 in CNN• Directly visualizing the features without using T-SNE
Deep features are naturally distributed over a sphere! 4
![Page 5: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/5.jpg)
Why Learning on Hypersphere
• Euclidean distance is not suitable for high-dimensional data
More specifically,
In high-dimensional space, vectors tend to be orthogonal to each other, then this reduces to
5
![Page 6: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/6.jpg)
Why Learning on Hypersphere
• Learning features on Hypersphere can well regularize the feature space.
In deep metric learning, features have to be normalized before entering the loss function.
Schroff et al. FaceNet: A Unified Embedding for Face Recognition and Clustering, CVPR 2015
6
![Page 7: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/7.jpg)
• Why Learning on Hypersphere
• Loss Design - Large-Margin Learning on Hypersphere
• Convolution Operator - Deep Hyperspherical Learning and Decoupled Networks
• Weight Regularization - Minimum Hyperspherical Energy for Regularizing Neural Networks
• Conclusion
Outline
7
![Page 8: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/8.jpg)
Large-Margin Learning on Hypersphere
• Standard CNN usually uses the softmax loss as the learning objective.
How to incorporate margin on hypersphere?
8
![Page 9: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/9.jpg)
Large-Margin Learning on Hypersphere
• The intuition (from binary classification)If x belongs to class 1, original Softmax requires:
We want to make the classification more rigorous in order to produce a decision margin:
9
![Page 10: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/10.jpg)
Large-Margin Learning on Hypersphere
Original Softmax Loss Large-Margin Softmax LossImposing large
margin
Normalizing classifier weights
Angular Softmax Loss
10
![Page 11: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/11.jpg)
Learned Feature Visualization
• 2D Feature Visualization on MNIST
• 3D Feature Visualization on CASIA Face Dataset
m=1 m=2 m=3 m=4
11
![Page 12: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/12.jpg)
Experimental Results• Face Verification
LFW and YTF dataset
SphereFace uses the angular large-margin softmax loss, achieving the state-of-the-art performance with only 0.5M training data.
12
![Page 13: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/13.jpg)
Experimental Results• Million-scale Face Recognition Challenge
MegaFace Challenge
SphereFace ranked No.1 from 2016.12 to 2017.4, and the current No. 1 entry is also developed based on SphereFace.
13
![Page 14: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/14.jpg)
Demo
14
![Page 15: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/15.jpg)
• Why Learning on Hypersphere
• Loss Design - Large-Margin Learning on Hypersphere
• Convolution Operator - Deep Hyperspherical Learning and Decoupled Networks
• Weight Regularization - Minimum Hyperspherical Energy for Regularizing Neural Networks
• Conclusion
Outline
15
![Page 16: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/16.jpg)
SphereNet
• Traditional Convolution
• HyperSpherical Convolution (SphereConv)
SphereConv normalizes each local patch of a feature map and each weight vector.
16
![Page 17: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/17.jpg)
SphereNet - Intuition from Fourier Transform
• Semantic information is mostly preserved with corrupted magnitude but not corrupted phase (angular information)
17
![Page 18: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/18.jpg)
Observation: The final feature is naturally decoupled, where the magnitude represents the intra-class variation.
Decoupled Convolution
18
![Page 19: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/19.jpg)
General Framework - Decoupled Convolution
• Decoupling angle and magnitude of feature vectors
• Allowing different designs of convolution operators for different tasks
Decoupled Convolution
Magnitude(intra-class variation)
Angle(semantic difference)
19
![Page 20: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/20.jpg)
• SphereConv
• BallConv
• TanhConv
• LinearConv
Example Choices of Magnitude
20
![Page 21: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/21.jpg)
• Linear
• Cosine
• Squared Cosine
Example Choices of Angle
21
![Page 22: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/22.jpg)
With SphereConv, the top-1 accuracy of CNNs on ImageNet can be improved by ~1%.
Generalization
Plain-CNN-9 Plain-CNN-12 ResNet-27
Baseline 58.31 61.42 65.54
SphereNet 59.23 62.27 66.49
* Different from the original NeurIPS paper:1) In ResNet, we use fully connected layer instead of average pooling to
obtain the final feature. We found it to be crucial for SphereNet.2) We add L2 decay, which slows down the optimization but results in
better performance.
Top-1 Accuracy (center crop) of baseline and SphereNet on ImageNet.
22
![Page 23: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/23.jpg)
•
•
Adversarial Robustness and Optimization
* Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, Aleksander Mądry. 23
![Page 24: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/24.jpg)
• Without BatchNorm, decoupled convolutions outperform the baseline.
• The bounded TanhConv can be optimized while unbounded ones fail.
Optimization Without BatchNorm
Accuracies of different convolution operators on Plain-CNN-9 without BatchNorm. N/C indicates ‘not converged’.
24
![Page 25: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/25.jpg)
Bounded convolution operators have better robustness against both fast gradient sign method (FGSM) attack and the multi-step version of FGSM.
Adversarial Robustness
Naturally Training
Adversarial Training
25
![Page 26: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/26.jpg)
It requires larger norm to attack decoupled convolution with bounded magnitude.
Adversarial Robustness
L2 and L_inf norms needed to attack models on samples in the test set. 26
![Page 27: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/27.jpg)
• Why Learning on Hypersphere
• Loss Design - Large-Margin Learning on Hypersphere
• Convolution Operator - Deep Hyperspherical Learning and Decoupled Networks
• Weight Regularization - Minimum Hyperspherical Energy for Regularizing Neural Networks
• Conclusion
Outline
27
![Page 28: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/28.jpg)
Minimum Hyperspherical Energy
Intuition:
Better generalization More diversity of neurons Less redundancy
Paper [1] shows that, in one-hidden-layer network, maximizing diversity can eliminate spurious local minima.
If two weight vectors in one layer are close to each other, there is probably more redundancy.
28[1] Bo Xie, Yingyu Liang, and Le Song. Diverse neural network learns true target functions. arXiv preprint arXiv:1611.03131, 2016.
![Page 29: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/29.jpg)
Minimum Hyperspherical Energy
Proposed regularization: add repulsion forces between any pair of weight vectors (in one layer)
It connects to Thomson problem - to find a minimal configuration of electrons of an atom.
29
![Page 30: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/30.jpg)
Minimum Hyperspherical Energy
Loss function:
This optimization problem is generally non-trivial. With s = 2, the problem is actually NP-hard.
30
![Page 31: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/31.jpg)
Although orthonormal loss seems similar, it does not yield ideal configuration of weights even in 3D case.
Minimum Hyperspherical Energy
31
![Page 32: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/32.jpg)
Minimum Hyperspherical Energy
MHE Loss is compatible with weight decay:
- MHE regularizes the angles of weights
- Weight decay regularizes the magnitude of weights
32
![Page 33: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/33.jpg)
Minimum Hyperspherical Energy
Co-linearity Issue:
In this toy example, optimizing the original MHE results in colinear weight vectors
Half-space MHE:
Optimizing on pairwise angles between lines (instead of vectors).
33
![Page 34: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/34.jpg)
MHE - Ablation Study
MHE on 9 layer Plain CNN on CIFAR-10/100 dataset.
34
![Page 35: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/35.jpg)
• MHE consistently improve the performance of networks.
• In cases that the network is hard to optimize due to redundancy of neurons (small width/large depth), MHE helps more.
MHE - Ablation Study
MHE with different depths of network on CIFAR-100.
35
![Page 36: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/36.jpg)
• MHE consistently improve the performance of networks.
• In cases that the network is hard to optimize due to redundancy of neurons (small width/large depth), MHE helps more.
MHE - Ablation Study
MHE with different widths of network on CIFAR-100.
36
![Page 37: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/37.jpg)
MHE can improve performance of networks on ImageNet.
MHE Application - Image Recognition
Top-1 error (center crop) of models on ImageNet.
37
![Page 38: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/38.jpg)
We add MHE loss to the angular softmax loss in SphereFace. We call the resulted model SphreFace+.
Synergy:• Angular softmax loss - intra-class compactness • MHE loss - inter-class separability.
MHE Application - Face Recognition
38
![Page 39: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/39.jpg)
MHE Application - Face Recognition
Comparison to State-of-the-art results.
Comparison between SphereFace and SphereFace+.
39
![Page 40: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/40.jpg)
Applying MHE to the final classifier enforces the prior that all categories have the same importance and thus improves performance.
MHE Application - Class Imbalanced Recognition
Results on class imbalanced recognition on CIFAR-10.
* Single - Reduce the number of samples in only one category by 90%. Multiple - Reduce the number of samples in multiple categories with different weights. Details are shown in the paper.
40
![Page 41: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/41.jpg)
MHE Application - Class Imbalanced Recognition
The category with less data tends to be ignored
Visualization for the final CNN feature.41
![Page 42: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/42.jpg)
With MHE added to the discriminator, the inception score of spectral GAN can be improved from 7.42 to 7.68.
MHE Application - GAN
42
![Page 43: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/43.jpg)
• Why Learning on Hypersphere
• Loss Design - Large-Margin Learning on Hypersphere
• Convolution Operator - Deep Hyperspherical Learning and Decoupled Networks
• Weight Regularization - Minimum Hyperspherical Energy for Regularizing Neural Networks
• Conclusion
Outline
43
![Page 44: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/44.jpg)
• We introduce a hyperspherical learning framework for deep visual learning, where all the neurons and classifiers are learned over a hypersphere.
• Large-margin learning on hypersphere is very beneficial to tasks like biometric verification and person re-id where features are expected to have large inter-class variation.
• Hyperspherical networks and decoupled networks are natural generalization of applying the hyperspherical learning to every layer of the network.
• Minimum hyperspherical energy is a generic regularization that aims to diversify the neurons on a hypersphere can improve the generalization.
Conclusion
44
![Page 45: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/45.jpg)
SphereFace: https://github.com/wy1iu/sphereface
SphereNet: https://github.com/wy1iu/SphereNet
DCNet: https://github.com/wy1iu/DCNets
MHE: https://github.com/wy1iu/MHE
SphereFace+: https://github.com/wy1iu/sphereface-plus
Source Code
45
![Page 46: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/46.jpg)
Architecture for SphereNet on ImageNet experiment
Plain-CNN-9: 7x7 conv - maxpool - 3*(3x3 conv, 64) - 3*(3x3 conv, 128) - 3*(3x3 conv, 256) - fc(512) - classifier
Plain-CNN-12: 7x7 conv - maxpool - 3*(3x3 conv, 64) - 3*(3x3 conv, 128) - 3*(3x3 conv, 256) - 3*(3x3 conv, 512) - fc(512) - classifier
ResNet-27: 7x7 conv - maxpool - 3*(3x3 ResBlock, 64) - 3*(3x3 ResBlock, 128) - 3*(3x3 ResBlock, 256) - 3*(3x3 ResBlock, 512) - fc(512) - classifier
Appendix
46
![Page 47: Deep Visual Learning on Hypersphere - Nvidia](https://reader030.fdocuments.net/reader030/viewer/2022020622/61ee4094b6e83e4b007729e9/html5/thumbnails/47.jpg)
Architecture for MHE ablation study on CIFAR-10/100
Appendix
47