Анализ изображений и видео 2, весна 2015: Поиск изображений.
AI&BigData Lab. Артем Чернодуб "Распознавание изображений...
Transcript of AI&BigData Lab. Артем Чернодуб "Распознавание изображений...
Lazy Deep Learning for Images
Recognition in ZZ Photo app
Artem Chernodub, George Paschenko
IMMSP NASU
AI&Big Data Lab, 23 April, 2015, Odessa.
ZZ Photo
𝑝 𝑥 𝑦 =𝑝 𝑦 𝑥 𝑝(𝑥)
𝑝(𝑦)
Biological-inspired models
Neuroscience
Machine Learning
2 / 55
Biological Neural Networks
3 / 55
Artificial Neural Networks
Traditional (Shallow) Neural
Networks
Deep Neural Networks
Deep Feedforward Neural
Networks Recurrent Neural Networks
4 / 55
Conventional Methods vs Deep Learning
5 / 55
Deep Learning = Learning of Representations (Features)
The traditional model of pattern recognition (since the late 50's):
fixed/engineered features + trainable classifier
Hand-crafted
Feature
Extractor
Trainable
Classifier
Trainable
Feature
Extractor
Trainable
Classifier
End-to-end learning / Feature learning / Deep learning:
trainable features + trainable classifier
6 / 55
ImageNet
Le et al. “Building high-level features using large-scale unsupervised learning” ICML
2012.
Model # of parameters Accuracy, %
Deep Net 10M 15.8
best state-of-the-art N/A 9.3
Training data: 16M images, 20K categories
7 / 55
Deep Face (Facebook)
Y. Taigman, M. Yang, M.A. Ranzato, L. Wolf. DeepFace: Closing the Gap to Human-
Level Performance in Face Verification // CVPR 2014.
Model # of parameters Accuracy, %
Deep Face Net 128M 97.35
Human level N/A 97.5
Training data: 4M facial images
8 / 55
TIMIT Phoneme Recognition
Graves, A., Mohamed, A.-R., and Hinton, G. E. (2013). Speech recognition with deep
recurrent neural networks // IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pages 6645–6649. IEEE.
Mohamed, A. and Hinton, G. E. (2010). Phone recognition using restricted Boltzmann
machines // IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 4354–4357.
Model # of parameters Accuracy
Hidden Markov Model, HMM N / A 27,3%
Deep Belief Network, DBN ~ 4M 26,7%
Deep RNN 4,3M 17.7%
Training data: 462 speakers train / 24 speakers test, 3.16 / 0.14 hrs.
9 / 55
Google Large Vocabulary Speech Recognition
H. Sak, A. Senior, F. Beaufays. Long Short-Term Memory Recurrent Neural Network
Architectures for Large Scale Acoustic Modeling // INTERSPEECH’2014.
K. Vesely, A. Ghoshal, L. Burget, D. Povey. Sequence-discriminative training of deep
neural networks // INTERSPEECH’2014.
Model # of parameters Cross-entropy
ReLU DNN 85M 11.3
Deep Projection LSTM RNN 13M 10.7
Training data: 3M utterances (1900 hrs).
10 / 55
Classic Feedforward Neural Networks (before 2006).
• Single hidden layer (Kolmogorov-Cybenko Universal
Approximation Theorem as the main hope).
• Vanishing gradients effect prevents using more layers.
• Less than 10K free parameters.
• Feature preprocessing stage is often critical.
11 / 55
Training the traditional (shallow) Neural Network: derivative + optimization
12 / 55
1) forward propagation pass
),( )1(i
ijijxwfz
),()1(~ )2(
jjj
zwgky
where zj is the postsynaptic value for the j-th hidden neuron, w(1) are the hidden layer’s
weights, f() are the hidden layer’s activation functions, w(2) are the output layer’s weights,
and g() are the output layer’s activation functions.
13 / 55
2) backpropagation pass
Local gradients calculation:
),1(~)1( kyktOUT
.)(' )2( OUT
jj
HID
jwzf
,)()2( j
OUT
j
zw
kE
.)(
)1( i
IN
j
ji
xw
kE
Derivatives calculation:
14 / 55
Bad effect of vanishing (exploding) gradients: a problem
,)( )1()(
)(
m
i
m
jm
ji
zw
kE
,' )1()()1()( m
ii
m
ij
m
j
m
jwf 0
)()(
m
jiw
kE=> 1mfor
15 / 55
Bad effect of vanishing (exploding) gradients: two hypotheses
1) increased frequency and
severity of bad local
minima
2) pathological curvature, like
the type seen in the well-known
Rosenbrock function: 222 )(100)1(),( xyxyxf
16 / 55
Deep Feedforward Neural Networks
• 2-stage training process: i) unsupervised pre-training; ii) fine tuning
(vanishing gradients problem is beaten!).
• Number of hidden layers > 1 (usually 6-9).
• 100K – 100M free parameters.
• No (or less) feature preprocessing stage.
17 / 55
Sparse Autoencoders
18 / 55
Dimensionality reduction
• Use a stacked RBM as deep auto-
encoder
1. Train RBM with images as input &
output
2. Limit one layer to few dimensions
Information has to pass through middle
layer
G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural
Networks // Science 313 (2006), p. 504 – 507. 19 / 55
Original
Deep
RBN
PCA
Dimensionality reduction
Olivetti face data, 25x25 pixel images reconstructed from 30 dimensions
(625 30)
G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural
Networks // Science 313 (2006), p. 504 – 507. 20 / 55
How to use unsupervised pre-training stage / 1
21 / 55
How to use unsupervised pre-training stage / 2
22 / 55
How to use unsupervised pre-training stage / 3
23 / 55
How to use unsupervised pre-training stage / 4
24 / 55
Unlabeled data
Unlabeled data is readily available
Example: Images from the web
1. Download 10’000’000 images
2. Train a 9-layer DNN
3. Concepts are formed by DNN
G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural
Networks // Science 313 (2006), p. 504 – 507. 25 / 55
Dimensionality reduction
PCA Deep RBN
804’414 Reuters news stories, reduction to 2 dimensions
G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural
Networks // Science 313 (2006), p. 504 – 507. 26 / 55
Hierarchy of trained representations
Low-level
feature Middle-level
feature Top-level
feature
Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus
2013]
27 / 55
Hessian-Free optimization: Deep Learning with no pre-training stage
J. Martens. Deep Learning via Hessian-free Optimization // Proceedings of the 27th
International Conference on Machine Learning (ICML), 2010.
28 / 55
FLOPS comparison
https://ru.wikipedia.org/wiki/FLOPS
Type Name Flops Cost
Mobile Raspberry Pi 1st Gen, 700
Mhz 0,04 Gflops $35
Mobile Apple A8 1,4 Gflops $700 (in iPhone 6)
CPU Intel Core i7-4930K (Ivy
Bridge), 3.7 GHz 140 Gflops $700
CPU Intel Core i7-5960X
(Haswell), 3.0 GHz 350 Gflops $1300
GPU NVidia GTX 980 4612 Gflops (single
precision), 144 Gflops
(double precision)
$600 + cost of PC
(~$1000)
GPU NVidia Tesla K80 8740 Gflops (single
precision), 2910
Gflops (double
precision)
$4500 + cost of
PC (~1500)
29 / 55
Deep Networks Training time using GPU
• Pretraining – from 2-3 weeks to 2-3 months.
• Fine-tuning (final supervised training) – from
1 day to 1 week.
30 / 55
Tools for training Deep Neural Networks
D. Kruchinin, E. Dolotov, K. Kornyakov, V. Kustikova, P. Druzhkov. The Comparison of
Deep Learning Libraries on the Problem of Handwritten Digit Classication // Analysis
of Images, Social Networks and Texts (AIST), 2015, April, 9-11th, Yekaterinburg.
31 / 55
Lazy Deep Learning: motivation
A. S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson. CNN Features off-the-shelf: an
Astounding Baseline for Recognition //2014 IEEE Conference on Computer Vision and
Pattern Recognition Workshops (CVPRW), 23-28 June 2014, Columbus, USA, p. 512
– 519.
32 / 55
Lazy Deep Learning: bechmark results
A. S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson. CNN Features off-the-shelf: an
Astounding Baseline for Recognition //2014 IEEE Conference on Computer Vision and
Pattern Recognition Workshops (CVPRW), 23-28 June 2014, Columbus, USA, p. 512
– 519.
33 / 55
Convolutional Neural Networks: Return of Jedi
Andrej Karpathy and Fei-Fei. CS231n: Convolutional Neural Networks for Visual
Recognition http://cs231n.github.io/convolutional-networks
Yoshua Bengio, Ian Goodfellow and Aaron Courville. Deep Learning // An MIT Press
book in preparation http://www-labs.iro.umontreal.ca/~bengioy/DLbook 34 / 55
AlexNet, 2012 — MeGa HiT
A. Kryzhevsky, I. Sutskever, G.E. Hinton. ImageNet Classification with Deep
Convolutional Neural Networks // Advances in Neural Information Processing
Systems 25 (NIPS 2012).
35 / 55
AlexNet, results on ISVRC-2012
A. Kryzhevsky, I. Sutskever, G.E. Hinton. ImageNet Classification with Deep
Convolutional Neural Networks // Advances in Neural Information Processing
Systems 25 (NIPS 2012).
36 / 55
Convolution Layer
Andrej Karpathy and Fei-Fei. CS231n: Convolutional Neural Networks for Visual
Recognition http://cs231n.github.io/convolutional-networks
Yoshua Bengio, Ian Goodfellow and Aaron Courville. Deep Learning // An MIT Press
book in preparation http://www-labs.iro.umontreal.ca/~bengioy/DLbook 37 / 55
Pooling layer
Andrej Karpathy and Fei-Fei. CS231n: Convolutional Neural Networks for Visual
Recognition http://cs231n.github.io/convolutional-networks
Yoshua Bengio, Ian Goodfellow and Aaron Courville. Deep Learning // An MIT Press
book in preparation http://www-labs.iro.umontreal.ca/~bengioy/DLbook
38 / 55
Implementation tricks: im2col
K. Chellapilla, S. Puri, P. Simard. High Performance Convolutional Neural Networks
for Document Processing // International Workshop on Frontiers in Handwriting
Recognition, 2006.
39 / 55
Implementation tricks: im2col for convolution
K. Chellapilla, S. Puri, P. Simard. High Performance Convolutional Neural Networks
for Document Processing // International Workshop on Frontiers in Handwriting
Recognition, 2006.
40 / 55
Activation functions
Andrej Karpathy and Fei-Fei. CS231n: Convolutional Neural Networks for Visual
Recognition http://cs231n.github.io/convolutional-networks
Yoshua Bengio, Ian Goodfellow and Aaron Courville. Deep Learning // An MIT Press
book in preparation http://www-labs.iro.umontreal.ca/~bengioy/DLbook
𝑓(𝑥) = max 0, 𝑥
𝑓′ 𝑥 = 1, 𝑥 ≥ 00, 𝑥 < 0
ReLU activation function
41 / 55
Development of AleksNet on OpenCV
VGG MatConvNet: CNNs for MATLAB http://www.vlfeat.org/matconvnet/
mexopencv:MATLAB-OpenCV interface http://kyamagu.github.io/mexopencv/matlab
MatConvNet,
MATLAB + CUDA OpenCV app, C++
YAML
YAML
42 / 55
ZZ Photo – photo organizer
Free beta version is available on http://zzphoto.me
43 / 55
MIT-8 toy problem: formulation
• 8 classes
• 2688 images in total
• TRAIN: 2000 images,
250 per class
• TEST: 688 images,
~86 per class
S. Banerji, A. Verma, C. Liu. Novel Color LBP Descriptors for Scene and Image
Texture Classification // Cross Disciplinary Biometric Systems, 2012, 15th
International Conference on Image Processing, Computer Vision, and Pattern
Recognition, Las Vegas, Nevada, pp. 205-225. 44 / 55
MIT-8 toy problem: results
Acc.
TRAIN
Acc.
TEST
1 LBP + SVM with RBF Kernel
27,2% 19,0%
2 LPQ + SVM with RBF kernel 38,4% 30,5%
3 LBP + SVM with χ2 kernel 94,2% 74,0%
4 LPQ + SVM with χ2 kernel 99,1% 82,2%
5 Deep CNN (AlexNet) + SVM RBF kernel (LAZY DL) 95,1% 91,8%
6 Deep CNN (AlexNet) + SVM with χ2 Kernel (LAZY DL) 100,0% 93,2%
7 Deep CNN (AlexNet) + MLP (LAZY DL) 100,0% 92,3%
Original results, to be published. 45 / 55
Pets detection problem (Kaggle Dataset + random Other images)
• Kaggle Dataset +
random “other” images;
• 2 classes (cats & dogs
VS other);
• TRAIN: 1,000 samples;
• TEST: 12,000 samples.
46 / 55
Viola-Jones Object Detector
• Very popular for Human Face Detection.
• May be trained for Cat and Dog Face detection.
• Available free in OpenCV library (http://opencv.org).
O. Parkhi, A. Vedaldi, C. V. Jawahar, and A. Zisserman. The Truth about Cats and
Dogs // Proceedings of the International Conference on Computer Vision (ICCV),
2011. J.
Liu, A. Kanazawa, D. Jacobs, P. Belhumeur. Dog Breed Classification Using Part
Localization // Lecture Notes in Computer Science Volume 7572, 2012, pp 172-
185.
47 / 55
Images pyramid for Viola-Jones
48 / 55
Viola-Jones Object Detector Classifier Structure
49 / 55
P. Viola, M. Jones. Rapid object detection using a boosted cascade of simple
features // Proceedings of the 2001 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, CVPR 2001.
Pets detection results: FAR vs FRR graphs
Original results, to be published. 50 / 55
Pets detection results: FAR = 0.5%
Error, %
1 Viola-Jones Face Detector for Cats & Dogs + LBP + SVM 79,73%
2 AlexNet, argmax (STANDARD DL, ImageNet-2012, 1000)
26,11%
3 AlexNet, sum (STANDARD DL, ImageNet-2012, 1000) 26,11%
4 AlexNet + SVM linear (LAZY DL) 4,35%
Original results, to be published. 51 / 55
Pet detection results : ROC curve
Original results, to be published. 52 / 55
Labeled Faces in the Wild (LFW) Dataset
G. B. Huang, M. Ramesh, T. Berg, E. Learned-Miller. Labeled Faces in the Wild: A
Database for Studying Face Recognition in Unconstrained Environments // University
of Massachusetts, Amherst, Technical Report 07-49, October, 2007
• more than 13,000 images
of faces collected from the
web.
• Pairs comparison,
restricted mode.
• test: 10-fold cross-
validation, 6000 face
pairs.
53 / 55
Face Recognition on LWF, results
54 / 55
Y. Taigman, M. Yang, M. Ranzato, L. Wolf. DeepFace: Closing the Gap to Human-
Level Performance in Face Verification, 2014, CVPR.
Error, %
1 Principal Component Analysis (EigenFaces) 60,2%
2 Local Binary Pattern Histograms (LBP) 72,4%
3 Deep CNN (AlexNet) + Euclid (LAZY DL) 71,0%
4 DeepFace by Facebook (STANDARD DL) 97,25%