Analyzing and Introducing Structures in Deep Convolutional ...deep cascade makes features hard to...
Transcript of Analyzing and Introducing Structures in Deep Convolutional ...deep cascade makes features hard to...
DATA
Edouard Oyallon
1
Analyzing and Introducing Structures in Deep Convolutional Neural Networks
DATA 22
Training set to predict labels "Rhino"
"Rhinos"
High Dimensional classification�! y(x)?
Estimation problem
(xi, yi) 2 R2242 ⇥ {1, ..., 1000}, i < 106
High variance: how to reduce it?Geometric variability
Groups acting on images: translation, rotation, scaling
Not a "rhino"
What is the natureof other sources of variability?
DATADeep learning: Technical
breakthrough
3
• Deep learning has permitted to solve a large number of task that were considered as extremely challenging for a computer.Ex.: Vision, Game of Go, Speech recognition, Artistic style transfer…
• The technique is generic and its success implies that it reduces large sources of variability.
• How, why?
DATA 4
x0
x1x2
…
Classifier
Ref.: ImageNet Classification with Deep Convolutional Neural Networks.
A Krizhevsky et al.Deep Convolutional Neural Network
⇢0W0 ⇢1W1 ⇢J�1WJ�1xJ = �x
xj+1 = ⇢jWjxj
convolutional operatornon-linear operator
DATA
Why mathematics about deep learning are important?
• Pure black box. Few mathematical results explain the cascade. Many rely on a low dimensional "manifold hypothesis":Variability is too high in images.
• No stability results. It means that "small" variations of the inputs might have a large impact on the system. And this happens.
• No model of the data. We do not understand what is the nature of the sources of variabilities that are reduced.
• Shall we learn each layer from scratch? (geometric priors?) The deep cascade makes features hard to interpret
5
Ref.: Intriguing properties of neural networks. C. Szegedy et al.
Ref.: Understanding deep learning requires rethinking generalization C. Zhang et al.
Ref.: Deep Roto-Translation Scattering for Object Classification. EO and S Mallat
DATA
Overview
Problem: how can we incorporate and analysing structures into deep networks?
1. Scattering for complex image classification
2. Beyond scattering: analyzing a supervised CNN
3. Best of two worlds: scattering and CNN
4. Beyonds euclidean group: Hiearchical CNN
6
DATAFighting the curse of dimensionality
• Objective: building a representation of x such that a simple (say euclidean) classifier can estimate the label y:
• Designing : must be regular with respect to the class:
• Necessary dimensionality reduction and separation to break the curse of dimensionality:
7
�x
�
k�x� �x0k n 1 ) y(x) = y(x0)
y
�
w<latexit sha1_base64="eqfykHkxP29JLdLAGokHSA1pBNI=">AAAB53icbZDLSgMxFIbP1Futt6pLN8EiuCozCupKC25ctuDYQjuUTHqmjc1khiSjlNIncONCxa3P4hu4821MLwtt/SHw8f/nkHNOmAqujet+O7ml5ZXVtfx6YWNza3unuLt3p5NMMfRZIhLVCKlGwSX6hhuBjVQhjUOB9bB/Pc7rD6g0T+StGaQYxLQrecQZNdaqPbaLJbfsTkQWwZtB6erz9LQCANV28avVSVgWozRMUK2bnpuaYEiV4UzgqNDKNKaU9WkXmxYljVEHw8mgI3JknQ6JEmWfNGTi/u4Y0ljrQRzaypianp7PxuZ/WTMz0UUw5DLNDEo2/SjKBDEJGW9NOlwhM2JggTLF7ayE9aiizNjbFOwRvPmVF8E/KZ+V3ZpXqlzCVHk4gEM4Bg/OoQI3UAUfGCA8wQu8OvfOs/PmvE9Lc86sZx/+yPn4AYu+jmw=</latexit>
<latexit sha1_base64="O0lQJ2xoQXj8ACW3DbgTbPBIgF0=">AAAB53icbZDLSgNBEEVr4ivGV9Slm8YguAozCupKA25cJuCYQDKEnk5N0qbnQXePEoZ8gRsXKm79Fv/AnX9jZ5KFJl5oONxbRVeVnwiutG1/W4Wl5ZXVteJ6aWNza3unvLt3p+JUMnRZLGLZ8qlCwSN0NdcCW4lEGvoCm/7wepI3H1AqHke3epSgF9J+xAPOqDZW47FbrthVOxdZBGcGlavP01z1bvmr04tZGmKkmaBKtR070V5GpeZM4LjUSRUmlA1pH9sGIxqi8rJ80DE5Mk6PBLE0L9Ikd393ZDRUahT6pjKkeqDms4n5X9ZOdXDhZTxKUo0Rm34UpILomEy2Jj0ukWkxMkCZ5GZWwgZUUqbNbUrmCM78yovgnlTPqnbDqdQuYaoiHMAhHIMD51CDG6iDCwwQnuAFXq1769l6s96npQVr1rMPf2R9/ABGL474</latexit>
RDRd
D � d
� ✏
DATAWavelets
• is a wavelet iff
• Typically localised in space and frequency.
• Rotation, dilation of a wavelets:
• Design wavelets selective to rotation variabilities.
<latexit sha1_base64="vZM3p0oDK8G//ZjqteTWKP152iI=">AAAB6nicbZDLSgMxFIbPtF5qvY2XnZtgEVyVGRcqrgpuXFZwbKEdSibNtKFJZkgyQhkKPoEbFypufQ4fwp1vY6btQlt/CHz8/znknBOlnGnjed9OqbyyurZe2ahubm3v7Lp7+/c6yRShAUl4otoR1pQzSQPDDKftVFEsIk5b0ei6yFsPVGmWyDszTmko8ECymBFsCqubatZza17dmwotgz+HWuPws/wIAM2e+9XtJyQTVBrCsdYd30tNmGNlGOF0Uu1mmqaYjPCAdixKLKgO8+msE3RinT6KE2WfNGjq/u7IsdB6LCJbKbAZ6sWsMP/LOpmJL8OcyTQzVJLZR3HGkUlQsTjqM0WJ4WMLmChmZ0VkiBUmxp6nao/gL668DMFZ/bzu3fq1xhXMVIEjOIZT8OECGnADTQiAwBCe4AVeHeE8O2/O+6y05Mx7DuCPnI8frjmPpA==</latexit><latexit sha1_base64="MKReSqAJM21mHrPcc9EWVyc1loM=">AAAB6nicbZC7SgNBFIbPJl5ivMVLZzMYBKuwa6FiFbCxjOAaIVnC7GQ2GTIzO8zMCiHkFWwsVGx9DmsbGzvfxtkkhSb+MPDx/+cw55xYcWas7397heLS8spqaa28vrG5tV3Z2b01aaYJDUnKU30XY0M5kzS0zHJ6pzTFIua0GQ8u87x5T7VhqbyxQ0UjgXuSJYxgm1ttZVinUvVr/kRoEYIZVOv770X1+ZE1OpWvdjclmaDSEo6NaQW+stEIa8sIp+NyOzNUYTLAPdpyKLGgJhpNZh2jI+d0UZJq96RFE/d3xwgLY4YidpUC276Zz3Lzv6yV2eQ8GjGpMkslmX6UZBzZFOWLoy7TlFg+dICJZm5WRPpYY2LdecruCMH8yosQntROa/51UK1fwFQlOIBDOIYAzqAOV9CAEAj04QGe4NkT3qP34r1OSwverGcP/sh7+wFQoJGe</latexit>
<latexit sha1_base64="mduu63BoHMqGld23Vba2fG/RufU=">AAAB6nicbZC9TsMwFIVvyl8pfwFGGCwqJKYqYQDGChbGViK0UhtVjuu0Vm0nsh2kKuorsDAAYuUheA42Nh4Fp+0ALUey9Omce+V7b5Rypo3nfTmlldW19Y3yZmVre2d3z90/uNdJpggNSMIT1Y6wppxJGhhmOG2nimIRcdqKRjdF3nqgSrNE3plxSkOBB5LFjGBTWN1Us55b9WreVGgZ/DlU68cfzW8AaPTcz24/IZmg0hCOte74XmrCHCvDCKeTSjfTNMVkhAe0Y1FiQXWYT2edoFPr9FGcKPukQVP3d0eOhdZjEdlKgc1QL2aF+V/WyUx8FeZMppmhksw+ijOOTIKKxVGfKUoMH1vARDE7KyJDrDAx9jwVewR/ceVlCM5rFzWv6Vfr1zBTGY7gBM7Ah0uowy00IAACQ3iEZ3hxhPPkvDpvs9KSM+85hD9y3n8AnNqQXA==</latexit><latexit sha1_base64="KoiyL79C8JDW2tvOqjdwVaE3Hps=">AAAB6nicbZDNSgMxFIXv+FvrX9WlIsEiuCozLtRl0Y3LFhxbaIeSSTNtaJIZkoxQhi7dunGh4taH6HO48xl8CTNtF9p6IPBxzr3k3hsmnGnjul/O0vLK6tp6YaO4ubW9s1va27/XcaoI9UnMY9UMsaacSeobZjhtJopiEXLaCAc3ed54oEqzWN6ZYUIDgXuSRYxgk1vtRLNOqexW3InQIngzKFePxvXvx+NxrVP6bHdjkgoqDeFY65bnJibIsDKMcDoqtlNNE0wGuEdbFiUWVAfZZNYROrVOF0Wxsk8aNHF/d2RYaD0Uoa0U2PT1fJab/2Wt1ERXQcZkkhoqyfSjKOXIxChfHHWZosTwoQVMFLOzItLHChNjz1O0R/DmV14E/7xyUXHrXrl6DVMV4BBO4Aw8uIQq3EINfCDQhyd4gVdHOM/Om/M+LV1yZj0H8EfOxw95uZHC</latexit>
j,✓ =1
22j (
r✓(u)
2j)
<latexit sha1_base64="4yWyX8g1ftSjdJSEzIePbvZr7Ik=">AAACJXicbVDLSsNAFJ34rPVVdekmWIQWpCRF1E2h6MZlBWsLTQ2T6aSddvJg5kYoIV/jxl9x46KK4MpfcdJkoa0HBg7nnMude5yQMwmG8aWtrK6tb2wWtorbO7t7+6WDwwcZRILQNgl4ILoOlpQzn7aBAafdUFDsOZx2nMlN6neeqJAs8O9hGtK+h4c+cxnBoCS71LBCyex4fGbBiAJOGpYrMInNJK4/xvVxkqR+JROFnYUqUTW1x0nVLpWNmjGHvkzMnJRRjpZdmlmDgEQe9YFwLGXPNELox1gAI5wmRSuSNMRkgoe0p6iPPSr78fzMRD9VykB3A6GeD/pc/T0RY0/KqeeopIdhJBe9VPzP60XgXvVj5ocRUJ9ki9yI6xDoaWf6gAlKgE8VwUQw9VedjLCqBFSzRVWCuXjyMmnXaxc14+683LzO2yigY3SCKshEl6iJblELtRFBz+gVzdC79qK9aR/aZxZd0fKZI/QH2vcP94GmQg==</latexit><latexit sha1_base64="4yWyX8g1ftSjdJSEzIePbvZr7Ik=">AAACJXicbVDLSsNAFJ34rPVVdekmWIQWpCRF1E2h6MZlBWsLTQ2T6aSddvJg5kYoIV/jxl9x46KK4MpfcdJkoa0HBg7nnMude5yQMwmG8aWtrK6tb2wWtorbO7t7+6WDwwcZRILQNgl4ILoOlpQzn7aBAafdUFDsOZx2nMlN6neeqJAs8O9hGtK+h4c+cxnBoCS71LBCyex4fGbBiAJOGpYrMInNJK4/xvVxkqR+JROFnYUqUTW1x0nVLpWNmjGHvkzMnJRRjpZdmlmDgEQe9YFwLGXPNELox1gAI5wmRSuSNMRkgoe0p6iPPSr78fzMRD9VykB3A6GeD/pc/T0RY0/KqeeopIdhJBe9VPzP60XgXvVj5ocRUJ9ki9yI6xDoaWf6gAlKgE8VwUQw9VedjLCqBFSzRVWCuXjyMmnXaxc14+683LzO2yigY3SCKshEl6iJblELtRFBz+gVzdC79qK9aR/aZxZd0fKZI/QH2vcP94GmQg==</latexit>
j,✓<latexit sha1_base64="h0os0qPCCwGOdy3Iax5X9vow0TE=">AAAB93icbVBNS8NAEN34WetHox69LBbBg5REinosePFYwdhCE8Jmu2nXbjZhdyLU0F/ixYOKV/+KN/+N2zYHbX0w8Hhvhpl5USa4Bsf5tlZW19Y3Nitb1e2d3b2avX9wr9NcUebRVKSqGxHNBJfMAw6CdTPFSBIJ1olG11O/88iU5qm8g3HGgoQMJI85JWCk0K75meZh8XDmw5ABmYR23Wk4M+Bl4pakjkq0Q/vL76c0T5gEKojWPdfJICiIAk4Fm1T9XLOM0BEZsJ6hkiRMB8Xs8Ak+MUofx6kyJQHP1N8TBUm0HieR6UwIDPWiNxX/83o5xFdBwWWWA5N0vijOBYYUT1PAfa4YBTE2hFDFza2YDokiFExWVROCu/jyMvHOGxcN57ZZbzXLNCroCB2jU+SiS9RCN6iNPERRjp7RK3qznqwX6936mLeuWOXMIfoD6/MHXVOTCg==</latexit><latexit sha1_base64="h0os0qPCCwGOdy3Iax5X9vow0TE=">AAAB93icbVBNS8NAEN34WetHox69LBbBg5REinosePFYwdhCE8Jmu2nXbjZhdyLU0F/ixYOKV/+KN/+N2zYHbX0w8Hhvhpl5USa4Bsf5tlZW19Y3Nitb1e2d3b2avX9wr9NcUebRVKSqGxHNBJfMAw6CdTPFSBIJ1olG11O/88iU5qm8g3HGgoQMJI85JWCk0K75meZh8XDmw5ABmYR23Wk4M+Bl4pakjkq0Q/vL76c0T5gEKojWPdfJICiIAk4Fm1T9XLOM0BEZsJ6hkiRMB8Xs8Ak+MUofx6kyJQHP1N8TBUm0HieR6UwIDPWiNxX/83o5xFdBwWWWA5N0vijOBYYUT1PAfa4YBTE2hFDFza2YDokiFExWVROCu/jyMvHOGxcN57ZZbzXLNCroCB2jU+SiS9RCN6iNPERRjp7RK3qznqwX6936mLeuWOXMIfoD6/MHXVOTCg==</latexit>
Isotropic| |
Non-Isotropic| |VS
�
Z (u)du = 0 and
Z| |2(u)du < 1
8
DATA
Wavelet Transform
• Wavelet transform :
• Isometric and linear operator of , with
• Covariant with translation :
• Nearly commutes with diffeomorphisms
• A good baseline to describe an image!
9
Wx = {x ? j,✓, x ? �J}✓,jJ
kWxk2 =X
✓,jJ
Z|x ? j,✓|2 +
Zx ? �
2J !1
!2
Ref.: Group Invariant Scattering, Mallat S
L2
WLa = LaW
La
k[W,L⌧ ]k Ckr⌧k
DATA
Scattering Transform10
• Scattering transform at scale J is the cascading of complex WT with modulus non-linearity, followed by a low pass-filtering:
• Mathematically well defined for a large class of wavelets.
Ref.: Group Invariant Scattering, Mallat S
x�J
|.|�J
|.|�J
Depthorder 0
order 1
order 2
SJx = {x ? �J ,|x ? j1,✓1 | ? �J ,||x ? j1,✓1 | ? j2,✓2 | ? �J}
j,✓
j,✓
DATA
A successful representation in vision• Successfully used in several applications:
• Digits
• Textures
• The design of the scattering transform is guided by the euclidean group
• To which extent can we compete with other architectures on more complex problems (e.g. variabilities are more complex)?
11
Small deformations +Translation
Ref.: Invariant Convolutional Scattering Network, J. Bruna and S Mallat
Ref.: Rotation, Scaling and Deformation Invariant Scattering for texture discrimination, Sifre L and Mallat S.
Rotation+Scale
All variabilities are known
DATASeparable Roto-translation Scattering
12
u1 u2
✓1
?u
?u ?✓1
~(u,✓1)� ⌦
�Translation Scattering
Roto-Translation Scattering
Separable Roto-Translation Scattering
x(u1, u2)
• Simplification of the Roto-translation scattering
• Discriminates angular variabilities thanks to a wavelet transform along (no averaging!)
• We combine it with Gaussians SVM
✓1
? ✓1
?�J
?�J
?�J
, spatial, angular wavelets
Ref.: Deep Roto-Translation Scattering for Object Classification. EO and S Mallat
DATAHow much learning is really required?
13
Dataset Type Accuracy
Caltech101 Scattering 80
Unsupervised 77
Supervised 93
CIFAR100 Scattering 57
Unsupervised 61
Supervised 82Few adaptation to the dataset
104101
256⇥ 256
imagesclasses
color images32⇥ 32
1005.104
color images
imagesclasses
CALTECHCIFAR
Scattering is competitive with unsupervised
Ref.: Deep Roto-Translation Scattering for Object Classification. EO and S Mallat
How can we explain the gap with supervised?
Performances are given without intensive data augmentation
DATAStudying empirically a CNN• Which ingredient permits CNNs to outperform the
Scattering Transform?
• We introduce a CNN which depends only on its width K and non-linearity in order to study it.
• Good perf and limited engineering: no max-pooling or ad-hoc non-linear module, only 13 layers.
14
⇢
Depth #params CIFAR10 CIFAR100Ours 13 28M 95.4 79.6
82.SGDR 28 150M 96.2 82.3 80WResNet 28 37M 95.8 80.0
All-CNN 9 1.3M 92.8 66.6
Supervised learning: how? why?
Input
B3
BK
...5⇥
# 2
BK
...4⇥
# 2
BK
...3⇥
A
L
Output
xj
xj+1
A block Bl
3 ⇥ 3K ⇥ l
�-+Exj
Wj
⇢
DATAExample of sandbox application:
on the non-linearity in CNNs• "More non-linear is better."
• Non-linearity needs to contract, being continuous or to remove the phase?
15A
ccur
acy
60
68
76
84
92
100
Ratio
0 0.1 0.3 0.5 0.7 0.9
ReLU
Kk (x)(., l) ,
(ReLU(x(., l)), if l k
x(., l), otherwise
ReLU on a fraction of the coefficients a layer x:k
K
k
K
⇢(x) = sign(x)(p
|x|+ 0.1)
good perf (89% acc.) on CIFAR10
0.1
0.1
Tradional pointwise non-linearity can be weakened
DATAProgressive properties
16
RD
⇢W1 ...⇢W2 ⇢W13
Acc
urac
y
30
40
50
60
70
80
90
100
Depth
1 2 3 4 5 6 7 8 9 10 11 12
SVM NN
Nearest Neighbor (NN)
• We aim to show the cascade permits a progressive contraction & separation, w.r.t. the depth:
• How can we explain it?
Gaussian SVM
Localised classifiers
Best performance
In the following, representations are spatially averaged.
DATAUnderstanding the progressive improvement
17A
vera
ged
intr
a cl
ass
di
stan
ces
Depth
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 56 7 8 9 10
Cum
ulat
ed v
aria
nces
123456789101112
1
50002
X
y(xj)=y(xj)=c
kxj
� x
j
k
No clear behavior: Refining those measures?
Intra class distance
PCA axis P
j
j
class c
⌃jP =
X
pP
�jp
Cumulative eigenvalues for the representation at depth j
�jp
DATALocal Support Vectors: exploring regularity
18
�k+1j =
�xj 2 �k
j |card{y(xj) 6= y(x(l)j ), l k + 1} >
k + 1
2
2 �1 \ (�2)c 62 [j�j 2 �62 �3 \ (�4)c
�1j =
�xj |y(x(1)
j ) 6= y(xj)
• Estimating the intrinsic dimension of the classification boundary is hard (curse of dimensionality!)
• We introduce Local Support Vectors(LSV):
• We extend this definition recursively to define k-LSV:
• Regularity measure at depth j: k-LSV approximatively indicates how local is the euclidean metric:
x
(l)j
y(xj)
Features that are not classified by any l- nearest, l<k ,neighbour at depth j
: l-th nearest neighbour of feature at depth j
: class of x
DATAProgressive contraction-separation measure
19
• We compute the # of k-LSV at each depth j on CIFAR10:
• The number of 1-LSV decreases with depth: better local separation of different classes
#
0
3500
7000
10500
14000
k
123456789101112
Slow convergence: several far neighbors are
enough to predict LSV’s label
How does the separation/contraction happen?
�k j
j
Fast convergence: only close neighbors
permits to predict LSV’s label
meaningful metric: contraction
l2
DATAScattering + CNN
• Can we combine the best of both world to understand better CNNs?
• We demonstrate interpretability and no loss in informative variabilities for training a CNN
• Remark: Scattering is stable, CNN is unstable, thus Scattering+CNN has no reason to be more stable
20
CNN ScatteringGood perf
InterpretableYes
Yes
No
No
DATAAn ideal input for a modern CNN
• Scattering is stable:
• Linearize small deformations:
• Invariant to local translation:
• For , , is covariant with :
21
CNNx
kSJx� SJyk kx� yk
DeformationsSJ
SJx(u,�)� u SO2(R)
SJ(g.x)(u,�) = SJx(g�1
u, g
�1�) , g.SJx(u,�)
L⌧x(u) = x(u� ⌧(u))
kSJL⌧x� SJxk Ckr⌧kkxk
|a| ⌧ 2J ) SJLax ⇡ SJ
8u8g 2 SO2(R), g.x(u) , x(g�1u)if then,
Ref.: Scaling the Scattering Transform: Deep Hybrid Networks
EO, E Belilovsky, S Zagoruyko
Ref.: Group Invariant Scattering, Mallat S
DATAScaling scattering on GPUs
22
Modulus
Lowpass1st wavelet2nd wavelet
ScatNet algorithm
Proposed algorithm
Save a lot of memory!x100 speed-up on GPU
Ref.: Thesis, EO
DATAImagenet/CIFAR
• State-of-the-art result on Imagenet2012:
• Demonstrates no loss of information + less layers
• Scattering + 5-layers perceptron on CIFAR: 85% acc. (SOTA w.r.t. non-convolutional learned representation)
23
ResNetx SJRef.: Scaling the Scattering Transform:
Deep Hybrid Networks EO, E Belilovsky, S Zagoruyko
Top 1 Top 5 #params
Scat+Resnet-10 69 90 12.8MVGG-16 69 90 138M
ResNet-18 69 89 11.7MResNet-200 79 95 64.7M
DATABenchmarking Small data
• We show incorporating geometrical invariants help learning.(with limited adaptation)
• State-of-the-art results on STL10 and CIFAR10:
24
STL10: 5k training, 8k testing, 10 classes +100k unlabeled(not used!!)
Cifar10, 10 classes keeping 100, 500 and 1000 samples
and testing on 10k
Ref.: Scaling the Scattering Transform: Deep Hybrid Networks
EO, E Belilovsky, S Zagoruyko
#train 100 500 1000 Full
WRN 16-8 35 47 60 96
VGG 16 26 47 56 93
Scat+ResNet 38 55 62 93
Acc.
Scat+ResNet 76
Supervised 70
Unsupervised 76
ResNetx SJ
DATA
Shared Local Encoder25
convolution (encoder)
24
24
24
1⇥ 1
W1
W1
W1
W2
W2
W2 W3
W3
W3
W4 W5 W6
S4
S4
S4
Ref.: Scaling the Scattering Transform: Deep Hybrid Networks
EO, E Belilovsky, S Zagoruyko
Top 1 Top 5Scat+SLE 57 80FV+FCs 56 79FV+SVM 54 75AlexNet 56 81
• AlexNet performances with 1x1 conv
• Outperform unsupervised encoders basedon SIFT + Fisher Vectors(FV)
Fully Connected Layers (classifier)
Is it really a classifier?
DATAA local descriptor for classification
26
24 W1 W2 W3S4
• We analyse the scattering’s encoder, which is a descriptor on neighbourhood of size pixels:
• Good transfer learning performance on Caltech101(83%)! Analog to previous reported performance.
• Atoms’ index of are structured by the order 0, 1, 2 of :(W1S4)k = w0,k(x ? �j)
+X
j1,✓1
w(j1,✓1),k(|x ? j1,✓1 | ? �j)
+X
j2,j1,✓1,✓2
w(j1,j2,✓1,✓2),k(||x ? j2,✓2 | ? j1,✓1 | ? �j)
What is the nature of the recombination?
W1 S4
24 ⇥ 24Ref.: Scaling the Scattering Transform:
Deep Hybrid Networks EO, E Belilovsky, S Zagoruyko
Fourier along :
Fourier along :
w(j1,!✓1 ),k= F✓1(w(j1,.),k)(!✓1)
w(j1,j2,!✓1 ,!✓2 ),k= F (✓1,✓2)(w(j1,j2,.,.),k)(!✓1 ,!✓2)
✓1
(✓1, ✓2)
Ref.: Visualizing and Understanding Convolutional Networks, M Zeiler, R
Fergus
DATAAnalysis of w
• Invariance to rotation is explicitly learned.
• Thresholding 80% of the coefficients in Fourier: 2% acc. loss
27Ref.: Scaling the Scattering Transform:
Deep Hybrid Networks EO, E Belilovsky, S Zagoruyko
method: similar to AlexNet first layer analysis
Can we find more complex invariance than rotation?
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14Amplitude
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
Dis
trib
utio
n
Order 1
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14Amplitude
Dis
trib
utio
nOrder 2
-3 0 3ωθ1
-3
0
3
ωθ 2
10−2
10−1
100
-3 0 3ωθ1
0.0
0.2
0.4
0.6
0.8
1.0
Amplitude
⌦2(!✓1 ,!✓2) =X
k,j1,j2
|w(j1,j2,!✓1 ,!✓2 ),k|2
⌦1(!✓1) =X
k,j1
|w(j1,!✓1 ),k|2
⌦1 ⌦2
Fourier basis sparsifies the operator!
DATAIdentifying the variabilities?
• CNNs exhibit some covariance w.r.t. variabilities:
• If we understood how this is done, we could use this to engineer CNN with nice properties:
28
Ref.: Understanding deep features with computer-generated imagery, M Aubry, B Russel
supL
k�Lx� �xkkLx� xk < 1 ) 9 ”weak” @
x
�
) �Lx ⇡ �x+ @
x
�L+ o(kLk)A linear operatorDisplacementL�
+ projection
DATAIntroducing structures to understand better
• SLE learned invariance: invariants to group are explicitly built via convolutions.
• Are all the non-informative variabilities linked to group actions?
• Can we structure the channel axis as we did?
• We will apply n-d convolutions, with n>3
29
Beyond the order of scattering convolution (i.e. 3)
x ? k(u1, ..., un) =X
v1,...,vn
x(u1 � v1, ..., un � vn)k(v1, ..., vn)
DATAObjective: Flattening the variability
30
Standard CNNs layers
Defining an order on layers of neurons
Organised Layers
x(u,�)x(u,�)
� �
�1
�2
� = (�1,�2)
DATA
Symmetry group hypothesis
• To each classification problem corresponds a canonic and unique symmetry group :
• We hypothesise there exists Lie groups and CNNs such that:
• Examples are given by the euclidean group:
31
8gj 2 Gj ,�j(gj .x) = �j(x) where xj = �j(x)
8x, 8g 2 G,�x = �g.x
G
G0 ⇢ G1 ⇢ ... ⇢ GJ ⇢ G
High dimensional
G0 = R2, G1 = G0 n SL2(R)
Ref.: Understanding deep convolutional networks
S Mallat
DATAMultiscale Hiearchical CNN
• CNN that is convolutional along axis channel:
• For , we refer to the variable as an attribute that discriminates previously obtained layer.
• Representation is finally averaged: invariant along translations by v.
32
vjxj
Ref.: Multiscale Hierarchical Convolutional Networks J Jacobsen, EO, S Mallat, Smeulders AWM
Understanding deep convolutional networks
S Mallat
u
xj+1(v1, ..., vj , vj+1) = ⇢j(xj ?v1,...,vj
vj+1)(v1, ..., vj)
x1 x2
x3
xJ(vJ) =X
v1,...,vJ�1
xJ�1(v1, ..., vJ�1, vJ)
DATAHiearchical CNN: numerical results
33
Ref.: Multiscale Hierarchical Convolutional Networks J Jacobsen, EO, S Mallat, AWM Smeulders
0
6
12
#M params
HCNNEquivalent CNNWe demonstrate
a reduction in #param while 91% on CIFAR10
vJ
vJ
vJ�1
vJ�1
Translations are present in the last layer xJ(vJ�1, vJ)
But not in the previous layers Incorporating more structures?
Modelization issue?
DATA
Conclusion34
Thank you!
• The problem was to introduce and analyze structures in CNNs.
• We demonstrate:State-of-the-art performance without learning first layer weights CNN progressively contractScattering + CNN is a robust baselineImplement a new class of CNN