Analyzing and Introducing Structures in Deep Convolutional ...deep cascade makes features hard to...

DATA

Edouard Oyallon

1

Analyzing and Introducing Structures in Deep Convolutional Neural Networks

DATA 22

Training set to predict labels "Rhino"

"Rhinos"

High Dimensional classification�! y(x)?

Estimation problem

(xi, yi) 2 R2242 ⇥ {1, ..., 1000}, i < 106

High variance: how to reduce it?Geometric variability

Groups acting on images: translation, rotation, scaling

Not a "rhino"

What is the natureof other sources of variability?

DATADeep learning: Technical

breakthrough

3

• Deep learning has permitted to solve a large number of task that were considered as extremely challenging for a computer.Ex.: Vision, Game of Go, Speech recognition, Artistic style transfer…

• The technique is generic and its success implies that it reduces large sources of variability.

• How, why?

DATA 4

x0

x1x2

…

Classifier

Ref.: ImageNet Classification with Deep Convolutional Neural Networks.

A Krizhevsky et al.Deep Convolutional Neural Network

⇢0W0 ⇢1W1 ⇢J�1WJ�1xJ = �x

xj+1 = ⇢jWjxj

convolutional operatornon-linear operator

DATA

Why mathematics about deep learning are important?

• Pure black box. Few mathematical results explain the cascade. Many rely on a low dimensional "manifold hypothesis":Variability is too high in images.

• No stability results. It means that "small" variations of the inputs might have a large impact on the system. And this happens.

• No model of the data. We do not understand what is the nature of the sources of variabilities that are reduced.

• Shall we learn each layer from scratch? (geometric priors?) The deep cascade makes features hard to interpret

5

Ref.: Intriguing properties of neural networks. C. Szegedy et al.

Ref.: Understanding deep learning requires rethinking generalization C. Zhang et al.

Ref.: Deep Roto-Translation Scattering for Object Classification. EO and S Mallat

DATA

Overview

Problem: how can we incorporate and analysing structures into deep networks?

1. Scattering for complex image classification

2. Beyond scattering: analyzing a supervised CNN

3. Best of two worlds: scattering and CNN

4. Beyonds euclidean group: Hiearchical CNN

6

DATAFighting the curse of dimensionality

• Objective: building a representation of x such that a simple (say euclidean) classifier can estimate the label y:

• Designing : must be regular with respect to the class:

• Necessary dimensionality reduction and separation to break the curse of dimensionality:

7

�x

�

k�x� �x0k n 1 ) y(x) = y(x0)

y

�

w<latexit sha1_base64="eqfykHkxP29JLdLAGokHSA1pBNI=">AAAB53icbZDLSgMxFIbP1Futt6pLN8EiuCozCupKC25ctuDYQjuUTHqmjc1khiSjlNIncONCxa3P4hu4821MLwtt/SHw8f/nkHNOmAqujet+O7ml5ZXVtfx6YWNza3unuLt3p5NMMfRZIhLVCKlGwSX6hhuBjVQhjUOB9bB/Pc7rD6g0T+StGaQYxLQrecQZNdaqPbaLJbfsTkQWwZtB6erz9LQCANV28avVSVgWozRMUK2bnpuaYEiV4UzgqNDKNKaU9WkXmxYljVEHw8mgI3JknQ6JEmWfNGTi/u4Y0ljrQRzaypianp7PxuZ/WTMz0UUw5DLNDEo2/SjKBDEJGW9NOlwhM2JggTLF7ayE9aiizNjbFOwRvPmVF8E/KZ+V3ZpXqlzCVHk4gEM4Bg/OoQI3UAUfGCA8wQu8OvfOs/PmvE9Lc86sZx/+yPn4AYu+jmw=</latexit>

<latexit sha1_base64="O0lQJ2xoQXj8ACW3DbgTbPBIgF0=">AAAB53icbZDLSgNBEEVr4ivGV9Slm8YguAozCupKA25cJuCYQDKEnk5N0qbnQXePEoZ8gRsXKm79Fv/AnX9jZ5KFJl5oONxbRVeVnwiutG1/W4Wl5ZXVteJ6aWNza3unvLt3p+JUMnRZLGLZ8qlCwSN0NdcCW4lEGvoCm/7wepI3H1AqHke3epSgF9J+xAPOqDZW47FbrthVOxdZBGcGlavP01z1bvmr04tZGmKkmaBKtR070V5GpeZM4LjUSRUmlA1pH9sGIxqi8rJ80DE5Mk6PBLE0L9Ikd393ZDRUahT6pjKkeqDms4n5X9ZOdXDhZTxKUo0Rm34UpILomEy2Jj0ukWkxMkCZ5GZWwgZUUqbNbUrmCM78yovgnlTPqnbDqdQuYaoiHMAhHIMD51CDG6iDCwwQnuAFXq1769l6s96npQVr1rMPf2R9/ABGL474</latexit>

RDRd

D � d

� ✏

DATAWavelets

• is a wavelet iff

• Typically localised in space and frequency.

• Rotation, dilation of a wavelets:

• Design wavelets selective to rotation variabilities.

<latexit sha1_base64="vZM3p0oDK8G//ZjqteTWKP152iI=">AAAB6nicbZDLSgMxFIbPtF5qvY2XnZtgEVyVGRcqrgpuXFZwbKEdSibNtKFJZkgyQhkKPoEbFypufQ4fwp1vY6btQlt/CHz8/znknBOlnGnjed9OqbyyurZe2ahubm3v7Lp7+/c6yRShAUl4otoR1pQzSQPDDKftVFEsIk5b0ei6yFsPVGmWyDszTmko8ECymBFsCqubatZza17dmwotgz+HWuPws/wIAM2e+9XtJyQTVBrCsdYd30tNmGNlGOF0Uu1mmqaYjPCAdixKLKgO8+msE3RinT6KE2WfNGjq/u7IsdB6LCJbKbAZ6sWsMP/LOpmJL8OcyTQzVJLZR3HGkUlQsTjqM0WJ4WMLmChmZ0VkiBUmxp6nao/gL668DMFZ/bzu3fq1xhXMVIEjOIZT8OECGnADTQiAwBCe4AVeHeE8O2/O+6y05Mx7DuCPnI8frjmPpA==</latexit><latexit sha1_base64="MKReSqAJM21mHrPcc9EWVyc1loM=">AAAB6nicbZC7SgNBFIbPJl5ivMVLZzMYBKuwa6FiFbCxjOAaIVnC7GQ2GTIzO8zMCiHkFWwsVGx9DmsbGzvfxtkkhSb+MPDx/+cw55xYcWas7397heLS8spqaa28vrG5tV3Z2b01aaYJDUnKU30XY0M5kzS0zHJ6pzTFIua0GQ8u87x5T7VhqbyxQ0UjgXuSJYxgm1ttZVinUvVr/kRoEYIZVOv770X1+ZE1OpWvdjclmaDSEo6NaQW+stEIa8sIp+NyOzNUYTLAPdpyKLGgJhpNZh2jI+d0UZJq96RFE/d3xwgLY4YidpUC276Zz3Lzv6yV2eQ8GjGpMkslmX6UZBzZFOWLoy7TlFg+dICJZm5WRPpYY2LdecruCMH8yosQntROa/51UK1fwFQlOIBDOIYAzqAOV9CAEAj04QGe4NkT3qP34r1OSwverGcP/sh7+wFQoJGe</latexit>

<latexit sha1_base64="mduu63BoHMqGld23Vba2fG/RufU=">AAAB6nicbZC9TsMwFIVvyl8pfwFGGCwqJKYqYQDGChbGViK0UhtVjuu0Vm0nsh2kKuorsDAAYuUheA42Nh4Fp+0ALUey9Omce+V7b5Rypo3nfTmlldW19Y3yZmVre2d3z90/uNdJpggNSMIT1Y6wppxJGhhmOG2nimIRcdqKRjdF3nqgSrNE3plxSkOBB5LFjGBTWN1Us55b9WreVGgZ/DlU68cfzW8AaPTcz24/IZmg0hCOte74XmrCHCvDCKeTSjfTNMVkhAe0Y1FiQXWYT2edoFPr9FGcKPukQVP3d0eOhdZjEdlKgc1QL2aF+V/WyUx8FeZMppmhksw+ijOOTIKKxVGfKUoMH1vARDE7KyJDrDAx9jwVewR/ceVlCM5rFzWv6Vfr1zBTGY7gBM7Ah0uowy00IAACQ3iEZ3hxhPPkvDpvs9KSM+85hD9y3n8AnNqQXA==</latexit><latexit sha1_base64="KoiyL79C8JDW2tvOqjdwVaE3Hps=">AAAB6nicbZDNSgMxFIXv+FvrX9WlIsEiuCozLtRl0Y3LFhxbaIeSSTNtaJIZkoxQhi7dunGh4taH6HO48xl8CTNtF9p6IPBxzr3k3hsmnGnjul/O0vLK6tp6YaO4ubW9s1va27/XcaoI9UnMY9UMsaacSeobZjhtJopiEXLaCAc3ed54oEqzWN6ZYUIDgXuSRYxgk1vtRLNOqexW3InQIngzKFePxvXvx+NxrVP6bHdjkgoqDeFY65bnJibIsDKMcDoqtlNNE0wGuEdbFiUWVAfZZNYROrVOF0Wxsk8aNHF/d2RYaD0Uoa0U2PT1fJab/2Wt1ERXQcZkkhoqyfSjKOXIxChfHHWZosTwoQVMFLOzItLHChNjz1O0R/DmV14E/7xyUXHrXrl6DVMV4BBO4Aw8uIQq3EINfCDQhyd4gVdHOM/Om/M+LV1yZj0H8EfOxw95uZHC</latexit>

j,✓ =1

22j (

r✓(u)

2j)

<latexit sha1_base64="4yWyX8g1ftSjdJSEzIePbvZr7Ik=">AAACJXicbVDLSsNAFJ34rPVVdekmWIQWpCRF1E2h6MZlBWsLTQ2T6aSddvJg5kYoIV/jxl9x46KK4MpfcdJkoa0HBg7nnMude5yQMwmG8aWtrK6tb2wWtorbO7t7+6WDwwcZRILQNgl4ILoOlpQzn7aBAafdUFDsOZx2nMlN6neeqJAs8O9hGtK+h4c+cxnBoCS71LBCyex4fGbBiAJOGpYrMInNJK4/xvVxkqR+JROFnYUqUTW1x0nVLpWNmjGHvkzMnJRRjpZdmlmDgEQe9YFwLGXPNELox1gAI5wmRSuSNMRkgoe0p6iPPSr78fzMRD9VykB3A6GeD/pc/T0RY0/KqeeopIdhJBe9VPzP60XgXvVj5ocRUJ9ki9yI6xDoaWf6gAlKgE8VwUQw9VedjLCqBFSzRVWCuXjyMmnXaxc14+683LzO2yigY3SCKshEl6iJblELtRFBz+gVzdC79qK9aR/aZxZd0fKZI/QH2vcP94GmQg==</latexit><latexit sha1_base64="4yWyX8g1ftSjdJSEzIePbvZr7Ik=">AAACJXicbVDLSsNAFJ34rPVVdekmWIQWpCRF1E2h6MZlBWsLTQ2T6aSddvJg5kYoIV/jxl9x46KK4MpfcdJkoa0HBg7nnMude5yQMwmG8aWtrK6tb2wWtorbO7t7+6WDwwcZRILQNgl4ILoOlpQzn7aBAafdUFDsOZx2nMlN6neeqJAs8O9hGtK+h4c+cxnBoCS71LBCyex4fGbBiAJOGpYrMInNJK4/xvVxkqR+JROFnYUqUTW1x0nVLpWNmjGHvkzMnJRRjpZdmlmDgEQe9YFwLGXPNELox1gAI5wmRSuSNMRkgoe0p6iPPSr78fzMRD9VykB3A6GeD/pc/T0RY0/KqeeopIdhJBe9VPzP60XgXvVj5ocRUJ9ki9yI6xDoaWf6gAlKgE8VwUQw9VedjLCqBFSzRVWCuXjyMmnXaxc14+683LzO2yigY3SCKshEl6iJblELtRFBz+gVzdC79qK9aR/aZxZd0fKZI/QH2vcP94GmQg==</latexit>

j,✓<latexit sha1_base64="h0os0qPCCwGOdy3Iax5X9vow0TE=">AAAB93icbVBNS8NAEN34WetHox69LBbBg5REinosePFYwdhCE8Jmu2nXbjZhdyLU0F/ixYOKV/+KN/+N2zYHbX0w8Hhvhpl5USa4Bsf5tlZW19Y3Nitb1e2d3b2avX9wr9NcUebRVKSqGxHNBJfMAw6CdTPFSBIJ1olG11O/88iU5qm8g3HGgoQMJI85JWCk0K75meZh8XDmw5ABmYR23Wk4M+Bl4pakjkq0Q/vL76c0T5gEKojWPdfJICiIAk4Fm1T9XLOM0BEZsJ6hkiRMB8Xs8Ak+MUofx6kyJQHP1N8TBUm0HieR6UwIDPWiNxX/83o5xFdBwWWWA5N0vijOBYYUT1PAfa4YBTE2hFDFza2YDokiFExWVROCu/jyMvHOGxcN57ZZbzXLNCroCB2jU+SiS9RCN6iNPERRjp7RK3qznqwX6936mLeuWOXMIfoD6/MHXVOTCg==</latexit><latexit sha1_base64="h0os0qPCCwGOdy3Iax5X9vow0TE=">AAAB93icbVBNS8NAEN34WetHox69LBbBg5REinosePFYwdhCE8Jmu2nXbjZhdyLU0F/ixYOKV/+KN/+N2zYHbX0w8Hhvhpl5USa4Bsf5tlZW19Y3Nitb1e2d3b2avX9wr9NcUebRVKSqGxHNBJfMAw6CdTPFSBIJ1olG11O/88iU5qm8g3HGgoQMJI85JWCk0K75meZh8XDmw5ABmYR23Wk4M+Bl4pakjkq0Q/vL76c0T5gEKojWPdfJICiIAk4Fm1T9XLOM0BEZsJ6hkiRMB8Xs8Ak+MUofx6kyJQHP1N8TBUm0HieR6UwIDPWiNxX/83o5xFdBwWWWA5N0vijOBYYUT1PAfa4YBTE2hFDFza2YDokiFExWVROCu/jyMvHOGxcN57ZZbzXLNCroCB2jU+SiS9RCN6iNPERRjp7RK3qznqwX6936mLeuWOXMIfoD6/MHXVOTCg==</latexit>

Isotropic| |

Non-Isotropic| |VS

�

Z (u)du = 0 and

Z| |2(u)du < 1

8

DATA

Wavelet Transform

• Wavelet transform :

• Isometric and linear operator of , with

• Covariant with translation :

• Nearly commutes with diffeomorphisms

• A good baseline to describe an image!

9

Wx = {x ? j,✓, x ? �J}✓,jJ

kWxk2 =X

✓,jJ

Z|x ? j,✓|2 +

Zx ? �

2J !1

!2

Ref.: Group Invariant Scattering, Mallat S

L2

WLa = LaW

La

k[W,L⌧ ]k Ckr⌧k

DATA

Scattering Transform10

• Scattering transform at scale J is the cascading of complex WT with modulus non-linearity, followed by a low pass-filtering:

• Mathematically well defined for a large class of wavelets.


x�J

|.|�J

|.|�J

Depthorder 0

order 1

order 2

SJx = {x ? �J ,|x ? j1,✓1 | ? �J ,||x ? j1,✓1 | ? j2,✓2 | ? �J}

j,✓

j,✓

DATA

A successful representation in vision• Successfully used in several applications:

• Digits

• Textures

• The design of the scattering transform is guided by the euclidean group

• To which extent can we compete with other architectures on more complex problems (e.g. variabilities are more complex)?

11

Small deformations +Translation

Ref.: Invariant Convolutional Scattering Network, J. Bruna and S Mallat

Ref.: Rotation, Scaling and Deformation Invariant Scattering for texture discrimination, Sifre L and Mallat S.

Rotation+Scale

All variabilities are known

DATASeparable Roto-translation Scattering

12

u1 u2

✓1

?u

?u ?✓1

~(u,✓1)� ⌦

�Translation Scattering

Roto-Translation Scattering

Separable Roto-Translation Scattering

x(u1, u2)

• Simplification of the Roto-translation scattering

• Discriminates angular variabilities thanks to a wavelet transform along (no averaging!)

• We combine it with Gaussians SVM

✓1

? ✓1

?�J

?�J

?�J

, spatial, angular wavelets


DATAHow much learning is really required?

13

Dataset Type Accuracy

Caltech101 Scattering 80

Unsupervised 77

Supervised 93

CIFAR100 Scattering 57

Unsupervised 61

Supervised 82Few adaptation to the dataset

104101

256⇥ 256

imagesclasses

color images32⇥ 32

1005.104

color images

imagesclasses

CALTECHCIFAR

Scattering is competitive with unsupervised


How can we explain the gap with supervised?

Performances are given without intensive data augmentation

DATAStudying empirically a CNN• Which ingredient permits CNNs to outperform the

Scattering Transform?

• We introduce a CNN which depends only on its width K and non-linearity in order to study it.

• Good perf and limited engineering: no max-pooling or ad-hoc non-linear module, only 13 layers.

14

⇢

Depth #params CIFAR10 CIFAR100Ours 13 28M 95.4 79.6

82.SGDR 28 150M 96.2 82.3 80WResNet 28 37M 95.8 80.0

All-CNN 9 1.3M 92.8 66.6

Supervised learning: how? why?

Input

B3

BK

...5⇥

# 2

BK

...4⇥

# 2

BK

...3⇥

A

L

Output

xj

xj+1

A block Bl

3 ⇥ 3K ⇥ l

�-+Exj

Wj

⇢

DATAExample of sandbox application:

on the non-linearity in CNNs• "More non-linear is better."

• Non-linearity needs to contract, being continuous or to remove the phase?

15A

ccur

acy

60

68

76

84

92

100

Ratio

0 0.1 0.3 0.5 0.7 0.9

ReLU

Kk (x)(., l) ,

(ReLU(x(., l)), if l k

x(., l), otherwise

ReLU on a fraction of the coefficients a layer x:k

K

k

K

⇢(x) = sign(x)(p

|x|+ 0.1)

good perf (89% acc.) on CIFAR10

0.1

0.1

Tradional pointwise non-linearity can be weakened

DATAProgressive properties

16

RD

⇢W1 ...⇢W2 ⇢W13

Acc

urac

y

30

40

50

60

70

80

90

100

Depth

1 2 3 4 5 6 7 8 9 10 11 12

SVM NN

Nearest Neighbor (NN)

• We aim to show the cascade permits a progressive contraction & separation, w.r.t. the depth:

• How can we explain it?

Gaussian SVM

Localised classifiers

Best performance

In the following, representations are spatially averaged.

DATAUnderstanding the progressive improvement

17A

vera

ged

intr

a cl

ass

di

stan

ces

Depth

1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 56 7 8 9 10

Cum

ulat

ed v

aria

nces

123456789101112

1

50002

X

y(xj)=y(xj)=c

kxj

� x

j

k

No clear behavior: Refining those measures?

Intra class distance

PCA axis P

j

j

class c

⌃jP =

X

pP

�jp

Cumulative eigenvalues for the representation at depth j

�jp

DATALocal Support Vectors: exploring regularity

18

�k+1j =

�xj 2 �k

j |card{y(xj) 6= y(x(l)j ), l k + 1} >

k + 1

2

2 �1 \ (�2)c 62 [j�j 2 �62 �3 \ (�4)c

�1j =

�xj |y(x(1)

j ) 6= y(xj)

• Estimating the intrinsic dimension of the classification boundary is hard (curse of dimensionality!)

• We introduce Local Support Vectors(LSV):

• We extend this definition recursively to define k-LSV:

• Regularity measure at depth j: k-LSV approximatively indicates how local is the euclidean metric:

x

(l)j

y(xj)

Features that are not classified by any l- nearest, l<k ,neighbour at depth j

: l-th nearest neighbour of feature at depth j

: class of x

DATAProgressive contraction-separation measure

19

• We compute the # of k-LSV at each depth j on CIFAR10:

• The number of 1-LSV decreases with depth: better local separation of different classes

#

0

3500

7000

10500

14000

k

123456789101112

Slow convergence: several far neighbors are

enough to predict LSV’s label

How does the separation/contraction happen?

�k j

j

Fast convergence: only close neighbors

permits to predict LSV’s label

meaningful metric: contraction

l2

DATAScattering + CNN

• Can we combine the best of both world to understand better CNNs?

• We demonstrate interpretability and no loss in informative variabilities for training a CNN

• Remark: Scattering is stable, CNN is unstable, thus Scattering+CNN has no reason to be more stable

20

CNN ScatteringGood perf

InterpretableYes

Yes

No

No

DATAAn ideal input for a modern CNN

• Scattering is stable:

• Linearize small deformations:

• Invariant to local translation:

• For , , is covariant with :

21

CNNx

kSJx� SJyk kx� yk

DeformationsSJ

SJx(u,�)� u SO2(R)

SJ(g.x)(u,�) = SJx(g�1

u, g

�1�) , g.SJx(u,�)

L⌧x(u) = x(u� ⌧(u))

kSJL⌧x� SJxk Ckr⌧kkxk

|a| ⌧ 2J ) SJLax ⇡ SJ

8u8g 2 SO2(R), g.x(u) , x(g�1u)if then,

Ref.: Scaling the Scattering Transform: Deep Hybrid Networks

EO, E Belilovsky, S Zagoruyko


DATAScaling scattering on GPUs

22

Modulus

Lowpass1st wavelet2nd wavelet

ScatNet algorithm

Proposed algorithm

Save a lot of memory!x100 speed-up on GPU

Ref.: Thesis, EO

DATAImagenet/CIFAR

• State-of-the-art result on Imagenet2012:

• Demonstrates no loss of information + less layers

• Scattering + 5-layers perceptron on CIFAR: 85% acc. (SOTA w.r.t. non-convolutional learned representation)

23

ResNetx SJRef.: Scaling the Scattering Transform:

Deep Hybrid Networks EO, E Belilovsky, S Zagoruyko

Top 1 Top 5 #params

Scat+Resnet-10 69 90 12.8MVGG-16 69 90 138M

ResNet-18 69 89 11.7MResNet-200 79 95 64.7M

DATABenchmarking Small data

• We show incorporating geometrical invariants help learning.(with limited adaptation)

• State-of-the-art results on STL10 and CIFAR10:

24

STL10: 5k training, 8k testing, 10 classes +100k unlabeled(not used!!)

Cifar10, 10 classes keeping 100, 500 and 1000 samples

and testing on 10k



#train 100 500 1000 Full

WRN 16-8 35 47 60 96

VGG 16 26 47 56 93

Scat+ResNet 38 55 62 93

Acc.

Scat+ResNet 76

Supervised 70

Unsupervised 76

ResNetx SJ

DATA

Shared Local Encoder25

convolution (encoder)

24

24

24

1⇥ 1

W1

W1

W1

W2

W2

W2 W3

W3

W3

W4 W5 W6

S4

S4

S4



Top 1 Top 5Scat+SLE 57 80FV+FCs 56 79FV+SVM 54 75AlexNet 56 81

• AlexNet performances with 1x1 conv

• Outperform unsupervised encoders basedon SIFT + Fisher Vectors(FV)

Fully Connected Layers (classifier)

Is it really a classifier?

DATAA local descriptor for classification

26

24 W1 W2 W3S4

• We analyse the scattering’s encoder, which is a descriptor on neighbourhood of size pixels:

• Good transfer learning performance on Caltech101(83%)! Analog to previous reported performance.

• Atoms’ index of are structured by the order 0, 1, 2 of :(W1S4)k = w0,k(x ? �j)

+X

j1,✓1

w(j1,✓1),k(|x ? j1,✓1 | ? �j)

+X

j2,j1,✓1,✓2

w(j1,j2,✓1,✓2),k(||x ? j2,✓2 | ? j1,✓1 | ? �j)

What is the nature of the recombination?

W1 S4

24 ⇥ 24Ref.: Scaling the Scattering Transform:


Fourier along :

Fourier along :

w(j1,!✓1 ),k= F✓1(w(j1,.),k)(!✓1)

w(j1,j2,!✓1 ,!✓2 ),k= F (✓1,✓2)(w(j1,j2,.,.),k)(!✓1 ,!✓2)

✓1

(✓1, ✓2)

Ref.: Visualizing and Understanding Convolutional Networks, M Zeiler, R

Fergus

DATAAnalysis of w

• Invariance to rotation is explicitly learned.

• Thresholding 80% of the coefficients in Fourier: 2% acc. loss

27Ref.: Scaling the Scattering Transform:


method: similar to AlexNet first layer analysis

Can we find more complex invariance than rotation?

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14Amplitude

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Dis

trib

utio

n

Order 1

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14Amplitude

Dis

trib

utio

nOrder 2

-3 0 3ωθ1

-3

0

3

ωθ 2

10−2

10−1

100

-3 0 3ωθ1

0.0

0.2

0.4

0.6

0.8

1.0

Amplitude

⌦2(!✓1 ,!✓2) =X

k,j1,j2

|w(j1,j2,!✓1 ,!✓2 ),k|2

⌦1(!✓1) =X

k,j1

|w(j1,!✓1 ),k|2

⌦1 ⌦2

Fourier basis sparsifies the operator!

DATAIdentifying the variabilities?

• CNNs exhibit some covariance w.r.t. variabilities:

• If we understood how this is done, we could use this to engineer CNN with nice properties:

28

Ref.: Understanding deep features with computer-generated imagery, M Aubry, B Russel

supL

k�Lx� �xkkLx� xk < 1 ) 9 ”weak” @

x

�

) �Lx ⇡ �x+ @

x

�L+ o(kLk)A linear operatorDisplacementL�

+ projection

DATAIntroducing structures to understand better

• SLE learned invariance: invariants to group are explicitly built via convolutions.

• Are all the non-informative variabilities linked to group actions?

• Can we structure the channel axis as we did?

• We will apply n-d convolutions, with n>3

29

Beyond the order of scattering convolution (i.e. 3)

x ? k(u1, ..., un) =X

v1,...,vn

x(u1 � v1, ..., un � vn)k(v1, ..., vn)

DATAObjective: Flattening the variability

30

Standard CNNs layers

Defining an order on layers of neurons

Organised Layers

x(u,�)x(u,�)

� �

�1

�2

� = (�1,�2)

DATA

Symmetry group hypothesis

• To each classification problem corresponds a canonic and unique symmetry group :

• We hypothesise there exists Lie groups and CNNs such that:

• Examples are given by the euclidean group:

31

8gj 2 Gj ,�j(gj .x) = �j(x) where xj = �j(x)

8x, 8g 2 G,�x = �g.x

G

G0 ⇢ G1 ⇢ ... ⇢ GJ ⇢ G

High dimensional

G0 = R2, G1 = G0 n SL2(R)

Ref.: Understanding deep convolutional networks

S Mallat

DATAMultiscale Hiearchical CNN

• CNN that is convolutional along axis channel:

• For , we refer to the variable as an attribute that discriminates previously obtained layer.

• Representation is finally averaged: invariant along translations by v.

32

vjxj

Ref.: Multiscale Hierarchical Convolutional Networks J Jacobsen, EO, S Mallat, Smeulders AWM

Understanding deep convolutional networks

S Mallat

u

xj+1(v1, ..., vj , vj+1) = ⇢j(xj ?v1,...,vj

vj+1)(v1, ..., vj)

x1 x2

x3

xJ(vJ) =X

v1,...,vJ�1

xJ�1(v1, ..., vJ�1, vJ)

DATAHiearchical CNN: numerical results

33

Ref.: Multiscale Hierarchical Convolutional Networks J Jacobsen, EO, S Mallat, AWM Smeulders

0

6

12

#M params

HCNNEquivalent CNNWe demonstrate

a reduction in #param while 91% on CIFAR10

vJ

vJ

vJ�1

vJ�1

Translations are present in the last layer xJ(vJ�1, vJ)

But not in the previous layers Incorporating more structures?

Modelization issue?

DATA

Conclusion34

Thank you!

• The problem was to introduce and analyze structures in CNNs.

• We demonstrate:State-of-the-art performance without learning first layer weights CNN progressively contractScattering + CNN is a robust baselineImplement a new class of CNN

Analyzing and Introducing Structures in Deep Convolutional ...deep cascade makes features hard to...

Documents

Transcript of Analyzing and Introducing Structures in Deep Convolutional ...deep cascade makes features hard to...