Harmonic Analysis of Deep Convolutional Neural Networks...\Carlos Kleiber conducting the Vienna...

Harmonic Analysis ofDeep Convolutional Neural Networks

Helmut Bolcskei

Department of Information Technology and Electrical Engineering

October 2017

joint work with Thomas Wiatowski and Philipp Grohs

ImageNet

coffee

CNNs win the ImageNet 2015 challenge [He et al., 2015 ]

ImageNet

coffee

CNNs win the ImageNet 2015 challenge [He et al., 2015 ]

Describing the content of an image

CNNs generate sentences describingthe content of an image [Vinyals et al., 2015 ]

“Carlos Kleiber conducting the Vienna Philharmonic’s New Year’s Concert 1989.”

“Carlos

Kleiber conducting the Vienna Philharmonic’s New Year’s Concert 1989

“Carlos Kleiber

conducting the Vienna Philharmonic’s New Year’s Concert 1989

“Carlos Kleiber conducting the

Vienna Philharmonic’s New Year’s Concert 1989

“Carlos Kleiber conducting the Vienna Philharmonic’s

New Year’s Concert 1989

“Carlos Kleiber conducting the Vienna Philharmonic’s New Year’s Concert

“Carlos Kleiber conducting the Vienna Philharmonic’s New Year’s Concert 1989.”

Feature extraction and classification

input: f =

non-linear feature extraction

feature vector Φ(f)

linear classifier

{〈w,Φ(f)〉 > 0, ⇒ Shannon

〈w,Φ(f)〉 < 0, ⇒ von Neumannoutput:

Why non-linear feature extractors?

Task: Separate two categories of data through a linear classifier

: 〈w, f〉 > 0

: 〈w, f〉 < 0

Φ(f) =

[‖f‖1

: 〈w,Φ(f)〉 > 0

: 〈w,Φ(f)〉 < 0

possible with w =

[1−1

: 〈w, f〉 > 0

: 〈w, f〉 < 0

not possible!

Φ(f) =

[‖f‖1

: 〈w,Φ(f)〉 > 0

: 〈w,Φ(f)〉 < 0

possible with w =

[1−1

: 〈w, f〉 > 0

: 〈w, f〉 < 0

not possible!

Φ(f) =

[‖f‖1

: 〈w,Φ(f)〉 > 0

: 〈w,Φ(f)〉 < 0

possible with w =

[1−1

Φ(f) =

[‖f‖1

⇒ Φ is invariant to angular component of the data

⇒ Linear separability in feature space!

Φ(f) =

[‖f‖1

⇒ Φ is invariant to angular component of the data

⇒ Linear separability in feature space!

Translation invariance

Handwritten digits from the MNIST database [LeCun & Cortes, 1998 ]

Feature vector should be invariant to spatial location⇒ translation invariance

Deformation insensitivity

Feature vector should be independent of cameras (of differentresolutions), and insensitive to small acquisition jitters

Scattering networks ([Mallat, 2012 ], [Wiatowski and HB, 2015 ])

feature mapf

|f ∗ gλ(k)1

||f ∗ gλ(k)1

| ∗ gλ(l)2

|f ∗ gλ(p)1

||f ∗ gλ(p)1

| ∗ gλ(r)2

feature mapf

|f ∗ gλ(k)1

· ∗ χ2

||f ∗ gλ(k)1

| ∗ gλ(l)2

· ∗ χ3

|f ∗ gλ(p)1

· ∗ χ2

||f ∗ gλ(p)1

| ∗ gλ(r)2

· ∗ χ3

· ∗ χ1

feature map

feature vector Φ(f)

|f ∗ gλ(k)1

· ∗ χ2

||f ∗ gλ(k)1

| ∗ gλ(l)2

· ∗ χ3

|f ∗ gλ(p)1

· ∗ χ2

||f ∗ gλ(p)1

| ∗ gλ(r)2

· ∗ χ3

· ∗ χ1

General scattering networks guarantee [Wiatowski & HB, 2015 ]

- (vertical) translation invariance

- small deformation sensitivity

essentially irrespective of filters, non-linearities, and poolings!

Building blocks

Basic operations in the n-th network layer

gλ(r)n non-lin. pool.

gλ(k)n non-lin. pool.

Filters: Semi-discrete frame Ψn := {χn} ∪ {gλn}λn∈Λn

An‖f‖22 ≤ ‖f ∗ χn‖22 +∑λn∈Λn

‖f ∗ gλn‖2 ≤ Bn‖f‖22, ∀f ∈ L2(Rd)

e.g.: Learned filters

Building blocks

An‖f‖22 ≤ ‖f ∗ χn‖22 +∑λn∈Λn

‖f ∗ gλn‖2 ≤ Bn‖f‖22, ∀f ∈ L2(Rd)

e.g.: Structured filters

Building blocks

An‖f‖22 ≤ ‖f ∗ χn‖22 +∑λn∈Λn

‖f ∗ gλn‖2 ≤ Bn‖f‖22, ∀f ∈ L2(Rd)

e.g.: Unstructured filters

Building blocks

An‖f‖22 ≤ ‖f ∗ χn‖22 +∑λn∈Λn

‖f ∗ gλn‖2 ≤ Bn‖f‖22, ∀f ∈ L2(Rd)

Building blocks

Non-linearities: Point-wise and Lipschitz-continuous

‖Mn(f)−Mn(h)‖2 ≤ Ln‖f − h‖2, ∀ f, h ∈ L2(Rd)

⇒ Satisfied by virtually all non-linearities usedin the deep learning literature!

ReLU: Ln = 1; modulus: Ln = 1; logistic sigmoid: Ln = 14 ; ...

Building blocks

Non-linearities: Point-wise and Lipschitz-continuous

‖Mn(f)−Mn(h)‖2 ≤ Ln‖f − h‖2, ∀ f, h ∈ L2(Rd)

⇒ Satisfied by virtually all non-linearities usedin the deep learning literature!

ReLU: Ln = 1; modulus: Ln = 1; logistic sigmoid: Ln = 14 ; ...

Building blocks

Pooling: In continuous-time according to

f 7→ Sd/2n Pn(f)(Sn·),

where Sn ≥ 1 is the pooling factor and Pn : L2(Rd)→ L2(Rd) isRn-Lipschitz-continuous

⇒ Emulates most poolings used in the deep learning literature!

e.g.: Pooling by sub-sampling Pn(f) = f with Rn = 1

Building blocks

e.g.: Pooling by sub-sampling Pn(f) = f with Rn = 1

Building blocks

e.g.: Pooling by averaging Pn(f) = f ∗ φn with Rn = ‖φn‖1

Vertical translation invariance

Theorem (Wiatowski and HB, 2015)

Assume that the filters, non-linearities, and poolings satisfy

Bn ≤ min{1, L−2n R−2

n }, ∀n ∈ N.

Let the pooling factors be Sn ≥ 1, n ∈ N. Then,

|||Φn(Ttf)− Φn(f)||| = O(

‖t‖S1 . . . Sn

for all f ∈ L2(Rd), t ∈ Rd, n ∈ N.

n }, ∀n ∈ N.

|||Φn(Ttf)− Φn(f)||| = O(

‖t‖S1 . . . Sn

⇒ Features become more invariant with increasing network depth!

n }, ∀n ∈ N.

|||Φn(Ttf)− Φn(f)||| = O(

‖t‖S1 . . . Sn

Full translation invariance: If limn→∞

S1 · S2 · . . . · Sn =∞, then

limn→∞

|||Φn(Ttf)− Φn(f)||| = 0

n }, ∀n ∈ N.

|||Φn(Ttf)− Φn(f)||| = O(

‖t‖S1 . . . Sn

The condition

n }, ∀n ∈ N,

is easily satisfied by normalizing the filters {gλn}λn∈Λn .

n }, ∀n ∈ N.

|||Φn(Ttf)− Φn(f)||| = O(

‖t‖S1 . . . Sn

⇒ applies to general filters, non-linearities, and poolings

Philosophy behind invariance results

Mallat’s “horizontal” translation invariance [Mallat, 2012 ]:

limJ→∞

|||ΦW (Ttf)− ΦW (f)||| = 0, ∀f ∈ L2(Rd), ∀t ∈ Rd

- features become invariant in every network layer, but needsJ →∞

- applies to wavelet transform and modulus non-linearity withoutpooling

“Vertical” translation invariance:

limn→∞

|||Φn(Ttf)− Φn(f)||| = 0, ∀f ∈ L2(Rd), ∀t ∈ Rd

- features become more invariant with increasing network depth

- applies to general filters, general non-linearities, and generalpoolings

limJ→∞

limn→∞

limJ→∞

limn→∞

Non-linear deformations

Non-linear deformation (Fτf)(x) = f(x− τ(x)), where τ : Rd → Rd

For “small” τ :

Non-linear deformations

Non-linear deformation (Fτf)(x) = f(x− τ(x)), where τ : Rd → Rd

For “large” τ :

Deformation sensitivity for signal classes

Consider (Fτf)(x) = f(x− τ(x)) = f(x− e−x2)

f1(x), (Fτf1)(x)

f2(x), (Fτf2)(x)

For given τ the amount of deformation inducedcan depend drastically on f ∈ L2(Rd)

Philosophy behind deformation stability/sensitivity bounds

Mallat’s deformation stability bound [Mallat, 2012 ]:

|||ΦW (Fτf)−ΦW (f)||| ≤ C(2−J‖τ‖∞+J‖Dτ‖∞+‖D2τ‖∞

)‖f‖W ,

for all f ∈ HW ⊆ L2(Rd)

- The signal class HW and the corresponding norm ‖ · ‖W dependon the mother wavelet (and hence the network)

Our deformation sensitivity bound:

|||Φ(Fτf)− Φ(f)||| ≤ CC‖τ‖α∞, ∀f ∈ C ⊆ L2(Rd)

- The signal class C (band-limited functions, cartoon functions, orLipschitz functions) is independent of the network

)‖f‖W ,

- Signal class description complexity implicit via norm ‖ · ‖W

- Signal class description complexity explicit via CC- L-band-limited functions: CC = O(L)- cartoon functions of size K: CC = O(K3/2)- M -Lipschitz functions CC = O(M)

)‖f‖W ,

- Decay rate α > 0 of the deformation error is signal-class-specific (band-limited functions: α = 1, cartoon functions:α = 1

2 , Lipschitz functions: α = 1)

)‖f‖W ,

- The bound depends explicitly on higher order derivatives of τ

- The bound implicitly depends on derivative of τ via thecondition ‖Dτ‖∞ ≤ 1

)‖f‖W ,

- The bound is coupled to horizontal translation invariance

limJ→∞

- The bound is decoupled from vertical translation invariance

limn→∞

CNNs in a nutshell

CNNs used in practice employ potentiallyhundreds of layers and 10,000s of nodes!

e.g.: Winner of the ImageNet 2015 challenge [He et al., 2015 ]

- Network depth: 152 layers

- average # of nodes per layer: 472

- # of FLOPS for a single forward pass: 11.3 billion

CNNs in a nutshell

Such depths (and breadths) pose formidable computationalchallenges in training and operating the network!

Topology reduction

Determine how fast the energy contained in thepropagated signals (a.k.a. feature maps) decays across layers

Guarantee trivial null-space for feature extractor Φ

Specify the number of layers needed to have “most” of theinput signal energy be contained in the feature vector

For a fixed (possibly small) depth, design CNNsthat capture “most” of the input signal energy

Topology reduction

Building blocks

gλ(r)n | · | ↓S

gλ(k)n | · | ↓S

Non-linearity: Modulus | · |Pooling: Sub-sampling with pooling factor S ≥ 1

Demodulation effect of modulus non-linearity

Components of feature vector given by |f ∗ gλn | ∗ χn+1

· · · · · ·

gλn(ω) χn+1(ω)

f(ω) · gλn(ω)

· · · · · ·

gλn(ω) χn+1(ω)

f(ω) · gλn(ω)

· · · · · ·

gλn(ω) χn+1(ω)

f(ω) · gλn(ω)

Modulus squared:

|f ∗ gλn(x)|2 Rf ·gλn

· · · · · ·

gλn(ω) χn+1(ω)

f(ω) · gλn(ω)

|f ∗ gλn |∧

via χn+1

Do all non-linearities demodulate?

High-pass filtered signal:

−2R 2R

F(f ∗ gλ)

−2R 2R

F(f ∗ gλ)

Modulus: Yes!

−2R 2Rω

|F(|f ∗ gλ|)|

−2R 2Rω

|F(|f ∗ gλ|)|

... but (small) tails!

−2R 2R

F(f ∗ gλ)

Modulus squared: Yes, and sharply so!

−2R 2R

|F(|f ∗ gλ|2)|

... but not Lipschitz-continuous!

−2R 2R

F(f ∗ gλ)

Rectified linear unit: No!

−2R 2R

|F(ReLU(f ∗ gλ))|

First goal: Quantify feature map energy decay

|f ∗ gλ(k)1

· ∗ χ2

||f ∗ gλ(k)1

| ∗ gλ(l)2

· ∗ χ3

|f ∗ gλ(p)1

· ∗ χ2

||f ∗ gλ(p)1

| ∗ gλ(r)2

· ∗ χ3

· ∗ χ1

Assumptions (on the filters)

i) Analyticity: For every filter gλn there exists a (not necessarilycanonical) orthant Hλn ⊆ Rd such that

supp(gλn) ⊆ Hλn .

ii) High-pass: There exists δ > 0 such that∑λn∈Λn

|gλn(ω)|2 = 0, a.e. ω ∈ Bδ(0).

⇒ Comprises various contructions of WH filters, wavelets,ridgelets, (α)-curvelets, shearlets

e.g.: analytic band-limited curvelets: ω1

Assumptions (on the filters)

i) Analyticity: For every filter gλn there exists a (not necessarilycanonical) orthant Hλn ⊆ Rd such that

supp(gλn) ⊆ Hλn .

ii) High-pass: There exists δ > 0 such that∑λn∈Λn

|gλn(ω)|2 = 0, a.e. ω ∈ Bδ(0).

⇒ Comprises various contructions of WH filters, wavelets,ridgelets, (α)-curvelets, shearlets

e.g.: analytic band-limited curvelets: ω1

Input signal classes

Sobolev functions of order s ≥ 0:

Hs(Rd) ={f ∈ L2(Rd)

∣∣∣ ∫Rd

(1 + |ω|2)s|f(ω)|2dω <∞}

Hs(Rd) contains a wide range of practically relevant signal classes

- square-integrable functions L2(Rd) = H0(Rd)- L-band-limited functions L2

L(Rd) ⊆ Hs(Rd), ∀L > 0, ∀s ≥ 0

- cartoon functions [Donoho, 2001 ] CCART ⊆ Hs(Rd), ∀s ∈ [0, 12)

Handwritten digits from MNIST database [LeCun & Cortes, 1998 ]

∣∣∣ ∫Rd

(1 + |ω|2)s|f(ω)|2dω <∞}

L(Rd) ⊆ Hs(Rd), ∀L > 0, ∀s ≥ 0

∣∣∣ ∫Rd

(1 + |ω|2)s|f(ω)|2dω <∞}

- square-integrable functions L2(Rd) = H0(Rd)

- L-band-limited functions L2L(Rd) ⊆ Hs(Rd), ∀L > 0, ∀s ≥ 0

∣∣∣ ∫Rd

(1 + |ω|2)s|f(ω)|2dω <∞}

L(Rd) ⊆ Hs(Rd), ∀L > 0, ∀s ≥ 0

∣∣∣ ∫Rd

(1 + |ω|2)s|f(ω)|2dω <∞}

L(Rd) ⊆ Hs(Rd), ∀L > 0, ∀s ≥ 0

Exponential energy decay

Theorem

Let the filters be wavelets with mother wavelet

supp(ψ ) ⊆ [r−1, r], r > 1,

or Weyl-Heisenberg (WH) filters with prototype function

supp(g) ⊆ [−R,R], R > 0.

Then, for every f ∈ Hs(Rd), there exists β > 0 such that

Wn(f) = O(a−n(2s+β)

2s+β+1

where a = r2+1r2−1

in the wavelet case, and a = 12 + 1

R in the WH case.

⇒ decay factor a is explicit and can be tuned via r,R

Theorem

Let the filters be wavelets with mother wavelet

supp(ψ ) ⊆ [r−1, r], r > 1,

or Weyl-Heisenberg (WH) filters with prototype function

supp(g) ⊆ [−R,R], R > 0.

Then, for every f ∈ Hs(Rd), there exists β > 0 such that

2s+β+1

where a = r2+1r2−1

in the wavelet case, and a = 12 + 1

R in the WH case.

⇒ decay factor a is explicit and can be tuned via r,R

Exponential energy decay:

2s+β+1

- β > 0 determines the decay of f(ω) (as |ω| → ∞) according to

|f(ω)| ≤ µ(1 + |ω|2)−( s2

), ∀ |ω| ≥ L,

for some µ > 0, and L acts as an “effective bandwidth”

- smoother input signals (i.e., s↑) lead to faster energy decay

- pooling through sub-sampling f 7→ S1/2f(S·) leads to decayfactor a

What about general filters? ⇒ polynomial energy decay!

2s+β+1

)- β > 0 determines the decay of f(ω) (as |ω| → ∞) according to

|f(ω)| ≤ µ(1 + |ω|2)−( s2

), ∀ |ω| ≥ L,

2s+β+1

|f(ω)| ≤ µ(1 + |ω|2)−( s2

), ∀ |ω| ≥ L,

2s+β+1

|f(ω)| ≤ µ(1 + |ω|2)−( s2

), ∀ |ω| ≥ L,

2s+β+1

|f(ω)| ≤ µ(1 + |ω|2)−( s2

), ∀ |ω| ≥ L,

... our second goal ... trivial null-space for Φ

Why trivial null-space?

Feature space

: 〈w,Φ(f)〉 > 0

: 〈w,Φ(f)〉 < 0

Non-trivial null-space: ∃ f∗ 6= 0 such that Φ(f∗) = 0

⇒ 〈w,Φ(f∗)〉 = 0 for all w !

⇒ these f∗ become unclassifiable!

... our second goal ... trivial null-space for Φ

Why trivial null-space?

Feature space

Φ(f∗)

: 〈w,Φ(f)〉 > 0

: 〈w,Φ(f)〉 < 0

Non-trivial null-space: ∃ f∗ 6= 0 such that Φ(f∗) = 0

⇒ 〈w,Φ(f∗)〉 = 0 for all w !

⇒ these f∗ become unclassifiable!

... our second goal ...

Trivial null-space for feature extractor:{f ∈ L2(Rd) | Φ(f) = 0

Feature extractor Φ(·) =⋃∞n=0 Φn(·) shall satisfy

A‖f‖22 ≤ |||Φ(f)|||2 ≤ B‖f‖22, ∀f ∈ L2(Rd),

for some A,B > 0.

“Energy conservation”

Theorem

For the frame upper {Bn}n∈N and frame lower bounds {An}n∈N,define B :=

∏∞n=1 max{1, Bn} and A :=

∏∞n=1 min{1, An}. If

0 < A ≤ B <∞,then

A‖f‖22 ≤ |||Φ(f)|||2 ≤ B‖f‖22, ∀ f ∈ L2(Rd).

- For Parseval frames (i.e., An = Bn = 1, n ∈ N), this yields

|||Φ(f)|||2 = ‖f‖22

- Connection to energy decay:

‖f‖22 =n−1∑k=0

|||Φk(f)|||2 +Wn(f)︸︷︷︸→ 0

Theorem

0 < A ≤ B <∞,then

A‖f‖22 ≤ |||Φ(f)|||2 ≤ B‖f‖22, ∀ f ∈ L2(Rd).

|||Φ(f)|||2 = ‖f‖22

‖f‖22 =n−1∑k=0

|||Φk(f)|||2 +Wn(f)︸︷︷︸→ 0

Theorem

0 < A ≤ B <∞,then

A‖f‖22 ≤ |||Φ(f)|||2 ≤ B‖f‖22, ∀ f ∈ L2(Rd).

|||Φ(f)|||2 = ‖f‖22

‖f‖22 =

n−1∑k=0

|||Φk(f)|||2 +Wn(f)︸︷︷︸→ 0

... and our third goal ...

For a given CNN, specify the number of layersneeded to capture “most” of the input signal energy

How many layers n are needed to have at least ((1− ε) · 100)% ofthe input signal energy be contained in the feature vector, i.e.,

(1− ε)‖f‖22 ≤n∑k=0

|||Φk(f)|||2 ≤ ‖f‖22, ∀f ∈ L2(Rd).

... and our third goal ...

For a given CNN, specify the number of layersneeded to capture “most” of the input signal energy

How many layers n are needed to have at least ((1− ε) · 100)% ofthe input signal energy be contained in the feature vector, i.e.,

(1− ε)‖f‖22 ≤n∑k=0

|||Φk(f)|||2 ≤ ‖f‖22, ∀f ∈ L2(Rd).

Number of layers needed

Theorem

Let the frame bounds satisfy An = Bn = 1, n ∈ N. Let the inputsignal f be L-band-limited, and let ε ∈ (0, 1). If

n ≥⌈

(1−√

1− ε )

then(1− ε)‖f‖22 ≤

n∑k=0

|||Φk(f)|||2 ≤ ‖f‖22.

Theorem

n ≥⌈

(1−√

1− ε )

then(1− ε)‖f‖22 ≤

n∑k=0

|||Φk(f)|||2 ≤ ‖f‖22.

⇒ also guarantees trivial null-space for⋃nk=0 Φk(f)

Theorem

n ≥⌈

(1−√

1− ε )

then(1− ε)‖f‖22 ≤

n∑k=0

|||Φk(f)|||2 ≤ ‖f‖22.

- lower bound depends on

- description complexity of input signals (i.e., bandwidth L)

- decay factor (wavelets a = r2+1r2−1 , WH filters a = 1

2 + 1R )

- similar estimates for Sobolev input signals and for generalfilters (polynomial decay!)

Theorem

n ≥⌈

(1−√

1− ε )

then(1− ε)‖f‖22 ≤

n∑k=0

|||Φk(f)|||2 ≤ ‖f‖22.

- lower bound depends on

- description complexity of input signals (i.e., bandwidth L)

- decay factor (wavelets a = r2+1r2−1 , WH filters a = 1

2 + 1R )

- similar estimates for Sobolev input signals and for generalfilters (polynomial decay!)

Numerical example for bandwidth L = 1:

(1− ε)0.25 0.5 0.75 0.9 0.95 0.99

wavelets (r = 2) 2 3 4 6 8 11WH filters (R = 1) 2 4 5 8 10 14general filters 2 3 7 19 39 199

Recall: Winner of the ImageNet 2015 challenge [He et al., 2015 ]

(1− ε)0.25 0.5 0.75 0.9 0.95 0.99

... our fourth and last goal ...

For a fixed (possibly small) depth N , design scatteringnetworks that capture “most” of the input signal energy

Recall: Let the filters be wavelets with mother wavelet

supp(ψ ) ⊆ [r−1, r], r > 1,

or Weyl-Heisenberg filters with prototype function

supp(g) ⊆ [−R,R], R > 0.

For fixed depth N, want to choose r in the wavelet and R in the WHcase so that

(1− ε)‖f‖22 ≤N∑k=0

|||Φk(f)|||2 ≤ ‖f‖22, ∀f ∈ L2(Rd).

Depth-constrained networks

Theorem

Let the frame bounds satisfy An = Bn = 1, n ∈ N. Let the inputsignal f be L-band-limited, and fix ε ∈ (0, 1) and N ∈ N. If, in thewavelet case,

1 < r ≤√κ+ 1

κ− 1,

or, in the WH case,

0 < R ≤√

κ− 12

where κ :=(

L(1−√

1−ε )

, then

(1− ε)‖f‖22 ≤N∑k=0

|||Φk(f)|||2 ≤ ‖f‖22.

Depth-width tradeoff

Spectral supports of wavelet filters:

r1 r r2 r3

g1 g2 g3ψ

Smaller depth N ⇒ smaller “bandwidth” r of mother wavelet

⇒ larger number of wavelets (O(logr(L))) tocover the spectral support [−L,L] of input signal

⇒ larger number of filters in the first layer

⇒ depth-width tradeoff

r1 r r2 r3

g1 g2 g3ψ

r1 r r2 r3

g1 g2 g3ψ

r1 r r2 r3

g1 g2 g3ψ

r1 r r2 r3

g1 g2 g3ψ

Yours truly

Experiment: Handwritten digit classification

- Dataset: MNIST database of handwritten digits [LeCun &Cortes, 1998 ]; 60,000 training and 10,000 test images

- Φ-network: D = 3 layers; same filters, non-linearities, andpooling operators in all layers

- Classifier: SVM with radial basis function kernel [Vapnik, 1995 ]

- Dimensionality reduction: Supervised orthogonal least squaresscheme [Chen et al., 1991 ]

Classification error in percent:

Haar wavelet Bi-orthogonal waveletabs ReLU tanh LogSig abs ReLU tanh LogSig

n.p. 0.57 0.57 1.35 1.49 0.51 0.57 1.12 1.22sub. 0.69 0.66 1.25 1.46 0.61 0.61 1.20 1.18max. 0.58 0.65 0.75 0.74 0.52 0.64 0.78 0.73avg. 0.55 0.60 1.27 1.35 0.58 0.59 1.07 1.26

- modulus and ReLU perform better than tanh and LogSig

- results with pooling (S = 2) are competitive with those withoutpooling, at significanly lower computational cost

- state-of-the-art: 0.43 [Bruna and Mallat, 2013 ]

- similar feature extraction network with directional, non-separablewavelets and no pooling

- significantly higher computational complexity

n.p. 0.57 0.57 1.35 1.49 0.51 0.57 1.12 1.22sub. 0.69 0.66 1.25 1.46 0.61 0.61 1.20 1.18max. 0.58 0.65 0.75 0.74 0.52 0.64 0.78 0.73avg. 0.55 0.60 1.27 1.35 0.58 0.59 1.07 1.26

Energy decay: Related work

[Waldspurger, 2017 ]: Exponential energy decay

Wn(f) = O(a−n),

for some unspecified a > 1.

- 1-D wavelet filters

- every network layer equipped with the same set of wavelets

- vanishing moments condition on the mother wavelet

- applies to 1-D real-valued band-limited input signals f ∈ L2(R)

Wn(f) = O(a−n),

[Czaja and Li, 2016 ]: Exponential energy decay

Wn(f) = O(a−n),

- d-dimensional uniform covering filters (similar to Weyl-Heisenberg filters), but does not cover multi-scale filters (e.g.wavelets, ridgedelets, curvelets etc.)

- every network layer equipped with the same set of filters

- analyticity and vanishing moments conditions on the filters

- applies to d-dimensional complex-valued input signalsf ∈ L2(Rd)

Wn(f) = O(a−n),

Harmonic Analysis of Deep Convolutional Neural Networks...\Carlos Kleiber conducting the Vienna...

Documents

Transcript of Harmonic Analysis of Deep Convolutional Neural Networks...\Carlos Kleiber conducting the Vienna...

Neuronale Netze ConvolutionalNeuralNetworks (CNNs)2019/... · 2019. 5. 8. · Neuronale Netze ConvolutionalNeuralNetworks (CNNs) Prof. Dr.-Ing. Sebastian Stober ArtificialIntelligenceLab

Michał KLEIBER

SPHERICAL CNNS ON UNSTRUCTURED GRIDS

Convolutional Neural Networks (CNNs) and Recurrent Neural ...

MICHAŁ KLEIBER

Kleiber Márk MF10M2 Kőzetek „univerzuma” Ásvány- és kőzettan Miskolci Egyetem Számítástechnika

BLOCK 6 KLEIBER PLOTS - thekleibers.net · BLOCK 6 KLEIBER PLOTS. Family Service Oﬃce Administration Building Michigan Memorial Funeral Home & Flower Shop (734) 782-2473 Front Gate

Auro tripathy - Localizing with CNNs

Kleiber new bluemarlin - soest.hawaii.edu · Update on Blue Marlin Stock Assessment Pierre Kleiber 1, ... that the recent state of fish and ... Introduction A stock assessment of

cs224n-2018-lecture12-Transformers and CNNs

Kleiber Investigations and Training€¦ · Kleiber Investigations and Training. Ammunition – 5.56 mm 55 gr. training ammo will suffice for this course. The student may want to

Are Transformers More Robust Than CNNs?

Full BRDF Reconstruction Using CNNs From Partial ...openaccess.thecvf.com/.../papers/Antensteiner_Full_BRDF...paper.pdfFull BRDF Reconstruction Using CNNs from Partial Photometric

ERC Prof Michał Kleiber

Robot Vision with CNNs: a Practical Example

Kleiber Investigations and Training · Kleiber Investigations and Training KleiberInvestigations.com 970-846-8588. Title: Microsoft Word - kleiber-investigations-ADVANCED-CARBINE-INSTRUCTOR-SKILLS

Cnns reiselivsdagene arena nord

«Vince Kleiber» by Mauro Balestrazzi - Classic Voice, no. 151 [December 2011]

From Zero to CNNs to Brains - CBMM...Vision Barbu,Mayo,Conwell (MIT,Harvard) 0 !CNNs Brains August12,2020

ELASTIC: Improving CNNs with Dynamic Scaling Policies