Yanhui Su yanhui [email protected] March 30, 2018 · Universal approximation bounds Theorem 10 (Barron...

Approximation theory in neural networks

Yanhui Su†

yanhui [email protected]

March 30, 2018

Outline Functions Functionals Operators Bounds Optimal

Outline

1 Approximation of functions by a sigmoidal function

2 Approximations of continuous functionals by a sigmoidalfunction

3 Universal approximation by neural networks with arbitraryactivation functions

4 Universal approximation bounds for superpositions of asigmoidal function

5 Optimal approximation with sparsely connected deep neuralnetworks


Basic Notations

1 d: dimension of input layer;

2 L: number of layers;

3 Nl: number of neuros in the lth layers, l = 1, · · · , L;

4 ρ : R→ R: activation function;

5 Wl : RNl−1 → RNl , 1 ≤ l ≤ L, x→ Alx+ bl;

6 (Al)ij , (bl)i: the networks weights;

Definition 1

A map Φ : Rd → RL given by

Φ(x) = WLρ(WL−1ρ(· · · ρ(W1(x)))), x ∈ Rd,

is called a neural network.


Basic Notations


A classical result of Cybenko

We say the σ is sigmoidal if σ(x)→

1, x→ +∞0, x→ −∞ .

A classical result on approximation of neural networks is:

Theorem 2

(Cybenko [6]) Let σ be any continuous sigmoidal function. Thenfinite sums of the form

G(x) =

N∑j=1

αjσ(yj · x+ θj) (1)

are dense in C(Id).

In [5], T.P. Chen, H. Chen and R.W. Liu gave a constructive proofwhich only assume that σ is bounded sigmoidal function.


Outline







Approximations of continuous functionals on Lp space

Theorem 3

(Chen and Chen [3]) Suppose that U is a compact set inLp[a, b](1 < p <∞), f is a continuous functional defined on U ,and σ(x) is a bounded sigmoidal function, then for any ε > 0,there exist h > 0, a positive integer m, m+ 1 pointsa = x0 < x1 < · · · < xm = b, xj = a+ j(b− a)/m,j = 0, 1, · · · ,m, a positive integer N and constants ci, θi, ξi,j ,i = 1, · · · , N , j = 0, 1, · · · ,m, such that∣∣∣∣∣∣f(u)−

N∑i=1

ciσ

m∑j=0

ξi,j1

2h

∫ xj+h

xj−hu(t)dt+ θi

∣∣∣∣∣∣ < ε

holds for all u ∈ U . Here it is assumed that u(x) = 0, if x /∈ [a, b].


Approximations of continuous functionals on C[a, b]

Theorem 4

(Chen and Chen [3]) Suppose that U is a compact set in C[a, b], fis a continuous functional defined on U , and σ(x) is a boundedsigmoidal function, then for any ε > 0, there exist m+ 1 pointsa = x0 < · · · < xm = b, a positive integer N and constants ci, θi,ξi,j , i = 1, · · · , N , j = 0, 1, · · · ,m, such that for any u ∈ U ,∣∣∣∣∣∣f(u)−

N∑i=1

ciσ

m∑j=0

ξi,ju(xj) + θi

∣∣∣∣∣∣ < ε


An example in dynamical system

Suppose that the input u(x) and the output s(x) = G(u(x))satisfies

ds(x)

dx= g(s(x), u(x), x), s(a) = s0

where g satisfies Lipschitz condition, then

(Gu)(x) = s0 +

∫ x

ag((Gu)(t), u(t), t)dt.

It can be shown that G is a continuous functional on C[a, b]. If theinput set U ⊂ C[a, b] is compact, then the output at a specifiedtime d can be approximated by

N∑i=1

ciσ

m∑j=1

ξi,ju(xi) + θi

.


Outline







Approximation by Arbitrary Functions

Definition 5

If a function g : R→ R satisfies that all the linear combinations ofthe form

N∑i=1

cig(λix+ θi), λi, θi, ci ∈ R, i = 1, · · · , N

are dense in every C[a, b], then g is called a Tauber-Wiener (TW)function.

Theorem 6

(Chen and Chen [4]) Suppose that g(x) ∈ C(R) ∩ S ′(R), theng ∈ (TW) if and only if g is not a polynomial.


Approximation by Arbitrary Functions

Theorem 7

(Chen and Chen [4]) Suppose that K is a compact set in Rd, U isa compact set in C(K), g ∈ (TW), then for any ε > 0, there are apositive integer N , θi ∈ R, ωi ∈ Rd, i = 1, · · · , N , which are allindependent of f ∈ U and constants ci(f) depending on f ,i = 1, · · · , N , such that∣∣∣∣∣f(x)−

N∑i=1

ci(f)g(ωi · x+ θi)

∣∣∣∣∣ < ε

holds for all x ∈ K, f ∈ U . Moreover, every ci(f) is a continuousfunctional defined on U .


Approximation to functionals by Arbitrary Functions

The following theorem can be viewed as a generalization ofTheorem 4 of sigmoidal function case.

Theorem 8

(Chen and Chen [4]) Suppose that g ∈ (TW), X is a Banachspace, K ⊂ X is a compact set, V is a compact set in C(K), f isa continuous functional defined on V . Then for any ε > 0, thereare positive integers N , m points x1, · · · , xk ∈ K, and constantsci, θi, ξij ∈ R, i = 1, · · · , N, j = 1, · · · ,m, such that∣∣∣∣∣∣f(u)−

N∑i=1

cig

m∑j=1

ξiju(xj) + θi

∣∣∣∣∣∣ < ε

holds for all u ∈ V .


Approximation to operators by Arbitrary Functions

Theorem 9

(Chen and Chen [4]) Suppose that g ∈ (TW), X is a Banachspace, K1 ⊂ X, K2 ⊂ Rd are two compact sets. V is a compactset in C(K1), G is a nonlinear continuous operators, which mapsV to C(K2). Then for any ε > 0, there are a positive integersM,N,m, constants cki , ζk, ξ

kij ∈ R, points ωk ∈ Rd, xj ∈ K1,

i = 1, · · · ,M, k = 1, · · · , N , j = 1, · · · ,m, such that∣∣∣∣∣∣G(u)(y)−M∑i=1

N∑k=1

cki g

m∑j=1

ξkiju(xj) + θki

· g(ωk · y + ζk)

∣∣∣∣∣∣ < ε

holds for all u ∈ V , y ∈ K2.


Outline







Basic notations

1 F (dω) = eiθ(ω)F (dω): the Fourier distribution (i.e.complex-valued measure) of a function f(x) on Rd, and

f(x) =

∫eiω·xF (dω) (2)

2 B: bounded set in Rd that contains 03 ΓB: the set of functions f on B for which the representation

(2) holds for x ∈ B for some complex-valued measure F (dω)for which

∫|ω|F (dω) is finite.

4 ΓC,B: the set of all functions f in ΓB such that for some Frepresenting f on B ∫

|ω|BF (dω) ≤ C

where |ω|B = supx∈B |ω · x|.


Universal approximation bounds

Theorem 10

(Barron [1]) For every function f in ΓC,B, every sigmoidal functionσ, every probability measure µ, and every n ≥ 1, there exists alinear combination of sigmoidal functions fn(x), such that∫

B(f(x)− fn(x))2µ(dx) ≤ (2C)2

n

where f(x) = f(x)− f(0).

In theorem 10, the approximation result was proved without therestrictions on |yj | which yield a difficult problem of searching anunbounded domain.


Universal approximation bounds

Given τ > 0, C > 0 and a bounded set B, let

Gσ,τ = γσ(τ(α · x+ b)) : |γ| ≤ 2C, |α|B ≤ 1, |b| ≤ 1

Theorem 11

(Barron [1]) For every f ∈ ΓC,B,τ > 0, n ≥ 1, every probabilitymeasure µ, and every sigmoidal function σ with 0 ≤ σ ≤ 1, thereis a function fn in the convex hull of n functions in Gσ,τ such that

∥∥f − fn∥∥ ≤ 2C

(1

n1/2+ δτ

)where ‖ · ‖ denote the L2(µ,B) norm, f = f(x)− f(0), and

δτ = inf0<ε≤1/2

2ε+ sup

|z|≥ε|σ(τz)− 1z>0|


Outline







Basic Notations

1 Ω: a domain in Rd;

2 C: a function class in L2(Ω).

3 M : the networks connectivity (i.e. the total number ofnonzero edge weights). If M is small relative to the numberof connections possible, we say that the network is sparselyconnected.

4 NNL,M,d,ρ: the class of networks Φ : Rd → R with L layers,connectivity no more than M , and activation function ρ.Moreover, we let

NN∞,M,d,ρ :=⋃L∈NNNL,M,d,ρ, NNL,∞,d,ρ :=

⋃M∈N

NNL,M,d,ρ

NN∞,∞,d,ρ :=⋃M∈N

NNL,∞,d,ρ


Basic Notations


Best M -term Approximation Error

Definition 12

(DeVore and Lorentz, [7]) Given C ⊂ L2(Ω), and a representationsystem D = (ϕi)i∈N ⊂ L2(Ω), we define, for f ∈ C and M ∈ N,

ΓDM (f) := infIM⊂N]IM=M

∥∥∥∥∥∥f −∑i∈IM

ciϕi

∥∥∥∥∥∥L2(Ω)

We call ΓDM (f) the best M-term approximation error of f withrespect to D.The supremal γ > 0 such that there exists C > 0 with

supf∈C

ΓCM (f) ≤ CM−γ , ∀M ∈ N

will be referred to as γ∗(C,D).


Best M -term Approximation Error

1 It is conceivable that the optimal approximation rate for C inany representation system reflects specific properties of C.However, a countable and dense repersentation systemD ⊂ L2(Rd) results in γ∗(C,D) =∞.

2 In numerical computation, we need to find some efficientmethods to approximate any f ∈ C by linear combination offinite elements in D. However, finding index in the full indexN is computationally infeasible.

3 In [8], Donoho suggests to restrict the search for the optimalcoefficient set to the first π(M) coefficients where π is somepolynomial. This approach is known as polynomial-depthsearch.


Effective Best M -term Approximation Error

To overcome these problems, Donoho [8] and Grohs [9] proposedthe following

Definition 13

Given C ⊂ L2(Ω), a representation system D = (φi)i∈N ⊂ L2(Ω).For γ > 0, we say that C has effective best M-termapproximation rate M−γ in D if there exists a univariatepolynomial π and constants C,D > 0 such that for all M ∈ N andf ∈ C, ∥∥∥∥∥∥f −

∑i∈IM

ciϕi

∥∥∥∥∥∥L2(Ω)

≤ CM−γ

for some index set IM ⊂ 1, · · · , π(M) with ]IM = M andcofficients (ci)i∈IM satisfying maxi∈IM |ci| ≤ D. The supremalγ > 0 such that C has effective best M -term approximation rateM−γ in D will be referred to as γ∗,eff(C,D).


Best M -edge Approximation Error

Definition 14

(Bolcskei et. al. [2]) Given C ⊂ L2(Ω), we define, for f ∈ C andM ∈ N,

ΓNNM (f) := infΦ∈NN∞,M,d,ρ

‖f − Φ‖L2(Ω)

We call ΓNNM (f) the best M-edge approximation error of f .The supremal γ > 0 such that a C > 0 with

supf∈C

ΓNNM (f) ≤ CM−γ , ∀M ∈ N

will be referred to as γ∗NN (C, ρ).


Best M -edge Approximation Error

The following theorem in [10] shows that Definition 14 has thesamilar troubles with the Definition 12.

Theorem 15

There exists a function ρ : R→ R that is C∞, strictly increasing,and satisfies limx→∞ ρ(x) = 1 and limx→−∞ ρ(x) = 0, such thatfor any d ∈ N, any f ∈ C([0, 1]d) and any ε > 0 there exists aneural network Φ with activation function ρ three layers ofdimensions N1 = 3d, N2 = 6d+ 3, and N3 = 1 satisfying

supx∈[0,1]d

|f(x)− Φ(x)| ≤ ε


Effective Best M -edge Approximation Error

Definition 16

(Bolcskei et. al. [2]) For γ > 0, C ⊂ L2(Ω) is said to haveeffective best M-edge approximation rate M−γ by neuralnetworks with activation function ρ if there exist L ∈ N, aunivariate polynomial π, and a constant C > 0 such that for allM ∈ N and f ∈ C

‖f − Φ‖L2(Ω) ≤ CM−γ

for some Φ ∈ NNL,M,d,ρ with the weights of Φ all bounded inabsolute value by π(M).The supremal γ > 0 such that C has effective best M -edgeapproximation rate M−γ will henceforth be denoted as γ∗,eff

NN (C, ρ).


Min-Max Rate Distortion Theory in [8, 9]

Definition 17

Let C ⊂ L2(Ω), for each l ∈ N, we denote by

El := E : C → 0, 1l

the set of binary encoders of C of length l, and we let

Dl := D : 0, 1l → L2(Ω)

be the set of binary decoders of length l. An encoder-decoder pair(E,D) ∈ El ×Dl is said to achieve distortion ε > 0 over thefunction class C, if

supf∈C‖D(E(f))− f‖L2(Ω) ≤ ε



Definition 18

Let C ⊂ L2(Ω), for ε > 0 the minimax code length L(ε, C) is

L(ε, C) := min

l ∈ N :

∃(E,D) ∈ El ×Dl such thatsupf∈C‖D(E(f))− f‖L2(Ω) ≤ ε

Moreover, the optimal exponent γ∗(C) is defined by

γ∗(C) := infγ ∈ R : L(ε, C) = O(ε−γ)



Theorem 19

Let C ⊂ L2(Ω), and the optimal effective best M -term

approximation rate of C in D ⊂ L2(Ω) be M−γ∗,eff(C,D). Then,

γ∗,eff(C,D) ≤ 1

γ∗(C).

If the representation system D satifies

γ∗,eff(C,D) =1

γ∗(C),

then, D is said to be optimal for the function class C.


Foundamental Bound on Effective M -edge Approximation

Theorem 20

(Bolcskei et. al. [2]) Let C ⊂ L2(Ω) and

Learn : (0, 1)× C → NN∞,∞,d,ρ

be a map such that, for each pair (ε, f) ∈ (0, 1)× C, every weightof the neural network Learn(ε, f) can be encoded with no morethan c log2(ε−1) bits while guaranteeing that

supf∈C‖f − Learn(ε, f)‖L2(Ω) ≤ ε

Then

supε∈(0, 1

2)

ε1γ · sup

f∈CM(Learn(ε, f)) =∞, ∀γ > 1

γ∗(C)



The main idea of the proof of Theorem 20 is encoding thetopology and weights of the map Learn(ε, f) by encoder-decoderpairs (E,D) ∈ El(ε) ×Dl(ε) achieving distortion ε over C with

l(ε) ≤ C0 · supf∈CM(Learn(ε, f)) log2(M(Learn(ε, f))) log2

(1

ε

),

where C0 > 0 is a constant.



Corollary 21

(Bolcskei et. al. [2]) Let Ω ⊂ Rd be bounded, and C ⊂ L2(Ω).Then, for all ρ : R→ R that are Lipschitz continuous ordifferentiable with polynomially bounded first derivative, we have

γ∗,effNN (C, ρ) ≤ 1

γ∗(C)

We call a function class C ⊂ L2(Ω) optimally representable byneural networks with activation function ρ : R→ R, if

γ∗,effNN (C, ρ) =

1

γ∗(C)


From Representation Systems to Neural Networks

Definition 22

(Bolcskei et. al. [2]) Let D = (ϕi)i∈N ⊂ L2(Ω) be a representationsystem. Then, D is said to be representable by neural networks(with activation function ρ), if there exists L,R ∈ N such that forall η > 0 and every i ∈ N there is a neural networkΦi,η ∈ NNL,R,d,ρ and

‖ϕi − Φi,η‖L2(Ω) ≤ η

If, in addition, the neural networks Φi,η ∈ NNL,R,d,ρ have weightsthat are uniformly polynomially bounded in (i, η−1), and if ρ iseither Lipschitz-continuous, or differentiable with polynomiallybounded derivative, we call the representation system (ϕi)i∈Neffectively representable by neural networks (with activationfunction ρ).



Theorem 23

(Bolcskei et. al. [2]) Let Ω ⊂ Rd be bounded, and suppose thatC ⊂ L2(Ω) is effectively representable in the representation systemD = (ϕi)i∈N ⊂ L2(Ω). Suppose that D is effectively representableby neural networks. Then, for all γ < γ∗,eff(C,D) there existconstants c, L > 0 and and a map

Learn : (0, 1)× L2(Ω)→ NNL,∞,d,ρ

such that for every f ∈ C the following statements hold:

1 there exists k ∈ N such that each weight of the networkLearn(ε, f) is bounded by ε−k.

2 the error bound ‖f − Learn(ε, f)‖L2(Ω) ≤ ε holds true, and

3 the neural network Learn(ε, f) has at most cε−1/γ edges.



Specifically, in [2], the authors show that all function classes thatare optimally approximated by a general class of representationsystems–so-called affine systems–can be approximated by deepneural networks with minimal connectivity and memoryrequirements. Affine systems encompass a wealth of representationsystems from applied harmonic analysis such as wavelets, ridgelets,curvelets, shearlets, α-shearlets, and more generally α-molecules.

Reference

[1] A.R. Barron, Universal approximation bounds for superpositions of a sigmoidal function, IEEE Transactionson Information theory, 930–945, 39(3), 1993.

[2] H. Bolcskei, P. Grohs, G. Kutyniok and P. Petersen, Optimal approximation with sparsely connected deepneural networks, arXiv:1705.01714, 2017.

[3] T.P. Chen and H. Chen, Approximations of continuous functionals by neural networks with application todynamic systems, IEEE Transactions on Neural Networks, 910–918, 4(6), 1993.

[4] T.P. Chen and H. Chen, Universal approximation to nonlinear operators by neural networks with arbitraryactivation functions and its application to dynamical systems, IEEE Transactions on Neural Networks,911–917, 6(4), 1995.

[5] T.P. Chen, H. Chen and R.W. Liu, A constructive proof and an extension of Cybenkos approximationtheorem, In Computing science and statistics, 163–168, Springer, 1992.

[6] G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of control, signals andsystems, 303–314, 2(4), 1989.

[7] R.A. DeVore and G.G. Lorentz, Constructive approximation, Springer Science & Business Media, 1993.

[8] D. Donoho, Unconditional bases are optimal bases for data compression and for statistical estimation,Applied and computational harmonic analysis, 100–115, 1(1), 1993.

[9] P. Grohs, Optimally sparse data representations, In Harmonic and Applied Analysis, 199–248, Springer,2015.

[10] V. Maiorov and A. Pinkus, Lower bounds for approximation by MLP neural networks, Neurocomputing,81–91, 25(1–3), 1999.

Thank You for Your Attention!

Yanhui Su yanhui [email protected] March 30, 2018 · Universal approximation bounds Theorem 10 (Barron...

Documents

Transcript of Yanhui Su yanhui [email protected] March 30, 2018 · Universal approximation bounds Theorem 10 (Barron...