Copyright © 1998-2008 Curt Hill Database Function What should every database do?
Yanhui Su yanhui [email protected] March 30, 2018 · Universal approximation bounds Theorem 10 (Barron...
Transcript of Yanhui Su yanhui [email protected] March 30, 2018 · Universal approximation bounds Theorem 10 (Barron...
Outline Functions Functionals Operators Bounds Optimal
Outline
1 Approximation of functions by a sigmoidal function
2 Approximations of continuous functionals by a sigmoidalfunction
3 Universal approximation by neural networks with arbitraryactivation functions
4 Universal approximation bounds for superpositions of asigmoidal function
5 Optimal approximation with sparsely connected deep neuralnetworks
Outline Functions Functionals Operators Bounds Optimal
Outline
1 Approximation of functions by a sigmoidal function
2 Approximations of continuous functionals by a sigmoidalfunction
3 Universal approximation by neural networks with arbitraryactivation functions
4 Universal approximation bounds for superpositions of asigmoidal function
5 Optimal approximation with sparsely connected deep neuralnetworks
Outline Functions Functionals Operators Bounds Optimal
Outline
1 Approximation of functions by a sigmoidal function
2 Approximations of continuous functionals by a sigmoidalfunction
3 Universal approximation by neural networks with arbitraryactivation functions
4 Universal approximation bounds for superpositions of asigmoidal function
5 Optimal approximation with sparsely connected deep neuralnetworks
Outline Functions Functionals Operators Bounds Optimal
Outline
1 Approximation of functions by a sigmoidal function
2 Approximations of continuous functionals by a sigmoidalfunction
3 Universal approximation by neural networks with arbitraryactivation functions
4 Universal approximation bounds for superpositions of asigmoidal function
5 Optimal approximation with sparsely connected deep neuralnetworks
Outline Functions Functionals Operators Bounds Optimal
Outline
1 Approximation of functions by a sigmoidal function
2 Approximations of continuous functionals by a sigmoidalfunction
3 Universal approximation by neural networks with arbitraryactivation functions
4 Universal approximation bounds for superpositions of asigmoidal function
5 Optimal approximation with sparsely connected deep neuralnetworks
Outline Functions Functionals Operators Bounds Optimal
Outline
1 Approximation of functions by a sigmoidal function
2 Approximations of continuous functionals by a sigmoidalfunction
3 Universal approximation by neural networks with arbitraryactivation functions
4 Universal approximation bounds for superpositions of asigmoidal function
5 Optimal approximation with sparsely connected deep neuralnetworks
Outline Functions Functionals Operators Bounds Optimal
Basic Notations
1 d: dimension of input layer;
2 L: number of layers;
3 Nl: number of neuros in the lth layers, l = 1, · · · , L;
4 ρ : R→ R: activation function;
5 Wl : RNl−1 → RNl , 1 ≤ l ≤ L, x→ Alx+ bl;
6 (Al)ij , (bl)i: the networks weights;
Definition 1
A map Φ : Rd → RL given by
Φ(x) = WLρ(WL−1ρ(· · · ρ(W1(x)))), x ∈ Rd,
is called a neural network.
Outline Functions Functionals Operators Bounds Optimal
Basic Notations
Outline Functions Functionals Operators Bounds Optimal
A classical result of Cybenko
We say the σ is sigmoidal if σ(x)→
1, x→ +∞0, x→ −∞ .
A classical result on approximation of neural networks is:
Theorem 2
(Cybenko [6]) Let σ be any continuous sigmoidal function. Thenfinite sums of the form
G(x) =
N∑j=1
αjσ(yj · x+ θj) (1)
are dense in C(Id).
In [5], T.P. Chen, H. Chen and R.W. Liu gave a constructive proofwhich only assume that σ is bounded sigmoidal function.
Outline Functions Functionals Operators Bounds Optimal
Outline
1 Approximation of functions by a sigmoidal function
2 Approximations of continuous functionals by a sigmoidalfunction
3 Universal approximation by neural networks with arbitraryactivation functions
4 Universal approximation bounds for superpositions of asigmoidal function
5 Optimal approximation with sparsely connected deep neuralnetworks
Outline Functions Functionals Operators Bounds Optimal
Approximations of continuous functionals on Lp space
Theorem 3
(Chen and Chen [3]) Suppose that U is a compact set inLp[a, b](1 < p <∞), f is a continuous functional defined on U ,and σ(x) is a bounded sigmoidal function, then for any ε > 0,there exist h > 0, a positive integer m, m+ 1 pointsa = x0 < x1 < · · · < xm = b, xj = a+ j(b− a)/m,j = 0, 1, · · · ,m, a positive integer N and constants ci, θi, ξi,j ,i = 1, · · · , N , j = 0, 1, · · · ,m, such that∣∣∣∣∣∣f(u)−
N∑i=1
ciσ
m∑j=0
ξi,j1
2h
∫ xj+h
xj−hu(t)dt+ θi
∣∣∣∣∣∣ < ε
holds for all u ∈ U . Here it is assumed that u(x) = 0, if x /∈ [a, b].
Outline Functions Functionals Operators Bounds Optimal
Approximations of continuous functionals on C[a, b]
Theorem 4
(Chen and Chen [3]) Suppose that U is a compact set in C[a, b], fis a continuous functional defined on U , and σ(x) is a boundedsigmoidal function, then for any ε > 0, there exist m+ 1 pointsa = x0 < · · · < xm = b, a positive integer N and constants ci, θi,ξi,j , i = 1, · · · , N , j = 0, 1, · · · ,m, such that for any u ∈ U ,∣∣∣∣∣∣f(u)−
N∑i=1
ciσ
m∑j=0
ξi,ju(xj) + θi
∣∣∣∣∣∣ < ε
Outline Functions Functionals Operators Bounds Optimal
An example in dynamical system
Suppose that the input u(x) and the output s(x) = G(u(x))satisfies
ds(x)
dx= g(s(x), u(x), x), s(a) = s0
where g satisfies Lipschitz condition, then
(Gu)(x) = s0 +
∫ x
ag((Gu)(t), u(t), t)dt.
It can be shown that G is a continuous functional on C[a, b]. If theinput set U ⊂ C[a, b] is compact, then the output at a specifiedtime d can be approximated by
N∑i=1
ciσ
m∑j=1
ξi,ju(xi) + θi
.
Outline Functions Functionals Operators Bounds Optimal
Outline
1 Approximation of functions by a sigmoidal function
2 Approximations of continuous functionals by a sigmoidalfunction
3 Universal approximation by neural networks with arbitraryactivation functions
4 Universal approximation bounds for superpositions of asigmoidal function
5 Optimal approximation with sparsely connected deep neuralnetworks
Outline Functions Functionals Operators Bounds Optimal
Approximation by Arbitrary Functions
Definition 5
If a function g : R→ R satisfies that all the linear combinations ofthe form
N∑i=1
cig(λix+ θi), λi, θi, ci ∈ R, i = 1, · · · , N
are dense in every C[a, b], then g is called a Tauber-Wiener (TW)function.
Theorem 6
(Chen and Chen [4]) Suppose that g(x) ∈ C(R) ∩ S ′(R), theng ∈ (TW) if and only if g is not a polynomial.
Outline Functions Functionals Operators Bounds Optimal
Approximation by Arbitrary Functions
Theorem 7
(Chen and Chen [4]) Suppose that K is a compact set in Rd, U isa compact set in C(K), g ∈ (TW), then for any ε > 0, there are apositive integer N , θi ∈ R, ωi ∈ Rd, i = 1, · · · , N , which are allindependent of f ∈ U and constants ci(f) depending on f ,i = 1, · · · , N , such that∣∣∣∣∣f(x)−
N∑i=1
ci(f)g(ωi · x+ θi)
∣∣∣∣∣ < ε
holds for all x ∈ K, f ∈ U . Moreover, every ci(f) is a continuousfunctional defined on U .
Outline Functions Functionals Operators Bounds Optimal
Approximation to functionals by Arbitrary Functions
The following theorem can be viewed as a generalization ofTheorem 4 of sigmoidal function case.
Theorem 8
(Chen and Chen [4]) Suppose that g ∈ (TW), X is a Banachspace, K ⊂ X is a compact set, V is a compact set in C(K), f isa continuous functional defined on V . Then for any ε > 0, thereare positive integers N , m points x1, · · · , xk ∈ K, and constantsci, θi, ξij ∈ R, i = 1, · · · , N, j = 1, · · · ,m, such that∣∣∣∣∣∣f(u)−
N∑i=1
cig
m∑j=1
ξiju(xj) + θi
∣∣∣∣∣∣ < ε
holds for all u ∈ V .
Outline Functions Functionals Operators Bounds Optimal
Approximation to operators by Arbitrary Functions
Theorem 9
(Chen and Chen [4]) Suppose that g ∈ (TW), X is a Banachspace, K1 ⊂ X, K2 ⊂ Rd are two compact sets. V is a compactset in C(K1), G is a nonlinear continuous operators, which mapsV to C(K2). Then for any ε > 0, there are a positive integersM,N,m, constants cki , ζk, ξ
kij ∈ R, points ωk ∈ Rd, xj ∈ K1,
i = 1, · · · ,M, k = 1, · · · , N , j = 1, · · · ,m, such that∣∣∣∣∣∣G(u)(y)−M∑i=1
N∑k=1
cki g
m∑j=1
ξkiju(xj) + θki
· g(ωk · y + ζk)
∣∣∣∣∣∣ < ε
holds for all u ∈ V , y ∈ K2.
Outline Functions Functionals Operators Bounds Optimal
Outline
1 Approximation of functions by a sigmoidal function
2 Approximations of continuous functionals by a sigmoidalfunction
3 Universal approximation by neural networks with arbitraryactivation functions
4 Universal approximation bounds for superpositions of asigmoidal function
5 Optimal approximation with sparsely connected deep neuralnetworks
Outline Functions Functionals Operators Bounds Optimal
Basic notations
1 F (dω) = eiθ(ω)F (dω): the Fourier distribution (i.e.complex-valued measure) of a function f(x) on Rd, and
f(x) =
∫eiω·xF (dω) (2)
2 B: bounded set in Rd that contains 03 ΓB: the set of functions f on B for which the representation
(2) holds for x ∈ B for some complex-valued measure F (dω)for which
∫|ω|F (dω) is finite.
4 ΓC,B: the set of all functions f in ΓB such that for some Frepresenting f on B ∫
|ω|BF (dω) ≤ C
where |ω|B = supx∈B |ω · x|.
Outline Functions Functionals Operators Bounds Optimal
Universal approximation bounds
Theorem 10
(Barron [1]) For every function f in ΓC,B, every sigmoidal functionσ, every probability measure µ, and every n ≥ 1, there exists alinear combination of sigmoidal functions fn(x), such that∫
B(f(x)− fn(x))2µ(dx) ≤ (2C)2
n
where f(x) = f(x)− f(0).
In theorem 10, the approximation result was proved without therestrictions on |yj | which yield a difficult problem of searching anunbounded domain.
Outline Functions Functionals Operators Bounds Optimal
Universal approximation bounds
Given τ > 0, C > 0 and a bounded set B, let
Gσ,τ = γσ(τ(α · x+ b)) : |γ| ≤ 2C, |α|B ≤ 1, |b| ≤ 1
Theorem 11
(Barron [1]) For every f ∈ ΓC,B,τ > 0, n ≥ 1, every probabilitymeasure µ, and every sigmoidal function σ with 0 ≤ σ ≤ 1, thereis a function fn in the convex hull of n functions in Gσ,τ such that
∥∥f − fn∥∥ ≤ 2C
(1
n1/2+ δτ
)where ‖ · ‖ denote the L2(µ,B) norm, f = f(x)− f(0), and
δτ = inf0<ε≤1/2
2ε+ sup
|z|≥ε|σ(τz)− 1z>0|
Outline Functions Functionals Operators Bounds Optimal
Outline
1 Approximation of functions by a sigmoidal function
2 Approximations of continuous functionals by a sigmoidalfunction
3 Universal approximation by neural networks with arbitraryactivation functions
4 Universal approximation bounds for superpositions of asigmoidal function
5 Optimal approximation with sparsely connected deep neuralnetworks
Outline Functions Functionals Operators Bounds Optimal
Basic Notations
1 Ω: a domain in Rd;
2 C: a function class in L2(Ω).
3 M : the networks connectivity (i.e. the total number ofnonzero edge weights). If M is small relative to the numberof connections possible, we say that the network is sparselyconnected.
4 NNL,M,d,ρ: the class of networks Φ : Rd → R with L layers,connectivity no more than M , and activation function ρ.Moreover, we let
NN∞,M,d,ρ :=⋃L∈NNNL,M,d,ρ, NNL,∞,d,ρ :=
⋃M∈N
NNL,M,d,ρ
NN∞,∞,d,ρ :=⋃M∈N
NNL,∞,d,ρ
Outline Functions Functionals Operators Bounds Optimal
Basic Notations
Outline Functions Functionals Operators Bounds Optimal
Best M -term Approximation Error
Definition 12
(DeVore and Lorentz, [7]) Given C ⊂ L2(Ω), and a representationsystem D = (ϕi)i∈N ⊂ L2(Ω), we define, for f ∈ C and M ∈ N,
ΓDM (f) := infIM⊂N]IM=M
∥∥∥∥∥∥f −∑i∈IM
ciϕi
∥∥∥∥∥∥L2(Ω)
We call ΓDM (f) the best M-term approximation error of f withrespect to D.The supremal γ > 0 such that there exists C > 0 with
supf∈C
ΓCM (f) ≤ CM−γ , ∀M ∈ N
will be referred to as γ∗(C,D).
Outline Functions Functionals Operators Bounds Optimal
Best M -term Approximation Error
1 It is conceivable that the optimal approximation rate for C inany representation system reflects specific properties of C.However, a countable and dense repersentation systemD ⊂ L2(Rd) results in γ∗(C,D) =∞.
2 In numerical computation, we need to find some efficientmethods to approximate any f ∈ C by linear combination offinite elements in D. However, finding index in the full indexN is computationally infeasible.
3 In [8], Donoho suggests to restrict the search for the optimalcoefficient set to the first π(M) coefficients where π is somepolynomial. This approach is known as polynomial-depthsearch.
Outline Functions Functionals Operators Bounds Optimal
Effective Best M -term Approximation Error
To overcome these problems, Donoho [8] and Grohs [9] proposedthe following
Definition 13
Given C ⊂ L2(Ω), a representation system D = (φi)i∈N ⊂ L2(Ω).For γ > 0, we say that C has effective best M-termapproximation rate M−γ in D if there exists a univariatepolynomial π and constants C,D > 0 such that for all M ∈ N andf ∈ C, ∥∥∥∥∥∥f −
∑i∈IM
ciϕi
∥∥∥∥∥∥L2(Ω)
≤ CM−γ
for some index set IM ⊂ 1, · · · , π(M) with ]IM = M andcofficients (ci)i∈IM satisfying maxi∈IM |ci| ≤ D. The supremalγ > 0 such that C has effective best M -term approximation rateM−γ in D will be referred to as γ∗,eff(C,D).
Outline Functions Functionals Operators Bounds Optimal
Best M -edge Approximation Error
Definition 14
(Bolcskei et. al. [2]) Given C ⊂ L2(Ω), we define, for f ∈ C andM ∈ N,
ΓNNM (f) := infΦ∈NN∞,M,d,ρ
‖f − Φ‖L2(Ω)
We call ΓNNM (f) the best M-edge approximation error of f .The supremal γ > 0 such that a C > 0 with
supf∈C
ΓNNM (f) ≤ CM−γ , ∀M ∈ N
will be referred to as γ∗NN (C, ρ).
Outline Functions Functionals Operators Bounds Optimal
Best M -edge Approximation Error
The following theorem in [10] shows that Definition 14 has thesamilar troubles with the Definition 12.
Theorem 15
There exists a function ρ : R→ R that is C∞, strictly increasing,and satisfies limx→∞ ρ(x) = 1 and limx→−∞ ρ(x) = 0, such thatfor any d ∈ N, any f ∈ C([0, 1]d) and any ε > 0 there exists aneural network Φ with activation function ρ three layers ofdimensions N1 = 3d, N2 = 6d+ 3, and N3 = 1 satisfying
supx∈[0,1]d
|f(x)− Φ(x)| ≤ ε
Outline Functions Functionals Operators Bounds Optimal
Effective Best M -edge Approximation Error
Definition 16
(Bolcskei et. al. [2]) For γ > 0, C ⊂ L2(Ω) is said to haveeffective best M-edge approximation rate M−γ by neuralnetworks with activation function ρ if there exist L ∈ N, aunivariate polynomial π, and a constant C > 0 such that for allM ∈ N and f ∈ C
‖f − Φ‖L2(Ω) ≤ CM−γ
for some Φ ∈ NNL,M,d,ρ with the weights of Φ all bounded inabsolute value by π(M).The supremal γ > 0 such that C has effective best M -edgeapproximation rate M−γ will henceforth be denoted as γ∗,eff
NN (C, ρ).
Outline Functions Functionals Operators Bounds Optimal
Min-Max Rate Distortion Theory in [8, 9]
Definition 17
Let C ⊂ L2(Ω), for each l ∈ N, we denote by
El := E : C → 0, 1l
the set of binary encoders of C of length l, and we let
Dl := D : 0, 1l → L2(Ω)
be the set of binary decoders of length l. An encoder-decoder pair(E,D) ∈ El ×Dl is said to achieve distortion ε > 0 over thefunction class C, if
supf∈C‖D(E(f))− f‖L2(Ω) ≤ ε
Outline Functions Functionals Operators Bounds Optimal
Min-Max Rate Distortion Theory in [8, 9]
Definition 18
Let C ⊂ L2(Ω), for ε > 0 the minimax code length L(ε, C) is
L(ε, C) := min
l ∈ N :
∃(E,D) ∈ El ×Dl such thatsupf∈C‖D(E(f))− f‖L2(Ω) ≤ ε
Moreover, the optimal exponent γ∗(C) is defined by
γ∗(C) := infγ ∈ R : L(ε, C) = O(ε−γ)
Outline Functions Functionals Operators Bounds Optimal
Min-Max Rate Distortion Theory in [8, 9]
Theorem 19
Let C ⊂ L2(Ω), and the optimal effective best M -term
approximation rate of C in D ⊂ L2(Ω) be M−γ∗,eff(C,D). Then,
γ∗,eff(C,D) ≤ 1
γ∗(C).
If the representation system D satifies
γ∗,eff(C,D) =1
γ∗(C),
then, D is said to be optimal for the function class C.
Outline Functions Functionals Operators Bounds Optimal
Foundamental Bound on Effective M -edge Approximation
Theorem 20
(Bolcskei et. al. [2]) Let C ⊂ L2(Ω) and
Learn : (0, 1)× C → NN∞,∞,d,ρ
be a map such that, for each pair (ε, f) ∈ (0, 1)× C, every weightof the neural network Learn(ε, f) can be encoded with no morethan c log2(ε−1) bits while guaranteeing that
supf∈C‖f − Learn(ε, f)‖L2(Ω) ≤ ε
Then
supε∈(0, 1
2)
ε1γ · sup
f∈CM(Learn(ε, f)) =∞, ∀γ > 1
γ∗(C)
Outline Functions Functionals Operators Bounds Optimal
Foundamental Bound on Effective M -edge Approximation
The main idea of the proof of Theorem 20 is encoding thetopology and weights of the map Learn(ε, f) by encoder-decoderpairs (E,D) ∈ El(ε) ×Dl(ε) achieving distortion ε over C with
l(ε) ≤ C0 · supf∈CM(Learn(ε, f)) log2(M(Learn(ε, f))) log2
(1
ε
),
where C0 > 0 is a constant.
Outline Functions Functionals Operators Bounds Optimal
Foundamental Bound on Effective M -edge Approximation
Corollary 21
(Bolcskei et. al. [2]) Let Ω ⊂ Rd be bounded, and C ⊂ L2(Ω).Then, for all ρ : R→ R that are Lipschitz continuous ordifferentiable with polynomially bounded first derivative, we have
γ∗,effNN (C, ρ) ≤ 1
γ∗(C)
We call a function class C ⊂ L2(Ω) optimally representable byneural networks with activation function ρ : R→ R, if
γ∗,effNN (C, ρ) =
1
γ∗(C)
Outline Functions Functionals Operators Bounds Optimal
From Representation Systems to Neural Networks
Definition 22
(Bolcskei et. al. [2]) Let D = (ϕi)i∈N ⊂ L2(Ω) be a representationsystem. Then, D is said to be representable by neural networks(with activation function ρ), if there exists L,R ∈ N such that forall η > 0 and every i ∈ N there is a neural networkΦi,η ∈ NNL,R,d,ρ and
‖ϕi − Φi,η‖L2(Ω) ≤ η
If, in addition, the neural networks Φi,η ∈ NNL,R,d,ρ have weightsthat are uniformly polynomially bounded in (i, η−1), and if ρ iseither Lipschitz-continuous, or differentiable with polynomiallybounded derivative, we call the representation system (ϕi)i∈Neffectively representable by neural networks (with activationfunction ρ).
Outline Functions Functionals Operators Bounds Optimal
From Representation Systems to Neural Networks
Theorem 23
(Bolcskei et. al. [2]) Let Ω ⊂ Rd be bounded, and suppose thatC ⊂ L2(Ω) is effectively representable in the representation systemD = (ϕi)i∈N ⊂ L2(Ω). Suppose that D is effectively representableby neural networks. Then, for all γ < γ∗,eff(C,D) there existconstants c, L > 0 and and a map
Learn : (0, 1)× L2(Ω)→ NNL,∞,d,ρ
such that for every f ∈ C the following statements hold:
1 there exists k ∈ N such that each weight of the networkLearn(ε, f) is bounded by ε−k.
2 the error bound ‖f − Learn(ε, f)‖L2(Ω) ≤ ε holds true, and
3 the neural network Learn(ε, f) has at most cε−1/γ edges.
Outline Functions Functionals Operators Bounds Optimal
From Representation Systems to Neural Networks
Specifically, in [2], the authors show that all function classes thatare optimally approximated by a general class of representationsystems–so-called affine systems–can be approximated by deepneural networks with minimal connectivity and memoryrequirements. Affine systems encompass a wealth of representationsystems from applied harmonic analysis such as wavelets, ridgelets,curvelets, shearlets, α-shearlets, and more generally α-molecules.
Reference
[1] A.R. Barron, Universal approximation bounds for superpositions of a sigmoidal function, IEEE Transactionson Information theory, 930–945, 39(3), 1993.
[2] H. Bolcskei, P. Grohs, G. Kutyniok and P. Petersen, Optimal approximation with sparsely connected deepneural networks, arXiv:1705.01714, 2017.
[3] T.P. Chen and H. Chen, Approximations of continuous functionals by neural networks with application todynamic systems, IEEE Transactions on Neural Networks, 910–918, 4(6), 1993.
[4] T.P. Chen and H. Chen, Universal approximation to nonlinear operators by neural networks with arbitraryactivation functions and its application to dynamical systems, IEEE Transactions on Neural Networks,911–917, 6(4), 1995.
[5] T.P. Chen, H. Chen and R.W. Liu, A constructive proof and an extension of Cybenkos approximationtheorem, In Computing science and statistics, 163–168, Springer, 1992.
[6] G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of control, signals andsystems, 303–314, 2(4), 1989.
[7] R.A. DeVore and G.G. Lorentz, Constructive approximation, Springer Science & Business Media, 1993.
[8] D. Donoho, Unconditional bases are optimal bases for data compression and for statistical estimation,Applied and computational harmonic analysis, 100–115, 1(1), 1993.
[9] P. Grohs, Optimally sparse data representations, In Harmonic and Applied Analysis, 199–248, Springer,2015.
[10] V. Maiorov and A. Pinkus, Lower bounds for approximation by MLP neural networks, Neurocomputing,81–91, 25(1–3), 1999.
Thank You for Your Attention!