Fat-Shattering Dimension and Majorizing Measures: two ways to...

22
Fat-Shattering Dimension and Majorizing Measures: two ways to quantify complexity David Pollard Statistics Department Yale University http://www.stat.yale.edu/ ~ pollard Talk given at the Workshop on Simplicity and Likelihood Yale University November 2009 http://cowles.econ.yale.edu/conferences/simplicity/

Transcript of Fat-Shattering Dimension and Majorizing Measures: two ways to...

Page 1: Fat-Shattering Dimension and Majorizing Measures: two ways to …pollard/Manuscripts+Notes/Yale_nov09... · 2009-11-14 · Fat-Shattering Dimension and Majorizing Measures: two ways

Fat-Shattering Dimension and Majorizing Measures:

two ways to quantify complexity

David PollardStatistics DepartmentYale Universityhttp://www.stat.yale.edu/~pollard

Talk given at the

Workshop on Simplicity and LikelihoodYale University

November 2009

http://cowles.econ.yale.edu/conferences/simplicity/

Page 2: Fat-Shattering Dimension and Majorizing Measures: two ways to …pollard/Manuscripts+Notes/Yale_nov09... · 2009-11-14 · Fat-Shattering Dimension and Majorizing Measures: two ways

VC classes of sets

Q = all quadrants (−∞, a]× (−∞, b] in R2

I How many patterns picked out from {x1, x2, x3}?

Q111

Q110

Q100

Q010

Q000

x1

x2

x3

V =

0 0 00 1 01 0 01 1 01 1 1

I For all x1, x2, x3 cannot get all 23 patterns: V has < 8 rows.For some x1, x2 can get all 22 patterns.

I VCdim(Q) = 2

Page 3: Fat-Shattering Dimension and Majorizing Measures: two ways to …pollard/Manuscripts+Notes/Yale_nov09... · 2009-11-14 · Fat-Shattering Dimension and Majorizing Measures: two ways

I [packing numbers for Q as subset of L1(P ),

for each probability measure P on R2]

If Q1, . . . , QN ∈ Q and

P (Qi∆Qi′) > ε for 1 ≤ i < i′ ≤ N

then

N ≤ constant× (1/ε)2

I Hard to prove with exponent 2 (I think).

Easy to prove with exponent 4.

I Haussler (1995) got constant×(1/ε)VCdim for general VC classes

of sets.

Page 4: Fat-Shattering Dimension and Majorizing Measures: two ways to …pollard/Manuscripts+Notes/Yale_nov09... · 2009-11-14 · Fat-Shattering Dimension and Majorizing Measures: two ways

I Fat shattering?

ε-shattering: Kearns and Schapire (1994, §6)

=

fat-shattering: Bartlett, Long, and Williamson (1996);

Anthony and Bartlett (2002, §11.3)

≈(not) stable: Talagrand (1987a);

shatter at levels α, β: Talagrand (1996a);

??? Talagrand (1984);

Fremlin??

Page 5: Fat-Shattering Dimension and Majorizing Measures: two ways to …pollard/Manuscripts+Notes/Yale_nov09... · 2009-11-14 · Fat-Shattering Dimension and Majorizing Measures: two ways

I F = a set of real-valued functions on a set X

x = (x1, x2, . . . , xn) ∈ Xn

levels: α = (α1, . . . , αn) and β = (β1, . . . , βn) with βj ≥ αj

I F picks out pattern K ⊆ {1,2, . . . , n} at levels α and β means:There exists fK ∈ F such that

fK(xj)

≥ βj if j ∈ K≤ αj if j ∈ Kc

I S(x, α, β,F) = number of distinct patterns picked out ≤ 2n

I F shatters x at levels α and β means: S(x, α, β,F) = 2n

I sdim(ε,F) is largest n such that F shatters some x in Xn at somelevels α and β with βj ≥ αj + 2ε for all j

Page 6: Fat-Shattering Dimension and Majorizing Measures: two ways to …pollard/Manuscripts+Notes/Yale_nov09... · 2009-11-14 · Fat-Shattering Dimension and Majorizing Measures: two ways

Example:

F = all increasing functions f on R with 0 ≤ f ≤ 1

I Why must we have

0 ≤ α1 ≤ · · · ≤ α3 + 2ε ≤ β3 ≤ α4 . . . ?

I Why is sdim(ε,F) ≤ (2ε)−1?

x1 x2 x3 x40

1

α1

α2?α4

β2

α3

β3

β1

?

Page 7: Fat-Shattering Dimension and Majorizing Measures: two ways to …pollard/Manuscripts+Notes/Yale_nov09... · 2009-11-14 · Fat-Shattering Dimension and Majorizing Measures: two ways

I Suppose 0 ≤ f ≤ 1 for all f in F

and X1, X2, · · · ∼ iid P

I

∆n := supF |n−1∑i≤n f(Xi)−

∫f dP | = supF |Pnf − Pf |

I Talagrand (1987a) and Talagrand (1996a):

failure of ∆n → 0 a.s. (uniform SLLN)

iff

(roughly) ∃ subset A with PA > 0 for which (almost) every sample

from (nonatomic) P (· | A) can be shattered at some fixed levels α

and β (with αj ≡ α1 and βj ≡ β1)

Page 8: Fat-Shattering Dimension and Majorizing Measures: two ways to …pollard/Manuscripts+Notes/Yale_nov09... · 2009-11-14 · Fat-Shattering Dimension and Majorizing Measures: two ways

Mendelson and Vershynin (2003)

I If 0 ≤ f ≤ 1 for all f in F

and P is a probability measure and α ≥ 1and f1, . . . , fN in F with∫

|fi − fi′|αdP > (cαε)

α for 1 ≤ i < i′ ≤ N

(a packing assumption) then

N ≤ (Cα/ε)6 sdim(ε,F)

for known constants cα and Cα.

I Good consequences for bounding ∆n

when X1, X2, · · · ∼ iid P if∫ 1

0

√sdim(ε,F) log(1/ε) dε <∞

Page 9: Fat-Shattering Dimension and Majorizing Measures: two ways to …pollard/Manuscripts+Notes/Yale_nov09... · 2009-11-14 · Fat-Shattering Dimension and Majorizing Measures: two ways

M&V method:

I For some sample x1, . . . , xm from P ,

with m = something involving logN and ε,

discretize:

V [i, j] := bfi(xj)/εc

to get an N ×m matrix V with entries in Sp = {0,1, . . . , p}for p = b1/εc.

I For a good realization x1, . . . , xm get

♥ m−1∑j≤m |V [i, j]− V [i′, j]|α > C′α for 1 ≤ i < i′ ≤ N

for some magic constant C′α

Page 10: Fat-Shattering Dimension and Majorizing Measures: two ways to …pollard/Manuscripts+Notes/Yale_nov09... · 2009-11-14 · Fat-Shattering Dimension and Majorizing Measures: two ways

I Key question:

For how many distinct (J, z) with J ⊆ Sp and z ∈ ZJ

can V “surround” (J, z) in the sense:

for each K ⊆ J there is an iK for which

V [iK, j]

≥ zj + 1 if j ∈ K≤ zj − 1 if j ∈ J\K

?

Page 11: Fat-Shattering Dimension and Majorizing Measures: two ways to …pollard/Manuscripts+Notes/Yale_nov09... · 2009-11-14 · Fat-Shattering Dimension and Majorizing Measures: two ways

V =

6 3 25 1 14 2 31 0 52 6 4

J = (1,2) z = (3,2)0

1

2

3

4

5

6

Page 12: Fat-Shattering Dimension and Majorizing Measures: two ways to …pollard/Manuscripts+Notes/Yale_nov09... · 2009-11-14 · Fat-Shattering Dimension and Majorizing Measures: two ways

I Answer:

If ♥ holds then

# distinct (J, z) surrounded by V is ≥√N − 1.

Page 13: Fat-Shattering Dimension and Majorizing Measures: two ways to …pollard/Manuscripts+Notes/Yale_nov09... · 2009-11-14 · Fat-Shattering Dimension and Majorizing Measures: two ways

Majorizing measures

I Long and glorious history:

Fernique (1975),

. . . ,

Talagrand (1987b), Talagrand (1990),

Ledoux and Talagrand (1991, Chapter 11),

Talagrand (1996b), Talagrand (2001), Talagrand (2005),

Kwapien and Rosinski (2004),

Bednorz (2006), Bednorz (2007), . . .

(List woefully incomplete)

Page 14: Fat-Shattering Dimension and Majorizing Measures: two ways to …pollard/Manuscripts+Notes/Yale_nov09... · 2009-11-14 · Fat-Shattering Dimension and Majorizing Measures: two ways

I Ψ : R+ → R+ convex, increasing, with Ψ(0) = 0

eg. Ψgaus(x) := exp(x2/2)− 1 useful for Gaussian processes

I Stochastic process {Zt : t ∈ T} indexed by metric space (T, d)

with ‖Zs − Zt‖Ψ ≤ d(s, t), that is,

(|Zt − Zs|d(s, t)

)≤ 1 for all s 6= t

I Probability measure µ on T is a majorizing measure if

supt∈T

∫ diam(T )

0Ψ−1

(1

µB(t, r)

)dr <∞

where B(t, r) = closed ball of radius r around t.

Page 15: Fat-Shattering Dimension and Majorizing Measures: two ways to …pollard/Manuscripts+Notes/Yale_nov09... · 2009-11-14 · Fat-Shattering Dimension and Majorizing Measures: two ways

I (Talagrand 1987b) for Gaussian process

P supt∈T

Zt <∞

if and only if there exists a majorizing measure (using Ψgaus)

I Actually, Talagrand gave explicit bounds.

Page 16: Fat-Shattering Dimension and Majorizing Measures: two ways to …pollard/Manuscripts+Notes/Yale_nov09... · 2009-11-14 · Fat-Shattering Dimension and Majorizing Measures: two ways

One way to build a MM: (under mild conditions on Ψ)

I Tk is a maximal set of points separated by at least diam(T )/2k,

for k = 1,2, . . .

I Nk := #Tk

I µk = uniform distribution on Tk, mass 1/Nk at each point

I µ =∑∞k=1 2−kµk is a MM if

∑∞k=1 2−kΨ−1(Nk) <∞

I Other kinds of MM’s exist and are useful

Page 17: Fat-Shattering Dimension and Majorizing Measures: two ways to …pollard/Manuscripts+Notes/Yale_nov09... · 2009-11-14 · Fat-Shattering Dimension and Majorizing Measures: two ways

I Use MM to construct nested partitions for “chaining argument”

(Talagrand 2005)

I Use MM as a local smoothing operator [Kwapien and Rosinski

(2004), Bednorz (2006), Bednorz (2007)]; get bounds on supre-

mum (and increments) of stochastic process

I Traditional chaining arguments seem to require “local complex-

ity” the same everywhere in T

I MM gives control where “local complexity” differs from one part

of T to another

Page 18: Fat-Shattering Dimension and Majorizing Measures: two ways to …pollard/Manuscripts+Notes/Yale_nov09... · 2009-11-14 · Fat-Shattering Dimension and Majorizing Measures: two ways

Wild analogies and speculation

I Arguments for minimax rates of convergence of estimators often

use Bayes argument with uniform prior on a maximal set of points

separated by at least εn

I Some minimax arguments have a suggestive similarity to MM

arguments

I MM like a (possibly nonuniform) Bayes prior?

Page 19: Fat-Shattering Dimension and Majorizing Measures: two ways to …pollard/Manuscripts+Notes/Yale_nov09... · 2009-11-14 · Fat-Shattering Dimension and Majorizing Measures: two ways

References

Anthony, M. and P. Bartlett (2002). Neural Network Learning:

Theoretical Foundations. Cambridge University Press.

Bartlett, P. L., P. M. Long, and R. C. Williamson (1996).

Fat-shattering and the learnability of real-valued functions.

Journal of Computer and System Sciences 52, 434–452.

Bednorz, W. (2006). A theorem on majorizing measures. An-

nals of Probability 34, 1771–1781.

Bednorz, W. (2007). Holder continuity of random processes.

Journal of Theoretical Probability 20, 917–934.

Page 20: Fat-Shattering Dimension and Majorizing Measures: two ways to …pollard/Manuscripts+Notes/Yale_nov09... · 2009-11-14 · Fat-Shattering Dimension and Majorizing Measures: two ways

Fernique, X. (1975). Regularite des trajectoires des fonctions

aleatoires gaussiennes. Springer Lecture Notes in Mathe-

matics 480, 1–97.

Haussler, D. (1995). Sphere packing numbers for subsets of

the Boolean n-cube with bounded Vapnik-Chervonenkis di-

mension. Journal of Combinatorial Theory 69, 217–232.

Kearns, M. J. and R. E. Schapire (1994). Efficient distribution-

free learning of probabilistic concepts. Journal of Computer

and System Sciences 48, 464–497.

Kwapien, S. and J. Rosinski (2004). Sample Holder continuity

of stochastic processes and majorizing measures. Progress

in Probability 58, 155–163.

Page 21: Fat-Shattering Dimension and Majorizing Measures: two ways to …pollard/Manuscripts+Notes/Yale_nov09... · 2009-11-14 · Fat-Shattering Dimension and Majorizing Measures: two ways

Ledoux, M. and M. Talagrand (1991). Probability in BanachSpaces: Isoperimetry and Processes. New York: Springer.

Mendelson, S. and R. Vershynin (2003). Entropy and the com-binatorial dimension. Inventiones Mathematicae 152, 37–55.

Talagrand, M. (1984). Pettis Integral and Measure Theory,Volume 307 of Memoirs AMS. American Mathematical So-ciety.

Talagrand, M. (1987a). The Glivenko-Cantelli problem. Annalsof Probability 15(3), 837–870.

Talagrand, M. (1987b). Regularity of gaussian processes. ActaMathematica 159, 99–149.

Page 22: Fat-Shattering Dimension and Majorizing Measures: two ways to …pollard/Manuscripts+Notes/Yale_nov09... · 2009-11-14 · Fat-Shattering Dimension and Majorizing Measures: two ways

Talagrand, M. (1990). Sample boundedness of stochastic pro-cesses under increment conditions. Annals of Probability18(1), 1–49.

Talagrand, M. (1996a). The Glivenko-Cantelli problem, tenyears later. Journal of Theoretical Probability 9(2), 371–384.

Talagrand, M. (1996b). Majorizing measures: The genericchaining (in special invited paper). Annals of Probability24, 1049–1103.

Talagrand, M. (2001). Majorizing measures without measures.The Annals of Probability 29(1), 411–417.

Talagrand, M. (2005). The Generic Chaining: Upper and lowerbounds of stochastic processes. Springer-Verlag.