Probabilistic ‘generalization’ of functions and dimension-based uniform convergence results

10
Probabilistic ‘generalization’ of functions and dimension-based uniform convergence results MARTIN ANTHONY Department of Mathematics, London School of Economics and Political Science, Houghton Street, London WC2A 2AE, UK email: [email protected] Received March 1996 and accepted August 1997 In this largely expository article, we highlight the significance of various types of ‘dimension’ for obtaining uniform convergence results in probability theory and we demonstrate how these results lead to certain notions of generalization for classes of binary-valued and real-valued functions. We also present new results on the generalization ability of certain types of artificial neural networks with real output. Keywords: Generalization, uniform convergence, Vapnik–Chervonenkis dimension 0960-3174 Ó 1998 Chapman & Hall 1. Introduction There are many approaches to the notion of ‘generaliza- tion’ in theories of machine learning and psychology. In this article, we review several mathematical approaches and we present some new results on the generalization ability of simple artificial neural networks. We start by describing one popular mathematical approach to generalization in the context of concept learning. We emphasize how this model of generalization relates to results in probability theory. We then examine how one might extend this notion of generalization to a more general framework, that of generalizing real functions. Again, we link these models with results in probability theory. We demonstrate concrete applications of one of the models by presenting some new results on the generalization ability of certain types of neural networks. 2. PAC-generalization The models of generalization we discuss in this article are based on what has become known as the ‘probably ap- proximately correct’, or PAC, model of computational learning theory (or statistical learning theory). This model was introduced by Valiant (1984). In Valiant’s formulation, much stress was placed on the computational complexity of learning algorithms, which is not something we shall address here. The main probabilistic tools which have become useful for the analysis of this model and its variants have their roots in the work of Vapnik and others (see Vapnik, 1982, 1995; Vapnik and Chervonenkis, 1971). The books by Natarajan (1991), Anthony and Biggs (1992) and Kearns and Vazirani (1995) contain general discussions of PAC learning. In its simplest form, the PAC model of learning may be described as follows. There is a set of examples X , and a target function t : X !f0; 1g. It is known that t belongs to some set C of functions, but that is all that is known about it. There is assumed to be some fixed (but unknown) probability measure l on the set X of examples. (More precisely, we mean a probability measure defined on a fixed r-algebra R of subsets of X . It is usually assumed that X is a complete separable metric space, in which case we take R to be the Borel algebra on X .) In this framework, a ‘learner’ receives a training sample s xt x 1 ; tx 1 ; x 2 ; tx 2 ; ... ; x m ; tx m 2 X f0; 1g m ; a sequence of labelled examples. For the PAC model, it is assumed that the examples x 1 ; x 2 ; ... are drawn indepen- Statistics and Computing (1998) 8, 5–14 0960-3174 Ó 1998 Chapman & Hall

Transcript of Probabilistic ‘generalization’ of functions and dimension-based uniform convergence results

Page 1: Probabilistic ‘generalization’ of functions and dimension-based uniform convergence results

Probabilistic `generalization' of functions

and dimension-based uniform convergence

results

MARTIN ANTHONY

Department of Mathematics, London School of Economics and Political Science,

Houghton Street, London WC2A 2AE, UKemail: [email protected]

Received March 1996 and accepted August 1997

In this largely expository article, we highlight the signi®cance of various types of `dimension'for obtaining uniform convergence results in probability theory and we demonstrate how theseresults lead to certain notions of generalization for classes of binary-valued and real-valued

functions. We also present new results on the generalization ability of certain types of arti®cialneural networks with real output.

Keywords: Generalization, uniform convergence, Vapnik±Chervonenkis dimension

0960-3174 Ó 1998 Chapman & Hall

1. Introduction

There are many approaches to the notion of `generaliza-tion' in theories of machine learning and psychology. Inthis article, we review several mathematical approaches andwe present some new results on the generalization ability ofsimple arti®cial neural networks. We start by describingone popular mathematical approach to generalization inthe context of concept learning. We emphasize how thismodel of generalization relates to results in probabilitytheory. We then examine how one might extend this notionof generalization to a more general framework, that ofgeneralizing real functions. Again, we link these modelswith results in probability theory. We demonstrate concreteapplications of one of the models by presenting some newresults on the generalization ability of certain types ofneural networks.

2. PAC-generalization

The models of generalization we discuss in this article arebased on what has become known as the `probably ap-proximately correct', or PAC, model of computationallearning theory (or statistical learning theory). This model

was introduced by Valiant (1984). In Valiant's formulation,much stress was placed on the computational complexity oflearning algorithms, which is not something we shall addresshere. Themain probabilistic tools which have become usefulfor the analysis of this model and its variants have their rootsin the work of Vapnik and others (see Vapnik, 1982, 1995;Vapnik and Chervonenkis, 1971). The books by Natarajan(1991), Anthony and Biggs (1992) and Kearns and Vazirani(1995) contain general discussions of PAC learning.In its simplest form, the PAC model of learning may be

described as follows. There is a set of examples X , and atarget function t : X ! f0; 1g. It is known that t belongs tosome set C of functions, but that is all that is known aboutit. There is assumed to be some ®xed (but unknown)probability measure l on the set X of examples. (Moreprecisely, we mean a probability measure de®ned on a ®xedr-algebra R of subsets of X . It is usually assumed that X isa complete separable metric space, in which case we take Rto be the Borel algebra on X .) In this framework, a `learner'receives a training sample

s � x�t� � �x1; t�x1��; �x2; t�x2��; . . . ;� �xm; t�xm���2 X � f0; 1g� �m;

a sequence of labelled examples. For the PAC model, it isassumed that the examples x1; x2; . . . are drawn indepen-

Statistics and Computing (1998) 8, 5±14

0960-3174 Ó 1998 Chapman & Hall

Page 2: Probabilistic ‘generalization’ of functions and dimension-based uniform convergence results

dently at random from X , according to l. The aim is to ®nda good approximation to t from a set H of functions(possibly di�erent from C). The classH must satisfy somemeasurability conditions in order for the following de®ni-tions and results to be valid; these conditions are quitenatural and a discussion may be found in Blumer et al.(1989). A learning algorithm is a mapping L from samplesof the form x�t�, where x 2 S1m�1 X m and t 2 C. Generally,the error of h 2H with respect to t and l is de®ned to beerl�h� � l fx 2 X : h�x� 6� t�x�g� �. With these notations, thede®nition of PAC learning is as follows.

De®nition 2.1 Suppose that L is a learning algorithm asdescribed above. We say that L is probably approximatelycorrect (or PAC) if, given �; d 2 �0; 1�, there is mL��; d� suchthat for any probability measure l on X and any targetfunction t 2 C,

lm x 2 X m : erl L�x�t��� � > �� ÿ �

< d;

for m � mL��; d�. (In other words, with probability at least1ÿ d, for a large enough sample, L�x�t�� has error lessthan �.)

Note that a PAC learning algorithm must work for everyprobability measure l and every possible target function t,and that mL��; d� depends on neither the probability mea-sure nor the target. PAC learning is therefore `distribution-independent' learning.If C �H then it is possible (and perhaps sensible) to use

a learning algorithm L with the property that the functionh � L�x�t�� satis®es h�xi� � t�xi� for 1 � i � m; in otherwords, the output function of the learning algorithm agreeswith the target function on the examples it saw duringtraining. We say that such an L is consistent (Blumer et al.,1989). We make the following de®nition.

De®nition 2.2 LetH be a set of functions from some set Xto f0; 1g. Then H PAC-generalizes if for all �; d 2 �0; 1�there is m0��; d� such that, for any t : X ! f0; 1g and anyprobability measure l on X , if m � m0��; d� then, withlm-probability at least 1ÿ d, a sample x 2 X m is such thatthe following holds:

h 2H and h�xi� � t�xi� for i � 1; 2; . . . ;m

�) erl�h� � l fx 2 X : h�x� 6� t�x�g� � < �:

It is clear that ifH PAC-generalizes, if C �H, and if L is aconsistent learning algorithm, then L is a PAC learningalgorithm; see Blumer et al. (1989) and Anthony and Biggs(1992).More generally, when it is not the case that C �H, or

when C is unknown, it will be impossible to ®nd a con-sistent learning algorithm. To deal with this (and withother considerations, such as there being no well-de®nedtarget function), the model can be extended by consideringprobability measures on X � f0; 1g, rather than probabilitymeasures on X coupled with functions from X to f0; 1g.

(Any probability measure l on X together with a functiont : X ! f0; 1g can be represented in the obvious way by aprobability measure P on X � f0; 1g; see Blumer et al.(1989) and Anthony and Shawe-Taylor (1994a).) In thismore general model (which is discussed in Blumer et al.(1989) and Haussler (1992), for example), the error ofh 2H with respect to a probability measure P onX � f0; 1g is taken to be

erP �h� � P �x; y� 2 X � f0; 1g : h�x� 6� yf g� �:In this context, a learning algorithm takes as input aP m-random sample

s � �x1; y1�; �x2; y2�; . . . ; �xm; ym�� �:In this more general setting there may be no function inHwith zero error, so we modify slightly the aim of learning:we now hope to produce a function h 2H with near-minimal error with respect to the measure P .

De®nition 2.3 A learning algorithm L is said to be probablyapproximately optimal if for any �; d 2 �0; 1� there ismL��; d� such that, given any probability measure P onX � f0; 1g, if m � mL��; d� then with P m-probability at least1ÿ d, s 2 �X � f0; 1g�m is such that

erP �L�s�� < infh2H

erP �h� � �:

(A more general de®nition than this can be given, where theaim is to produce a function whose error is almost as smallas the smallest one could hope to ®nd in a class F, whichmight be di�erent from H. This is known as agnosticlearning and is not something we shall discuss further here.See Kearns et al. (1992) and Maass (1993).)By analogy with a consistent algorithm for the simple

PAC framework discussed earlier, it might be thoughtthat a good approach to the present problem is to use alearning algorithm L which chooses h 2H which appears,on the basis of the sample s, to have small error. Formally,we de®ne the observed error of h 2H on a sample s to be

ers�h� � 1

mi : h�xi� 6� yif gj j;

and we should hope that if s is large enough then, with highprobability, the observed error is close to the (actual) errorof h. We make the following more general de®nition ofPAC-generalization.

De®nition 2.4 LetH be a set of functions from X to f0; 1g.Then H PAC-generalizes if for any �; d 2 f0; 1g, there ism0��; d� such that for any probability measure P onX � f0; 1g,

P m s 2 �X � f0; 1g�m : suph2H

ers�h� ÿ erP �h�j j � �� �� �

< d

for m � m0��; d�. (In other words, for m � m0��; d�, withprobability at least 1ÿ d, if we take a random sample

6 Anthony

Page 3: Probabilistic ‘generalization’ of functions and dimension-based uniform convergence results

drawn according to P then for every h 2H, the observederror and the (actual) error di�er by less than �.)

It can be seen fairly easily that ifH PAC-generalizes, thena learning algorithm L which chooses h minimizing theobserved error on the training sample is probably ap-proximately optimal; see Haussler (1992). (It is clear that ifthis de®nition holds true for H then so does the earlierde®nition, De®nition 2.2, of PAC-generalization. In thissense, the present de®nition subsumes the previous one.)

3. Uniform convergence of probabilities

Blumer et al. (1989) observed that `uniform convergence'results in probability theory, such as those obtained byVapnik and Chervonenkis (1971), are immediately appli-cable to PAC-generalization.Let Z be a set and K a r-algebra of subsets of Z. Suppose

that E � K is a set of events. For z � �z1; z2; . . . ; zm� 2 Zm,denote by P z�A� the relative frequency of event A 2 E onthe sample z,

P z�A� � 1

mi : zi 2 Af gj j:

(Although the relative frequency, as de®ned, does not de-pend explicitly on a probability measure P as the notationwould suggest, this notation is useful since we shall be in-terested in the relative frequencies of events on P -randomsamples.) Standard results in probability theory assure usthat, given any A 2 E, the relative frequency P z�A� con-verges in probability to the probability P �A�. However, weshall require something far stronger. One says that therelative frequencies (of events in E) converge uniformly toprobabilities if there is a function f �m; �� with the propertythat for every � 2 �0; 1�, f �m; �� ! 0 as m!1, and whichis such that

8P ; P m z 2 Zm : supA2E

P z�A� ÿ P �A��� �� � �� �� �< f �m; ��;

where `8P ' means for all probability measures P on �Z;K�.Note that there are two senses in which the convergence isuniform: the rate of convergence of P z�A� to P �A� can bebounded by a quantity f �m; �� which is independent ofboth the probability measure P and the event A 2 E.It is fairly straightforward to see how such a uniform

convergence result implies PAC-generalization (an obser-vation made, for example, in Blumer et al. (1989) andHaussler (1992)). For (using the notation of the previoussection), let us take Z � X � f0; 1g and take K to be theproduct r-algebra R� 2f0;1g. For each h 2H, let

Eh � f�x; y� 2 X � f0; 1g : h�x� 6� ygand let E � fEh : h 2Hg. With P as in the discussion ofPAC-generalization, P�Eh� � erP �h� and, for s 2 Zm,P s�Eh� � ers�h�. Now, if the relative frequencies of the

events in E converge uniformly to their probabilities, thenfor all � 2 �0; 1�, given any d 2 �0; 1�, there is m0��; d� suchthat for m � m0��; d�,

8P ; P m z 2 Zm : supA2E

P z�A� ÿ P�A��� �� � �� �� �< d:

But this means precisely that, for m � m0��; d�, and for anyprobability measure P on X � f0; 1g,

P m s 2 �X � f0; 1g�m : suph2H

ers�h� ÿ erP �h�j j � �� �� �

< d;

which is exactly what is required (see De®nition 2.4) forHto PAC-generalize.The paper of Vapnik and Chervonenkis (1971) gave the

®rst such general uniform convergence result. In order todescribe it, we ®rst need the notion of the growth functionof the set E of events. For S � Z, let

E \ S � fA \ S : A 2 Eg:Then the growth function PE : N! N is given by

PE�m� � maxjSj�mjE \ Sj:

Vapnik and Chervonenkis (1971) proved the following re-sult.

Theorem 3.1 (Vapnik and Chervonenkis (1971)) Let E be aset of events on �Z;K� and P any probability measure on�Z;K�. Then, for m � 2=�2,

P m z 2 Zm : supA2E

P z�A� ÿ P �A��� �� � �� �� �� 4PE�2m�eÿ�2m=8:

As it stands, this is not explicitly a uniform convergenceresult. However, it leads directly to one for classes E whoseVapnik±Chervonenkis dimension ± a measure of the`richness' of the class, developed in Vapnik and Chervo-nenkis (1971) ± is ®nite. Noting that PE�m� � 2m for all m,one says that E has ®nite Vapnik±Chervonenkis dimen-sion d if PE�d � 1� < 2d�1 and PE�d� � 2d . (Otherwise ±that is, if PE�m� � 2m for all m ± the class has in®niteVapnik±Chervonenkis dimension.) The Vapnik±Chervo-nenkis dimension of E, often called the VC-dimension, isdenoted VCdim�E�. Vapnik and Chervonenkis observedthat if E has ®nite VC±dimension d then the growthfunction PE�m� is bounded by a polynomial of degreed � 1. (A similar bound, known as `Sauer's Lemma', isgiven in Sauer (1972).) Therefore, when E has ®nite VC-dimension, the bound given in Theorem 3.1 takes the formof a negative exponential multiplied by a polynomial, andtherefore converges to 0 as m tends to in®nity. Thus, ®niteVC-dimension is a su�cient condition for relative fre-quencies to converge uniformly to probabilities. In fact,this is also a necessary condition (Vapnik and Chervo-nenkis, 1971; Blumer et al., 1989).

Probabilistic `generalization' of functions and dimension-based uniform convergence results 7

Page 4: Probabilistic ‘generalization’ of functions and dimension-based uniform convergence results

Theorem 3.2 (Vapnik and Chervonenkis, 1971) Let E be aset of subsets of a set Z. Then the relative frequencies of theevents in E converge uniformly to their probabilities if andonly if E has ®nite VC-dimension.

We have seen how uniform convergence results apply di-rectly to prove PAC-generalization, by taking the events Eto be the error sets Eh for h 2H. If we identify f0; 1g-valued functions with their supports, then we may de®nethe VC-dimension of the setH of functions. It is very easyto see that the VC-dimensions of E � fEh : h 2Hg andHare the same (Blumer et al., 1989); therefore, a necessaryand su�cient condition for H to PAC-generalize (in thesense of De®nition 2.4) is thatH has ®nite VC-dimension.(On the necessity side, Blumer et al. (1989) prove a numberof stronger assertions, among them that H PAC-general-izes in the weaker sense of De®nition 2.2 only if H has®nite VC-dimension.)We remark that a number of di�erent uniform conver-

gence results along the lines of Theorem 3.1 have beenobtained, some of which, such as those in Vapnik (1982),Blumer et al. (1989), Haussler (1992) and Anthony andShawe-Taylor (1994a), provide better bounds on the rate ofuniform convergence.

4. Generalization of real functions

We now discuss how the previous notions of generalizationhave been extended to classes of real-valued functions. Forsimplicity, we shall assume, unless indicated otherwise, thatour sets H of functions map from a set X into the realinterval �0; 1�. (It is easy to modify the theories presentedhere to deal with function classes mapping into otherbounded subsets of the reals, as in Haussler (1992), forinstance.)Most of what we shall say here can be made signi®cantly

more general: indeed, in Haussler (1992), a theory is de-veloped for function classes mapping into any completeseparable metric space.It is clear that a di�erent approach must be taken for

classes of real-valued functions. For example, if onewanted to extend De®nition 2.2 to real functions, it wouldbe inappropriate, given a target real function t and a realfunction h, to de®ne the error of h with respect to t (and aprobability measure l) to be l fx 2 X : h�x� 6� t�x�g� �. Oneshould not merely be interested in whether h�x� equals t�x�,but in how close h�x� is to t�x�.In this section we present a number of the models of

generalization for real functions which have been studied.Later, we explore the connections between these and uni-form convergence results in probability theory, high-lighting the importance of certain extensions of theVC-dimension.

4.1. PAC-generalization

Suppose thatH is a set of functions from a set X to �0; 1�.In order to measure how close one function is to a targetfunction or, more generally, how well it `®ts' a probabilitymeasure P on Z � X � �0; 1�, it is useful to use a lossfunction (Haussler, 1992). In this approach, developed ex-tensively by Haussler (1992), a loss function is a functionl : �0; 1� � �0; 1� ! �0;M �, for some M > 0, and l�y; y0� maybe thought of as the `distance' between y and y0. Havingsaid this, it should be emphasized that l need not be ametric. For example, a much-used loss function is thesquare loss, given by l�y; y0� � �y ÿ y0�2, and this is not ametric. Another commonly-used loss function is the linearloss, l�y; y0� � jy ÿ y0j, which is a metric. There are anumber of other loss functions which are appropriate for arange of di�erent problems; see the discussion in Haussler(1992). From now on, without any great loss of generality,we shall assume that M � 1. (Results for general M aregiven in Haussler (1992).)To extend the general de®nition, De®nition 2.4, of PAC-

generalization, we assume that we have, as above, a func-tion class H from X to �0; 1�, and a loss functionl : �0; 1� � �0; 1� ! �0; 1�. For the de®nition to be as generalas possible, rather than assume that there is a targetfunction from X to �0; 1� together with a probability mea-sure on X , we assume that we have some unknown prob-ability measure P on Z � X � �0; 1�. Given h 2H, the errorof h with respect to P is de®ned to be the expected value ofthe quantity l�h�x�; y�, where �x; y� 2 Z is distributed ac-cording to P , that is,

erP �h� � E�x;y��P �l�h�x�; y�� � EP �l�h�x�; y��:The corresponding estimate of the error on a sample

z � ��x1; y1�; �x2; y2�; . . . ; �xm; ym�� 2 Zm

is

erz�h� � 1

m

Xm

i�1l�h�xi�; yi�:

We have the following de®nition of PAC-generalization inthis context. Again, the motivating idea is that we want tobe sure that, with high probability, on a large enoughsample, the sample-based estimate of the error of anyh 2H is close to the true error of h.

De®nition 4.1 LetH be a set of functions from a set X into�0; 1�. We say that H PAC-generalizes if for any �; d 2�0; 1�, there is m0��; d� such that for any probability mea-sure P on Z � X � �0; 1�,

P m z 2 Zm : suph2H

erz�h� ÿ erP �h�j j � �� �� �

< d

for m � m0��; d�.This de®nition of generalization leads to certain types of`learning' result (and, indeed, one could motivate it by the

8 Anthony

Page 5: Probabilistic ‘generalization’ of functions and dimension-based uniform convergence results

desire to obtain such results). Speci®cally, regarding alearning algorithm as a function from

S1m�1 Zm to H, it is

straightforward to show that the following result on (theobvious extension of) probably approximately optimallearning holds.

Theorem 4.2 (Haussler, 1992) Suppose thatH is a class offunctions from X to �0; 1� and that l is a loss function.Suppose also that H PAC-generalizes (with respect to theloss function l). Then, there is a learning algorithm L suchthat the following holds: given �; d 2 �0; 1�, there is mL��; d�such that for any probability measure P on Z � X � �0; 1�and for m � mL��; d�, with probability at least 1ÿ d, aP m-random sample z 2 Zm is such that

erP �L�z�� < infh2H

erP �h� � �:

4.2. Generalization from interpolation

Another approach to the generalization of real functions isto consider `generalization from approximate interpola-tion', where we can develop two distinct models (Anthonyet al., 1996; Anthony and Bartlett, 1995). This approach isless general than the loss functions approach, in that itextends De®nition 2.2 rather than De®nition 2.4. For thesemodels of generalization, we do have a target real functiont : X ! R together with a probability measure l on X , andthe aim is to ®nd a good approximation to t from H.To motivate these de®nitions of generalization, we recall

De®nition 2.2, the ®rst, most basic, de®nition of PAC-generalization for classes of f0; 1g-valued functions. Thekey part of the de®nition was the requirement that form � m0��; d�, with lm-probability at least 1ÿ d, a lm-ran-domly drawn x 2 X m is such that

h 2H and h�xi� � t�xi� for i � 1; 2; . . . ;m

�) erl�h� � l fx 2 X : h�x� 6� t�x�g� � < �:

As we mentioned earlier, it would be too coarse to carry thisde®nition over, as it stands, to real function classes. Supposethat we replace the too-stringent `h�x� � t�x�' by the condi-tion `h�x� is within g of t�x�', where g is some small, chosennumber in �0; 1�. That is, we wish to be sure that, withprobability at least 1ÿ d, if h is an g-interpolant of t on thesample, in the sense that t�xi� ÿ g < h�xi� < t�xi� � g fori � 1; 2; . . . ;m, then h�x� is within gof t�x�on a set ofmeasureat least 1ÿ �. Then we arrive at the following de®nition(Anthony and Shawe-Taylor, 1994b; Anthony et al., 1996).

De®nition 4.3 LetH be a set of functions from X to �0; 1�.We say that H generalizes from approximate interpolationif for all �; d; g 2 �0; 1�, there is m0�g; �; d� such that, for allprobability measures l on X and all t : X ! R, ifm � m0�g; �; d� then with lm-probability at least 1ÿ d, asample x � �x1; x2; . . . ; xm� 2 X m is such that the followingimplication holds for every h 2H:

jh�xi� ÿ t�xi�j < g �1 � i � m��) l�fx : jh�x� ÿ t�x�j � gg� < �:

This de®nition may seem to be rather demanding: one ex-pects, with high probability, to be able to deduce from thefact that h is within g of t on a random sample that it iswithin g of t almost everywhere else (with respect to l). Wemay weaken this requirement by requiring such an h to beclose to t almost everywhere, but not necessarily as close asg. This results in the following de®nition (Anthony andBartlett, 1995).

De®nition 4.4 LetH be a set of functions from X to �0; 1�.We say that H weakly generalizes from approximate in-terpolation if for all �; d; g; c 2 �0; 1�, there is m0�g; c; �; d�such that, for all probability measures l on X and allt : X ! R, if m � m0�g; c; �; d� then with lm-probability atleast 1ÿ d, a sample x � �x1; x2; . . . ; xm� 2 X m is such thatthe following implication holds for every h 2H:

jh�xi� ÿ t�xi�j < g �1 � i � m��) l�fx : jh�x� ÿ t�x�j � g� cg� < �:

5. Uniform convergence of empiricals

5.1. De®nitions and connections with generalization

We have seen how PAC-generalization for f0; 1g-valuedfunction classes relates directly to the uniform convergenceof relative frequencies to probabilities. Here, we explainhow PAC-generalization for classes of real functions relatesto results in probability theory on the uniform convergenceof empiricals to expectations.Let Z be a set and K a r-algebra on Z. Suppose thatF is

a set of random variables on Z, each with range �0; 1�. Forz 2 Zm, the empirical estimate of the random variablef 2F on z is

Ez� f � � 1

m

Xm

i�1f �zi�:

One says that the empiricals (of F) converge uniformly toexpectations if there is a function f �m; �� such that for every� 2 �0; 1�, f �m; �� ! 0 as m!1, and

8P ; P m z 2 Zm : supf2FjEz� f � ÿ E� f �j � �

( ) !< f �m; ��;

where `8P ' means for all probability measures de®ned on�Z;K�. Note that this is an extension of the notion of theuniform convergence of relative frequencies to probabili-ties, since the probability and relative frequency of an eventare, respectively, the expectation and the empirical estimateof its indicator (or characteristic) function.Returning to the loss functions approach to function

generalization, given any h 2H, let lh be the function from

Probabilistic `generalization' of functions and dimension-based uniform convergence results 9

Page 6: Probabilistic ‘generalization’ of functions and dimension-based uniform convergence results

Z to �0; 1� de®ned by lh�x; y� � l�h�x�; y�. Then, as is easilyveri®ed,

erP �h� � EP �lh� erz�h� � Ez�lh�:From this, it is clear that PAC-generalization as de®ned inDe®nition 4.1 is equivalent to the uniform convergence ofempiricals to expectations for the class of random variablesF � flh : h 2Hg, usually called the loss space and de-noted lH (Haussler, 1992).

5.2. Covering numbers

The notion of covering numbers turns out to be central toresults on uniform convergence of empiricals to expecta-tions. Given a pseudo-metric space �Y ; d� and a subset S ofY , we say that the set T � Y is an �-cover for S (where� > 0) if, for every s 2 S there is t 2 T such that d�s; t� � �.For a ®xed � > 0 we denote byN��; S; d� the cardinality ofthe smallest �-cover for S. (We de®neN��; S; d� to be1 ifthere is no such cover.)Suppose that F is a set of �0; 1�-random variables de-

®ned on Z. For z � �z1; z2; . . . ; zm�, we denote by F�z� thefollowing subset of Rm:

F�z� � f �z1�; f �z2�; . . . ; f �zm�� � : f 2Ff g:Pollard (1984) (see also Haussler, 1992) obtained the fol-lowing result.

Theorem 5.1 Suppose thatF is a permissible� set of �0; 1�-valued random variables on Z and that P is any probabilitymeasure on Z. Then, for any positive integer m and any� > 0,

P m z 2 Zm : supf2FjEz� f � ÿ EP � f �j � �

( ) !

� 4EP 2m N�

16;F�z�; d

� �� �exp ÿ �

2m128

� �;

where d is the L1-metric, given, on R2m, by d�r; s� �

12m

P2mi�1 jri ÿ sij.

Haussler (1992) has improved this result in a number ofways, but the form given here is su�cient for our purposes.From this result, it can be seen that uniform convergence

of empiricals to expectations will occur if the coveringnumbers can be bounded in such a way that the bound ofTheorem 5.1 tends to 0 as m!1. Such bounds have beenobtained in terms of `dimensions' which characterise the`richness' of the class F in much the same way as the VC-dimension does for f0; 1g-valued classes.

5.3. The pseudo-dimension

The pseudo-dimension (also known as the Pollard di-mension) was introduced by Pollard (1984). With thenotation as above, we say that z 2 Zm is P-shattered byF if some translate r�F�z� of F�z� intersects all or-thants of R

m. The pseudo-dimension of H , denotedPdim�H�, is the largest length of a P-shattered z (or it isin®nite, if there is no bound on the lengths of P -shat-tered z). We state the de®nition formally in a rathermore explicit way.

De®nition 5.2 (Pseudo-dimension) With the usual notation,z 2 Zm is pseudo-shattered byF if there are r1; r2; . . . ; rm 2R such that for any b 2 f0; 1gm, there is fb 2F with

fb�zi� � ri () bi � 1:

The largest d such that some z 2 Zd is P-shattered is thepseudo-dimension of F, denoted Pdim�F�. (When thismaximum does not exist, the pseudo-dimension is taken tobe in®nite.)

Note that when the classF in fact maps into the set f0; 1grather than the interval �0; 1�, the de®nition of pseudo-di-mension reduces to the VC-dimension. Furthermore, whenF is a vector space of real functions, the pseudo-dimensionof F is precisely the linear dimension of F; see Haussler(1992).The following results follow from one due to Pollard

(1984); see Haussler (1992).

Theorem 5.3 Suppose that d is the L1-metric on Rk, wherek is any positive integer, and that z 2 Zk. Suppose also thatF has ®nite pseudo-dimension. Then, for any � 2 �0; 1�,

N��;F�z�; d� < 22e�ln

2e�

� �� �Pdim�F�:

Thus, whenF has ®nite pseudo-dimension, d, we see fromTheorem 5.1 that, for any P ,

P m z 2 Zm : supf2FjEz� f � ÿ E� f �j � �

( ) !< f �m; ��;

where

f �m; �� � 832e�ln

32e�

� �� �d

exp ÿ �2m128

� �! 0 as m!1;

for each ®xed �. One therefore has the following result.

Theorem 5.4 If F has ®nite pseudo-dimension then theempiricals of the random variables in F converge uni-formly to their expectations.

Returning to function generalization, and recalling thatwhat we require for PAC-generalization is uniform

�The set of random variables must satisfy some measurability conditions;

see Pollard (1984) and Haussler (1992) for details. These are not

particularly stringent, and we refer to a class with the required properties

as a permissible class.

10 Anthony

Page 7: Probabilistic ‘generalization’ of functions and dimension-based uniform convergence results

convergence of empiricals to expectations for the loss spacelH, we obtain the following result of Haussler (1992).

Corollary 5.5 If lH has ®nite pseudo-dimension then HPAC-generalizes.

For certain loss functions l, the pseudo-dimension of l canbe related directly to the pseudo-dimension of H; seeHaussler (1992).

5.4. Scale-sensitive pseudo-dimension

Recently, it has been shown that ®niteness of the pseudo-dimension of F is a stronger condition than is needed foruniform convergence of empiricals to expectations. Alonet al. (1993) have determined a weaker su�cient conditionfor the uniform convergence. Their characterization in-volves a `scale-sensitive' counterpart of the pseudo-di-mension, which was introduced by Kearns and Schapire(1990) in their work on the learnability of probabilisticconcepts, and which, after Bartlett et al. (1994), we shallcall the fat-shattering function.

De®nition 5.6 (fat-shattering) Suppose that F is a set offunctions from Z to �0; 1� and that c > 0. We say thatz 2 Zm is c-shattered if there is r � �r1; r2; . . . ; rm� 2 Rm suchthat for every b � �b1; b2; . . . ; bm 2 f0; 1gm, there is afunction fb 2F with fb�zi� � ri � c if bi � 1 andfb�zi� � ri ÿ c if bi � 0.

Thus, z is c-shattered if it is shattered with a `width ofshattering' of at least c. We de®ne the fat-shattering func-tion, fatF : R� ! N [ f0;1g, by de®ning fatF�c� to bethe largest d such that some z 2 Zd is c-shattered. (Wede®ne fatF�c� � 1 if there is no maximum such d.) It iseasy to see that Pdim�F� � limc!0 fatF�c�. It should benoted, however, that it is possible for the pseudo-dimen-sion to be in®nite, even when fatF�c� is ®nite for all posi-tive c. We shall say thatF has ®nite fat-shattering functionwhenever it is the case that for all c 2 �0; 1�, fatF�c� is®nite.Alon et al. (1993) bounded the covering numbers in

terms of the fat-shattering function.

Theorem 5.7 (Alon et al. (1993)) Suppose thatF is a set of�0; 1�-valued random variables on Z and that F has ®nitefat-shattering function. Let m be a positive integer. Sup-pose c > 0 and that d � fatF�c=4�. Let

B �Xd

i�1

mi

� � 2

c

� �� �i

:

Then, provided m � logB� 1, for any z 2 Zm,

N��;F�z�; d1� < 2 m2

c

� �2 !logB

;

where d is the L1-metric on Rm, given by d1�r; s� �

max1�i�m jri ÿ sij.Now, for any r; s 2 Rm, the L1-distance is bounded by theL1-distance:

d�r; s� � 1

m

Xm

i�1jri ÿ sij � max

1�i�mjri ÿ sij � d1�r; s�:

This means that any �-cover with respect to d1 is also an�-cover with respect to d, and hence, for any F, for all �,and all z, N��;F�z�; d� �N��;F�z�; d1�. Thus, thebound of Theorem 5.7 is a bound on the required coveringnumbers. This bound is sub-exponential in m and so, asearlier, it leads to a uniform convergence result: if F has®nite fat-shattering function then the empiricals convergeto their expectations uniformly. In fact, as shown in Alonet al. (1993), the converse is also true.

Theorem 5.8 (Alon et al. (1993)) One has uniform con-vergence of empiricals to expectations for the class F of�0; 1�-random variables if and only if F has ®nite fat-shattering function.

We have the following immediate corollary for general-ization.

Theorem 5.9 (Alon et al. (1993)) The class of functionsHfrom X to �0; 1� PAC-generalizes (with respect to lossfunction l) if and only if the loss space lH has ®nite fat-shattering function.

6. Generalization from interpolation

6.1. The strong model

The problem of generalization from approximate inter-polation may be regarded to some extent as a problemwithin the loss functions approach to function learning. Tosee this, let us ®x g 2 �0; 1� and take lg to be the lossfunction given by

lg�y; y0� � 0 if jy ÿ y0j < g lg�y; y0� � 1 if jy ÿ y0j � g:

Then, for ®xed g, generalization from interpolation isequivalent to a restricted form of PAC-generalization withrespect to lg, in which we only consider distributions P onZ � X � R which correspond (in the obvious fashion) topairs �t; l� where t is a function on X and l is a proba-bility measure on X . The loss space lg

H is f0; 1g-valued, soits pseudo-dimension and fat-shattering function are pre-cisely its VC-dimension. In Anthony et al. (1996), a scale-sensitive dimension BdimH is de®ned as follows: forc 2 �0; 1�,

BdimH�c� � VCdim�lcH�:

Probabilistic `generalization' of functions and dimension-based uniform convergence results 11

Page 8: Probabilistic ‘generalization’ of functions and dimension-based uniform convergence results

We call this the band dimension and say that H has ®niteband dimension if BdimH�c� is ®nite for all c. In Anthonyet al. (1996), the following result is obtained.

Theorem 6.1 The class H generalizes from approximateinterpolation if and only if it has ®nite band dimension.

The band dimension is not a well-known measure of di-mension, having appeared only rarely in other work onlearning theory (such as Natarajan, 1993). However, it canbe related to the pseudo-dimension (Anthony et al., 1996),as follows.

Theorem 6.2 Suppose thatH maps from a domain X intoa bounded real interval. ThenH has ®nite band dimensionif and only if it has ®nite pseudo-dimension. Furthermore,there are constants c1; c2 > 0 such that for all c 2 �0; 1�,

c1Pdim�H�log�1=g� � BdimH�c� � c2Pdim�H�:

Corollary 6.3 H generalizes from approximate interpola-tion if and only if H has ®nite pseudo-dimension.

Thus, although it looks like a very di�cult de®nition tosatisfy, generalization from interpolation holds for manynatural classes of functions. The fact that ®nite pseudo-dimension of H is necessary for generalization from ap-proximate interpolation is, in some ways, contrary to theresults for PAC-generalization; there, the ®niteness condi-tion on the pseudo-dimension (namely, Pdim�lH� <1) isnot necessary and can be replaced by ®niteness of the fat-shattering function.The following result from Anthony et al. (1996) provides

indication of the appropriate size of m0�g; �; d�.

Theorem 6.4 Suppose that H has ®nite pseudo-dimensiond. Then there is a constant c such that a su�cient value ofm0�g; �; d� for generalization from approximate interpola-tion is

c�

Pdim�H� ln 1

� �� ln

1

d

� �� �:

The paper by Anthony et al. (1996) contains many otherresults on generalizing from approximate interpolation,including a characterization of those measures of dimen-sion whose ®niteness is a necessary and su�cient conditionfor such generalization.

6.2. The weak model

Since ®nite pseudo-dimension is a su�cient condition forgeneralization from approximate interpolation, it is also asu�cient condition for weak generalization from approxi-mate interpolation. However, the following result has beenobtained.

Theorem 6.5 H weakly generalizes from approximateinterpolation if and only if H has ®nite fat-shatteringfunction.

Unlike generalization from approximate interpolation, theproblem of weak generalization cannot be expressed di-rectly as a problem involving loss functions. The `if' part ofTheorem 6.5 follows from the following `convergence' re-sult, which is implicit in Anthony and Bartlett (1995).

Theorem 6.6 Suppose that H is a class of functions map-ping from a domain X to the real interval �0; 1� and thatHhas ®nite fat-shattering function. Let t be any functionfrom X to R and let c; g; � 2 �0; 1�. Let l be any probabilitydistribution on X and m any positive integer. De®ne thesubset Q of X m to be the set of x for which there existsh 2H such that

l x 2 X : jh�x� ÿ t�x�j � g� cf g� � > �

and jh�xi� ÿ t�xi�j < g; �1 � i � m�:Then

lm�Q� < 2El2m Nc2;H�z�; d1

� �� �2ÿ�m=2;

where d1 is the L1-metric on R2m, given by d1�r; s� �max1�i�2m jri ÿ sij.Combining this with the result of Alon et al., Theorem 5.7gives the `if' part of Theorem 6.5. The following resultfrom Anthony and Bartlett (1995) indicates a suitable valueof m0�g; c; �; d�.

Theorem 6.7 Suppose thatH maps from a domain X into�0; 1� and thatH has ®nite fat-shattering function. There isa constant K such that a su�cient sample length for weakgeneralization from approximate interpolation is

m0�c; g; �; d� � K�

ln1

d

� �� fatH�c=8� ln2 fatH�c=8�

c�

� �� �:

The proof that ®nite fat-shattering function is necessarycan be found in Anthony and Bartlett (1995).

6.3. Some applications to arti®cial neural networks

In this section, we explain how the results on generalizationfrom interpolation can be used to obtain results for certaintypes of neural network. (See, for example, Hertz et al.(1991) for the basic de®nitions of neural networks.) Weconsider here arti®cial neural networks N having onehidden layer, where each hidden node has linear thresholdactivation function, and in which the activation function ofthe single output node is the identity function (so that itoutputs the weighted sum of its inputs). We do not assumethat the number of hidden units is known. We do, however,make the assumption that the weights from the hiddenlayer to the output node are bounded in such a way that the

12 Anthony

Page 9: Probabilistic ‘generalization’ of functions and dimension-based uniform convergence results

sum of the absolute values of all these weights is at most a®xed, known, constant B. (If bounds are given on thenumber k of hidden nodes and on the absolute value ofeach weight from the hidden layer to the output node, thenwe certainly have such a bound B: thus the restriction weimpose on the weights is weaker than imposing a restrictionon the number of hidden units and on the magnitude of theweights.) Each state of such a neural network N is de-scribed by weight vectors b 2 Rk and w1;w2; . . . ;wk 2 Rn.The assumption on the weights is that

Pki�1 jbij � B. Let

sign�x� : R! f0; 1g be given by sign�x� � 1() x � 0.Then we may describe the set of functions computable bythe neural network explicitly, as follows. Letx � �b;w1;w2; . . . ;wk� denote a typical state of the net-work. Then the function computed by the network in statex is hx : Rn ! R given by

hx�x1; x2; . . . ; xn� �Xk

j�1bj sign

Xn

i�1wjixi

!:

The set H of all functions computable by such a neuralnetwork architecture is then

H ��

hx : x � �b;w1;w2; . . . ;wk�; where k 2 N

andXk

j�1jbjj � B

�:

Results from Gurvits and Koiran (1995) show that

fatH�a� � KB2n2

a2ln

Bna

� �;

where K is some ®xed constant.The following result is obtained by using this bound

together with Theorem 6.5 and Theorem 6.7 (or, rather,their simple modi®cations for classes of functions mappinginto the interval �0;B�). For the sake of simplicity, we havenot explicitly determined the constants involved.

Theorem 6.8 There are constants c and �0 and c0 such thatthe following holds. Let �, d and g be ®xed positive num-bers less than 1, with � < �0 and c < c0. Suppose that N isany one-hidden-layer network of the type described aboveand let l be any probability distribution on X , the set of allinputs. Suppose that the target function t is computable bythe network. Suppose also that L is a learning algorithm forN with the property that if s � ��x1; y1�; �x2; y2�; . . . ;�xm; ym�� is any training sample for t of length m, thenhx � L�s� satis®es jhx�xi� ÿ yij < g for 1 � i � m. Then, if atraining sample s is generated by a lm-random choice of�x1; x2; . . . ; xm�, where m > �16=�� ln�4=d�, then, withprobability at least 1ÿ d, L�s� � hx is such that

l x : jhx�x� ÿ t�x�j � g� cBn���p �lnm�2����

mp

( ) !< �:

Proof. It follows from Theorem 6.6 together with Theo-rem 5.7 (see Anthony and Bartlett, 1995) that a su�cientsample length m0�g; c; �; d� for (the class of functionscomputable by) N to approximate from interpolated ex-amples is

m0�g; c; �; d� � 8

�3d ln2

1200dB2

�c2 ln2 2

� �� ln

4

d

� �� �where d � fatH�c=8�. Let m=2 > �8=�� ln�4=d�. Then m willbe at least m0 if

m2>

24d

� ln2 2ln2

1200dB2

�c2 ln2 2

� �:

Recalling that fatH�a� is bounded by K�B2n2=a2� ln Bn=a� �,there is a constant c1 such that this will be true ifm > c1b

2 ln b� �3; where b � Bn=�c ���p �. There is a constant

c2 such that this inequality holds if b < c2� ����mp

=�lnm�2�; forthen, b2�ln b�3 is of order no more than

m

�lnm�4 �lnm�3 � mlnm

:

(Clearly, this same argument works provided we takeb < c2� ����m

p=�lnm�3=2�x�; where x is any ®xed positive

number, but the present choice (x � 1=2) su�ces for ourpurpose.) Now,

Bnc���p � b < c2

����mp

�lnm�2

means we may take c � c�Bn�lnm�2=� ���p ����mp �� for some

constant c, as required. (

Thus, if we ®x in advance the accuracy and con®dence, �and d, and if we have a learning procedure which will g-interpolate on any training sample of length m for t, we canbe con®dent (with probability at least 1ÿ d) that the ®nalstate of the network will estimate t�x� to within an errormargin of cBn�lnm�2=� ���p ����

mp � on most inputs. (Formally,

it will compute within this error margin on all inputs butfor those in some set having probability less than �.)Consider now the subclass of these neural networks in

which the number of hidden units is ®xed at some numberk. We can bound the pseudo-dimension of the class offunctions computable by the network N in terms of k(Gurvits and Koiran, 1995): speci®cally, the pseudo-di-mension of order kn ln�kn�. (In fact, the same bound alsoholds without the restriction on the weights into the outputnode.) From Theorem 6.4, we have the following result.

Theorem 6.9 Let �, d and g be ®xed positive numbers. LetN be a network of the type just described, having n inputnodes and k hidden nodes. Suppose that l is a probabilitydistribution on X , the set of all inputs, and that the targetfunction t is computable by the network. Suppose also thatthe learning algorithm L is such that if s ���x1; y1�; �x2; y2�; . . . ; �xm; ym�� is any training sample for t of

Probabilistic `generalization' of functions and dimension-based uniform convergence results 13

Page 10: Probabilistic ‘generalization’ of functions and dimension-based uniform convergence results

length m, then hx � L�s� satis®es jhx�xi� ÿ yij < g for1 � i � m. Then there is a constant c, depending only on �and d, such that the following holds for m > ckn ln�kn�: if atraining sample s is generated by a lm-random choice of�x1; x2; . . . ; xm�, then, with probability at least 1ÿ d, L�s�� hx satis®es

P x 2 X : jhx�x� ÿ t�x�j � gf g� � < �:

Acknowledgements

The author's work was supported in part by the EuropeanUnion through the ESPRIT project `Neurocolt'.

References

Alon, N., Ben-David, S., Cesa-Bianchi, N. and Haussler, D.(1993) Scale-sensitive dimensions, uniform convergence, and

learnability. In Proceedings of the Symposium on Founda-tions of Computer Science. IEEE Press.

Anthony, M. and Bartlett, P. (1995) Function learning from

interpolation, submitted. (An extended abstract appears inProceedings Eurocolt'95, Springer-Verlag.)

Anthony, M. and Biggs, N. (1992) Computational Learning The-

ory: An Introduction. Cambridge Tracts in TheoreticalComputer Science (30). Cambridge: Cambridge UniversityPress.

Anthony, M. and Shawe-Taylor, J. (1994a) A result of Vapnikwith applications. Discrete Applied Mathematics, 47, 207±17.

Anthony, M. and Shawe-Taylor, J. (1994b) Valid generalisationof functions from close approximations on a sample. In

Proceedings of Euro-COLT'93. Oxford University Press.Anthony, M., Bartlett, P., Ishai, Y. and Shawe-Taylor, J. (1996)

Valid generalisation from approximate interpolation, 5, 191±

214, 1996. Combinatorics, Probability and Computing.Bartlett, P. L., Long, P. M. and Williamson, R. C. (1994) Fat-

shattering and the learnability of real-valued functions. In

Proceedings of the Seventh Annual ACM Conference onComputational Learning Theory, New York: ACM Press.

Blumer, A., Ehrenfeucht, A., Haussler, D. and Warmuth, M. K.(1989) Learnability and the Vapnik-Chervonenkis dimen-

sion. Journal of the ACM, 36(4), 929±65.

Gurvits, L. and Koiran, P. (1995) Approximation and learning ofconvex superpositions. In Proceedings of Eurocolt'95,Springer-Verlag Lecture Notes in Arti®cial Intelligence,

pp. 222±36, Springer-Verlag.Haussler, D. (1992) Decision theoretic generalizations of the PAC

model for neural net and other learning applications. Infor-

mation and Computation, 100(1), 78±150.Hertz, J., Krogh, A. and Palmer, R. (1991) Introduction to the

Theory of Neural Computation. Redwood City: Addison-

Wesley.Kearns, M. J. and Schapire, R. E. (1990) E�cient distribution-

free learning of probabilistic concepts. In Proceedings of the1990 IEEE Symposium on Foundations of Computer Sci-

ence, IEEE Press.Kearns, M. J. and Vazirani, U. (1995) Introduction to Computa-

tional Learning Theory, MIT Press.

Kearns, M. J., Schapire, R. E. and Sellie, L. M. (1992) Towarde�cient agnostic learning. In Proceedings of the 5th AnnualWorkshop on Computational Learning Theory, pp. 341±52.

New York, NY: ACM Press.Maass, W. (1993) Agnostic PAC-learning of functions on analog

neural nets (extended abstract). In Advances in Neural In-formation Processing Systems, 6. Morgan Kaufmann.

Natarajan, B. K. (1991) Machine Learning: A Theoretical Ap-proach. San Mateo, California: Morgan Kaufmann.

Natarajan, B. K. (1993) Occam's razor for functions. In Pro-

ceedings of the Sixth ACM Workshop on ComputationalLearning Theory, July 1993, ACM Press.

Pollard, D. (1984) Convergence of Stochastic Processes. Springer-

Verlag.Sauer, N. (1972) On the density of families of sets. Journal of

Combinatorial Theory (A), 13, 145±7.

Valiant, L. G. (1984) A theory of the learnable. Communicationsof the ACM, 27(11), 1134±42.

Vapnik, V. N. (1982) Estimation of Dependences Based on Em-pirical Data. New York: Springer-Verlag.

Vapnik, V. N. (1995) The Nature of Statistical Learning Theory.New York: Springer-Verlag.

Vapnik, V. N. and Chervonenkis, A. Y. (1971) On the uniform

convergence of relative frequencies of events to their proba-bilities. Theory of Probability and its Applications, 16(2),264±80.

14 Anthony