{"title": "Polynomial Uniform Convergence of Relative Frequencies to Probabilities", "book": "Advances in Neural Information Processing Systems", "page_first": 904, "page_last": 911, "abstract": null, "full_text": "Polynomial Uniform Convergence of \nRelative Frequencies to Probabilities \n\nAlberto Bertoni, Paola Carnpadelli~ Anna Morpurgo, Sandra Panizza \n\nDipartimento di Scienze dell'Informazione \n\nUniversita degli Studi di Milano \n\nvia Comelico, 39 - 20135 Milano - Italy \n\nAbstract \n\nWe define the concept of polynomial uniform convergence of relative \nfrequencies to probabilities in the distribution-dependent context. Let \nXn = {O, l}n, let Pn be a probability distribution on Xn and let Fn C 2X ,. \nbe a family of events. The family {(Xn, Pn, Fn)}n~l has the property \nof polynomial uniform convergence if the probability that the maximum \ndifference (over Fn) between the relative frequency and the probabil(cid:173)\nity of an event exceed a given positive e be at most 6 (0 < 6 < 1), \nwhen the sample on which the frequency is evaluated has size polynomial \nin n,l/e,l/b. Given at-sample (Xl, ... ,Xt), let C~t)(XI, ... ,Xt) be the \nVapnik-Chervonenkis dimension of the family {{x}, ... ,xtl n f I f E Fn} \nand M(n, t) the expectation E(C~t) It). We show that {(Xn, Pn, Fn)}n~l \nhas the property of polynomial uniform convergence iff there exists f3 > 0 \nsuch that M(n, t) = O(n/t!3). Applications to distribution-dependent \nPAC learning are discussed. \n\n1 \n\nINTRODUCTION \n\nThe probably approximately correct (PAC) learning model proposed by Valiant \n[Valiant, 1984] provides a complexity theoretical basis for learning from examples \nproduced by an arbitrary distribution. As shown in [Blumer et al., 1989], a cen-\n\n\u2022 Also at CNR, Istituto di Fisiologia dei Centri Nervosi, via Mario Bianco 9, 20131 \n\nMilano, Italy. \n\n904 \n\n\fPolynomial Uniform Convergence \n\n905 \n\ntral notion for distribution-free learnability is the Vapnik-Chervonenkis dimension, \nwhich allows obtaining estimations of the sample size adequate to learn at a given \nlevel of approximation and confidence. This combinatorial notion has been defined \nin [Vapnik & Chervonenkis, 1971] to study the problem of uniform convergence of \nrelative frequencies of events to their corresponding probabilities in a distribution(cid:173)\nfree framework. \n\nIn this work we define the concept of polynomial uniform convergence of relative \nfrequencies of events to probabilities in the distribution-dependent setting. More \nprecisely, consider, for any n, a probability distribution on {O, l}n and a family of \nevents Fn ~ 2{O,1}\"; our request is that the probability that the maximum difference \n(over Fn) between the relative frequency and the probability of an event exceed a \ngiven arbitrarily small positive constant \u00a3 be at most 6 (0 < 6 < 1) when the sample \non which we evaluate the relative frequencies has size polynomial in n, 1/\u00a3,1/6. \n\nThe main result we present here is a necessary and sufficient condition for polyno(cid:173)\nmial uniform convergence in terms of \"average information per example\" . \n\nIn section 2 we give preliminary notations and results; in section 3 we introduce the \nconcept of polynomial uniform convergence in the distribution-dependent context \nand we state our main result, which we prove in section 4. Some applications to \ndistribution-dependent PAC learning are discussed in section 5. \n\n2 PRELIMINARY DEFINITIONS AND RESULTS \n\nLet X be a set of elementary events on which a probability measure P is defined \nand let F be a collection of boolean functions on X, i.e. functions f ; X -\n{O, 1}. \nFor I E F the set 1-1 (1) is said event, and Pj denotes its probability. At-sample \n(or sample of size t) on X is a sequence ~ = (Xl, .. . , X,), where Xk E X (1 < k < t). \nLet X(t) denote the space of t-samples and pCt) the probability distribution induced \nby P on XCt), such that P(t)(Xl,\"\" Xt) = P(Xt)P(X2)'\" P(Xt). \nGiven a t-sample ~ and a set f E F, let vjt)(~) be the relative frequency of f in the \nt-sample~, i.e. \n\n(t)( ) _ L~=l I(x;) \n\u2022 \nVj X \n-\n\nt \n\n-\n\nConsider now the random variable II~) ; XCt) _ \nwhere \n\n[01], defined over (XCt), pCt\u00bb), \n\nII~)(Xt, ... ,xe) = sup I Vjt)(Xl, ... , Xt) - Pj I . \n\nJEF \n\nThe relative frequencies of the events are said to converge to the probabilities uni(cid:173)\nformly over F if, for every \u00a3 > 0, limt_oo pCt){ X I II~)(~) > \u00a3} = O. \nIn order to study the problem of uniform convergence of the relative frequencies \nto the probabilities, the notion of index Ll F ( x) of a family F with respect to a \nt-sample ~ has been introduced [Vapnik & Chervonenkis, 1971]. Fixed at-sample \n~ = (Xl, ... , Xt), \n\n\f906 \n\nBertoni, Campadelli, Morpurgo, and Panizza \n\nObviously A.F(Xl, ... ,Xt) ~ 2t; a set {xl, ... ,x,} is said shattered by F iff \nA.F(Xl, ... ,Xt) = 2t; the maximum t such that there is a set {XI, ... ,Xt} shat(cid:173)\ntered by F is said the Vapnik-Chervonenkis dimension dF of F. The following \nresult holds [Vapnik & Chervonenkis, 1971]. \n\nTheorem 2.1 For all distribution probabilities on X I the relative frequencies of the \nevents converge (in probability) to their corresponding probabilities uniformly over \nF iff dF < 00. \n\nWe recall that the Vapnik-Chervonenkis dimension is a very useful notion in \nthe distribution-independent PAC learning model [Blumer et al., 1989]. In the \ndistribution-dependent framework, where the probability measure P is fixed and \nknown, let us consider the expectation E[log2 A.F(X)]' called entropy HF(t) of the \nfamily F in samples of size t; obviously HF(t) depends on the probability distribu(cid:173)\ntion P. The relevance of this notion is showed by the following result [Vapnik & \nChervonenkis, 1971]. \n\nTheorem 2.2 A necessary and sufficient condition for the relative frequencies of \nthe events in F to converge uniformly over F (in probability) to their corresponding \nprobabilities is that \n\n3 POLYNOMIAL UNIFORM CONVERGENCE \nConsider the family {(Xn, Pn , Fn}}n>l, where Xn = {O, l}n, Pn is a probability \ndistribution on Xn and Fn is a family of boolean functions on X n . \nSince Xn is finite, the frequencies trivially converge uniformly to the probabilities; \ntherefore we are interested in studying the problem of convergence with constraints \non the sample size. To be more precise, we introduce the following definition. \n\nDefinition 3.1 Given the family {(Xn, Pn, Fn}}n> 1, the relative frequencies of the \nevents in Fn converge polynomially to their corresponding probabilities uniformly \nover Fn iff there exists a polynomial p(n, 1/\u00a3, 1/8) such that \n\n\\1\u00a3,8> 0 \\In (t? p(n, 1/\u00a3, 1/8) ~ p(t){~ I n~~(~) > \u00a3} < 8). \n\nIn this context \u00a3 and 8 are the approximation and confidence parameters, respec(cid:173)\ntively. \n\nThe problem we consider now is to characterize the family {(Xn , Pn , Fn}}n> 1 such \nthat the relative frequencies of events in Fn converge polynomially to the probabil-\nities. Let us introduce the random variable c~t) : X~t) ~ N, defined as \n\nC~t)(Xl' ... ' Xt) = maxi #A I A ~ {XI, ... , xtl A A is shattered by Fn}. \n\nIn this notation it is understood that c~t) refers to Fn. The random variable c~t) \nand the index function A.Fn are related to one another; in fact, the following result \ncan he easily proved. \n\n\fL(~lllUla 3.1 C~t)(~.) < 10g~Fn(~) S; C~)(~) logt. \n\nPolynomial Uniform Convergence \n\n907 \n\nLet M(n, t) = E(_n_) be the expectation of the random variable~. From Lemma \n3.1 readily follows that \n\nt \n\nC(t) \n\nC(t) \n\nt \n\nM(n, t) < \n\nHF (t) \n\n; \n\nS; M(n, t) logt; \n\ntherefore M(n, t) is very close to HF,..(t)/t, which can be interpreted as \"average \ninformation for example\" for samples of size t. \nOur main result shows that M(n, t) is a useful measure to verify whether \n{(Xn, Pn, Fn) }n>l satisfies the property of polynomial convergence, as shown by \nthe following theorem. \n\nTheorem 3.1 Given {(Xn, Pn , Fn) }n~ 1, the following conditions are equivalent: \n\nCl. The relative frequencies of events in Fn converge polynomially to their corre(cid:173)\n\nsponding probabilities. \n\nC2. There exists f3 > 0 such that M(n, t) = O(n/t!3). \n\nC3. There exists a polynomial1/;(n, l/e) such that \n\n'r/c'r/n (t ~ 1/;(n, l/c)::} M(n,t) < c). \n\nProof\u00b7 \n\n\u2022 C2 ::} C3 is readily veirfied. In fact, condition C2 says there exist a, f3 > 0 \nsuch that M(n,t) S; an/tf3; now, observing that t ~ (an/c)! implies \nan/t!3 < e, condition C3 immediately follows . \n\n\u2022 C3 ::} C2. As stated by condition C3, there exist a, b, c > 0 such that if \nt ~ anb Icc then M(n, t) < c. Solving the first inequality with respect to c \ngives, in the worst case, c = (an b /t)~, and substituting for c in the second \ninequality yields M(1l,t) ~ (anb/t)~ = a~n~/t~. If ~ < 1 we immediately \nobtain M(n,t) ~ a~n~/t~ < a~n/d. Otherwise, if ~ > 1, since M(n,t) < 1, \nwe have M(n,t) S; min{l,atn~/d} S; min{l,(a~n~/d)~} S; aln/tl. \n0 \n\nThe proof of the equivalence between propositions C1 and C3 will be given in the \nnext section. \n\n4 PROOF OF THE MAIN THEOREM \n\nFirst of all, we prove that condition C3 implies condition Cl. The proof is based \non the following lemma, which is obtained by minor modifications of [Vapnik & \nChervonenkis, 1971 (Lemma 2, Theorem 4, and Lemma 4)]. \n\n\f908 \n\nBenoni, Campadelli, Morpurgo, and Panizza \nLemma 4.1 Given the family {(Xn,Pn,Fn)}n~I' if limt_ex> HF;(t) = 0 then \n\n\\;fe\\;fo\\;fn (t > 1:;;0 => p~t){~ I rr~~(~) > e} < 0), \n\nwhere to is such that H Fn (to)/to ::; e2/64. \n\nAs a consequence, we can prove the following. \n\nTheorem 4.1 Given {(Xn,Pn,Fn)}n~}, if there exists apolynomial1/J(n,l/e) such \nthat \n\n\\;fe\\;fn (t ~ 1/J(n, l/e) => HF;(t) < c), \n\nthen the relative frequencies of events in Fn converge polynomially to their proba(cid:173)\nbilities. \nProof (outline). It is sufficient to observe that if we choose to = 1/J(n,64/e2 ), by \nhypothesis it holds that HFn(to)/to < e2/64; therefore, from Lemma 4.1, if \n\n132to _ 132./,( 64) \nt > e20 - e20 'f' n, e2 \n' \n\nthen p~t) {~ I rr~~ (~) > e} < O. \no \nAn immediate consequence of Theorem 4.1 and of the relation M(n, t) < HF ... (t)/t < \nM(n, t) logt is that condition C3 implies condition Cl. \n\nWe now prove that condition C1 implies condition C3. For the sake of simplicity it \nis convenient to introduce the following notations: \n\nd t ) \na~t) = T \n\nPa(n,e,t) = p~t){~la~>C~) < e}. \n\nThe following lemma, which relates the problem of polynomial uniform convergence \nof a family of events to the parameter Pa(n, e, t), will only be stated since it can be \nproved by minor modifications of Theorem 4 in [Vapnik & Chervonenkis, 1971]. \n\nLemma 4.2 1ft ~ 16/e2 then pAt){~lrr~~(x) > e} > {(1- Pa(n,8e,2t)). \n\nA relevant property of Pa(n, e, t) is given by the following lemma. \n\nLemma 4.3 \\;fa > 1 Pa(n,e/a,at) < P~(n,e,t). \nProof. Let ~l , ... '~a) be an at-sample obtained by the concatenation of a elements \n~1\"\"'~ E X(t). It is easy to verify that c~at)(~I\"\" ,~a) ~ maXi=I, ... ,aC~t)(Xi)' \nTherefore \n\nPAat){c~at)(~l, ... ,Xa)::; k}::; PAat){c~t)(~I)::; k/l. \u00b7\u00b7\u00b7/l.C~t)(~a) < k}. \n\nBy the independency of the events c~t)(~) < k we obtain \n\np~at){c~at)(Xl'\" .,~) < k} < II p~t){C~t)(~d::; k}. \n\na \n\ni=1 \n\n\fPolynomial Uniform Convergence \n\n909 \n\nRecalling that a~) = C~t) It and substituting k = et, the thesis follows. \no \nA relation between Pa(n, e, t) and the parameter M(n, t), which we have introduced \nto characterize the polynomial uniform convergence of {(Xn, Pn, Fn)}n~I, is shown \nin the following lemma. \nLemma 4.4 For every e (0 < e < 1/4), if M(n, t) > 2.,fi then Pa(n, e, t) < 1/2. \nProof. For the sake of simplicity, let m = M(n, I). If m > 6 > 0 , we have \n\n6 < m = t x dPa = r6\nJo \n\n/\n\nJ6/2 \nJo \n666 \n< 2Pa (n, 2' I) + 1- Pa(n, 2' I). \n\n2 x dPa + t x dPa \n\nSince 0 < 6 < 1, we obtain \n\n6 \nPa (n'2,/) < 1- 6/ 2 ~ 1- 2\u00b7 \nBy applying Lemma 4.3 it is proved that, for every a ~ 1, \n\n1- 6 \n\n6 \n\nPa(n, 20\" 0'/) ~ 1 - 2 \n\n(6)Q \n\n6 \n\nFor a = ~ we obtain \n\n62 21 \n\n1 \nPa (n'4'\"6) 2-j\"i, \nthen Pa(n, e, t) < 1/2. \nIt is easy to verify that C~Qt)(~I' ... '~Q) < Ef=1 C~t)(~;) for every a ~ 1. This \nimplies M(n, at) < M(n, t) for a > 1, hence M(n, t..fi) > M(n, t), from which the \nthesis follows. \n0 \n\nTheorem 4.2 If for the family {(Xn, Pn ,Fn)}n~1 the relative frequencies of \nevents in Fn converge polynomially to their probabilities, then there exists a polyno(cid:173)\nmial1/;(n, lie) such that \n\n\\;f e \\;fn (t ~ 'ljJ(n, lie) => M(n, t) ~ e). \n\nProof. By contradiction. Let us suppose that {(Xn' Pn, Fn)}n> 1 polynomially \nconverges and that for all polynomial functions 1/;(n, lie) there exist e, n, t such \nthat t ~ 1/;(n, lie) and M(n, t) > e. \nSince M(n, t) is a monotone, non-increasing function with respect to t it fol(cid:173)\nlows that for every 1/; there exist e, n such that M(n, 1/;(n, lie)) > e. Consid(cid:173)\nering the one-to-one corrispondence T between polynomial functions defined by \nT1/;(n, lie) = 2.,fi. From Lemma 4.4 it follows that \n\n\\;f p~t){.f.1 n~~ (.f.) > c} < 210) \n\nFrom Lemma 4.2 we know that if t ~ 16/e2 then \n\np~t){.f.1 n~~(.f.) > e} ~ ~(1- Pa (n,8e, 2t)) \n\n1ft ~ max{16/e2, ~. \nFixed a polynomial p(n, l/e) such that 2p(n, 8/e) ~ max{16/e 2 , ~). \n\n(2) \n\nFrom assertions (1) and (2) the contradiction ~ < ~ can easily be derived. \nAn immediate consequence of Theorem 4.2 is that, in Theorem 3.1, condition C1 \nimplies condition C3. Theorem 3.1 is thus proved. \n\n0 \n\n5 DISTRIBUTION-DEPENDENT PAC LEARNING \n\nIn this section we briefly recall the notion of learnability in the distribution(cid:173)\ndependent PAC model and we discuss some applications of the previous re(cid:173)\nsults. Given {(Xn, Pn, Fn)}n~b a labelled t-sample 5, for f E Fn is a sequence \n((Xl, f(xt), . . . , (Xt, f(xt}\u00bb), where (Xl, . . . , Xt) is a t-sample on X n . We say that \nh,/2 E Fn are e-close with respect to Pn iff Pn{xlh(x) f. /2(x)} < e. \nA learning algorithm A for {(Xn, Pn , Fn)}n>l is an algorithm that, given in input \ne,6 > 0, a labelled t-sample 5, with f E Fn , outputs the representation of a function \n9 which, with probability 1 - 6, is e-close to f . The family {(Xn, Pn, Fn)}n~l is \nsaid polynomially learnable iff there exists a learning algorithm A working in time \nbounded by a polynomial p(n, l/e , 1/6). \nBounds on the sample size necessary to learn at approximation e and confidence \n1 - 6 have been given in terms of e-covers [Benedek & Itai, 1988]; classes which \nare not learnable in the distribution-free model, but are learnable for some specific \ndistribution, have been shown (e.g. I-terms DNF [Kucera et al., 1988]). \n\nThe following notion is expressed in terms of relative frequencies. \n\nDefinition 5.1 A quasi-consistent algorithm for the family {(Xn, Pn, Fn)}n>l is \nan algorithm that, given in input 6, e > 0 and a labelled t-sample 5, with f E Fn , \noutputs in time bounded by a polynomial p( n, 1/ e, 1/6) the representation of a func(cid:173)\ntion 9 E Fn such that \n\np(t){X I vet) (x) > e} < 6 \n\n'$g-\n\nn \n\n-\n\nBy Theorem 3.1 the following result can easily be derived. \n\n\fPolynomial Uniform Convergence \n\n911 \n\nTheoreIn 5.1 Given {(Xn, Pn, Fn)}n~l' if there exists f3 > 0 such that M(n, t) = \nO(n/t f3 ) and there exists a quasi-consistent algorithm for {(Xn, Pn, Fn)}n~l then \n{(Xn, Pn, Fn) }n~l is polynomially learnable. \n\n6 CONCLUSIONS AND OPEN PROBLEMS \n\nWe have characterized \nthe property of polynomial uniform convergence of \n{(Xn,Pn,Fn)}n>l by means of the parameter M(n,t). In particular we proved \nthat {(Xn,Pn,Fn}}n~l has the property of polynomial convergence iff there exists \nf3 > 0 such that M(n, t) = O(n/tf3 ), but no attempt has been made to obtain better \nupper and lower bounds on the sample size in terms of M(n, t). \nWith respect to the relation between polynomial uniform convergence and PAC \nlearning in the distribution-dependent context, we have shown that if a family \n{(Xn, Pn, Fn) }n> 1 satisfies the property of polynomial uniform convergence then it \ncan be PAC learned with a sample of size bounded by a polynomial function in \nn, 1/\u00a3, 1/6. \nIt is an open problem whether the converse implication also holds. \n\nAcknowledgements \n\nThis research was supported by CNR, project Sistemi Informatici e Calcolo Paral(cid:173)\nlelo. \n\nReferences \n\nG. Benedek, A. Itai. (1988) \"Learnability by Fixed Distributions\". Proc. COLT'88, \n80-90. \n\nA. Blumer, A. Ehrenfeucht, D. Haussler, K. Warmuth. (1989) \"Learnability and \nthe Vapnik-Chervonenkis Dimension\". J. ACM 36, 929-965. \nL. Kucera, A. Marchetti-Spaccamela, M. Protasi. (1988) \"On the Learnability of \nDNF Formulae\". Proc. XV Coli. on Automata, Languages, and Programming, \nL.N .C.S. 317, Springer Verlag. \n\nL.G. Valiant. (1984) \"A Theory of the Learnable\". Communications of the ACM \n27 (11), 1134-1142. \n\nV.N. Vapnik, A.Ya. Chervonenkis. (1971) \"On the uniform convergence of relative \nfrequencies of events to their probabilities\". Theory of Prob. and its Appl. 16 (2), \n265-280. \n\n\f", "award": [], "sourceid": 583, "authors": [{"given_name": "Alberto", "family_name": "Bertoni", "institution": null}, {"given_name": "Paola", "family_name": "Campadelli", "institution": null}, {"given_name": "Anna", "family_name": "Morpurgo", "institution": null}, {"given_name": "Sandra", "family_name": "Panizza", "institution": null}]}