{"title": "Generalisation in Feedforward Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 215, "page_last": 222, "abstract": null, "full_text": "Generalisation in Feedforward Networks \n\nAdam Kowalczyk and Herman Ferra \nTelecom Australia, Research Laboratories \n\n770 Blackburn Road, Clayton, Vic. 3168, Australia \n\n(a.kowalczyk@trl.oz.au, h.ferra@trl.oz.au) \n\nAbstract \n\nWe discuss a model of consistent learning with an additional re(cid:173)\nstriction on the probability distribution of training samples, the \ntarget concept and hypothesis class. We show that the model pro(cid:173)\nvides a significant improvement on the upper bounds of sample \ncomplexity, i.e. the minimal number of random training samples \nallowing a selection of the hypothesis with a predefined accuracy \nand confidence. Further, we show that the model has the poten(cid:173)\ntial for providing a finite sample complexity even in the case of \ninfinite VC-dimension as well as for a sample complexity below \nVC-dimension. This is achieved by linking sample complexity to \nan \"average\" number of implement able dichotomies of a training \nsample rather than the maximal size of a shattered sample, i.e. \nVC-dimension. \n\n1 \n\nIntroduction \n\nA number offundamental results in computational learning theory [1, 2, 11] links the \ngeneralisation error achievable by a set of hypotheses with its Vapnik-Chervonenkis \ndimension (VC-dimension, for short) which is a sort of capacity measure. They \nprovide in particular some theoretical bounds on the sample complexity, i.e. a \nminimal number of training samples assuring the desired accuracy with the desired \nconfidence. However there are a few obvious deficiencies in these results: (i) the \nsample complexity bounds are unrealistically high (c.f. Section 4.), and (ii) for \nsome networks they do not hold at all since VC-dimension is infinite, e.g. some \nradial basis networks [7]. \n\n\f216 \n\nAdam Kowalczyk, Hemzan Ferra \n\nOne may expect that there are at least three main reasons for this state of affairs: \n(a) that the VC-dimension is too crude a measure of capacity, (b) since the bounds \nare universal they may be forced too high by some malicious distributions, (c) \nthat particular estimates themselves are too crude, and so might be improved with \ntime. In this paper we will attack the problem along the lines of (a) and (b) since \nthis is most promising. \nIndeed, even a rough analysis of some proofs of lower \nbound (e.g. \n[I)) shows that some of these estimates were determined by clever \nconstructions of discrete, malicious distributions on sets of \"shattered samples\" (of \nthe size of VC-dimension). Thus this does not necessarily imply that such bounds \non the sample complexity are really tight in more realistic cases, e.g. continuous \ndistributions and \"non-malicious\" target concepts, the point eagerly made by critics \nof the formalism. The problem is to find such restrictions on target concepts and \nprobability distributions which will produce a significant improvement. The current \npaper discusses such a proposition which significantly improves the upper bounds \non sample complexity. \n\n2 A Restricted Model of Consistent Learning \n\nFirst we introduce a few necessary concepts and some basic notation. We assume \nwe are given a space of samples X with a probability measure J.L, a set H of binary \nfunctions X t-+ {O, I} called the hypothesis space and a target concept t E H. For an \nn-sample i = (Z1, ... , zn) E xn and h : H -+ {O, I} the vector (h(Z1), ... , h(zn)) E \n{O, l}n will be denoted by h(i). We define two projections 7r:e and 7rt,:e of H onto \nas follows 7r:e(h) = h z) = (h(Z1), ... , h(zn)) and 7rt,:e(h) = 7r:e(lt - hi) = \n{0,1 \nIt - hl(i) for every h E H . Below we shall use the notation 151 for the cardinality \nof the set 5 . The average density of the sets of projections 7r:e( H) or 7rt,:e( H) in \n{O, l}n is defined as \n\n(_ de! \n\nde! \n\n} n \n\nde! \n\nPrH(i) d~ 17r:e(H)I/2n = l7rt,:e(H)I/2n \n\n(equivalently, this is the probability of a random vector in {O, l}n belonging to the \nset 7r:e(H)). Now we define two associated quantities: \n\nPrH'Il(n) \n\nd.:J J PrH(i)J.Ln(di) = 2-n J 17r:e(H) lJ.Ln (di), \n\n(1) \n\n(3) \n\nWe recall that dH = max{n ; 3:eEX .. I7r:e(H)1 = 2n} is called the Vapnik-\nChervonenkis dimension (VC-dimension) of H [1, 11]. If dH ~ 00 then Sauer's \nlemma implies the estimates (c.f. [1, 2, 10)) \n\nde! \n\nPrH'Il(n) ~ PrH,max(n) ~ 2- ncf?(dH, n) ~ 2-n (en/dH )dH \n\nwhere cf?(d,n) d;J 2::1=0 (~) (we assume C) d;J \u00b0 ifi > n). \n\n, \n\n(2) \n\nNow we are ready to formulate our main assumption in the model. We say that the \nspace of hypotheses His (J.Ln, C)-uniform around t E 2x iffor every set 5 C {O, l}n \n\nJ l7rt,:e(H) n 51J.Ln(dx) ~ CI 5 IPrH'Il(n). \n\n\fGeneralisation in Feedforward Networks \n\n217 \n\nThe meaning of this condition is obvious: we postulate that on average the number \nof different projections 7rt,ar(h) of hypothesis h E H falling into S has a bound \nproportional to the probability PrH,~(n) of random vector in {O, l}n belonging to \nthe set 7rt,ar(H). Another heuristic interpretation of (3) is as follows. Imagine that \nelements of 7rt,ar(H) are almost uniformly distributed in {O, 1}n, i.e. with average \ndensity Par ~ CI7rt,ar(H)1/2n. Thus the \"mass\" of the volume lSI is l7rt,ar(H) n SI ~ \nPariSI and so its average f l7rt,ar(H) n SIJ.tn(di) has the estimate ~ lSI f parJ.tn(di) ~ \nCISiPrH,~(n). \nOf special interest is the particular case of con8istent learning [1], i.e. when the \ntarget concept and the hypothesis fully agree on the training sample. In this case, \nfor any E > 0 we introduce the notation \n\nQE(m) d;J {i E Xm ; 3hEH ert,ar(h) = 0 & ert,~(h) ~ E}, \n\nde! \n\nm \n\nde! f \n\nwhere ert,ar(h) = Ei=llt-hl(zi)/mand ert,~(h) = \nIt-hl(z)J.t(dz) denote error \nrates on the training sample i = (ZlJ ... , zm) and X, respectively. Thus QE(m) is \nthe set of all m-samples for which there exists a hypothesis in H with no error on \nthe sample and the error at least E on X. \n\nTheorem 1 If the hypothesis space H is (J.t2m , C)-uniform around t E H then for \nanyE > 8/m \n\n(4) \n\n(5) \n\nProof of the theorem is given in the Appendix. \nGiven E,6 > O. The integer mL(6, E) d;J minim > 0 j J.tm(QE) ~ 6} will be called \nthe 8ample complezity following the terminology of computational learning theory \n(c.f. [1]). Note that in our case the sample complexity depends also (implicitly) on \nthe target concept t, the hypothesis space H and the probability measure J.t. \nCorollary 2 If the hypothesis space H is (J.tn, C)-uniform around t E H for any \nn> 0, then \n\nmL(6, E) < maz{8/E, minim j 2CPrH,~(2m)(3/2r < 6}} \n\n< maz{8/E, 6.9 dH + 2.4log2 6}' 0 \n\nC \n\n(6) \n\n(7) \n\nThe estimate (7) of Corollary 2 reduces to the estimate \n\nmL(6, E) ~ 6.9 dH \n\nindependent of 6 and E. This predicts that under the assumption of the corollary \na transition to perfect generalisation occurs for training samples of size ~ 6.9 d H , \nwhich is in agreement with some statistical physics predictions showing such tran(cid:173)\nsition occurring below ~ 1.5dH for some simple neural networks (c.f. [5]). \n\nProof outline. Estimate (6) follows from the first estimate (5). Estimate (7) can \nbe derived from the second bound in (5) (virtually by repeating the proof of [1, \nTheorem 8.4.1] with substitution of 4log2(4/3) for E and 6/C for 6). Q.E.D. \n\n\f2I8 \n\nAdam Kowalczyk, Herman Ferra \n\n10-1 \n\n10-2 \n\nfie \n\n10-300 '---'-..J....J. ........... L.U.L_.L-J'--'-- 0, and for \nany m ~ m:z, dvc(x) = dH = m:z with probability> o. In this regard the abstract \npercept ron resembles the linear threshold multilayer percept ron (with ml and m:z \ncorresponding to nhl + 1 and dH , respectively). However, the main advantage of \nthis model is that we can derive the following estimate: \n\nPrH,.( m) <:; 2-m ~~' (m ~ m, ) pm-m, -'( 1 - p)' ( min( m - i, m,), m) \n\n(11) \n\nU sing this estimate we find that for sufficiently low p (and sufficiently large m:z) \nthe sample complexity upper bound (6) is determined by ml and can even be lower \nthan m:z = dH (c.f. Figure l.b). In particular, the sample complexity determined \nby Eqns. (6) and (11) can be finite even if dH = m:z = 00 (c.f. the curve E(m, 0) \nfor p = .05 in Fig. l.b which is the same for m:z = 1000 and m:z = (0). \n\n4 Discussion \n\nThe paper strongly depends on the postulate (3) of (J.\u00a3n, C)-uniformity. We admit \nthat this is an ad hoc assumption here as we do not give examples when it is \n\n\f220 \n\nAdam Kowalczyk, Herman Ferra \n\nsatisfied nor a method to determine the constant C. From this point of view our \nresults at the current stage have no predictive power, perchaps only explanatory \none. The paper should be viewed as an attempt in the direction to explain within \nVC-formalism some known generalisation properties of neural networks which are \nout of the reach of the formalism to date, such as the empirically observed peak \ngeneralisation for backpropagation network trained with samples of the size well \nbelow VC-dimension [8] or the phase transitions to perfect generalisation below \n1.5. xVC-dimension [5]. We see the formalism in this paper as one of a number \nof possible approaches in this direction. There are other possibilities here as well \n(e.g. [5, 12]) and in particular other, weaker versions of (p.fI., C)-uniformity can be \nused leading to similar results. For instance in Theorem 1 and Corollary 2 it was \nenough to assume (p.fI. , C)-uniformity for a special class of sets S (S = s;;:;m, c.f. the \nAppendix); we intend to discuss other options in this regard on another occasion. \nNow we relate this research to some previous results (e.g. [2, 4]) which imply the \nfollowing estimates on sample complexity (c.f. [1, Theorems 8.6.1-2]): \n\n) . . \n\n~mL(<<5,\u20ac)~ -; dH log:l-;-+log:l6\" \n\n12 \n\nr4 ( \n\nmax \n\n( dH - 1 \n\n32\u20ac \n\n,-In(<<5)/\u20ac \n\n2)1 \n\n' \n\n(12) \n\nwhere the lower bound is proved for all \u20ac ~ 1/8 and \u00ab5 ~ 1/100; here mL(<<5, \u20ac) \nis \nthe \"universal\" sample complexity, i.e. for all target concepts t and all probability \ndistributions p.. For \u20ac = \u00ab5 = 0.01 and dH \u00bb 1 this estimate yields 3dH < \nmi.(.Ol, .01) < 4000dH' These bounds should be compared against estimates of \nCorollary 2 of which (7) provides a much tighter upper bound, mL(.Ol, .01) ~ 6.9dH, \nif the assumption on (p.m, C)-uniformity of the hypothesis space around the target \nconcept t is satisfied. \n\n5 Conclusions \n\nWe have shown that und~r appropriate restriction on the probability distribution \nand target concept, the upper bound on sample complexity (and \"perfect general(cid:173)\nisation\") can be lowered to ~ 6.9x VC-dimension, and in some cases even below \nVC-dimension (with a strong possibility that multilayer perceptron could be such). \n\nWe showed that there are other parameters than VC-dimension potentially impact(cid:173)\ning on generalisation capabilities of neural networks. In particular we showed by \nexample (abstract perceptron) that a system may have finite sample complexity \nand infinite VC dimension at the same time. \n\nThe formalism of this paper predicts transition to perfect generalisation at relatively \nlow training sample sizes but it is too crude to predict scaling laws for learning curves \n(c.f. [5, 12] and references in there). \n\nAcknowledgement. The permission of Managing Director, Research and Infor(cid:173)\nmation Technology, Telecom Australia, to publish this paper is gratefully acknowl(cid:173)\nedged. \n\nReferences \n\n[1] M. Anthony and N. Biggs. Computational Learning Theory. Cambridge Uni-\n\n\fGeneralisation in Feedfom'ard Networks \n\n221 \n\nversity Press, 1992. \n\n[2] A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. Learnability and \nthe Vapnik-Chervonenkis dimensions. Journal of the ACM, 36:929-965, (Oct. \n1989). \n\n[3] T.M. Cover. Geometrical and statistical properties of linear inequalities with \napplications to pattern recognition. IEEE Trans. Elec. Comp., EC-14:326-\n334, 1965. \n\n[4] A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound \non the number of examples needed for learning. Information and Computation, \n82:247-261, 1989. \n\n[5] D. Hausler, M. Kearns, H.S. Seung, and N. Tishby. Rigorous learning curve \n\nbounds from statistical mechanics. Technical report, 1994. \n\n[6] A. Kowalczyk. Separating capacity of analytic neuron. In Proc. ICNN'94 , \n\nOrlando, 1994. \n\n[7] A. Macintyre and E. Sontag. Finiteness results for sigmoidal \"neural\" networks. \nIn Proc. of the 25th Annual ACM Symp. Theory of Comp., pages 325-334, 1993. \n[8] G.L. Martin and J .A. Pitman. Recognizing handprinted letters and digits using \n\nbackpropagation learning. Neural Comput., 3:258-267, 1991. \n\n[9] A. Sakurai. Tighter bounds of the VC-dimension of three-layer networks. In \n\nProceedings of the 1999 World Congress on Neural Networks, 1993. \n\n[10] N. Sauer. On the density of family of sets. Journal of Combinatorial Theory \n\n(Series A), 13:145-147, 1972). \n\n[11] V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer(cid:173)\n\nVerlag, 1982. \n\n[12] V. Vapnik, E. Levin, and Y. Le Cun. Measuring the vc-dimension of a learning \n\nmachine. Neural Computation, 6 (5):851-876, 1994). \n\n6 Appendix: Sketch of the proof of Theorem 1 \n\nThe proof is a modification of the proof of [1, Theorem 8.3.1]. We divide it into \nthree stages. \n\nStage 1. Let \n\n. de! \n\n'RJ = {(x,y) E Xm X Xm ~ X m \n\n(13) \nfor j E {O, 1, ... , m}. Using a Chernoff bound on the \"tail\" of binomial distribution \nit can be shown [1, Lemma 8.3.2] that for m ~ 8/\u00a3 \n\n2 \n\n; 3hEHer:ch = 0 & eryh = jim} \n\nJr(Qf(m)) ::; 2 L j.L2m(nj) \n\nm \n\nj~rmf/21 \n\n(14) \n\nStage 2. Now we use a combinatorial argument to estimate j.L2m(1~j). We consider \nthe 2m -element commutative group Gm of transformations of xm x xm ~ X 2m \ngenerated by all \"co-ordinate swaps\" of the form \n\n(Xb ... , Xm, Y1I ... , Ym) 1-+ (Xl, ... , Xi-1, Yi, Xi+1I ... , Xm, Yb ... , Yi-1, Xi, Yi+b ... , Ym), \n\n\f222 \n\nAdam Kowalczyk, Herman Ferra \n\nfor 1 ~ i ~ m. We assume also that Gm transforms {O, l}m x {O, l}m ~ {O, 1}2m \nin a similar fashion. Note that \n\n(15) \n\nAs transformation u E Gm preserve the measure p.2m on xm X xm we obtain \n\n2mp.2m(ni) = IGmlp.2m(ni) = I: J p.2m(d\u00a3dY)X'RAu(\u00a3, Y\u00bb \nuEG ... J p.2m(didY) I: X'R.i(u(i, Y\u00bb. \n\n(16) \n\nde! \n\n-\n\n-\n\nLet s;;:;m = {h = (hI, h2) E {O,l}m x {O, l}m \n\nhI = \u00b0 & IIh211 = j} and \n8j m = {h E {O,l} m ; Ilhll = j}, where Ilhll = hI + ... + hm for any h = \n(h1, ... , hm) E {O, l}m. Then 181ml = (2j), u(s;;:;m) C 81m for any u E Gm and \n\n- -\n\nde! -\n\nde! \n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n2 \n\n2 \n\nuEG ... \n\nni = {(i, YJ E xm x xm ; 3h E 7rt,(z,y)(H) n s;;:;m}, \n\n(17) \n\nThus from Eqn. (16) we obtain \n\n2mp.2m(ni) ~ J p.2m(didY) I: \n\nI: \n\nxs;:-;-(uh) \n\nuEG ... hEWt,(.,f)(H)nS;-\n\nI: \n\nI: xs;:-;\",(uh) \n\nhEwt,{.,f)(H)ns;'\" uEG ... \n\nJ p.2m(didY) \nJ p.2m(didY) _ \n\nI: _ ... I{u E Gm j uh E 81m}1 \n\n= J p.2m (didY) I {7rt,(:ii,y) (H) n 8;m}1 2m- j . \n\nhEw t,(.,f)(H)nSo,; \n\nApplying now the condition of (p.2m, C)-uniformity (Eqn. 3), Eqn. 17 and dividing \nby 2m we get \n\nStage 3. On substitution of the above estimate into (14) we obtain estimate (4). \nTo derive (5) let us observe that 2:7=rmE/21 (2j) 2- j ~ (1 + 1/2)2m. Q.E.D. \n\n\f", "award": [], "sourceid": 913, "authors": [{"given_name": "Adam", "family_name": "Kowalczyk", "institution": null}, {"given_name": "Herman", "family_name": "Ferr\u00e1", "institution": null}]}