{"title": "MLP Can Provably Generalize Much Better than VC-bounds Indicate", "book": "Advances in Neural Information Processing Systems", "page_first": 190, "page_last": 196, "abstract": null, "full_text": "MLP  can provably generalise much better \n\nthan VC-bounds indicate. \n\nA.  Kowalczyk  and  H.  Ferra \nTelstra Research Laboratories \n\n770  Blackburn Road,  Clayton,  Vic.  3168,  Australia \n\n({ a.kowalczyk, h.ferra}@trl.oz.au) \n\nAbstract \n\nResults  of a  study of the  worst  case  learning  curves  for  a  partic(cid:173)\nular class of probability  distribution on input  space  to MLP  with \nhard  threshold  hidden  units are  presented.  It is  shown in  partic(cid:173)\nular,  that  in  the  thermodynamic  limit  for  scaling  by  the  number \nof connections to the first  hidden layer, although the true learning \ncurve behaves as  ~ a-I for  a  ~ 1, its VC-dimension  based bound \nis trivial  (= 1)  and  its VC-entropy bound is trivial for  a  ::;  6.2.  It \nis also shown that bounds following the true learning curve can be \nderived  from  a  formalism  based on the density of error patterns. \n\n1 \n\nIntroduction \n\nThe VC-formalism and its extensions link the generalisation capabilities of a binary \nvalued neural network with its counting function l  ,  e.g.  via upper bounds implied by \nVC-dimension or VC-entropy on  this function  [17,  18].  For linear perceptrons the \ncounting function  is  constant for  almost every selection  of a  fixed  number of input \nsamples [2],  and essentially equal  to its upper  bound determined by VC-dimension \nand  Sauer's  Lemma.  However,  in  the  case  for  multilayer  perceptrons  (MLP)  the \ncounting function  depends  essentially on  the selected input  samples.  For instance, \nit has  been shown recently that for  MLP with sigmoidal units although the largest \nnumber of input samples which can be shattered, Le.  VC-dimension, equals O(w 2 ) \n[6],  there is always a non-zero probability of finding a  (2w + 2)-element input sample \nwhich  cannot be  shattered,  where  w  is  the number of weights  in  the network  [16]. \nIn the case of MLP using  Heaviside rather than sigmoidal activations  (McCulloch(cid:173)\nPitts neurons), a similar claim can be made:  VC-dimension is O(wl1og21lt}  [13,  15], \n\n1 Known  also as the  partition function  in  computational learning theory. \n\n\fMLP Can Provably Generalize Much Better than VC-bounds Indicate \n\n191 \n\nwhere WI  is  the number of weights to the first  hidden layer of 11.1  units,  but there is \na  non-zero probability of finding a sample of size  WI + 2 which  cannot be shattered \n[7,  8].  The  results  on  these  \"hard  to  shatter  samples\"  for  the  two  MLP  types \ndiffer  significantly  in  terms  of techniques  used  for  derivation.  For  the  sigmoidal \ncase  the result is  \"existential\"  (based on  recent advances in  \"model  theory\") while \nin  the  Heaviside  case  the  proofs  are  constructive,  defining  a  class  of probability \ndistributions  from  which  \"hard  to shatter\"  samples  can  be  drawn  randomly;  the \nresults in  this  case are also  more explicit  in  that a  form  for  the  counting function \nmay be given  [7,  8]. \n\nCan  the  existence  of such  hard  to  shatter  samples  be  essential  for  generalisation \ncapabilities of MLP? Can they be an essential factor for improvement of theoretical \nmodels  of generalisation?  In  this paper we  show  that  at least  for  the  McCulloch(cid:173)\nPitts case  with  specific  (continuous)  probability  distributions  on  the input  space \nthe answer is  \"yes\".  We estimate  \"directly\"  the real learning curve in  this case and \nshow that its bounds based on VC-dimension or VC-entropy are loose at low learning \nsample  regimes  (for  training samples having  less  than  12  x  WI  examples)  even  for \nthe linear perceptron.  We  also show that a modification to the VC-formalism given \nin  [9,  10]  provides a  significantly better bound.  This latter part is a  more  rigorous \nand formal extension and re-interpretation of some results in [11,  12].  All the results \nare presented in the thermodynamic limit, i.e.  for  MLP with WI  ~ 00 and training \nsample size increasing proportionally, which  simplifies  their mathematical form. \n\n2  Overview of the formalism \n\nOn  a  sample  space  X  we  consider  a  class  H  of binary functions  h  :  X  ~ {a, 1} \nwhich  we  shall  call  a hypothesis  space.  Further we  assume  that there  are given  a \nprobability distribution jJ on X  and a  target concept t  : X  ~ {a, 1}.  The quadruple \nC =  (X, jJ, H, t)  will  be called  a learning  system. \n\nIn the usual  way,  with each hypothesis h E  H  we  associate  the  generalization  error \nfh  d~ Ex [It(x)  - h(x)l]  and  the  training  error  fh,x  d~ ~ L:~l It(Xi)  - h(xdl  for \nany training m-sample x =  (Xl, ... ,xm) E xm. \nGiven a  learning threshold \u00b0 ~ ,X  ~ 1, let us introduce an auxiliary random variable \nf~ax(X) d~ max{fh  ;  h  E  H  &  fh,x  ~ ,X}  for  x E  xm,  giving  the  worst  general(cid:173)\nization error of all  hypotheses with training error ~ ,X  on the m-sample x E xm.  2 \nThe basic objects of intE'rest in this paper are  the  learning  cUnJe3  defined  as \n\ntL>C( \nf).  m \n\n[ max ( ... )] \nXm  f).  X. \n\n) d!l E \n\n-\n\n2.1  Thermodynamic limit \n\nNow  we  introduce  the  thermodynamic  limit  of the  learning  curve.  The  underly(cid:173)\ning idea of such asymptotic analysis is to capture the essential features  of learning \n\n2In  this paper max(S), where S  C  R, denotes the maximal element in  the closure of S, \n\nor  00  if no such element exists.  Similarly,  we  understand mineS). \n\n3Note that our learning curve is determined by the worst generalisation error of accept(cid:173)\n\nable  hypotheses  and  in  this  respect  differs  from  \"average  generalisation  error\"  learning \ncurves considered elsewhere,  e.g.  [3,  5]. \n\n\f192 \n\nA.  Kowalczyk and H.  Ferra \n\nsystems of very large size.  Mathematically it turns out that in the thermodynamic \nlimit the functional forms of learning curves simplify significantly and analytic char(cid:173)\nacterizations of these are possible. \n\nWe  are  given  a  sequence  of learning systems,  or shortly,  LN  =  (XN,/J.N,HN,tN)' \nN  =  1,2, ...  and  a scaling  N  f-7  TN  E R+, with  the property TN  ~ 00;  the scaling \ncan  be thought  of as  a  measure  of the size  (complexity)  of a  learning system, e.g. \nVC-dimension of HN.  The  thermodynamic limit of scaled  learning  curves is defined \n\nfor  a  > \u00b0 as follows  4 \n\n\u20acAOO  a  =  1m sup \u20acA,N \nwe  (  )  de! l' \n\nwe  (L \n\naTN \n\nJ) \n, \n\nN--+oo \n\n(1) \n\nHere,  and below,  the additional subscript  N  refers to the N-th learning system. \n\n2.2  Error pattern density formalism \n\nThis subsection briefly presents a  thermodynamic version of a  modified VC formal(cid:173)\nism  discussed  previously in  [9J;  more  details and proofs can  be found  in  [1OJ.  The \nmain innovation of this approach comes from splitting error patterns into error shells \nand using estimates on  the size  of these  error shells  rather than the  total number \nof error patterns.  We  shall see  on examples discussed  in  the following  section that \nthis improves results significantly. \nThe space  {O, l}m  of all  binary m-vectors naturally splits into m + 1 error pattern \nshells Ern,  i  = 0,1, ... , m, with  the i-th shell  composed of all  vectors with exactly i \nentries equal  to 1  .  For each h E  Hand i  = (Xl, ... ,Xm )  E X m ,  let vh(i)  E {O,l}m \nt(Xj).  As  the  i-th error shell  has en  elements,  the  average  error  pattern  density \ndenote  a  vector  (error  pattern)  having  1 in  the j-th position if and only if h(xj) :j:. \n\nfalling into this error shell  is \n\n(i  =  0,1, ... ,m), \n\n(2) \n\nwhere #  denotes the cardinality of a set5  . \n\nTheorem 1  Given a sequence of learning systems LN =  (XN, /J.N,  HN, tN), a scal(cid:173)\ning TN  and a  function  'P  : R+  X  (0, 1)  ~ R+  such  that \n\nIn (dfN) ~ -TN'P (m ,i) + O(TN), \n\nTN  m \n\n, \n\nfor  all m,N =  1,2, ... , \u00b0 ~ i  ~ m. \n\nThen \n\n(3) \n\n(4) \n\n4We  recall  that  lxJ  denotes the largest  integer  $  x  and limsuPN-+oo  XN  is  defined as \nlimN-+oo  of the monotonic  sequence  N  1--+  max{xl' X2, \u2022\u2022\u2022 ,  XN}'  Note  that in  contrast  to \nthe ordinary  limit, lim sup always exists. \n\n5Note the difference to the concept of error shells used in [4]  which are partitions of the \nfinite  hypothesis space  H  according  to the generalisation  error  values.  Both  formalisms \nare  related  though,  and  the  central  result  in  [4],  Theorem  4,  can  be  derived  from  our \nTheorem 1 below. \n\n\fMLP Can Provably Generalize Much Better than VC-bounds Indicate \n\nfor  any \u00b0 :s  A :s 1 and a, /3  > 0,  where \n\nf>.,6(a)  =  max  f  E (0,1) ;  3 0::;y::;>. a(1i(y) + /31i(x))  - rp  a  + a/3,  1 \n\nde! \n\n{ \n\n( \n\nt::;x::;1 \n\ny + /3X ) \n/3 \n+ \n\nand 1i(y) d~ -y In y - (1  - y) In(l - y)  denotes  the entropy function. \n\n3  Main results :  applications of the formalism \n\n3.1  VC-bounds \n\n193 \n\n~ \u00b0 \n\n} \n\nWe  consider  a  learning  sequence  L N  =  (X N , J.L N , H N , t N),  t N  E  H N  (realisable \ncase)  and the  scaling of this sequence  by VC-dimension  [17],  i.e.  we  assume TN  = \nfor  A = \u00b0 (consistent  learning case)  [1,  17]: \ndvc(HN) -+  00.  The following bounds for the N-th learning system can be derived \n\nfwe  (m)  < \nO,N \n\n;\n\n. \n\n( \n\n.1 \n\n\u00b0 mm  1,2 \n\n2-mt/2 \n\n(2) dVc(HN\u00bb) \n\nem \n\ndvc(HN) \n\ndf. \n\n(5) \n\nIn the thermodynamic limit, i.e.  as  N  -+  00,  we  get for  any a  > lie \n\nfg'~(a) \n\n<  mIn \n-\n\n. (1  210g2 (2ea)) \n\n, \n\na \n\n, \n\n(6) \n\nNote that this bound is  independent of probability distributions J.LN. \n\n3.2  Piecewise constant functions \n\nLet  PC(d)  denote  the  class  of  piecewise  constant  binary  functions  on  the  unit \n\nsegment  [0,1)  with  up  to  d  ~ \u00b0 discontinuities  and  with  their  values  defined  as \n\n1 at  all  these  discontinuity points.  We  consider  here  the  learning sequence  LN = \n([0,1), J.LN, PC(dN), tN)  where  J.LN  is  any  continuous  probability  distributions  on \n[0,1),  dN  is  a  monotonic  sequence  of positive  integers diverging to  00  and  targets \ntN  E  PC(dt N )  are such that the  limit  c5t  d~ limN-+oo  .!!.I:Ldd  exists.  (Without  loss  of \ngenerality we  can assume that all  J.LN  are the uniform  distribution on  [0,1).) \n\ntN \n\nFor  this learning sequence the following  can be established. \n\nClaim  1.  The  following  function  defined  for  a  > 1 and \u00b0 :s  x :s  1 as \nrp(a, x)  d~  -a(l-x)1i (2;(t~X\u00bb) -ax1i (~~~) +a1i(x) \n\nfor2ax(1-x) > 1, \n\nand as 0,  otherwise, satisfies assumption  (3)  with respect  to the scaling TN  d;!  dN. \nClaim 2.  The  following  two sided  bound on  the learning curve holds: \n\n(7) \n\n\f194 \n\nA.  Kowalczyk and H.  Ferra \n\nWe  outline  the main steps of proof of these two claims now. \n\nFor  Claim  1 we  start with  a  combinatorial argument  establishing that in  the  par(cid:173)\nticular case of constant target \n\ntl,\"!lN  = \n{ \n\" \n\n( ':1'-1) -1 \"'~N /2 (m-:-i-l) (i~l)  for  d + dt  < min(2i, 2(m - i)), \n1 \n\notherwise. \n\nL...J)=O \n\n,-1 \n\n)-1 \n\n) \n\nNext  we  observe that that the above sum equals \n\nThis  easily  gives  Claim  1 for  constant  target  (tSt  =  0).  Now  we  observe  that  this \nparticular case gives an upper  bound for  the general  case  (of non-constant  target) \nif we  use  the  \"effective\"  number of discontinuities dN  + dt N  instead of dN. \nFor Claim 2 we  start with the estimate [12,  11] \n\nderived from  the  Mauldon result  [14]  for  the  constant target tN =  canst,  m  ~ dN. \nThis implies immediately the expression \n\n\u20aco~(a) =  -..!..  (1 + In(2a)) . \n\n2a \n\n(8) \n\nfor  the  constant  target,  which  extends  to the  estimate  (7)  with  a  straightforward \nlower and upper bound on the  \"effective\"  number of discontinuities in the case of a \nnon-constant target. \n\n3.3  Link to multilayer perceptron \n\nLet  MLpn(wd  denote the class offunction from  R n  to {O, I} which  can be imple(cid:173)\nmented by a  multilayer perceptron  (feedforward  neural network)  with  ~ 1 number \nof hidden  layers,  with  Wt  connections to the first  hidden layer and the first  hidden \nlayer  composed  entirely  of fully  connected,  linear  threshold  logic  units  (i.e.  units \nable  to  implement  any  mapping  of the  form  (Xl, .. , Xn)  f-t  O(ao  + L~l aixi)  for \nai  E  R).  It  can  be  shown  from  the  properties  of Vandermonde  determinant  (c.f. \n[7,  8])  that if 1 : [0,1)  -+  R n  is  a  mapping  with  coordinates composed  of linearly \nindependent polynomials  (generic situation) of degree::; n, then \n\nThis implies  immediately that all  results for  learning  the class of PC functions  in \nSection  5.2  are  applicable  (with  obvious  modifications)  to this  class  of multilayer \nperceptrons with probability distribution concentrated on  the I-dimensional curves \nof the form  1([0,1)) with 1  as above. \n\nHowever,  we  can  go  a  step  further.  We  can  extend  such  a  distribution  to  a  con(cid:173)\ntinuous distribution on R n  with  support  \"sufficiently close\"  to the  curve 1([0,1)), \n\n(9) \n\n\fMLP Can Provably Generalize Much Better than VC-bounds Indicate \n\n195 \n\n1.0 \n\n,...... \n..!:!. \n....  0.8 \n0 \nt:: \nCD  0.6 \nc: \n0 :; \n. ~ \nIii .... \nCD  0.2 \nc: \nCD \nC) \n\n0.4 \n\n0.0 \n\n\\.. \n.... \\.. \n\n..... \"-: .... \n\n.... \" .. \n.... ''':2: :'C: 2:'.:: ::CC:.: .. ......... ~. \" ........................... ~.\"\".csc . \n\nVC Entropy \nEPD,0,=0.2 \nEPD,o,= O.O \nTC+, 0,=0.2 \nTCO,o,=O.O \nTC-, 0, = 0.2 \n\n0.0 \n\n10.0 \n\n20.0 \n\nscaled training sample size (a) \n\nFigure 1:  Plots of different estimates for thermodynamic limit of learning curves for \nthe sequence of multilayer perceptrons as in Claim 3 for consistent learning (A  =  0). \nEstimates on true learning curve from  (7) are for 8t  = 0 ('TCO') and 8t  = 0.2 ('TC+' \nand 'TC-' for  the upper and lower bound, respectively) .  Two upper bounds of the \nform  (4)  from the modified VC-formalism for r.p  as in Claim 1 and f3  =  1 are plotted \nfor  8t  =  0.0  and  8t  =  0.2  (marked  EPD).  For  comparison,  we  plot  also the  bound \n\n(10)  based on the VC-entropy;  VC  bound  (5)  being trivial for  this scaling, = 1,  c.f. \n\nCorollary 2, is not shown. \n\nwith changes to the error pattern densities  /)/fN,  the learning curves, etc., as small \nas  desired.  This observa.tion implies the follo~ing result: \nClaim  3  For  any  sequence  of multilayer  perceptrons,  M LpnN (WIN),  WIN  ~ \nthere  exists  a  sequence  of  continuous  probability  distributions  J.1.N  on \n00, \nR nN  with  properties  as \nFor  any  sequence  of  targets  tN  E \nM LpnN (WltN)'  both  Claim  1  and  Claim  2  of Section  3.2  hold  for  the  learn-\ning  sequence  (RnN, J.1.N,  M LpnN (WIN), tN)  with  scaling  TN  d~ nIN  and  8t  = \nIn  particular  bound  (4)  on  the  learning  curve  holds  for  r.p \nlimN-+oo WltN IWIN. \nas in  Claim  1. \n\nfollows. \n\nCorollary 2  If additionally  the number of units in  first  hidden  layer 1llN  ~ 00, \nthen  the  thermodynamic limit  of VC-bound  (5)  with  respect  to  the  scaling TN  = \nWIN  is  trivial,  i.e.  =  1 for  all a  > O. \n\ntt'lN \n\n-\n\nProof.  The bound  (5)  is trivial for m  ~ 12dN,  where dN d~ dvc(M LpnN (WltN )). \nAs  dN  =  O(WIN IOg2(1lIN))  [13,  15]  for  any  continuous  probability on  the  input \nspace, this bound is trivial  for  any a  =  ~ < 12..4H...  ~ 00  if N  ~ 00. 0 \nThere is a possibility that VC  dimension based bounds are applicable but fail  to cap(cid:173)\nture the true behavior because of their independence from the distribution.  One op(cid:173)\ntion to remedy the situation is to try a distribution-specific estimate such as VC en(cid:173)\ntropy (i.e.  the expectation of the logarithm of the counting function  IIN(XI, ... , xm) \nwhich  is  the  number  of  dichotomies  realised  by  the  perceptron  for  the  m-tuple \nof input  points  [18]) .  However,  in  our  case,  lIN (Xl , ... , xm)  has  the  lower  bound \n2 \",min(wlN/2 m-l) (m)  \u00a3, \nIS  VIrtu  y t  e ex-\nL.\"i=O \npression  from  Sauer's lemma  with  VC-dimension  replaced  by  WIN 12.  Thus using \n\n'  or Xl, ... , Xm  m genera  pOSItIOn,  w  IC \n\nl '  . \n\nh'  h' \n\nall \n\nWIN \n\n' \n\ni\n\n\u2022 \n\n. \n\nh \n\n\f196 \n\nA.  Kowalczyk and H.  Ferra \n\nVC  entropy  instead  of  VC  dimension  (and  Sauer's  Lemma)  we  cannot  hope  for \na  better result  than  bounds  of the  form  (5)  with  WlN 12  replacing  VC-dimension \nresulting in  the bound \n\n(0 > lie) \n\n(10) \n\nin  the  thermodynamic  limit  with  respect  to  the  scaling  TN  =  WlN. \n(Note  that \nmore  \"optimistic\"  VC  entropy  based  bounds can be  obtained  if prior distribution \non hypothesis space is  given and taken into account  [3].) \n\nThe plots of learning curves are shown  in  Figure  l. \n\nAcknowledgement.  The permission of Director of Telstra Research Laboratories \nto publish this paper is  gratefully acknowledged. \n\nReferences \n\n[1)  A.  Blumer,  A.  Ehrenfeucht,  D.  Haussler,  and M.K.  Warmuth.  Learnability  and the \n\nVapnik-Chervonenkis dimensions.  Journal  of the  ACM, 36:929-965,  (Oct.  1989). \n\n[2)  T.M.  Cover.  Geometrical  and statistical  properties of linear  inequalities  with  appli(cid:173)\n\ncations to pattern recognition.  IEEE Trans.  Elec.  Comp.,  EC-14:326-334,  1965. \n\n[3)  D. Hausler, M.  Kearns, and R.  Shapire. Bounds on the Sample Complexity of Bayesian \nLearning  Using  Information  Theory  and  VC  Dimension.  Machine  Learning,  14:83-\n113,  (1994). \n\n[4)  D.  Haussler,  M.  Kearns, H.S.  Seung,  and N.  Tishby.  Rigorous learning curve bounds \n\nfrom statistical  mechanics.  In  Proc.  COLT'94,  pages 76-87,  1994. \n\n[5)  S.B.  Holden  and  M.  Niranjan.  On  the  Practical  Applicability  of  VC  Dimension \n\nBounds.  Neural  Computation,  1:1265-1288,  1995). \n\n[6)  P.  Koiran  and E.D.  Sontag.  Neural  networks with  quadratic VC-dimension.  In Proc. \n\nNIPS 8,  pages  197-203,  The MIT Press,  Cambridge,  Ma.,  1996 .. \n\n[7)  A.  Kowalczyk.  Counting function  theorem  for  multi-layer networks.  In  Proc.  NIPS \n\n6,  pages 375-382.  Morgan  Kaufman  Publishers,  Inc.,  1994. \n\n[8)  A.  Kowalczyk.  Estimates of storage capacity of multi-layer perceptron with threshold \n\nlogic  hidden  units.  Neural  networks,  to appear. \n\n[9)  A.  Kowalczyk  and H.  Ferra.  Generalisation  in feedforward  networks.  Proc.  NIPS  6, \n\npages 215-222,  The MIT Press,  Cambridge,  Ma.,  1994. \n\n[10)  A.  Kowalczyk.  An  asymptotic  version  of EPD-bounds on  generalisation  in  learning \n\nsystems.  1996.  Preprint. \n\n[11)  A.  Kowalczyk,  J.  Szymanski, and R.C.  Williamson.  Learning curves from  a  modified \n\nVC-formalism:  a  case study.  In  Proc.  of 1CNN'95 , 2939-2943,  IEEE,  1995. \n\n[12)  A.  Kowalczyk,  J.  Szymanski,  P.L.  Bartlett, and R.C.  Williamson.  Examples of learn(cid:173)\n\ning  curves  from  a  modified  VC-formalism.  Proc.  NIPS  8,  pages  344- 350,  The  MIT \nPress,  1996. \n\n[13)  W.  Maas.  Neural Nets with superlinear VC-dimesnion.  Neural  Computation,  6:877-\n\n884,  1994. \n\n[14)  J.G.  Mauldon.  Random  division  of an  interval.  Proc.  Cambridge  Phil.  Soc.,  41:331-\n\n336,  1951. \n\n[15)  A.  Sakurai.  Tighter bounds of the VC-dimension of three-layer networks.  In Proc.  of \n\nthe  1993  World  Congress  on  Neural  Networks,  1993. \n\n[16)  E.  Sontag.  Shattering  all  sets  of k  points  in  \"general  position\"  requires  (k  - 1)/2 \n\nparameters.  Report  96-01,  Rutgers Center for  Systems and Control,  1996. \n\n[17)  V.  Vapnik.  Estimation  of Dependences  Based  on  Empirical  Data.  Springer-Verlag, \n\n1982. \n\n[18)  V.  Vapnik.  The  Nature  of Statistical  Learning  Theory.  Springer-Verlag,  1995. \n\n\f", "award": [], "sourceid": 1219, "authors": [{"given_name": "Adam", "family_name": "Kowalczyk", "institution": null}, {"given_name": "Herman", "family_name": "Ferr\u00e1", "institution": null}]}