{"title": "On Stochastic Complexity and Admissible Models for Neural Network Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 818, "page_last": 824, "abstract": null, "full_text": "On Stochastic  Complexity and  Admissible \n\nModels  for  Neural Network  Classifiers \n\nPadhraic Smyth \n\nCommunications  Systems  Research \n\nJet Propulsion  Laboratory \n\nCalifornia Institute of Technology \n\nPasadena,  CA  91109 \n\nAbstract \n\nGiven some  training data how  should we  choose a particular network clas(cid:173)\nsifier  from  a  family  of networks  of different  complexities?  In  this  paper \nwe  discuss how  the application of stochastic complexity theory to classifier \ndesign problems can provide some insights into this problem.  In particular \nwe  introduce  the  notion  of admissible  models  whereby  the  complexity  of \nmodels  under consideration is  affected  by  (among other factors)  the class \nentropy,  the  amount  of training  data,  and  our  prior  belief.  In  particular \nwe  discuss the implications of these results with respect to neural architec(cid:173)\ntures and demonstrate the approach on real data from  a medical diagnosis \ntask. \n\n1 \n\nIntroduction  and Motivation \n\nIn this paper we  examine in a general sense the application of Minimum Description \nLength  (MDL)  techniques to the  problem of selecting a  good  classifier from  a  large \nset  of candidate models  or  hypotheses.  Pattern recognition  algorithms  differ  from \nmore  conventional  statistical  modeling  techniques  in  the  sense  that  they  typically \nchoose from a very large number of candidate models to describe the available data. \nHence,  the  problem of searching  through  this set of candidate models  is  frequently \na  formidable  one, often  approached in practice  by  the use of greedy  algorithms.  In \nthis context, techniques which allow us to eliminate portions of the hypothesis space \nare of considerable interest.  We  will show  in  this paper that it is  possible to use  the \nintrinsic  structure of the  MDL  formalism  to eliminate large  numbers  of candidate \nmodels given  only minimal information about  the  data.  Our results  depend  on the \n\n818 \n\n\fOn Stochastic Complexity \n\n819 \n\nvery  simple  notion  that  models  which  are  obviously  too  complex  for  the  problem \n(e.g.,  models  whose  complexity  exceeds  that  of the  data  itself)  can  be  discarded \nfrom further  consideration in  the search for  the most  parsimonious  model. \n\n2  Background on  Stochastic  Complexity Theory \n\n2.1  General Principles \n\nStochastic complexity prescribes a  general theory of inductive inference from  data, \nwhich,  unlike  more  traditional  inference  techniques,  takes  into  account  the  com(cid:173)\nplexity  of the  proposed  model  in  addition  to  the  standard  goodness-of-fit  of the \nmodel  to  the  data.  For  a  detailed  rationale  the  reader  is  referred  to  the  work  of \nRissanen  (1984)  or  Wallace  and  Freeman  (1987)  and  the  references  therein.  Note \nthat  the  Minimum  Description  Length  (MDL)  technique  (as  Rissanen's  approach \nhas become known)  is  implicitly related to Maximum A Posteriori (MAP) Bayesian \nestimation techniques if cast in  the appropriate framework. \n\n2.2  Minimum  Description Length  and Stochastic  Complexity \n\nFollowing  the  notation  of Barron  and  Cover  (1991),  we  have  N  data-points,  de(cid:173)\nscribed  as  a  sequence  of tuples  of observations  {xI, ... , xf , Yi}, 1 ::;  i  ::;  N,  to  be \nreferred  to  as  {xi,yd  for  short.  The  xf  correspond  to  values  taken  on  by  the  f{ \nrandom variables X k  (which may be continuous or discrete),  while, for  the purposes \nof this  paper,  the  Yi  are elements of the finite  alphabet of the  discrete  m-ary  class \nvariable  Y.  Let  rN  =  {M l , ... , MlrNI}  be  the  family  of candidate  models  under \nconsideration.  Note  that  by  defining  r N  as  a  function  of N,  the  number  of data \npoints,  we  allow  the  possibility  of considering  more  complicated  models  as  more \ndata arrives.  For  each  Mj  ErN let C( Mj)  be non-negative numbers  such  that \n\nL 2-C(Mj)  ::;  l. \n\nj \n\nThe C(Mj) can be interpreted as  the cost in bits of specifying model Mj  -\nin  turn, \n2-C(Mj)  is  the  prior  probability  assigned  to model  M j  (suitably  normalized).  Let \nus  use  C = {C(Mt}, ... , C(M1rNln  to refer  to a  particular  coding scheme for  rN . \nHence the total  description  length of the data plus a  model  Mj  is  defined  as \n\nL(Mj, {Xi , yd) =  C(Mj) + log (p( {Ydl~j( {~lJ))) \n\ni.e.,  we  first  describe the model and then the class  data relative  to  the given  model \n(as  a  function  of {xd,  the  feature  data).  The  stochastic  complexity  of the  dat.a \n{Xi, Yi}  relative to Cand r N  is  the minimum description length \n\nI( {Xi, yd) =  min  {L(M}\", {Xi, yd n\u00b7 \n\n-\n\nMjErN \n\n-\n\nThe  problem  of finding  the  model  of  shortest  description  length  is  intractable  in \nthe  general  case  -\nnonetheless  the  idea of finding  the  best  model  we  can  is  well \nmotivated,  works  well  in  practice  and is  far  preferable  to  the  alternative  approach \nof ignoring the complexity issue entirely. \n\n\f820 \n\nSmyth \n\n3  Admissible Stochastic Complexity  Models \n\n3.1  Definition of Admissibility \n\nWe will find it useful to define the notion of an admissible model for the classification \nproblem:  the  set  of admissible  models  ON  (~ r N )  is  defined  as  all  models  whose \ncomplexity  is  such  that  there  exists  no  other  model  whose  description  length  is \nknown  to be  smaller.  In  other  words  we  are  saying  that  inadmissible  models  are \nthose  which  have  complex~ty in  bits greater  than  any  known  description  length -\nclearly  they  cannot  be  better  than  the  best  known  model  in  terms  of description \nlength and can be eliminated from consideration.  Hence,  ON is defined dynamically \nand is  a function of how many description lengths we  have already calculated in  our \nsearch.  Typically r N  may be pre-defined, such as the class of all 3-layer feed-forward \nneural networks  with  particular activation functions .  We  would  like  to restrict our \nsearch for  a good model to the set ON  ~ rN as far  as  possible (since non-admissible \nmodels  are  of no  practical  use).  In  practice  it  may  be  difficult  to  determine  the \nexact  boundaries  of ON,  particularly  when  Ir N I is  large  (with  decision  trees  or \nneural networks for  example).  Note  that the notion of admissibility  described  here \nis  particularly useful  when  we  seek  a  minimal description length,  or  equivalently  a \nmodel  of maximal  a  posteriori  probability  -\nin  situations  where  one's  goal  is  to \naverage over a  number of possible models  (in a  Bayesian manner) a  modification of \nthe admissibility  criterion  would  be necessary. \n\n3.2  Results for  Admissible Models \n\nSimple techniques for  eliminating obvious non-admissible models are of interest : for \nthe  classification  problem  a  necessary  condition  that a  model  M j  be  admissible  is \nthat \n\nC(Mj) ~ N\u00b7 H(X)  ~ Nlog(m) \n\nwhere H(X) is the entropy ofthe m-ary class variable X.  The obvious interpretation \nin  words  is  that any  admissible  model  must  have  complexity  less  than  that of the \ndata  itself.  It is  easy  to  show  in  addition  that  the  complexity  of any  admissible \nmodel is  upper  bounded by  the  parameters of the classification  problem: \n\nHence,  the size of the space  of admissible  models  can  also be  bounded: \n\nOur  approach  suggests  that  for  classification  at  least,  once  we  know  N  and  the \nnumber of classes  m, there are strict limitations on how  many  admissible models  we \ncan  consider.  Of course  the theory  does  not  state  that  considering  a  larger  subset \nwill  necessarily  result in  a less optimal model being found,  however, it is  difficult to \nargue the case for  including large numbers of models which  are clearly too complex \nfor  the problem.  At best, such an approach will lead to an inefficient search, whereas \nat worst a  very  poor  model  will  be chosen  perhaps as  a  result  of the use  of a  poor \ncoding scheme for  the  unnecessarily  large hypothesis space. \n\n\fOn Stochastic Complexity \n\n821 \n\n3.3  Admissible Models  and Bayes  Risk \n\nThe  notion  of  minimal compression  (the  minimum  achievable  goodness-of-fit)  is \nintimately  related  in  the  classification  problem  to  the  minimal  Bayes  risk  for  the \nproblem (Kovalevsky,  1980).  Let  MB  be  any  model  (not  necessarily  unique)  which \nachieves  the  optimal  Bayes  risk  (i.e.,  minimizes  the  classifier  error)  for  the  classi(cid:173)\nfication  problem.  In  particular,  C( {xdIMB( {yd))  is  not  necessarily  zero,  indeed \nin  most  practical  problems  of interest  it  is  non-zero,  due  to  the  ambiguity  in  the \nmapping from  the feature space to the class  variable.  In  addition,  MB  may not  be \ndefined in the set r N,  and hence,  MB  need  not  even  be admissible.  If,  in  the limit \nas  N  -+ 00,  MB  rt.  roo  then there is  a fundamental  approximation error in  the rep(cid:173)\nresentation being used, i.e.,  the family of models  under consideration is  not flexible \nenough  to  optimally  represent  the  mapping from  {xd  to {yd.  Smyth  (1991)  has \nshown  how  information  about  the  Bayes  error  rateror  the  problem  (if  available) \ncan  be  used  to further  tighten  the  bounds on  admissibility. \n\n4  Applying  Mininlum Description  Length  Principles  to \n\nNeural  Network  Design \n\nIn  principle  the  admissibility  results  can  be  applied  to a  variety  of classifier  design \nproblems  -\napplications  to  Markov  model  selection  and  decision  tree  design  are \ndescribed  elsewhere  (Smyth,  1991).  In  this  paper  we  limit  our  attention  to  the \nproblem of automatically selecting a feedforward  multi-layer  network  architecture. \n\n4.1  Calculation of the  Goodness-of-Fit \n\nAs  is  clear from the preceding discussion,  application of the MDL  principle to clas(cid:173)\nsifier selection requires that the classifier produce a posterior probability estimate of \nthe class labels.  In the context of a network model this is  not a problem provided the \nnetwork  is  trained  to  provide  such  estimates.  This  requires  a  simple  modification \nof the objective function  to a log-likelihood function  - L~llog(p(ydxd), where  Yi \nis  the class  label of the ith training datum and pO  is  the network's estimate of pO. \nThis function  has  been  proposed  in  the literature in  the  past  under the guise  of a \ncross-entropy  measure  (for  the special  case  of binary  classes)  and  more  recently  it \nhas  been  derived from  the  more  basic  arguments of Minimum Mutual Information \n(MMI) (Bridle, 1990) and Maximum Likelihood (ML) Estimation (Gish,  1990).  The \ncross-entropy function for  network  training is  nothing more  that the goodness-of-fit \ncomponent  of  the  description  length  criterion.  Hence,  both  MMI  and  ML  (since \nthey  are  equivalent  in  this  case)  are  special  cases  of the  MDL  procedure  wherein \nthe complexity term is  a constant and is  left out of the optimization (all models are \nassumed  to be  equally  likely  and likelihood  alone is  used  as  the decision  criterion). \n\n4.2  Complexity  Penalization for  Multi-layer Perceptron Models \n\nIt has been proposed in the past (Barron, 1989) to use a penalty term of (k/2) log N, \nwhere  k  is  the number of parameters (weights and  biases)  in  the network.  The ori(cid:173)\ngins  of  this  complexity  measure  lie  in  general  arguments  originally  proposed  by \nRissanen  (1984).  However  this  penalty  term  is  too  large.  Cybenko  (1990)  has \n\n\f822 \n\nSmyth \n\npointed out that existing successful applications of networks have far  more  param(cid:173)\neters  than could  possibly  be justified  by  a statistical analysis, given  the  amount  of \ntraining data used to construct the network.  The critical factor lies in  the precision \nto  which  these  parameters  are  stated  in  the  final  model.  In  essence  the  principle \nof MDL  (and  Bayesian  techniques)  dictates  that  the data only justifies  the stating \nof any  parameter  in  the  model  to  some  finite  precision,  inversely  proportional  to \nthe  inherent  variance  of the estimate.  Approximate  techniques  for  the  calculation \nof the  complexity  terms  in  this  manner  have  been  proposed  (Weigend,  Huberman \nand Rumelhart, this volume) but a  complete description length analysis has not yet \nappeared in  the literature. \n\n4.3  Complexity Penalization for a  Discrete Network  Model \n\nIt turns out that there are alternatives to multi-layer perceptrons whose complexity \nis  much  easier  to calculate.  We  will  look  in  particular  at  the  rule-based  network \nof Goodman et  al.  (1990).  In  this  model  the  hidden  units  correspond  to Boolean \ncombinations  of discrete  input  variables.  The link  weights  from  hidden  to  output \n(class)  nodes are proportional to log  conditional probabilities of the class  given  the \nactivation of a  hidden node.  The output nodes form estimates of the posterior class \nprobabilities  by  a  simple  summation  followed  by  a  normalization.  The  implicit \nassumption of conditional independence is  ameliorated in  practice  by  the fact  that \nthe hidden  units  are  chosen  in  a  manner  to ensure  that  the  assumption  is  violated \nas little as  possible. \n\nThe complexity  penalty for  the  network  is  calculated  as  being  (1/2) log N  per  link \nfrom the hidden to output layers,  plus an appropriate coding term for  the specifica(cid:173)\ntion  of the hidden  units.  Hence,  the description  length  of a  network with  k  hidden \nunits would  be \n\n\u2022 \n\nN \n\nL =  - L log(p(Yd x;)) + k /2 log N  - L log 11\"( od \n\nk \n\ni=1 \n\ni=1 \n\nwhere  0i  is  the order of the ith hidden  node and 11\"( OJ)  is  a  prior  probability on  the \norders.  Using  this  definition  of description  length  we  get  from  our  earlier  results \non  admissible  models  that the number of hidden  units  in  the  architecture  is  upper \nbounded  by \n\nk < \n\nNH(C) \n\n- 0.51ogN + logJ< + 1 \n\nwhere J<  is  the number of binary input attributes. \n\n4.4  Application to a  Medical Diagnosis  Problem \n\nWe  consider  the  application  of our  techniques  to the  discovery  of a  parsimonious \nnetwork  for  breast cancer  diagnosis,  using  the discrete network  model.  A  common \ntechnique in breast cancer  diagnosis  is  to obtain a  fine  needle  aspirate (FNA) from \nthe patient.  The FN A sample is  then evaluated under  a  microscope  by  a  physician \nwho  makes  a  diagnosis.  Ground  truth in  the form  of binary  class  labels  (\"benign\" \nor  \"malignant\") is  obtained by  re-examination or  biopsy  at a  later stage.  Wolberg \nand Mangasarian  (1991)  described the collection of a database of such information. \n\n\fOn Stochastic Complexity \n\n823 \n\nThe  feature  information  consisted  of subjective  evaluations  of nine  FNA  sample \ncharacteristics  such  as  uniformity of cell  size,  marginal adhesion and  mitoses.  The \ntraining  data consists  of 439  such  FNA  samples obtained from  real  patients  which \nwere  later assigned  class  labels.  Given  that the  prior  class  entropy  is  almost  1 bit, \none can immediately state from our bounds that networks with more than 51  hidden \nunits are inadmissible.  Furthermore,  as  we  evaluate different models  we  can narrow \nthe region of admissibility using the results stated earlier.  Figure 1 gives a graphical \ninterpretation of this  procedure. \n\n40 \n\n. \n\n. \n\n. \n\n. \n\n/I) \n\n35 \n!::  30 \nc \n:l \nc  25 \nII \n:g  20 \nL \n'0 \n15 \n... \nII  10 \n.c \n\u00a7  5 \nz \no \n\n100 \n\nj~~~,;;;ibl.~,~~~:::,+<: \n::::::::::::\u00a5-::::::~I-=~IerOf:H'ddIunit5 \n\n. \n\n,  - -.. -- Upper bound on admissible complexity \n\n. ...  _.- ... .... .... ..... .... . _\"  --- - - . _. \n\n, \n\n, \n\n-_ .. -. \n\n150 \n\n200 \n\n250 \n\n300 \n\n350 \n\nDescription Length (In bits) \n\nFigure  1.  Inadmissible  region  as  a  function of description  length \n\nThe  algorithm  effectively  moves  up  the  left-hand  axis,  adding  hidden  units  in  a \ngreedy  manner.  Initially the description length  (the  lower  curve)  decreases  rapidly \nas  we  capture the  gross  structure in  the  data.  For  each  model  that  we  calculate  a \ndescription  length,  we  can  in  turn  calculate  an  upper  bound  on  admissibility  (the \nupper  curve)  -\nthis  bound  is  linear  in  description  length.  Hence , for  example  by \nthe time we  have 5 hidden units we  know  that any models with more than 21  hidden \nunits  are  inadmissible.  Finally  a  local  minimum of the  description  length  function \nis reached at 12  units, at which point we  know that the optimal solution can have  at \nmost  16  hidden  units .  As  matter of interest,  the  resulting  network  with  12  hidden \nunits correctly  classified 94  of 96  independent  test cases. \n\n5  Conclusion \n\nThere are  a  variety  of related issues  which  arise  in this  context  which  we  can  only \nbriefly  mention  due  to space  constraints.  For  example,  how  does  the  prior  \"model \nentropy\",  H(ON)  = - Li p(l\\1i) log(p(l\\1d) ,  affect  the  complexity  of  the  search \nproblem?  Questions also naturally arise as  to how  ON  should grow  as  a function  of \nN  in  an  incrementa/learning scenario. \n\nIn  conclusion,  it  should  not  be  construed  from  this  paper  that  consideration  of \nadmissible models is  the major factor  in  inductive inference -\ncertainly  the choice \nof description  lengths  for  the  various  models  and  the  use  of efficient  optimization \n\n\f824 \n\nSmyth \n\ntechniques  for  seeking  the  parameters  of each  model  remain  the  cornerstones  of \nsuccess.  Nonetheless, our results  provide useful  theoretical insight and are practical \nto the extent that they provide  a  \"sanity  check\"  for  model selection  in  MDL. \n\nAcknowledgments \n\nThe research  described  in  this  paper  was  performed at the Jet Propulsion  Labora(cid:173)\ntories,  California Institute of Technology,  under a  contract  with  the  National  Aero(cid:173)\nnautics  and Space Administration.  In  addition  this  work  was  supported in  part by \nthe Air  Force  Office  of Scientific  Research  under grant  number  AFOSR-90-0199. \n\nReferences \n\nA.  R.  Barron  (1989),  'Statistical  properties  of artificial  neural  networks,'  in  Pro(cid:173)\nceedings  of 1989 IEEE  Conference  on  Decision  and  Control. \n\nA.  R.  Barron  and  T.  M.  Cover  (1991),  'Minimum complexity  density  estimation,' \nto appear in  IEEE  Trans.  Inform.  Theory. \n\nJ. Bridle  (1990),  'Training stochastic model recognition  algorithms as networks can \nlead  to  maximum mutual information  estimation  of parameters,'  in  D.  S.  Touret(cid:173)\nzky  (ed.),  Advances  in  Neural  Information  Processing  Systems  1,  pp.211-217,  San \nMateo,  CA:  Morgan  Kaufmann. \n\nG.  Cybenko (1990),  'Complexity theory  of neural  networks  and classification  prob(cid:173)\nlems,'  preprint. \n\nH.  Gish  (1991),  'Maximum  likelihood  training  of neural  networks,'  to  appear  in \nProceedings  of the  Third  International  Workshop  on  AI and  Statistics,  (D.  Hand, \ned.),  Chapman and  Hall:  London. \n\nR.  M.  Goodman,  C.  Higgins,  J.  W.  Miller,  and  P.  Smyth  (1990),  'A  rule-based \napproach  to  neural  network  classifiers,'  in  Proceedings  of the  1990  International \nNeural  Network  Conference,  Paris,  France. \n\nV.  A.  Kovalevsky  (1980),  Image  Pattern  Recognition,  translated  from  Russian  by \nA.  Brown,  New  York:  Springer Verlag,  p.79. \n\nJ.  Rissanen  (1984),  'Universal  coding,  information,  prediction,  and  estimation,' \nIEEE  Trans.  Inform.  Theory,  vo1.30,  pp.629-636. \n\nP.  Smyth  (1991),  'Admissible stochastic  complexity models  for  classification  prob(cid:173)\nlems,'  to  appear  in  Proceedings  of the  Third  International  Workshop  on  AI and \nStatistics,  (D.  Hand,  ed.),  Chapman and  Hall:  London. \n\nC.  S.  Wallace  and  P.  R.  Freeman  (1987),  'Estimation  and  inference  by  compact \ncoding,'  J.  Royal Stat.  Soc .  B,  vol. 49 , no.3,  pp.240-251. \n\nW. H.  Wolberg and O. L.  Mangasarian (1991),  Multi-surface method of pattern sep(cid:173)\naration  applied  to  breast  cytology  diagnosis,  Proceedings  of the  National  Academy \nof Sciences,  in  press. \n\n\f", "award": [], "sourceid": 435, "authors": [{"given_name": "Padhraic", "family_name": "Smyth", "institution": null}]}