{"title": "Occam's Razor", "book": "Advances in Neural Information Processing Systems", "page_first": 294, "page_last": 300, "abstract": null, "full_text": "Occam\u00b7s Razor \n\nCarl Edward Rasmussen \n\nDepartment of Mathematical Modelling \n\nTechnical University of Denmark \n\nBuilding 321, DK-2800 Kongens Lyngby, Denmark \ncarl@imm . dtu . dk  http : //bayes . imm . dtu . dk \n\nZoubin Ghahramani \n\nGatsby Computational Neuroscience Unit \n\nUniversity College London \n\n17 Queen Square, London WCIN 3AR, England \n\nzoubin@gatsby . ucl . ac . uk  http : //www . g a tsby . ucl .ac . uk \n\nAbstract \n\nThe Bayesian paradigm apparently only sometimes gives rise to Occam's \nRazor;  at  other times  very  large models perform well.  We  give  simple \nexamples of both kinds of behaviour. The two views are reconciled when \nmeasuring complexity of functions, rather than of the machinery used to \nimplement them.  We analyze the complexity of functions for some linear \nin the parameter models that are  equivalent to  Gaussian Processes, and \nalways find Occam's Razor at work. \n\n1  Introduction \n\nOccam's Razor is  a well known principle of \"parsimony of explanations\" which is influen(cid:173)\ntial in scientific thinking in general and in problems of statistical inference in particular.  In \nthis paper we review its consequences for Bayesian statistical models, where its behaviour \ncan be easily demonstrated and quantified.  One might think that one has  to  build  a prior \nover models which explicitly favours simpler models. But as we will see, Occam's Razor is \nin fact embodied in the application of Bayesian theory. This idea is known as an \"automatic \nOccam's Razor\" [Smith &  Spiegelhalter, 1980; MacKay, 1992; Jefferys &  Berger, 1992]. \n\nWe focus on complex models with large numbers of parameters which are often referred to \nas non-parametric.  We will use the term to refer to models in which we do not necessarily \nknow the roles played by individual parameters, and inference is  not primarily targeted at \nthe parameters themselves, but rather at the predictions made by the models.  These types \nof models are typical for applications in machine learning. \n\nFrom  a  non-Bayesian  perspective,  arguments  are  put  forward  for  adjusting  model  com(cid:173)\nplexity in the light of limited training data, to avoid over-fitting. Model complexity is often \nregulated by adjusting the number offree parameters in the model and sometimes complex(cid:173)\nity  is further constrained by  the  use of regularizers  (such as  weight decay).  If the model \ncomplexity is either too low or too high performance on an  independent test set will  suffer, \ngiving rise  to  a characteristic Occam's Hill.  Typically  an  estimator of the  generalization \nerror or an independent validation set is used to control the model complexity. \n\n\fFrom the Bayesian perspective, authors seem to take two conflicting stands on the question \nof model complexity.  One view is  to infer the probability of the model for each of several \ndifferent model  sizes  and  use these  probabilities when making predictions.  An  alternate \nview suggests that we simply choose a \"large enough\" model and  sidestep the problem of \nmodel size selection.  Note that both views assume that parameters are averaged over.  Ex(cid:173)\nample:  Should we use Occam's Razor to determine the optimal number of hidden units in a \nneural network or should we simply use as many hidden units as possible computationally? \nWe now describe these two views in more detail. \n\n1.1  View 1: Model size selection \n\nOne of the central quantities in Bayesian learning is the evidence, the probability of the data \ngiven the model P(YIM i ) computed as the integral over the parameters W  of the likelihood \ntimes the prior.  The evidence is related to the probability of the model, P(MiIY) through \nBayes rule: \n\nwhere it is  not uncommon that the prior on models P(M i ) is  flat,  such  that P(MiIY) is \nproportional to the evidence. Figure 1 explains why the evidence discourages overcomplex \nmodels, and can be used to selectl  the most probable model. \n\nIt is  also  possible  to understand how  the evidence discourages overcomplex models  and \ntherefore embodies Occam's Razor by using the following interpretation.  The evidence is \nthe probability that if you randomly selected parameter values from your model class, you \nwould generate data set Y.  Models  that are  too  simple will  be very  unlikely to  generate \nthat particular data set,  whereas models that are  too complex can generate many possible \ndata sets, so again, they are unlikely to generate that particular data set at random. \n\n1.2  View 2:  Large models \n\nIn  non-parametric Bayesian models  there is  no  statistical reason  to  constrain models,  as \nlong as  our prior reflects our beliefs.  In fact,  since constraining the model order (i.e.  num(cid:173)\nber of parameters) to  some  small  number would  not usually  fit  in  with  our prior beliefs \nabout the true data generating process, it makes sense to use large models (no matter how \nmuch data you have) and pursue the infinite limit if you can2 \u2022  For example, we ought not \nto  limit the  number of basis  functions  in  function  approximation a priori  since we  don't \nreally believe that the data was actually generated from a small number of fixed basis func(cid:173)\ntions.  Therefore,  we  should consider models with  as  many parameters as  we can  handle \ncomputationally. \n\nNeal  [1996]  showed  how  multilayer  perceptrons  with  large  numbers  of  hidden  units \nachieved good performance on  small data sets.  He used sophisticated MCMC techniques \nto  implement averaging over parameters.  Following this  line of thought there is  no model \ncomplexity selection  task:  We  don't need  to  evaluate  evidence (which is  often  difficult) \nand we don't need or want to use Occam's Razor to limit the number of parameters in  our \nmodel. \n\n'We really ought to average together predictions from all models weighted by their probabilities. \nHowever if the evidence is strongly peaked, or for practical reasons, we may want to select one as an \napproximation. \n\n2Por some models, the limit of an  infinite number of parameters is a simple model which can be \ntreated tractably.  Two examples are the Gaussian Process limit of Bayesian  neural  networks  [Neal, \n1996], and the infinite limit of Gaussian mixture models  [Rasmussen, 2000]. \n\n\ftoo complex \n\ny \n\nAll  possible data sets \n\nFigure  1:  Left panel:  the evidence as  a function of an  abstract one dimensional represen(cid:173)\ntation  of \"all possible\"  datasets.  Because  the evidence must \"normalize\",  very  complex \nmodels which can account for many data sets only achieve modest evidence; simple models \ncan reach high evidences, but only for a limited set of data.  When a dataset Y  is observed, \nthe evidence can be used to  select between model complexities.  Such selection cannot be \ndone using just the likelihood, P(Y Iw, Mi).  Right panel:  neural networks with different \nnumbers of hidden unit form a family  of models, posing the model selection problem. \n\n2  Linear in the parameters models - Example:  the Fourier model \n\nFor simplicity, consider function approximation using the class of models that are linear in \nthe parameters; this class includes many well known models such as  polynomials, splines, \nkernel methods, etc: \n\ny(x)  =  L Wi(Pi(X)  {:} Y =  W T <1>, \n\nwhere y is the scalar output, ware the unknown weights (parameters) of the model, (/>i(x) \nare fixed basis functions,  <l>in  = \u00a2i(X(n)) and x(n)  is the (scalar or vector) input for exam(cid:173)\nple number n.  For example, a Fourier model for scalar inputs has the form: \n\ny(x) = ao + Lad sin(dx) + bd cos(dx), \n\nD \n\nd=l \n\nwhere  w \nweights: \n\n{ao,al,bl, ... ,aD,bD}'  Assuming  an  independent Gaussian  prior on  the \n\np(wIS, c)  ex:  exp (- ~ [Coa~ + L  cd(a~ + b~)]), \n\nD \n\nd=l \n\nwhere S  is  an  overall  scale and Cd  are precisions (inverse variances) for weights of order \n(frequency) d.  It is easy to show that Gaussian priors over weights imply Gaussian Process \npriors  over functions3 .  The covariance function  for the  corresponding Gaussian  Process \nprior is: \n\nK(x,x') = [Lcos(d(x-x'))/Cd]/S. \n\nD \n\nd=O \n\n3U nder the prior, the joint density of any  (finite) set of outputs  y is Gaussian \n\n\fOrder 0 \n\nOrder 1 \n\nOrder 2 \n\nOrder 3 \n\nOrder 4 \n\nOrder 5 \n\n2 \n\n2 \n\n2 \n\n2 \n\n0 \n\n2 \n\n..  _i .J . \n0  ..  \"+ ... \nct. \n... \n\n-1 \n\n-2 \n\n-1 \n\n-2 \n\n-1 \n\n0 \n\n1 \nOrder 6 \n\n-1 \n\n0 \n\n1 \nOrder 7 \n\n-1 \n\n0 \n\n1 \nOrderS \n\n2 \n\n-1 \n\n-2 \n\n+i-\n\n-1 \n\n-2 \n\n... \n\n... \n\n-2 \n\n-1 \n\n0 \n\n1 \nOrder 9 \n\n-1 \n\n2 \n\n-1 \n\n-1 \n\n-2 \n\n... \n\n... \n\n-2 \n\n-1 \n\n0 \n\n1 \nOrder 10 \n\n-1 \n\n0 \n\n1 \nOrder 11 \n\n2 \n\n2 \n\n... \n\n-2 \n\n... \n\n-2 \n\n-1 \n\n0 \n\n1 \n\n-1 \n\n0 \n\n1 \n\n-1 \n\n0 \n\n1 \n\n-1 \n\n0 \n\n1 \n\n-1 \n\n0 \n\n1 \n\n-1 \n\n0 \n\n1 \n\n0.25 \n\n0.2 \n\n0.15 \n\n0.1 \n\n0.05 \n\n0 \n\n0 \n\n2 \n\n3 \n\n4 \n\n5 \n6 \nModel order \n\n7 \n\nS \n\n9 \n\n10 \n\n11 \n\nFigure 2:  Top:  12  different model orders for the  \"unscaled\" model:  Cd  ex  1.  The mean \npredictions are shown with a full line, the dashed and dotted lines limit the 50%  and 95% \ncentral mass of the predictive distribution (which is student-t).  Bottom:  posterior probabil(cid:173)\nity of the models, normalised over the 12 models.  The probabilities of the models exhibit \nan Occam's Hill, discouraging models that are either \"too small\" or \"too big\". \n\nInference in the Fourier model \n\n2.1 \nGiven data V  =  {x(n), y(n) In  =  1, ... ,N} with  independent Gaussian noise with preci(cid:173)\nsion T, the likelihood is: \n\np(Ylx, w, T)  ex  II exp (- ~[y(n) - W T <l>n]2). \n\nN \n\nn=1 \n\nFor analytical convenience, let the scale of the prior be proportional to the noise precision, \nS  = CT and put vague4  Gamma priors on T and C: \n\np(T)  ex  T<>1-1  exp(-,81T), \n\np(C) ex  C<>2-1 exp (-,82 C) , \n\nthen  we  can  integrate over weights  and  noise  to  get  the  evidence  as  a function  of prior \nhyperparameters, C (the overall scale) and c  (the relative scales): \n\nff \n\n,8<>1,8<>2r(a1+ N/ 2) \nE(C, c)  = }} p(Ylx, w, T)p(wIC, T, c)p(T)p(C)dTdw  =  (~7r)~/2r(a1)r(a2) \nx  IA11/2 [,81  +  ~Y T  (J - <I> A -1<1> T)yr<>1-N/2CD+<>2-1/2 exp( -,82C)~/2 II Cd, \n4We choose vague priors by  setting al = a2 = fA = /32  = 0.2 throughout. \n\nd=1 \n\nD \n\n\fScaling  Exponent=O \n\nScaling  Exponent=2 \n\nScaling  Exponent=3 \n\nScaling  Exponent=4 \n\n2 \n\n2 \n\nV'J~ o ~ \n\n\\\" \n\na \n\n- 2 \n\n- 1 \n\n- 2 \n\n- 1 \n\n- 2 \n\n- 1 \n\n- 2 \n\n- 2 \n\n2 \n\n- 2 \n\na \n\n2 \n\n- 2 \n\na \n\n- 2 \n\n2 \n\nFigure 3:  Functions drawn  at  random from  the  Fourier model  with  order D  =  6  (dark) \nand  D  =  500  (light)  for  four  different  scalings;  limiting  behaviour from  left  to  right: \ndiscontinuous, Brownian, borderline smooth, smooth. \n\nwhere A  =  cpT cp + C diag(c), and the tilde indicates duplication of all components except \nfor  the  first.  We  can  optimizeS  the  overall  scale  C  of the  weights  (using  ego  Newton's \nmethod).  How  do we choose the relative scales,  c?  The answer to  this  question turns out \nto be intimately related to the two different views of Bayesian inference. \n\n2.2  Example \n\nTo  illustrate the  behaviour of this  model we  use data generated from  a step function  that \nchanges  from  -1 to  1 corrupted  by  independent additive  Gaussian  noise  with  variance \n0.25.  Note  that  the  true  function  cannot be  implemented exactly  with  a  model  of finite \norder, as  would typically be the case in realistic modelling situations  (the true function is \nnot \"realizable\" or the model is  said to be \"incomplete\"). The input points are arranged in \ntwo lumps of 16 and 8 points, the step occurring in the middle of the larger, see figure 2. \n\nIf we choose the scaling precisions to be independent of the frequency of the contributions, \nCd  ex  1 (while  normalizing  the  sum of the  inverse precisions)  we  achieve predictions  as \ndepicted in figure 2.  We clearly see an Occam's Razor behaviour. A model order of around \nD  = 6 is  preferred.  One might say  that  the  limited  data does  not support models more \ncomplex than this. One way of understanding this is to note that as  the model order grows, \nthe  prior parameter volume  grows,  but  the  relative  posterior  volume  decreases,  because \nparameters must be  accurately  specified in  the complex  model  to  ensure good agreement \nwith the data.  The ratio of prior to posterior volumes is the  Occam Factor, which  may be \ninterpreted as a penalty to pay for fitting parameters. \n\nIn the present model, it is easy to draw functions at random from the prior by simply draw(cid:173)\ning values for the coefficients from their prior distributions. The left panel of figure 3 shows \nsamples from the prior for the previous example for D  = 6 and D  = 500. With increasing \norder the functions get more and more dominated by high frequency components.  In  most \nmodelling applications however,  we  have  some prior expectations about smoothness.  By \nscaling  the  precision factors  Cd  we  can achieve  that the  prior over functions converges to \nfunctions  with particular characteristics as  D  grows towards infinity.  Here we  will focus \non  scalings  of the  form  Cd  = d'Y  for different  values  of ,,(,  the  scaling  exponent.  As  an \nexample, if we choose the scaling Cd  =  d3  we do not get an Occam's Razor in terms of the \norder of the model, figure 4.  Note  that the predictions and their errorbars become almost \nindependent of the  model  order as  long  as  the  order is  large enough.  Note  also  that the \nerrorbars for these large models seem more reasonable than for D  = 6 in figure 2 (where a \nspurious \"dip\" between the two lumps of data is predicted with high confidence). With this \nchoice of scaling, it seems that the \"large models\" view is appropriate. \n\n50f course, we ought to integrate over C, but unfortunately that is difficult. \n\n\fOrder 0 \n\nOrder 1 \n\nOrder 2 \n\nOrder 3 \n\nOrder 4 \n\nOrder 5 \n\n2 ..  _i .J . \n\n0 \u00b7\u00b7  .+ ... \nt .. \n* \n\n-1 \n\n-2 \n\n-1 \n\n0 \n\n1 \nOrder 6 \n\n-1 \n\n0 \n\n1 \nOrder 7 \n\n-1 \n\n0 \n\n1 \nOrderS \n\n-1 \n\n0 \n\n1 \nOrder 9 \n\n-1 \n\n0 \n\n1 \nOrder 10 \n\n-1 \n\n0 \n\n1 \nOrder 11 \n\n2 \n\n2 \n\n2 \n\n-1 \n\n-2 \n\n-2 \n\n2 \n\no \n-1 \n\n-2 \n\n-1 \n\n0 \n\n1 \n\n-1 \n\n0 \n\n1 \n\n-1 \n\n0 \n\n1 \n\n-1 \n\n0 \n\n1 \n\n-1 \n\n0 \n\n1 \n\n-1 \n\n0 \n\n1 \n\n2 \n\no \n-1 \n\n-2 \n\n0.25 \n\n0.2 \n\n0.15 \n\n0.1 \n\n0.05 \n\nO~~~----~----\n\n2 \n\no \n\n3 \n\n4 \n\n5 \n6 \nModel order \n\n7 \n\nS \n\n9 \n\n10 \n\n11 \n\nFigure 4:  The same as  figure 2, except that the scaling Cd  = d3  was used here, leading to a \nprior which converges to smooth functions as D  -t 00.  There is no Occam's Razor; instead \nwe see  that as  long as  the model is complex enough, the evidence is  flat.  We  also  notice \nthat the predictive density of the model is unchanged as long as  D  is sufficiently large. \n\n3  Discussion \n\nIn the previous examples we saw that, depending on the scaling properties of the prior over \nparameters, both the Occam's Razor view and the large models view can seem appropriate. \nHowever, the example was unsatisfactory because it is not obvious how to choose the scal(cid:173)\ning exponent 'Y.  We can gain more insight into the meaning of'Y by analysing properties of \nfunctions drawn from the prior in the limit of large D. It is useful to consider the expected \nsquared difference of outputs corresponding to nearby inputs, separated by ~: \n\nG(~) = E[(J(x) - f(x + ~))2l, \n\nin  the limit as  ~ -t O.  In  the table in  figure  5 we have computed these limits for various \nvalues  of 'Y,  together with  the characteristics of these functions.  For example,  a property \nof smooth  functions  is  that  G (~)  <X  ~ 2 .  Using  this  kind  of information  may  help  to \nchoose  good  values  for  'Y  in  practical applications.  Indeed,  we  can  attempt  to  infer the \n\"characteristics of the function\" 'Y  from  the data.  In figure  5 we  show  how  the  evidence \ndepends  on  'Y  and  the  overall  scale  C  for  a  model  of large  order (D  =  200).  It is  seen \nthat the  evidence has  a maximum  around 'Y  =  3.  In  fact  we  are  seeing  Occam's  Razor \nagain!  This  time  it is  not in  terms  of the dimension if the model,  but rather in  terms  of \nthe  complexity of the functions  under the priors  implied by  different values  of 'Y.  Large \nvalues of'Y correspond to priors with most probability mass on simple functions, whereas \nsmall values of'Y correspond to priors that allow more complex functions.  Note,  that the \n\"optimal\" setting 'Y  = 3 was exactly the model used in figure 4. \n\n\flog  Evidence (D=2oo. max=-27.48) \n\n- 0.5 \n\n-1 \n\n6' \n~-1 .5 \n.Q \n\n-2 \n\n-2.5 \n\n'Y \n<1 \n2 \n3 \n>3 \n\nlimD-.--+o G(~} \n\n1 \n~ \n\nproperties \n\ndiscontinuous \n\nBrownian \n\n~2(1-ln~) \n\nborderline smooth \n\n~2 \n\nsmooth \n\nFigure 5:  Left panel:  the evidence as a function of the scaling exponent, 'Y  and overall scale \nC, has a maximum at 'Y  = 3.  The table shows the characteristics of functions for different \nvalues of 'Y. Examples of these functions are shown in figure 3. \n\n4  Conclusion \n\nWe have reviewed the automatic Occam's Razor for Bayesian models and seen how, while \nnot necessarily penalising the number of parameters, this process is active in terms of the \ncomplexity offunctions.  Although we have only presented simplistic examples, the expla(cid:173)\nnations of the behaviours rely on very basic principles that are generally applicable. Which \nof the two differing Bayesian views is most attractive depends on the circumstances:  some(cid:173)\ntimes the large model limit may be computationally demanding;  also,  it may be difficult \nto  analyse the scaling properties of priors for some models.  On the other hand, in typical \napplications of non-parametric models, the \"large model\" view may be the most convenient \nway of expressing priors  since typically, we don't seriously believe that the \"true\" gener(cid:173)\native process can be implemented exactly with  a small model.  Moreover, optimizing (or \nintegrating) over continuous hyperparameters may be easier than optimizing over the dis(cid:173)\ncrete space of model sizes.  In the end, whichever view we take,  Occam's Razor is always \nat work discouraging overcomplex models. \n\nAcknowledgements \n\nThis  work  was  supported  by  the  Danish  Research  Councils  through  the  Computational \nNeural Network Center (CONNECT) and the THOR Center for Neuroinformatics. Thanks \nto Geoff Hinton for asking a puzzling question which stimulated the writing of this paper. \n\nReferences \n\nJefferys, W. H. & Berger, J. O. (1992) Ockham's Razor and Bayesian Analysis. Amer.  Sci., 80:64-72. \n\nMacKay, D. J. C.  (1992)  Bayesian Interpolation.  Neural Computation, 4(3):415-447. \n\nNeal,  R.  M.  (1996)  Bayesian  Learning for Neural Networks,  Lecture Notes  in  Statistics No.  118, \nNew York:  Springer-Verlag. \n\nRasmussen,  C.  E.  (2000)  The  Infinite  Gaussian  Mixture  Model,  in  S.  A.  Solla,  T.  K.  Leen  and \nK.-R. Muller (editors.), Adv.  Neur.  In!  Proc.  Sys.  12, MIT Press, pp.  554-560. \n\nSmith,  A. F.  M.  &  Spiegelhalter, D.  J.  (1980)  Bayes factors  and choice  criteria for linear models. \n1.  Roy. Stat.  Soc. , 42:213-220. \n\n\f", "award": [], "sourceid": 1925, "authors": [{"given_name": "Carl", "family_name": "Rasmussen", "institution": null}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}]}