{"title": "Bayesian Averaging is Well-Temperated", "book": "Advances in Neural Information Processing Systems", "page_first": 265, "page_last": 271, "abstract": null, "full_text": "Bayesian averaging is well-temperated \n\nLars Kai Hansen \n\nDepartment of Mathematical Modelling \nTechnical University of Denmark B321 \n\nDK-2800 Lyngby, Denmark \n\nlkhansen@imm.dtu.dk \n\nAbstract \n\nBayesian predictions are stochastic just like predictions of any other \ninference scheme that generalize from a finite sample. While a sim(cid:173)\nple variational argument shows that Bayes averaging is generaliza(cid:173)\ntion optimal given that the prior matches the teacher parameter \ndistribution the situation is less clear if the teacher distribution is \nunknown. I define a class of averaging procedures, the temperated \nlikelihoods, including both Bayes averaging with a uniform prior \nand maximum likelihood estimation as special cases. I show that \nBayes is generalization optimal in this family for any teacher dis(cid:173)\ntribution for two learning problems that are analytically tractable: \nlearning the mean of a Gaussian and asymptotics of smooth learn(cid:173)\ners. \n\n1 \n\nIntroduction \n\nLearning is the stochastic process of generalizing from a random finite sample of \ndata. Often a learning problem has natural quantitative measure of generalization. \nIf a loss function is defined the natural measure is the generalization error, i.e., the \nexpected loss on a random sample independent of the training set. Generalizability \nis a key topic of learning theory and much progress has been reported. Analytic \nresults for a broad class of machines can be found in the litterature [8, 12, 9, 10] \ndescribing the asymptotic generalization ability of supervised algorithms that are \ncontinuously parameterized. Asymptotic bounds on generalization for general ma(cid:173)\nchines have been advocated by Vapnik [11]. Generalization results valid for finite \ntraining sets can only be obtained for specific learning machines, see e.g. [5]. A \nvery rich framework for analysis of generalization for Bayesian averaging and other \nschemes is defined in [6]. \n\nA veraging has become popular as a tool for improving generalizability of learning \nmachines . In the context of (time series) forecasting averaging has been investigated \nintensely for decades [3]. Neural network ensembles were shown to improve general(cid:173)\nization by simple voting in [4] and later work has generalized these results to other \ntypes of averaging. Boosting, Bagging, Stacking, and Arcing are recent examples \nof averaging procedures based on data resampling that have shown useful see [2] \nfor a recent review with references. However, Bayesian averaging in particular is \nattaining a kind of cult status. Bayesian averaging is indeed provably optimal in a \n\n\f266 \n\nL. K. Hansen \n\nnumber various ways (admissibility, the likelihood principle etc) [1]. While it fol(cid:173)\nlows by construction that Bayes is generalization optimal if given the correct prior \ninformation, i.e., the teacher parameter distribution, the situation is less clear if \nthe teacher distribution is unknown. Hence, the pragmatic Bayesians downplay the \nrole of the prior. Instead the averaging aspect is emphasized and \"vague\" priors are \ninvoked. It is important to note that whatever prior is used Bayesian predictions \nare stochastic just like predictions of any other inference scheme that generalize \nfrom a finite sample. \n\nIn this contribution I analyse two scenarios where averaging can improve gener(cid:173)\nalizability and I show that the vague Bayes average is in fact optimal among the \naveraging schemes investigated. Averaging is shown to reduce variance at the cost \nof introducing bias, and Bayes happens to implement the optimal bias-variance \ntrade-off. \n\n2 Bayes and generalization \n\nConsider a model that is smoothly parametrized and whose predictions can be \ndescribed in terms of a density function 1 . Predictions in the model are based on a \ngiven training set: a finite sample D = {Xa}~=l of the stochastic vector x whose \ndensity - the teacher - is denoted p(xIOo). In other words the true density is assumed \nto be defined by a fixed, but unknown, teacher parameter vector 00 . The model, \n\ndenoted H, involves the parameter vector \u00b0 and the predictive density is given by \n\np(xID, H) = ! p(xIO, H)p(OID, H)dO \n\np(OID, H) is the parameter distribution produced in training process. In a maxi(cid:173)\nmum likelihood scenario this distribution is a delta function centered on the most \nlikely parameters under the model for the given data set. In ensemble averaging \napproaches, like boosting bagging or stacking, the distribution is obtained by train(cid:173)\ning on resampled traning sets. In a Bayesian scenario, the parameter distribution \nis the posterior distribution, \n\np(OID, H) = f p(DIO', H)p(O'IH)dO' \n\np(DIO, H)p(OIH) \n\n(2) \n\nwhere p(OIH) is the prior distribution (probability density of parameters if D is \nempty). In the sequel we will only consider one model hence we suppress the model \nconditioning label H. \n\nThe generalization error is the average negative log density (also known as simply \nthe \"log loss\" - in some applied statistics works known as the \"deviance\") \n\n(1) \n\n(3) \n\n(4) \n\nr(DIOo) = ! -logp(xID)p(xIOo)dx, \n\nThe expected value of the generalization error for training sets produced by the \ngiven teacher is given by \n\nf(Oo) = ! ! -logp(xID)p(xIOo)dxp(DIOo)dD. \n\nlThis does not limit us to conventional density estimation; pattern recognition and \n\nmany functional approximations problems can be formulated as density estimation prob(cid:173)\nlems as well. \n\n\fBayesian Averaging is Well-Temperated \n\n267 \n\nPlaying the game of \"guessing a probability distribution\" [6] we not only face a \nrandom training set, we also face a teacher drawn from the teacher distribution \np( Bo) . The teacher averaged generalization must then be defined as \n\nr = J f(Bo)p(Bo)dBo . \n\n(5) \n\n(7) \n\nThis is the typical generalization error for a random training set from the randomly \nchosen teacher - produced by the model H. The generalization error is minimized \nby Bayes averaging if the teacher distribution is used as prior. To see this, form the \nLagrangian functional \n\n\u00a3[q(xID)] = J J J -logq(xID)p(xIBo)dxp(DIBo)dDp(Bo)dBo+A J q(xID)dx (6) \n\ndefined on positive functions q(xID). The second term is used to ensure that q(xID) \nis a normalized density in x . Now compute the variational derivative to obtain \n\n6\u00a3 \n\n6q(xID) = - q(xID) \n\np(xIBo)p(DIBo)p(Bo)dBo + A. \n\n1 J \n\nJ \np(xIB) J p(DIB')p(B')dB' dB, \n\np(DIB)p(B) \n\nEquating this derivative to zero we recover the predictive distribution of Bayesian \naveragmg, \n\nq(xID) = \n\n(8) \nwhere we used that A = J p(DIB)p(B)dB is the appropriate normalization constant. \nIt is easily verified that this is indeed the global minimum of the averaged gener(cid:173)\nalization error. We also note that if the Bayes average is performed with another \nprior than the teacher distribution p( Bo), we can expect a higher generalization er(cid:173)\nror. The important question from a Bayesian point of view is then: Are there cases \nwhere averaging with generic priors (e.g. vague or uniform priors) can be shown to \nbe optimal? \n\n3 Temperated likelihoods \n\nTo come closer to a quantative statement about when and why vague Bayes is the \nbetter procedure we will analyse two problems for which some analytical progress is \npossible. We will consider a one-parameter family of learning procedures including \nboth a Bayes and the maximum likelihood procedure, \nv(DIB) \n\np(BI!3,D,H) = Jpf3(DIB')dB\" \n\n(9) \n\nwhere !3 is a positive parameter (plying the role of an inverse temperature). The \nfamily of procedures are all averaging procedures, and !3 controls the width of the \naverage. Vague Bayes (here used synonymously with Bayes with a uniform prior) \nis recoved for !3 = 1, while the maximum posterior procedure is obtained by cooling \nto zero width !3 --+ 00 . \nIn this context the generalization design question can be frased as follows : is there \nan optimal temperature in the family of the temperated likelihoods? \n\n3.1 Example: ID normal variates \n\nLet the teacher distribution be given by \n\np(xIBo) = ~exp (-~(X - Bo)2) \n\n211\"<72 \n\n2<7 \n\n(10) \n\n\f268 \n\nL.K. Hansen \n\nThe model density is of the same form with (J unknown and u2 assumed to be \nknown. For N examples the posterior (with a uniform prior) is, \n\np(OID) = J 2:U2 exp (-::2 (x - (J)2) , \n\n(11) \nwith x = 1/ N Eo: Xo:. The temperated likelihood is obtained by raising to the ,8'th \npower and normalizing, \n\np((JID,,8) = V ~ exp \n\nf7iN \n\n(,8N \n- 2u2 (x - (J)2 \n\n) \n\n. \n\nThe predictive distribution is found by integrating w.r.t. (J, \n\np(xID,,8) = !P(ZIB)P(BID,~)dB; ~exp (--212 (x- X)2) , \n\n21!'u$ \n\nu f3 \n\n(12) \n\n(13) \n\nwith u~ = u2(1+1/,8N). We note that this distribution is wider for all the averaging \nprocedures than it is for maximum likelihood (,8 -T (0), i.e., less variant. For very \nsmall ,8 the predictive distribution is almost independent of the data set, hence \nhighly biased. \n\nIt is straightforward to compute the generalization error of the predictive distribu(cid:173)\ntion for general,8. First we compute the generalization error for the specific training \nset D, \n\nr(D,,8, (Jo) = ! -logp(xID, ,8)p(xl(Jo)dx = log )21!'u$ + 2~~ ((x - (JO)2 + ( 2) , \n\n(14) \nThe average generalization error is then found by averaging w.r.t the sampling \ndistribution using x\"\" N((Jo, u2/N)., \n\nr(,8) = ! r(D, ,8)dDp(DI(Jo) = log )21!'u$ + 2:$ (~ + 1) , \n\n(15) \n\nWe first note that the generalization error is independent of the teacher (Jo param(cid:173)\neter, this happened because (J is a \"location\" parameter. The ,8-dependency of the \naveraged generalization error is depicted in Figure 1. Solving 8r(,8) /8,8 = 0 we find \nthat the optimal ,8 solves \n\nu$=U2(,8~+I)=U2(~+I) :=} \n\n,8=1 \n\n(16) \n\nNote that this result holds for any N and is independent of the teacher parameter. \nThe Bayes averaging at unit temperature is optimal for any given value of (Jo, hence, \nfor any teacher distribution. We may say that the vague Bayes scheme is robust \nto the teacher distribution in this case. Clearly this is a much stronger optimality \nthan the more general result proven above. \n\n3.2 Bias-variance tradeoff \n\nIt is interesting to decompose the generalization error in Eq. 15 in bias and variance \ncomponents. We follow Heskes [7] and define the bias error as the generalization \nerror of the geometric average distribution, \n\nB(,8) = ! -logp(x)p(xl(Jo)dx, \n\n(17) \n\n\fBayesian Averaging is Wel/-Temperated \n\n269 \n\nGENERALIZATION \n\n0.7 \n\n0.& \n\n0.5 \n\nA 04 \n1\u00a3 \n\u00a7 , \n\nv 03 \n\n02 \n\n0.1 \n\n0 \n0 \n\n0.5 \n\n15 \n\n2 \n\n25 \n\nTEMPERATURE \n\n3 \n\n35 \n\n45 \n\nFigure 1: Bias-variance trade-off as function of the width of the temperated likeli(cid:173)\nhood ensemble (temperature = 1/ (3) for N = 1. The bias is computed as the gen(cid:173)\neralization error of the predictive distribution obtained from the geometric average \ndistribution w.r.t. training set fluctuations as proposed by Heskes. The predictive \ndistribution produced by Bayesian averaging corresponds to unit temperature (ver(cid:173)\ntical line) and it achieves the minimal generalization error. Maximum-likelihood \nestimation for reference is recovered as the zero width/temperature limit. \n\nwith \n\np(x) = Z-l exp ( / 10g(P(X 1D)]P(D I80 )dD) . \n\nInserting from Eq. (13), we find \n\np(z) = ~exp (-~(X -80)2) . \n\n27r0'~ \n\n0' f3 \n\nIntegrating over the teacher distribution we find, \n\nB(f3) = -2 log 27r0'~ + -2 \n20'f3 \nThe variance error is given by V(f3) = r(f3) - B(f3) , \n\n0'2 \n\n1 \n\n0'2 \n\nV (f3) = 2N O'~ \n\n(18) \n\n(19) \n\n(20) \n\n(21) \n\nWe can now quantify the statements above. By averaging a bias is introduced -the \npredictive distribution becomes wider- which decrease the variance contribution \ninitially so that the generalization error being the sum of the two decreases. At still \nhigher temperatures the bias becomes too strong and the generalization error start \nto increase. The Bayes average at unit temperature is the optimal trade-off within \nthe given family of procedures. \n\n\f270 \n\nL. K. Hansen \n\n3.3 Asymptotics for smoothly parameterized models \n\nWe now go on to show that a similar result also holds for general learning prob(cid:173)\nlems in limit of large data sets. We consider a system parameterized by a finite \ndimensional parameter vector O. For a given large training set and for a smooth \nlikelihood function, the temperated likelihood is approximately Gaussian centered \nat the maximum posterior parameters[13]' hence the normalized temperated poste(cid:173)\nrior reads \n\nP(OI(3D,H) = I(3NA(~OML) lexp (_(3; 60'A(D,OML)60) \n\n(22) \nwhere 60 = O-OML, with OML = OML(D) denoting the maximum likelihood solution \nfor the given training sample. The second derivative or Hessian matrix is given by \n\nA(D,O) \n\n1 N \nN LA(xa,O) \n\na=l \n\nA(x,O) = \n\n{)2 \n\n()O{)O' - log p( x 10) \n\nThe predictive distribution is given by \n\np(xl(3, D) = ! p(xIO)p(OI(3, D)dO \n\n(23) \n\n(24) \n\n(25) \n\nwe write p(xIO) = exp(-f(xIO)) and expand f(xIO) around OML to second order, we \nfind \n\np(xIO) ~ p(XIOML) exp (-a(xIOML)'60 - ~60' A(xIOML)60) . \n\n(26) \nWe are then in position to perform the integration over the posterior to find the \nnormalized predictive distribution, \n\np(xl(3, D) = p(XIOML) \n\nI(3N A(D)I \n\n1 \n\n, \n\nI(3NA(D) + A(x)1 exp ( 2'a(xIOML) A(xIOML)a(xIOML)). \n(27) \n\nProceeding as above, we compute the generalization error \n\nf((3, ( 0 ) = ! ! -logp(xl(3, D)p(xIOo)dxp(DIOo)dD \n\n(28) \n\nFor sufficiently smooth likelihoods, fluctuations in the maximum likelihood param(cid:173)\neters will be asymptotic normal, see e.g. [8], and furthermore fluctuations in A(D) \ncan be neglected, this means that we can approximate, \n\nA(x) + A(D) ~ (~ + l)Ao, Ao = ! A(xIOo)p(xIOo)dx \n\n(29) \n\nwhere Ao is the averaged Fisher information matrix. With these approximations \n(valid as N --+ (0) the generalization error can be found, \n\nd \n\n( \n\n1) d 1+ ~ \n- 21 + (3N' \n\nf((3, ( 0 ) ~ f(oo) + 2 log 1 + (3N \n\n(30) \nwith d = dim(O) denoting the dimension of the parameter vector. Like in the ID \nexample (Eq. (15)) we find the generalization error is asymptotically independent \nof the teacher parameters. It is minimized for (3 = 1 and we conclude that Bayes \nis well-temperated in the asymptotics and that this holds for any teacher distri(cid:173)\nbution. In the Bayes literature this is refered to as the prior is overwhelmed by \ndata [1]. Decomposing the errors in bias and variance contributions we find similar \nresults as for in ID example, Bayes introduces the optimal bias by averaging at unit \ntemperature. \n\n\fBayesian Averaging is Well-Temperated \n\n271 \n\n4 Discussion \n\nWe have seen two examples of Bayes averaging being optimal, in particular improv(cid:173)\ning on maximum likelihood estimation. We found that averaging introduces a bias \nand reduces variance so that the generalization error (being the sum of bias and \nvariance) initially decrease. Bayesian averaging at unit temperature is the optimal \nwidth of the averaging distribution. For larger temperatures (widths) the bias is \ntoo strong and the generalization error increases. Both examples were special in the \nsense that they lead to generalization errors that are independent of the random \nteacher parameter. This is not generic, of course, rather the generic case is that a \nmis-specified prior can lead to arbitrary large learning catastrophes. \n\nAcknowledgments \n\nI thank the organizers of the 1999 Max Planck Institute Workshop on Statistical \nPhysics of Neural Networks Michael Biehl, Wolfgang Kinzel and Ido Kanter, where \nthis work was initiated. I thank Carl Edward Rasmussen, Jan Larsen, and Manfred \nOpper for stimulating discussions on_Bayesian averaging. This work was funded by \nthe Danish Research Councils through the Computational Neural Network Center \nCONNECT and the THOR Center for Neuroinformatics. \n\nReferences \n\n[1] C.P. Robert: The Bayesian Choice - A Decision- Theoretic Motivation. Springer Texts \nin Statistics, Springer Verlag, New York (1994). A. Ohagan: Bayesian Inference. \nKendall's Advanced Theory of Statistics. Vol 2B. The University Press, Cambridge \n(1994). \n\n[2] L. Breiman: Using adaptive bagging to debias regressions. Technical Report 547, \n\nStatistics Dept. U.C. Berkeley, (1999) . \n\n[3] R.T. Clemen Combining forecast: A review and annotated bibliography. Journal of \n\nForecasting 5, 559 (1989). \n\n[4] L.K. Hansen and P. Salamon: Neural Network Ensembles. IEEE Transactions on \n\nPattern Analysis and Machine Intelligence, 12, 993-1001 (1990). \n\n[5] L.K . Hansen: Stochastic Linear Learning: Exact Test and Training Error Averages. \n\nNeural Networks 6, 393-396, (1993) \n\n[6] D. Haussler and M. Opper: Mutual Information, Metric Entropy, and Cumulative \n\nRelative Entropy Risk Annals of Statistics 25 2451-2492 (1997) \n\n[7] T . Heskes: Bias/Variance Decomposition for Likelihood-Based Estimators. Neural \n\nComputation 10, pp 1425-1433, (1998) . \n\n[8] L. Ljung: System Identification: Theory for the User. Englewood Cliffs, New Jersey: \n\nPrentice-Hall, (1987). \n\n[9] J . Moody: \"Note on Generalization, Regularization, and Architecture Selection in \n\nNonlinear Learning Systems,\" in B.H. Juang, S.Y. Kung & C.A. Kamm (eds.) Pro(cid:173)\nceedings of the first IEEE Workshop on Neural Networks for Signal Processing, Pis(cid:173)\ncataway, New Jersey: IEEE, 1-10, (1991). \n\n[10] N . Murata, S. Yoshizawa & S. Amari: Network Information Criterion - Deter(cid:173)\n\nmining the Number of Hidden Units for an Artificial Neural Network Model. IEEE \nTransactions on Neural Networks, vol. 5, no. 6, pp. 865-872, 1994. \n\n[11] V. Vapnik: Estimation of Dependences Based on Empirical Data. Springer-Verlag \n\nNew York (1982). \n\n[12] H . White, \"Consequences and Detection of Misspecified Nonlinear Regression Mod(cid:173)\n\nels,\" Journal of the American Statistical Association, 76(374), 419-433, (1981). \n\n[13] D .J .C MacKay: Bayesian Interpolation, Neural Computation 4, 415-447, (1992) . \n\n\f", "award": [], "sourceid": 1709, "authors": [{"given_name": "Lars", "family_name": "Hansen", "institution": null}]}