Part of Advances in Neural Information Processing Systems 5 (NIPS 1992)
David Wolpert
The Bayesian "evidence" approximation has recently been employed to determine the noise and weight-penalty terms used in back-propagation. This paper shows that for neural nets it is far easier to use the exact result than it is to use the evidence approximation. Moreover, unlike the evi(cid:173) dence approximation, the exact result neither has to be re-calculated for every new data set, nor requires the running of computer code (the exact result is closed form). In addition, it turns out that the evidence proce(cid:173) dure's MAP estimate for neural nets is, in toto, approximation error. An(cid:173) other advantage of the exact analysis is that it does not lead one to incor(cid:173) rect intuition, like the claim that using evidence one can "evaluate differ(cid:173) ent priors in light of the data". This paper also discusses sufficiency conditions for the evidence approximation to hold, why it can sometimes give "reasonable" results, etc.
1 THE EVIDENCE APPROXIMATION
It has recently become popular to consider the problem of training neural nets from a Baye(cid:173) sian viewpoint (Buntine and Weigend 1991, MacKay 1992). The usual way of doing this starts by assuming that there is some underlying target function f from Rn to R, parameter(cid:173) ized by an N-dimensional weight vector w. We are provided with a training set L of noise(cid:173) corrupted samples of f. Our goal is to make a guess for w, basing that guess only on L. Now assume we have Li.d. additive gaussian noise resulting in P(L I w, ~) oc exp(-~ X2)), where X2(w, L) is the usual sum-squared training set error, and ~ reflects the noise level. Assume further that P(w I a.) oc exp(-o.W(w)), where W(w) is the sum of the squares ofthe weights. If the values of a. and ~ are known and fixed, to the values ~ and ~t respectively, then P(w)