{"title": "Bayesian Query Construction for Neural Network Models", "book": "Advances in Neural Information Processing Systems", "page_first": 443, "page_last": 450, "abstract": null, "full_text": "Bayesian Query Construction for Neural \n\nNetwork Models \n\nGerhard Paass \n\nJorg Kindermann \n\nGerman National Research Center for Computer Science (GMD) \n\nD-53757 Sankt Augustin, Germany \n\npaass@gmd.de \n\nkindermann@gmd.de \n\nAbstract \n\nIf data collection is costly, there is much to be gained by actively se(cid:173)\nlecting particularly informative data points in a sequential way. In \na Bayesian decision-theoretic framework we develop a query selec(cid:173)\ntion criterion which explicitly takes into account the intended use \nof the model predictions. By Markov Chain Monte Carlo methods \nthe necessary quantities can be approximated to a desired preci(cid:173)\nsion. As the number of data points grows, the model complexity \nis modified by a Bayesian model selection strategy. The proper(cid:173)\nties of two versions of the criterion ate demonstrated in numerical \nexperiments. \n\n1 \n\nINTRODUCTION \n\nIn this paper we consider the situation where data collection is costly, as when \nfor example, real measurements or technical experiments have to be performed. In \nthis situation the approach of query learning ('active data selection', 'sequential \nexperimental design', etc.) has a potential benefit. Depending on the previously \nseen examples, a new input value ('query') is selected in a systematic way and \nthe corresponding output is obtained. The motivation for query learning is that \nrandom examples often contain redundant information, and the concentration on \nnon-redundant examples must necessarily improve generalization performance. \n\nWe use a Bayesian decision-theoretic framework to derive a criterion for query con(cid:173)\nstruction. The criterion reflects the intended use of the predictions by an appropriate \n\n\f444 \n\nGerhard Paass. Jorg Kindermann \n\nloss function. We limit our analysis to the selection of the next data point, given a \nset of data already sampled. The proposed procedure derives the expected loss for \ncandidate inputs and selects a query with minimal expected loss. \n\nThere are several published surveys of query construction methods [Ford et al. 89, \nPlutowski White 93, Sollich 94]. Most current approaches, e.g. [Cohn 94], rely \non the information matrix of parameters. Then however, all parameters receive \nequal attention regardless of their influence on the intended use of the model \n[Pronzato Walter 92]. In addition, the estimates are valid only asymptotically. Baye(cid:173)\nsian approaches have been advocated by [Berger 80], and applied to neural networks \n[MacKay 92]. In [Sollich Saad 95] their relation to maximum information gain is \ndiscussed. In this paper we show that by using Markov Chain Monte Carlo me(cid:173)\nthods it is possible to determine all quantities necessary for the selection of a query. \nThis approach is valid in small sample situations, and the procedure's precision can \nbe increased with additional computational effort. With the square loss function, \nthe criterion is reduced to a variant of the familiar integrated mean square error \n[Plutowski White 93]. \nIn the next section we develop the query selection criterion from a decision-theoretic \npoint of view. In the third section we show how the criterion can be calculated using \nMarkov Chain Monte Carlo methods and we discuss a strategy for model selection. \nIn the last section, the results of two experiments with MLPs are described. \n\n2 A DECISION-THEORETIC FRAMEWORK \nAssume we have an input vector x and a scalar output y distributed as y \"\" p(y I x, w) \nwhere w is a vector of parameters. The conditional expected value is a deterministic \nfunction !(x, w) := E(y I x, w) where y = !(x, w)+\u00a3 and \u00a3 is a zero mean error term. \nSuppose we have iteratively collected observations D(n) := ((Xl, iii), .. . , (Xn, Yn)). \nWe get the Bayesian posterior p(w I D(n)) = p(D(n) I w) p(w)/ J p(D(n) I w) p(w) dw \nand the predictive distribution p(y I x, D(n)) = J p(y I x, w)p(w I D(n)) dw if p(w) is \nthe prior distribution. \n\nWe consider the situation where, based on some data x, we have to perform an \naction a whose result depends on the unknown output y. Some decisions may have \nmore severe effects than others. The loss function L(y, a) E [0,00) measures the \nloss if y is the true value and we have taken the action a E A. In this paper we \nconsider real-valued actions, e.g. setting the temperature a in a chemical process. \nWe have to select an a E A only knowing the input x. According to the Bayes \nPrinciple [Berger 80, p.14] we should follow a decision rule d : x --t a such that \nthe average risk J R(w, d) p(w I D(n)) dw is minimal, where the risk is defined as \nR(w, d) := J L(y, d(x)) p(y I x, w) p(x) dydx. Here p(x) is the distribution of future \ninputs, which is assumed to be known. \nFor the square loss function L(y, a) = (y - a)2, the conditional expectation \nd(x) := E(y I x, D(n)) is the optimal decision rule. In a control problem the loss \nmay be larger at specific critical points. This can be addressed with a weigh(cid:173)\nted square loss function L(y, a) := h(y)(y - a)2, where h(y) 2: a [Berger 80, \np.1U]. The expected loss for an action is J(y - a)2h(y) p(y I x, D(n)) dy. Re(cid:173)\nplacing the predictive density p(y I x, D(n)) with the weighted predictive density \n\n\fBayesian Query Construction for Neural Network Models \n\n445 \n\n:= h(y) p(y I x, Den)/G(x), where G(x) := I h(y) p(y I x, Den) dy, \np(y I x, Den) \nwe get the optimal decision rule d(x) := I yp(y I x, Den) dy and the average loss \nG(x) I(y - E(y I x, D(n))2 p(y I x, Den) dy for a given input x. With these modi(cid:173)\nfications, all later derIvations for the square loss function may be applied to the \nweighted square loss. \nThe aim of query sampling is the selection of a new observation x in such a way \nthat the average risk will be maximally reduced. Together with its still unknown \ny-value, x defines a new observation (x, y) and new data Den) U (x, y). To determine \nthis risk for some given x we have to perform the following conceptual steps for a \ncandidate query x: \n\n1. Future Data: Construct the possible sets of 'future' observations Den) U \n\n(x, y), where y \"\"' p(y I x, Den). \n\n2. Future posterior: Determine a 'future' posterior distribution of parameters \nthat depends on y in the same way as though it had \n\np(w I Den) U (x, y\u00bb \nactually been observed. \n\n3. Future Loss: Assuming d~,x(x) is the optimal decision rule for given values \n\nof x, y, and x, compute the resulting loss as \n\n1';,x(x):= J L(y,d;,x(x\u00bbp(ylx,w)p(wIDen)U(x,y\u00bbdydw \n\n(1) \n\n4. Averaging: Integrate this quantity over the future trial inputs x distributed \n\nas p(x) and the different possible future outputs y, yielding \n1';:= Ir;,x(x)p(x)p(ylx,Den)dxdy. \n\nThis procedure is repeated until an x with minimal average risk is found. Since local \noptima are typical, a global optimization method is required. Subsequently we then \ntry to determine whether the current model is still adequate or whether we have to \nincrease its complexity (e.g. by adding more hidden units). \n\n3 COMPUTATIONAL PROCEDURE \n\nLet us assume that the real data Den) was generated according to a regression model \ny = !(x, w)+{ with i.i.d. Gaussian noise {\"\"' N(O, (T2(w\u00bb. For example !(x, w) may \nbe a multilayer perceptron or a radial basis function network. Since the error terms \nare independent, the posterior density is p( w I Den) \nex: p( w) rr~=l P(Yi I Xi, w) even \nin the case of query sampling [Ford et al. 89]. \n\nAs the analytic derivation of the posterior is infeasible except in trivial cases, we \nhave to use approximations. One approach is to employ a normal approximation \n[MacKay 92], but this is unreliable if the number of observations is small compa(cid:173)\nred to the number of parameters. We use Markov Chain Monte Carlo procedures \n[PaaB 91, Neal 93] to generate a sample WeB) := {WI, .. . WB} of parameters distri(cid:173)\nbuted according to p( w I Den). If the number of sampling steps approaches infinity, \nthe distribution of the simulated Wb approximates the posterior arbitrarily well. \n\nTo take into account the range of future y-values, we create a set of them by si(cid:173)\nmulation. For each Wb E WeB) a number of y \"\"' p(y I x, Wb) is generated. Let \n\n\f446 \n\nGerhard Paass. JiJrg Kindermann \n\ny(x.R) := {YI, ... , YR} be the resulting set. Instead of performing a new Markov \nMonte Carlo run to generate a new sample according to p(w I DCn) U (x, y)), we \nuse the old set WCB) of parameters and reweight them (importance sampling). \nIn this way we may approximate integrals of some function g( w) with respect to \np(w I DCn) U (x, y)) [Kalos Whitlock 86, p.92]: \n\n( ) ( ID U( - -))d \n9 w P W \n\nX, Y \n\nCn) \n\nW \n\nj \n\n__ L~-lg(Wb)P(ylx,Wb) \n--\n\nB \n\nLb=l p(Y I x, Wb) \n\n(2) \n\nThe approximation error approaches zero as the size of WCB) increases. \n\n3.1 APPROXIMATION OF FUTURE LOSS \n\nConsider the future loss f;,x(x) given new observation (x, y) and trial input Xt. In \nthe case of the square loss function, (1) can be transformed to \n\nf~,.t(Xt) = \n\nj[!(Xt,w)-E(yIXt,Dcn)U(X,y)Wp(wIDcn)U(x,y))dw (3) \n+ j \u00a3T2(w) p(w I DCn) U (x, y)) dw \n\nwhere \u00a3T2(w) := Var(y I x, w) is independent of x. Assume a set XT = {Xl, ... , XT} \nis given, which is representative of trial inputs for the distribution p(x). Define \nS(x, y) := L~=i p(Y I x, Wb) for y E YCx,R) . Then from equations (2) and (3) we get \nE(ylxt,DCn)U(x,y)):= 1/S(x,Y)L~=1!(Xt,Wb)P(Ylx,Wb) and \n\n1 \n\nB \n\nS(x -) L\u00a3T2(Wb)P(Ylx,Wb) \n\n(4) \n\n,y b=l \n\n1 \n\nB \n\n,y b=l \n\n+ S(x -) I)!(Xt, Wb) - E(y I Xt, DCn) U (x, y))]2 p(Y I x, Wb) \n\nThe final value of f; is obtained by averaging over the different y E YCx,R) and \ndifferent trial inputs Xt E XT. To reduce the variance, the trial inputs Xt should \nbe selected by importance sampling (2) to concentrate them on regions with high \ncurrent loss (see (5) below). To facilitate the search for an x with minimal f; we \nreduce the extent of random fluctuations of the y values. Let (Vi, ... , VR) be a \nvector of random numbers Vr -- N(O,1), and let jr be randomly selected from \n{1, ... , B}. Then for each x the possible observations Yr E YCx,R) are defined as \nYr := !(x, wir) + Vr\u00a3T2(wir). In this way the difference between neighboring inputs \nis not affected by noise, and search procedures can exploit gradients. \n\n3.2 CURRENT LOSS \nAs a proxy for the future loss, we may use the current loss at x, \n\nrcurr(x) = p(x) j L(y, d*(x)) p(y I x, DCn)) dy \n\n(5) \n\n\fBayesian Query Construction for Neural Network Models \n\n447 \n\nwhere p(x) weights the inputs according to their relevance. For the square loss \nfunction the average loss at x is the conditional variance Var(y I x, DCn\u00bb. We get \n\nTcurr(X) = \n\np(x) jU(x,w)-E(YIX,DCn\u00bb)2p(wIDcn\u00bbdw \n\n(6) \n\n+ p(x) j 0\"2(w) p(w I D(n\u00bb dw \n\nIf E(y I x,DCn\u00bb \nrepresentative of p(w I DCn\u00bb we can approximate the current loss with \n\nfr~~=lf(x,wb) and the sample WCB):= {Wl, ... ,WB} is \n\nTcurr(X) ~ 13 L..tU(x, Wb) - E(y I x, DCn\u00bb) + 13 L..t 0\" (Wb) \n\nA \n\n(7) \n\np( x) ~ \n\n2 \n\np( x) ~ 2 \n\nIf the input distribution p( x) is uniform, the second term is independent of x. \n\nb=l \n\nb=l \n\n3.3 COMPLEXITY REGULARIZATION \n\nNeural network models can represent arbitrary mappings between finite-dimensional \nspaces if the number of hidden units is sufficiently large [Hornik Stinchcombe 89]. \nAs the number of observations grows, more and more hidden units are neces(cid:173)\nsary to catch the details of the mapping. Therefore we use a sequential proce(cid:173)\ndure to increase the capacity of our networks during query learning. White and \nWooldridge call this approach the \"method of sieves\" and provide some asym(cid:173)\nptotic results on its consistency [White Wooldridge 91]. Gelfand and Dey com(cid:173)\npare Bayesian approaches for model selection and prove that, in the case of ne(cid:173)\nsted models Ml and M2, model choice by the ratio of popular Bayes factors \np(DCn) I Mi) := J p(DCn) I W, Mi) p(w I Mi) dw will always choose the full model \nregardless of the data as n --t 00 [Gelfand Dey 94]. They show that the pseudo(cid:173)\nBayes factor, a Bayesian variant of crossvalidation, is not affected by this paradox \n\nA(Ml' M2) := II p(y; I x;, DCn,j), Mt}j II p(Y; I x;, DCn,j), M2) \n\nn \n\nn \n\n(8) \n\n;=1 \n\nj=1 \n\nHere DCn ,;) := D(n) \\ (x;, y;). As the difference between p(w I DCn\u00bb and p( wi D(n,j\u00bb \nis usually small, we use the full posterior as the importance function (2) and get \n\np(Y; I x;, DCn,j),Mi) = \n\nj p(Y; IXj,w,Mi)p(wIDCn,j),Mi)dw \n\n'\" B/(t,l/P(Y;li;,W\"M,)) \n\n(9) \n\n4 NUMERICAL DEMONSTRATION \n\nIn a first experiment we tested the approach for a small a 1-2-1 MLP target func(cid:173)\ntion with Gaussian noise N(0,0.05 2 ). We assumed the square loss function and a \nuniform input distribution p(x) over [-5,5]. Using the \"true\" architecture for the \napproximating model we started with a single randomly generated observation. We \n\n\f448 \n\nGerhard Paass, JiJrg Kindermann \n\n=~!\u00a5~ \n--- ~tuo:io_ \n\n~ ~ \n\n1'1 \n0 \n\n.. \n\n.' . \n\n::::.:::::.::::\\.... \n\n.... \n\nd \n\n~ \n\n\\~. '\\ ------ -- - - - - - - -----\n\\., 1 \\l .......... _ .. _-_._ ........... __ .................... _ .... _ ....... _ .. \n\\ ! \n\\! \n\n.. \n\n-2 \n\non :; \n~ a: \n0 :; \n\n~ \n\n'\" 0 \n\n, \n\\ \n\n:.,. \n\n\\ ,\n\n' \n\n\" \n\nI\n\n--'~' =~ I \n. . \n\" . \n\n. . \n\n10 \n\n15 \n\n20 \n\n25 \n\nNo .d_ \n\n30 \n\nFigure 1: Future loss exploration: predicted posterior mean, future loss and current \nloss for 12 observations (left), and root mean square error of prediction (right) . \n\nestimated the future loss by (4) for 100 different inputs and selected the input with \nsmallest future loss as the next query. B = 50 parameter vectors were generated re(cid:173)\nquiring 200,000 Metropolis steps. Simultaneously we approximated the current loss \ncriterion by (7). The left side of figure 1 shows the typical relation of both measures. \nIn most situations the future loss is low in the same regions where the current loss \n(posterior standard deviation of mean prediction) is high. The queries are concen(cid:173)\ntrated in areas of high variation and the estimated posterior mean approximates \nthe target function quite well. \n\nIn the right part of figure 1 the RMSE of prediction averaged over 12 independent \nexperiments is shown. After a few observations the RMSE drops sharply. In our \nexample there is no marked difference between the prediction errors resulting from \nthe future loss and the current loss criterion (also averaged over 12 experiments). \nConsidering the substantial computing effort this favors the current loss criterion. \nThe dots indicate the RMSE for randomly generated data (averaged over 8 experi(cid:173)\nments) using the same Bayesian prediction procedure. Because only few data points \nwere located in the critical region of high variation the RMSE is much larger. \nIn the second experiment, a 2-3-1 MLP defined the target function I(x, wo) , to which \nGaussian noise of standard deviation 0.05 was added. I( x, wo) is shown in the left \npart of figure 2. We used five MLPs with 2-6 hidden units as candidate models \nMl, .. . , M5 and generated B = 45 samples WeB) of the posterior pew I D(n)' M.), \nwhere D(n) is the current data. We started with 30,000 Metropolis steps for small \nvalues of n and increased this to 90,000 Metropolis steps for larger values of n. \nFor a network with 6 hidden units and n = 50 observations, 10,000 Metropolis \nsteps took about 30 seconds on a Sparc10 workstation. Next, we used equation (9) \nto compare the different models, and then used the optimal model to calculate the \ncurrent loss (7) on a regular grid of 41 x 41 = 1681 query points x. Here we assumed \nthe square loss function and a uniform input distribution p(x) over [-5,5] x [-5,5]. \nWe selected the query point with maximal current loss and determined the final \nquery point with a hillclimbing algorithm. In this way we were rather sure to get \nclose to the true global optimum. \n\nThe main result of the experiment is summarized in the right part of figure 2. It \n\n\fBayesian Query Construct.ion for Neural Network Models \n\n449 \n\n\". \n\n:2 \\ \n\"': \n<:> \n\neXDlorati~n \no .m random a \n\u2022 \n\n\\ \n~\\\u00b7{l\u00b7\u00b7 .. o .. o .. o ............. __ (). ... \n\n\\ \n\n., \n.\" ~. \n\n. . .......... 0 ... .. ........ -- 0 \n\n. . \n\n80 \n\n100 \n\n20 \n\n40 \nNo. of Observations \n\n60 \n\nFigure 2: Current loss exploration: MLP target function and root mean square error. \n\nshows - averaged over 3 experiments - the root mean square error between the true \nmean value and the posterior mean E(y I x) on the grid of 1681 inputs in relation to \nthe sample size. Three phases of the exploration can be distinguished (see figure 3). \nIn the beginning a search is performed with many queries on the border of the \ninput area. After about 20 observations the algorithm knows enough detail about \nthe true function to concentrate on the relevant parts of the input space. This leads \nto a marked reduction ofthe mean square error. After 40 observations the systematic \npart of the true function has been captured nearly perfectly. In the last phase of \nthe experiment the algorithm merely reduces the uncertainty caused by the random \nnoise. In contrast, the data generated randomly does not have sufficient information \non the details of f(x , w), and therefore the error only gradually decreases. Because \nof space constraints we cannot report experiments with radial basis functions which \nled to similar results. \n\nAcknowledgements \n\nThis work is part of the joint project 'REFLEX' of the German Fed. Department \nof Science and Technology (BMFT), grant number 01 IN 111Aj4. We would like to \nthank Alexander Linden, Mark Ring, and Frank Weber for many fruitful discussions. \n\nReferences \n\n[Berger 80] Berger, J. (1980): Statistical Decision Theory, Foundations, Concepts, and \n\nMethods. Springer Verlag, New York. \n\n[Cohn 94] Cohn, D. (1994): Neural Network Exploration Using Optimal Experimental \n\nDesign. In J. Cowan et al. (eds.): NIPS 5. Morgan Kaufmann, San Mateo. \n\n[Ford et al. 89] Ford, I. , Titterington, D.M., Kitsos, C.P. (1989): Recent Advances in Non(cid:173)\n\nlinear Design. Technometrics, 31, p.49-60. \n\n[Gelfand Dey 94] Gelfand, A.E., Dey, D.K. (1994): Bayesian Model Choice: Asymptotics \n\nand Exact Calculations. J. Royal Statistical Society B, 56, pp.501-514. \n\n\f450 \n\nGerhard Paass, Jorg Kindermann \n\nFigure 3: Squareroot of current loss (upper row) and absolute deviation from true \nfunction (lower row) for 10,25, and 40 observations (which are indicated by dots) . \n\n[Hornik Stinchcombe 89] Hornik, K., Stinchcombe, M. (1989): Multilayer Feedforward \n\nNetworks are Universal Approximators. Neural Networks 2, p.359-366. \n\n[Kalos Whitlock 86] Kalos, M.H., Whitlock, P.A. (1986): Monte Carlo Methods, Wiley, \n\nNew York. \n\n[MacKay 92] MacKay, D. (1992): Information-Based Objective Functions for Active Data \n\nSelection. Neural Computation 4, p.590-604. \n\n[Neal 93] Neal, R.M. (1993): Probabilistic Inference using Markov Chain Monte Carlo \nMethods. Tech. Report CRG-TR-93-1, Dep. of Computer Science, Univ. of Toronto. \n[PaaB 91] PaaB, G. (1991): Second Order Probabilities for Uncertain and Conflicting Evi(cid:173)\n\ndence. In: P.P. Bonissone et al. (eds.) Uncertainty in Artificial Intelligence 6. Elsevier, \nAmsterdam, pp. 447-456. \n\n[Plutowski White 93] Plutowski, M., White, H. (1993): Selecting Concise Training Sets \n\nfrom Clean Data. IEEE Tr. on Neural Networks, 4, p.305-318. \n\n[Pronzato Walter 92] Pronzato, L., Walter, E. (1992): Nonsequential Bayesian Experimen(cid:173)\n\ntal Design for Response Optimization. In V. Fedorov, W.G. Miiller, I.N. Vuchkov \n(eds.): Model Oriented Data-Analysis. Physica Verlag, Heidelberg, p. 89-102. \n\n[Sollich 94] Sollich, P. (1994): Query Construction, Entropy and Generalization in Neural \n\nNetwork Models. To appear in Physical Review E. \n\n[Sollich Saad 95] Sollich, P., Saad, D. (1995): Learning from Queries for Maximum Infor(cid:173)\n\nmation Gain in Unlearnable Problems. This volume. \n\n[White Wooldridge 91] White, H., Wooldridge, J. (1991): Some Results for Sieve Estima(cid:173)\n\ntion with Dependent Observations. In W. Barnett et al. (eds.) : Nonparametric and \nSemiparametric Methods in Econometrics and Statistics, New York, Cambridge Univ. \nPress. \n\n\f", "award": [], "sourceid": 1000, "authors": [{"given_name": "Gerhard", "family_name": "Paass", "institution": null}, {"given_name": "J\u00f6rg", "family_name": "Kindermann", "institution": null}]}