{"title": "Bayesian Query Construction for Neural Network Models", "book": "Advances in Neural Information Processing Systems", "page_first": 443, "page_last": 450, "abstract": null, "full_text": "Bayesian Query Construction for  Neural \n\nNetwork Models \n\nGerhard Paass \n\nJorg Kindermann \n\nGerman National Research  Center for  Computer Science  (GMD) \n\nD-53757 Sankt Augustin,  Germany \n\npaass@gmd.de \n\nkindermann@gmd.de \n\nAbstract \n\nIf data collection is costly, there is much to be gained by actively se(cid:173)\nlecting particularly informative data points in a sequential  way.  In \na  Bayesian decision-theoretic framework we  develop  a  query selec(cid:173)\ntion  criterion  which  explicitly takes  into account  the intended  use \nof the model predictions.  By  Markov Chain Monte Carlo methods \nthe  necessary  quantities  can  be  approximated to  a  desired  preci(cid:173)\nsion.  As  the  number  of data  points  grows,  the  model  complexity \nis  modified  by  a  Bayesian  model  selection  strategy.  The  proper(cid:173)\nties of two  versions  of the criterion  ate demonstrated in numerical \nexperiments. \n\n1 \n\nINTRODUCTION \n\nIn  this  paper  we  consider  the  situation  where  data  collection  is  costly,  as  when \nfor  example, real measurements or  technical  experiments have to be performed.  In \nthis  situation  the  approach  of query  learning  ('active  data selection',  'sequential \nexperimental  design',  etc.)  has  a  potential  benefit.  Depending  on  the  previously \nseen  examples,  a  new  input  value  ('query')  is  selected  in  a  systematic  way  and \nthe  corresponding  output  is  obtained.  The  motivation for  query  learning  is  that \nrandom examples often  contain  redundant  information, and  the  concentration  on \nnon-redundant examples must necessarily  improve generalization performance. \n\nWe use  a  Bayesian decision-theoretic framework to derive a criterion for  query con(cid:173)\nstruction. The criterion reflects the intended use of the predictions by an appropriate \n\n\f444 \n\nGerhard Paass.  Jorg Kindermann \n\nloss function.  We  limit our analysis to the selection of the next data point, given a \nset of data already sampled. The proposed  procedure  derives  the  expected  loss for \ncandidate inputs and selects  a  query  with minimal expected  loss. \n\nThere  are several  published surveys of query  construction methods  [Ford et  al.  89, \nPlutowski White 93,  Sollich 94].  Most  current  approaches,  e.g.  [Cohn 94],  rely \non  the  information  matrix  of  parameters.  Then  however,  all  parameters  receive \nequal  attention  regardless  of their  influence  on  the  intended  use  of  the  model \n[Pronzato Walter 92]. In addition, the estimates are valid only asymptotically. Baye(cid:173)\nsian approaches have been advocated by [Berger  80], and applied to neural networks \n[MacKay 92].  In  [Sollich Saad 95]  their  relation  to  maximum information gain  is \ndiscussed.  In  this  paper  we  show  that  by  using  Markov  Chain  Monte  Carlo  me(cid:173)\nthods it is possible to determine all quantities necessary for the selection of a query. \nThis approach is valid in small sample situations, and the procedure's precision can \nbe  increased  with  additional  computational effort.  With  the square  loss  function, \nthe  criterion  is  reduced  to a  variant  of the familiar  integrated  mean square  error \n[Plutowski White 93]. \nIn the next section we  develop the query selection criterion from a decision-theoretic \npoint of view. In the third section we show how the criterion can be calculated using \nMarkov Chain Monte Carlo methods and we  discuss  a strategy for  model selection. \nIn  the last section,  the results of two experiments with  MLPs are  described. \n\n2  A  DECISION-THEORETIC  FRAMEWORK \nAssume we have an input vector x and a scalar output y distributed as y \"\"  p(y I x, w) \nwhere w is a vector of parameters. The conditional expected value is a deterministic \nfunction !(x, w)  := E(y I x, w) where y = !(x, w)+\u00a3 and \u00a3  is a zero mean error term. \nSuppose  we  have  iteratively  collected  observations  D(n)  := ((Xl, iii), .. . , (Xn, Yn)). \nWe  get the Bayesian posterior p(w I D(n)) = p(D(n) I w) p(w)/ J p(D(n) I w) p(w) dw \nand the predictive distribution p(y I x, D(n))  = J p(y I x, w)p(w I D(n)) dw  if p(w)  is \nthe prior  distribution. \n\nWe  consider  the  situation  where,  based  on some  data  x,  we  have  to  perform  an \naction a  whose result depends on the unknown output y.  Some decisions  may have \nmore  severe  effects  than  others.  The  loss  function  L(y, a)  E  [0,00)  measures  the \nloss  if y  is  the  true  value  and  we  have  taken  the  action  a  E  A.  In  this  paper  we \nconsider  real-valued  actions,  e.g.  setting the temperature  a  in  a  chemical  process. \nWe  have  to  select  an  a  E  A  only  knowing  the  input  x.  According  to  the  Bayes \nPrinciple  [Berger  80,  p.14]  we  should  follow  a  decision  rule  d  :  x  --t  a  such  that \nthe  average  risk  J R(w, d) p(w I D(n)) dw  is  minimal,  where  the  risk  is  defined  as \nR(w, d)  := J L(y, d(x)) p(y I x, w) p(x) dydx.  Here  p(x)  is  the distribution of future \ninputs,  which is  assumed to be  known. \nFor  the  square  loss  function  L(y, a)  =  (y  - a)2,  the  conditional  expectation \nd(x)  :=  E(y I x, D(n))  is  the  optimal decision  rule.  In  a  control  problem  the  loss \nmay  be  larger  at  specific  critical  points.  This  can  be  addressed  with  a  weigh(cid:173)\nted  square  loss  function  L(y, a)  :=  h(y)(y  - a)2,  where  h(y)  2:  a [Berger  80, \np.1U].  The  expected  loss  for  an  action  is  J(y  - a)2h(y) p(y  I x, D(n)) dy.  Re(cid:173)\nplacing  the  predictive  density  p(y I x, D(n))  with  the  weighted  predictive  density \n\n\fBayesian Query Construction for Neural Network Models \n\n445 \n\n:=  h(y) p(y  I x, Den)/G(x),  where  G(x)  :=  I  h(y) p(y  I x, Den)  dy, \np(y  I x, Den) \nwe  get  the  optimal decision  rule  d(x)  := I  yp(y I x, Den)  dy  and  the  average  loss \nG(x) I(y - E(y I x, D(n))2 p(y I x, Den)  dy  for  a  given  input  x.  With  these  modi(cid:173)\nfications,  all  later  derIvations  for  the  square  loss  function  may  be  applied  to  the \nweighted  square loss. \nThe  aim of query  sampling is  the  selection  of a  new  observation  x in  such  a  way \nthat  the  average  risk  will  be  maximally reduced.  Together  with  its still  unknown \ny-value,  x defines a  new observation (x, y)  and new data Den) U (x, y).  To determine \nthis risk  for  some given x we  have  to perform the following conceptual steps for  a \ncandidate query  x: \n\n1.  Future  Data:  Construct  the  possible  sets  of 'future'  observations  Den)  U \n\n(x, y),  where  y \"\"'  p(y I x, Den). \n\n2.  Future  posterior: Determine a  'future' posterior distribution of parameters \nthat  depends  on  y in  the same  way  as  though  it  had \n\np(w I Den)  U (x, y\u00bb \nactually been observed. \n\n3.  Future  Loss: Assuming d~,x(x) is  the optimal decision rule for  given values \n\nof x,  y,  and  x, compute the resulting loss  as \n\n1';,x(x):= J L(y,d;,x(x\u00bbp(ylx,w)p(wIDen)U(x,y\u00bbdydw \n\n(1) \n\n4.  Averaging: Integrate this quantity over the future trial inputs x  distributed \n\nas p(x)  and the different  possible future  outputs y,  yielding \n1';:= Ir;,x(x)p(x)p(ylx,Den)dxdy. \n\nThis procedure is repeated until an x with minimal average risk is found. Since local \noptima are typical, a global optimization method is required.  Subsequently we  then \ntry to determine whether the current  model is still adequate or whether we have  to \nincrease  its complexity (e.g.  by  adding more hidden  units). \n\n3  COMPUTATIONAL  PROCEDURE \n\nLet us assume that the real data Den) was generated according to a regression model \ny =  !(x, w)+{ with i.i.d. Gaussian noise {\"\"'  N(O, (T2(w\u00bb.  For example !(x, w)  may \nbe a  multilayer perceptron or a  radial basis function network. Since the error terms \nare  independent,  the posterior  density  is  p( w I Den) \nex:  p( w) rr~=l P(Yi  I Xi, w)  even \nin the  case  of query sampling [Ford et  al.  89]. \n\nAs  the  analytic derivation  of the  posterior  is  infeasible  except  in trivial  cases,  we \nhave  to  use  approximations.  One  approach  is  to  employ  a  normal approximation \n[MacKay 92],  but  this  is  unreliable  if the  number of observations  is  small compa(cid:173)\nred  to  the  number of parameters.  We  use  Markov  Chain  Monte  Carlo  procedures \n[PaaB 91,  Neal  93]  to generate  a  sample WeB)  := {WI, .. . WB}  of parameters distri(cid:173)\nbuted according to p( w I Den). If the number of sampling steps approaches infinity, \nthe distribution of the simulated Wb  approximates the posterior arbitrarily well. \n\nTo  take  into  account  the  range  of future  y-values,  we  create  a  set  of them  by  si(cid:173)\nmulation.  For  each  Wb  E  WeB)  a  number  of  y \"\"'  p(y  I x, Wb)  is  generated.  Let \n\n\f446 \n\nGerhard Paass.  JiJrg  Kindermann \n\ny(x.R)  :=  {YI, ... , YR}  be  the  resulting  set.  Instead  of performing  a  new  Markov \nMonte  Carlo  run  to  generate  a  new  sample  according  to  p(w  I DCn)  U (x, y)),  we \nuse  the  old  set  WCB)  of parameters  and  reweight  them  (importance  sampling). \nIn  this  way  we  may approximate integrals of some function  g( w)  with  respect  to \np(w I DCn)  U (x, y))  [Kalos Whitlock 86,  p.92]: \n\n(  )  (  ID  U( - -))d \n9  w  P  W \n\nX, Y \n\nCn) \n\nW \n\nj \n\n__  L~-lg(Wb)P(ylx,Wb) \n--\n\nB \n\nLb=l p(Y I x, Wb) \n\n(2) \n\nThe approximation error approaches zero  as the size  of WCB)  increases. \n\n3.1  APPROXIMATION OF  FUTURE LOSS \n\nConsider  the future  loss  f;,x(x)  given new  observation (x, y)  and trial input  Xt.  In \nthe case  of the square loss function,  (1)  can be transformed to \n\nf~,.t(Xt)  = \n\nj[!(Xt,w)-E(yIXt,Dcn)U(X,y)Wp(wIDcn)U(x,y))dw  (3) \n+ j  \u00a3T2(w) p(w I DCn)  U (x, y)) dw \n\nwhere  \u00a3T2(w)  := Var(y I x, w)  is independent of x.  Assume a  set  XT  =  {Xl, ... , XT} \nis  given,  which  is  representative  of trial  inputs  for  the  distribution  p(x).  Define \nS(x, y)  := L~=i p(Y I x, Wb)  for  y E YCx,R) . Then from equations (2)  and  (3)  we  get \nE(ylxt,DCn)U(x,y)):= 1/S(x,Y)L~=1!(Xt,Wb)P(Ylx,Wb) and \n\n1 \n\nB \n\nS(x  -)  L\u00a3T2(Wb)P(Ylx,Wb) \n\n(4) \n\n,y  b=l \n\n1 \n\nB \n\n,y  b=l \n\n+ S(x  -)  I)!(Xt, Wb)  - E(y I Xt, DCn)  U (x, y))]2 p(Y I x, Wb) \n\nThe  final  value  of f;  is  obtained  by  averaging  over  the  different  y  E  YCx,R)  and \ndifferent  trial  inputs  Xt  E  XT.  To  reduce  the  variance,  the  trial  inputs  Xt  should \nbe  selected  by  importance sampling (2)  to concentrate  them on  regions  with  high \ncurrent  loss  (see  (5)  below).  To facilitate  the  search  for  an  x with  minimal f;  we \nreduce  the  extent  of random  fluctuations  of the  y values.  Let  (Vi, ... , VR)  be  a \nvector  of random  numbers  Vr  -- N(O,1),  and  let  jr  be  randomly  selected  from \n{1, ... , B}.  Then  for  each  x  the  possible  observations  Yr  E  YCx,R)  are  defined  as \nYr  := !(x, wir) + Vr\u00a3T2(wir).  In this way the difference  between  neighboring inputs \nis not  affected  by  noise,  and search  procedures  can exploit gradients. \n\n3.2  CURRENT LOSS \nAs  a  proxy for  the future loss,  we  may use  the current loss  at x, \n\nrcurr(x)  =  p(x) j  L(y, d*(x)) p(y I x, DCn)) dy \n\n(5) \n\n\fBayesian Query Construction for Neural Network Models \n\n447 \n\nwhere  p(x)  weights  the  inputs  according  to  their  relevance.  For  the  square  loss \nfunction  the average loss  at x is  the conditional variance Var(y I x, DCn\u00bb.  We  get \n\nTcurr(X)  = \n\np(x) jU(x,w)-E(YIX,DCn\u00bb)2p(wIDcn\u00bbdw \n\n(6) \n\n+ p(x) j  0\"2(w) p(w I D(n\u00bb  dw \n\nIf  E(y  I x,DCn\u00bb \nrepresentative of p(w I DCn\u00bb  we  can approximate the current  loss  with \n\nfr~~=lf(x,wb) and  the  sample  WCB):=  {Wl, ... ,WB}  is \n\nTcurr(X)  ~  13 L..tU(x, Wb)  - E(y I x, DCn\u00bb)  + 13 L..t 0\"  (Wb) \n\nA \n\n(7) \n\np( x)  ~ \n\n2 \n\np( x)  ~ 2 \n\nIf the input distribution p( x)  is  uniform, the second  term is  independent  of x. \n\nb=l \n\nb=l \n\n3.3  COMPLEXITY REGULARIZATION \n\nNeural network models can represent arbitrary mappings between finite-dimensional \nspaces  if the  number of hidden  units is  sufficiently  large  [Hornik Stinchcombe 89]. \nAs  the  number  of  observations  grows,  more  and  more  hidden  units  are  neces(cid:173)\nsary  to  catch  the  details  of  the  mapping.  Therefore  we  use  a  sequential  proce(cid:173)\ndure  to  increase  the  capacity  of our  networks  during  query  learning.  White  and \nWooldridge  call  this  approach  the  \"method  of sieves\"  and  provide  some  asym(cid:173)\nptotic  results  on  its  consistency  [White Wooldridge 91].  Gelfand  and  Dey  com(cid:173)\npare  Bayesian  approaches  for  model  selection  and  prove  that,  in  the  case  of ne(cid:173)\nsted  models  Ml  and  M2,  model  choice  by  the  ratio  of  popular  Bayes  factors \np(DCn)  I Mi)  := J p(DCn)  I W, Mi) p(w  I Mi) dw  will  always  choose  the  full  model \nregardless  of the  data  as  n  --t  00  [Gelfand Dey  94].  They  show  that  the  pseudo(cid:173)\nBayes  factor,  a  Bayesian variant of crossvalidation, is  not affected  by  this paradox \n\nA(Ml' M2) := II p(y; I x;, DCn,j), Mt}j II p(Y;  I x;, DCn,j), M2) \n\nn \n\nn \n\n(8) \n\n;=1 \n\nj=1 \n\nHere  DCn ,;)  := D(n) \\ (x;, y;).  As the difference between p(w I DCn\u00bb  and p( wi D(n,j\u00bb \nis  usually small, we  use  the full  posterior  as the importance function  (2)  and get \n\np(Y;  I x;, DCn,j),Mi)  = \n\nj  p(Y;  IXj,w,Mi)p(wIDCn,j),Mi)dw \n\n'\"  B/(t,l/P(Y;li;,W\"M,)) \n\n(9) \n\n4  NUMERICAL DEMONSTRATION \n\nIn a first  experiment  we  tested  the  approach  for  a  small a  1-2-1  MLP  target func(cid:173)\ntion  with  Gaussian  noise  N(0,0.05 2 ).  We  assumed  the  square  loss  function  and  a \nuniform input  distribution p(x)  over  [-5,5]. Using  the  \"true\"  architecture  for  the \napproximating model we  started with a single randomly generated observation. We \n\n\f448 \n\nGerhard Paass,  JiJrg  Kindermann \n\n=~!\u00a5~ \n--- ~tuo:io_ \n\n~  ~ \n\n1'1 \n0 \n\n.. \n\n.' . \n\n::::.:::::.::::\\.... \n\n.... \n\nd \n\n~ \n\n\\~. '\\  ------ -- - - - - - - -----\n\\.,  1 \\l  .......... _ .. _-_._ ........... __ .................... _ .... _ ....... _ .. \n\\ ! \n\\! \n\n.. \n\n-2 \n\non :; \n~ a: \n0 :; \n\n~ \n\n'\" 0 \n\n, \n\\ \n\n:.,. \n\n\\ ,\n\n' \n\n\" \n\nI\n\n--'~'  =~ I \n. . \n\"  . \n\n. . \n\n10 \n\n15 \n\n20 \n\n25 \n\nNo .d_  \n\n30 \n\nFigure 1: Future loss exploration: predicted  posterior  mean, future loss and current \nloss for  12  observations  (left), and root  mean square error  of prediction  (right) . \n\nestimated the future  loss by  (4) for  100 different  inputs and selected  the input with \nsmallest future loss as the next query.  B  = 50 parameter vectors were generated re(cid:173)\nquiring 200,000 Metropolis steps. Simultaneously we  approximated the current loss \ncriterion by (7). The left side of figure  1 shows the typical relation of both measures. \nIn most situations the future  loss  is low  in the same regions where  the current  loss \n(posterior  standard deviation of mean prediction)  is  high.  The  queries  are  concen(cid:173)\ntrated  in  areas  of high  variation and  the  estimated  posterior  mean  approximates \nthe target function quite well. \n\nIn the  right part of figure  1 the  RMSE  of prediction averaged over  12  independent \nexperiments  is  shown.  After  a  few  observations  the  RMSE  drops  sharply.  In  our \nexample there is  no marked difference  between  the prediction errors  resulting from \nthe future  loss  and the  current  loss  criterion  (also  averaged  over  12  experiments). \nConsidering  the substantial computing effort  this favors  the  current  loss  criterion. \nThe dots indicate the RMSE for  randomly generated  data (averaged over  8 experi(cid:173)\nments) using the same Bayesian prediction procedure.  Because only few  data points \nwere  located in the critical region  of high  variation the RMSE is much larger. \nIn the second experiment, a 2-3-1 MLP defined the target function I(x, wo) , to which \nGaussian noise  of standard  deviation 0.05  was  added.  I( x, wo)  is shown in the left \npart  of figure  2.  We  used  five  MLPs  with  2-6  hidden  units  as  candidate  models \nMl, .. . , M5  and generated  B  =  45  samples WeB)  of the  posterior  pew I D(n)' M.), \nwhere  D(n)  is  the  current  data.  We  started with  30,000  Metropolis steps for  small \nvalues  of n  and  increased  this  to  90,000  Metropolis  steps  for  larger  values  of n. \nFor  a  network  with  6  hidden  units  and  n  =  50  observations,  10,000  Metropolis \nsteps took about 30  seconds  on  a Sparc10 workstation. Next,  we  used  equation (9) \nto compare the different  models, and then used  the optimal model to calculate the \ncurrent loss (7) on a regular grid of 41 x 41  =  1681 query points x. Here  we assumed \nthe square loss function and a uniform input distribution p(x) over  [-5,5] x [-5,5]. \nWe  selected  the  query  point  with  maximal current  loss  and  determined  the  final \nquery  point  with  a  hillclimbing algorithm. In this  way  we  were  rather  sure  to get \nclose  to the true global optimum. \n\nThe  main result  of the  experiment  is  summarized in  the  right  part  of figure  2.  It \n\n\fBayesian Query Construct.ion for Neural Network Models \n\n449 \n\n\". \n\n:2  \\ \n\"': \n<:> \n\neXDlorati~n \no  .m  random  a \n\u2022 \n\n\\ \n~\\\u00b7{l\u00b7\u00b7 .. o .. o .. o ............. __ (). ... \n\n\\ \n\n., \n.\" ~. \n\n. . .......... 0  ... .. ........ -- 0 \n\n. .  \n\n80 \n\n100 \n\n20 \n\n40 \nNo. of Observations \n\n60 \n\nFigure 2:  Current loss exploration: MLP target function and root mean square error. \n\nshows - averaged over  3 experiments - the  root mean square error  between the true \nmean value and the posterior mean E(y I x) on the grid of 1681 inputs in relation to \nthe sample size.  Three phases of the exploration can  be distinguished (see  figure  3). \nIn  the  beginning  a  search  is  performed  with  many  queries  on  the  border  of the \ninput  area.  After  about  20  observations  the  algorithm knows  enough  detail  about \nthe true function  to concentrate on the relevant parts of the input space. This leads \nto a marked reduction ofthe mean square error. After 40 observations the systematic \npart  of the  true  function  has  been  captured  nearly  perfectly.  In  the  last  phase  of \nthe experiment the algorithm merely reduces  the uncertainty caused  by the random \nnoise. In contrast, the data generated randomly does not have sufficient information \non the  details of f(x , w),  and  therefore  the error only gradually decreases.  Because \nof space constraints we  cannot report experiments with radial basis functions which \nled  to similar results. \n\nAcknowledgements \n\nThis  work  is  part of the joint project  'REFLEX' of the  German Fed.  Department \nof Science and Technology (BMFT),  grant number 01  IN  111Aj4. We would like to \nthank Alexander Linden, Mark Ring, and Frank Weber for many fruitful discussions. \n\nReferences \n\n[Berger  80]  Berger,  J.  (1980):  Statistical  Decision  Theory,  Foundations,  Concepts,  and \n\nMethods.  Springer  Verlag,  New  York. \n\n[Cohn  94]  Cohn,  D.  (1994):  Neural  Network  Exploration  Using  Optimal  Experimental \n\nDesign.  In  J.  Cowan et al.  (eds.):  NIPS 5.  Morgan  Kaufmann,  San  Mateo. \n\n[Ford  et al.  89]  Ford,  I. , Titterington,  D.M.,  Kitsos,  C.P. (1989):  Recent Advances in  Non(cid:173)\n\nlinear  Design.  Technometrics,  31,  p.49-60. \n\n[Gelfand  Dey  94]  Gelfand,  A.E., Dey,  D.K.  (1994):  Bayesian  Model  Choice:  Asymptotics \n\nand  Exact Calculations.  J.  Royal Statistical Society B,  56,  pp.501-514. \n\n\f450 \n\nGerhard Paass,  Jorg  Kindermann \n\nFigure  3:  Squareroot of current  loss  (upper  row)  and  absolute  deviation from  true \nfunction  (lower  row)  for  10,25, and 40  observations  (which  are  indicated by  dots) . \n\n[Hornik Stinchcombe  89]  Hornik,  K.,  Stinchcombe,  M.  (1989):  Multilayer  Feedforward \n\nNetworks  are  Universal  Approximators.  Neural  Networks 2,  p.359-366. \n\n[Kalos  Whitlock  86]  Kalos,  M.H.,  Whitlock,  P.A.  (1986):  Monte  Carlo  Methods,  Wiley, \n\nNew  York. \n\n[MacKay  92]  MacKay,  D.  (1992):  Information-Based  Objective Functions for  Active Data \n\nSelection.  Neural  Computation 4,  p.590-604. \n\n[Neal  93]  Neal,  R.M.  (1993):  Probabilistic  Inference  using  Markov  Chain  Monte  Carlo \nMethods.  Tech.  Report  CRG-TR-93-1,  Dep.  of Computer Science,  Univ.  of Toronto. \n[PaaB  91]  PaaB,  G.  (1991):  Second Order  Probabilities  for  Uncertain  and  Conflicting  Evi(cid:173)\n\ndence.  In:  P.P. Bonissone et al.  (eds.)  Uncertainty in Artificial Intelligence 6.  Elsevier, \nAmsterdam,  pp.  447-456. \n\n[Plutowski  White  93]  Plutowski,  M.,  White,  H.  (1993):  Selecting  Concise  Training  Sets \n\nfrom  Clean  Data.  IEEE  Tr.  on  Neural  Networks,  4,  p.305-318. \n\n[Pronzato  Walter 92]  Pronzato, L.,  Walter, E.  (1992):  Nonsequential  Bayesian Experimen(cid:173)\n\ntal  Design  for  Response  Optimization.  In  V.  Fedorov,  W.G.  Miiller,  I.N.  Vuchkov \n(eds.):  Model  Oriented Data-Analysis. Physica  Verlag,  Heidelberg,  p.  89-102. \n\n[Sollich  94]  Sollich,  P.  (1994):  Query Construction,  Entropy  and Generalization  in  Neural \n\nNetwork  Models.  To  appear in  Physical  Review  E. \n\n[Sollich  Saad  95]  Sollich,  P.,  Saad,  D.  (1995):  Learning from  Queries for  Maximum  Infor(cid:173)\n\nmation  Gain  in  Unlearnable  Problems.  This  volume. \n\n[White Wooldridge  91]  White,  H.,  Wooldridge,  J.  (1991):  Some  Results for  Sieve  Estima(cid:173)\n\ntion  with  Dependent  Observations.  In  W.  Barnett et  al.  (eds.)  :  Nonparametric  and \nSemiparametric Methods in Econometrics and Statistics, New York,  Cambridge Univ. \nPress. \n\n\f", "award": [], "sourceid": 1000, "authors": [{"given_name": "Gerhard", "family_name": "Paass", "institution": null}, {"given_name": "J\u00f6rg", "family_name": "Kindermann", "institution": null}]}