{"title": "Supervised Learning of Probability Distributions by Neural Networks", "book": "Neural Information Processing Systems", "page_first": 52, "page_last": 61, "abstract": "", "full_text": "52 \n\nSupervised Learning of Probability Distributions \n\nby Neural Networks \n\nEric B. Baum \n\nJet Propulsion Laboratory, Pasadena CA 91109 \n\nDepartment of Physics,Harvard University,Cambridge MA 02138 \n\nFrank Wilczek t \n\nAbstract: \n\nWe propose that the back propagation algorithm for super(cid:173)\n\nvised learning can be generalized, put on a satisfactory conceptual \nfooting, and very likely made more efficient by defining the val(cid:173)\nues of the output and input neurons as probabilities and varying \nthe synaptic weights in the gradient direction of the log likelihood, \nrather than the 'error'. \n\nIn the past thirty years many researchers have studied the \nquestion of supervised learning in 'neural'-like networks. Recently \na learning algorithm called 'back propagation H - 4 or the 'general(cid:173)\nized delta-rule' has been applied to numerous problems including \nthe mapping of text to phonemes 5 , the diagnosis of illnesses6 and \nthe classification of sonar targets 7 \u2022 In these applications, it would \noften be natural to consider imperfect, or probabilistic informa(cid:173)\ntion. We believe that by considering supervised learning from this \nslightly larger perspective, one can not only place back propaga-\n\nt Permanent address: Institute for Theoretical Physics, Univer(cid:173)\n\nsity of California, Santa Barbara CA 93106 \n\n\u00a9 American Institute of Physics 1988 \n\n\f53 \n\ntion on a more rigorous and general basis, relating it to other well \nstudied pattern recognition algorithms, but very likely improve its \nperformance as well. \n\nThe problem of supervised learning is to model some mapping \nbetween input vectors and output vectors presented to us by some \nreal world phenomena. To be specific, coqsider the question of \nmedical diagnosis. The input vector corresponds to the symptoms \nof the patient; the i-th component is defined to be 1 if symptom i \nis present and 0 if symptom i is absent. The output vector corre(cid:173)\nsponds to the illnesses, so that its j-th component is 1 if the j-th \nillness is present and 0 otherwise. Given a data base consisting \nof a number of diagnosed cases, the goal is to construct (learn) a \nmapping which accounts for these examples and can be applied to \ndiagnose new patients in a reliable way. One could hope, for in(cid:173)\nstance, that such a learning algorithm might yield an expert system \nto simulate the performance of doctors. Little expert advice would \nbe required for its design, which is advantageous both because ex(cid:173)\nperts' time is valuable and because experts often have extraodinary \ndifficulty in describing how they make decisions. \n\nA feedforward neural network implements such a mapping be(cid:173)\n\ntween input vectors and output vectors. Such a network has a set \nof input nodes, one or several layers of intermediate nodes, and a \nlayer of output nodes. The nodes are connected in a forward di(cid:173)\nrected manner, so that the output of a node may be connected to \nthe inputs of nodes in subsequent layers, but closed loops do not \noccur. See figure 1. The output of each node is assumed to be a \nbounded semilinear function of its inputs. That is, if Vj denotes \nthe output of the j-th node and Wij denotes the weight associated \nwith the connection of the output of the j-th node to the input of \n\n\f54 \n\nthe i-th, then the i-th neuron takes value Vi = g(L,i Wi:jV:j), where \ng is a bounded, differentiable function called the activation func(cid:173)\ntion. g(x) = 1/(1 + e- X ), called the logistic function, is frequently \nused. Given a fixed set of weights {Wi:j}, we set the input node \nvalues to equal some input vector, compute the value of the nodes \nlayer by layer until we compute the output nodes, and so generate \nan output vector. \n\nFigure 1: A 5 layer network. Note bottleneck at layer 3. \n\n\f55 \n\nSuch networks have been studied because of analogies to neu(cid:173)\n\nrobiology, because it may be easy to fabricate them in hardware, \nand because learning algorithms such as the Perceptron learning \nalgorithm8 , Widrow- Hoff9, and backpropagation have been able \nto choose weights Wi,. that solve interesting problems. \n\nGiven a set of input vectors sr, together with associated target \nvalues tj, back propagation attempts to adjust the weights so as \nto minimize the error E in achieving these target values, defined as \n\nE = E EJL = E(tj - oj)2 \n\nJL \n\nJL,i \n\n(1) \n\nwhere oj is the output of the j-th node when sJL is presented as \ninput. Back propagation starts with randomly chosen Wi,. and \nthen varies in the gradient direction of E until a local minimum \nis obtained. Although only a locally optimal set of weights is ob(cid:173)\ntained, in a number of experiments the neural net so generated \nhas performed surprisingly well not only on the training set but on \nsubsequent data. 4 - 6 This performance is probably the main reason \nfor widespread interest in backpropagation. \n\nIt seems to us natural, in the context of the medical diagnosis \npro blem, the other real world problems to which backpropagation \nhas been applied, and indeed in any mapping problem where one \ndesires to generalize from a limited and noisy set of examples, to \ninterpret the output vector in probabilistic terms. Such an inter(cid:173)\npretation is standard in the literature on pattern classification. 1o \nIndeed, the examples might even be probabilistic themselves. That \nis to say it might not be certain whether symptom i was present \nin case /L or not. \n\nLet sr represent the probability symptom i is present in case \n/L, and let tj represent the probability disease j ocurred in case \n\n\f56 \n\nfL. Consider for the moment the case where the tJ are 1 or 0, \nso that the cases are in fact fully diagnosed. Let Ii (s, 0) be our \nprediction of the probability of disease i given input vector 5, where \n{; is some set of parameters determined by our learning algorithm. \nIn the neural network case, the {; are the connection weights and \nIi ( sl' , { Wi.i }) = oJ. \n\nA \n\nNow lacking a priori knowledge of good 0, the best one can do \nis to choose the parameters {; to maximize the likelihood that the \ngiven set of examples should have occurred. 10 The formula for this \nlikelihood, p, is immediate: \n\nor \n\nThe extension of equation (2), and thus equation (3) to the \ncase where the f are probabilities, taking values in [0,1]' is straight-\n\n\f57 \n\nforward * 1 and yields \n\nlog(p) = ~ [tjlog(Jj (s\", 0)) + (1 - tj)log(1 - Ij (W, 0))] \n\n(4) \n\np. ,3 \n\nExpressions of this sort often arise in physics and information the(cid:173)\nory and are generally interpreted as an entropy. 11 \n\nWe may now vary the {O} in the gradient direction of the en(cid:173)\n\ntropy. The back propagation algorithm generalizes immediately \nfrom minimizing 'Error' or 'Energy' to maximizing entropy or log \nlikelihood, or indeed any other function of the outputs and the \ninputs 12 . Of course it remains true that the gradient can be com(cid:173)\nputed by back propagation with essentially the same number of \ncomputations as are required to compute the output of the net(cid:173)\nwork. \n\nA backpropagation algorithm based on log-likelihood is not \n\nonly more intuitively appealing than one based on an ad-hoc def(cid:173)\ninition of error, but will make quite different and more accurate \npredictions as well. Consider e.g. training the net on an exam(cid:173)\nple which it already understands fairly well. Say tj = 0, and \n/j(80) = L Now, from eqn(l) BE/B/j = 2\u00a3, so using 'Error' as a \n\n* 1 We may see this by constructing an equivalent larger set of \nexamples with the f taking only values 0 or 1 with the appropriate \nfrequency. Thus assume the tj are rational numbers with denomi(cid:173)\nnator dj and numerator nj and let p = IIp.,j dj. What we mean by \nthe set of examples {tp. : J-t = 1, ... , M} can be represented by con(cid:173)\nsidering a set of N = Mp examples {ij} where for each J-t, ij = 0 \nfor p(J-t- 1) < v < pJ-t and 1 < vmod(dj) < (dj - nj), and ij = 1 \notherwise. N ow applying equation (3) gives equation (4), up to an \noverall normalization. \n\n\f58 \n\ncriterion the net learns very little from this example, whereas, us(cid:173)\ning eqn(3), Blog(p)/B!;j = 1/(1 - f), so the net continues to learn \nand can in fact converge to predict probabilities near 1. Indeed \nbecause back propagation using the standard 'Error' measure can \nnot converge to generate outputs of 1 or 0, it has been custom(cid:173)\nary in the literature4 to round the target values so that a target \nof 1 would be presented in the learning algorithm as some ad hoc \nnumber such as .8, whereas a target of 0 would be presented as .2. \n\nIn the context of our general discussion it is natural to ask \nwhether using a feedforward network and varying the weights is in \nfact the most effective alternative. Anderson and Abrahams 13 have \ndiscussed this issue from a Bayesian viewpoint. From this point of \nview, fitting output to input using normal distributions and varying \nthe means and covariance matrix may seem to be more logical. \n\nFeedforward networks do however have several advantages for \ncomplex problems. Experience with neural networks has shown the \nimportance of including hidden units wherein the network can form \nan internal representation of the world. If one simply uses normal \ndistributions, any hidden variables included will simply integrate \nout in calculating an output. It will thus be necessary to include at \nleast third order correlations to implement useful hidden variables. \nUnfortunately, the number of possible third order correlations is \nvery large, so that there may be practical obstacles to such an \napproach. Indeed it is well known folklore in curve fitting and \npattern classification that the number of parameters must be small \ncompared to the size of the data set if any generalization to future \ncases is expected. 10 \n\nIn feedforward nets the question takes a different form. There \ncan be bottlenecks to information flow. Specifically, if the net is \n\n\f59 \n\nconstructed with an intermediate layer which is not bypassed by \nany connections (i.e. there are no connections from layers preceding \nto layers subsequent), and if furthermore the activation functions \nare chosen so that the values of each of the intermediate nodes \ntend towards either 1 or 0*2, then this layer serves as a bottleneck \nto information flow. No matter how many input nodes, output \nnodes, or free parameters there are in the net, the output will be \nconstrained to take on no more than 21 different patterns, where \nI is the number of nodes in the bottleneck layer. Thus if I is \nsmall, some sort of 'generalization' must occur even if the number \nof weights is large. One plausible reason for the success of back \npropagation in adequately solving tasks, in spite of the fact that \nit finds only local minima, is its ability to vary a large number of \nparameters. This freedom may allow back propagation to escape \nfrom many putative traps and to find an acceptable solution. \n\nA good expert system, say for medical diagnosis, should not \nonly give a diagnosis based on the available information, but should \nbe able to suggest, in questionable cases, which lab tests might be \nperformed to clarify matters. Actually back propagation inher(cid:173)\nently has such a capability. Back propagation involves calculation \nof 81og(p)/8wij. This information allows one to compute immedi(cid:173)\nately 81og(p)/8sj . Those input nodes for which this partial deriva(cid:173)\ntive is large correspond to important experiments. \n\nIn conclusion, we propose that back propagation can be gen(cid:173)\n\neralized, put on a satisfactory conceptual footing, and very likely \nmade more efficient, by defining the values of the output and in-\n\n*2 Alternatively when necessary this can be enforced by adding \nan energy term to the log-likelihood to constrain the parameter \nvariation so that the neuronal values are near either 1 or O. \n\n\f60 \n\nput neurons as probabilities, and replacing the 'Error' by the log(cid:173)\nlikelihood. \n\nAcknowledgement: E. B. Baum was supported in part by DARPA \nthrough arrangement with NASA and by NSF grant DMB-840649, \n802. F. Wilczek was supported in part by NSF grant PHY82-17853 \n\nReferences \n\n(1)Werbos,P, \"Beyond Regression: New Tools for Prediction and \nAnalysis in the Behavioral Sciences\" , Harvard University Disserta(cid:173)\ntion (1974) \n(2)Parker D. B., \"Learning Logic\" ,MIT Tech Report TR-47, Center \nfor Computationl Research in Economics and Management Science, \nMIT, 1985 \n(3)Le Cun, Y., Proceedings of Cognitiva 85,p599-604, Paris (1985) \n(4)Rumelhart, D. E., Hinton, G. E., Williams, G. E., \"Learning \nInternal Representations by Error Propagation\", in \"Parallel Dis(cid:173)\ntributed Processing\" , vol 1, eds. Rumelhart, D. E., McClelland, J. \nL., MIT Press, Cambridge MA,( 1986) \n(5)Sejnowski, T. J., Rosenberg, C. R., Complex Systems, v 1, pp \n145-168 (1987) \n(6)LeCun, Y., Address at 1987 Snowbird Conference on Neural \nNetworks \n\n(7)Gorman, P., Sejnowski, T. J., \"Learned Classification of Sonar \nTargets Using a Massively Parallel Network\", in \"Workshop on \nNeural Network Devices and Applications\", JPLD-4406, (1987) \npp224-237 \n\n(8)Rosenblatt, F., \"Principles of Neurodynamics: Perceptrons and \n\n\f61 \n\nthe theory of brain mechanisms\", Spartan Books, Washington DC \n(1962) \n(9)Widrow, B., Hoff, M. E., 1960 IRE WESCON Cony. Record, \nPart 4, 96-104 (1960) \n(10)Duda, R. 0., Hart, P. E., \"Pattern Classification and Scene \nAnalysis\", John Wiley and Sons, N.Y., (1973) \n(11)Guiasu, S., \"Information Theory with Applications\", McGraw \nHill, NY, (1977) \n(12)Baum,E.B., \"Generalizing Back Propagation to Computation\" , \nin \"Neural Networks for Computing\", AlP Conf. Proc. 151, Snow(cid:173)\nbird UT (1986)pp47-53 \n(13)Anderson, C.H., Abrahams, E., \"The Bayes Connection\" , Pro(cid:173)\nceedings of the IEEE International Conference on Neural N etwor ks, \nSan Diego,(1987) \n\n\f", "award": [], "sourceid": 3, "authors": [{"given_name": "Eric", "family_name": "Baum", "institution": null}, {"given_name": "Frank", "family_name": "Wilczek", "institution": null}]}