{"title": "Bayesian Self-Organization", "book": "Advances in Neural Information Processing Systems", "page_first": 1001, "page_last": 1008, "abstract": null, "full_text": "Bayesian Self-Organization \n\nAlan L. Yuille \n\nDivision of Applied Sciences \n\nHarvard University \n\nCambridge, MA 02138 \n\nStelios M. Smirnakis \n\nLyman Laboratory of Physics \n\nHarvard University \n\nCambridge, MA 02138 \n\nLei Xu * \n\nDept. of Computer Science \n\nHSH ENG BLDG, Room 1006 \n\nThe Chinese University of Hong Kong \n\nShatin, NT \nHong Kong \n\nAbstract \n\nRecent work by Becker and Hinton (Becker and Hinton, 1992) \nshows a promising mechanism, based on maximizing mutual in(cid:173)\nformation assuming spatial coherence, by which a system can self(cid:173)\norganize itself to learn visual abilities such as binocular stereo. We \nintroduce a more general criterion, based on Bayesian probability \ntheory, and thereby demonstrate a connection to Bayesian theo(cid:173)\nries of visual perception and to other organization principles for \nearly vision (Atick and Redlich, 1990). Methods for implementa(cid:173)\ntion using variants of stochastic learning are described and, for the \nspecial case of linear filtering, we derive an analytic expression for \nthe output. \n\n1 \n\nIntroduction \n\nThe input intensity patterns received by the human visual system are typically \ncomplicated functions of the object surfaces and light sources in the world. It \n*Lei Xu was a research scholar in the Division of Applied Sciences at Harvard University \n\nwhile this work was performed. \n\n1001 \n\n\f1002 \n\nYuille, Smimakis, and Xu \n\nseems probable, however, that humans perceive the world in terms of surfaces and \nobjects (Nakayama and Shimojo, 1992). Thus the visual system must be able to \nextract information from the input intensities that is relatively independent of the \nactual intensity values. Such abilities may not be present at birth and hence must \nbe learned. It seems, for example, that binocular stereo develops at about the age \nof two to three months (Held, 1981). \nBecker and Hinton (Becker and Hinton, 1992) describe an interesting mechanism \nfor self-organizing a system to achieve this. The basic idea is to assume spatial \ncoherence of the structure to be extracted and to train a neural network by maxi(cid:173)\nmizing the mutual information between neurons with disjoint receptive fields. For \nbinocular stereo, for example, the surface being viewed is assumed flat (see (Becker \nand Hinton, 1992) for generalizations of this assumption) and hence has spatially \nconstant disparity. The intensity patterns, however, do not have any simple spatial \nbehaviour. Adjusting the synaptic strengths of the network to maximize the mutual \ninformation between neurons with non-overlapping receptive fields, for an ensem(cid:173)\nble of images, causes the neurons to extract features that are spatially coherent -\nthereby obtaining the disparity [fig. 1]. \n\nmaximize I (a;b) \n\n( : I : I ~ I ~ ) \n\nFigure 1: In Hinton and Becker's initial scheme (Becker and Hinton, 1992), max(cid:173)\nimization of mutual information between neurons with spatially disjoint receptive \nfields leads to disparity tuning, provided they train on spatially coherent patterns \n(i.e. those for which disparity changes slowly with spatial position) \n\nWorkers in computer vision face a similar problem of estimating the properties of \nobjects in the world from intensity images. It is commonly stated that vision is ill(cid:173)\nposed (Poggio et al, 1985) and that prior assumptions about the world are needed \nto obtain a unique perception. It is convenient to formulate such assumptions by \nthe use of Bayes' theorem P(SID) = P(DIS)P(S)/ P(D). This relates the proba-\n\n\fBayesian Self-Organization \n\n1003 \n\nbility P(SID) of the scene S given the data D to the prior probability of the scene \nP(S) and the imaging model P(DIS) (P(D) can be interpreted as a normalization \nconstant) . Thus a vision theorist (see (Clark and Yuille, 1990), for example) deter(cid:173)\nmines an imaging model P(DIS), picks a set of plausible prior assumptions about \nthe world P(S) (such as natural constraints (Marr, 1982)), applies Bayes' theorem, \nand then picks an interpretation S* from some statistical estimator of P(SID) (for \nexample, the maximum a posteriori (MAP) estimator S* = ARG{M AXsP(SID)}.) \n\nAn advantage of the Bayesian approach is that, by nature of its probabilistic formu(cid:173)\nlation, it can be readily related to learning with a teacher (Kersten et aI, 1987). It is \nunclear, however, whether such a teacher will always be available. Moreover, from \nBecker and Hinton's work on self-organization, it seems that a teacher is not always \nnecessary. This paper proposes a way for generalizing the self-organization ap(cid:173)\nproach, by starting from a Bayesian perspective, and thereby relating it to Bayesian \ntheories of vision . The key idea is to force the activity distribution of the outputs to \nbe close to a pre-specified prior distribution Pp(S). We argue that this approach is \nin the same spirit as (Becker and Hinton, 1992), because we can choose the prior dis(cid:173)\ntribution to enforce spatial coherence, but it is also more general since many other \nchoices of the prior are possible. It also has some relation to the work performed by \nAtick and Redlich (Atick and Redlich, 1990) for modelling the early visual system. \nWe will take the viewpoint that the prior Pp(S) is assumed known in advance by \nthe visual system (perhaps by being specified genetically) and will act as a self(cid:173)\norganizing principle. Later we will discuss ways that this might be relaxed. \n\n2 Theory \n\nWe assume that the input D is a function of a signal L that the system wants \nto determine and a distractor N [fig.2]. For example L might correspond to the \ndisparities of a pair of binocular stereo images and N to the intensity patterns. The \ndistribution of the inputs is PD(D) and the system assumes that the signal L has \ndistribution Pp(L). \nLet the output of the system be S = G(D, ,) where G is a function of a set \nof parameters, to be determined. For example, the function G(D, ,) could be \nrepresented by a multi-layer perceptron with the , 's being the synaptic weights. \nBy approximation theory, it can be shown that a large varidy of neural networks \ncan approximate any input-output function arbitrarily well given enough hidden \nnodes (Hornik et aI, 1989) . \n\nThe aim of self-organizing the network is to ensure that the parameters, are chosen \nso that the outputs S are as close to the L as possible. We claim that this can be \nachieved by adjusting the parameters, so as to make the derived distribution of the \noutputs PDD(S : ,) = f 8(S - G(D, ,))PD (D)[dD] as close as possible to Pp(S). \nThis can be seen to be a consistency condition for a Bayesian theory as from Bayes \nformula we obtain the equation: \n\nJ P(SID)PD(D)[dD] = J P(DIS)Pp(S)[dD] = Pp(S). \n\n(1) \n\n\f1004 \n\nYuille, Smimakis, and Xu \n\nwhich is equivalent to our condition, provided we choose to identify P(SID) with \n6(S - C(D, -y\u00bb. \nTo make this more precise we must define a measure of similarity between the two \ndistributions Pp(S) and PDD(S : -y). An attractive measure is the Kullback-Leibler \ndistance (the entropy of PDD relative to Pp): \n\nJ \n\nPDD(S:-y) \n\nI( L(-y) = PDD(S : -y) log Pp(S) \n\n[dS]. \n\n(2) \n\nD= F(~,N) \n\n~(~) \n\nS=G(D,r) \n\nFigure 2: The parameters -yare adjusted to minihu~e the Kullback-Leibler dis(cid:173)\ntance between the prior (Pp) distribution of the true signal (E) and the derived \ndistribution (PDD) of the network output (8). \nThis measure can be divided into two parts: (i) - I PDD(S : -y) log Pp(S)[dS] and \n(ii) I PDD(S : -y) log PDD(S : -y)[dS). The second term encourages variability of the \noutput while the first term forces similarity to the prior distribution. \nSuppose that Pp(S) can be expressed as a Markov random field (i.e. the spatial \ndistribution of Pp(S) has a local neighbourhood structure, as is commonly assumed \nin Bayesian models of vision). Then, by the Hammersely-Clifford theorem, we can \nwrite Pp(S) = e-fJEp(S) /Z where Ep(S) is an energy function with local connections \n(for example, Ep(S) = Li(S, - Si+1)2), {3 is an inverse temperature and Z is a \nnormalization constant. \n\nThen the first term can be written (Yuille et ai, 1992) as \n\n-J PDD(S : -y) log Pp(S)[d8) = {3{Ep(G(D, -Y\u00bb)D + log Z. \n\n(3) \n\n\fBayesian Self-Organization \n\n1005 \n\nWe can ignore the log Z term since it is a constant (independent of ,). Mini(cid:173)\nmizing the first term with respect to , will therefore try to minimize the energy \nof the outputs averaged over the inputs - (Ep(G(D,')))D - which is highly desir(cid:173)\nable (since it has a close connection to the minimal energy principles in (Poggio \net aI, 1985, Clark and Yuille, 1990)). It is also important, however, to avoid the \ntrivial solution G(D,,) = 0 as well as solutions for which G(D,,) is very small \nfor most inputs. Fortunately these solutions are discouraged by the second term: \nJ PDD(D,,) log PDD(D, ,)[dD], which corresponds to the negative entropy of the \nderived distribution of the network output. Thus, its minimization with respect to \n, is a maximum entropy principle which will encourage variability in the outputs \nG( D,,) and hence prevent the trivial solutions. \n\n3 Reformulating for Implementation. \n\nOur theory requires us to minimize the Kullback-Leibler distance, equation 2, with \nrespect to ,. We now describe two ways in which this could be implemented using \nvariants of stochastic learning. First observe that by substituting the form of the \nderived distribution into equation 2 and integrating out the 5 variable we obtain: \n\n\" J \n\nJ\\L({) = \n\nPD(D) log Pp(G(D,,)) \n\n[dD]. \n\nPDD(G(D,,) : ,) \n\n(4) \n\nAssuming a representative sample {DJ.t : JJ fA} of inputs we can approximate K L(,) \nby LJ.ttA log[PDD(G(DJ.t,,) : ,)/ Pp(G(DJ.t, ,))]. We can now, in principle, perform \nstochastic learning using backpropagation: pick inputs DJ.t at random and update \nthe weights, using log[PDD(G(DJ.t,,): ,)/Pp(G(DJ.t,,))] as the error function. \nTo do this, however, we need expressions for PDD(G(DJ.t,,) : ,) and its deriva(cid:173)\ntive with repect to,. If the function G(D,,) can be restricted to being 1-1 (in(cid:173)\ncreasing the dimensionality of the output space if necessary) then we can obtain \n(Yuille et aI, 1992) analytic expressions PDD(G(D,,) :,) = PD(D)/I det(oG/oD)1 \nand (ologPDD(G(D,,) : ,)/0,) = -(oG/OD)-1(02G/oDo,), where [-1] denotes \nthe matrix inverse. Alternatively we can perform additional sampling to estimate \nPDD(G(D,,):,) and (ologPDD(G(D,,): ,)/0,) directly from their integral rep(cid:173)\nresentations. (This second approach is similar to (Becker and Hinton, 1992) though \nthey are only concerned with estimating the first and second moments of these \ndistributions. ) \n\n4 Connection to Becker and Hinton. \n\nThe Becker and Hinton method (Becker and Hinton, 1992) involves maximizing the \nmutual information between the output of two neuronal units 5 1 ,52 [fig.l]. This is \ngiven by : \n\nwhere the first two terms correspond to maximizing the entropies of 51 and 52 \nwhile the last term forces 51 :::::: 52. \n\n\f1006 \n\nYuille, Smirnakis, and Xu \n\nBy contrast, our version tries to minimize the quantity: \n\nIf we then ensure that Pp (S 1, S2) = 6 (S 1 - S2) our second term will force S 1 ~ S2 \nand our first term will maximize the entropy of the joint distribution of Sl, S2. We \nargue that this is effectively the same as (Becker and Hinton, 1992) since maxi(cid:173)\nmizing the joint entropy of Sl, S2 with Sl constrained to equal S2 is equivalent to \nmaximizing the individual entropies of SI and S2 with the same constraint. \nTo be more concrete, we consider Becker and Hinton's implementation of the mutual \ninformation maximization principle in the case of units with continuous outputs. \nThey assume that the outputs of units 1, 2 are Gaussian 1 and perform steepest \ndescent to maximize the symmetrized form of the mutual information between SI \nand S2: \n\nwhere VO stands for variance over the set of inputs. They assume that the difference \nbetween the two outputs can be expressed as un correlated additive noise, SI = \nS2 + N. We reformalize their criterion as maximizing EBH(V(S2), V(N)) where \n\nEBH (V(S2), V(N)) = log{V(S2) + V(N)} + log V(S2) - 210g V(N). \n\n(6) \n\nFor our scheme we make similar assumptions about the distributions of SI and \nS2. We see that < logPDD(SI,S2) >= -log{< si >< S~ > - < S1S2 >2} = \n-log{V(S2)V(N)} (since < S1S2 >=< (S2 + N)S2 >= V(S2) and < Sf >= \nV(S2) + V(N)). Using the prior distribution PP(Sl' S2) ~ e- r (Sl-S2)2 our criterion \ncorresponds to minimizing EYSX(V(S2), V(N)) where: \n\nEy SX(V(S2), V(N)) = -log V(S2) - log V(N) + rV(N). \n\n(7) \n\nIt is easy to see that maximizing EBH (V(S2), V(N)) will try to make V(S2) as \nlarge as possible and force V(N) to zero (recall that, by definition, V(N) ~ 0). \nMinimizing our energy will try to make V(S2) as large as possible and will force \nV(N) to 1/r (recall that r appears as the inverse of the variance of a Gaussian \nprior distribution for SI - S2 so making r large will force the prior distribution to \napproach 6(Sl - S2).) Thus, provided r is very large, our method will have the \nsame effect as Becker and Hinton's. \n\n5 Application to Linear Filtering. \n\nWe now describe an analysis of these ideas for the case of linear filtering. Our \napproach will be contrasted with the traditional Wiener filter approach. \n\n1 We assume for simplicity that these Gaussians have zero mean. \n\n\fBayesian Self-Organization \n\n1007 \n\nConsider a process ofthe form D(i) = ~(i)+N(i) where D(i) denotes the input to \nthe system, ~(i) is the true signal which we would like to predict, and N(i) is the \nn?ise corrupting the signal. The resulting Wiener filter Aw (i) has fourier transform \nAw = ~~ , ~/\u00abh:: , ~ + ~N,N) where ~~,~ and ~N,N are the power spectrum of the \nsignal and the noise respectively. \nBy contrast, let us extract a linear filter Ab by applying our criterion. In the case \nthat the noise and signal are independent zero mean Gaussian distributions this \nfilter can be calculated explicitly (Yuille et aI, 1992). It has fourier transform with \nsquared magnitude given by IAbl2 = ~!:,~/(~~,~ + ~N,N) . Thus our filter can be \nthought of as the square root of the Wiener filter. \n\nIt is important to realize that although our derivation assumed additive Gaussian \nnoise our system would not need to make any assumptions about the noise distribu(cid:173)\ntion. Instead our system would merely need to assume that the filter was linear and \nthen would automatically obtain the \"correct\" result for the additive Gaussian noise \ncase. We conjecture that the system might detect non-Gauusian noise by finding it \nimpossible to get zero Kullback-Liebler distance with the linear ansatz. \n\n6 Conclusion \n\nThe goal of this paper was to introduce a Bayesian approach to self-organization \nusing prior assumptions about the signal as an organizing principle. We argued that \nit was a natural generalization of the criterion of maximizing mutual information \nassuming spatial coherence (Becker and Hinton, 1992) . Using our principle it should \nbe possible to self-organize Bayesian theories of vision, assuming that the priors \nare known, the network is capable of representing the appropriate functions and \nthe learning algorithm converges. There will also be problems if the probability \ndistributions of the true signal and the distractor are too similar. \n\nIf the prior is not correct then it may be possible to detect this by evaluating \nthe goodness of the Kullback-Leibler fit after learning 2. This suggests a strategy \nwhereby the system increases the complexity of the priors until the Kullback-Leibler \nfit is sufficiently good (this is somewhat similar to an idea proposed by Mumford \n(Mumford, 1992)). This is related to the idea of competitive priors in vision (Clark \nand Yuille, 1990). One way to implement this would be for the prior probability \nitself to have a set of adjustable parameters that would enable it to adapt to different \nclasses of scenes. We are currently (Yuille et aI, 1992) investigating this idea and \nexploring its relationships to Hidden Markov Models. \n\nWays to implement the theory, using variants of stochastic learning, were described. \nWe sketched the relation to Becker and Hinton . \n\nAs an illustration of our approach we derived the filter that our criterion would give \nfor filtering out additive Gaussian noise (possibly the only analytically tractable \ncase). This had a very interesting relation to the standard Wiener filter. \n\n2This is reminiscent of Barlow's suspicious coincidence detectors (Barlow, 1993), where \nwe might hope to determine if two variables x & yare independent or not by calculating \nthe Kullback-Leibler distance between the joint distribution P(x, y) and the product of \nthe individual distributions P( x) P(y). \n\n\f1008 \n\nYuille, Smirnakis, and Xu \n\nAcknowledgements \n\nWe would like to thank DARPA for an Air Force contract F49620-92-J-0466. Con(cid:173)\nversations with Dan Kersten and David Mumford were highly appreciated. \n\nReferences \n\nJ.J. Atick and A.N. Redlich. \nNeural Computation. Vol. 2, No.3, pp 308-320. Fall. 1990. \n\n\"Towards a Theory of Early Visual Processing\". \n\nH.B. Barlow. \"What is the Computational Goal of the Neocortex?\" To appear in: \nLarge scale neuronal theories of the brain. Ed. C. Koch. MIT Press. 1993. \n\nS. Becker and G.E. Hinton. \"Self-organizing neural network that discovers surfaces \nin random-dot stereograms\". Nature, Vol 355. pp 161-163. Jan. 1992. \n\nJ .J. Clark and A.L. Yuille. Data Fusion for Sensory Information Processing \nSystems. Kluwer Academic Press. Boston/Dordrecht/London. 1990. \n\nR. Held. \"Visual development in infants\". In The encyclopedia of neuroscience, \nvol. 2. Boston: Birkhauser. 1987. \n\nK. Hornik, S. Stinchocombe and H. White. \"Multilayer feed-forward networks are \nuniversal approximators\". Neural Networks 4, pp 251-257. 1991. \n\nD. Kersten, A.J. O'Toole, M.E. Sereno, D.C. Knill and J .A. Anderson. \"Associative \nlearning of scene parameters from images\". Optical Society of America, Vol. 26, \nNo. 23, pp 4999-5006. 1 December, 1987. \n\nD. Marr. Vision. W.H . Freeman and Company. San Francisco. 1982. \n\nD. Mumford. \nPreprint. Harvard University. 1992. \n\n\"Pattern Theory: a unifying perspective\". Dept. Mathematics \n\nK. Nakayama and S. Shimojo. \"Experiencing and Perceiving Visual Surfaces\". \nScience. Vol. 257, pp 1357-1363. 4 September. 1992. \n\nT. Poggio, V. Torre and C. Koch. \"Computational vision and regularization the(cid:173)\nory\" . Nature, 317, pp 314-319. 1985. \n\nA.L. Yuille, S.M. Smirnakis and L. Xu. \"Bayesian Self-Organization\". Harvard \nRobotics Laboratory Technical Report . 1992. \n\n\fPART IX \n\nSPEECH AND SIGNAL \n\nPROCESSING \n\n\f\f", "award": [], "sourceid": 809, "authors": [{"given_name": "Alan", "family_name": "Yuille", "institution": null}, {"given_name": "Stelios", "family_name": "Smirnakis", "institution": null}, {"given_name": "Lei", "family_name": "Xu", "institution": null}]}