{"title": "Neural Networks for Model Matching and Perceptual Organization", "book": "Advances in Neural Information Processing Systems", "page_first": 618, "page_last": 625, "abstract": null, "full_text": "618 \n\nNEURAL NETWORKS FOR MODEL \n\nMATCHING AND PERCEPTUAL \n\nORGANIZATION \n\nGene Gindi \nEE Department \nYale University \nNew Haven, CT 06520 \n\nEric Mjolsness \nCS Department \nYale University \n\nNew Haven, CT 06520 \n\nP. Anandan \nCS Department \nYale University \nNew Haven, CT 06520 \n\nABSTRACT \n\nWe introduce an optimization approach for solving problems in com(cid:173)\n\nputer vision that involve multiple levels of abstraction. Our objective \nfunctions include compositional and specialization hierarchies. We cast \nvision problems as inexact graph matching problems, formulate graph \nmatching in terms of constrained optimization, and use analog neural \nnetworks to perform the optimization. The method is applicable to per(cid:173)\nceptual grouping and model matching. Preliminary experimental results \nare shown. \n\n1 \n\nIntroduction \n\nThe minimization of objective functions is an attractive way to formulate and solve \nvisual recognition problems. Such formulations are parsimonious, being expressible \nin several lines of algebra, and may be converted into artificial neural networks \nwhich perform the optimization. Advantages of such networks including speed, \nparallelism, cheap analog computing, and biological plausibility have been noted \n[Hop field and Tank, 1985]. \nAccording to a common view of computational vision, recognition involves the con(cid:173)\nstruction of abstract descriptions of data governed by a data base of models. Ab(cid:173)\nstractions serve as reduced descriptions of complex data useful for reasoning about \nthe objects and events in the scene. The models indicate what objects and properties \nmay be expected in the scene. The complexity of visual recognition demands that \nthe models be organized into compositional hierarchies which express object-part \nrelationships and specialization hierarchies which express object-class relationships. \nIn this paper, we describe a methodology for expressing model-based visual recog(cid:173)\nnition as the constrained minimization of an objective function. Model-specific \nobjective functions are used to govern the dynamic grouping of image elements into \nrecognizable wholes. Neural networks are used to carry out the minimization. \n\n\u00b0This work was supported in part by AFOSR grant F49620-88-C-002S, and by DARPA grant \n\nDAAAlS-87-K-OOOl, by ONR grant N00014-86-0310. \n\n\fModel Matching and Perceptual Organization \n\n619 \n\nPrevious work on optimization in vision has typically been restricted to computa(cid:173)\ntions occuring at a single of level of abstraction and/or involving a single model \n[Barrow and Popplestone, 1971,Hummel and Zucker, 1983,Terzopoulos, 1986]. For \nexample, surface interpolation schemes, even when they include discontinuities \n[Terzopoulos, 1986] do not include explicit models for physical objects whose surface \ncharacteristics determine the expected degree of smoothness. By contrast, hetero(cid:173)\ngeneous and hierarchical model-bases often occur in non-optimization approaches \nto visual recognition [Hanson and Riseman, 1986] including some which use neural \nnetworks [Ballard, 1986]. We attempt to obtain greater express ability and efficiency \nby incorporating hierarchies of abstraction into the optimization paradigm. \n\n2 Casting Model Matching as Optimization \n\nWe consider a type of objective function which, when minimized by a neural \nnetwork, is capable of expressing many of the ideas found in Frame systems \nin Artificial Intelligence [Minsky, 1975]. These \"Frameville\" objective functions \n[Mjolsness et al., 1988,Mjolsness et al., 1989] are particularly well suited to appli(cid:173)\ncations in model-based vision, with frames acting as few-parameter abstractions of \nvisual objects or perceptual groupings thereof. Each frame contains real-valued pa(cid:173)\nrameters, pointers to other frames, and pointers to predefined models (e.g. models \nof objects in the world) which determine what portion of the objective function acts \nupon a given frame. \n\n2.1 Model Matching as Graph Matching \n\nModel matching involves finding a match between a set of frames, ultimately derived \nfrom visual data, and the predefined static models. A set of pointers represent \nobject-part relationships between frames, and are encoded as a graph or sparse \nmatrix called ina. That is, inaij = 0 unless frame j is \"in\" frame i as one of its \nparts, in which case inaij = 1 is a \"pointer\" from j to i. The expected object(cid:173)\npart relationships between the corresponding models is encoded as a fixed graph \nor sparse matrix INA. A form of inexact graph-matching is required: ina should \nfollow INA as much as is consistent with the data. \nA sparse match matrix M (0 < Meti < 1) of dynamic variables represents the \ncorrespondence between model a and frame i. To find the best match between the \ntwo graphs one can minimize a simple objective function for this match matrix, due \nto Hopfield [Hopfield, 1984] (see also [Feldman et al., 1988,Malsburg, 1986]), which \njust counts the number of consistent rectangles (see Figure 1a): \n\nE(M) = - ~~INAet~inaijMaiM~j. \n\net{3 \n\nij \n\n(1) \n\nThis expression may be understood as follows: For model a and frame i, the match \nvalue Meti is to be increased if the neighbors of a (in the INA graph) match to the \nneighbors of i (in the ina graph). \n\n\f620 \n\nMjolsness, Gindi and Anandan \n\nM 1 \u2022\u2022 \n\nINA t.2 \n\nModel \nside \n\nData \nside \n\nFigure 1: (a) Examples of Frameville rectangle rule. Shows the rectangle re(cid:173)\nlationship between frames (triangles) representing a wing of a plane, and the plane \nitself. Circles denote dynamic variables, ovals denote models, and triangles denote \nframes. For the plane and wing models, the first few parameters of a frame are \ninterpreted as position, length, and orientation. (b) Frameville sibling compe(cid:173)\ntition among parts. The match variables along the shaded lines (M3,9 and M2,7) \nare suppressed in favor of those along the solid lines (M2,9 and M 3,7)' \n\nNote that E(M) as defined above can be trivially minimized by setting all the el(cid:173)\nements of the match matrix to unity. However, to do so will violate additional \nsyntactic constraints of the form h(M) = 0 which are imposed on the optimization, \neither exactly (Platt and Barr, 1988] or as penalty terms (Hopfield and Tank, 1985] \n~h2(M) added to the objective function. Originally the syntactic constraints \nsimply meant that each frame should match one model and vice versa, as in \n(Hopfield and Tank, 1985]. But in Frameville, a frame can match both a model \nand one of its specializations (described later), and a single model can match any \nnumber of instances or frames. In addition one can usually formulate constraints \nstating that if a model matches a frame then two distinct parts of the same model \nmust match two distinct part frames and vice-versa. \\Ve have found the following \n\n\fModel Matching and Perceptual Organization \n\n621 \n\nformulation to be useful: \n\n~ INAa{3Mai - ~ inaijM{3j \na \nE inaijMai - E 1NAa{3M{3j \n\nj \n\nJ \n\n0, \"'p, i \n\n0, Va,j \n\n(2) \n\n(3) \n\n{3 \n\nwhere the first sum in each equation is necessary when several high-level models \n(or frames) share a part. (It turns out that the first sums can be forced to zero \nor one by other constraints.) The resulting competition is illustrated in Figure lb. \nAnother constraint is that M should be binary-valued, i.e., \n\n(4) \n\nbut this constraint can also be handled by a special \"analog gain\" term in \nthe objective function [Hopfield and Tank, 1985] together with a penalty term \nc Eai Mai(l - Mai). \nIn Frameville, the ina graph actually becomes variable, and is determined by a dy(cid:173)\nnamic grouping or \"perceptual organization\" process. These new variables require \ninaij) = 0, and including many high-level \nnew constraints, starting with inaij (1 -\nconstraints which we now formulate. \n\n2.2 Franles and Objective Functions \n\nFrames can be considered as bundles ~ of real-valued parameters Fip, where p \nindexes the different parameters of a frame. For efficiency in computing complex \narithmetic relationships, such as those involved in coordinate transformations, an \nanalog representation of these parameters is used. A frame contains no information \nconcerning its match criteria or control flow; instead, the match criteria are ex(cid:173)\npressed as objective functions and the control flow is determined by the partiCUlar \nchoice of a minimization technique. \nIn Figure la, in order for the rectangle (1,4,9,2) to be consistent, the parameters \nF4p and F9p should satisfy a criterion dictated by models 1 and 2, such as a restric(cid:173)\ntion on the difference in angles appropriate for a mildly swept back wing. Such a \nconstraint results in the addition of the following term to the objective function: \n\nL lNAa{3 inaij MaiM{3j Ha{3(~, Pj) \n\ni,j,a,{3 \n\n(5) \n\nwhere Ha{3(~, Fj) measures the deviation of the parameters of the data frames from \nthat demanded by the models. The term H can express coordinate transformation \narithmetic (e.g. H a{3(Xi, Xj) = 1/2[xi - Xj - D.xa{3]2), and its action on a frame f;. \nis selectively controlled or \"gated\" by M and ina variables. This is a fundamental \nextension of the distance metric paradigm in pattern recognition; because of the \ncomplexity of the visual world, we use an entire database of distance metrics H a{3. \n\n\f622 \n\nMjolsness, Gindi and Anandan \n\nFigure 2: Frameville specialization hierarchy. The plane model specializes \nalong 154. links to a propeller plane or a jet plane and correspondingly the wing \nmodel specializes to prop-wing or jet-wing. Sibling match variables M 6 ,4 and M 4 ,4 \ncompete as do M7,9 and M S,9. The winner in these competitions is determined by \nthe consistency of the appropriate rectangles, e.g. if the 4-4-9-5 rectangle is more \nconsistent than the 6-4-9-7 rectangle, then the jet model is favored over the prop \nmodel. \n\nWe index the models (and, indirectly, the data base of H metrics) by introducing \na static graph of pointers I54. OI{j to act as both a specialization hierarchy and a \ndiscrimination network for visual recognition. A frame may simultaneously match \nto a model and just one of its specializations: \n\nMcxi - L I54.cx{jMf3i = o. \n\nf3 \n\n(6) \n\nAs a result, 154. siblings compete for matches to a given frame (see Figure 2); this \ncompetition allows the network to act as a discrimination tree. \nFrameville networks have great expressive power, but have a potentially serious \nproblem with cost: for n data frames and m models there may be O(nm + 71 2 ) \nneurons widely interconnected but sparsely activated. The number of connections \nis at most the number of monomials in the polynomial objective function, namely \nn2m/, where / is the fan-out of the INA graph. One solution to the cost prob(cid:173)\nlem, used in the line grouping experiments reported in [Mjolsness et al., 1989], is to \nrestrict the flexibility of the frame system by setting most M and ina neurons to \nzero permanently. The few remaining variables can form an efficient data structure \n\n\fModel Matching and Perceptual Organization \n\n623 \n\nsuch as a pyramid in vision. A more flexible solution might enforce the sparseness \nconstraints on the M and ina neurons during minimization, as well as at the fixed \npoint. Then large savings could result from using \"virtual\" neurons (and connec(cid:173)\ntions) which are created and destroyed dynamically. This and other cost-cutting \nmethods are a subject of continuing research. \n\n3 Experimental Results \n\nWe describe here experiments involving the recognition of simple stick figures. \n(Other experiments involving the perceptual grouping of lines are reported in \n[Mjolsness et al., 1989].) The input data (Figure 3(a)) are line segments param(cid:173)\neterized by location x, y and orientation (), corresponding to frame parameters Fjp \n(p = 1,2,3). As seen in Figure 3(b), there are two high-level models, \"T\" and \n\"L\" junctions, each composed of three low-level segments. The task is to recognize \ninstances of \"T\", \"L\", and their parts, in a translation-invariant manner. \nThe parameter check term H cx{3 of Equation 5 achieves translation invariance by \nchecking the location and orientation of a given part relative to a designated main \npart and is given by: \n\nHa{3(~, ff;) = I)Fip - Fjp - ~;{3)2 \n\n(7) \n\nP \n\nHere Fjp and Fip are the slots of a low-level segment frame and a high-level main \npart, respectively, and the quantity ~~{3 is model information that stores coordinate \ndifferences. (Rotation invariance can also be formulated if a different parameteri(cid:173)\nzation is used.) It should be noted that absence of the main part does not preclude \nrecognition of the high-level model. \nWe used the unconstrained optimization technique in [Hopfield and Tank, 1985] and \nachieved improved results by including terms demanding that at most one model \nmatch a given frame, and that at most one high-level frame include a given low-level \nframe as its part [Mjolsness et al., 1989]. \nFigure 3(c) shows results of attempts to recognize the junctions in Figure 3(a). \nWhen initialized to random values, the network becomes trapped in unfavorable \nlocal minima of the fifth-order objective function. (But with only a single high-level \nmodel in the database, the system recognizes a shape amid noise.) If, however, the \nnetwork is given a \"hint\" in the form of an initial state with mainparts and high-level \nmatches set correctly, the network converges to the correct state. \nThere is a great deal of unexploited freedom in the design of the model base and \nits objective functions; there may be good design disciplines which avoid introduc(cid:173)\ning spurious local minima. For example, it may be possible to use ISA and INA \nhierarchies to guide a network to the desired local minimum. \n\n\f624 \n\nMjolsness, Gindi and Anandan \n\n+++++++t \n+ + + + + + + \n,+++++++ \n\n10 + + \n\u2022 + + \n\u2022 + + \n+ + + + + + + + \n7 \n\u2022 + + + + + \n+ + + + + \n5+++++ ,+++++ \n4 + + + + + \n+ + + \n3 + + + + + \nt + + + + + \n2 + + + + + \n+ + + + + \n1 + + + + + +--t--+ + + + \no-;r+ + + + + + +--t-+' + \n10 \n\no 1 \n\n9 \n\n4 \n\n5 \n\nI \n\n7 \n\n\u2022 \n\n2 \n\n3 \n\n1 \n\n2 3 \n\ninaij 123 \n\n1 \u2022 \u2022 \u2022 0000000 \n20000 \u2022\u2022\u2022 000 \n30000000000 \n\n12345678910 \n\nMpj \n\nE; 0000.00000 \nB .000000000 \n8 00000.0000 \nE3 0.00000000 \nE1 00.0000000 \nB 000000l1li000 \n9 0000000000 \n\n1 2 \n\n3 4 \n\n5 6 \n\n7 \n\n8 \n\n9 10 \n\nFigure 3: (a) Input data consists of unit-length segments oriented horizontally or \nvertically. The task is translation-invariant recognition of three segments forming a \nliT\" junction (e.g. sticks 1,2,3) or an \"L\" (e.g. sticks 5,6,7) amid extraneous noise \nsticks. (b) Structure of network. Models occur at two levels. INA links are \nshown for a liT\". Each frame has three parameters: position x, y and orientation \ne. Also shown are some match and ina links. The bold lines highlight a possible \nconsistency rectangle. (c) Experhnental result. The value of each dynamical \nvariable is displayed as the relative area of the shaded portion of a circle. Matrix \nM{jj indicates low-level matches and MOti indicates high-level matches. Grouping \nof low-level to high-level frames is indicated by the ina matrix. The parameters of \nthe high-level frames are displayed in the matrix Fip of linear analog neurons. (The \nparameters of the low-level frames, held fixed, are not displayed.) The few neurons \ncircumscribed by a square, corresponding to correct matches for the main parts of \neach model, are clamped to a value near unity. Shaded circles indicate the final \ncorrect state. \n\n\fModel Matching and Perceptual Organization \n\n625 \n\n4 Conclusion \n\nFrameville provides opportunities for integrating all levels of vision in a uniform no(cid:173)\ntation which yields analog neural networks. Low-level models such as fixed convo(cid:173)\nlution filters just require analog arithmetic for frame parameters, which is provided. \nHigh-level vision typically requires structural matching, also provided. Qualitatively \ndifferent models may be integrated by specifying their interactions, H cx/3. \n\nAcknowledgements \n\nWe thank J. Utans, J. Ockerbloom and C. Garrett for the Frameville simulations. \n\nReferences \n\n[1] Dana Ballard. Cortical connections and parallel processing: structure and function. \n\nBehavioral and Brain Sciences, vol 9:67-120, 1986. \n\n[2] Harry G. Barrow and R. J. Popplestone. Relational descriptions in picture processing. \n\nIn D. Mitchie, editor, Machine Intelligence 6, Edinborough University Press, 1971. \n\n[3] Jerome A. Feldman, Mark A. Fanty, and Nigel H. Goddard. Computing with struc(cid:173)\n\ntured neural networks. IEEE Computer, 91, March 1988. \n\n[4] Allen R. Hanson and E. M. Riseman. A methodology for the development of general \nknowledge-based vision systems. In M. A. Arbib and A. R. Hanson, editors, Vision, \nBrain, and Cooperative Computation, MIT Press, 1986. \n\n[5] J. J. Hopfield. Personal communication. October 1984. \n\n[6] J. J. Hopfield and D. W. Tank. \n\n'Neural' computation of decisions in optimization \n\nproblems. Biological Cybernetics, vol. 52:141-152, 1985. \n\n[7] Robert A. Hummel and S. W. Zucker. On the foundations of relaxation labeling \n\nprocesses. IEEE Transactions on PAMI, vol. PAMI-5:267-287, May 1983. \n\n[8] Marvin L. Minsky. A framework for representing knowledge. In P. H. Winston, editor, \n\nThe Psychology of Computer Vision, McGraw-Hill, 1975. \n\n[9] Eric Mjolsness, Gene Gindi, and P. Anandan. Optimization in Model Matching and \nPerceptual Organization: A First Look. Technical Report YALEU /DCS/RR-634, \nYale University, June 1988. \n\n[10] Eric Mjolsness, Gene Gindi, and P. Anandan. Optimization in Model Matching and \n\nPerceptual Organization. Neural Computation, to appear. \n\n[11] John C. Platt and Alan H. Barr. Constraint methods for flexible models. Computer \n\nGraphics, 22(4), August 1988. Proceedings of SIGGRAPH '88. \n\n[12] Demitri Terzopoulos. Regularization of inverse problems involving discontinuities. \n\nIEEE Transactions on PAMI, vol. PAMI-8:413-424, 1986. \n\n[13] Christoph von der Malsburg and Elie Bienenstock. Statistical coding and short-term \nsynaptic plasticity: a scheme for knowledge representation in the brain. In Disordered \nSystems and Biological Organization, pages 247-252, Springer-Verlag, 1986. \n\n\f", "award": [], "sourceid": 108, "authors": [{"given_name": "Eric", "family_name": "Mjolsness", "institution": null}, {"given_name": "Gene", "family_name": "Gindi", "institution": null}, {"given_name": "P.", "family_name": "Anandan", "institution": null}]}