{"title": "Learning with Preknowledge: Clustering with Point and Graph Matching Distance Measures", "book": "Advances in Neural Information Processing Systems", "page_first": 713, "page_last": 720, "abstract": null, "full_text": "Learning with Preknowledge: \n\nClustering with Point and Graph \n\nMatching Distance Measures \n\nSteven Gold!, Anand Rangarajan1 and Eric Mjolsness2 \n\nDepartment of Computer Science \n\nYale University \n\nNew Haven, CT 06520-8285 \n\nAbstract \n\nPrior constraints are imposed upon a learning problem in the form \nof distance measures. Prototypical 2-D point sets and graphs are \nlearned by clustering with point matching and graph matching dis(cid:173)\ntance measures. The point matching distance measure is approx. \ninvariant under affine transformations - translation, rotation, scale \nand shear - and permutations. It operates between noisy images \nwith missing and spurious points. The graph matching distance \nmeasure operates on weighted graphs and is invariant under per(cid:173)\nmutations. Learning is formulated as an optimization problem . \nLarge objectives so formulated ('\" million variables) are efficiently \nminimized using a combination of optimization techniques - alge(cid:173)\nbraic transformations, iterative projective scaling, clocked objec(cid:173)\ntives, and deterministic annealing. \n\n1 \n\nIntroduction \n\nWhile few biologists today would subscribe to Locke's description of the nascent \nmind as a tabula rasa, the nature of the inherent constraints - Kant's preknowl-\n\n1 E-mail address of authors: lastname-firstname@cs.yale.edu \n2Department of Computer Science and Engineering, University of California at San \n\nDiego (UCSD), La Jolla, CA 92093-0114. E-mail: emj@cs.ucsd.edu \n\n\f714 \n\nSteven Gold, Anand Rangarajan, Eric Mjolsness \n\nedge - that helps organize our perceptions remains much in doubt. Recently, the \nimportance of such preknowledge for learning has been convincingly argued from \na statistical framework [Geman et al., 1992]. Researchers have proposed that our \nbrains may incorporate preknowledge in the form of distance measures [Shepard, \n1989]. The neural network community has begun to explore this idea via tangent \ndistance [Simard et al., 1993], model learning [Williams et al., 1993] and point \nmatching distances [Gold et al., 1994]. However, only the point matching distances \nhave been invariant under permutations. Here we extend that work by enhancing \nboth the scope and function of those distance measures, significantly expanding the \nproblem domains where learning may take place. \n\nWe learn objects consisting of noisy 2-D point-sets or noisy weighted graphs by \nclustering with point matching and graph matching distance measures. The point \nmatching measure is approx. invariant under permutations and affine transforma(cid:173)\ntions (separately decomposed into translation, rotation, scale and shear) and oper(cid:173)\nates on point-sets with missing or spurious points. The graph matching measure is \ninvariant under permutations. These distance measures and others like them may be \nconstructed using Bayesian inference on a probabilistic model of the visual domain. \nSuch models introduce a carefully designed bias into our learning, which reduces its \ngenerality outside the problem domain but increases its ability to generalize within \nthe problem domain. (From a statistical viewpoint, outside the problem domain it \nincreases bias while within the problem domain it decreases variance). The resulting \ndistance measures are similar to some of those hypothesized for cognition. \n\nThe distance measures and learning problem (clustering) are formulated as objec(cid:173)\ntive functions. Fast minimization of these objectives is achieved by a combination \nof optimization techniques - algebraic transformations, iterative projective scaling, \nclocked objectives, and deterministic annealing. Combining these techniques signif(cid:173)\nicantly increases the size of problems which may be solved with recurrent network \narchitectures [Rangarajan et al., 1994]. Even on single-cpu workstations non-linear \nobjectives with a million variables can routinely be minimized. With these meth(cid:173)\nods we learn prototypical examples of 2-D points set and graphs from randomly \ngenerated experimental data. \n\n2 Distance Measures in Unsupervised Learning \n\n2.1 An Affine Invariant Point Matching Distance Measure \n\nThe first distance measure quantifies the degree of dissimilarity between two unla(cid:173)\nbeled 2-D point images, irrespective of bounded affine transformations, i.e. differ(cid:173)\nences in position, orientation, scale and shear. The two images may have different \nnumbers of points. The measure is calculated with an objective that can be used to \nfind correspondence and pose for unlabeled feature matching in vision. Given two \nsets of points {Xj} and {Yk}, one can minimize the following objective to find the \naffine transformation and permutation which best maps Y onto X : \n\nEpm(m, t,A) = L: L: mjkllXj -\n\nJ K \n\nj=lk=l \nwith constraints: Yj Ef=l mjk ~ 1 , Yk Ef=l mjk ~ 1 , Yjk mjk 2:: O. \n\nj=lk=l \n\nt - A\u00b7 Ykll 2 + g(A) - a L: L: mjk \n\nJ K \n\n\fLearning with Preknowledge \n\n715 \n\nA is decomposed into scale, rotation, vertical shear and oblique shear components. \ng(A) regularizes our affine transformation - bounding the scale and shear compo(cid:173)\nnents. m is a fuzzy correspondence matrix which matches points in one image with \ncorresponding points in the other image. The inequality constraint on m allows for \nnull matches - that is a given point in one image may match to no corresponding \npoint in the other image. The a term biases the objective towards matches. \nThen given two sets of points {Xj} and {Yk} the distance between them is defined \nas: \n\nD({Xj}, {Yk}) = min (Epm(m,t, A) I constraints on m) \n\nm,t,A \n\nThis measure is an example of a more general image distance measure derived in \n[Mjolsness, 1992]: \n\nd(z, y) = mind(z, T(y\u00bb E [0, (0) \n\nT \n\nwhere T is a set of transformation parameters introduced by a visual grammar. \nUsing slack variables, and following the treatment in [Peterson and Soderberg, 1989; \nYuille and Kosowsky, 1994] we employ Lagrange multipliers and an z logz barrier \nfunction to enforce the constraints with the following objective: \n\nEpm(m, t, A) = L: L: mjkllXj -\n\nJ \n\nK \n\nj=lk=l \n\nt - A\u00b7 Ykll 2 + g(A) - a L: L: mjk \n\nJ \n\nK \n\nj=lk=1 \n\n1 J+1 K+1 \n\n+-p L: L: mjk(logmjk - 1) + L: J.'j (L: mjk - 1) + L: lIk(L: mjk - 1) \n\nJ \n\nK+1 \n\nK \n\nJ+1 \n\n(1) \n\nj=1 k=l \n\nj=l \n\nk=l \n\nk=1 \n\nj=l \n\nIn this objective we are looking for a saddle point. (1) is minimized with respect to \nm, t, and A which are the correspondence matrix, translation, and affine transform, \nand is maximized with respect to l' and 11, the Lagrange multipliers that enforce \nthe row and column constraints for m. \n\nThe above can be used to define many different distance measures, since given \nthe decomposition of A it is trivial to construct measures which are invariant only \nunder some subset of the transformations (such as rotation and translation). The \nregularization and a terms may also be individually adjusted in an appropriate \nfashion for a specific problem domain. \n\n2.2 Weighted Graph Matching Distance Measures \n\nThe following distance measure quantifies the degree of dissimilarity between two \nunlabeled weighted graphs. Given two graphs, represented by adjacency matrices \nGab and gij, one can minimize the objective below to find the permutation which \nbest maps G onto g: ' \n\nEgm(m) = L: L:(L: Gabmbi - L: majgji)2 \n\nA \n\nI \n\nB \n\nJ \n\na=l i=l b=1 \n\n;=1 \n\n\f716 \n\nSteven Gold, Anand Rangarajan, Eric Mjolsness \n\n'Va 2::=1 mai = 1 , 'Vi 2::=1 mai = 1 , 'Vai mai ;::: O. These \nwith constraints: \nconstraints are enforced in the same fashion as in (1). An algebraic fixed-point \ntransformation and self-amplification term further transform the objective to: \n\nI \n\nB \n\nA \n\nEgm(m) = L L(J.'ai(L Gabmbi - z: majDji) -\n+(j L z: mai(logmai - 1) + L \n\na=1 i=1 \n\nlA I \n\nj=1 \n\nb=1 \n\nA \n\nJ \n\n1 \n2J.'~i -\n\n,lTaimai + ~lT~i) \n\nI \n\nlI:a(Z: mai - 1) + z: Ai(L mai - 1) \n\nA \n\nI \n\n(2) \n\na=1 i=1 \n\na=1 \n\ni=1 \n\ni=1 \n\na=1 \n\nIn this objective we are also looking for a saddle point. \n\nA second, functionally equivalent, graph matching objective is also used III the \nclustering problem: \n\nA B \n\nI \n\nJ \n\nEgm/(m) = LZ:LZ:maim bj(Gab-Dji)2 \n\n(3) \n\na=lb=li=lj=l \n\nwith constraints: 'Va 2::=1 mai = 1 , 'Vi 2::=1 mai = 1 , 'Vai mai ;::: O. \n\n2.3 The Clustering Objective \n\nThe learning problem is formulated as follows: Given a set of I objects, {Xi} find \na set of A cluster centers {Ya} and match variables {Mia} defined as \n\nM. _ {I if Xi is in Ya's cluster \n\n0 otherwise, \n\nla -\n\nsuch that each object is in only one cluster, and the total distance of all the objects \nfrom their respective cluster centers is minimized. To find {Ya} and {Mia} minimize \nthe cost function, \n\nEeltuter(Y,M) = LLMiaD(Xi' Ya) \n\nI \n\nA \n\ni=l a=l \n\nwith the constraint that 'Vi 2:a Mia = 1 , 'Vai Mai ;::: O. D(Xi, Ya), the distance \nfunction, is a measure of dissimilarity between two objects. \n\nThe constraints on M are enforced in a manner similar to that described for the \ndistance measure, except that now only the rows of the matrix M need to add to \none, instead of both the rows and the columns. \n\nEeituter(Y, M) \n\nA \n\nI \nZ:Z:MiaD(Xi , Ya) + (j Z:Z: Mia(log Mia - 1) \ni=l a=1 \n\ni=l a=1 \n\nA \n\nI \n\n1 \n\nI \n\nA \n\n+ Z:Ai(LMia - 1) \n\ni=l \n\na=1 \n\n(4) \n\n\fLearning with Pre knowledge \n\n717 \n\nHere, the objects are point-sets or weighted graphs. If point-sets the distance mea(cid:173)\nsure D(Xi, Ya) is replaced by (1), if graphs it is replaced by (2) or (3). \nTherefore, given a set of objects, X, we construct Ecltuter and upon finding the \nappropriate saddle point of that objective, we will have Y, their cluster centers, \nand M, their cluster memberships. \n\n3 The Algorithm \n\nThe algorithm to minimize the clustering objectives consists of two loops - an inner \nloop to minimize the distance measure objective [either (1) or (2)] and an outer \nloop to minimize the clustering objective (4). Using coordinate descent in the \nouter loop results in dynamics similar to the EM algorithm [Jordan and Jacobs, \n1994] for clustering. All variables occurring in the distance measure objective are \nheld fixed during this phase. The inner loop uses coordinate ascent/descent which \nresults in repeated row and column projections for m. The minimization of m, \nand the distance measure variables [either t, A of (1) or 1', (f of (2)], occurs in an \nincremental fashion, that is their values are saved after each inner loop call from \nwithin the outer loop and are then used as initial values for the next call to the \ninner loop. This tracking of the values of the distance measure variables in the \ninner loop is essential to the efficiency of the algorithm since it greatly speeds up \neach inner loop optimization. Most coordinate ascent/descent phases are computed \nanalytically, further speeding up the algorithm. Some local minima are avoided, \nby deterministic annealing in both the outer and inner loops. The mUlti-phase \ndynamics maybe described as a clocked objective. Let {D} be the set of distance \nmeasure variables excluding m. The algorithm is as follows: \nInitialize {D} to the equivalent of an identity transform, Y to random values \nBegin Outer Loop \nBegin Inner Loop \n\nInitialize {D} with previous values \nFind m, {D} for each ia pair: \n\nFind m by softmax, projecting across j, then k, iteratively \nFind {D} by coordinate descent \n\nEnd Inner Loop \nFind M ,Y using fixed values of m, {D}, determined in inner loop: \n\nFind M by softmax, across i \nFind Y by coordinate descent \n\nIncrease f3M, f3m \nEnd Outer Loop \n\nWhen analytic solutions are computed for Y the outer loop takes a form similar to \nfuzzy ISODATA clustering, with annealing on the fuzziness parameter. \n\n4 Methods and Experimental Results \n\nFour series of experiments were ran with randomly generated data to evaluate the \nlearning algorithms. Point sets were clustered in the first three experiments and \nweighted graphs were clustered in the fourth. In each experiment a set of object \n\n\f718 \n\nSteven Gold, Anand Rangarajan, Eric Mjolsness \n\nmodels were randomly generated. Then from each object model a set of object \ninstances were created by transforming the object model according to the problem \ndomain assumed for that experiment. For example, an object represented by points \nin two dimensional space was translated, rotated, scaled, sheared, and permuted \nto form a new point set. A object represented by a weighted graph was permuted. \nNoise was added to further distort the object. Parts of the object were deleted and \nspurious features (points) were added. In this manner, from a set of object models, \na larger number of object instances were created. Then with no knowledge of the \noriginal objects models or cluster memberships, we clustered the object instances \nusing the algorithms described above. \nThe results were evaluated by comparing the object prototypes (cluster centers) \nformed by each experimental run to the object models used to generate the object \ninstances for that experiment. The distance measures used in the clustering were \nused for this comparison, i.e. to calculate the distance between the learned pro(cid:173)\ntotype and the original object. Note that this distance measure also incorporates \nthe transformations used to create the object instances. The mean and standard \ndeviations of these distances were plotted (Figure 1) over hundreds of experiments, \nvarying the object instance generation noise. The straight line appearing on each \ngraph displays the effect of the noise only. It is the expected object model-object \nprototype distance if no transformations were applied, no features were deleted or \nadded, and the cluster memberships of the object instances were known. It serves \nas an absolute lower bound on our learning algorithm. The noise was increased in \neach series of experiments until the curve flattened - that is the object instances \nbecame so distorted by noise that no information about the original objects could \nbe recovered by the algorithm. \n\nIn the first series of experiments (Figure 1a), point set objects were translated, \nrotated, scaled, and permuted. Initial object models were created by selecting points \nwith a uniform distribution within a unit square. The transformations to create \nthe object instance were selected with a uniform distribution within the following \nbounds; translation: \u00b1.5, rotation: \u00b127\u00b0, log(scale): \u00b1 log(.5). 100 object instances \nwere generated from 10 object models. All objects contained 20 points.The standard \ndeviation of the Gaussian noise was varied by .02 from .02 to .16. 15 experiments \nwere run at each noise level. The data point at each error bar represents 150 \ndistances (15 experiments times 10 model-prototype distances for each experiment). \n\nIn the second and third series of experiments (Figures 1b and 1c), point set objects \nwere translated, rotated, scaled, sheared (obliquely and vertically), and permuted. \nEach object point had a 10% probability of being deleted and a 5% probability of \ngenerating a spurious point. The point sets and transformations were randomly \ngenerated as in the first experiment, except for these bounds; log(scale): \u00b1 log(.7), \nlog(verticalshear): \u00b1log(.7), and log(obliqueshear): \u00b1log(.7). In experiment 2, \n64 object instances and 4 object models of 15 points each were used. In experiment \n3, 256 object instances and 8 object models of 20 points each were used. Noise \nlevels like experiment 1 were used, with 20 experiments run at each noise level in \nexperiment 2 and 10 experiments run at each noise level in experiment 3. \n\nIn experiment 4 (Figure 1d), object models were represented by fully connected \nweighted graphs. The link weights in the initial object models were selected with a \nuniform distribution between 0 and 1. The objects were then randomly permuted \n\n\fLearning with Preknowledge \n\n719 \n\n(a) \n\n(b) \n\n0.05 \n\n0.1 \n\nstandard deviation \n\n(c) \n\n0.15 \n\n3r---~----~----~, \n\n0.05 \n\n0.1 \n\n0.15 \n\nstandard deviation \n\n0.05 \n\n0.1 \n\nstandard deviation \n\n(d) \n\n~ \n\nIr-'~(\" \n\n(\"It-' \n\nCD2 \n\n~ \n~,/ \n.-\nV \n\"01 V \nor \no \n\n0.05 \n\n-\n\n0.1 \n\nstandard deviation \n\n0.15 \n\n0.15 \n\nFigure 1: (a): 10 clusters, 100 point sets, 20 points each, scale ,rotation, translation, \n120 experiments (b): 4 clusters, 64 point sets, 15 points each, affine, 10 % deleted, \n5 % spurious, 140 experiments (c): 8 clusters, 256 point sets, 20 points each, affine, \n10 % deleted, 5 % spurious, 70 experiments (d): 4 clusters, 64 graphs, 10 nodes \neach , 360 experiments \n\nto form the object instance and uniform noise was added to the link weights. 64 \nobject instances were generated from 4 object models consisting of 10 node graphs \nwith 100 links. The standard deviation of the noise was varied by .01 from .01 to \n.12. 30 experiments where run at each noise level. \n\nIn most experiments at low noise levels (~ .06 for point sets, ~ .03 for graphs), the \nobject prototypes learned were very similar to the object models. Even at higher \nnoise levels object prototypes similar to the object models are formed, though less \nconsistently. Results from about 700 experiments are plotted. The objective for \nexperiment 3 contained close to one million variables and converged in about 4 \nhours on an SGI Indigo workstation. The convergence times of the objectives of \nexperiments 1, 2 and 4 were 120, 10 and 10 minutes respectively. \n\n5 Conclusions \n\nIt has long been argued by many, that learning in complex domains typically asso(cid:173)\nciated with human intelligence requires some type of prior structure or knowledge. \nWe have begun to develop a set of tools that will allow the incorporation of prior \n\n\f720 \n\nSteven Gold, Anand Rangarajan, Eric Mjolsness \n\nstructure within learning. Our models incorporate many features needed in complex \ndomains like vision - noise, missing and spurious features, non-rigid transformations. \nThey can learn objects with inherent structure, like graphs. Many experiments have \nbeen run on experimentally generated data sets. Several directions for future re(cid:173)\nsearch hold promise. One might be the learning of OCR data [Gold et al., 1995]. \nSecond a supervised learning stage could be added to our algorithms. Finally the \npower of the distance measures can be enhanced to operate on attributed relational \ngraphs with deleted nodes and links [Gold and Rangarajan, 1995]. \n\nAcknowledgements \n\nONR/DARPA: NOOOI4-92-J-4048 , AFOSR: F49620-92-J-0465 and Yale CTAN. \n\nReferences \n\nS. Geman, E. Bienenstock, and R. Doursat. (1992) Neural networks and the \nbias/variance dilemma. Neural Computation 4:1-58. \nS. Gold, E. Mjolsness and A. Rangarajan. (1994) Clustering with a domain-specific \ndistance measure. In J . Cowan et al., (eds.), NIPS 6. Morgan Kaufmann. \nS. Gold, C. P. Lu, A. Rangarajan, S. Pappu and E. Mjolsness. \n(1995) New \nalgorithms for 2D and 3D point matching: pose estimation and correspondence. In \nG. Tesauro et al., (eds.) , NIPS 7. San Francisco, CA: Morgan Kaufmann. \nS. Gold and A. Rangarajan (1995) A graduated assignment algorithm for graph \nmatching. YALEU/DCS/TR-I062, Yale Univ., CS Dept. \nM. I. Jordan and R. A. Jacobs. (1994) Hierarchical mixtures of experts and the \nEM algorithm. Neural Computation, 6:181-214. \n\nE. Mjolsness. Visual grammars and their neural networks. (1992) SPIE Conference \non the Science of Artificial Neural Networks, 1110:63-85 . \n\nC. Peterson and B. Soderberg. A new method for mapping optimization problems \nonto neural networks. (1989) International Journal of Neural Systems,I(1):3-22. \nA. Rangarajan, S. Gold and E. Mjolsness. (1994) A novel optimizing network \narchitecture with applications. YALEU /DCS/TR- I036, Yale Univ. , CS Dept . \nR. Shepard. (1989). Internal representation of universal regularities: A challenge \nfor connectionism. In L. Nadel et al. , (eds.), Neural Connections, Mental Compu(cid:173)\ntation. Cambridge, MA, London, England: Bradford/MIT Press. \nP. Simard, Y. Le Cun, and J. Denker. (1993) Efficient pattern recognition using \na transformation distance. In S. Hanson et ai., (eds.), NIPS 5. San Mateo, CA: \nMorgan Kaufmann. \n\nC. Williams, R. Zemel, and M. Mozer. (1993) Unsupervised learning of object \nmodels. AAAI Tech. Rep . FSS-93-04, Univ. of Toronto, CS Dept. \n\nA. L. Yuille and J .J. Kosowsky. (1994) . Statistical physics algorithms that converge. \nNeural Computation, 6:341-356 . \n\n\f", "award": [], "sourceid": 933, "authors": [{"given_name": "Steven", "family_name": "Gold", "institution": null}, {"given_name": "Anand", "family_name": "Rangarajan", "institution": null}, {"given_name": "Eric", "family_name": "Mjolsness", "institution": null}]}