{"title": "Improving Convergence in Hierarchical Matching Networks for Object Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 401, "page_last": 408, "abstract": null, "full_text": "Improving Convergence in  Hierarchical \n\nMatching Networks for  Object \n\nRecognition \n\nJoachim Utans* \n\nGene Gindit \n\nDepartment of Electrical  Engineering \n\nYale  University \n\nP.  O.  Box  2157  Yale  Station \n\nNew  Haven,  CT 06520 \n\nAbstract \n\nWe  are  interested  in  the  use  of analog  neural  networks  for  recog(cid:173)\nnizing  visual  objects.  Objects  are  described  by  the  set  of parts \nthey  are  composed  of  and  their  structural  relationship.  Struc(cid:173)\ntural  models  are  stored  in  a  database  and  the  recognition  prob(cid:173)\nlem  reduces  to  matching  data  to  models  in  a  structurally  consis(cid:173)\ntent  way.  The  object  recognition  problem is  in  general  very  diffi(cid:173)\ncult in that it involves coupled problems of grouping, segmentation \nand  matching.  We  limit the  problem here  to  the simultaneous la(cid:173)\nbelling  of the  parts  of a  single  object  and  the  determination  of \nanalog  parameters.  This  coupled  problem  reduces  to  a  weighted \nmatch  problem in  which  an  optimizing neural  network  must  min(cid:173)\nimize  E(M, p)  =  LO'i MO'i WO'i(p),  where  the  {MO'd  are  binary \nmatch  variables  for  data parts  i  to  model  parts  a  and  {Wai(P)} \nare weights dependent on parameters p .  In this work we  show that \nby  first  solving  for  estimates p without  solving  for  M ai , we  may \nobtain good initial parameter estimates that yield  better solutions \nfor  M  and  p. \n\n*Current  address: \n\nInternational  Computer  Science  Institute,  1947  Center  Street, \n\nSuite  600,  Berkeley,  CA 94704,  utans@icsi.berkeley.edu \n\ntCurrent  address:  SUNY  Stony  Brook,  Department  of  Electrical  Engineering,  Stony \n\nBrook,  NY  11784 \n\n401 \n\n\f402 \n\nUtans and Gindi \n\nFigure  1:  Stored  Model for  a  3-Level  Compositional  Hierarchy  (compare Figure 3) . \n\n1  Recognition via Stochastic  Forward  Models \n\nThe  Frameville  object  recognition  system  introduced  by  Mjolsness  et  al  [5,  6,  1] \nmakes use of a compositional hierarchy to represent stored models.  The recognition \nproblem is formulated as  the minimization of an objective function.  Mjolsness  [3,4] \nhas  proposed  to  derive  the  objective  function  describing  the  recognition  problem \nin  a  principled  way  from  a  stochastic  model that  describes  the objects  the system \nis  designed  to  recognize  (stochastic  visual  grammar).  The  description  mirrors  the \ndata representation  as  a  compositional  hierarchy,  at  each  stage  the  description  of \nthe object  becomes more detailed  as  parts are  added. \n\nThe stochastic model assigns a probability distribution at each stage of that process. \nThus at each  level of the hierarchy  a more detailed description  of parts in terms of \ntheir subparts is given by specifying a probability distribution for  the coordinates of \nthe  subparts.  Explicitly specifying  these  distributions  allows  for  finer  control  over \nindividual  part  descriptions  than  the  rather  general  parameter  error  terms  used \nbefore  [1,  8].  The goal is  to derive  a joint probability distribution  for  an  instance \nof an object  and  its  parts  as  it  appears  in  the  scene.  This gives  the  probability of \nobserving such  an object prior to the  arrival of the data.  Given an observed  image, \nthe  recognition  problem  can  be  stated  as  a  Bayesian  inference  problem  that  the \nneural  network  solves. \n\n1.1  3-Level Stochastic Model \n\nFor example, consider  the model shown in  Figure 1 and 3.  The object and  its parts \nare  represented  as  line segments  (sticks),  the parameters were  p  = (x, y, I, ())T  with \nx , y  denoting  position,  I  the  length  of  a  stick  and  ()  its  orientation.  The  model \nconsiders  only  a  rigid  translation of an object  in the  image. \nOnly  one  model  is  stored.  From  a  central  position  p  =  (x, y, I, ()),  itself chosen \nfrom  a  uniform density,  the  N{3  parts at  the first  level  are  placed.  Their structural \nrelationships  is  stored  as  coordinates  u{3  in  an  object-centered  coordinate  frame, \ni.e.  relative  to p.  While placing the  parts, Gaussian distributed  noise  with  mean 0 \nand  is  added  to the  position coordinates  to  capture the notion of natural variation \nof the object's shape.  The  variance is  coordinate specific,  but we  assume the same \ndistribution  for  the  x  and  y  coordinates,  O\"'ix;  O\"'~,  is  the  variance  for  the  length \n\n\fImproving Convergence  in  Hierarchical  Matching Networks  for  Object Recognition \n\n403 \n\ncomponent and UI9  for  the relative angle.  In addition, here we  assume for simplicity \nthat all parts are independently distributed.  Each of the parts {3  is  composed of sub(cid:173)\nparts.  For simplicity of notation, we  assume that each  part {3  is  composed from the \nsame number  of subparts  N m  (note  that  the  index  'Y  in  Figure  2 here  corresponds \nto the double index {3m  to keep  track of which part {3  subpart {3m  belongs to on the \nmodel side,  i.e.  the  index  (3m  denotes  the  mth  sub-part  of part (3).  The  next  step \nmodels  the  unordering  of parts  in  the  image via a  permutation matrix M,  chosen \nwith  probability  P(M),  by  which  their  identity  is  lost.  If this  step  were  omitted, \nthe recognition problem would reduce  to the problem of estimating part parameters \nbecause  the parts would  already be  labeled. \n\nFrom the grammar we  compute the final joint probability distribution (all constant \nterms are  collected  in a  constant  C): \n\nP(M, {P,3m}, {PtJ}, p) = \n\n1.2  Frameville Architecture for  Part Labelling within a  single Object \n\nThe stochastic forward model for  the part labelling problem with only a single object \npresent in the scene  translates into a reduced  Frameville architecture as depicted in \nFigure  2.  The  compositional hierarchy  parallels  the steps  in  the  stochastic  model \nas  parts  are  added  at  each  level.  Match  variables  appear  only  at  the  lowest  level, \ncorresponding  to  the  permutation step  of the  grammar.  Parts  in  the  image must \nbe matched to model parts and parts found  to belong to  the stored  object must be \ngrouped together. \n\nThe single match neuron Mai  at the highest level can be set to unity since we assume \nwe know the object's identity and only a single object is present.  Similarly, all terms \ninaij  from the first  to the second  level  can  be set  to  unity for  the correct  grouping \nsince  the  grouping  is  known  at  this  point  from  the  forward  model  description.  In \naddition,  at  the  intermediate  (second)  level,  we  may  set  all  M,3j  =  1  for  {3  =  j \nand  MtJj  =  0  otherwise  with  no  loss  of generality.  These  mid-level  frames  may \nbe  matched  ahead  of  time,  but  their  parameters  must  be  computed  from  data. \nIntroducing a part permutation at the intermediate levels  thus is  redundant.  Given \nthis,  an additional simplification ina grouping variables  at  the lowest  (third)  level \nis  possible.  Since  parts  are  pre-matched  at  all  but  the  lowest  level,  inaj k  can  be \nexpressed  in terms of the  part match M\"{k  as  inajk  =  M\"{k1NA\"{tJM,3j  and explicitly \nrepresenting  inaj k  as  variables is  not necessary. \nThe input  to  the system  are  the  {pk}, recognition  involves finding  the  parameters \n\n\f404 \n\nUtans  and Gindi \n\nModel  <lrame \n\n\u2022  x \n\u2022 \ny \n\u2022  9 \n\u2022 \nI \n\nData \n\nFigure  2:  Frameville  Architecture  for  the  Stochastic  Model.  The  3-level  grammar leads  to  a  reduced \n\"Frameville\"  style network architecture:  a  single  model is stored on the model side and only one instance \nof the  model  is  present  in  the  input  data.  The  ovals  on  the  model  side  represent  the  object,  its  parts \nand subparts  (compare Figure  1);  the  arcs INA  represent their structural relationship .  On  the data side, \nthe  triangles  represent  parameter vectors  (or frames)  describing  an  instance  of the  object  in  the  scene. \nAt the lowest  level  the Pk  represent  the input data, parameters at  higher levels  in  the hierarchy  must  be \ncomputed  by  the  network  (represented  as  bold  triangles) .  ina  represents  the  grouping  of parts  on  the \ndata side  (see  text) .  The horizontal  lines  represent  assignments  from  frames  on  the  data side  to  nodes \non  the  model  side.  At  the  intermediate  level,  frames  are  prematched  to the corresponding  parts  on  the \nmodel side ; match  variables  are necessary only  at the lowest level  (represented  as bold lines  with circles). \n\nP  and  {Pi}  as  well  as  the labelling of parts M.  Thus, from  Bayes  Theorem \n\nP( {pdIM, p, {Pi} )P(M, p, {Pi}) \n\n(2) \nand  recognition  reduces  to  finding  the  most  probable  values  for  p,  {Pi}  and  M \ngiven  the  data: \n\nP({Pk} ) \nex:  P(M, p, {Pi}, {pd) \n\narg max  P(M, p, {Pi}, {pd) \nM,P,{Pi} \n\n(3) \n\nSolving the inference problem involves finding the MAP estimate and is is equivalent \nto minimizing the  exponent in  equation  (1)  with  respect  to  M, P  and {Pi}. \n\n2  Bootstrap:  Coarse  Scale Hints to Initialize the Network \n\n2.1  Compositional Hierarchy and  Scale Space \n\nIn some labelling approaches found in the vision literature, an object is first  labelled \nat  the  coarse,  low  resolution,  level  and  approximate parameters are found .  In  this \ntop-down  approach  the  information  at  the  higher,  more  abstract,  levels  is  used \n\n\fImproving Convergence  in  Hierarchical Matching Networks for  Object Recognition \n\n405 \n\nim----:.+-Human \n\n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \nL  ______  I \n\nr---------' \n1 \n1-- - ,  1 \n\niVt Am \n\ni  (I)  ~ i \n\n1 \n1 \n______ _ ___ J \n\nII \n\nIII \n\nspatial \nscale \n\nabstraction \n\nFigure 3:  Compositional Hierarchy vs .  Scale Space Hierarchy_ A  compositional hierarchy can represent a \nscale space hierarchy.  At successive  levels  in  the hierarchy,  more and more detail  is  added  to the object_ \n\nto  select  initial  values  for  the  parts  at  the  next  lower  level  of abstraction.  The \nsegmentation and labelling at this next lowest  level  is  thus not done blindly; rather \nit is  strongly influenced  contextually by  the results  at the  level  above. \n\nIn fact, in very general terms such a scheme was described by Marr and Nishihara [2]. \nThey advocate in essence  a hierarchical model base in which a shape is first  matched \nto  the  highest  levels,  and  defaults  in  terms  of relative object-based  parameters of \nparts at the next level are recalled from memory.  These defaults then serve as initial \nvalues in  an unspecified  segmentation algorithm that derives  part parameters;  this \nstep  is  repeated  recursively  until the lowest  level  is  reached. \n\nNote that the highest level of abstractions correspond to the coarsest levels of spatial \nscale.  There is nothing in the design of the model base that demands this, but invari(cid:173)\nably, elements at the top of a compositional hierarchy are of coarser scale since they \nmust  both  include  the  many  subparts  below,  and  summarize  this  inclusion  with \nrelatively  few  parameters.  Figure  3  illustrates  the  correspondence  between  these \nrepresentations. \nIn  this  sense,  the  compositional  hierarchy  as  applied  to  shapes \nincludes  a  notion of scale,  but  there  is  no  \"scale-space\"  operation  of intentionally \nblurring  data.  The  notion  of Scale  Space  as  utilized  here  thus  differs  from  the \napplication  of the  method  to  low-level  computations in  the  visual  domain  where \nauxiliary coarse  scale representations  are computed explicitly.  The object represen(cid:173)\ntations in the Frameville system as described earlier combines both, bottom-up and \ntop-down elements.  If the  top-down aspects  of the scheme  described  by  Marr  and \nNishihara  [2]  could  be incorporated  into the  Frameville architecture,  then  our  pre(cid:173)\nvious simulation results  [8]  suggest  that much better  performance can  be expected \nfrom  the neural network.  Two problems must be  addressed:  (1)  How  do we  obtain, \nfrom  the  observed  raw  data alone,  a  coarse  estimate of the  slot  parameters at  the \nhighest  level  and  (2)  given  these  crude  estimates  how  do  we  utilize  them  to recall \ndefault settings for  the segmentation one level  below? \n\n\f406 \n\nUtans and Gindi \n\n0, \n\nBootstrap \n\ny \n\nModel \n\nData \n\nFigure  4:  Bootstrap computation for  a  network from  a  3- level  grammar.  Analog frame  variables  at  the \ntop  and  intermediate level  are initialized  from  data by  a  bootstrap computation (bold  lines  indicate  the \nflow  of information) \n\n2.2 \n\nInitialization of Coarse  Scale Parameters \n\nWe  propose to aid convergence by supplying initial values for  the analog variables p \nand  {Pi}; these  must be computed from data without making use  of the  labelling. \nIn general,  it  is  not  possible  to solve  for  the  analog parameters without  knowledge \nof the  correct  permutation  matrix  M.  However,  for  the  purpose  of obtaining  an \napproximation f>  one  can  derive  a  new  objective function  that  does  not depend  on \nM  and  the  parameters  {Pi}  by  integrating  over  the  {Pi}  and  summing over  all \npossible permutation matrices M: \n\nP(p,{pk}) =  L  J d{pj}P(P,{Pi},{pd,M) \n\n{M}IM is  a \npermutation \n\n(4) \n\nThis formulation leads  to  an  Elastic  Net  type  network  [9,  7].  However,  this imple(cid:173)\nmentation of a separate network  for  the  bootstrap  computations is  expensive. \n\nHere  we  use  simpler computation where  the  coarse scale  parameters are  estimated \nby  computing sample averages,  corresponding to finding the solution for  the Elastic \nNet  in  the high  temperature  limit [7].  For  the  position  x  we  find,  after  integrating \nover  the {xi}, \n\nx \n\nL  M{3mkXk \nL{3m 1/(O\"~xO\"~mx) 13m  k  O\"~xO\"~mx \n\n1 \n\n_ \n\n1  L  U{3x \nO\"~x \n\nL{3 1/ O\"~x  (3 \n\nand  similarly  for  y.  Since  the  assignment  M{3m k  of subparts  k  on  the  data  side \nto  subparts  fJm  on  the  model  side  is  not  known  at  this  point,  the  first  term  in \nequations  (5)  cannot  be  evaluated.  After  approximating  the  actual  variance  with \n\n(5) \n\n\fImproving Convergence  in Hierarchical Matching Networks for  Object Recognition \n\n407 \n\nan average variance, these equations reduce to \n\nx \n\n111  \n\nN  N  L Xk  - N  N  L uf3mx  - N  L uf3x \n\n13  m \n\nk \n\n13  m  13m \n\n13 \n\n13 \n\n(6) \n\nIn  terms of the objective function  this  translates into assuming that here the error \nterms for  all  parts are  weighted equally.  Since  these weights  would  depend on  the \nactual part match,  this just corresponds to our ignorance regarding identity of the \nparts.  This  approximation  assumes  that  the  variances  do  not  differ  by  a  large \namount, otherwise  the  approximation p will  not  be  close  to the  true values.  Since \nthe  model  can  be  designed  such  that  the  part  primitives  used  at  the  lowest  level \nof  the  grammar are  not  highly  specialized  as  would  be  the  case  for  abstractions \nat higher levels of the model,  the approximation proved sufficient for  the  problems \nstudied here. \n\nThe  neural  network  can  be  used  to perform  the  calculation.  The  Elastic  Net  for(cid:173)\nmulation  assigns  approximately  equal  weights  to  all  possible  assignments  at  high \ntemperatures.  Thus,  this  behavior  can  be  expressed  in  the  original  network  with \nmatch variables by choosing  Mf3mk  =  l/{Nf3Nm )  V i,j.  This leads to the following \ntwo-pass  bootstrap computation.  Using  this  specific  choice for  M  only the analog \nvariables  need  to  be  updated  to compute the coarse  scale  estimates.  The network \nwith  constant M  is  just  the  neural  network implementation  for  computing x from \nequation  (6).  After these have  converged, x can  be  used  to compute  Xj  = x + uf3. \nThus,  the parameters for  intermediate  levels  can  by  hypothesized  from  the  coarse \nscale estimate x by  adding the known  transformation  (recall  that for  intermediate \nlevels, the part identity is  preserved and no  permutation steps takes place  (see  Fig(cid:173)\nure 2)).  Then the network is  restarted with random values  for  the  match variables \nto compute the correct labelling and the correct parameters. \n\n2.3  Simulation Results \n\nThe bootstrap procedure has been implemented for a 3-level hierarchical model.  The \nmodel describes a  \"gingerbread man\"  as shown in Figure 3.  The incorrect solutions \nobserved  did  not,  in  the  vast  majority  of  cases,  violate  the  permutation  matrix \nconstraint,  i.e.  the  assignment  was  unique.  However,  even  though  the  assignment \nis  unique,  parts where  not  always  assigned correctly.  Most  commonly,  the identity \nof neighboring parts was  interchanged, in  particular for  cases with large variance. \n\nThe  advantage  of  using  the  bootstrap  initialization  is  clear  from  Figure  5.  For \nthe  simulation,  cr~  = 2crt;  the  noise  variance  was  identical  for  all  parts.  The  net(cid:173)\nwork  computed  the  solution  reliably  for  large  noise  variances.  In  such  cases  the \nperformance  of  the  network  without  initialization  deteriorates  rapidly.  Only  one \nset  of  10  experiments  was  used  for  the  graph  but  in  all  simulations  performed, \nthe network with  initialization  consistently outperformed the  network without  ini(cid:173)\ntialization.  Figure  5(right)  shows  the  time  measured  in  the  number  of iterations \nnecessary for  the network to converge;  it is  almost unaffected  by  the increase in  the \nnoise  variance.  This  is  because  the  initial  values  derived  from  data are  still  close \nto the  final  solution.  While  in  some  cases,  the  random  starting point  happens  to \nbe  close  to  the  correct  solution  and  the  network  without  initialization  converges \nrapidly,  Figure  5  reflect  the  typical  behavior  and  demonstrate  the  advantage  of \ncomputing approximate initial values. \n\n\f408 \n\nUtans  and Gindi \n\n100 \n\n80 \n\n80 \n\n~ \n\n.0 \n\n20 \n\n'h.o \n\n0.2 \n\nSuccess  Rate \n\nConvergence  Speed \n\n300 \n\n'\" -= o .-.oj ., \n\n... \n\n11)200 \n.oj \n\n..... \no \n... \nII) \n~ 100 \n\n-= \n\n0.8 \n\nI  0 \n\no  0.0 \n\n0 . 2 \n\n0.8 \n\n0.4 \na\"1'  2  CT22 \n\n0.8 \n\n1.0 \n\n0 . \u2022 \n\n0.8 \n\nott.  2  (122 \n\nFigure  5 :  Results  Comparing the Network  without  and  with  Initialization  (solid  line) . \nLeft :  The  success  rate  indicates  the  rate  at  which  the  network  converged  to  the  correct solutions.  /1~ \ndenotes the  noise variance at the intermediate  level  of the model  and  /1~  the noise variance  at the lowest \nlevel.  Only  one  set  of  10  experiments  was  used  for  the  graph  but  in  all  simulations  performed ,  the \nnetwork  with  initialization  consistently  outperformed  the  network  without  initialization . \nRight:  The  graph  shows  the  average  time  it  takes  for  the  network  to  converge  (as  measured  by  the \nnumber  of iterations)  averaged  over 10  experiments.  Only  simulations  where  the  network  converged  to \nthe  correct solution  are  used  to  compute  the  average  time  for convergence.  The stopping criterion  used \nrequired  all  the  match  neurons  to  assume  values  M'j  > 0.95 or M'J  < 0 .05.  The  error bars  denote  the \nstandard deviation. \n\nAcknowledgements \n\nThis  work  was  supported  in  part  by  AFOSR  grant  AFOSR  90-0224.  Vie  thank \nE.  Mjolsness  and  A.  Rangarajan for  many helpful  discussions. \n\nReferences \n[1]  G.  Gindi,  E.  Mj~lsness,  and  P.  Anandan.  Neural  networks  for  model  based  recogni(cid:173)\n\ntion.  In  Neural Networks:  Concepts,  Applications and Implementations, pages 144-173. \nPrentice-Hall,  1991. \n\n[2]  David  Marr.  Vision.  W.  H.  Freeman  and  Co.,  New  York,  1982. \n[3]  E.  Mjolsness.  Bay~sian inference  on  visual  grammars  by  neural  nets  that  optimize. \nTechnical  Report YALEU-DCS-TR-854,  Yale  University,  Dept. of Computer Science, \n1991. \n\n[4]  E.  Mj~lsness.  Visual  grammars  and  their  neural  nets.  In  R.P.  Lippmann  J.E.  Moody, \nS.J.  Hanson,  editor,  Advances  in  Neural  Information  Processing  Systems  4.  Morgan \nKaufmann  Publishers,  San  Mateo,  CA,  1992. \n\n[5]  Eric  Mjolsness,  Gene  Gindi,  and  P.  Anandan.  Optimization  in  model  matching  and \n\nperceptual  organization:  A  first  look.  Research  report  yaleu/dcs/rr-634,  Yale  Univer(cid:173)\nsity,  Department  of Computer  Science,  1988. \n\n[6]  Eric  Mjolsness,  Gene  R.  Gindi,  and  P.  Anandan.  Optimization in  model matching and \n\nperceptual  organization.  Neural  Computation, vol.  1,  no.  2,  1989. \n\n[7]  Joachim  Utans.  Neural Networks for  Object  Recognition within  Compositional Hierar(cid:173)\n\nchies.  PhD thesis,  Department  of Electrical  Engineering,  Yale  University,  New  Haven, \nCT 06520,  1992. \n\n[8]  Joachim  Utans,  Gene  R.  Gindi,  Eric  Mjolsness,  and  P.  Anandan.  Neural  networks \n\nfor  object  recognition  within  compositional  hierarchies:  Initial  experiments.  Techni(cid:173)\ncal  report  8903,  Yale  University,  Center  for  Systems  Science,  Department  Electrical \nEngineering,  1989. \n\n[9]  A.  L.  Yuille.  Generalized  deformable  models,  statistical  physics,  and  matching  prob(cid:173)\n\nlems.  Neural  Computation,  2(2):1-24,  1990. \n\n\f", "award": [], "sourceid": 661, "authors": [{"given_name": "Joachim", "family_name": "Utans", "institution": null}, {"given_name": "Gene", "family_name": "Gindi", "institution": null}]}