{"title": "Linear Operator for Object Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 452, "page_last": 459, "abstract": null, "full_text": "Linear Operator for  Object  Recognition \n\nRonen  Bssri \n\nShimon Ullman\u00b7 \n\nM.I.T.  Artificial Intelligence Laboratory \n\nand Department of Brain  and Cognitive Science \n\n545 Technology Square \nCambridge,  MA  02139 \n\nAbstract \n\nVisual  object  recognition  involves  the  identification of images of 3-D  ob(cid:173)\njects seen  from  arbitrary  viewpoints.  We  suggest  an  approach  to object \nrecognition  in  which  a  view  is  represented  as  a  collection  of points  given \nby their location in the image.  An object is  modeled by a set of 2-D views \ntogether  with  the  correspondence  between  the  views.  We show that  any \nnovel  view  of the  object  can  be expressed  as a  linear  combination  of the \nstored  views.  Consequently,  we  build  a  linear operator that  distinguishes \nbetween views of a specific object and views of other objects.  This opera(cid:173)\ntor can be implemented using neural network architectures with relatively \nsimple structures. \n\n1 \n\nIntroduction \n\nVisual  object  recognition  involves  the  identification of images  of 3-D  objects seen \nfrom arbitrary viewpoints.  In particular, objects often appear in images from previ(cid:173)\nously unseen viewpoints.  In this paper we suggest an approach to object recognition \nin  which rigid objects are recognized from  arbitrary viewpoint.  The method can be \nimplemented using  neural network architectures with relatively simple structures. \nIn our approach a  view  is  represented  as a  collection  of points  given by their loca(cid:173)\ntion  in  the  image,  An  object  is  modeled  by a  small set of views  together with  the \ncorrespondence  between these  views.  We  show  that  any  novel  view  of the object \n\n\u2022 Also,  Weizmann Inst. of Science,  Dept.  of Applied Math.,  Rehovot 76100,  Israel \n\n452 \n\n\fLinear Operator for  Object Recognition \n\n453 \n\ncan  be  expressed  as  a  linear  combination  of the stored  views.  Consequently,  we \nbuild  a  linear  operator  that  distinguishes  views  of a  specific  object  from  views  of \nother objects.  This operator can be implemented by a  neural  network. \n\nThe method has several advantages.  First, it handles correctly rigid objects,  but is \nnot restricted to such objects.  Second,  there is no need in  this scheme to explicitly \nrecover and represent the 3-D structure of objects.  Third, the computations involved \nare often simpler than in previous schemes. \n\n2  Previous Approaches \n\nObject recognition involves a  comparison of a  viewed  image against object  models \nstored in memory.  Many existing schemes to object recognition accomplish this task \nby  performing a  template comparison  between the image and  each of the  models, \noften  after  compensating for  certain  variations  due  to  the  different  positions  and \norientations in  which the object is  observed.  Such an approach is  called  alignment \n(Ullman,  1989),  and  a  similar  approach  is  used  in  (Fischler  &,  Bolles  1981,  Lowe \n1985,  Faugeras  &,  Hebert  1986,  Chien  &,  Aggarwal  1987,  Huttenlocher  &,  Ullman \n1987,  Thompson &,  Mundy  1987). \n\nThe majority of alignment schemes use object-centered representations to model the \nobjects.  In these  models  the 3-D structure of the objects is  explicitly  represented. \nThe acquisition of models in these schemes therefore requires  a separate process to \nrecover  the 3-D structure of the objects. \n\nA number of recent studies use 2-D viewer-centered representations for object recog(cid:173)\nnition.  Abu-Mostafa &,  Pslatis (1987), for instance, developed a neural network that \ncontinuously collects  and  stores  the observed  views  of objects.  When  a  new  view \nis observed it  is  recognized  if it is  sufficiently similar  to one of the  previously  seen \nviews.  The system  is  very  limited  in  its  ability  to  recognize  objects  from  novel \nviews.  It does  not  use  information  available  from  a  collection  of object  views  to \nextend the range of recognizable views  beyond the range determined by each of the \nstored views separately. \n\nIn the scheme below we suggest a  different  kind of viewer-centered representations \nto model the objects.  An object is  modeled by a set of its observed images with the \ncorrespondence  between points in  the  images.  We show that only  a  small  number \nof  images  is  required  to  predict  the  appearance  of  the  object  from  all  possible \nviewpoints.  These  predictions  are  exact  for  rigid  objects,  but  are  not  confined  to \nsuch objects.  We  also suggest a  neural network to implement the scheme. \n\nA similar representation was recently used by Poggio &,  Edelman (1990) to develop a \nnetwork that recognizes objects using radial basis functions (RBFs).  The approach \npresented here has several advantages over this approach.  First, by using the linear \ncombinations  of the  stored  views  rather  than  applying  radial  basis  functions  to \nthem we  obtain exact  predictions for  the novel appearances of objects rather than \nan approximation.  Moreover, a smaller number of views is required in our scheme to \npredict the appearance of objects from all possible views.  For example, when a rigid \nobject that does not introduce self occlusion (such as a  wired object) is  considered, \npredicting  its  appearance  from  all  possible  views  requires  only  three  views  under \nthe  LC  Scheme and  about sixty views under the RBFs Scheme. \n\n\f454 \n\nBasri and Ullman \n\n3  The Linear Combinations (LC)  Scheme \n\nIn  this  section  we  introduce  the  Linear  Combinations  (LC)  Scheme.  Additional \ndetails about the scheme can be found  in (Ullman &  Basri,  1991).  Our approach is \nbased on the following observation.  For many continuous transformations of interest \nin  recognition,  such as  3-D  rotation,  translation,  and  scaling,  every  possible  view \nof a transforming object can be expressed as a  linear combination of other views of \nthe object.  In other words, the set of possible images of an object undergoing rigid \n3-D transformations and scaling is embedded in  a  linear space, spanned by a small \nnumber of 2-D images. \nWe start by showing that any image of an object undergoing rigid  transformations \nfollowed  by an  orthographic projection can be expressed as  a  linear combination of \na  small number of views.  The coefficients of this  combination may differ for  the x(cid:173)\nand y-coordinates.  That is,  the intermediate view of the object may be given by two \nlinear combinations, one for  the x-coordinates and  the other for  the y-coordinates. \nIn addition, certain functional restrictions may hold among the different coefficients. \n\nWe represent an image by two coordinate vectors, one contains the x-values of the \nobject's points,  and the other contains their y-values.  In other words,  an image  P \nis  described  by  x = (XlJ ... , xn)  and  y  = (Yll ... , Yn)  where every  (Xi, Yi),  1 < i  ~ n, \nis  an image  point.  The order of the points in  these vectors  is  preserved  in  all  the \ndifferent  views  of the same object,  namely,  if P  and  pI are two views  of the same \nobject, then (Xi, Yi)  E P  and (x~, yD  E pI are in correspondence (or, in other words, \nthey are the projections of the same object point). \nClaim: \nviewpoints is  embedded in  a  4-D  linear space. \n(A  proof is  given in Appendix A.) \n\nThe  set  of coordinate  vectors  of an  object  obtained  from  all  different \n\nFollowing  this  claim  we  can  represent  the  entire space  of views  of an  object  by a \nbasis  that  consists of any four  linearly  independent  vectors  taken from  the space. \nIn  particular,  we  can  construct  a  basis  using  familiar  views  of the  object.  Two \nimages supply four such vectors and therefore are often sufficient to span the space. \nBy  considering the linear combinations of the model vectors we  can reproduce any \npossible view of the object. \n\nIt is  important to note  that the set of views  of a  rigid  object  does  not  occupy  the \nentire linear  4-D  space.  Rather,  the  coefficients of the linear  combinations  repro(cid:173)\nducing  valid  images  follow  in  addition  two  quadratic  constraints.  (See  Appendix \nA.)  In order  to verify that an object  undergoes  a  rigid  transformation  (as opposed \nto  a  general  3-D  affine  transformation)  the  model  must  consist  of at  least  three \nsnapshots of the object. \nMany 3-D rigid objects are bounded with smooth curved surfaces.  The contours of \nsuch objects  change their  position  on  the object  whenever  the  viewing  position  is \nchanged.  The linear  combinations scheme can be extended to handle these objects \nas  well.  In this cases  the scheme gives  accurate  approximations to the appearance \nof these objects (Ullman  &  Basri,  1991). \n\nThe linear  combination scheme assumes that the same object  points are visible  in \nthe different views.  When the views are sufficiently different, this will no longer hold, \n\n\fLinear Operator for  Object Recognition \n\n455 \n\ndue  to self-occlusion.  To represent  an object  from  all  possible  viewing  directions \n(e.g.,  both  \"front\"  and  \"back\"),  a  number of different  models of this  type  will  be \nrequired.  This notion is  similar to the use  of different object aspects suggested by \nKoenderink  &  Van Doorn  (1979).  (Other aspects of occlusion  are discussed in  the \nnext section.) \n\n4  Recognizing an Object Using the LC  Scheme \n\nIn  the  previous  section  we  have  shown  that  the set  of views  of a  rigid  object  is \nembedded in a  linear space of a small dimension.  In this section we  define  a  linear \noperator  that  uses  this  property  to  recognize  objects.  We  then  show  how  this \noperator can be used  in the recognition  process. \n\nLet  PI, ... , Pk  be  the  model  views,  and  P  be  a  novel  view  of  the  same  object. \nAccording  to  the  previous  section  there  exist  coefficients  a}, ... , ak  such  that: \nP  = L:~=1 aiPi.  Suppose  L  is  a  linear  operator  such  that  LPi  = q  for  every \n1  <  i  ~ n  and  some  constant  vector  q,  then  L  transforms  P  to q  (up  to  a  scale \nfactor),  Lp = (L:~=1 ai)q.  If in  addition  L  transforms  vectors  outside  the  space \nspanned  by  the  model  to vectors  other  then  q  then  L  distinguishes  views  of the \nobject  from  views  of other objects.  The vector  q  then serves  as a  \"name\"  for  the \nobject.  It can either be the zero vector, in which case L  transforms every novel view \nof the object to zero, or it can be a  familiar  view of the object, in which  case L  has \nan  associative property, namely,  it takes a  novel  view  of an object  and  transforms \nit to a  familiar view.  A  constructive definition of L  is  given in  appendix B. \nThe core of the recognition process we propose includes a neural network that imple(cid:173)\nments the linear operator defined above.  The input to this network is  a  coordinate \nvector created from  the image,  and the output is  an  indication  whether the image \nis  in  fact  an  instance  of the  modeled  object.  The  operator  can  be  implemented \nby a  simple, one layer, neural network with only feedforward  connections,  the type \npresented  by  Kohonen,  Oja,  &  Lehtio  (1981) .  It is  interesting  to  note  that  this \noperator can be modified  to recognize several models in  parallel. \n\nTo  apply  this  network  to the  image  the  image  should  first  be  represented  by  its \ncoordinate vectors.  The construction of the coordinate vectors from  the image can \nbe implemented using cells with linear response properties, the type of cells encoding \neye  positions  found  by  Zipser  &  Andersen  (1988).  The  positions obtained  should \nbe  ordered  according  to  the  correspondence  of the  image  points  with  the  model \npoints.  Establishing the  correspondence is  a  difficult  task and  an obstacle to most \nexisting recognition schemes.  The phenomenon of apparent motion (Marr & Ullman \n1981)  suggests,  however,  that the human visual  system is  capable of handling this \nproblem. \n\nIn many cases objects seen in the image are partially occluded.  Sometimes also some \nof the points cannot  be located reliably.  To handle these  cases  the linear operator \nshould be modified to exclude the missing points.  The computation of the updated \noperator from  the original one  involves computing a  pseudo-inverse.  A  method  to \ncompute the pseudo-inverse of a matrix in real time using neural networks has been \nsuggested by Yeates  (1991). \n\n\f456 \n\nBasri and Ullman \n\n5  Summary \n\nWe have presented a  method for  recognizing 3-D objects from  2-D  images.  In  this \nmethod,  an object-model  is  represented  by  the  linear combinations of several  2-D \nviews  of the  object.  It  has  been  shown  that  for  objects  undergoing  rigid  trans(cid:173)\nformations  the  set  of  possible  images  of a  given  object  is  embedded  in  a  linear \nspace spanned  by  a  small  number of views.  Rigid  transformations  can  be  distin(cid:173)\nguished  from  more  general  linear  transformations  of the object  by testing  certain \nconstraints  placed  upon  the  coefficients  of the linear  combinations.  The  method \napplies  to objects with sharp as well  as smooth boundaries. \n\nWe have proposed  a  linear operator to map  the  different views  of the same object \ninto  a  common  representation,  and  we  have  presented  a  simple  neural  network \nthat implements this operator.  In addition,  we  have suggested  a scheme to handle \nocclusions  and  unreliable  measurements.  One  difficulty  in  this  scheme  is  that  it \nrequires  to find  the correspondence  between the image and the model  views.  This \nproblem is  left  for  future  research. \n\nThe linear combination scheme described above  was implemented and applied to a \nnumber of objects.  Figures 1 and 2 show the application of the linear combinations \nmethod  to  artificially  created  and  real  life objects.  The figures  show  a  number of \nobject  models,  their  linear  combinations,  and  the  agreement  between  these  linear \ncombinations  and  actual  images  of the objects.  Figure  3  shows  the  results of ap(cid:173)\nplying  a  linear operator  with  associative properties  to artificial  objects.  It can  be \nseen that whenever the operator is  fed  with  a  novel view of the object for  which it \nwas designed  it returns a  familiar  view  of the object. \n\n1'\\ \nI  \\ \n/ ;\n\\ \nI \nI \n\\ \nI \n\\ \n\n< \n\n/ \n\n---~-:----~ \n\n\\ \n\nFigure  1:  Top:  three model  pictures  of a  pyramid.  Bottom:  two of their linear combina(cid:173)\ntions. \n\nAppendix A \n\nIn this appendix we prove that the coordinate vectors of images of a rigid object lie \nin a 4-D  linear space.  We also show that the coefficients of the linear combinations \nthat produce valid images of the object follow in addition two quadratic constraints. \nLet  0  be  a  set  of  object  points,  and  let  x  =  (Xl, ... , X n),  Y = (Yl, ... , Yn),  and \n\n\fLinear Operator for  Object Recognition \n\n457 \n\nFigure 2:  Top:  three  model  pictures  of a  VW  car.  Bottom:  a  linear combination of the \nthree images (left),  an  actual edge image  (middle),  and the two images overlayed  (right). \n\n//\\ \n.  / \n\nI \nI \nI \n\n\\ \n\\ \n\\ \n\\ \n\n--::--~ \n\nFigure 3:  Top:  applying  an  associative pyramidal operator to a  pyramid  (left)  returns  a \nmodel view of the pyramid (right, compare with  Figure 1 top left).  Bottom:  applying the \nsame operator to  a  cube  (left)  returns  an unfamiliar image  (right). \n\n\f458 \n\nBasri and Ullman \nz = (Zl, ... , zn) such that (Xi, Yi, Zi)  E  0  for every 1 ~ i < n.  Let P be a  view of the \nobject, and let x = (Xl, ... , xn)  and y = (!Ill ... , !In)  such that (Xi, !Ii)  is the position \nof (Xi, Yi, Zi)  in P.  We call x, y, and z the coordinate vectors of 0, and x and y the \ncorresponding coordinate vectors in P.  Assume P is  obtained from  0  by applying \na  rotation matrix R,  a  scale  factor  s,  and  a  translation  vector  (t~, ty)  followed  by \nan orthographic projection. \nClaim:  There exist coefficients at, a2, aa, a4  and bl, b2, ba, b4 such that: \n\nx \ny \n\nalx+a2y+aaZ+a41 \nbl x+b2y+baz+b41 \n\nwhere  1 = (1, ... , 1) E 1?,\". \nProof: \n\nSimply by assigning: \n\nal \na2 \naa  -\na4 \n\nsrll \nsr12 \nsrla \nt~ \n\nh \nb2 \nba \nb4 \n\nsr21 \nsr22 \nsr2a \nty \n\nTherefore,  x, y  E  span{x, y, z, I}  regardless  of the  viewpoint  from  which  x and \nyare taken.  Notice  that  the  set  of views  of a  rigid  object  does  not  occupy  the \nentire  linear  4-D  space.  Rather,  the  coefficients  follow  in  addition  two quadratic \nconstraints: \n\na~ +  a~ +  a;  =  b~ +  b~ +  b; \n\nalbl  +  a2b2  +  aaba  =  0 \n\nAppendix B \n\nA  \"recognition matrix\"  is  defined  as follows.  Let {PI, ... , Pk}  be a  set of k  linearly \nindependent vectors representing the model pictures.  Let {Pk+t, ... , Pn}  be a  set of \nvectors such that {pt, ... , Pn}  are all linearly independent.  We define the following \nmatrices: \n\nP \nQ \n\n(Pl, .. \u00b7,Pk,Pk+l, ''',Pn) \n(q, .. \u00b7,q,Pk+t, .. \u00b7,Pn) \n\nWe require that: \n\nTherefore: \n\nLP=Q \n\nL = QP- l \n\nNote that since P  is composed of n  linearly independent vectors, the inverse matrix \np- l  exists,  therefore L  can always  be constructed. \n\nAcknowledgments \n\nWe wish  to  thank  Yael  Moses  for  commenting on  the final  version  of this  paper. \nThis  report  describes  research  done  at the  Massachusetts Institute of Technology \nwithin the Artificial Intelligence Laboratory.  Support for  the laboratory's  artificial \n\n\fLinear Operator for  Object Recognition \n\n459 \n\nintelligence research is provided in part by the Advanced Research Projects Agency \nof the  Department  of  Defense  under  Office  of Naval  Research  contract  N00014-\n85-K-0124.  Ronen  Basri  is  supported  by  the  McDonnell-Pew  and  the  Rothchild \npostdoctoral fellowships. \n\nReferences \nAbu-Mostafa, Y.S. & Pslatis, D. 1987.  Optical neural computing.  Scientific Amer(cid:173)\n\nican,  256,  66-73. \n\nChien,  C.H.  &  Aggarwal,  J.K.,  1987.  Shape  recognition  from  single  silhouette. \n\nProc.  of ICCV Conf.  (London) 481-490. \n\nFaugeras, O.D. &  Hebert,  M.,  1986.  The representation, recognition and location \n\nof 3-D objects.  Int.  J.  Robotics  Research,  5(3), 27-52. \n\nFischler,  M.A.  &  Bolles,  R.C.,  1981.  Random sample  consensus:  a  paradigm for \nmodel fitting  with  application  to image  analysis  and  automated  cartography. \nCommunications  of the  ACM, 24(6),  381-395. \n\nHuttenlocher, D.P. &  Ullman, S.,  1987.  Object recognition using alignment.  Proc. \n\nof ICCV Conf.  (London),  102-111. \n\nKoenderink,  J.J.  &  Van  Doorn,  A.J.,  1979.  The internal representation  of solid \n\nshape with  respect  to vision.  Bioi.  Cybernetics  32,  211-216. \n\nKohonen,  T.,  Oja,  E.,  &  Lehtio,  P.,  1981.  Storage  and  processing  of informa(cid:173)\n\ntion in distributed  associative memory systems.  in  Hinton,  G.E.  (3 Anderson, \nJ.A.,  Parallel Models  of Associative Memory.  Hillsdale,  NJ:  Lawrence Erlbaum \nAssociates,  105-143. \n\nLowe,  D.G.,  1985.  Perceptual  Organization  and  Visual  Recognition.  Boston: \n\nKluwer  Academic  Publishing. \n\nMan,  D.  &  Ullman,  S.,  1981.  Directional  selectivity  and  its  use  in  early  visual \n\nprocessing.  Proc.  R.  Soc.  Lond.  B  211,  151-180. \n\nPoggio, T. & Edelman, S.,  1990.  A  network that learns to recognize  three dimen(cid:173)\n\nsionalobjects.  Nature,  Vol.  343,  263-266. \n\nThompson, D.W. & Mundy J.L., 1987.  Three dimensional model matching from an \nunconstrained viewpoint.  Proc.  IEEE Int.  Con!  on  robotics  and  Automation, \nRaleigh,  N.C.,  208-220. \n\nS.  Ullman  and  R.  Basri,  1991.  Recognition  by  Linear  Combinations of Models. \nIEEE  Trans.  on  Pattern  Analysis  and  Machine  Intelligence,  Vol.  13,  No.  10, \npp.  992-1006 \n\nUllman,  S.,  1989.  Aligning  pictorial descriptions:  An approach to object  recogni(cid:173)\n\ntion:  Cognition,  32(3),  193-254.  Also:  1986,  A.I.  Memo  931,  The  Artificial \nIntelligence  Lab.,  M.I. T .. \n\nYeates,  M.C.,  1991.  A  neural  network  for  computing  the  pseudo-inverse  of  a \nmatrix and  application to Kalman filtering.  Tech.  Report,  California  Institute \nof Technology. \n\nZipser, D. & Andersen, R.A.,  1988.  A back-propagation programmed network that \nsimulates response properties of a subset of posterior parietal neurons.  Nature, \n331, 679-684. \n\n\f", "award": [], "sourceid": 515, "authors": [{"given_name": "Ronen", "family_name": "Basri", "institution": null}, {"given_name": "Shimon", "family_name": "Ullman", "institution": null}]}