{"title": "Unsupervised Classification of 3D Objects from 2D Views", "book": "Advances in Neural Information Processing Systems", "page_first": 949, "page_last": 956, "abstract": null, "full_text": "Unsupervised Classification of 3D Objects \n\nfrom 2D Views \n\nSatoshi Suzuki  Hiroshi Ando \n\nA TR Human Information Processing Research Laboratories \n2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02, Japan \n\nsatoshi@hip.atr.co.jp, ando@hip.atr.co.jp \n\nAbstract \n\nThis paper presents an unsupervised learning scheme for categorizing \n3D  objects  from  their  2D  projected images.  The  scheme  exploits  an \nauto-associative network's ability to encode each view of a single object \ninto a representation that indicates its view direction.  We propose two \nmodels that employ different classification mechanisms; the first model \nselects an auto-associative network whose recovered view best matches \nthe input view, and the second model is based on a modular architecture \nwhose  additional  network classifies  the  views  by  splitting  the  input \nspace  nonlinearly.  We demonstrate  the  effectiveness  of the  proposed \nclassification models through simulations using 3D wire-frame objects. \n\n1  INTRODUCTION \nThe  human  visual  system can  recognize  various  3D  (three-dimensional)  objects  from \ntheir 2D  (two-dimensional) retinal images although the images vary  significantly as  the \nviewpoint changes. Recent computational models have explored how to learn to recognize \n3D objects from their projected views (Poggio &  Edelman,  1990). Most existing models \nare,  however,  based on  supervised learning,  i.e., during  training the teacher tells  which \nobject each view belongs to. The model proposed by Weinshall et al.  (1990) also requires \na  signal that segregates different objects during  training. This paper,  on  the other hand, \ndiscusses  unsupervised  aspects  of 3D  object recognition  where  the  system  discovers \ncategories by itself. \n\n\f950 \n\nSatoshi  Suzuki,  Hiroshi  Ando \n\nThis  paper presents  an  unsupervised classification  scheme  for  categorizing  3D  objects \nfrom their 2D views. The scheme consists of a mixture of 5-layer auto-associative networks, \neach of which identifies an object by non-linearly encoding the views into a representation \nthat describes transformation of a rigid object. A mixture model with linear networks was \nalso studied by  Williams et al.  (1993) for classifying objects under affine transformations. \nWe propose two models that employ different classification mechanisms. The first model \nclassifies the given view by  selecting an auto-associative network whose recovered view \nbest matches  the  input  view.  The  second  model  is  based  on  the  modular  architecture \nproposed by  Jacobs  et al.  (1991)  in  which an additional  3-layer network classifies  the \nviews  by directly splitting the input space. The simulations using 3D wire-frame objects \ndemonstrate that both models effectively learn to classify each view as a 3D object. \nThis paper is organized as  follows.  Section 2 describes in detail the proposed models for \nunsupervised classification of 3D objects. Section 3 describes the simulation results using \n3D  wire-frame  objects.  In  these  simulations,  we  test  the  performance  of the  proposed \nmodels  and  examine  what  internal  representations  are  acquired  in  the  hidden  layers. \nFinally, Section 4 concludes this paper. \n2 THE NETWORK MODELS \nThis  section describes an unsupervised scheme that classifies 2D  views into 3D objects. \nWe  initially  examined classical  unsupervised clustering  schemes,  such as  the  k-means \nmethod  or the  vector quantization method,  to  see  whether  such methods  can  solve  this \nproblem (Duda &  Hart,  1973). Through simulations using the wire-frame objects described \nin  the  next section, we  found that these  methods  do  not yield satisfactory performance. \nWe, therefore, propose a new unsupervised learning scheme for classifying 3D objects. \nThe proposed scheme exploits an  auto-associative network for identifying an  object. An \nauto-associative  network finds  an  identity  mapping through a bottleneck in  the  hidden \nlayer, \nthat \nRn ~Rm r \n) Rn where  m < n.  The network,  thus, compresses  the  input into a \nlow dimensional representation by eliminating redundancy. If we use a five-layer perceptron \nnetwork, the network can perform nonlinear dimensionality reduction, which is a nonlinear \nanalogue to the principal component analysis (Oja,  1991; DeMers &  Cottrell,  1993). \nThe proposed classification  scheme consists  of a mixture  of five-layer auto-associative \nnetworks  which we  call  the identification networks,  or  the I-Nets.  In  the case  where  the \ninputs  are  the  projected views  of a rigid object,  the minimum dimension that constrains \nthe  input variation  is  the degree of freedom  of the rigid object,  which is  six  in the  most \ngeneral case, three for rotation and three for translation. Thus, a single I-Net can compress \nthe  views  of an object into  a representation  whose  dimension  is  its  degree  of freedom. \nThe  proposed  scheme  categorizes  each  view  of a  number  of 3D  objects into  its  class \nthrough selecting an appropriate I-Net. We present the following two models for different \nselection and learning methods. \nModel  I:  The  model  I  selects  an  I-Net  whose  output  best  fits  the  input  (see  Fig.  1). \nSpecIfically, we assume a classifier whose output vector is given by the softmax function \nof a negative squared difference between the input and the output of the I-Nets, i.e., \n\nthe l  network  approximates  functions  . F \n\ni.e., \n\nand  F- 1  such \n\n(1) \n\n\fUnsupervised CLassification  of 3D Objects from  2D  Views \n\n951 \n\nI-Net \n\nI-Net  \u2022\u2022\u2022  I-Net \n\nI-Net  I-Net...  I-Net \n\n2D Projected Images of 3D Objects \n\n2D Projected Images of 3D Objects \n\nModell \n\nModel II \n\nFigure 1: Model I and Model II. Each I-Net (identification net) is a 5-layer auto-associative \nnetwork and the C-Net (classification net) is a 3-layer network. \n\nwhere  Y * and Yi  denote the input and the output of the  i th I-Net, respectively. Therefore, \nif only one of the I-Nets has an output that best matches the input, then the output value \nof the corresponding unit in the classifier becomes nearly one and the output values of the \nother units  become  nearly  zero.  For training  the  network,  we  maximize  the  following \nobjective function: \n\nL exp[ -ally * _YiI12\nL exp[ -Ily * - Yi 112 ] \n\nIn-'~' --~------~ \n\ni \n\n] \n\n(2) \n\nwhere  a  (>1) denotes a constant. This function  forces  the output of at least one I-Net to \nfit  the  input, and it also forces  the rest of I-Nets to increase the error between the  input \nand the output.  Since it is difficult for a  single I-Net to learn more than one object,  we \nexpect that the network will eventually converge to the state where each I-Net identifies \nonly one object. \nModel II: The model II, on the other hand,  employs an additional network which we call \nthe  classification network or  the C-Net,  as  illustrated in Fig.  1.  The C-Net classifies the \ngiven views by directly partitioning the input space. This type of modular architecture has \nbeen proposed by  Jacobs  et al.  (1991)  based on  a  stochastic  model  (see  also Jordan  & \nJacobs,  1992). In this architecture, the final output,  Y, is given by \n\n(3) \n\nwhere  Yi  denotes the output of the  i  th I-Net, and  gi  is given by the softmax function \n\ngi  = eXP[SiVtexP[Sj] \n\n(4) \n\nwhere  Si  is the weighted sum arriving at the  i th output unit of the C-Net. \nFor the C-Net, we use three-layer perceptron, since a simple perceptron with two layers \ndid not provide a good performance for the objects used for our simulations (see Section \n\n\f952 \n\nSatoshi  Suzuki,  Hiroshi  Ando \n\n3).  The  results  suggest  that  classification  of such  objects  is  not  a  linearly  separable \nproblem.  Instead of using  MLP (multi-layer perceptron),  we  could  use  other types  of \nnetworks for the C-Net, such as RBF (radial basis function) (Poggio &  Edelman,  1990). \nWe maximize the objective function \n\nIn LgjO'-1 exp[-lly*-yJ /(20'2)] \n\nj \n\n(5) \n\nwhere  0'2  is  the variance. This function forces  the C-Net to select only one I-Net, and at \nthe same time, the selected I-Net to encode and decode the input information. \nNote  that the  model  I  can  be  interpreted as  a  modified version  of the  model  II,  since \nmaximizing (2) is essentially equivalent to maximizing (5) if we replace  Sj  of the C-Net \nin  (4)  with  a  ne&ative  s~uared difference  between  the  input and  the  output of the  i  th \nI-Net,  i.e.,  Sj  = -Ily * -yj Ir . Although the model I is a more direct classification method \nthat exploits auto-associative networks, it is  interesting to examine what information can \nbe extracted from the input for classification in the model II (see Section 3.2). \n\n3  SIMULATIONS \nWe implemented the  network models described in the previous section to evaluate their \nperformance. The 3D objects that we used for our simulations are 5-segment wire-frame \nobjects  whose  six vertices  are randomly  selected in  a  unit cube,  as  shown  in  Fig.  2  (a) \n(see  also  Poggio  &  Edelman,  1990).  Various  views  of the  objects  are  obtained  by \northographically  projecting  the  objects  onto  an  image  plane  whose  position  covers  a \nsphere  around  the  object  (see  Fig.  2  (b\u00bb.  The  view  position  is  defined  by  the  two \nparameters,  8  and  fj).  In  the  simulations,  we  used  x,  y  image  coordinates  of the  six \nvertices of three wire-frame objects for the inputs to the network. \nThe models contain three I-Nets, whose number is set equal to the number of the objects. \nThe number of units in the third layer of the five-layer I-Nets is  set equal to the  number \nof the  view  parameters,  which is  two  in  our simulations.  We  used twenty  units  in  the \nsecond and fourth layers.  To train the network efficiently, we initially limited the ranges \nof 8  and  fj)  to  1r /8 and  1r /4 and gradually increased the range until it covered the whole \nsphere.  During the training,  objects  were randomly  selected among  the  three  and their \nviews  were randomly  selected within  the  view  range.  The  steepest ascent method  was \nused  for  maximizing  the  objective  functions  (2)  and  (5)  in  our  simulations,  but  more \nefficient methods, such as the conjugate gradient method, can also be used. \n\n(a) \n\nz \n\nView \n\n(b) \n\ny \n\nFigure 2:  (a) 3D wire-frame objects.  (b) Viewpoint defined by two parameters,  8  and  fj). \n\n\fUnsupervised Classification  of 3D Objects from  2D  Views \n\n953 \n\n3.1 SIMULATIONS USING THE MODEL I \nThis section describes the simulation results using the model!. As described in Section 2, \nthe  classifier of this  model  selects  an  I-Net  that produces  minimum  error between  the \noutput and the  input.  We  test the classification performance of the  model  and examine \ninternal representations of the  I-Nets after training the networks.  The constant  a  in  the \nobjective function  (2) was set to 50 during the training. \nFig. 3 shows the output of the classifier plotted over the view directions when the views \nof an object are used for the inputs. The output value of a unit is almost equal to one over \nthe  entire range of the  view direction,  and the outputs of the  other two  units are nearly \nzero.  This  indicates  that  the  network  effectively classifies  a  given  view  into  an  object \nregardless  of the  view  directions.  We obtained  satisfactory  results  for  classification  if \nmore than five units are used in the second and fourth layers of the I-Nets. \nFig.  4  shows examples of the  input views  of an  object and the  views  recovered by  the \ncorresponding I-Net.  The recovered  views  are  significantly  similar to the  input views, \nindicating  that  each  auto-associative  I-Net  can  successfully  compress  and recover  the \nviews of an object. In fact,  as shown in Fig. 5, the squared error between the input and the \noutput of an I-Net is nearly zero for only one of the objects. This indicates that each I-Net \ncan be used for identifying an object for almost entire view range. \n\nUNIT 1 \n\nUNIT 2 \n\nUNIT 3 \n\nFigure 3:  Outputs of the  classifier in the  model  I. The output value of the  second unit is \nalmost equal  to  one  over  the  full  view  range,  and  the  outputs  of the  other two  units  are \nnearly zero for one of the 3D objects. \n\nRecovered \nviews \n\nInput views \n\nFigure 4:  Examples of the input and recovered views of an object.  The recovered views are \nsignificantly similar to the input views. \n\n\f954 \n\nSatoshi  Suzuki,  Hiroshi  Ando \n\nWe further analyzed what information is encoded in the third layer of the I-Nets. Fig. 6 \n(a) illustrates the outputs of the third layer units plotted as a function of the view direction \n( (},  \u00a2) of an  object. Fig.  6 (b)  shows  the  view direction ( (} ,  \u00a2) plotted as  a function of \nthe outputs of the third layer units.  Both figures  exhibit single-valued functions,  i.e.  the \nview direction of the object uniquely determines the  outputs of the hidden units, and at \nthe  same  time  the  outputs  of the  hidden  units  uniquely  determine  the  view  direction. \nThus, each I-Net encodes a given view of an object into a representation that has one-to-one \ncorrespondence  with the  view direction.  This result is  expected from  the condition  that \nthe dimension  of the  third layer is  set equal to  the  degree  of freedom  of a rigid object. \n\nObject 1 \n\nObject 2 \n\nObject 3 \n\nFigure 5: Error between the input view and the recovered view of an I-Net for each object. \nThe figures show that the I-Net recovers only the views of Object 3. \n\nunit! \n\nunit2 \n\n(a) \n\n(b) \n\na \n\nWlit2 \n\nunit2 \n\nFigure 6:  (a) Outputs of the third layer units of an I-Net plotted over the view direction ( (}, \n\u00a2) of an  object.  (b) The  view direction plotted over the outputs  of the  third layer units. \nFigure (b) was obtained by inversely replotting Figure (a). \n\n3.2 SIMULATIONS USING THE MODEL n \nIn this section, we show the simulation results using the model II. The C-Net in the model \nlearns to classify the  views by  splitting the input space nonlinearly. We examine internal \nrepresentations of the C-Net that lead to view invariant classification in its output. \n\n\fUnsupervised Classification  of 3D Objects from  2D  Views \n\n955 \n\nIn the simulations, we used the same 3 wire-frame objects used in the previous simulations. \nThe  C-Net  contains  20  units  in  the  hidden  layer.  The  parameter  cr  in  the  objective \nfunction  (5) was  set to 0.1. Fig. 7 (a) illustrates the values of an output unit in the C-Net \nfor an object. As in the case of the  model I,  the model correctly classified the views into \ntheir original object for almost entire view range. Fig. 7 (b) illustrates the outputs of two \nof the hidden units as  examples, showing that each hidden unit has a limited view range \nwhere  its  output  is  nearly  one.  The  C-Net,  thus,  combines  these  partially  invariant \nrepresentations in the hidden layer to achieve full view invariance at the output layer. \nTo  examine  a  generalization  ability  of the  model,  we  limited  the  view  range  in  the \ntraining period and tested the network using the images with the  full  view range. Fig. 8 \n(a) and (b) show the values of an output unit of the C-Net and the error of the corresponding \nI-Net plotted over the entire view range. The region surrounded by a rectangle indicates \nthe  range  of view  directions  where  the  training  was  done.  The  figures  show  that  the \ncorrect classification and the  small recovery  error are  not restricted within  the  training \nrange  but spread across  this  range,  suggesting  that the  network exhibits  a  satisfactory \ncapability of generalization. We obtained similar generalization results for the model I as \nwell. We  also trained the networks with a sparse set of views rather than using randomly \nselected  views.  The results  show  that  classification  is  nearly  perfect regardless of the \nviewpoints if we use at least 16 training views evenly spaced within the full view range. \n\nFigure 7:  (a)  Output values of an output unit of the C-Net when the views of an object are \ngiven (cf. Fig.3).  (b) Output values of two hidden units ofthe C-Net for the same object. \n\nOUTPUT \n\nERROR \n\nFigure 8:  (a)  Output values of an  output unit of the C-Net.  (b) Errors between the input \nviews  and  the  recovered  views  of the  corresponding I-Net.  The region  surrounded by  a \nrectangle indicates the view range where the training was done. \n\n\f956 \n\nSatoshi  Suzuki,  Hiroshi  Ando \n\n4  CONCLUSIONS \nWe have presented an unsupervised classification scheme that classifies 3D objects from \ntheir 2D views. The scheme consists of a mixture of non-linear auto-associative networks \neach of which identifies  an object by  encoding an  input view into a representation  that \nindicates  its  view direction. The  simulations using 3D wire-frame objects demonstrated \nthat the  scheme  effectively clusters  the  given  views  into their original  objects with  no \nexplicit identification of the object classes being provided to the networks. We presented \ntwo  models  that  utilize  different  classification mechanisms.  In  particular,  the  model  I \nemploys  a  novel  classification  and  learning  strategy  that  forces  only  one  network  to \nreconstruct  the  input  view,  whereas  the  model  II  is  based on  a  conventional  modular \narchitecture  which requires  training  of a  separate  classification  network.  Although we \nassumed in the simulations that feature points are already identified in each view and that \ntheir  correspondence  between  the  views  is  also  established,  the  scheme  does  not,  in \nprinciple, require the identification and correspondence of features,  because the scheme is \nbased solely on the existence of non-linear mappings between a set of images of an object \nand its  degree  of freedom.  Therefore,  we  are currently  investigating how  the  proposed \nscheme can be used to classify real gray-level images of 3D objects. \n\nAcknowledgments \nWe  would  like  to  thank  Mitsuo  Kawato  for  extensive  discussions  and  continuous \nencouragement,  and Hiroaki  Gomi  and Yasuharu  Koike  for  helpful  comments.  We  are \nalso grateful to Tommy Poggio for insightful discussions. \n\nReferences \nDeMers, D.  and Cottrell, G.  (1993).  Non-linear dimensionality reduction. In Hanson, S. \n1.,  Cowan, 1.  D.  &  Giles, C.  L.,  (eds), Advances in Neural Information Processing \nSystems 5.  Morgan Kaufmann Publishers, San Mateo, CA. 580-587. \n\nDuda,  R.  O.  and  Hart,  P.  E.  (1973).  Pattern  Classification  and Scene Analysis.  John \n\nWiley &  Sons, NY. \n\nJacobs, R.  A.,  Jordan,  M. I., Nowlan,  S.  1.  and Hinton, G.  E.  (1991).  Adaptive mixtures \n\nof local experts. Neural Computation, 3,79-87. \n\nJordan, M.  I. and Jacobs, R.  A.  (1992).  Hierarchies of adaptive experts. In Moody, J.  E., \nHanson,  S.  J.  &  Lippmann, R.  P.,  (eds), Advances in Neural Information Processing \nSystems 4.  Morgan Kaufmann Publishers, San Mateo, CA. 985-992. \n\nOja, E.  (1991). Data compression, Feature extraction, and autoassociation in feedforward \nneural  networks.  In  Kohonen,  K.  et al.  (eds),  Anificial Neural Networks.  Elsevier \nScience publishers B.V., North-Holland. \n\nPoggio, T. and Edelman, S. (1990). A network that learns to recognize three-dimensional \n\nobjects. Nature, 343, 263. \n\nWeinshall, D., Edelman, S.  and Btilthoff, H.  H.  (1990).  A self-organizing multiple-view \nrepresentation of 3D objects.  In Touretzky, D.  S., (eds), Advances in Neural Information \nProcessing Systems 2.  Morgan Kaufmann Publishers, San Mateo, CA. 274-281. \n\nWilliams,  C.  K.  I.,  Zemel,  R.  S.  and  Mozer,  M.  C.  (1993).  Unsupervised  learning  of \nobject models. AAAI Fall 1993 Symposium on Machine Learning in Computer Vision. \n\n\f", "award": [], "sourceid": 910, "authors": [{"given_name": "Satoshi", "family_name": "Suzuki", "institution": null}, {"given_name": "Hiroshi", "family_name": "Ando", "institution": null}]}