{"title": "Learning in Compositional Hierarchies: Inducing the Structure of Objects from Data", "book": "Advances in Neural Information Processing Systems", "page_first": 285, "page_last": 292, "abstract": null, "full_text": "Learning in Compositional Hierarchies: \n\nInducing the Structure of Objects from Data \n\nJoachim Utans \n\nOregon Graduate Institute \n\nDepartment of Computer Science and Engineering \n\nP.O.  Box 91000 \n\nPortland, OR 97291-1000 \n\nutans@cse.ogi.edu \n\nAbstract \n\nI  propose a learning algorithm for  learning hierarchical  models  for ob(cid:173)\nject recognition.  The  model  architecture  is  a  compositional  hierarchy \nthat  represents  part-whole relationships:  parts  are  described  in  the lo(cid:173)\ncal  context of substructures  of the  object.  The  focus  of this  report  is \ninducing  the  structure  of \nlearning  hierarchical  models  from  data,  i.e. \nmodel  prototypes from  observed exemplars of an  object.  At each  node \nin the hierarchy, a probability distribution governing its parameters must \nbe learned.  The connections between  nodes reflects  the structure of the \nobject.  The  formulation of substructures  is  encouraged  such  that  their \nparts  become  conditionally  independent.  The  resulting  model  can  be \ninterpreted  as  a  Bayesian  Belief Network  and  also  is  in  many  respects \nsimilar to the stochastic visual grammar described by Mjolsness. \n\n1  INTRODUCTION \n\nModel-based object recognition solves the problem of invariant recognition by relying on \nstored  prototypes  at  unit scale  positioned at  the  origin  of an  object-centered  coordinate \nsystem.  Elastic matching techniques are used to find a correspondence between features of \nthe stored model and the data and can also compute the parameters of the transformation the \nobserved instance has undergone relative to the stored model.  An example is the TRAFFIC \nsystem (Zemel,  Mozer and Hinton, 1990) or the Frameville system (Mjolsness, Gindi and \n\n285 \n\n\f286 \n\nUtans \n\ni~----::;\"\"Human \n\nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \n1.  ______  1 \n\nr--------, \nI \nI \n\nr o\n\no\n\n-\n\n' \n\niwi Arm \n\ni @ ~ ! \n\nL  ________  J \n\nLower Arm \n\n~ -oo~oj 1(' \n\nFigure I: Example of a compositional \nhierarchy.  The  simple  figure  can  be \nrepresented  as  hierarchical  composi(cid:173)\ntion  of  parts.  The  hierarchy  can \nbe  represented  as  a graph  (a  tree  in \nthis case).  Nodes represent parts  and \nedges represent the structural relation(cid:173)\nship.  Nodes  at  the  bottom  represent \nindividual  parts  of the  object;  nodes \nat higher levels denote more complex \nsubstructures.  The single node at the \ntop of the tree represents the entire ob(cid:173)\nject. \n\nAnandan,  1989;  Gindi,  Mjolsness  and  Anandan,  1991;  Vtans,  1992).  Frameville  stores \nmodels as compositional hierarchies and by matching at each level in the hierarchy reduces \nthe combinatorics of the match. \n\nThe attractive feature of feed-forward neural  networks for object recognition is the relative \nease with which their parameters can be learned from training data.  Multilayer feed-forward \nnetworks are typically trained on input/output pairs (supervised learning) and thus are tuned \nto recognize instances of objects as  seen during training.  Difficulties arise if the observed \nobject appears  at  a different position in  the input image,  is  scaled or rotated,  or has  been \nsubject to distortions. Some of these problems can be overcome by suitable preprocessing or \njudicious choice of features.  Other possibilities are weight sharing (LeCun, Boser, Denker, \nHenderson,  Howard, Hubbard and Jackel,  1989) or invariant distance measures  (Simard, \nLeCun and Denker,  1993). \n\nFew  attempts  have  been  reported  in  the  neural  network  literature to  learn  the  prototype \nmodels for  model  based recognition from  data.  For example,  the Frameville system  uses \nhand-designed models.  However,  models learned from data and reflecting the statistics of \nthe data should be superior to  the hand-designed models  used previously.  Segen  (1988a; \n1988b) reports an approach to learning structural descriptions where features are clustered \nto  substructures using a  Minimum Description Length (MDLJ  criterion to obtain a sparse \nrepresentation.  Saund (1993) has  proposed a algorithm for constructing tree presentation \nwith multiple \"causes\"  where observed data is  accounted for by  multiple substructures at \nhigher levels  in  the hierarchy.  Veda  and Suzuki (1993) have developed an  algorithm for \nlearning models from shape contours using multiscale convex/concave structure matching \nto find a prototype shape typical for exemplars from a given class. \n\n2  LEARNING COMPOSITIONAL HIERARCHIES \n\nThe algorithm described  here merges  parts  by  means  of grouping variables  to form  sub(cid:173)\nstructures.  The model architecture is a compositional hierarchy, i.e.  a part-whole hierarchy \n(an  example is  shown  in  Figure 1).  The nodes  in  the graph represent parts  and substruc(cid:173)\ntures,  the arcs  describe the structure of the object.  At each node a probability density for \npart parameters  is  stored.  A  prominent advocate  of such  models  has  been  Marr (1982) \nand models  of this type are  used in the Frameville system  (Mjolsness  et ai.,  1989; Gindi \net al.,  1991; Vtans,  1992).  The nodes  in the graph  represent parts  and substructures, the \n\n\fLearning in Compositional Hierarchies: Inducing the Structure of Objects from Data \n\n287 \n\nFigure  2:  Examples  of  differ(cid:173)\nent compositional  hierarchies  for \nthe  same  object  (the  digit  9  for \na  seven-segment  LED  display). \nOne  model emphasizes the  paral(cid:173)\nlel  lines  making  up  the  square in \nthe top part of the figure while for \nanother  model  angles  are  chosen \nas intermediate substructures. The \nexample on the right shows a hier(cid:173)\narchy that \"reuses\" parts. \n\narcs  describe the  structure of the object.  The arcs  can  be regarded as  \"part-of\" or \"ina\" \nrelationships (similar to the notion used in semantic networks).  At each node a probability \ndensity for part parameters such as  position, size and orientation is  stored. \n\nThe model represents a typical prototype object at unit scale in an object-centered coordinate \nsystem.  Parameters  of parts are specified relative to  parameters  of the parent node in the \nhierarchy.  Substructures thus provide a local context for their parts and decouple their parts \nfrom other parts and substructures in the model.  The advantages of this representation are \nsparseness,  invariance with respect to  viewpoint transformations and the ability to model \nlocal deformations.  In  addition, the model  explicitly represents  the structure of an object \nand emphasizes the importance of structure for recognition (Cooper,  1989). \n\nLearning requires estimating the parameters of the distributions at each node (the mean and \nvariance in the case of Gaussians) and finding the structure of model.  The emphasis in this \nreport is on learning structure from exemplars.  The parameterization of substructures may \nbe different than for the parts at the lowest level and become more complex and require more \nparameters as  the substructures themselves become more complex.  The representation as \ncompositional hierarchy can  avoid overfitting since at higher levels in the hierarchy more \nexemplars are available for parameter estimation due to the grouping of parts (Omohundro, \n1991). \n\n2.1  Structure and Conditional Independence:  Bayesian Networks \n\nIn  what  way  should  substructures  be  allocated?  Figure  2  shows  examples  of different \ncompositional hierarchies for the same object (the digit 9 for a seven-segment LED display). \nOne model emphasizes the parallel lines making up the square in the top part of the figure \nwhile for  another model  angles  are chosen  as  intermediate substructures.  It is  not clear \nwhich of these models to choose. \n\nThe important benefit of a hierarchical representation of structure is that parts belonging to \ndifferent substructures become decoupled, i.e.  they are assigned to a different local context. \nThe problem of constructing structured descriptions of data that reflect this independence \nrelationship has been studied previously in the field of Machine Learning (see (Pearl, 1988) \nfor  a  comprehensive introduction).  The  resulting models  are  Bayesian  Belief Networks. \nCentral  to  the idea of Bayesian Networks is  the assumption  that objects can  be regarded \nas  being  composed  of components  that only  sparsely  interact  and  the  network  captures \nthe  probabilistic dependency  of these  components.  The  network can  be  represented  as \nan  interaction graph augmented with conditional probabilities.  The structure of the graph \nrepresents the dependence of variables, i.e.  connects them with and arc.  The strength of the \n\n\f288 \n\nUtans \n\nm,. \n\n0.11 \n\nFigure  3:  Bayesian  Networks  and  conditional \nindependence (see text). \n\nFigure 4:  The model architecture.  Circles denote \nthe grouping variables ina (here a possible valid \nmodel after leaming is shown). \n\ndependence is expressed as forward conditional probability. The conditional independence \nis represented by  the absence of an  arc  between  two nodes  and leads  to  the sparseness of \nthe model. \n\nThe notion of conditional independence in the context studied here manifest itself as follows. \nBy just observing two parts in the image, one must assume that they,  i.e.  their parameters \nsuch as  position, are dependent and must be modeled using their joint distribution.  How(cid:173)\never,  if one knows  that these  two parts  are  grouped to form  a substructure then knowing \nthe  parameters  of the  substructure,  the parts  become  conditionally independent,  namely \nconditioned on the parameters of the substructure.  Thus, the internal nodes representing the \nsubstructures summarize the interaction of their child nodes.  The correlation between the \nchild nodes is summarized in the parent node and what remains is, for example, independent \nnoise in observed instances of the child nodes. \n\nThe probability of observing an  instance can  be calculated from  the model  by starting at \nthe root node and multiplying with the conditional probabilities of nodes traversed until the \nleaf nodes are reached.  For example, given the graph in Figure 3, the joint distribution can \nbe factored as \n\nP(Xl' Yl, Y2,  zl, Z2,  z3, Z4)  = \n\n(I) \n(note that the hidden nodes are treatedjust like the nodes corresponding to observable parts). \n\nP(Xd P (Yllxd P (Zllyd P (ZlIYl)P(Z2IYl )P(z3IY2)P(Z4IY2) \n\nNote that the stochastic visual grammar described by Mjolsness (1991) is equivalent to this \nmodel.  The model used there is a stochastic forward (generative) model where each level \nof the compositional hierarchy  corresponds to  a stochastic production rule that generates \nnodes  in  the  next  lower  level.  The  distribution  of parameters  at  the  next  lower  level \nare  conditioned on  the  parameters  of the  parent  node.  Thus,  the  model  obtained from \nconstructing a Bayesian network is equivalent to the stochastic grammar if the network is \nconstrained to a directed acyclic graph (DAG). \n\nIf all the nodes of the network correspond to observable events, techniques exist for finding \nthe structure of the Bayesian Network and estimate its parameters (Pearl,  1988) (see also \n(Cooper and  Herskovits,  1992)}.  However,  for  the hierarchical  models  considered here, \nonly the nodes at the lowest layer (the leaves of the tree) correspond to observable instances \nof parts  of the  object  in  the  training data.  The  learning  algorithm  must  induce hidden, \nunobservable substructures.  That is,  it is  assumed  that the observables  are  \"caused\"  by \ninternal nodes not directly accessible.  These are represented as  nodes in the network just \n\n\fLearning in Compositional Hierarchies: Inducing the Structure of Objects from  Data \n\n289 \n\nlike the observables and their parameters must be estimated as  well.  See (Pearl,  1988) for \nan extensive discussion and examples of this idea. \n\nLearning Bayesian  networks is  a hard  problem when  the network contains hidden nodes \nbut a construction algorithm exists if it is known that the data is in fact tree-decomposable \n(Pearl,  1988).  The methods is based on computing the correlations p between child nodes \nand constraints on the correlation coefficients dictated by a particular structure.  The entire \ntree can be constructed recursively using this method.  Here, the case of Normal-distributed \nreal-valued random variables is of interest: \n\np(XI, ... , Xn)  =  ~ Vdet'f exp  --(x - p)  :E \n\nT \n\n-I \n\n) \n(x - p) \n\n(2) \n\n1  1 \n\nv2?r \n\ndetL \n\n(1 \n\n2 \n\nwhere x  =  (XI, X2,  ... ,xn )  with mean  p  =  E{x}  and covariance matrix  :E  =  E{(x -\np)(x - p)T} The method is based on a condition under which a set of random variables \nis  star-decomposable.  The question one ask  is  whether a  set of n  random  variables can \nbe represented as  the marginal distribution of n + 1 variables  XI, ... ,  X n , W  such that the \nXI, ... , Xn  are conditionally independent given w, i.e. \n\nJ p(XI, ... , Xn, w)dw \n\n(3) \n\n(4) \n\nIn  the  graph  representation  of the  Bayesian  Network  w  is  the central  node  relating  the \nXI, ... ,Xn ,  hence the name star-decomposable.  In the general case of n  variables  this is \nhard to verify but a result by Xu and Pearl  (1987) is available for 3 variables:  A  necessary \nand sufficient condition for 3 random variables  with a joint normal distribution to be star(cid:173)\ndecomposable is that the pairwise correlation coefficients satisfy the triangle inequality \n\npjk  ~ PjiPik \n\nwith \n\n(5) \n\nfor all i, j, k E  [1,2,3] and i \"I j  \"I k.  Equality holds if node w coincides with node i.  For \nthe lowest level of the hierarchy, nodes j  and  k  represent parts and node i  = w  represents \nthe common substructure. \n\n2.2  An Objective Function for Grouping Parts \n\nThe algorithm proposed here is based on \"soft\" grouping by means of grouping variables ina \nwhere both the grouping variables  and  the parameter estimates  are updated concurrently. \nThe  learning  algorithms  described  in  (Pearl,  1988)  incrementally  construct  a  Bayesian \nnetwork and decisions made at early stages cannot be reversed.  It is hoped that the method \nproposed here is more robust with regard to inaccuracies of the estimates.  However,  if the \ntrue distribution is not a star-decomposable normal distribution it can only be approximated. \nLet inaij be a binary variable associated with the arc connecting node i and node j; inaij = \n1 if the arc is present in the network (ina is the adjacency matrix of the graph describing the \nstructure of the model).  The model architecture is restricted to a compositional hierarchy (a \ndeparture from the more general structure of a Bayesian Network, i.e.  nodes are preassigned \nto  levels  of the hierarchy  (see Figure 4)).  Based on the condition in equation (5)  a  cost \n\n\f290 \n\nUtans \n\nfunction term for the grouping variables ina is \n\nEp  =  L  inawjinawk (PwjPwk  - Pjk)2 \n\nw,j,kt-j \n\n(6) \n\nThe  term  penalizes  the  grouping  of two  part  nodes  to  the  same  parent  if  the  term  in \nparentheses  is  large  (i  and  k  index  part  nodes,  w  nodes  at  the  next  higher level  in  the \nhierarchy) The inawj  can be regarded as  assignment variables the assign child nodes j  to \nparent nodes w.  The parameters at each node and the assignment variables ina are estimated \nusing an EM algorithm (Dempster, Laird and Rubin, 1977; Utans, 1993; Yuille, Stolorz and \nUtans,  1994).  For the details of the implementation of grouping with match  networks see \n(Mjolsness et at.,  1989; Mjolsness, 1991; Gindi et at.,  1991; Utans,  1992; Utans,  1994). \nAt each  node for each  parameter a probability distribution is  stored.  Nodes  at the lowest \nlevel of the hierarchy represent parts in the input data.  For the Gaussian distributions used \nhere for all  nodes,  the parameters are the mean J-t  and the variance  (J'  and can be estimated \nfrom  data.  Each  part  node  can  potentially  be  grouped  to  any  substructure  at  the  next \nhigher level in the hierarchy.  The parameters of the distributions at this level are estimated \nfrom data as  well but using the current value of the grouping variables inaij  to weight the \ncontribution from each part node.  Because each child node j  can have only one parent node \ni, an  additional constraint for a unique assignment is Lw inawj  =  1. \n\n3  ANEXAMPLE \n\nInitial simulations of the proposed algorithm were performed using a hierarchial model for \ndot clusters.  The training data was generated using the three-level model shown in Figure 5. \nEach node is  parameterized by its position (x, y).  The node at the top level represents the \nentire dot cluster.  At the intermediate level  nodes  represent  subcluster centers.  The leaf \nnodes at the lowest level represent individual dots that are output by the model and observed \nin the image.  The top level node represents the position of the entire cluster.  At each level \n1 + 1 stored offsets d!t 1  are added to the parent coordinates x~ to obtain the coordinates \nof the child nodes.  Then,  independent,  zero-mean  Gaussian distributed noise (  is added: \nxj+l  =  x! + d~jl + ( The training data consists of a vector of positions at the lowest level \n{Xj}  with Xj  = (Xj, Yj),  j  = 1 ... 9 for each exemplar. \nThe identity of the parts in the training data is assumed known.  In addition, the data consists \nof parts from a single object.  For the simulations, the model  architecture is restricted to a \nthree-level hierarchy.  Since at the top level a single node represents the entire object, only \nthe grouping variables  from  the lowest to  the intermediate level  are  unknown (the nodes \nat the intermediate level  are implicitly grouped to  the single node at the top level).  In  the \ncurrent implementation the parameters of a parent node are defined as  the average over the \nparameters of its child nodes:  x~ =  Jv  Lj i~jxj+l \nFor this problem the algorithm has recovered the structure of the model that generated the \ntraining data.  Thus  in  this  case  it is  possible to  use  the  correlation  coefficients  to  learn \nthe  structure of an  object  from  noisy  training exemplars.  However,  the  algorithm  does \nnot recover the same parameter values x  used in  the generative model at the intermediate \nlayers.  These cannot uniquely specified due to  the ambiguity between the parameters  Xi \nand offsets d ij  (a different choice for Xi  leads to different values for d ij ). \n\n\fLearning in Compositional Hierarchies: Inducing the Structure of Objects from Data \n\n291 \n\n0 \n\n0 \n\n\u2022 \n\n0 \n\n0 \n\nDaIs \n\n)(  Global  Position \n\n0 \n\n0 \n\n0 \n\nDot \n\n0 \n\n\u2022 \n\n0 \n\n)( \n\n0 \n\n\u2022 \n\u2022  CI ustar  Center \n\nFigure 5:  The  model used to  generated training  data.  The structure  of the  model  is  a  three-level \nhierarchy. The model parameters are chosen such that the generated dot cluster spatially overlap.  On \nthe left, an example of an instance of a dot cluster generated from the model is shown (these constitute \nthe training data). \n4  EXTENSIONS \n\nThe results of the initial experiments are encouraging but more research needs to be done \nbefore the algorithm can be applied to  real  data.  For the example used here,  the training \ndata was generated by a hierarchical model.  Thus the distribution of the training exemplars \ncould,  in  principle,  be learned exactly  using  the proposed  model  architecture.  I  plan  to \nstudy the effect of approximating the distribution of real-world data by applying the method \nto the problem of learning models for handwritten digit recognition. \n\nThe model should be extended to  include provisions to deal with missing data.  Instead of \nbeing binary variables,  inaij could be the conditional probability that part j  is present in a \ntypical instance of the object given that the parent node i  itself is present (similar to the dot \ndeletion rule described in (Mjolsness,  1991)}.  These probabilities must also be estimated \nfrom data. Under this interpretation the inaij  are similar to the mixture coefficients in the \nmixture of experts model (Jordan and Jacobs,  1993) \n\nThe robustness of the algorithm can be improved when the desired locality of the model is \nexplicitly favored via an additional constraint. \n\nE\\ocal = .A L inaij inaik IXj - Xk 12 \n\nij k \n\nIn this sense, the toy problem shown here is unnecessarily difficult.  Preliminary experiments \nindicate that  including this  term  reduces  the sensitivity to  spurious correlations  between \nparts that are far apart. \n\nAs described the algorithm performs unsupervised grouping; learning the hierarchical model \ndoes  not take in  to account the recognition performance obtained when  using the model. \nWhile the problem of learning and representing models in a hierarchical form is interesting \nin  its  own  right,  the  final  criteria for judging the model  in  the  context of a  recognition \nproblem should be recognition performance.  The assumption is that the model should pick \nup substructures that are specific to a particular class of objects and maximally discriminate \nbetween objects belonging to  other classes.  For example, after a initial model is obtained \nthat roughly captures the structure of the training data, it can be refined on-line during the \nrecognition stage. \n\n\f292 \n\nUtans \n\nAcknowledgements \nInitial  work on  this  project was  performed  while  the  author  was  with  the  International \nComputer Science Institute, Berkeley,  CA. At OGI supported was  provided in part under \ngrant ONR N00014-92-J-4062. Discussions with S. Knerr, E. Mjolsness and S. Omohundro \nwere helpful in preparing this work. \n\nReferences \nCooper, G. F.  and Herskovits, E.  (1992),  'A bayesian method for induction of probabilistic networks \n\nfrom data', Machine Learning 9, 309-347. \n\nCooper, P. R. (1989), Parallel Object Recognition from Structure (The Tinkertoy Project), PhD thesis, \n\nUniversity of Rochester, Computer Science.  also Technical Report No.  301. \n\nDempster, A.  P.,  Laird, N.  M.  and Rubin, D.  B.  (1977),  'Maximum likelihood from  incomplete data \n\nvia the EM algorithm', J.  Royal Statist.  Soc.  B 39, 1-39. \n\nGindi,  G.,  Mjolsness, E.  and Anandan, P.  (1991),  Neural networks for  model based recognition, in \n'Neural Networks:  Concepts, Applications and Implementations', Prentice-Hall, pp.  144-173. \nJordan,  M.  I.  and Jacobs,  R.  A.  (1993),  Hierarchical  mixtures  of experts  and  the  EM  algorithm, \n\nTechnical Report 930 I, MIT Computational Cognitive Science. \n\nLeCun, Y.,  Boser, B.,  Denker, J.  S.,  Henderson, D., Howard, R.  E.,  Hubbard, W.  and Jackel, L.  D. \n(1989),  'Backpropagation applied  to handwritten  zip code recognition',  Neural Computation \n1,541-551. \n\nMarr, D.  (1982), Vision, W.  H.  Freeman and Co., New York. \nMjolsness, E. (1991), Bayesian inference on visual grammars by neural nets that optimize, Technical \n\nReport YALEU-DCS-TR-854, Yale University, Dept. of Computer Science. \n\nMjolsness, E., Gindi, G. R.  and Anandan, P.  (1989), 'Optimization in model matching and perceptual \n\norganization', Neural Computation 1(2). \n\nOmohundro, S.  M.  (1991), Bumptrees for efficient function, constraint, and classification learning, in \nR.  Lippmann, J. Moody and D. Touretzky, eds, 'Advances in Neural Information Processing 3', \nMorgan Kaufmann Publishers, San Mateo, CA. \n\nPearl,  J.  (1988),  Probabilistic Reasoning in  Intelligent  Systems:  Networks  of Plausible  Inference, \n\nMorgan Kaufmann Publishers, Inc., San Mateo, CA. \n\nSaund, E. (1993), A multiple cause mixture model for unsupervised learning, Technical report, Xerox \n\nPARC, Palo Alto, CA.  preprint, submitted to  Neural Computation. \n\nSegen, J. (1988a), Learning graph models of shape, in 'Proceedings of the 5th International Conference \n\non Machine Learning' . \n\nSegen, J.  (1988b),  'Learning structural description of shape', Machine Vision  pp. 257-269. \nSimard, P., LeCun, Y.  and Denker, J. (1993), Efficient pattern recognition using a new transformation \ndistance, in S. J. Hanson, J. Cowan and L. Giles, eds, 'Advances in Neural Information Processing \n5', Morgan Kaufmann Publishers, San Mateo, CA. \n\nUeda,  N.  and  Suzuki,  S.  (1993),  'Learning  visual  models  from  shape contours  using  multiscale \n\nconvex/concave structure matching' , IEEE Transactions on Pattern Analysis and Machine  In(cid:173)\ntelligence 15(4), 337-352. \n\nUtans, J.  (1992),  Neural  Networks for Object Recognition within  Compositional Hierarchies,  PhD \n\nthesis, Department of Electrical Engineering, Yale University, New Haven, CT 06520. \n\nUtans, J.  (1993), Mixture models and the EM algorithm for object recognition within compositional \nhierarchies. part 1:  Recognition, Technical Report TR-93-004, International Computer Science \nInstitute,  1947 Center St., Berkeley, CA 94708. \n\nUtans,  J.  (1994),  'Mixture  models  for  learning  and  recognition  in  compositional  hierarchies',  in \n\npreparation. \n\nXu, L.  and Pearl, J.  (1987), Structuring causal tree models with continous variables, in  'Proceedings \n\nof the 3rd Workshop on Uncertainty in AI', pp.  170-179. \n\nYuille, A.,  Stolorz, P.  and Utans, J.  (1994),  'Statistical physics, mixtures of distributions and the EM \n\nalgorithm', to  appear in Neural Computation. \n\nZemel, R.  S., Mozer, M. C.  and Hinton, G. E. (1990), Traffic:  Recognizing objects using hierarchical \nreference  frame  transformations,  in  D.  S.  Touretzky,  ed.,  'Advances  in  Neural  Information \nProcessing 2', Morgan Kaufman Pulishers, San Mateo, CA. \n\n\f", "award": [], "sourceid": 840, "authors": [{"given_name": "Joachim", "family_name": "Utans", "institution": null}]}