{"title": "2D Observers for Human 3D Object Recognition?", "book": "Advances in Neural Information Processing Systems", "page_first": 829, "page_last": 835, "abstract": "", "full_text": "2D Observers for Human 3D Object Recognition? \n\nZili Liu \n\nNEC Research Institute \n\nDaniel Kersten \n\nUniversity of Minnesota \n\n. Abstract \n\nConverging evidence has shown that human object recognition \ndepends on familiarity with the images of an object. Further, \nthe greater the similarity between objects, the stronger is the \ndependence on object appearance, and the more important two(cid:173)\ndimensional (2D) image information becomes. These findings, how(cid:173)\never, do not rule out the use of 3D structural information in recog(cid:173)\nnition, and the degree to which 3D information is used in visual \nmemory is an important issue. Liu, Knill, & Kersten (1995) showed \nthat any model that is restricted to rotations in the image plane \nof independent 2D templates could not account for human perfor(cid:173)\nmance in discriminating novel object views. We now present results \nfrom models of generalized radial basis functions (GRBF), 2D near(cid:173)\nest neighbor matching that allows 2D affine transformations, and \na Bayesian statistical estimator that integrates over all possible 2D \naffine transformations. The performance of the human observers \nrelative to each of the models is better for the novel views than \nfor the familiar template views, suggesting that humans generalize \nbetter to novel views from template views. The Bayesian estima(cid:173)\ntor yields the optimal performance with 2D affine transformations \nand independent 2D templates. Therefore, models of 2D affine \nmatching operations with independent 2D templates are unlikely \nto account for human recognition performance. \n\n1 \n\nIntroduction \n\nObject recognition is one of the most important functions in human vision. To \nunderstand human object recognition, it is essential to understand how objects are \nrepresented in human visual memory. A central component in object recognition \nis the matching of the stored object representation with that derived from the im(cid:173)\nage input. But the nature of the object representation has to be inferred from \nrecognition performance, by taking into account the contribution from the image \ninformation. When evaluating human performance, how can one separate the con-\n\n\f830 \n\nZ Liu and D. Kersten \n\ntributions to performance of the image information from the representation? Ideal \nobserver analysis provides a precise computational tool to answer this question. An \nideal observer's recognition performance is restricted only by the available image \ninformation and is otherwise optimal, in the sense of statistical decision theory, \nirrespective of how the model is implemented. A comparison of human to ideal \nperformance (often in terms of efficiency) serves to normalize performance with re(cid:173)\nspect to the image information for the task. We consider the problem of viewpoint \ndependence in human recognition. \n\nA recent debate in human object recognition has focused on the dependence of recog(cid:173)\nnition performance on viewpoint [1 , 6]. Depending on the experimental conditions, \nan observer's ability to recognize a familiar object from novel viewpoints is impaired \nto varying degrees. A central assumption in the debate is the equivalence in view(cid:173)\npoint dependence and recognition performance. In other words, the assumption is \nthat viewpoint dependent performance implies a viewpoint dependent representa(cid:173)\ntion, and that viewpoint independent performance implies a viewpoint independent \nrepresentation. However, given that any recognition performance depends on the \ninput image information, which is necessarily viewpoint dependent, the viewpoint \ndependence of the performance is neither necessary nor sufficient for the viewpoint \ndependence of the representation. Image information has to be factored out first, \nand the ideal observer provides the means to do this. \n\nThe second aspect of an ideal observer is that it is implementation free. Con(cid:173)\nsider the GRBF model [5], as compared with human object recognition (see be(cid:173)\nlow). The model stores a number of 2D templates {Ti} of a 3D object 0, \nand reco~nizes or rejects a stimulus image S by the following similarity measure \n~iCi exp UITi - SI1 2 j2(2 ), where Ci and a are constants. The model's performance \nas a function of viewpoint parallels that of human observers. This observation has \nled to the conclusion that the human visual system may indeed, as does the model, \nuse 2D stored views with GRBF interpolation to recognize 3D objects [2]. Such a \nconclusion, however, overlooks implementational constraints in the model, because \nthe model's performance also depends on its implementations. Conceivably, a model \nwith some 3D information of the objects can also mimic human performance, so \nlong as it is appropriately implemented. There are typically too many possible \nmodels that can produce the same pattern of results. \n\nIn contrast, an ideal observer computes the optimal performance that is only limited \nby the stimulus information and the task. We can define constrained ideals that are \nalso limited by explicitly specified assumptions (e.g., a class of matching operations). \nSuch a model observer therefore yields the best possible performance among the \nclass of models with the same stimulus input and assumptions. \nIn this paper, \nwe are particularly interested in constrained ideal observers that are restricted in \nfunctionally Significant aspects (e.g., a 2D ideal observer that stores independent \n2D templates and has access only to 2D affine transformations) . The key idea is \nthat a constrained ideal observer is the best in its class. So if humans outperform \nthis ideal observer, they must have used more than what is available to the ideal. \nThe conclusion that follows is strong: not only does the constrained ideal fail to \naccount for human performance, but the whole class of its implementations are also \nfalsified. \n\nA crucial question in object recognition is the extent to which human observers \nmodel the geometric variation in images due to the projection of a 3D object onto a \n2D image. At one extreme, we have shown that any model that compares the image \nto independent views (even if we allow for 2D rigid transformations of the input \nimage) is insufficient to account for human performance. At the other extreme, it \nis unlikely that variation is modeled in terms of rigid transformation of a 3D object \n\n\f2D Observers/or Hwnan 3D Object Recognition? \n\n831 \n\ntemplate in memory. A possible intermediate solution is to match the input image \nto stored views, subject to 2D affine deformations. This is reasonable because 2D \naffine transformations approximate 3D variation over a limited range of viewpoint \nchange. \n\nIn this study, we test whether any model limited to the independent comparison \nof 2D views, but with 2D affine flexibility, is sufficient to account for viewpoint \ndependence in human recognition. In the following section, we first define our ex(cid:173)\nperimental task, in which the computational models yield the provably best possible \nperformance under their specified conditions. We then review the 2D ideal observer \nand GRBF model derived in [4], and the 2D affine nearest neighbor model in [8]. \nOur principal theoretical result is a closed-form solution of a Bayesian 2D affine ideal \nobserver. We then compare human performance with the 2D affine ideal model, as \nwell as the other three models. In particular, if humans can classify novel views of \nan object better than the 2D affine ideal, then our human observers must have used \nmore information than that embodied by that ideal. \n\n2 The observers \n\nLet us first define the task. An observer looks at the 2D images of a 3D wire \nframe object from a number of viewpoints. These images will be called templates \n{Td. Then two distorted copies of the original 3D object are displayed. They \nare obtained by adding 3D Gaussian positional noise (i.i.d.) to the vertices of the \noriginal object. One distorted object is called the target, whose Gaussian noise has \na constant variance. The other is the distract or , whose noise has a larger variance \nthat can be adjusted to achieve a criterion level of performance. The two objects \nare displayed from the same viewpoint in parallel projection, which is either from \none of the template views, or a novel view due to 3D rotation. The task is to choose \nthe one that is more similar to the original object. The observer's performance is \nmeasured by the variance (threshold) that gives rise to 75% correct performance. \nThe optimal strategy is to choose the stimulus S with a larger probability p (OIS). \nFrom Bayes' rule, this is to choose the larger of p (SIO). \nAssume that the models are restricted to 2D transformations of the image, and \ncannot reconstruct the 3D structure of the object from its independent templates \n{Ti}. Assume also that the prior probability p(Td is constant. Let us represent S \nand Ti by their (x, y) vertex coordinates: (X Y )T, where X = (Xl, x2, ... , xn), \ny = (yl, y2 , ... , yn). We assume that the correspondence between S and T i is \nsolved up to a reflection ambiguity, which is equivalent to an additional template: \nTi = (xr yr )T, where X r = (xn, ... ,x2,xl ), yr = (yn, ... ,y2,yl). We still \ndenote the template set as {Td. Therefore, \n\n(1) \n\nIn what follows, we will compute p(SITi)p(Ti ), with the assumption that S = \nF (Ti) + N (0, crI2n ), where N is the Gaussian distribution, 12n the 2n x 2n identity \nmatrix, and :F a 2D transformation. For the 2D ideal observer, :F is a rigid 2D \nrotation. For the GRBF model, F assigns a linear coefficient to each template \nT i , in addition to a 2D rotation. For the 2D affine nearest neighbor model, :F \nrepresents the 2D affine transformation that minimizes liS - Ti11 2 , after Sand Ti \nare normalized in size. For the 2D affine ideal observer, :F represents all possible \n2D affine transformations applicable to T i. \n\n\f832 \n\n2.1 The 2D ideal observer \n\nZ Liu and D. Kersten \n\nThe templates are the original 2D images, their mirror reflections, and 2D rotations \n(in angle \u00a2) in the image plane. Assume that the stimulus S is generated by adding \nGaussian noise to a template, the probability p(SIO) is an integration over all \ntemplates and their reflections and rotations. The detailed derivation for the 2D \nideal and the GRBF model can be found in [4]. \n\nEp(SITi)p(Ti) ex: E J d\u00a2exp (-liS - Ti(\u00a2)112 /2( 2 ) \u2022 \n\n(2) \n\n2.2 The GRBF model \n\nThe model has the same template set as the 2D ideal observer does. Its training \nrequires that EiJ;7r d\u00a2Ci(\u00a2)N(IITj - Ti(\u00a2)II,a) = 1, j = 1,2, ... , with which {cd \ncan be obtained optimally using singular value decomposition. When a pair of new \nstimuli is} are presented, the optimal decision is to choose the one that is closer \nto the learned prototype, in other words, the one with a smaller value of \n\n111- E 127r d\u00a2ci(\u00a2)exp (_liS -2:~(\u00a2)1I2) II. \n\n(3) \n\n2.3 The 2D affine nearest neighbor model \n\nIt has been proved in [8] that the smallest Euclidean distance D(S, T) between S \nand T is, when T is allowed a 2D affine transformation, S ~ S/IISII, T ~ T/IITII, \n(4) \n\nD2(S, T) = 1 - tr(S+S . TTT)/IITII2, \n\nwhere tr strands for trace, and S+ = ST(SST)-l. The optimal strategy, therefore, \nis to choose the S that gives rise to the larger of E exp (_D2(S, Ti)/2a2) , or the \nsmaller of ED2(S, Ti). (Since no probability is defined in this model, both measures \nwill be used and the results from the better one will be reported.) \n\n2.4 The 2D affine ideal observer \n\nWe now calculate the Bayesian probability by assuming that the prior probabil(cid:173)\nity distribution of the 2D affine transformation, which is applied to the template \nT i, AT + Tr = (~ ~) Ti + (~: ::: ~:), obeys a Gaussian distribution \nN(Xo,,,,/16 ), where Xo is the identity transformation xl' = (a,b,c,d,tx,t y) = \n(1,0,0,1,0,0). We have \n\nEp(SITi ) = E i: dX exp (-IIATi + Tr - SII 2/2(2) \n\n(5) \n\n= EC(n, a, \",/)deC 1 (QD exp (tr (KfQi(QD-1QiKi) /2(12), \n\n(6) \n\nwhere C(n, a, \",/) is a function of n, a, \"'/; Q' = Q + \",/-212, and \n\nQ _ ( XT . X T X T \u00b7 Y T ) QK _ ( XT\u00b7 Xs Y T . Xs) \n\n-\n\nX T \u00b7Ys Y T .Ys \n\n-\n\nY T \u00b7XT YT \u00b7YT \n\n' \n\n-21 \n\n+\"'/ \n\n2\u00b7 \n\n(7) \n\nThe free parameters are \"'/ and the number of 2D rotated copies for each T i (since \na 2D affine transformation implicitly includes 2D rotations, and since a specific \nprior probability distribution N(Xo, \",/1) is assumed, both free parameters should \nbe explored together to search for the optimal results). \n\n\f2D Observers for Hwnan 3D Object Recognition? \n\n833 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\u2022 \n\n\u2022 \n\nFigure 1: Stimulus classes with increasing structural regularity: Balls, Irregular, \nSymmetric, and V-Shaped. There were three objects in each class in the experiment. \n\n2.5 The human observers \n\nThree naive subjects were tested with four classes of objects: Balls, Irregular, Sym(cid:173)\nmetric, and V-Shaped (Fig. 1). There were three objects in each class. For each \nobject, 11 template views were learned by rotating the object 60\u00b0 /step, around \nthe X- and Y-axis, respectively. The 2D images were generated by orthographic \nprojection, and viewed monocularly. The viewing distance was 1.5 m. During the \ntest, the standard deviation of the Gaussian noise added to the target object was \n(J\"t = 0.254 cm. No feedback was provided. \n\nBecause the image information available to the humans was more than what was \navailable to the models (shading and occlusion in addition to the (x, y) positions of \nthe vertices), both learned and novel views were tested in a randomly interleaved \nfashion. Therefore, the strategy that humans used in the task for the learned and \nnovel views should be the same. The number of self-occlusions, which in princi(cid:173)\nple provided relative depth information, was counted and was about equal in both \nlearned and novel view conditions. The shading information was also likely to be \nequal for the learned and novel views. Therefore, this additional information was \nabout equal for the learned and novel views, and should not affect the comparison \nof the performance (humans relative to a model) between learned and novel views. \nWe predict that if the humans used a 2D affine strategy, then their performance \nrelative to the 2D affine ideal observer should not be higher for the novel views than \nfor the learned views. One reason to use the four classes of objects with increasing \nstructural regularity is that structural regularity is a 3D property (e.g., 3D Sym(cid:173)\nmetric vs. Irregular), which the 2D models cannot capture. The exception is the \nplanar V-Shaped objects, for which the 2D affine models completely capture 3D ro(cid:173)\ntations, and are therefore the \"correct\" models. The V-Shaped objects were used in \nthe 2D affine case as a benchmark. If human performance increases with increasing \nstructural regularity of the objects, this would lend support to the hypothesis that \nhumans have used 3D information in the task. \n\n2.6 Measuring performance \n\nA stair-case procedure [7] was used to track the observers' performance at 75% \ncorrect level for the learned and novel views, respectively. There were 120 trials \nfor the humans, and 2000 trials for each of the models. For the GRBF model, \nthe standard deviation of the Gaussian function was also sampled to search for \nthe best result for the novel views for each of the 12 objects, and the result for \nthe learned views was obtained accordingly. This resulted in a conservative test \nof the hypothesis of a GRBF model for human vision for the following reasons: \n(1) Since no feedback was provided in the human experiment and the learned and \nnovel views were randomly intermixed, it is not straightforward for the model to \nfind the best standard deviation for the novel views, particularly because the best \nstandard deviation for the novel views was not the same as that for the learned \n\n\f834 \n\nZ Liu and D. Kersten \n\nones. The performance for the novel views is therefore the upper limit of the \nmodel's performance. (2) The subjects' performance relative to the model will be \ndefined as statistical efficiency (see below). The above method will yield the lowest \npossible efficiency for the novel views, and a higher efficiency for the learned views, \nsince the best standard deviation for the novel views is different from that for the \nlearned views. Because our hypothesis depends on a higher statistical efficiency for \nthe novel views than for the learned views, this method will make such a putative \ndifference even smaller. Likewise, for the 2D affine ideal, the number of 2D rotated \ncopies of each template Ti and the value I were both extensively sampled, and the \nbest performance for the novel views was selected accordingly. The result for the \nlearned views corresponding to the same parameters was selected. This choice also \nmakes it a conservative hypothesis test. \n\n3 Results \n\nLearned Views \n\n\u2022 Human \nIJ 20 Ideal \nO GRBF \nO 20 Affine Nearest NtMghbor \nrn 20 Affine kIoai \n\ne-\n.\u00a3. \n:!2 \n0 \n~ \n\n~ \n~ \nI-\n\n1.5 \n\ne-\n.\u00a3. \n:g \n0 \n~ \n81 \nl! \nl-\n\n25 \n\n1.5 \n\n0.5 \n\nNovel Views \n\n\u2022 Human \nEJ 20 Ideal \no GRBF \no 20 Affine Nearesl N.tghbor \n\n~ 2DAfllna~ \n\nObject Type \n\nObject Type \n\nFigure 2: The threshold standard deviation of the Gaussian noise, added to the \ndistractor in the test pair, that keeps an observer's performance at the 75% correct \nlevel, for the learned and novel views, respectively. The dotted line is the standard \ndeviation of the Gaussian noise added to the target in the test pair. \n\nFig. 2 shows the threshold performance. We use statistical efficiency E to com(cid:173)\npare human to model performance. E is defined as the information used by \nhumans relative to the ideal observer [3] : E = (d~uman/d~deal)2, where d' \nis the discrimination index. We have shown in [4] that, in our task, E = \n((a~1!f;actor)2 - (CTtarget)2) / ((CT~~~~~tor)2 - (CTtarget)2) , where CT is the thresh(cid:173)\nold. Fig. 3 shows the statistical efficiency of the human observers relative to each \nof the four models. \nWe note in Fig. 3 that the efficiency for the novel views is higher than those for the \nlearned views (several of them even exceeded 100%), except for the planar V-Shaped \nobjects. We are particularly interested in the Irregular and Symmetric objects in \nthe 2D affine ideal case, in which the pairwise comparison between the learned \nand novel views across the six objects and three observers yielded a significant \ndifference (binomial, p < 0.05). This suggests that the 2D affine ideal observer \ncannot account for the human performance, because if the humans used a 2D affine \ntemplate matching strategy, their relative performance for the novel views cannot \nbe better than for the learned views. We suggest therefore that 3D information was \nused by the human observers (e.g., 3D symmetry). This is supported in addition \nby the increasing efficiencies as the structural regularity increased from the Balls, \nIrregular, to Symmetric objects (except for the V-Shaped objects with 2D affine \nmodels). \n\n\f2D Observers for Hwnan 3D Object Recognition? \n\n835 \n\n20 Ideal \n\no Learned \n\u2022 Novel \n\n300 \n\n.. \nl 250 \n.. \n\" 200 \n..! \n\" \n$: \nw \n\n'50 \n\nQ \nN \n\nGRBF Modol \n\nI 0 l&arnedl \n.Noval \n\n'\" \n\"\" \n.'\" --------------\n\n\"\" \n\n250 \n\nl \nf \n~ \n\"-\nII! \n\" '\" \n\nObJect Type \n\nObjoctTypo \n\n>-\n\n250 \n\nl 300 \nj \n~ \n~ 200 \nt \ni \nI \n! \n~ 0 \n\n150 \n\nQ \nN \n\n20 Aftlne Nearest \no Learned \n\u2022 Novel \n\nIghbor \n\n20 Affine Ideal \n\no Learned \n\u2022 Novel \n\n300 \n\n200 \n\nl \n,.. 250 \n\" \nj \n\" \n~ \nj \n\n'50 \n\nObject Type \n\nObjOGtType \n\n---\n\nFigure 3: Statistical efficiencies of human observers relative to the 2D ideal observer, \nthe GRBF model, the 2D affine nearest neighbor model, and the 2D affine ideal \nobserver_ \n\n4 Conclusions \n\nComputational models of visual cognition are subject to information theoretic as \nwell as implementational constraints. When a model's performance mimics that of \nhuman observers, it is difficult to interpret which aspects of the model characterize \nthe human visual system. For example, human object recognition could be simu(cid:173)\nlated by both a GRBF model and a model with partial 3D information of the object. \nThe approach we advocate here is that, instead of trying to mimic human perfor(cid:173)\nmance by a computational model, one designs an implementation-free model for a \nspecific recognition task that yields the best possible performance under explicitly \nspecified computational constraints. This model provides a well-defined benchmark \nfor performance, and if human observers outperform it, we can conclude firmly that \nthe humans must have used better computational strategies than the model. We \nshowed that models of independent 2D templates with 2D linear operations cannot \naccount for human performance. This suggests that our human observers may have \nused the templates to reconstruct a representation of the object with some (possibly \ncrude) 3D structural information. \n\nReferences \n\n[1] Biederman I and Gerhardstein P C. Viewpoint dependent mechanisms in visual \nobject recognition: a critical analysis. J. Exp. Psych.: HPP, 21: 1506-1514, 1995. \n[2] Biilthoff H H and Edelman S. Psychophysical support for a 2D view interpolation \n\ntheory of object recognition. Proc. Natl. Acad. Sci. , 89:60-64, 1992. \n\n[3] Fisher R A. Statistical Methods for Research Workers. Oliver and Boyd, Edin(cid:173)\n\nburgh, 1925. \n\n[4] Liu Z, Knill D C, and Kersten D. Object classification for human and ideal \n\nobservers. Vision Research, 35:549-568, 1995. \n\n[5] Poggio T and Edelman S. A network that learns to recognize three-dimensional \n\nobjects. Nature, 343:263-266, 1990. \n\n[6] Tarr M J and Biilthoff H H. \n\nby geon-structural-descriptions or by multiple-views? \n21:1494-1505,1995. \n\nIs human object recognition better described \nJ. Exp. Psych.: HPP, \n\n[7] Watson A B and Pelli D G. QUEST: A Bayesian adaptive psychometric method. \n\nPerception and Psychophysics, 33:113-120, 1983. \n\n[8] Werman M and Weinshall D. Similarity and affine invariant distances between \n\n2D point sets. IEEE PAMI, 17:810-814,1995. \n\n\f", "award": [], "sourceid": 1351, "authors": [{"given_name": "Zili", "family_name": "Liu", "institution": null}, {"given_name": "Daniel", "family_name": "Kersten", "institution": null}]}