{"title": "Qualitative structure from motion", "book": "Advances in Neural Information Processing Systems", "page_first": 356, "page_last": 362, "abstract": null, "full_text": "Qualitative structure from motion \n\nDaphna Weinshall \nCenter for Biological Information Processing \nMIT, E25-201, Cambridge MA 02139 \n\nAbstract \n\nExact structure from motion is an ill-posed computation and therefore \nvery sensitive to noise. In this work I describe how a qualitative shape \nrepresentation, based on the sign of the Gaussian curvature, can be com(cid:173)\nputed directly from motion disparities, without the computation of an \nexact depth map or the directions of surface normals. I show that humans \ncan judge the curvature sense of three points undergoing 3D motion from \ntwo, three and four views with success rate significantly above chance. A \nsimple RBF net has been trained to perform the same task. \n\n1 \n\nINTRODUCTION \n\nWhen a scene is recorded from two or more different positions in space, e.g. by a \nmoving camera, objects are projected into disparate locations in each image. This \ndisparity can be used to recover the three-dimensional structure of objects that is \nlost in the projection process. The computation of structure requires knowledge \nof the 3D motion parameters. Although these parameters can themselves be com(cid:173)\nputed from the disparities, their computation presents a difficult problem that is \nmathematically ill-posed: small perturbations (or errors) in the data may cause \nlarge changes in the solution [9]. This brittleness, or sensitivity to noise, is a major \nfactor limiting the applicability of a number of structure from motion algorithms in \npractical situations (Ullman, 1983). \n\nThe problem of brittleness of the structure from motion algorithms that use the min(cid:173)\nimal possible information may be attacked through two different approaches. One \ninvolves using more data, either in the space domain (more corresponding points \nin each image frame, Bruss & Horn, 1981), or in the time domain (more frames, \n\n356 \n\n\fQualitative Structure From Motion \n\n357 \n\nUllman, 1984). The other approach is to look for, instead of a general quantitative \nsolution, a qualitative one that would still meet the main requirements of the task \nfor which the computation is performed (e.g., object representation or navigation). \nThis approach has been applied to navigation (e.g., Nelson & Aloimonos, 1988) and \nobject recognition (e.g., Koenderink & van Doorn, 1976; Weinshall, 1989). \n\nUnder perspective projection, the knowledge of the positions of 7 corresponding \npoints in two successive frames is the theoretical lower limit of information nec(cid:173)\nessary to compute the 3D structure of an object that undergoes a general motion \n(Tsai & Huang, 1984). As mentioned above, acceptable performance of structure \nfrom motion algorithms on real, noisy images requires that a larger number of cor(cid:173)\nresponding points be used. In contrast, the human visual system can extract 3D \nmotion information using as few as 3 points in each of the two frames (Borjesson \n& von Hofsten, 1973). To what extent can object shape be recovered from such \nimpoverished data? I have investigated this question experimentally (by studying \nthe performance of human subjects) and theoretically (by analyzing the information \navailable in the three-point moving stimuli). \n\n2 THEORETICAL SHORTCUTS \n\nThe goal of the structure from motion computation is to obtain the depth map of \na moving object: the value of the depth coordinate at each point in the 2D image \nof the object. The depth map can be used subsequently to build a representation \nof the object, e.g., for purposes of recognition. One possible object representation \nis the description of an object as a collection of generic parts, where each part is \ndescribed by a few parameters. Taking the qualitative approach to vision described \nin the introduction, the necessity of having a complete depth map for building useful \ngeneric representations can be questioned. Indeed, one such representation, a map \nof the sign of the Gaussian curvature of the object's surface, can be computed \ndirectly (and, possibly, more reliably) from motion disparities. The knowledge \nof the sign of the Gaussian curvature of the surface allows the classification of \nsurface patches as elliptic (convex/concave), hyperbolic (saddle point), cylindrical, \nor planar. Furthermore, the boundaries between adjacent generic parts are located \nalong lines of zero curvature (parabolic lines). \n\nThe basic result that allows the computation of the sign of the Gaussian curvature \ndirectly from motion disparities is the following theorem (see Weinshall, 1989 for \ndetails): \n\nTheorem 1 Let FOE denote the Focus Of Ezpansion - the location in the image \ntowards (or away from) which the motion is directed. \n\nPick three collinear points in one image and observe the pattern they form in a \nsubsequent image. \n\nThe sign of the curvature of these three points in the second image relative to the \nFOE is the same as the sign of the normal curvature of the 3D curve defined by \nthese three points. \n\nThe sign of the Gaussian curvature at a given point can be found without knowing \nthe direction of the normal to the surface, by computing the curvature sign of point \n\n\f358 Weillshall \n\n( \n\n\u2022 \n\u2022 \n\n\u2022 \n\nCONCAVE \n\n\u2022 \n( \u2022 \n\n\u2022 \nCONVEX \n\nD D \n\nFrame 1 \n\nFrame 2 \n(a) \n\n~ \n\n..... 0.80r-----,-----:----:------, \no Q.I 0.75 ...................................................... . \n\"-o 0.70 ............ .. \no \nt: o :;; \n&. 0.55 ............. . ]' ............. 1' .............. ; ............. . \n\"-\no \n\"-a. \n\n. \n. . . \n4 \n\nS \nNumber of frames \n\n3 \n\n(b) \n\nFigure 1: Experiment 1: perception of curvature from three points in 3D translation. \n(a) Four naive subjects were shown two, three or four snapshots of the motion \nsequence. The subjects did not perceive the motion as translation. The total \nextent and the speed of the motion were identical in each condition. The three \npoints were always collinear in the first frame. The back and forth motion sequence \nwas repeated eight times, after which the subjects were required to decide on the \nsign of the curvature (see text). The mean performance, 62%, differed significantly \nfrom chance (t = 5.55, p < 0.0001). Furthermore, all subjects but one performed \nsignificantly above chance. (b) The effect of the number offrames was not significant \n(X 2 = 1. 72, p = 0.42). Bars show \u00b11 standard error of the mean. \n\ntriads in all directions around the point. The sign of the Gaussian curvature is \ndetermined by the number of sign reversals of the triad curvatures encountered \naround the given point. The exact location of the FOE is therefore not important. \nThe sign operator described above has biological appeal, since the visual system \ncan compute the deviation of three points from a straight line with precision in the \nhyperacuity range (that is, by an order of magnitude more accurately than allowed \nby the distance between adjacent photoreceptors in the retina). In addition, this \nfeature must be important to the visual system, since it appears to be detected \npreattentively (in parallel over the entire visual field; see Fahle, 1990). \n\nIt is difficult to determine whether the visual system uses such a qualitative strategy \nto characterize shape features, since it is possible that complete structure is first \nrecovered, from which the sign of the Gaussian curvature is then computed. In the \nfollowing experiments I present subjects with impoverished data that is insufficient \nfor exact structure from motion (3 points in 2 frames). If subjects can perform the \ntask, they have to use some strategy different from exact depth recovery. \n\n3 EXPERIMENT 1 \n\nIn the first experiment four subjects were presented with 120 moving rigid configu(cid:173)\nrations of three points. The number of distinct frames per configuration varied from \n2 to 4. The motion was translation only. Subjects had to judge whether the three \npoints were in a convex or a concave configuration, namely, whether the broken 3D \n\n\fQualitative Structure From Motion \n\n359 \n\nline formed by the points was bent towards or away from the subject (figure 1a). \nThe middle point was almost never the closest or the farthest one, so that relative \ndepth was not sufficient for solving the problem. With only two-frame the stim(cid:173)\nulus was ambiguous in that there was an infinity of rigid convex and concave 3D \nconfigurations of three points that could have given rise to the images presented. \nFor these stimuli the correct answer is meaningless, and one important question is \nwhether this inherent ambiguity affects the subjects' performance (as compared to \ntheir performance with 3 and 4 frames). \n\nThe subjects' performance in this experiment was significantly better than chance \n(figure Ib). The subjects were able to recover partial information on the shape \nof the stimulus even with 2 frames, despite the theoretical impossibility of a full \nstructure from motion computationl . Moreover, the number of frames presented in \neach trial had no significant effect on the error rate: the subjects performed just as \nwell in the 2 frame trials as in the 3 and 4 frame trials (figure Ib). Had the subjects \nrelied on the exact computation of structure from motion, one would expect a better \nperformance with more frames (Ullman, 1984; Hildreth et al., 1989). \n\nOne possible account (reconstructional) of this result is that subjects realized that \nthe motion of the stimuli consisted of pure 3D translation. Three points in two \nframes are in principle sufficient to verify that the motion is translational and to \ncompute the translation parameters. The next experiment renders this account \nimplausible by demonstrating that the subjects perform as well when the stimuli \nundergo general motion that includes rotation as well as translatior~. \n\nAnother possible (geometrical) account is that the human visual system incorporates \nthe geometrical knowledge expressed by theorem 1, and uses this knowledge in \nambiguous cases to select the more plausible answer. However, theorem 1 does not \naddress the ambiguity of the stimulus that stems from the dependency of the result \non the location of the Focus Of Expansion. Ifindeed some knowledge of this theorem \nis used in performing this task, the ambiguity has to be resolved by \"guessing\" the \nlocation of the FOE. The strategy consistent with human performance in the first \nexperiment is assuming that the FOE lies in the general direction towards which \nthe points in the image are moving. The next experiment is designed to check the \nuse of this heuristic. \n\n4 EXPERIMENT 2 \n\nThis experiment was designed to clarify which of the two proposed explanations to \nthe subjects' good performance in experiment 1 with only 2 frames is more plausible. \n\nFirst, to eliminate completely the cue to exact depth in a translational motion, \nthe stimuli in experiment 2 underwent rotation as well as translation. The 3D \nmotion was set up in such a manner that the projected 2D optical flow could not \nbe interpreted as resulting from pure translational motion. \n\nSecond, if subjects do use an implicit knowledge of theorem 1, the accuracy of their \nperformance should depend on the correctness of the heuristic used to estimate \n\nII should note thal all the subjects were surprised by their good performance. They \n\nfelt that the stimulus was ambiguous and that they were mostly guessing. \n\n\f360 Weinshall \n\nthe location of the FOE as discussed in the previous section. This heuristic yields \nincorrect results for many instances of general 3D motion. In experiment 2, two \ntypes of 3-point 2-frame motion were used: one in which the estimation of the FOE \nusing the above heuristic is correct, and one in which this estimation is wrong. \nIf subjects rely on an implicit knowledge of theorem 1, their judgement should be \nmostly correct for the first type of motion, and mostly incorrect for the second type. \n\ni 1.00 \nIii \nI 0.75 \n8 \n\"6 \n& \n~ 0.5iJ \n~ e \nt \n\n0.25 \n\nO.OD,:---\n\nFOE cue (O=lncorrect, 1 =CO\"ect) \n\nFigure 2: Experiment 2: three points in general motion. The same four subjects as \nin experiment 1 were shown two-frame sequences of back and forth motion that in(cid:173)\ncluded 3D translation and rotation. The mean performance when the FOE heuristic \n(see text) was correct, 71%, was significantly above chance (t = 5.71, p < 0.0001). \nIn comparison, the mean performance when the FOE heuristic was misleading, 26%, \nwas significantly below chance (t = -4.90, p < 0.0001). The degree to which the \nmotion could be mistakenly interpreted as pure translation was uncorrelated with \nperformance (,. = 0.04, F(I, 318) < 1). The performance in experiment 2 was sim(cid:173)\nilar to that in experiment 1 (the difference was not significant X2 < 1). In other \nwords, the performance was as good under general motion as under pure translation. \n\nFigure 2a describes the results of experiment 2. As in the first experiment, the \nsubjects performed significantly above chance when the FOE estimation heuristic \nwas correct. When the heuristic was misleading, they were as likely to be wrong \nas they were likely to be right in the correct heuristic condition. As predicted by \nthe geometrical explanation to the first experiment, seeing general motion instead \nof pure translation did not seem to affect the performance. \n\n5 LEARNING WITH A NEURAL NETWORK \n\nComputation of qualitative structure from motion, outlined in section 2, can be \nsupported by a biologically plausible architecture based on the application of a \nthree-point hyperacuity operator, in parallel, in different directions around each \npoint and over the entire visual field. Such a computation is particularly suitable \nto implementation by an artificial neural network. \nsis Function (RBF) network (Moody & Darken, 1989; Poggio & Girosi, 1990) to \n\nI have trained a Radial Ba(cid:173)\n\n\fQualitative Structure From Motion \n\n361 \n\nidentify the sign of Gaussian curvature of three moving points (represented by a \ncoordinate vector of length 6). After a supervised learning phase in which the net(cid:173)\nwork was trained to produce the correct sign given examples of motion sequences, \nit consistently achieved a substantial success rate on novel inputs, for a wide range \nof parameters. Figure 3 shows the success rate (the percentage of correct answers) \nplotted against the number of examples used in the training phase. \n\nD \u2022\u2022 \n\nD\u00b75D D!:-----\u00b1:----~'~DD:----~'60=-----~2DD::------::!.260 \n\nnu~r 01 training .xamp\". \n\nFigure 3: The correct performance rate of the RBF implementation vs. the number \nof examples in the training set. \n\n6 SUMMARY \n\nI have presented a qualitative approach to the problem of recovering object structure \nfrom motion information and discussed some of its computational, psychophysical \nand implementational aspects. The computation of qualitative shape, as repre(cid:173)\nsented by the sign of the Gaussian curvature, can be performed by a field of simple \noperators, in parallel over the entire image. The performance of a qualitative shape \ndetection module, implemented by an artificial neural network, appears to be similar \nto the performance of human subjects in an identical task. \n\nAcknowledgements \n\nI thank H. Biilthofr, N. Cornelius, M. Dornay, S. Edelman, M. Fahle, S. Kirkpatrick, \nM. Ross and A. Shashua for their help. This research was done partly in the MIT \nAI Laboratory. It was supported by a Fairchild postdoctoral fellowship, and in part \nby grants from the office of Naval Research (N00014-88-k-0164), from the National \nScience Foundation (IRI-8719394 and IRI-8657824), and a gift from the James S. \nMcDonnell Foundation to Professor Ellen Hildreth. \n\nReferences \n\n[1] E. Borjesson and C. von Hofsten. Visual perception of motion in depth: ap-\n\n\f362 Weinshall \n\nplication of a vector model to three-dot motion patterns. Perception and Psy(cid:173)\nchophy,ics, 13:169-179, 1973. \n\n[2] A. Bruss and B. K. P. Horn. Passive navigation. Computer Vision, Graphics, \n\nand Image Processing, 21:3-20, 1983. \n\n[3] M. W. Fahle. Parallel, semi-parallel, and serial processing of visual hyperacuity. \nIn Proc. SPIE Con/. on Electronic Imaging: science and technology, Santa \nClara, CA, February 1990. to appear. \n\n[4] E. C. Hildreth, N. M. Grzywacz, E. H. Adelson, and V. K. Inada. The percep(cid:173)\n\ntual buildup of three-dimensional structure from motion, 1989. Perception & \nPsychophysics, in press. \n\n[5] J. J. Koenderink and A. J. van Doorn. Local structure of movement parallax \n\nof the plane. Journal of the Optical Society of America, 66:717-723, 1976. \n\n[6] J. Moody and C. Darken. Fast learning in networks oflocally tuned processing \n\nunits. Neural Computation, 1:281-289, 1989. \n\n[7] R. C. Nelson and J. Aloimonos. Using flow field divergence for obstacle avoid(cid:173)\nance: towards qualitative vision. In Proceedings of the 2nd International Con(cid:173)\nference on Computer Vision, pages 188-196, Tarpon Springs, FL, 1988. IEEE, \nWashington, DC. \n\n[8] T. Poggio and F. Girosi. Regularization algorithms for learning that are equiv(cid:173)\n\nalent to multilayer networks. Science, 247:978-982, 1990. \n\n[9] T. Poggio and C. Koch. ill-posed problems in early vision: from computational \ntheory to analog networks. Proceedings of the Royal Society of London B, \n226:303-323, 1985. \n\n[10] R.Y. Tsai and T.S. Huang. Uniqueness and estimation of three dimensional \nmotion parameters of rigid objects with curved surfaces. IEEE Transactions \non Pattern Analysis and Machine Intelligence, 6:13-27, 1984. \n\n[11] S. Ullman. Computational studies in the interpretation of structure and mo(cid:173)\n\ntion: summary and extension. In J. Beck, B. Hope, and A. Rosenfeld, editors, \nHuman and Machine Vi,ion. Academic Press, New York, 1983. \n\n[12] S. Ullman. Maximizing rigidity: the incremental recovery of 3D structure from \n\nrigid and rubbery motion. Perception, 13:255-274, 1984. \n\n[13] D. Weinshall. Direct computation of 3D shape and motion invariants. A.I. \nMemo No. 1131, Artificial Intelligence Laboratory, Massachusetts Institute of \nTechnology, May 1989. \n\n\f", "award": [], "sourceid": 317, "authors": [{"given_name": "Daphna", "family_name": "Weinshall", "institution": null}]}