{"title": "Stereopsis by a Neural Network Which Learns the Constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 327, "page_last": 334, "abstract": null, "full_text": "Stereopsis by a Neural Network \nWhich Learns the Constraints \n\nAlireza Khotanzad and Ying-Wung Lee \nImage Processing and Analysis Laboratory \n\nElectrical Engineering Department \nSouthern Methodist University \n\nDallas, Texas 75275 \n\nAbstract \n\nThis paper presents a neural network (NN) approach to the problem of \nstereopsis. The correspondence problem (finding the correct matches \nbetween the pixels of the epipolar lines of the stereo pair from amongst all \nthe possible matches) is posed as a non-iterative many-to-one mapping . A \ntwo-layer feed forward NN architecture is developed to learn and code this \nnonlinear and complex mapping using the back-propagation learning rule \nand a training set. The important aspect of this technique is that none of \nthe typical constraints such as uniqueness and continuity are explicitly \nimposed. All the applicable constraints are learned and internally coded \nby the NN enabling it to be more flexible and more accurate than the \nexisting methods. The approach is successfully tested on several random(cid:173)\ndot stereograms. It is shown that the net can generalize its learned map(cid:173)\nping to cases outside its training set. Advantages over the Marr-Poggio \nAlgorithm are discussed and it is shown that the NN performance is supe(cid:173)\nrIOr. \n\n1 INTRODUCTION \n\nThree-dimensional image processing is an indispensable property for any advanced \ncomputer vision system. Depth perception is an integral part of 3-d processing. It \ninvolves computation of the relative distances of the points seen in the 2-d images \nto the imaging device. There are several methods to obtain depth information. A \ncommon technique is stereo imaging. It uses two cameras displaced by a known \ndistance to generate two images of the same scene taken from these two different \nviewpoints. Distances to objects can be computed if corresponding points are \nidentified in both frames. Corresponding points are two image points which \ncorrespond to the same object point in the 3-d space as seen by the left and the \nright cameras, respectively. Thus, solving the so called \"correspondence problem\" \n\n327 \n\n\f328 \n\nKhotanzad and Lee \n\nis the essential stage of depth perception by stereo imaging. \nMany computational approaches to the correspondence problem have been studied \nin the past. An exhaustive review of such techniques is best left to a survey arti(cid:173)\ncles by Dhond and Aggarwal (1989). Common to all such techniques is the \nemployment of some constraints to limit computational requirement and also \nreduce the ambiguity. They usually consist of strict rules that are fixed a priori \nand are based on a rough model of the surface to-be-solved. Unfortunately, \npsychophysical evidence of human stereopsis suggest that the appropriate con(cid:173)\nstraints are more complex and more flexible to be characterized by simple fixed \nrules. \nIn this paper, we suggest a novel approach to the stereo correspondence problem \nvia neural networks (NN). The problem is cast into a mapping framework and \nsubsequently solved by a NN which is especially suited to such tasks. An impor(cid:173)\ntant aspect of this approach is that the appropriate constraints are automatically \nlearned and generalized by the net resulting in a flexible and more accurate model. \nThe iterative algorithm developed by Marr and Poggio (1976) for can be regarded \nas a crude neural network approach with no embedded learning. In fact, the ini(cid:173)\ntial stages of the proposed technique follow the same initial steps taken in that \nalgorithm. However, the later stages of the two algorithms are quite distinct with \nours involving a learning process and non-iterative operation. \nThere have been other recent attempts to solve the correspondence problem by \nneural networks. Among these are O'Toole (1989), Qian and Sejnowski (1988), \nSun et al. (1987), and Zhou and Chellappa (1988). These studies use different \napproaches and topologies from the one used in this paper. \n\n2 DESCRIPTION OF THE APPROACH \n\nThe proposed approach poses the correspondence problem as a mapping problem \nand uses a special kind of NN to learn this mapping. The only constraint that is \nexplicitly imposed is the \"epipolar\" constraint. It states that the match of a point \nin row m of one of the two images can only be located in row m of the other \nimage. This helps to reduce the computation by restricting the search area. \n\n2.1 CORRESPONDENCE PROBLEM AS A MAPPING PROBLEM \n\nThe initial phase of the procedure involves casting the correspondence problem as \na many to one mapping problem. To explain the method, let us consider a very \nsimple problem involving one row (epipolar line) of a stereo pair. Assume 6 pixel \nwide rows and take the specific example of [001110] and [111010] as left and right \nimage rows respectively. The task is to find the best possible match between these \ntwo strings which in this case is [1110]. \nThe process starts by forming an \"initial match matrix\". This matrix includes all \npossible matches between the pixels of the two rows. Fig. 1 illustrates this matrix \nfor the considered example. Each 1 indicates a potential match. However only a \nfew of these matches are correct. Thus, the main task is to distinguish the correct \nmatches which are starred from the false ones. \n\n\fStereopsis by a Neural Network Which Learns the Constraints \n\n329 \n\nTo distinguish the correct matches from the false ones, Marr and Poggio (1976) \nimposed two constraints on the correspondences; (1) uniqueness- that there should \nbe a one-to-one correspondence between features in the two eyes, and (2) smooth(cid:173)\nness - that surfaces should change smoothly in depth. The first constraint means \nthat only one element of the match matrix may have a value of 1 along each hor(cid:173)\nizontal and vertical direction. The second constraint translates into a tendency \nfor the correct matches to spread along the 45\u00b0 directions. These constraints are \nimplemented through weighted connections between match matrix elements. The \nuniqueness constraint is modeled by inhibitory (negative) weights along the \nhorizontal/vertical directions. The smoothness constraint gives rise to excitatory \n(positive) weights along 45\u00b0 \nlines. The connections from the rest of elements \nreceive a zero (don't care) weight. Using fixed excitatory and inhibitory constants, \nthey progressively eliminate false correspondences by applying an iterative algo(cid:173)\nrithm. \nThe described row wise matching does not consider the vertical dependency of pix(cid:173)\nels in 2-d images. To account for inter-row relationships, the procedure is \nextended by stacking up the initial match matrices of all the rows to generate a \nthree-dimensional \"initial match volume\", as shown in Fig. 2. Application of the \ntwo mentioned constraints extends the 2-d excitatory region described above to a \n45 \u00b0 oriented plane in the volume while the inhibitory region remains on the 2-d \nplane of the row-wise match. Since depth changes usually happen within a local(cid:173)\nity, instead of using the complete planes, a subregion of them around each element \nis selected. Fig. 3 shows an example of such a neighborhood. Note that the con(cid:173)\nsidered excitatory region is a circular disc portion of the 45\u00b0 plane. The choice of \nthe radius size (three in this case) is arbitrary and can be varied. A similar itera(cid:173)\ntive technique is applied to the elements of the initial match volume in order to \neliminate incompatible matches and retain the good ones. \nThere are several serious difficulties with the Marr-Poggio algorithm. First, there \nis \nthe \nexcitatory /inhibitory weights. These parameters are usually selected by trial and \nerror. Moreover, a set of weights that works well for one case does not necessarily \nyield good results for a different pair of images. In addition, utilization of con(cid:173)\nstant weights has no analogy in biological vision systems. Another drawback \nregards the imposition of the two previously mentioned constraints which are \nbased on assumptions about the form of the underlying scene. However, psycho(cid:173)\nphysical evidence suggests that the stereopsis constraints are more complex and \nmore flexible than can be characterized by simple fixed rules. \nThe view that we take is that the described process can be posed as a mapping \noperation from the space of \"initial match volume\" to the space of \"true match \nvolume\". Such a transformation can be considered as a one-shot (non-iterative) \nmapping from the initial matches to the final ones. This is a complex non-linear \nrelationship which is very difficult to model by conventional methods. However, a \nneural net can learn, and more importantly generalize it. \n\nthe best values of \n\nselection of \n\nno \n\nsystematic method \n\nfor \n\n2.2 NEURAL NETWORK ARCHITECTURE \n\nThe described mapping is a function of the elements in the initial match volume. \nThis can be expressed as: \n\n\f330 \n\nKhotanzad and Lee \n\nwhere \nt(Xb X2, xs) = \n\nf= \ni(a, b, c) = \n\nS= \n\nt(XI' X2, xs) = f (i(a, b, c) I (a, b, c) ( S) \n\nstate of the node located at coordinate (Xli X2, xs) In the \ntrue match volume. \nthe nonlinear mapping function. \nstate of the node located at coordinate (a, b, c) in the ini(cid:173)\ntial match volume. \nA set of three-dimensional coordinates including (Xl, X2, xs) \nand those of its neighbors in a specified neighborhood. \n\nIn such a formulation, if f is known, the task is complete. A NN is capable of \nlearning f through examining a set of examples involving initial matches and their \ncorresponding true matches. The learned function will be coded in a distributive \nmanner as the learned weights of the net. \nNote that this approach does not impose any constraints on the solution. No a \npriora\" excitatory/inhibitory assignments are made. Only a unified concept of a \nneighboring region, S, which influences the disparity computation is adopted. The \ninfluence of the elements in S on the solution is learned by the NN. This means \nthat all the appropriate constraints are automatically learned. \nUnlike the Marr-Poggio approach, the NN formulation allows us to consider any \nshape or size for the neighborhood, S. Although in discussions of next sections we \nuse a Marr-Poggio type neighborhood as shown in Fig. 3, there is no restriction on \nthis. In this work we used this S in order to be able to compare our results with \nthose of Marr-Poggio. In a previous study (Khotanzad & Lee (1990)) we used a \nstandard fully connected multi-layer feed-forward NN to learn f. The main prob(cid:173)\nlem with that net is the ad hoc selection of the number of hidden nodes. In this \nstudy, we use another layered feed-forward neural net termed \"sparsely connected \nNN with augmented inputs\" which does not suITer from this problem. It consists of \nan input layer, an output layer, and one \"hidden layer. The hidden layer nodes \nand the output node have a Sigmoid non-linearity transfer function. The inputs \nto this net consist of the state of the considered element in the initial match \nvolume along with states of those in its locality as will be described. The response \nof the output node is the computed state of the considered element of the initial \nmatch volume in the true match volume. The number of hidden nodes are \ndecided based on the shape and size of the selected neighborhood, S , as described \nin the example to follow. This net is not a fully connected net and each hidden \nnode gets connected to a subset of inputs. Thus the term \"sparsely connected\" is \nused. \nTo illustrate the suggested net, let us use the S of Fig. 3. In this case, each ele(cid:173)\nment in the initial match volume gets affected by 24 other elements shown by cir(cid:173)\ncles and crosses in the figure. Our suggested network for such an S is shown in \nFig. 4. It has 625 inputs, 25 hidden nodes and one output node. Each hidden \nnode is only connected to one set of 25 input nodes. The 625 inputs consist of 25 \nsets of 25 elements of the initial match volume. Let us denote these sets by \nII, 12, ... , 125 respectively. The first set of 25 inputs consists of the state of the \nelement of the initial match volume whose final state is sought along with those of \n\n\fStereopsis by a Neural Network Which Learns the Constraints \n\n331 \n\nits 24 neighbors. Let us denote this node and its neighbors by t and st = \nsi, sit \"', Si4 respectively. Then 11 = {t, st}. The second set is composed of \nthe same type of information for neighbor sr In other words 12 = {si, s l}. \n13\u2022 \" ' , 125 are made similarly. So in general \n\nst \n\n_ \n\nt \n\nt \n8J \n\nIj-{sj, S}, \n\nj = 2, 3, ... , 25. \n\nNote that there is a good degree of overlap among these 625 inputs. However, \nthese redundant inputs are processed separately in the hidden layer as explained \nlater. Due to the structure of this input, it is referred to as \"augmented input\". \nThe hidden layer consists of 25 nodes, each of which is connected to only one of \nthe 25 sets of inputs through weights to be learned. Thus, each node of the hid(cid:173)\nden layer processes the result of evolution of one of the 25 input sets. The effects \nof processing these 25 evolved sets would then be integrated at the single output \nnode through the connection weights between the hidden nodes and the output \nnode. The output node then computes the corresponding final state of the con(cid:173)\nsidered initial match element. \nTraining this net is equivalent to finding proper weights for all of its connections \nas well as thresholds associated with the nodes. This is carried out by the back(cid:173)\npropagation learning algorithm (Rumelhart et. al (1986)). Again note that all the \nweights used in this scheme are unknown and need to be computed through the \nlearning procedure with the training set. Thus, the concept of a priori excitatory \nand inhibitory labeling is not used. \n3 EXPERUWENTALSTUDY \n\nThe performance of the proposed neural network approach is tested on several \nrandom-dot stereograms. A random dot stereogram consists of a pair of similar \nstructural images filled with randomly generated black and white dots, with some \nregions of one of the images shifted to either left or right relative to the other \nimage. When viewed through a stereoscope, a human can perceive the shifted \nstructures as either floating upward or downward according to their relative \ndisparities. Stereograms with 50% density (i.e. half black, half white) are used. \nSix 32x32 stereograms with varying disparities are used to teach the network. \nThe actual disparity maps (floating surfaces) of these are shown in Fig. 5. Each \nstereogram contains three different depth levels (disparity regions) represented by \ndifferent gray levels. Therefore, six three-dimensional initial match volumes and \ntheir six corresponding true match volumes comprise the training set for the NN. \nEach initial match volume and its corresponding true match volume contain 323 \ninput-output pairs. Since six stereo grams are considered, a total of 6x323 input(cid:173)\noutput pairs are available for training. \nThe performance of the trained net is tested on several random-dot stereograms. \nFig. 5 shows the results for the same data the net is trained with. In addition the \nperformance was tested on other stereo grams that are different from the training \nset. The considered differences include: the shape of the disparity regions, size of \nthe image, disparity levels, and addition of noise to one image of the pair. These \ncases are not presented here due to space limitation. We can report that all of \nthem yielded very good results. \n\n\f332 \n\nKhotanzad and Lee \n\nIn Fig. 5, the results obtained using the Marr-Poggio algorithm are also shown for \ncomparison. Even though it was tried to find the best feed backs for Man-Poggio \nthrough trial and error, the NN outperformed it in all cases in terms of number of \nerror pixels in the resulting disparity map. \n\n4 CONCLUSION \n\nIn this paper, a neural network approach to the problem of stereopsis was dis(cid:173)\ncussed. A multilayer feed-forward net was developed to learn the mapping that \nretains the correct matches between the pixels of the epipolar lines of the stereo \npair from amongst all the possible matches. The only constraint that is explicitly \nimposed is the \"epipolar\" constraint. All the other appropriate constraints are \nlearned by example and coded in the nets in a distributed fashion. The net learns \nby examples of stereo pairs and their corresponding depth maps using the back(cid:173)\npropagation learning rule. Performance was tested on several random-dot stereo(cid:173)\ngrams and it was shown that the learning is generalized to cases outside the train(cid:173)\ning. The net performance was also found to be superior to Marr-Poggio algo(cid:173)\nrithm. \n\nAcknow ledgements \n\nThis work was supported in part by DARPA under Grant MDA-903-86-C-0182 \n\nReferences \n\nDhond, U. R. & Aggarwal, J. K. (1989), \"Structure from stereo - A review,\" IEEE \nTrans. SMC, vol. 19, pp. 1489-1510. \nDrumheller, M. & Poggio, T. (1986), \"On parallel stereo,\" Proc. IEEE Inti. Conf. \non Robotics and Automation, vol. 3, pp. 1439-1448. \nKhotanzad, A. & Lee, Y. W. (1990), \"Depth Perception by a Neural Network,\" \nIEEE Midcon/90 Conf. Record, Dallas, Texas, pp. 424-427, Sept. 11-13. \nMarr, D. & Poggio, T. (1976), \"Cooperative computation of stereo disparity,\" Sci(cid:173)\nence, 194, pp. 238-287. \nO'Toole, A. J. (1989), \"Structure from stereo by associative learning of the con(cid:173)\nstraints,\" Percept\u00a3on, 18, pp. 767-782. \nPoggio, T. (1984), \"Vision by man and machine,\" Scientific American, vol. 250, \npp. 106-116, April. \nQiang, N. & Sejnowski, T. J. (1988), \"Learning to solve random-dot stereo grams \nof dense and transparent surfaces with recurrent backpropagation,\" in Touretzky \n& Sejnowski (Eds.), Proceedings of the 1988 Connectionist Models, pp. 435-444, \nMorgan Kaufmann Publishers. \nRumelhart, D. E., Hinton G. E., and Williams R. J. (1986), \"Learning internal \nrepresentations by error propagation,\" in D.E. Rumelhart & J.1. McClelland \n(Eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cog(cid:173)\nnition. vol. 1: Foundations, MIT Press. \n\n\fStereopsis by a Neural Network Which Learns the Constraints \n\n333 \n\nSun, G. Z., Chen, H. H., Lee, Y. C. (1987), \"Learning stereopsis with neural net(cid:173)\nworks,\" Proc. IEEE First Inti. Con/. on Neural Networks, San Diego, CA, pp. \n345-355, June. \nZhou, Y. T. and Chellappa, R. (1988), \"Stereo matching using a neural network,\" \nProc. \nIEEE International Con!. Acoustics, Speech, and Signal Processing, \nICASSP-88, New York, pp. 940-943, April 11-14. \n\naight \n\n----~~~------~~ \n\n1 \n\n1 \n\n1 \n\n1 \n\n1 \n\n1 \n\n1 \n\n1 \n\n1* \n\no \n\n1 \n\no \n\n1 \n\n1 \n\n1 \n\n1 \n\n1 \n\n1 \n\n1* \n\n1* \n\n1* \n\n1 \n\n1 \n\n1 \n\n1 \n\no \n\n0 \n\n0 Left \nFigure 1: The initial match matrix for \n\n1 \n\n1 \n\nthe considered example. \n1 represents a match. \nCorrect matches are starred. \n\nrow 2(left) \n\nrow l(left) \n\nFigure 2: Schematic of the initial match \n\nvolume constructed by stacking \nup row match matrices. \n\n1 2 \n\n25 26 27 \n\n50 \n\n:'''lpUt Layer \n\n625 \n\nFigure 3: The neighborhood structure \n\nFigure 4: The sparsely connected NN \n\nwith augmented inputs when the \nneighborhood of Fig. 3 is used. \n\nconsidered in the initial match \nvolume. If used with Marr-Poggio, \ncircles and crosses represent \nexcitatory and inhibitory neighbors \nrespectively. \n\n\f334 \n\nKhotanzad and Lee \n\nActual \n\nMarr-Poggio Neural Net \n\ncorresponding \n\ndisparity \n\nin \n\npixels \n\ngray \nlevel \n\no \n\nunlabelled \n\n-4 \n\n-2 \n\no \n\n+2 \n\n+4 \n\nFigure 5: The results of disparity computation for six random-dot stereograms \nwhich are used to train the NN. The Marr-Poggio results are also \nshown. \n\n\f", "award": [], "sourceid": 335, "authors": [{"given_name": "Alireza", "family_name": "Khotanzad", "institution": null}, {"given_name": "Ying-Wung", "family_name": "Lee", "institution": null}]}