{"title": "Adaptive Elastic Models for Hand-Printed Character Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 512, "page_last": 519, "abstract": null, "full_text": "Adaptive Elastic Models for Hand-Printed \n\nCharacter Recognition \n\nGeoffrey E. Hinton, Christopher K. I. Williams and Michael D. Revow \n\nDepartment of Computer Science, University of Toronto \n\nToronto, Ontario, Canada M5S lA4 \n\nAbstract \n\nHand-printed digits can be modeled as splines that are governed by about \n8 control points. For each known digit. the control points have preferred \n., home\" locations, and deformations of the digit are generated by moving \nthe control points away from their home locations. Images of digits can be \nproduced by placing Gaussian ink generators uniformly along the spline. \nReal images can be recognized by finding the digit model most likely to \nhave generated the data. For each digit model we use an elastic matching \nalgorithm to minimize an energy function that includes both the defor(cid:173)\nmation energy of the digit model and the log probability that the model \nwould generate the inked pixels in the image. The model with the lowest \ntotal energy wins. If a uniform noise process is included in the model of \nimage generation, some of the inked pixels can be rejected as noise as a \ndigit model is fitting a poorly segmented image. The digit models learn \nby modifying the home locations of the control points. \n\n1 \n\nIntroduction \n\nGiven good bottom-up segmentation and normalization, feedforward neural net(cid:173)\nworks are an efficient way to recognize digits in zip codes. (Ie Cun et al., 1990). \nHowever. in some cases. it is not possible to correctly segment and normalize the \ndigits without using knowledge of their shapes, so to achieve close to human per(cid:173)\nformance on images of whole zip codes it will be necessary to use models of shapes \nto influence the segmentation and normalization of the digits. One way of doing \nthis is to use a large cooperative network that simultaneously segments, normalizes \nand recognizes all of the digit.s in a zip code. A first step in this direct.ion is to take \na poorly segmented image of a single digit and to explain the image properly in \nterms of an appropriately normalized, deformed digit model plus noise. The ability \nof t.he model to reject some parts of the image as noise is the first step towards \nmodel-driven segmentation. \n\n512 \n\n\fAdaptive Elastic Models for Hand)Printed Character Recognition \n\n513 \n\n2 Elastic models \n\nOne technique for recognizing a digit is to perform an elastic match with many \ndifferent exemplars of each known digit-class and to pick the class of the nearest \nneighbor. Unfortunately this requires a large number of elastic matches, each of \nwhich is expensive. By using one elastic model to capture all the variations of a given \ndigit we greatly reduce the number of elastic matches required . Burr (1981a, 1981b) \nhas investigated several types of elastic model and elastic matching procedure. We \ndescribe a different kind of elastic model that is based on splines. Each elastic \nmodel contains parameters that define an ideal shape and also define a deformation \nenergy for departures from this ideal. These parameters are initially set by hand \nbut can be improved by learning. They are an efficient way to represent the many \npossible instances of a given digit . \n\nEach digit is modelled by a deformable spline whose shape is determined by the \npositions of 8 control points . Every point on the spline is a weighted average of \nfour control points, with the weighting coefficients changing smoothly as we move \nalong the spline. 1 To generate an ideal example of a digit we put the 8 control \npoints at their home locations for that model. To deform the digit we move the \ncontrol points away from their home locations. Currently we assume that, for each \nmodel, the control points have independent, radial Gaussian distributions about \ntheir home locations. So the negative log probability of a deformation (its energy) \nis proportional to the sum of the squares of the departures of the control points \nfrom their home locations. \n\nThe deformation energy function only penalizes shape deformations . Translation, \nrotation, dilation , elongation, and shear do not change the shape of an object so we \nwant the deformation energy to be invariant under these affine transformations. We \nachieve this by giving each model its own \"object-based frame\". Its deformation \nenergy is computed relative to this frame, not in image coordinates. When we fit \nthe model to data, we repeatedly recompute the best affine transformation between \nthe object-based frame and the image (see section 4). The repeated recomputation \nof the affine transform during the model fit means that the shape of the digit is \ninfluencing the normalization. \n\nAlthough we will use our digit models for recognizing images, it helps to start by \nconsidering how we would use them for generating images. The generative model is \nan elaboration of the probabilistic interpretation of the elastic net given by Durbin, \nSzeliski & Yuille (1989) . Given a particular spline, we space a number of \"beads\" \nuniformly along the spline. Each bead defines the center of a Gaussian ink generator. \nThe number of beads on the spline and the variance of the ink generators can easily \nbe changed without changing the spline itself. \n\nTo generate a noisy image of a particular digit class, run the following procedure: \n\n\u2022 Pick an affine transformation from the model's intrinsic reference frame to the \nimage frame (i .e. pick a size, position, orientation, slant and elongation for the \ndigit) . \n\n1 In computing the weighting coefficients we use a cubic B-spline and we treat the first \n\nand last control points as if they were doubled. \n\n\f514 \n\nHinton, Williams, and Revow \n\n\u2022 Pick a defo~mation of the mo.d~l (i.e. ~~ve the control !)Qi~ts awa1 from their \n\nhome locatIOns). The probabIlIty of pIckmg a deformatIOn IS ~ e- de.Jornl \n\n\u2022 Repeat many times: \n\nEither (with probability 1I\"noi.H) add a randomly positioned noise pixel \nOr pick a bead at random and generate a pixel from the Gaussian \n\ndistribution defined by the bead. \n\n3 Recognizing isolated digits \n\nWe recognize an image by finding which model is most likely to have generated it. \nEach possible model is fitted to the image and the one that has the lowest cost fit is \nthe winner. The cost of a fit is the negative log probability of generating the image \ngi ven the model. \n\nlog J P(I) P( image I 1) dI \n\n-\n\n(1 ) \n\nIE model \ninstances \n\nWe can approximate this by just considering the best fitting model instance 2 and \nignoring the fact that the model should not generate ink where there is no ink in \nthe image:3 \n\nE = A EdeJorm - L logP(pixel I best model instance) \n\n(2) \n\ninked \npixels \n\nThe probability of an inked pixel is the sum of the probabilities of all the possible \nways of generating it from the mixture of Gaussian beads or the uniform noise field. \n\nP(i) = 1I\"noi.H + 1I\"model \n\nN \n\nB \n\n(3) \n\nwhere N is the total number of pixels, B is the number of beads, 11\" is a mlxmg \nproportion', and Pb( i) is the probability density of pixel i under Gaussian bead b. \n\n4 The search procedure for fitting a model to an image \n\nEvery Gaussian bead in a model has the same variance. When fitting data, we start \nwith a big variance and gradually reduce it as in the elastic net algorithm of Durbin \nand Willshaw (1987) . Each iteration of the elastic matching algorithm involves \nthree steps: \n\n21n effect, we are assuming that the integral in equation 1 can be approximated by the \nheight of the highest peak, and so we are ignoring variations between models in the width \nof the peak or the number of peaks. \n\n3If the inked pixels are rare, poor models sin mainly by not inking those pixels that \n\nshould be inked rather than by inking those pixels that should not be inked. \n\n\fAdaptive Elastic Models for Hand) Printed Character Recognition \n\n515 \n\n\u2022 Given the current locations of the Gaussians, compute the responsibility that \neach Gaussian has for each inked pixel. This is just the probability of generating \nthe pixel from that Gaussian, normalized by the total probability of generating \nthe pixel. \n\n\u2022 Assuming that the responsibilities remain fixed, as in the EM algorithm of \nDempster, Laird and Rubin (1977), we invert a 16 x 16 matrix to find the \nimage locations for the 8 control points at which the forces pulling the control \npoints towards their home locations are balanced by the forces exerted on the \ncontrol points by the inked pixels. These forces come via the forces that the \ninked pixels exert on the Gaussian beads. \n\n\u2022 Given the new image locations of the control points, we recompute the affine \ntransformation from the object-based frame to the image frame. We choose \nthe affine transformation that minimizes the sum of the squared distances, in \nobject-based coordinates, between the control points and their home locations. \nThe residual squared differences determine the deformation energy. \n\nSome stages in the fitting of a model to data are shown in Fig. 1. This search \ntechnique avoids nearly all local minima when fitting models to isolated digits. But \nif we get a high deformation energy in the best fitting model, we can try alternative \nstarting configurations for the models. \n\n5 Learning the digit models \nWe can do discriminative learning by adjusting the home positions and variances \nof the control points to minimize the objective function \n\nc = - L 10gp(cor7'ect digit), \n\ntraining \n\ncases \n\np(correct digit) = =-----~-\ne-Ed'Y'1 \n\n\"\"\" \nLall digits \n\ne-Ecorrect \n\n(4) \n\nFor a model parameter such as the x coordinate of the home location of one of the \ncontrol points we need oC / ax in order to do gradient descent learning. Equation \n4 allows us to express oC / ax in terms of oE / ax but there is a subtle problem: \nChanging a parameter of an elastic model causes a simple change in the energy \nof the configuration that the model previously settled to, but the model no longer \nsettles to that configuration. So it appears that we need to consider how the energy \nis affected by the change in the configuration. Fortunately, derivatives are simple at \nan energy minimum because small changes in the configuration make no change in \nthe energy (to first order). Thus the inner loop settling leads to simple derivatives \nfor the outer loop learning, as in the Boltzmann machine (Hinton, 1989). \n\n6 Results on the hand-filtered dataset \nWe are trying out the scheme out on a relatively simple task - we have a model of \na two and a model of a three, and we want the two model to win on \"two\" images, \nand the three model to win on \"three\" images. \n\nWe have tried many variations of the character models, the preprocessing, the initial \naffine transformations of the models, the annealing schedule for the variances, the \n\n\f516 \n\nHinton, Williams, and Revow \n\nc:= \n\n(a) \n\n(b) \n\n,,--,. \n\n, \n\n\"\",ue:;~, . . ~~4P \n\n(c) \n\n(d) \n\nFigure 1: The sequellce> (n) 1.0 (d) shows SOIIlC stages or rHf.illg a model :~ 1.0 SOllie \ndaf.1\\.. The grey circles I\u00b7e>presellf. the heads Oil the splille, alld t.he> m,dius or t.he rircl(~ \nrepresents t.he standard deviation or t.he Gaussian. (a.) shows the illitia.1 conliglll'a(cid:173)\ntioll, with eight beads equally spaced along the spline. 111 (b) and (c) the va.riallce \nis 11I'ogl:es~ively decrca.~ed and t.he Humber or heads is incrf~ased. The ri lIal ra \\lsi IIg \nGO beads is showlI in (d). We use about. three iterat.ions al. each or nve variallces \non our \"annealing schedule\". III this example, we used 1Tnoiu = 0.3 which lIIa.kes it. \ncheaper to explain the extrft,nCOliS 1I0ise pixels and the flourishes 011 t.he cllds or t.11!~ \n:~ as noise rather lhall deformillg t.he llIodel to briug Gallssiall heads cI()s(~ t.o t.hese \npixels. \n\n\fAdaptive Elastic Models for Hand)Printed Character Recognition \n\n517 \n\nmixing proportion of the noise, and the relative importance of deformation energy \nversus data-fit energy. \n\nOur current best performance is 10 errors (1.6%) on a test set of 304 two's and 304 \nthree's. We reject cases if the best-fitting model is highly deformed, but on this \ntest set the deformation energy never reached the rejection criterion. The training \nset has 418 cases, and we have a validation set of 200 cases to tell us when to stop \nlearning. Figure 2 shows the effect of learning on the models. The initial affine \ntransform is defined by the minimal vertical rectangle around the data. \n\n(BEFORE) \n\n[0 \n\nlTI \n\nIAFTER I \nlTI \n[i] \n\nFigure 2: The two and three models before and after learning. The control points \nare labelled 1 through 8. We used maximum likelihood learning in which each digit \nmodel is trained only on instances of that digit. After each pass through all those \ninstances, the home location of each control point (in the object-based frame) is \nredefined to be the average location of the control point in the final fits of the model \nof the digit to the instances of the digit. Most of the improvement in performance \noccurred after the fist pass, and after five updates of the home locations of the \ncontrol points, performance on the validation set started to decrease. Similar results \nwere obtained with discriminative training. We could also update the variance of \neach control point to be its variance in the final fits, though we did not adapt the \nvariances in this simulation. \n\n\f518 \n\nHinton, Williams, and Revow \n\nThe images are preprocessed to eliminate variations due to stroke-width and paper \nand ink intensities. First, we use a standard local thresholding algorithm to make \na binary decision for each pixel. Then we pick out the five largest connected com(cid:173)\nponents (hopefully digits). We put a box around each component , then thin all the \ndata in the box . If we ourselves cannot recognize the resulting image we eliminate \nit from the data set. The training, validation and test data is all from the training \nportion of the United States Postal Service Handwritten ZIP Code Database (1987) \nwhich was made available by the USPS Office of Advanced Technology. \n\n7 Discussion \nBefore we tried using splines to model digits, we used models that consisted of a \nfixed number of Gaussian beads with elastic energy constraints operating between \nneighboring beads. To constrain the curvature we used energy terms that involved \ntriples of beads. With this type of energy function, we had great difficulty using \na single model to capture topologically different instances of a digit. For example, \nwhen the central loop of a 3 changes to a cusp and then to an open bend, the sign \nof the curvature reverses. With a spline model it is easy to model these topological \nvariants by small changes in the relative vertical locations of the central two control \npoints (see figure 2). This advantage of spline models is pointed out by (Edelman, \nUllman and Flash, 1990) who use a different kind of spline that they fit to character \ndata by directly locating candidate knot points in the image. \n\nSpline models also make it easy to increase the number of Gaussian beads as their \nvariance is decreased. This coarse-to-fine strategy is much more efficient than using \na large number of beads at all variances, but it is much harder to implement if the \ndeformation energy explicitly depends on particular bead locations, since changiug \nthe number of beads then requires a new function for the deformation energy. \n\nIn determining where on the spline to place the Gaussian beads, we initially used \na fixed set of blending coefficients for each bead . These coefficients are the weight:s \nused to specify the bead location as a weighted center of gravity of the loca.l-iollS of \n4 control points. Unfortunately this yields too few beads in portions of a digit such \nas a long tail of a 2 which are governed by just a few control points. Performance \nwas much improved by spacing the beads uniformly along the curve. \n\nBy using spline models, we build in a lot of prior knowledge about wha.t characters \nlook like, so we can describe the shape of a character using only a small number \nof parameters (16 coordinates and 8 variances). This means that the learning is \nexploring a much smaller space than a conventional feed-forward network. Also, \nbecause the parameters are easy to interpret, we can start with fairly good initial \nmodels of the characters. So learning only requires a few updates of the parameters. \n\nObvious extensions of the deformation energy function include using elliptical Gaus(cid:173)\nsians for the distributions of the control points, or using full covariance matrices for \nneighboring pairs of control points. Another obvious modification is to use ellipti(cid:173)\ncal rather than circular Gaussians for the beads . If strokes curve gently relative to \ntheir thickness, the distribution of ink can be modelled much better using elliptical \nGaussians. However, an ellipse takes about twice as many operations to fit and is \nnot helpful in regions of sharp curvature. Our simulations suggest that, on average, \ntwo circular beads are more flexible than one elliptical bead. \n\n\fAdaptive Elastic Models for Hand) Printed Character Recognition \n\n519 \n\nCurrently we do not impose any penalty on extremely sheared or elongated affine \ntransformations, though this would probably improve performance. Having an ex(cid:173)\nplicit representation of the affine transformation of each digit should prove very \nhelpful for recognizing multiple digits, since it will allow us to impose a penalty on \ndifferences in the affine transformations of neighboring digits. \n\nPresegmented images of single digits contain many different kinds of noise that \ncannot be eliminated by simple bottom-up operations. These include descenders, \nunderlines, and bits of other digits; corrections; dirt in recycled paper; smudges and \nmisplaced postal franks. To really understand the image we probably need to model \na wide variety of structured noise. We are currently experimenting with one simple \nway of incorporating noise models. After each digit model has been used to segment \na noisy image into one digit instance plus noise, we try to fit more complicated noise \nmodels to the residual noise. A good fit greatly decreases the cost of that noise and \nhence improves this interpretation of the image. We intend to handle flourishes on \nthe ends of characters in this way rather than using more elaborate digit models \nthat include optional flourishes. \n\nOne of our main motivations in developing elastic models is the belief that a strong \nprior model should make learning easier, should reduce confident errors, and should \nallow top-down segmentation . Although we have shown that elastic spline mod(cid:173)\nels can be quite effective, we have not yet demonstrated that they are superior to \nfeedforward nets and there is a serious weakness of our approach: Elastic match(cid:173)\ning is slow. Fitting the models to the data takes much more computation than a \nfeedforward net. So in the same number of cycles, a feedforward net can try many \nalternative bottom-up segmentations and normalizations and select the overall seg(cid:173)\nmentation that leads to the most recognizable digit string. \n\nAcknowledgements \nThis research was funded by Apple and by the Ontario Information Technology Research \nCentre. We thank Allan Jepson and Richard Durbin for suggesting spline models. \n\nReferences \n\nBurr, D. J. (1981a) . A dynamic model for image registration. Comput. Gmphics image \n\nProcess., 15:102-112. \n\nBurr, D. J. (1981b). Elastic matching of line drawings. IEEE Trans. Pattern Analysis and \n\nMachine Intelligence, 3(6):708-713. \n\nDempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from \n\nincomplete data via the EM algorithm . Proc. Roy. Stat. Soc., B-39:1-38 . \n\nDurbin, R., Szeliski, R., and Yuille, A. L. (1989). An analysis of the elastic net approach \n\nto the travelling salesman problem. Neural Computation, 1:348-358. \n\nDurbin, R. and Willshaw, D. (1987). An analogue approach to the travelling salesman \n\nproblem. Nature, 326:689-691. \n\nEdelman, S., Ullman, S., and Flash, T. (1990). Reading cursive handwriting by alignment \n\nof letter prototypes. Internat. Journal of Comput. Vision, 5(3):303-33l. \n\nHinton, G. E. (1989). Deterministic Boltzmann learning performs steepest descent in \n\nweight-space. Neural Computation, 1:143-150. \n\nIe Cun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., and Jackel, L. \n(1990). Handwritten digit recognition with a back-propagation network. In Advances \nin Neural Information Processing Systems 2, pages 396-404. Morgan Kaufmann. \n\n\f\fPART IX \n\nCONTROL \n\nAND PLANNING \n\n\f\f", "award": [], "sourceid": 533, "authors": [{"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}, {"given_name": "Christopher", "family_name": "Williams", "institution": null}, {"given_name": "Michael", "family_name": "Revow", "institution": null}]}