{"title": "Efficient Pattern Recognition Using a New Transformation Distance", "book": "Advances in Neural Information Processing Systems", "page_first": 50, "page_last": 58, "abstract": null, "full_text": "Efficient Pattern Recognition Using a \n\nNew Transformation Distance \n\nPatrice Simard \n\nYann Le Cun \n\nJohn Denker \n\nAT&T Bell Laboratories, 101 Crawford Corner Road, Holmdel, NJ 07724 \n\nAbstract \n\nMemory-based classification algorithms such as radial basis func(cid:173)\ntions or K-nearest neighbors typically rely on simple distances (Eu(cid:173)\nclidean, dot product ... ), which are not particularly meaningful on \npattern vectors. More complex, better suited distance measures are \noften expensive and rather ad-hoc (elastic matching, deformable \ntemplates). We propose a new distance measure which (a) can be \nmade locally invariant to any set of transformations of the input \nand (b) can be computed efficiently. We tested the method on \nlarge handwritten character databases provided by the Post Office \nand the NIST. Using invariances with respect to translation, rota(cid:173)\ntion, scaling, shearing and line thickness, the method consistently \noutperformed all other systems tested on the same databases. \n\n1 \n\nINTRODUCTION \n\nDistance-based classification algorithms such as radial basis functions or K-nearest \nneighbors often rely on simple distances (such as Euclidean distance, Hamming \ndistance, etc.). As a result, they suffer from a very high sensitivity to simple \ntransformations of the input patterns that should leave the classification unchanged \n(e.g. translation or scaling for 2D images). This is illustrated in Fig. 1 where an \nunlabeled image of a \"9\" must be classified by finding the closest prototype image \nout of two images representing respectively a \"9\" and a \"4\". According to the \nEuclidean distance (sum of the squares of the pixel to pixel differences), the \"4\" \nis closer even though the \"9\" is much more similar once it has been rotated and \nthickened. The result is an incorrect classification. The key idea is to construct a \ndistance measure which is invariant with respect to some chosen transformations \nsuch as translation, rotation and others. The special case of linear transformations \nhas been well studied in statistics and is sometimes referred to as Procrustes analysis \n\n50 \n\n\fEfficient Pattern Recognition Using a New Transformation Distance \n\n51 \n\nPattern to \nbe classified \n\nprototype A \n\nPrototype B \n\nFigure 1: What is a good similarity measure? According to the Euclidean distance \nthe pattern to be classified is more similar to prototype B. A better distance measure \nwould find that prototype A is closer because it differs mainly by a rotation and a \nthickness transformation, two transformations which should leave the classification \ninvariant. \n\n(Sibson, 1978). It has been applied to on-line character recognition (Sinden and \nWilfong, 1992). \nThis paper considers the more general case of non-linear transformations such as \ngeometric transformations of gray-level images. Remember that even a simple \nimage translation corresponds to a highly non-linear transformation in the high(cid:173)\ndimensional pixel space l . In previous work (Simard et al., 1992b), we showed how \na neural network could be trained to be invariant with respect to selected transfor(cid:173)\nmations of the input. VVe now apply similar ideas to distance-based classifiers. \n''''hen a pattern P is transformed (e.g. rotated) with a transformation s that depends \non one parameter a (e.g. the angle of the rotation), the set of all the transformed \npatterns Sp = {x I 35 such that x = s(5, P)} is a one-dimensional curve in the \nvector space of the inputs (see Fig. 2). \nIn certain cases, such as rotations of \ndigitized images, this curve must be made continuous using smoothing techniques \n(see (Simard et al., 1992b)). When the set of transformations is parameterized by \nn parameters ai (rotation, translation, scaling, etc.), Sp is a manifold of at most n \ndimensions. The patterns in Sp that are obtained through small transformations \nof P, i.e. \nthe part of Sp that is close to P, can be approximated by a plane \ntangent to the manifold Sp at the point P. Small transformations of P can be \nobtained by adding to P a linear combination of vectors that span the tangent \nplane (tangent vectors). The images at the bottom of Fig. 2 were obtained by that \nprocedure. Tangent vectors for a transformation s can easily be computed by finite \ndifference (evaluating os(a, P)/oa); more details can be found in (Simard et al., \n1992b; Simard et al., 1992a). \nAs we mentioned earlier, the Euclidean distance between two patterns P and E \nis in general not appropriate because it is sensitive to irrelevant transformations \nof P and of E. In contrast, the distance V(E, P) defined to be the minimal dis(cid:173)\ntance between the two manifolds Sp and SE is truly invariant with respect to the \ntransformation used to generate Sp and SE. Unfortunately, these manifolds have \nno analytic expression in general, and finding the distance between them is a hard \noptimization problem with multiple local minima. Besides, t.rue invariance is not \n\n1 If the ima.ge of a \"3\" is translated vertica.lly upward, the middle top pixel will oscillate \n\nfrom black to white three times. \n\n\f52 \n\nSimard, Cun, and Denker \n\n[3] \n\n\u2022 -15 \u2022 \n\n-7.5-\n\np \n\n.7.5 -\n\nTrue rotations of P \n\nTransformations at p \n\n1/ \nII \n\n_ \u2022\u2022 _ ........... _ \u2022\u2022 .Y \n\nPixel space \n\na=-O.2 \n\na=-Q.l \n\np \n\na=O.l \n\na=O.2 \n\np \n\nT. V. \n\nFigure 2: Top: Small rotations of an original digitized image of the digit \"3\". \nMiddle: Representation of the effect of the rotation in pixel space (if there were \nonly 3 pixels). Bottom: Images obtained by moving along the tangent to the \ntransformation curve for the same original digitized image P by adding various \namounts (a) of the tangent vector (T.V.). \n\nnecessarily desirable since a rotation of a \"6\" into a \"9\" does not preserve the correct \nclassification. \nOur approach consists of approximating the non-linear manifold Sp and SE by \nlinear surfaces and computing the distance D( E, P) defined to be the minimum \ndistance between them. This solves three problems at once: 1) linear manifolds \nhave simple analytical expressions which can be easily computed and stored, 2) \nfinding the minimum distance between linear manifolds is a simple least squares \nproblem which can be solved efficiently and, 3) this distance is locally invaria.nt but \nnot globally invariant. Thus the distance between a \"6\" and a slightly rota.ted \"6\" \nis small but the distance between a \"6\" and a \"9\" is la.rge. The different. distan ces \nbetween P and E are represented schematically in Fig. 3. \nThe figure represents two patterns P and E in 3-dimensional space. The ma.nifolds \ngenerated by s are represented by one-dimensional curves going through E and P \nrespectively. The linear approximations to the manifolds are represented by lines \ntangent to the curves at E and P. These lines do not intersect in 3 dimensions and \nthe shortest distance between them (uniquely defined) is D(E, P). The distance \nbetween the two non-linear transformation curves VeE, P) is also shown on the \nfigure. \nAn efficient implementation of the tangent distance D(E, P) will be given in the \n\n\fEfficient Pattern Recognition Using a New Transformation Distance \n\n53 \n\nFigure 3: Illustration of the Euclidean distance and the tangent distance between \nP and E \n\nnext section. Although the tangent distance can be applied to any kind of pat(cid:173)\nterns represented as vectors, we have concentrated our efforts on applications to \nimage recognition. Comparison of tangent distance with the best known competing \nmethod will be described. Finally we will discuss possible variations on the tangent \ndistance and how it can be generalized to problems other than pattern recognition. \n\n2 \n\nIMPLEMENTATION \n\nIn this section we describe formally the computation of the tangent distance. Let \nthe function s which map u, a to s(a, u) be a differentiable transformation of the \ninput space, depending on a vector a of parameter, verifying s(O, u) = 'It. \nIf u is a 2 dimensional image for instance, s(a, u) could be a rotation of u by \nthe angle &. If we are interested in all transformations of images which conserve \ndistances (isometry), 8(a, u) would be a rotation by a r followed by a translation \nby ax, a y of the image u. In this case & = (ar , ax, a y) is a vector of parameters of \ndimension 3. In general, & = (ao, .. \" am-d is of dimension m. \nSince 8 is differentiable, the set Stl. = {x I 3a for which x = 8( a, 'It)} is a differen(cid:173)\ntiable manifold which can be approximated to the first order by a hyperplane Ttl.. \nThis hyperplane is tangent to Stl. at u and is generated by the columns of matrix \n\nLtl. = 08(&~ 'It) I = [08(&, u), ... , 08(a, U)] \n\naa \n\nd=cf \n\noao \n\naam-l d=O \n\nwhich are vectors tangent to the manifold. If E and P are two patterns to be \ncompared, the respective tangent planes TE and Tp can be used to define a new \ndistance D between these two patterns. The tangent distance D(E, P) between E \nand P is defined by \n\nD(E, P) = min \n\nxETE,yETp \n\nIIx - yW \n\nThe equation of the tangent planes TE and Tp is given by: \n\n(1) \n\n(2) \n\n(3) \n(4) \n\n\f54 \n\nSimard, Cun, and Denker \n\nwhere LE and Lp are the matrices containing the tangent vectors (see Eq. 1) and \nthe vectors a E and ap are the coordinates of E' and P' in the corresponding tangent \nplanes. The quantities LE and Lp are attributes of the patterns so in many cases \nthey can be precomputed and stored. \nComputing the tangent distance \n\n(5) \n\namounts to solving a linear least squares problem. The optimality condition is that \nthe partial derivatives of D(E, P) with respect to a p and aE should be zero: \n\noD(~, P) = 2(E'(aE) _ p'(ap\u00bb T LE = 0 \noD(~,P) = 2(p'(ap) _ E'(aE\u00bbT Lp = 0 \n\noaE \n\noap \n\n(6) \n\n(7) \n\nSubstituting E' and P' by their expressions yields to the following linear system of \nequations, which we must solve for ap and ilE: \n\nL;(E - P - Lpilp + LEaE) = 0 \nLf(E - P - Lpap + LEilE) = 0 \n\n(8) \n(9) \n\nThe solution of this system is \n\n(LPEL\"E1L~ - L;)(E - P) = (LPEL\"E1LEP - Lpp )ap \n(LEPLp~L; - L~)(E - P) = (LEE - LEPLp~LpE)aE \n\n(10) \n(11) \nwhere LEE = LfT LE , LpE = L~LE' L EP = L~Lp and Lpp = L~Lp. LU \ndecompositions 0 LEE and Lpp can be precomputed. The most expenSIve part in \nsolving this system is evaluating LEP (LPE can be obtained by transposing LEP). \nIt requires mE x mp dot products, where mE is the number of tangent vectors for E \nand mp is the number of tangent vectors for P. Once LEP has been computed, ilp \nand ilE can be computed by solving two (small) linear system of respectively mE and \nmp equations. The tangent distance is obtained by computing IIE'(aE) - p'(ap )11 \nusing the value of a p and ilE in equations 3 and 4. If n is the length of vector E (or \nP), the algorithm described above requires roughly n(mE+l)(mp+l)+3(m~+m~) \nmultiply-adds. Approximations to the tangent distance can be computed more \nefficiently. \n\n3 RESULTS \n\nBefore giving the results of handwritten digit recognition experiments, we would \nlike to demonstrate the property of \"local invariance\" of tangent distance. A 16 by \n16 pixel image similar to the \"3\" in Fig 2 was translated by various amounts. The \ntangent distance (using the tangent vector corresponding to horizonta.l translations) \nand the Euclidean Distance between the original image and its translated version \nwere measured as a function of the size k (in pixels) of the translation. The result \nis plotted in Fig. 4. It is clear that the Euclidean Distance starts increasing linearly \nwith k while the tangent distance remains very small for translations as large as \ntwo pixels. This indicates that, while Euclidean Distance is not invariant to trans(cid:173)\nlation, tangent distance is locally invariant. The extent of the invariance can be \n\n\fEfficient Pattern Recognition Using a New Transformation Distance \n\n55 \n\n10 \n\n8 \n\n6 \n\n4 \n\n2 \n\nDistance \n\nTangent \nDistance \n\no~--.---~~~~--~~ \n\n8 \n~ ~ ~ ~ 0 \n# of pixels by which image is translated \n\n2 \n\n4 \n\n6 \n\nFigure 4: Euclidean and tangent distances between a 16x16 handwritten digit image \nand its translated version as a function of the amount of translation measured in \npixels. \n\nincreased by smoothing the original image, but significant features may be blurred \naway, leading to confusion errors. The figure is not symmetric for large translations \nbecause the translated image is truncated to the 16 by 16 pixel field of the original \nimage. In the following experiments, smoothing was done by convolution with a \nGaussian of standard deviation u = 0.75. This value, which was estimated visually, \nturned out to be nearly optimal (but not critical). \n\n3.1 Handwritten Digit Recognition \n\nExperiments were conducted to evaluate the performance of tangent distance for \nhandwritten digit recognition. An interesting characteristic of digit images is that \nwe can readily identify a set of local transformations which do not affect the identity \nof the character, while covering a large portion of the set of possible instances of the \ncharacter. Seven such image transformations were identified: X and Y translations, \nrotation, scaling, two hyperbolic transformations (which can generate shearing and \nsqueezing), and line thickening or thinning. The first six transformations were \nchosen to span the set of all possible linear coordinate transforms in the imn~e \nplane (nevertheless, they correspond to highly non-linear transforms in pixel space). \nAdditional transformations have been tried with less success. \nThe simplest possible use of tangent distance is in a Nearest Neighbor classifier. A \nset of prototypes is selected from a training set, and stored in memory. W\u00b7hen a \ntest pattern is to be classified, the J( nearest prototypes (in terms of tangent dis(cid:173)\ntance) are found, and the pattern is given the class that has the majority among the \nneighbors. In our applications, the size of the prototype set is in the neighborhood \nof 10,000. In principle, classifying a pattern would require computing 10,000 tan(cid:173)\ngent distances, leading to excessive classification times, despite the efficiency of the \ntangent distance computation. Fortunately, two patterns that are very far apart in \nterms of Euclidean Distance are likely to be far apart in terms of tangent distance. \nTherefore we can use Euclidean distance as a \"prefilter\" , and eliminate prototypes \nthat are unlikely to be among the nearest neighbors. V'le used the following 4-step \nclassification procedure: 1) the Euclidean distance is computed between the test \npattern and all the prototypes, 2) The closest 100 prototypes are selected, 3) the \ntangent distance between these 100 prototypes and the test pattern is computed \n\n\f56 \n\nSimard, Cun, and Denker \n\nerror (%) \n\n6 \n5 \n4 \n\n3 \n\n2 \n1 \n0 \n\nUSPS \n\nHuman T-Dlst NNet \n\nK-NN \n\n5 \n4 \n3 \n\n2 \n1 \no \n\nNIST \n\nHuman T -Dlst NNet \n\nFigure 5: Comparison of the error rate of tangent nearest neighbors and other \nmethods on two handwritten digit databases \n\nand 4) the most represented label among the J( closest prototype is outputed. This \nprocedure is two orders of magnitude faster than computing all 10,000 tangent \ndistances, and yields the same performance. \nUS Postal Service database: In the first experiment, the database consisted of \n16 by 16 pixel size-normalized images of handwritten digits, coming from US mail \nenvelopes. The entire training set of 9709 examples of was used as the prototype \nset. The test set contained 2007 patterns. The best performance was obtained with \nthe \"one nearest neig~bor\" rule. The results are plotted in Fig. 5. The error rate \nof the method is 2.6%. Two members of our group labeled the test set by hand \nwith an error rate of 2.5% (using one of their labelings as the truth to test the other \nalso yielded 2.5% error rate). This is a good indicator of the level of difficulty of \nthis task2 . The performance of our best neural network (Le Cun et al., 1990) was \n3.3%. The performance of one nearest neighbor with the Euclidean distance was \n5.9%. These results show that tangent distance performs substantially better than \nboth standard K-nearest neighbor and neural networks. \nNIST database: The second experiment was a competition organized by the N 8,(cid:173)\ntional Institute of Standards and Technology. The object of the competition was \nto classify a test set of 59,000 handwritten digits, given a training set of 223,000 \npatterns. A total of 45 algorithms were submitted from 26 companies from 7 differ(cid:173)\nent countries. Since the training set was so big, a very simple procedure was used \nto select about 12,000 patterns as prototypes. The procedure consists of creating \na new database (empty at the beginning), and classifying each pattern of the large \ndatabase using the new database as a prototype set. Each time an error is made, \nthe pattern is added to the new database. More than one pass may have to be made \nbefore the new database is stable. Since this filtering process would take too long \nwith 223,000 prototypes, we split the large database into 22 smaller databases of \n10,000 patterns each, filtered those (to about 550 patterns) and concatenated the \nresult, yielding a database of roughly 12,000 patterns. This procedure has many \ndrawbacks, and in particular, it is very good at picking up mislabeled characters \nin the training set. To counteract this unfortunate effect, a 3 nearest neighbors \nprocedure was used with tangent distance. The organizers decided to collect the \n\n2This is an extremely difficult test set. Procedures that achieve less than 0.5% error on \n\nother handwritten digit tasks barely achieve less than 4% on this one \n\n\fEfficient Pattern Recognition Using a New Transformation Distance \n\n57 \n\ntraining set and the test set among two very different populations (census bureau \nworkers for the training set, high-school students for the test set), we therefore re(cid:173)\nport results on the official NIST test set (named \"hard test set\"), and on a subset \nof the official training set, which we kept aside for test purposes (the \"easy test \nset\"). The results are shown in Fig. 5. The performance is much worse on the \nhard test set since the distribution was very different from that of the training set. \nOut of the 25 participants who used the NIST training database, tangent distance \nfinished first. The overall winner did not use the training set provided by NIST (he \nused a much larger proprietary training set), and therefore was not affected by the \ndifferent distributions in the training set and test set. \n\n4 DISCUSSION \n\nThe tangent distance algorithm described in the implementation section can be \nimproved/adjusted in at least four different ways: 1) approximating the tangent \ndistance for better speed 2) modifying the tangent distance itself, 3) changing the \nset of transformations/tangent vectors and 4) using the tangent distance with clas(cid:173)\nsification algorithms other than K-nearest neighbors, perhaps in combination, to \nminimize the number of prototypes. We will discuss each of these aspects in turn. \nApproximation: The distance between two hyperplanes TE and Tp going through \nP and E can be approximated by computing the projection PEep) of Ponto TE \nand Pp(E) of E onto Tp. The distance IIPE(P) - Pp(E)1I can be computed in \nO(n(mE + mp\u00bb multiply-adds and is a fairly good approximation of D(E, P). \nThis approximation can be improved at very low cost by computing the closest \npoints between the lines defined by (E, PEep\u00bb~ and (P, Pp(E\u00bb. This approximation \nwas used with no loss of performance to reduce the number of computed tangent \ndistance from 100 to 20 (this involves an additional \"prefilter\"). In the case of \nimages, another time-saving idea is to compute tangent distance on progressively \nsmaller sets of progressively higher resolution images. \nChanging the distance: One may worry that the tangent planes of E and P \nmay be parallel and be very close at a very distant region (a bad side effect of the \nlinear a.pproximation). This effect can be limited by imposing a constraint of the \nform IlaEIi < f{E and lIapli < f{p. This constraint was implemented but did not \nyield better results. The reason is that tangent planes are mostly orthogonal in \nhigh dimensional space and the norms of [[aEIi and !lapll are already small. \nThe tangent distance can be normalized by dividing it by the norm of the vectors. \nThis improves the results slightly because it offsets side effects introduced in some \ntransformations such as scaling. Indeed, if scaling is a transformation of interest, \nthere is a potential danger of finding the minimum distance between two images \nafter they have been scaled down to a single point. The linear approximation of \nthe scaling transformation does not reach this extreme, but still yields a slight \ndegradation of the performance. The error rate reported on the USPS database can \nbe improved to 2.4% using this normalization (which was not tried on NIST). \nTangent distance can be viewed as one iteration of a Newton-type algorithm which \nfinds the points of minimum distance on the true transformation manifolds. The \nvectors aE and ap are the coordinates of the two closest points in the respective \ntangent spaces, but they can also be interpreted for real (non-linear) transforma(cid:173)\ntions. If ae; is the amount of the translation tangent vector that must be added \nto E to make it as close as possible to P, we can compute the true translation of \nimage E by ae,; pixels. In other words, E'(aE) and pl(ap) are projected onto \n\n\f58 \n\nSimard, Cun, and Denker \n\nclose points of SE and Sp. This involves a resampling but can be done efficiently. \nOnce this new image has been computed, the corresponding tangent vectors can \nbe computed for this new image and the process can be repeated. Eventually this \nwill converge to a local minimum in the distance between the two transformation \nmanifold of P and E. The tangent distance needs to be normalized for this iteration \nprocess to work. \nA priori knowledge: The a priori knowledge used for tangent vectors depends \ngreatly on the application. For character recognition, thickness was one of the \nmost important transformations, reducing the error rate from 3.3% to 2.6%. Such \na transformation would be meaningless in, say, speech or face recognition. Other \ntransformations such as local rubber sheet deformations may be interesting for \ncharacter recognition. Transformations can be known a priori or learned from the \ndata. \nOther algorithms, reducing the number of prototypes: Tangent distance is \na general method that can be applied to problems other than image recognition, \nwith classification methods other than K-nearest neighbors. Many distance- ba.sed \nclassification schemes could be used in conjunction with tangent distance, among \nthem LVQ (Kohonen, 1984), and radial basis functions. Since all the operators in(cid:173)\nvolved in the tangent distance are differentiable, it is possible to compute the partial \nderivative of the tangent distance (between an object and a prototype) with respect \nto the tangent vectors, or with respect to the prototype. Therefore the tangent \ndistance operators can be inserted in gradient-descent based adaptive machines (of \nwhich LVQ and REF are particular cases). The main advantage of learning the \nprototypes or the tangent vectors is that fewer prototypes may be needed to reach \nthe same (or superior) level of performance as, say, regular K-nearest neighbors. \nIn conclusion, tangent distance can greatly improve many of the distance-based \nalgorithms. We have used tangent distance in the simple K-nearest neighbor al(cid:173)\ngorithm and outperformed all existing techniques on standard classification tasks. \nThis surprising success is probably due the fact that a priori knowledge can be very \neffectively expressed in the form of tangent vectors. Fortuna.tely, many algorithms \nare based on computing distances and can be adapted to express a priori knowledge \nin a similar fashion. Promising candidates include Parzen windows, learning vector \nquantization and radial basis functions. \n\nReferences \nKohonen, T. (1984). Self-organization and Associative Memory. In Springer Sedes in \n\nInformation Sciences, volume 8. Springer-Verlag. \n\nLe Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, Vv., and \nJackel, 1. D. (1990). Handwritten digit recognition with a back-propagation net(cid:173)\nwork. In Touretzky, D., editor, Advances in Neural Information Processing Systems \n2 (NIPS *89), Denver, CO. Morgan Kaufman. \n\nSibson, R. (1978) . Studies in the Robustness of Multidimensiona.l Scaling: Procrust.es \n\nStatistices. 1. R. Statist. Soc., 40:234-238. \n\nSimard, P. Y., LeCun, Y., Denker, J., and Victorri, B. (1992a). An Efficient Met.hod for \nLearning Invariances in Adaptive classifiers. In International Conference on Pattern \nRecognition, volume 2, pages 651-655, The Hague, Netherlands. \n\nSimard, P. Y., Victorri, B., LeCun, Y., and Denker, J. (1992b). Tangent Prop - A formal(cid:173)\n\nism for specifying selected invariances in an adaptive network. In Neural Information \nProcessing Systems, volume 4, pages 895-903, San Mateo, CA. \n\nSinden, F. and Wilfong, G. (1992). On-line Recognition of Handwritten Symbols. Tech(cid:173)\n\nnical Report 11228-910930-02IM, AT&T Bell Laboratories. \n\n\f", "award": [], "sourceid": 656, "authors": [{"given_name": "Patrice", "family_name": "Simard", "institution": null}, {"given_name": "Yann", "family_name": "LeCun", "institution": null}, {"given_name": "John", "family_name": "Denker", "institution": null}]}