{"title": "A Rapid Graph-based Method for Arbitrary Transformation-Invariant Pattern Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 665, "page_last": 672, "abstract": null, "full_text": "A Rapid Graph-based Method for \nArbitrary Transformation-Invariant \n\nPattern Classification \n\nAlessandro Sperduti \n\nDipartimento di Informatica \n\nU niversita di Pisa \n\nCorso Italia 40 \n\n56125 Pisa, ITALY \nperso~di.unipi.it \n\nDavid G. Stork \n\nMachine Learning and Perception Group \n\nRicoh California Research Center \n\n2882 Sand Hill Road # 115 \n\nMenlo Park, CA USA 94025-7022 \n\nstork~crc.ricoh.com \n\nAbstract \n\nWe present a graph-based method for rapid, accurate search \nthrough prototypes for transformation-invariant pattern classifica(cid:173)\ntion. Our method has in theory the same recognition accuracy as \nother recent methods based on ''tangent distance\" [Simard et al., \n1994], since it uses the same categorization rule. Nevertheless ours \nis significantly faster during classification because far fewer tan(cid:173)\ngent distances need be computed. Crucial to the success of our \nsystem are 1) a novel graph architecture in which transformation \nconstraints and geometric relationships among prototypes are en(cid:173)\ncoded during learning, and 2) an improved graph search criterion, \nused during classification. These architectural insights are applica(cid:173)\nble to a wide range of problem domains. Here we demonstrate that \non a handwriting recognition task, a basic implementation of our \nsystem requires less than half the computation of the Euclidean \nsorting method. \n\n1 \n\nINTRODUCTION \n\nIn recent years, the crucial issue of incorporating invariances into networks for pat(cid:173)\ntern recognition has received increased attention, most especially due to the work of \n\n\f666 \n\nAlessandro Sperduti, David G. Stork \n\nSimard and his colleagues. To a regular hierachical backpropagation network Simard \net al. [1992] added a Jacobian network, which insured that directional derivatives \nwere also learned. Such derivatives represented directions in feature space corre(cid:173)\nsponding to the invariances of interest, such as rotation, translation, scaling and \neven line thinning. On small training sets for a function approximation problem, \nthis hybrid network showed performance superior to that of a highly tuned back(cid:173)\npropagation network taken alone; however there was negligible improvement on \nlarge sets. In order to find a simpler method applicable to real-world problems, \nSimard, Le Cun & Denker [1993] later used a variation of the nearest neighbor \nalgorithm, one incorporating \"tangent distance\" (T-distance or D T ) as the classifi(cid:173)\ncation metric -\nthe smallest Euclidean distance between patterns after the optimal \ntransformation. In this way, state-of-the-art accuracy was achieved on an isolated \nhandwritten character task, though at quite high computational complexity, owing \nto the inefficient search and large number of Euclidean and tangent distances that \nhad to be calculated. \nWhereas Simard, Hastie & Saeckinger [1994] have recently sought to reduce this \ncomplexity by means of pre-clustering stored prototypes, we here take a different \napproach, one in which a (graph) data structure formed during learning contains \ninformation about transformations and geometrical relations among prototypes. \nNevertheless, it should be noted that our method can be applied to a reduced \n(clustered) training set such as they formed, yielding yet faster recognition. Simard \n[1994] recently introduced a hierarchical structure of successively lower resolution \npatterns, which speeds search only if a minority of patterns are classified more \naccurately by using the tangent metric than by other metrics. In contrast, our \nmethod shows significant improvement even if the majority or all of the patterns \nare most accurately classified using the tangent distance. \n\ninvariant classification \n\nOther methods seeking fast \ninclude Wilensky and \nManukian's scheme [1994]. While quite rapid during recall, it is more properly \nconsidered distortion (rather than coherent transformation) invariant. Moreover, \nsome transformations such as line thinning cannot be naturally incorporated into \ntheir scheme. Finally, it appears as if their scheme scales poorly (compared to \ntangent metric methods) as the number of invariances is increased. \n\nIt seems somewhat futile to try to improve significantly upon the recognition ac(cid:173)\ncuracy of the tangent metric approach -\nfor databases such as NIST isolated \nhandwritten characters, Simard et al. [1993] reported accuracies matching that \nof humans! Nevertheless, there remains much that can be done to increase the \ncomputational efficiency during recall. This is the problem we address. \n\n2 TRANSFORMATION INVARIANCE \n\nIn broad overview, during learning our method constructs a labelled graph data \nstructure in which each node represents a stored prototype (labelled by its category) \nas given by a training set, linked by arcs representing the T-distance between them. \nSearch through this graph (for classification) takes advantage of the graph structure \nand an improved search criterion. To understand the underlying computations, we \nmust first consider tangent space. \n\n\fGraph-Based Method for Arbitrary Transformation-Invariant Pattern Classification \n\n667 \n\nFigure 1: Geometry of tangent space. Here, a three-dimensional feature space \ncontains the \"current\" prototype, Pc, and the subspace consisting of all patterns \nobtainable by performing continuous transformations of it (shaded). Two candidate \nprototypes and a test pattern, T, as well as their projections onto the T-space of \nPc are shown. The insert (above) shows the progression of search through the \ncorresponding portion of the recognition graph. The goal is to rapidly find the \nprototype closest to T (in the T-distance sense), and our algorithm (guided by the \nminimum angle OJ in the tangent space) finds that P 2 is so closer to T than are \neither PI or Pc (see text). \n\nFigure 1 illustrates geometry of tangent space and the relationships among the fun(cid:173)\ndamental entities in our trained system. A labelled (\"current\") trained pattern is \nrepresented by Pc, and the (shaded) surface corresponds to patterns arising under \ncontinuous transformations of Pc. Such transformations might include rotation, \ntranslation, scaling, line thinning, etc. Following Simard et al. [1993], we approxi(cid:173)\nmate this surface in the vicinity of Pc by a subspace -\nthe tangent space or T -space \nof Pc - which is spanned by \"tangent\" vectors, whose directions are determined by \ninfinitessimally transforming the prototype Pc. The figure shows an ortho-normal \nbasis {TVa, TV b}, which helps to speed search during classification, as we shall see. \nA test pattern T and two other (candidate) prototypes as well as their projections \nonto the T-space of Pc are shown. \n\n\f668 \n\nAlessandro Sperduti, David G. Stork \n\n3 THE ALGORITHMS \n\nOur overall approach includes constructing a graph (during learning), and searching \nit (for classification). The graph is constructed by the following algorithm: \n\nInitialize N = # patterns; k = # nearest neighbors; t = # invariant transforma(cid:173)\n\nGraph construction \n\ntions \n\nBegin Loop For each prototype Pi (i = 1 ~ N) \n\n\u2022 Compute a t-dimensional orthonormal basis for the T -space of Pi \n\u2022 Compute (\"one-sided\") T-distance of each of the N - 1 prototypes \n\nP j (j i- i) using Pi'S T-space \n\n\u2022 Represent Pj.l (the projection of P j onto the T-space of Pi) in the \n\ntangent orthonormal frame of Pi \n\n\u2022 Connect Pi to each of its k T-nearest neighbors, storing their associ(cid:173)\n\nated normalized projections Ph \n\nEnd Loop \n\nDuring classification, our algorithm permits rapid search through prototypes. Thus \nin Figure 1, starting at Pc we seek to find another prototype (here, P2) that is \ncloser to the test point T . After P2 is so chosen, it becomes the current pattern, \nand the search is extended using its T-space. Graph search ends when the closest \nprototype to T is found (Le., closest in a T-distance sense). \n\nWe let D~ denote the current minimum tangent distance. Our search algorithm is: \n\nInput Test pattern T \nInitialize \n\nGraph search \n\n\u2022 Choose initial candidate prototype, Po \n\u2022 SetPc~Po \n\u2022 Set D~ ~ DT(Pc , T), i.e., the T-distance ofT from Pc \n\nDo \n\n\u2022 For each prototype P j connected to Pc compute cos(Oj) = \n\u2022 Sort these prototypes by increasing values of OJ and put them into a \n\nT.L\u00b7P~ \nIT.Ll.L \n\ncandidate list \n\n\u2022 Pick P j from the top of the candidate list \n\u2022 In T-space of Pj, compute DT(Pj , T) \n\nIf DT(Pj , T) < D~ then Pc ~ P j and D~ ~ DT(P j , T) \notherwise mark P j as a \"failure\" (F), and pick next prototype from \n\nthe candidate list \n\nUntil Candidate list empty \nReturn D~ or the category label of the optimum prototype found \n\n\fGraph-Based Method for Arbitrary Transformation-Invariant Pattern Classification \n\n669 \n\nDr 4.91 \n\n3.70 \n\n3.61 \n\n3.03 \n\n2.94 \n\nFigure 2: The search through the \"2\" category graph for the T-nearest stored \nprototype to the test pattern is shown (N = 720 and k = 15 nearest neighbors). \nThe number of T-distance calculations is equal to the number of nodes visited plus \nthe number offailures (marked F); Le., in the case shown 5 + 26 = 31. The backward \nsearch step attempt is thwarted because the middle node has already been visited \n(marked M). Notice in the prototypes how the search is first a downward shift, then \na counter-clockwise rotation -\n\na mere four steps through the graph. \n\nFigure 2 illustrates search through a network of \"2\" prototypes. Note how the T(cid:173)\ndistance of the test pattern decreases, and that with only four steps through the \ngraph the optimal prototype is found. \n\nThere are several ways in which our search technique can be incorporated into a \nclassifier. One is to store all prototypes, regardless of class, in a single large graph \nand perform the search; the test pattern is classified by the label of the optimal \nprototype found. Another, is to employ separate graphs, one for each category, and \nsearch through them (possibly in parallel); the test is classified by the minimum \nT-distance prototype found. The choice of method depends upon the hardware \nlimitations, performance speed requirements, etc. Figure 3 illustrates such a search \nthrough a \"2\" category graph for the closest prototype to a test pattern \"5.\" We \nreport below results using a single graph per category, however. \n\n3.1 Computational complexity \n\nIf a graph contains N prototypes with k pointers (arcs) each, and if the patterns are \nof dimension m, then the storage requirement is O(N((t + 1) . m 2 + kt)). The time \ncomplexity of training depends upon details of ortho-normalization, sorting, etc., \nand is of little interest anyway. Construction is more than an order of magnitude \nfaster than neural network training on similar problems; for instance construction \nof a graph for N = 720 prototypes and k = 100 nearest neighbors takes less than \n\n\f670 \n\nAlessandro Sperduti, David G. Stork \n\n[ZJ[ZJ[2J[2] \n\nDr 5.10 \n\n5.09 \n\n5.01 \n\n4.93 \n\n4.90 \n\nFigure 3: The search through a \"2\" category graph given a \"5\" test pattern. Note \nhow the search first tries to find a prototype that matches the upper arc of the \n\"5,\" and then one possessing skew or rotation. For this test pattern, the minimum \nT-distance found for the \"5\" category (3.62) is smaller than the one found for the \n\"2\" category shown here (4.22), and indeed for any other category. Thus the test \npattern is correctly classified as a \"5.\" \n\n20 minutes on a Sparc 10. \n\nThe crucial quantity of interest is the time complexity for search. This is, of course, \nproblem related, and depends upon the number of categories, transformation and \nprototypes and their statistical properties (see next Section). Worst case analyses \n(e.g., it is theoretically conceivable that nearly all prototypes must be visited) are \nirrelevant to practice. \nWe used a slightly non-obvious search criterion at each step, the function cos(Oj), \nas shown in Figure 1. Not only could this criterion be calculated very efficiently \nin our orthonormal basis (by using simple inner products), but it actually led to \na slightly more accurate search than Euclidean distance in the T-space -\nperhaps \nthe most natural choice of criterion. The angle OJ seems to guide the \"flow\" of the \nsearch along transformation directions toward the test point. \n\n4 Simulations and results \n\nWe explored the search capabilities of our system on the binary handwritten digit \ndatabase of Guyon, et al. [1991J. We needed to scale all patterns by a linear factor \n(0.833) to insure that rotated versions did not go outside the 16 x 16 pixel grid. As \nrequired in all T-space methods, the patterns must be continuous valued (Le., here \ngrayscale); this was achieved by convolution with a spatially symmetric Gaussian \nhaving a = .55 pixels. We had 720 training examples in each of ten digit categories; \nthe test set consisted of 1320 test patterns formed by transforming independent \nprototypes in all meaningful combinations of the t = 6 transformations (four spatial \ndirections and two rotation senses). \n\nWe compared the Euclidean sorting method of Simard et al. [1993J to our graph \n\n\fGraph-Based Method for Arbitrary Transformation-Invariant Pattern Classification \n\n671 \n\n1.00 \n\n______ -----:::::::::::::==---10. 6 \n\n\u00a7 \n0.4 u \n.c \n~ u \n'\" \n~ \n0.2 e \n~ \n\n... ' \n',-. error \n\n- ---~. \" .. - ................ --\n\n0 \n400 \n\n350 \n\no \n\n50 \n\n100 150 200 \n\n300 \nComputational complexity \n\n250 \n\n(equivalent number of T -distance calculations) \n\nFigure 4: Comparison of graph-based (heavy lines) and standard Euclidean sorting \nsearches (thin lines). Search accuracy is the percentage of optimal prototypes found \non the full test set of 1320 patterns in a single category (solid lines). The average \nsearch error is the per pattern difference between the global optimum T -distance and \nthe one actually found, averaged over the non-optimal prototypes found through the \nsearch (dashed lines). Note especially that for the same computational complexity, \nour method has the same average error, but that this average is taken over a much \nsmaller number of (non-optimal) prototypes. For a given criterion search accuracy, \nour method requires significantly less computation. For instance, if 90% of the \nprototypes must be found for a requisite categorization accuracy (a typical value \nfor asymptotically high recognition accuracy), our graph-based method requires less \nthan half the computation of the Euclidean sorting method. \n\nbased method using the same data and transformations, over the full range of \nrelevant computational complexities. Figure 4 summarizes our results. For our \nmethod, the computational complexity is adjusted by the number of neighbors \ninspected, k. For their Euclidean sorting method, it is adjusted by the percentage \nof Euclidean nearest neighbors that were then inspected for T -distance. We were \nquite careful to employ as many computational tricks and shortcuts on both methods \nwe could think of. Our results reflect fairly on the full computational complexity, \nwhich was dominated by tangent and Euclidean distance calculations. \n\nWe note parenthetically that many of the recognition errors for both methods could \nbe explained by the fact that we did not include the transformation of line thinning \n(solely because we lacked the preprocessing capabilities); the overall accuracy of \nboth methods will increase when this invariance is also included. \n\n5 CONCLUSIONS AND FUTURE WORK \n\nWe have demonstrated a graph-based method using tangent distance that per(cid:173)\nmits search through prototypes significantly faster than the most popular current \napproach. Although not shown above, ours is also superior to other tree-based \n\n\f672 \n\nAlessandro Sperduli. David G. Stork \n\nmethods, such as k-d-trees, which are less accurate. Since our primary concern was \nreducing the computational complexity of search (while matching Simard et al.'s \naccuracy), we have not optimized over preprocessing steps, such as the Gaussian \nkernel width or transformation set. We note again that our method can be applied \nto reduced training sets, for instance ones pruned by the method of Simard, Hastie \n& Saeckinger [1994]. Simard's [1994] recent method -\nin which low-resolution \nversions of training patterns are organized into a hierarchical data structure so \nas to reduce the number of multiply-accumulates required during search -\nis in \nsome sense \"orthogonal\" to ours. Our graph-based method will work with his low(cid:173)\nresolution images too, and thus these two methods can be unified into a hybrid \nsystem. \nPerhaps most importantly, our work suggests a number of research avenues. We \nused just a single (\"central\") prototype Po to start search; presumably having \nseveral candidate starting points would be faster. Our general method may admit \ngradient descent learning of parameters of the search criterion. For instance, we can \nimagine scaling the different tangent basis vectors according to their relevance in \nguiding correct searches as determined using a validation set. Finally, our approach \nmay admit elegant parallel implementations for real-world applications. \n\nAcknowledgements \n\nThis work was begun during a visit by Dr. Sperduti to Ricoh CRC. We thank I. \nGuyon for the use of her database of handwritten digits and Dr. K. V. Prasad for \nassistance in image processing. \n\nReferences \n\n1. Guyon, P. Albrecht, Y. Le Cun, J. Denker & W. Hubbard. (1991) \"Comparing \ndifferent neural network architectures for classifying handwritten digits,\" Proc. of \nthe Inter. Joint Conference on Neural Networks, vol. II, pp. 127-132, IEEE Press. \nP. Simard. (1994) \"Efficient computation of complex distance metrics using hierar(cid:173)\nchical filtering,\" in J. D. Cowan, G. Tesauro and J. Alspector (eds.) Advances in \nNeural Information Processing Systems-6 Morgan Kaufmann pp. 168-175. \nP. Simard, B. Victorrio, Y. Le Cun & J. Denker. (1992) \"Tangent Prop - A formal(cid:173)\nism for specifying selected invariances in an adaptive network,\" in J. E. Moody, S. \nJ . Hanson and R. P. Lippmann (eds.) Advances in Neural Information Processing \nSystems-4 Morgan Kaufmann pp. 895-903. \nP. Y. Simard, Y. Le Cun & J. Denker. (1993) \"Efficient Pattern Recognition Using \na New Transformation Distance,\" in S. J. Hanson, J. D. Cowan and C. L. Giles \n(eds.) Advances in Neural Information Processing Systems-5 Morgan Kaufmann \npp.50-58. \nP. Y. Simard, T. Hastie & E. Saeckinger. (1994) \"Learning Prototype Models for \nTangent Distance,\" Neural Networks for Computing Snowbird, UT (April, 1994). \nG. D. Wilensky & N. Manukian. (1994) \"Nearest Neighbor Networks: New Neural \nArchitectures for Distortion-Insensitive Image Recognition,\" Neural Networks for \nComputing Snowbird, UT (April, 1994). \n\n\f", "award": [], "sourceid": 1012, "authors": [{"given_name": "Alessandro", "family_name": "Sperduti", "institution": null}, {"given_name": "David", "family_name": "Stork", "institution": null}]}