{"title": "Multiresolution Tangent Distance for Affine-invariant Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 843, "page_last": 849, "abstract": null, "full_text": "Multiresolution Tangent Distance for \n\nAffine-invariant Classification \n\nNuno Vasconcelos \n\nAndrew Lippman \n\nMIT Media Laboratory, 20 Ames St, E15-320M, \n\nCambridge, MA 02139, {nuno,lip }@media.mit.edu \n\nAbstract \n\nThe ability  to  rely  on similarity  metrics  invariant to  image transforma(cid:173)\ntions is  an important issue for image classification tasks such as face or \ncharacter recognition. We analyze an invariant metric that has performed \nwell for the latter - the tangent distance - and study its limitations when \napplied to regular images, showing that the most significant among these \n(convergence to local minima) can be drastically reduced by computing \nthe distance in a multiresolution setting. This leads to the multi resolution \ntangent distance,  which exhibits  significantly  higher invariance  to  im(cid:173)\nage transformations, and can be easily combined with robust estimation \nprocedures. \n\n1  Introduction \n\nImage classification algorithms often rely  on distance metrics  which are too  sensitive to \nvariations in the imaging environment or set up (e.g. the Euclidean and Hamming distances), \nor on metrics which, even though less sensitive to these variations, are application specific \nor too expensive from a computational point of view (e.g.  deformable templates). \n\nA solution to this problem, combining invariance to image transformations with computa(cid:173)\ntional simplicity and general purpose applicability was  introduced by  Simard et al  in  [7]. \nThe key idea is that, when subject to spatial transformations, images describe manifolds in a \nhigh dimensional space, and an invariant metric should measure the distance between those \nmanifolds instead of the distance between other properties of (or features extracted from) \nthe images themselves.  Because these manifolds are complex, minimizing the distance be(cid:173)\ntween them is a difficult optimization problem which can, nevertheless, be made tractable \nby considering the minimization of the distance between the tangents to the manifolds -the \ntangent distance (TO) - instead of that between the manifolds themselves. While it has led \nto impressive results for the problem of character recognition [8] , the linear approximation \ninherent to the TO is too stringent for regular images, leading to invariance over only a very \nnarrow range of transformations. \n\n\f844 \n\nN.  Vasconcelos and A.  Lippman \n\nIn  this  work  we  embed  the  distance  computation  in  a  multi resolution  framework  [3], \nleading to  the multiresolution tangent distance  (MRTD).  Multiresolution decompositions \nare common in the vision literature and have been known to  improve the  performance of \nimage registration  algorithms  by  extending  the  range  over  which  linear  approximations \nIn  particular,  the  MRTD  has  several  appealing  properties:  1)  maintains \nhold  [5,  1]. \nthe  general purpose nature of the TD;  2) can  be  easily  combined with robust estimation \nprocedures, exhibiting invariance to  moderate non-linear image variations (such as caused \nby  slight  variations  in  shape or occlusions);  3)  is  amenable  to  computationally efficient \nscreening  techniques where bad matches  are discarded at low resolutions;  and 4) can  be \ncombined  with  several  types  of classifiers.  Face recognition  experiments show  that  the \nMRTD  exhibits a  significantly  extended invariance to  image transformations, originating \nimprovements in recognition accuracy as high as 38%, for the hardest problems considered. \n\n2  The tangent distance \n\nConsider the  manifold described by  all  the possible  linear transformations that a  pattern \nlex) may be subject to \n\nTp  [lex)]  =  1('ljJ(x, p)), \n\n(1) \nwhere  x  are  the  spatial  coordinates  over  which  the  pattern  is  defined,  p  is  the  set  of \nparameters which define the transformation, and 'ljJ  is a function typically linear on p, but \nnot necessarily linear on x.  Given two patterns M(x) and N(x), the distance between the \nassociated manifolds - manifold distance (MD) - is \n\nT(M, N) =  min IITq[M(x)] - Tp[N(x)]W. \n\np,q \n\n(2) \n\nFor simplicity,  we consider a version of the distance in  which only one of the patterns is \nsubject to a transformation, i.e. \n\nT(M, N) = min IIM(x) - Tp[N(x)]lf, \n\np \n\n(3) \n\nbut all results can be extended to the two-sided distance.  Using the fact that \n\\7p Tp[N(x)] = \\7pN('ljJ(x, p)) = \\7p '\u00a2(x, p)\\7xN('\u00a2(x, p)), \n\n(4) \nwhere  \\7pTp  is  the  gradient of Tp  with  respect  to  p,  Tp[N(x)]  can,  for  small  p,  be \napproximated by a first order Taylor expansion around the identity transformation \n\nTp[N(x)] =  N(x) + (p - If\\7p 'ljJ(x,p)\\7x N(x). \n\nThis is equivalent to approximating the manifold by a tangent hyper-plane, and leads to the \nTD. Substituting this expression in equation 3, setting the gradient with respect to p  to zero, \nand solving for p  leads to \n\np ~ [~'VP;6(X' P ) 'Vx N(x) 'V); N(X)'V~;6(x, P)]-' ~ D(x)'Vp;6(x, P l'VxN(x) + I, \n(5) \nwhere D(x)  = M(x) - N(x).  Given  this  optimal p, the TD between the  two  patterns \nis  computed using  equations  I  and  3.  The  main  limitation  of this  formulation  is  that  it \nrelies on a first-order Taylor series approximation, which is valid only over a small range \nof variation in the parameter vector p . \n\n2.1  Manifold distance via Newton's method \n\nThe minimization of the MD of equation 3 can also be performed through Newton's method, \nwhich consists of the iteration \n\npn+1  = pn _  0:  [\\7~ T/p=pn] -I \\7p Tlp=pn \n\n(6) \n\n\fMultiresolution Tangent Distancefor Affine-invariant Classification \n\n845 \n\nwhere \\7p /  and  \\7~ /  are, respectively, the gradient and  Hessian of the cost function of \nequation 3 with respect to the parameter p, \n\n\\7p/ =  2 L [M(x) - Tp[N(x)]) V'pTp[N(x)] \n\nx \n\nV'~ /  =  2 L  [-V'pTp[N(x)] \\7~Tp[N(x)] +  [M(x) - N(x)] V'~Tp[N(x)]] . \n\nx \n\nDisregarding the term which contains second-order derivatives (V'~Tp[N(x)]), choosing \npO  = I  and  Q:  = 1,  using  4,  and  substituting  in  6  leads  to  equation  5. \nthe  TO \ncorresponds to  a single iteration of the minimization of the MD by a simplified version of \nNewton's method, where sec!ond-orderderivatives are disregarded. This reduces the rate of \nconvergence of Newton's method, and a single iteration may not be enough to achieve the \nlocal minimum, even for simple functions.  It is, therefore, possible to achieve improvement \nif the iteration described by equation 6 is repeated until convergence. \n\nI.e. \n\n3  The multiresolution tangent distance \n\nThe iterative minimization of equation 6 suffers from two major drawbacks [2]:  1) it may \nrequire a significant number of iterations for convergence and 2), it can easily get trapped \nin  local minima.  Both  these limitations can  be,  at least partially,  avoided by  embedding \nthe computation of the MD in a multiresolution framework, leading to the multiresolution \nmanifold distance (MRMD). For its computation, the patterns to classify are first subject to \na multiresolution decomposition, and the MD is then iteratively computed for each layer, \nusing the estimate obtained from the layer above as a starting point, \n\nwhere, Dl(x) =  M(x) - Tp~ [N(x)]. If only one iteration is allowed at each imageresolu(cid:173)\ntion, the MRMD becomes the multiresolution extension of the TO, i.e.  the multi resolution \ntangent distance (MRTO). \nTo  illustrate the benefits of minimization over different scales consider the signal J (t)  = \nE{;=1  sin(wkt ),  and  the  manifold  generated  by  all  its  possible  translations  J'(t,d)  = \nJ(t + d).  Figure  1 depicts the multiresolution Gaussian decomposition of J(t), together \nwith  the Euclidean distance to  the points on the manifold as  a function of the  translation \nassociated  with  each of them  (d).  Notice  that  as  the  resolution  increases,  the  distance \nfunction has more local minima, and the range of translations over which an initial guess \nis  guaranteed to  lead  to  convergence to  the  global minimum  (at d  =  0)  is  smaller.  I.e., \nat higher resolutions, a better initial estimate is necessary to  obtain the same performance \nfrom the minimization algorithm. \n\nNotice also  that,  since the function  to  minimize is  very smooth at the lowest resolutions, \nthe  minimization  will  require  few  iterations  at these  resolutions  if a  procedure  such  as \nNewton's method is employed. Furthermore, since the minimum at one resolution is a good \nguess for the  minimum at the  next resolution,  the computational effort required to reach \nthat minimum will also be small.  Finally, since a minimum at low resolutions is based on \ncoarse, or global, information about the function or patterns to  be classified, it is likely to \nbe the global minimum of at least a significant region of the parameter space, if not the true \nglobal minimum. \n\n\f846 \n\nN. Vasconcelos and A. Lippman \n\n\u00b7RB .~5ISa {\\Z\\Z\\] \n-UJj -F\\lJ -t;: Ll \n\n. . . . . . . . . . ...\n\n_ . : . .   . . .. ...\n\n-I~..::.. \n\n\u2022\u2022\u2022\u2022 \n\n.. ~ \n\n.. .\n\n. . .  . \n\n..I~ \n\n\u2022 \u2022\n\n\u2022\u2022 \n\n--\"' .\n\n..\n\n.... . \n\n..:.. \n\n\u2022 .\n\n..\n\n\u2022\u2022\n\n\u2022\u2022\n\nFigure  1:  Top:  Three  scales  of the  multiresolution  decomposition  of J(t) .  Bottom:  Euclidean \ndistance  VS.  translation for each scale.  Resolution decreases from left to right. \n\n4  Affine-invariant classification \n\nThere are many linear transformations which can be used in equation 1.  In this  work, we \nconsider manifolds generated by affine transformations \n\n1jJ(x,p) =  0  0  0  x  y IP  = ~(x)p, \n\ny  1000 ]  \n\nX \n\n[\n\n(8) \n\nwhere  P  is  the  vector of parameters  which characterize the transformation.  Taking  the \ngradient of equation  8  with  respect to  p.  V'p1jJ(x,p)  =  ~(x)T. using  equation 4.  and \nsubstituting in equation 7. \n\np~+1  =  pr + \" \n\n[ ~ 4> (x) TV x N ' (x) viN' (x) 4> (xl ] -I \nL D'(x)~(x)TV'xN'(x), \n\nx \n\n(9) \n\nwhere N'(x)  =  N(1jJ(x, PI\u00bb' and D'(x)  =  M(x) - N'(x).  For a given levell of the \nmultiresolution decomposition,  the iterative process of equation 9  can be summarized as \nfollows. \n\n1.  Compute N'(x)  by  warping  the pattern to classify  N(x)  according to  the  best \n\ncurrent estimate of p, and compute its spatial gradient V'xN'(x). \n\n2.  Update the estimate of PI according to equation 9. \n3.  Stop if convergence, otherwise go to  1. \n\nOnce the final PI  is obtained, it is passed to the multiresolution level below (by doubling the \ntranslation parameters), where it is  used as initial estimate.  Given the values of Pi  which \nminimize the MD between a  pattern to classify  and a  set of prototypes in the database,  a \nK-nearest neighbor classifier is used to find the pattern's class. \n\n5  Robust classifiers \n\nOne issue of importance for pattern recognition systems is that of robustness to outliers, i.e \nerrors which occur with low  probability, but which can have large magnitude.  Examples \nare  errors due to  variation of facial  features  (e.g.  faces  shot with  or without glasses)  in \nface recognition, errors due to undesired blobs of ink or uneven line thickness in character \nrecognition, or errors due to partial occlusions (such as a hand in front of a face) or partially \n\n\fMultiresolution Tangent Distance/or Affine-invariant Classification \n\n847 \n\nmissing patterns (such as  an  undoted  i).  It is  well  known  that  a few  (maybe even  one) \noutliers of high leverage are sufficient to throw mean squared error estimators completely \noff-track [6] . \n\nSeveral robust estimators have been proposed in the statistics literature to avoid this problem. \nIn this  work we consider M-estimators  [4]  which can  be  very  easily  incorporated in  the \nMD classification framework.  M-estimators  are  an  extension of least squares estimators \nwhere the square function is substituted by a functional p(x) which weighs large errors less \nheavily.  The robust-estimator version of the tangent distance then becomes to minimize the \ncost function \n\n(10) \n\n(11) \n\nT(M, N) =  min I: p(M(x) - Tp[N{x)]) , \n\np  x \n\nand it is  straightforward to show that the \"robust\" equivalent to equation 9 is \n\np~+'  ~ pr +\"  [~P\"[D(X))oI>(X)TI7XN'(X)I7;;:N'(X)oI>(X)T ]-' x \n\n[~P'[D(X))oI>(X)Tl7xN' (X)]  , \n\nwhere D(x)  =  M(x) - N'(x) and p'(x) and p\"(x) are, respectively, the first and second \nderivatives of the function p( x)  with respect to its argument. \n\n6  Experimental results \n\nIn this section, we report on experiments carried out to evaluate the performance of the MD \nclassifier.  The first set of experiments was designed to test the invariance of the TD to affine \ntransformations of the  input.  The  second set was  designed to  evaluate the  improvement \nobtained under the multiresolution framework. \n\n6.1  Affine invariance of the tangent distance \n\nStarting from a single view of a reference face,  we created an  artificial dataset composed \nby  441  affine  transformations of it.  These transformations consisted of combinations of \nall  rotations in  the range from  - 30 to  30 degrees with  increments of 3 degrees,  with  all \nscaling transformations in the range from 70% to  130% with increments of 3%.  The faces \nassociated with the extremes of the scaling/rotation space are represented on the left portion \nof figure 2. \n\nOn  the  right  of figure  2  are  the  distance  surfaces  obtained  by  measuring  the  distance \nassociated  with  several  metrics  at  each of the  points  in  the  scaling/rotation  space.  Five \nmetrics were considered in this experiment:  the Euclidean distance (ED), the TD, the MD \ncomputed through Newton's method, the MRMD, and the MRTD. \n\nWhile the TD exhibits some invariance to  rotation and scaling, this invariance is  restricted \nto  a  small  range  of the  parameter  space  and  performance  only  slightly  better  than  the \nobtained with the ED. The performance of the  MD  computed through Newton's method \nis dramatically superior, but still inferior to  those achieved with the MRTD (which is  very \nclose to zero over the entire parameter space considered in this experiment), and the MRMD. \nThe performance of the MRTD is in fact impressive given that it involves a computational \nincrease of less than 50% with respect to the TD, while each iteration of Newton's method \nrequires an  increase of 100%,  and  several  iterations are  typically  necessary  to  attain  the \nminimum MD. \n\n\fN.  Vasconcelos and A.  Lippman \n\n848 \n\n-30 \n\n!i \n~  0 \n-0 \na: \n\n0.7 \n\n1.3 \n\nScaling \n\nFigure 2:  Invariance of the tangent distance.  In the right, the surfaces shown correspond to ED, TO, \nMO through Newton's method, MRTO, and MRMO. This ordering corresponds to that of the nesting \nof the surfaces, i.e.  the ED is the cup-shaped surface in the center, while the MRMO is the flat surface \nwhich  is approximately zero everywhere. \n\n6.2  Face recognition \n\nTo evaluate the performance of the multiresolution tangent distance on a real classification \ntask,  we conducted a  series of face  recognition experiments, using  the Olivetti  Research \nLaboratories (ORL) face database.  This database is composed by 400 images of 40 subjects, \n10 images per subject,  and contains variations in  pose,  light conditions, expressions and \nfacial features, but small variability in terms of scaling, rotation, or translation.  To correct \nthis limitation we created three artificial datasets by  applying to each image three random \naffine transformations drawn from three multivariate normal distributions centered on the \nidentity  transformation  with  different  covariances.  A  small  sample  of the  faces  in  the \ndatabase  is  presented in  figure  3,  together with its  transformed version  under  the  set  of \ntransformations of higher variability. \n\nFigure 3:  Left: sample of the ORL face database. Right: transformed version. \n\nWe  next designed three experiments with increasing degree of difficulty.  In the first,  we \nselected the  first  view of each subject as  the  test  set,  using  the  remaining nine  views  as \ntraining data.  In the second, the first five faces were used as test data while the remaining \nfive  were  used for training.  Finally,  in  the  third experiment,  we reverted the roles of the \ndatasets used in the first.  The recognition accuracy for each of these experiments and each \nof the datasets is reported on figure 4 for the ED, the TO, the MRTD, and a robust version \nof this distance (RMRTO) with p(x)  =  1x2  if x::; aT and p(x) =  ~2 otherwise, where T \nis  a threshold (set to 2.0 in our experiments), and a  a  robust version of the error standard \ndeviation defined as a  = median lei - median (ei )1  /0.6745. \n\nSeveral conclusions can  be taken  from  this  figure.  First,  it can be  seen that the MRTD \nprovides a significantly higher invariance to linear transformations than the ED or the TO, \n\n\fMultiresolUlion Tangent Distance for Affine-invariant Classification \n\n849 \n\nincreasing the recognition accuracy by as much as  37.8% in the hardest datasets.  In fact, \nfor  the easier tasks  of experiments one  and two,  the  performance of the  multiresolution \nclassifier is almost constant and always above the level of 90% accuracy.  It is only for the \nharder experiment that the invariance of the MRTO classifier starts to break down.  But even \nin this case, the degradation is graceful- the recognition accuracy only drops below 75% \nfor considerable values of rotation and scaling (dataset D3). \n\nOn the  other hand,  the ED  and the single resolution TO break down  even  for  the  easier \ntasks, and fail dramatically when the hardest task is performed on the more difficult datasets. \nFurthermore, their performance does not degrade gracefully, they seem to be more invariant \nwhen the training set has five views than when it is composed by nine faces of each subject \nin the database. \n\n'~--------~,~-=--=~~~ \n\n___ _  ...l _ _ _ _ _ __ _  l... \n\n, \n, \n\n, \n\nI \n\n'iJ \n\n' .... \n\n, \n, \n, \n\n- - --1- -\n\n- -- - -~ -\n\n, \n\n, \n\n, \n, \n\n'!O.(U. \n\n, \n, \n\n, \n, \n\n- - -- - - --r- --- - - - - -t- - ----\n\n. .. _ __ _____ _ L  _ _ _ _ ___ ...1_ ______  _ \n\n, \n, \n-\n, \n, \n, \n\"\".(1,1 ____ __ _ _ _ ~- - ---- --+ ---- --- -~-\n, \n, \n, \n- -- - -- - -r- -- ---- - -t - - - - --- - t- -\n, \n, \n\n____ ______ L- __ _ __ __ ...1  ___ __ _ __ l-\n\n, \n, \n, \n, \n\n, \n, \n, \n\nam . \n\n, \n, \n, \n\n, \n, \n\n, \n, \n\n, \n\n, \n\nI \n,;. \n\n. \n\n., \n__ ---- - -I-- ------ - - ------ - r -m-\n-------;----... ---.........  :0 \n\nL . _ \n\n_ _ \n\n....1_ _ \nI \n\n_ _ ~ _ \n\n, \n\n,GI;IIIIII _\n\n.... \n\nOO~ _\n\n_ __ ___ __ 1- _ ____ __ -10  _ _ ____ __ ... _ \n\n10>00_  -- - - -- -- t-- -- -- -- -~ - -- ----t\" -\n\n, \n, \n, \n_ _ __ ____  L __ _ ____ j _ ~ __ __ _ _  L _ \n, \n, \n..,CV _ _ __ ___ _ _ ~ -- - ---- ~ -- --- -- - ~ -\n\n, \n, \n, \n, \n\nJOQI _\n\n, \n\nl'W'> __ __ ___ _  ~ _ _ ___ __ ... ______ _ _  ... \n\n, \n\nI \n\n, \n, \n, \n, \n, \n, , \n\n, \n\n_\n\n_\n\nr \n\n\u2022 \n\n11041 \n\nI \nI \n, \n, \nI \n\n! \n! \n, \nI \nI \n\niii\"\" \n.OII IIL.  ----- - - - j- ------ - -t - - - - - - - - r -TD\"\"\"\" \nI  1Mb \nIIRm \nI \nI \n, \nI \n\n_ _ ______ L  _____ __ ...1  _ _ _  \u2022 _ _ _ _  .l- . \n\n-~?~?~:~j?~~~~~~~}~~~~~~~~~ \n: \n: '~ \nI ____ _ ___ + ____ _____ t _ \n, \n\n_ ___ L  ___ _ _ __ ...l  ___ ___ ___ _ \n\n__ _ _ _ \n\n, \n\n110m \u2022\n\n:1111,1,1 . \n\n\u2022  _\n\n, \n, \n\nI \nI \n\n\u2022 \n\n-\n\n, \nI  _ _ _ _ _ ___ ~ \n\n\u00ab>.011 _\n\n_ _ __ ___  I  _____ _ _ \n\n_\n\n_  ~ __ __ ___ L  _____ ____ _ \n\nI \n\n, , , \n\nFigure  4:  Recognition  accuracy.  From  left  to  right:  results  from  the  first,  second,  and  third \nexperiments.  Oatasets are ordered by degree of variability:  00 is the ORL database 03 is subject to \nthe affine transfonnations of greater amplitude. \n\nAcknowledgments \n\nWe would like to thank Federico Girosi for first bringing the tangent distance to our attention, \nand for several stimulating discussions on the topic. \n\nReferences \n\n[1J  P.  Anandan, J.  Bergen, K.  Hanna, and R.  Hingorani.  Hierarchical Model-Based Mo(cid:173)\n\ntion Estimation.  In M.  Sezan and R.  Lagendijk, editors,  Motion Analysis and Image \nSequence Processing, chapter 1.  Kluwer Academic Press,  1993. \n[2J  D. Bertsekas.  Nonlinear Programming.  Athena Scientific, 1995. \n[3J  P.  Burt and  E.  Adelson.  The Laplacian Pyramid  as  a  Compact Image Code.  IEEE \n\nTrans.  on Communications, Vol. 31:532-540,1983. \n\n[4]  P.  Huber.  Robust Statistics.  John Wiley,  1981 . \n[5]  B. Lucas and T. Kanade. An Iterative Image Registration Technique with an Application \n\nto Stereo Vision.  In Proc.  DARPA Image Understanding Workshop,  198 I. \n\n[6J  P.  Rousseeuw  and A.  Leroy.  Robust Regression and Outlier Detection.  John Wiley, \n\n1987. \n\n[7]  P.  Simard,  Y.  Le  Cun,  and  J.  Denker.  Efficient  Pattern  Recognition  Using  a  New \nTransformation Distance.  In Proc.  Neurallnfonnation Proc.  Systems,  Denver, USA, \n1994. \n\n[8]  P.  Simard, Y.  Le Cun, and 1.  Denker.  Memory-based Character Recognition Using a \nTransformation Invariant Metric. In Int.  Conference on Pattern Recognition, Jerusalem, \nIsrael, 1994. \n\n\f", "award": [], "sourceid": 1474, "authors": [{"given_name": "Nuno", "family_name": "Vasconcelos", "institution": null}, {"given_name": "Andrew", "family_name": "Lippman", "institution": null}]}