{"title": "The Canonical Distortion Measure in Feature Space and 1-NN Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 245, "page_last": 251, "abstract": null, "full_text": "The Canonical Distortion Measure in Feature \n\nSpace and I-NN Classification \n\nJonathan Baxter*and Peter Bartlett \nDepartment of Systems Engineering \n\nAustralian National University \n\nCanberra 0200, Australia \n\n{jon,bartlett}@syseng.anu.edu.au \n\nAbstract \n\nWe  prove  that  the  Canonical  Distortion Measure  (CDM)  [2,  3]  is  the \noptimal distance measure to use for  I  nearest-neighbour (l-NN) classifi(cid:173)\ncation, and show that it reduces to squared Euclidean distance in feature \nspace  for function classes that can be expressed  as  linear combinations \nof a  fixed  set  of features.  PAC-like  bounds  are  given  on the  sample(cid:173)\ncomplexity  required to  learn the CDM.  An experiment  is presented  in \nwhich  a  neural network CDM was  learnt for a Japanese  OCR environ(cid:173)\nment and then used to do  I-NN classification. \n\n1 \n\nINTRODUCTION \n\nLet X  be an input space,  P  a distribution on X, F  a class of functions mapping X  into Y \n(called the \"environment\"), Q a distribution on F  and (J'  a function (J':  Y  X  Y  -t [0 , .\"1]. \nThe Canonical Distortion Measure (CDM) between two inputs x, Xl  is defined to be: \n\np(x, Xl)  = L (J'(f(x) , f(x l)) dQ(f). \n\n(1) \n\nThroughout this paper we  will be considering real-valued functions  and  squared loss,  so \nY  =  ~ and  (J'(y,  yl)  :=  (y  - yl)2.  The  CDM  was  introduced  in  [2,  3],  where  it  was \nanalysed  primarily from  a  vector  quantization perspective.  In  particular,  the  CDM  was \nproved to be the optimal distortion measure  to use in vector quantization, in the sense of \nproducing the  best  approximations to  the  functions  in  the  environment F.  In  [3]  some \nexperimental results were also presented (in a toy domain) showing how the CDM may be \nlearnt. \n\nThe purpose of this paper is to  investigate the utility of the CDM as  a classification tool. \nIn Section 2 we show how the CDM for a class of functions possessing a common feature \n\n*The first author was supported in part by EPSRC grants #K70366 and #K70373 \n\n\f246 \n\n1.  Baxter and P.  Bartlett \n\nset reduces,  via a change of variables,  to  squared Euclidean distance in feature  space.  A \nlemma is then given showing that the  CDM is the optimal distance measure to use for  1-\nnearest-neighbour (l-NN) classification.  Thus, for functions possessing a common feature \nset, optimall-NN classification is achieved by using squared Euclidean distance in feature \nspace. \n\nIn general the CDM will be unknown, so in Section 4 we present a technique for learning \nthe  CDM by minimizing squared loss,  and  give  PAC-like bounds on the  sample-size  re(cid:173)\nquired for good generalisation. In Section 5 we present some experimental results in which \na  set of features  was  learnt  for  a  machine-printed Japanese  OCR environment,  and  then \nsquared Euclidean distance was used to do  I-NN classification in feature space.  The exper(cid:173)\niments provide strong empirical support for the theoretical results in a difficult real-world \napplication. \n\n2  THE CDM IN FEATURE SPACE \n\nSuppose each f  E  F  can be  expressed as  a  linear combination of a fixed  set  of features \n~ :=  (\u00a2l, ... , \u00a2k).  That  is,  for  all f  E  F,  there  exists  w  :=  (WI,\u00b7\u00b7\u00b7, Wk)  such  that \nf  =  w  . ~ =  2:7=1  Wi\u00a2i.  In this case  the  distribution Q  over the  environment F  is  a \ndistribution over the weight vectors w. Measuring the distance between function values by \n()(y,  y')  := (y - yl)2, the CDM (1) becomes: \n\np(x, x') = r  [w\u00b7 ~(x) - w\u00b7 ~(X,)]2 dQ(w) = (~(x) - ~(X'))W(~(x) - ~(X'))' \n\niE.k \n\n(2) \nwhere  W  =  fw  w'w dQ(w).  is  a  k  x  k matrix.  Making  the  change of variable  ~ -t \n~JW, we have p(x, x')  =  11~(x) - ~(x')112 . Thus, the assumption that the functions in \nthe environment can be expressed as  linear combinations of a fixed  set of features  means \nthat the CDM is simply squared Euclidean distance in a feature space related to the original \nby a linear transformation. \n\n3 \n\nI-NN CLASSIFICATION AND THE CDM \n\nSuppose the environment F  consists of classifiers,  i.e.  {O, 1 }-valued functions.  Let f  be \nsome  function in F  and z  :=  (Xl, f(Xl)), ... , (Xn,  f(x n))  a  training set of examples  of \nf.  In I-NN classification the classification of a novel x  is computed by f(x*)  where X*  = \nargminx \u2022 d(x, Xi)),  i.e.  the  classification of X is  the  classification  of the  nearest  training \npoint to  x  under some  distance  measure  d.  If both f  and  x  are  chosen  at  random,  the \nexpected misclassification error of the  1-NN scheme using d and the training points x  := \n(xl, ... ,xn)is \n\nf(x* )]2 , \n\ner(x, d)  := EF Ex [J(x)  -\n\n(3) \nwhere x*  is the nearest neighbour to  x  from {Xl, . . . , x n }.  The following lemma is now \nimmediate from the definitions. \nLemma 1.  For all sequences x = (Xl, . .. , X n ).  er{x, d)  is minimized ifd is the CDM p. \nRemarks.  Lemma  1 combined with the results of the last section shows that for function \nclasses possessing a common feature set, optimall-NN classification is achieved by using \nsquared Euclidean  distance  in  feature  space.  In Section  5  some  experimental  results on \nJapanese OCR are presented supporting this conclusion. \n\nThe property of optimality of the CDM for  I-NN classification may not be stable to small \nperturbations. That is,  if we learn an approximationg to p,  then even ifExxx (g(x, x') -\n\n\fThe Canonical Distortion Measure in Feature Space and I-NN Classification \n\n247 \n\np(x, x' ))2  is  small  it may  not  be the case  that  l-NN classification using 9  is  also  small. \nHowever,  one can  show that stability is  maintained for  classifier environments  in which \npositive examples of different functions do not overlap  significantly (as  is the case for the \nJapanese  OCR environment of Section 5,  face  recognition environments,  speech recogni(cid:173)\ntion environments and so on).  We are currently investigating the general conditions under \nwhich stability is maintained. \n\n4  LEARNING THE CDM \n\nFor most environments encountered in practice (e.g speech recognition or image recogni(cid:173)\ntion), P will be unknown. In this section it is shown how p may be estimated or learnt using \nfunction approximation techniques (e.g.  feedforward neural networks). \n\n4.1  SAMPLING THE ENVIRONMENT \n\nTo learn the CDM p, the learner is provided with a class of functions (e.g.  neural networks) \n9  where each 9  E  9 maps  X  x  X  ~ [0, M].  The goal of the learner is to find  a 9 such \nthat the error between 9  and the CDM p  is small.  For the sake of argument this error will \nbe measured by the expected squared loss: \n\nerp(g)  := Exxx [g(x, x')  - p(x, x')f , \n\n(4) \n\nwhere the expectation is with respect to p2. \n\nOrdinarily the  learner would be  provided with training data  in  the  form  (x, x', p( x, x'}) \nand would use this data to minimize an  empirical version of (4).  However,  p  is unknown \nso to generate data of this form p must be estimated for each training pair x, x'.  Hence to \ngenerate training sets for learning the CDM, both the distribution Q over the environment \n:F and the distribution P over the input space X  must be sampled.  So let f  := (it, ... , f m) \nbe m  i.i.d.  samples  from :F according to Q and let x  := (Xl, ... , x n )  be n  i.i.d.  samples \nfrom X  according to P. For any pair Xi, X j  an estimate of p( Xi,  X j) is given by \n\n1  m \n\nP(Xi' Xj}  := m  ~ (J'(fdxd,fk(Xj )). \n\nk=l \n\nThis gives n (n  - 1) /2 training triples, \n\n{(xi,Xj,p(xi,xj)),l::; i  < j::; n} , \n\nwhich can be used as data to generate an empirical estimate of er p (g): \n\n(5) \n\n(6) \n\nOnly n(n - 1)/2 of the possible n 2  training triples are used because the functions 9  E 9 \nare  assumed  to  already be  symmetric  and to  satisfy 9 (x, x)  =  0 for  all  x  (if this  is  not \nthe case then set g'(X, x')  := (g(x, x') + g(x', x))/2 if x  =j:.  x'  and g'(X, x)  =  0 and use \ng' := {g':  9  E g} instead). \nIn  [3]  an  experiment was presented in which 9  was  a neural  network class  and  (6)  was \nminimized directly by gradient descent.  In Section 5 we  present an alternative technique \nin  which a  set of features  is first  learnt for the  environment and then an  estimate of p  in \nfeature space is constructed explicitly. \n\n\f248 \n\nJ.  Baxter and P.  Bartlett \n\n4.2  UNIFORM CONVERGENCE \nWe  wish to ensure  good generalisation from a 9  minimizing e~r  r,  in the sense  that (for \nsmall 6, 5), \n\nx , \n\nPr { x, r  : :~~ lerx,f(g) - erp(g) I > 6}  < 5, \n\nThe following theorem shows that this occurs if both the number of functions  m  and the \nnumber of input samples  n  are  sufficiently large.  Some  exotic  (but nonetheless  benign) \nmeasurability restrictions have  been ignored in the statement of the theorem.  In the state(cid:173)\nment of the theorem, N (E , 9)  denotes the smallest 6 -cover of 9  under the  L 1 ( P 2 )  norm, \nwhere {gl , . . . , gN} is an 6-cover of9 iffor all 9  E 9  there exists gi  such that Ilgi - gil  ~ 6. \nTheorem 2.  Assume  the  range  of the functions  in  the  environment  :F  is  no  more  than \n[-J B /2, J B /2)  and in  the  class  9  (used  to  approximate  the  CDM)  is  no  more  than \n[0 , VB).  For all 6  > 0 and 0 < 5 ~ 1.  if \n\n32B 4 \n\n4 \nm> --log-\n5 \n\n6 2 \n\n-\n\nn 2: \n\n512B 2  ( \n\n6 2 \n\nlogN(6,9) + log \n\n512B 2 \n\n8) \n6 2  + log;5 \n\nand \n\nthen \n\nProof  For each 9  E 9 , define \n\nerx(g) :=  (2  ) \nnn-1 \n\nIf for any x = (Xl, . . . , X n ), \n\nl~i<j~n \n\nand \n\nx , \n\ngE9 \n\nPr {r:  sup ler  r(g) - erx(g) I > ~} ~ ~, \n2 \nPr {x: sup lerx(g)  - erp(g)1  > ~} ~ ~, \n\n2 \n\ngE9 \n\n2 \n\n2 \n\n(7) \n\n(8) \n\n(9) \n\n(10) \n\n(II) \n\n(12) \n\nthen by the triangle inequality (9) will hold.  We treat (11) and (12) separately. \nEquation (11).  To simplify the notation let gij , Pij  and Pij  denote 9 (Xi, X j), p( Xi,  X j)  and \np(Xi' Xj) respectively.  Now, \n\nL  (gij  - pij )2  - 2:  (gij  - Pij)2 \n\nl~i<j~n \n\n19<j~n \n\n2 \n\nn(n - 1) \n\n2:  (Pij  - Pij) (2gij - Pij  - Pij) \n\nl~i<j~n \n\n4B \n\n< -,----:-\n- n(n - 1) \n\nL  (Pij  - Pij) \n\n19<j~n \n\nE.rx(J) - m  2: X(Jk) \n\n1  m \n\nk=l \n\n\fThe  Canonical Distortion Measure in Feature Space and J-NN Classification \n\n249 \n\nwhere x: :F -+  [0, 4B2]  is defined by \n4B \nx (f)  : =  ----,----(cid:173)\nn(n - 1) \n\nThus, \n\nPr {f: :~g I\"r.,f(g) - e-r.(g) I > n S  Pr {f  EJ'x(f) - ~ t, xU,)  > ~ } \n\n1 :Si<J:S n \n\nwhich is  ~ 2 exp (_m\u20ac2/  (32B4))  by Hoeffding's inequality.  Setting this less  than 6/2 \ngives the bound on m  in theorem 2. \n\nEquation (12).  Without loss of generality, suppose that n is even.  The trick here is to split \nthe sum over all  pairs  (Xi,  X j)  (with i  <  j) appearing  in the definition of er x (g)  into a \ndouble sum: \n~ \nerx(g) =  (  ) \"  [g(Xj,  Xj)  - p(Xi, Xj)]2 \n\n2 \n\nnn-1 \n\n6 \n\n1 ::;i<j:S n \n\n1  n-1 2 n /2 \n\n=  n  _  1 L ;; L  [g(xo ,U), xO:(j))  - p(xo.(j), xO:(j))] \n\n2 \n\ni=l \n\nj=l \n\n, \n\nwhere  for  each  i  =  1, ... , n  - 1,  (J\"i  and  (J\"~  are  permutations  on  {I, ... , n}  such  that \n{(J\"d 1) , ... , (J\"i (n/2)) n {(J\"H 1), ... , (J\"~( n/2)} is empty.  That there exist permutations with \nthis property such that the sum can be broken up in this way can be proven easily by induc(cid:173)\ntion.  Now,  conditional on each (J\"i,  the n/2 pairs Xi  :=  {(Xo.(j), xO:(j)), j  =  1, ... , n/2} \nare an i.i.d.  sample from X  x  X  according to p2.  So by standard results from real-valued \nfunction learning with squared loss [4]: \n\nPr  Xi:  suP ;;?= [g(XO.(j), Xo:u))  - p(xo.U). XO:(j))]2  - erp(g)  > ~ \n\n2  n/2 \n\n{ \n\n} \n\ngEQ \n\nJ=l \n\nHence, by the union bound, \n\nPr { x:  ~~g /erx(g) - erp(g) I > ~} ~ 4(n  -\n\n~ 4N (48~2 ' g) exp ( - 2;:~2 ) . \nl)N (48~2 ' g) exp  ( - 2;:~2 ) . \n\nSetting n as  in the statement of the theorem ensures this is less than 6/2. \n\nD \n\nRemark.  The  bound on  m  (the  number of functions  that  need  to  be  sampled  from  the \nenvironment) is  independent of the complexity of the class g.  This should be contrasted \nwith related bias learning (or equivalently, learning to learn) results [1]  in which the number \nof functions does depend on the complexity.  The heuristic explanation for this is that here \nwe are  only learning a  distance  function on the  input space  (the CDM),  whereas  in bias \nlearning we are learning an entire hypothesis space that is appropriate for the environment. \nHowever, we shall see in the next section how for certain classes of problems the CDM can \nalso be used to learn the functions  in the environment.  Hence in these cases  learning the \nCDM is a more effective method of learning to learn. \n\n5  EXPERIMENT: JAPANESE OCR \n\nTo  verify  the  optimality  of the  CDM  for  I-NN  classification,  and  also  to  show  how \nit  can  be  learnt  in  a  non-trivial  domain  (only  a  toy  example  was  given  in  [3]),  the \n\n\f250 \n\n1.  Baxter and P  Bartlett \n\nCOM was  learnt  for  a Japanese  OCR environment.  Specifically,  there  were  3018  func(cid:173)\ntions  I  in  the  environment  F,  each  one  a  classifier  for  a  different  Kanji  character.  A \ndatabase  containing 90,918  segmented,  machine-printed  Kanji  characters  scanned  from \nvarious  sources  was  purchased  from  the CEDAR  group  at  the  State  University of New \nYork,  Buffalo  The  quality  of the  images  ranged  from  clean  to  very  degraded  (see \nhttp://www . cedar .buffalo. edu/Databases/JOcR/). \n\nThe main reason for choosing Japanese OCR rather than English OCR as  a test-bed was \nthe large number of distinct characters in Japanese.  Recall from Theorem 2 that to get good \ngeneralisation from a learnt COM,  sufficiently many functions must be  sampled from the \nenvironment.  If the environment just consisted of English characters  then it is  likely that \n\"sufficiently many\" characters would mean all characters, and so it would be impossible to \ntest the learnt COM on novel characters not seen in training. \n\nInstead of learning the COM directly by minimizing (6),  it was  learnt  implicitly by first \nlearning a set of neural network features for the functions in the environment.  The features \nwere  learnt using the method outlined in [1],  which essentially involves  learning a set of \nclassifiers  with a common final  hidden layer.  The features  were  learnt on  400 out of the \n3000 classifiers  in the environment, using 90% of the data in training and  10% in testing. \nEach  resulting  classifier  was  a  linear  combination of the  neural  network  features.  The \naverage error of the classifiers was 2.85% on the test set (which is an accurate estimate as \nthere were 9092 test examples). \nRecall from Section 2 that if all f  E F  can be expressed as I  =  W  . 4>  for a fixed  feature \nset  4>,  then  the  COM  reduces  to  p{x, x')  =  (4)(x)  - 4>(x' ))W(4>{x)  - 4>(X I ))'  where \nW  = fw  w/w dQ(w).  The  result  of the  learning procedure  above  is  a  set  of features \nci>  and 400 weight vectors  w l, . . . , W 400,  such that for each of the character classifiers fi \nused  in training,  Ii  ::  Wi  .  ci>.  Thus,  g(x, x')  :=  (ci>(x)  - ci>(X'))W(ci>(x)  - 4>(X'))'  is \nan empirical estimate of the true CDM,  where  W  := L;~~ W:Wi.  With a linear change \nof variable ci>  -+  ci>VW, 9 becomes g(x, x')  =  114>(x)  - ci>(x')112.  This 9 was used to do \nI-NN classification on the test examples in two different experiments. \n\nIn the first  experiment,  all testing and training examples that were  not an example of one \nof the 400 training characters were lumped into an extra category for the purpose of clas(cid:173)\nsification.  All  test  examples  were  then given the  label  of their nearest  neighbour in the \ntraining set under 9  (i.e.  , initially all  training examples  were  mapped  into  feature  space \nto give {ci>( Xl)' ... , ci>( X n )}.  Then each test example was mapped into feature  space  and \nassigned the same label as argminx.llci>( x) - ci>( Xi) 11 2).The total misclassification error was \n2.2%, which can be directly compared with the misclassification error of the original clas(cid:173)\nsifiers of 2.85%.  The COM does better because it uses the training data explicitly and the \ninformation stored in the network to make a comparison, whereas  the classifiers only use \nthe information in the network.  The learnt COM was  also used to do  k-NN classification \nwith k  >  1.  However this afforded no improvement.  For example,  the error of the 3-NN \nclassifier was  2.54% and the error of the  20-NN classifier was  3.99%.  This  provides an \nindication that the COM may not be the optimal distortion measure to use if k-NN classifi(cid:173)\ncation (k > 1) is the aim. \nIn the  second experiment 9  was  again used to  do  I-NN classification on the test set,  but \nthis time all 3018 characters were distinguished. So in this case the learnt COM was being \nasked to distinguish between 2618 characters that were treated as a single character when \nit was being trained.  The misclassification error was  a surprisingly low 7.5%.  The  7.5% \nerror compares favourably with the 4.8% error achieved on the same data by the CEDAR \ngroup, using a carefully selected feature set and a hand-tailored nearest-neighbour routine \n[5].  In our case the distance measure was learnt from raw-data input, and has not been the \nsubject of any optimization or tweaking. \n\n\fThe  Canonical Distortion Measure in Feature Space and I-NN Classification \n\n251 \n\nFigure  1:  Six  Kanji  characters  (first  character  in each  row)  and  examples  of their four \nnearest neighbours (remaining four characters in each row). \n\nAs  a  final,  more  qualitative  assessment,  the  learnt  CDM  was  used  to  compute  the  dis(cid:173)\ntance between every pair of testing examples,  and then the distance between each pair of \ncharacters  (an  individual character  being  represented  by  a  number  of testing  examples) \nwas  computed by averaging the distances  between their constituent examples.  The  near(cid:173)\nest neighbours of each character were then calculated.  With this measure,  every character \nturned out to be its own nearest neighbour, and in many cases the next-nearest neighbours \nbore a strong subjective similarity to the original.  Some representative examples are shown \nin Figure 1. \n\n6  CONCLUSION \n\nWe  have  shown  how the Canonical  Distortion Measure  (CDM)  is  the optimal  distortion \nmeasure for  I-NN classification, and that for environments in which all the functions can \nbe  expressed  as  a linear combination of a fixed  set of features,  the Canonical Distortion \nMeasure is squared Euclidean distance in feature space.  A technique for learning the CDM \nwas presented and PAC-like bounds on the sample complexity required for good generali(cid:173)\nsation were proved. \n\nExperimental results were presented in which the CDM for a Japanese  OCR environment \nwas learnt by first learning a common set of features for a subset of the character classifiers \nin the environment.  The learnt CDM was then used as a distance measure in  I-NN neigh(cid:173)\nbour classification, and performed remarkably well, both on the characters used to train it \nand on entirely novel characters. \n\nReferences \n\n[1]  Jonathan  Baxter.  Learning  Internal  Representations.  In Proceedings  of the Eighth \nInternational Conference on  Computational Learning Theory,  pages  311-320. ACM \nPress,  1995. \n\n[2]  Jonathan Baxter.  The  Canonical  Metric for  Vector  Quantisation.  Technical  Report \nNeuroColt Technical Report 047, Royal Holloway College, University of London, July \n1995. \n\n[3]  Jonathan Baxter. The Canonical Distortion Measure for Vector Quantization and Func(cid:173)\ntion Approximation.  In Proceedings  of the Fourteenth  International Conference on \nMachine Learning, July  1997.  To Appear. \n\n[4]  W  S  Lee,  P  L  Bartlett,  and  R  C  Williamson.  Efficient  agnostic  learning  of neural \n\nnetworks with bounded fan-in.  IEEE Transactions on Information Theory,  1997. \n\n[5]  S.N.  Srihari,  T.  Hong,  and Z.  Shi.  Cherry Blossom:  A  System for Reading Uncon(cid:173)\n\nstrained Handwritten Page Images.  In Symposium on Document Image Understanding \nTechnology (SDIUT),  1997. \n\n\f", "award": [], "sourceid": 1357, "authors": [{"given_name": "Jonathan", "family_name": "Baxter", "institution": null}, {"given_name": "Peter", "family_name": "Bartlett", "institution": null}]}