{"title": "Unsupervised Classifiers, Mutual Information and 'Phantom Targets", "book": "Advances in Neural Information Processing Systems", "page_first": 1096, "page_last": 1101, "abstract": null, "full_text": "Unsupervised  Classifiers,  Mutual Information \n\nand  'Phantom Targets' \n\nJohn s.  Bridle \nAnthony J .R.  Heading \nDefence  Research  Agency \nSt.  Andrew's  Road,  Malvern \n\"\"orcs.  \"\\VR14  3PS,  U.K. \n\nAbstract \n\nDavid J.e. MacKay \n\nCalifornia Institute of Technology  139-74 \n\nPasadena CA 91125  U.S.A \n\nWe derive criteria for training adaptive classifier networks to perform unsu(cid:173)\npervised data analysis.  The first  criterion turns a simple Gaussian classifier \ninto  a  simple  Gaussian  mixture  analyser.  The  second  criterion,  which  is \nmuch more generally applicable, is based on mutual information. It simpli(cid:173)\nfies  to an  intuitively  reasonable  difference  between  two entropy functions, \none  encouraging  'decisiveness,'  the  other  'fairness'  to  the  alternat.ive  in(cid:173)\nterpretations  of the  input.  This  'firm  but  fair'  criterion  can  be  applied \nto  any  network  that  produces  probability-type  outputs,  but  it  does  not \nnecessarily  lead to useful  behavior. \n\n1  Unsupervised Classification \n\nOne of the  main distinctions made in  discussing  neural  network  architectures,  and \npattern analysis algorithms generally, is  between  supervised and unsupervised data \nanalysis.  We  should  therefore  be  interested  in  any  method  of  building  bridges \nbetween  techniques  in  these  two  categories.  For  instance,  it  is  possible  to  use  an \nunsupervised system such as a  Boltzmann machine to learn  the joint distribution of \ninputs and a teacher's classificat.ion labels.  The particular type of bridge we seek is a \nmethod of taking a supervised pattern classifier and turning it into an unsupervised \ndata analyser.  That is,  we  are interested in  methods of \"bootstrapping\"  classifiers. \n\nConsider  a  classifier system.  Its input is  a  vector x, and the output is  a  probability \nvector y(x).  (That is,  the elements ofy are positive and sum to 1.)  The elements of \ny, (Yi (x), i  =  1 ... N c )  are to be taken as the probabilities that x  should be assigned \nto  each  of  Nc  classes. \n(Note  that  our  definition  of  classifier  does  not  include  a \ndecision  process.) \n\n1096 \n\n\fUnsupervised Classifiers,  Mutual Information and 'Phantom Targets' \n\n1097 \n\nTo enforce  the  conditions  we  require  for  the  output values,  v,,'e  recommend  using  a \ngeneralised  logistic  (normalised exponential, or SoftMax) output stage.  \\Ve  call  t.he \nunnormalised log  probabilities of the  classes  ai,  and the softmax performs: \n\nYi  = ea,/Z  with  Z =  Lea, \n\n(1 ) \n\nNormally  the  parameters of such  a  system  would  be  adjust.ed  using  a  training  set \ncomprising examples of inputs and corresponding classes, {(Xi, cd},  vVe  assume that \nthe system includes means t.o  convert derivatives of a t.raining criterion with respect \nto  the outputs into  a  form  suitable  for  adjusting  the  values  of the  parameters, for \ninstance  by  \"backpropagation\", \nImagine however  that  we  have  unlabelled  data,  X m , m.  =  1, , ,Nts ,  and  wish  to use \nit  to  'improve'  the  classifier.  We  could  think  of  this  as  self-supervised  learning, \nto  hone  an  already  good  system  on  lots  of easily-obtained  unlabelled  real-world \ndata,  or  to  adapt  to  a  slowly  changing  environment,  or  as  a  way  of  turning  a \nclassifier  int.o  some sort  of cluster  analyser.  (Just  what  kind  depends on  details  of \nthe classifier itself.)  The ideal method would  be theoretically well-founded, general(cid:173)\npurpose (independent of the details of the classifier),  and computationally tractable. \n\nOne  well  known  approach  to  unsupervised  data  analysis  is  to  minimise  a  recon(cid:173)\nstruction  error:  for  linear  projections and  squared euclidean  distance  this  leads  to \nprincipal components analysis, while  reference-point based classifiers lead  to vector \nquantizer  design  methods,  such  as  the  LBG  algorithm,  Variants  on  VQ ,  such  as \nKohonen's  feature  maps,  can  be  motivated  by  requiring  robustness  to  distortions \nin  the  code  space  .  Reconstruction  error is  only  available  as  a  training criterion  if \nreconstruction  is  defined:  in  general  we  are only given  class  label probabilities. \n\n2  A  Data Likelihood  Criterion \n\nFor the special case of a Gaussian clustering of an unlabelled data set, it was demon(cid:173)\nstrated  in  [1]  that  gradient  ascent  on  the  likelihood  of the  data  has  an  appealing \ninterpretation in  terms of backpropagation in an equivalent unit-Gaussian classifier \nnetwork:  for  each  input  X  presented  to  the  network,  the  output  y  is  doubled  to \ngive  'phantom  targets'  t  =  2y;  when  the  derivatives  of the log  likelihood  criterion \nJ  =  -Eiti 10gYi  relative to these  targets are propagated back  through the network, \nit turns out that the  resulting gradient is  identical  to t.he  gradient of the likelihood \nof the data given  a  Gaussian  mixture model. \n\nFor  the  unit-Gaussian  classifier,  the activations ai  in  (1)  are \n\nai = -Ix - wd 2 , \n\nso the outputs of the network  are \n\nYi  =  P(class = i  I x, w) \n\n(2) \n\n(3) \n\nwhere  we  assume  the inputs  are drawn from  equi-probable unit-Gaussian  distribu(cid:173)\ntions with  the mean of the distribution of the ith class  equal  to Wi. \n\nThis result was only derived in a limited context, and it was speculated that it might \nbe  generalisable  to  arbitrary  classification  models .  The above  phantom  t.arget.  rule \n\n\f1098 \n\nBridle,  Heading, and MacKay \n\nhas  been  re-derived  for  a  larger  class  of networks  [4],  but  the  conditions  for  strict \napplicability are quite severe.  Briefly,  there should be exponential density functions \nfor each class,  and the normalizing factors for these densit.ies should be independent \nof the  parameters.  Thus Gaussians  with  fixed  covariance  matrices  are  acceptable, \nbut  variable  covariances  are not,  and  neither  are linear transformat.ions  preceeding \nthe Gaussians. \n\nThe next section introduces a new objective function which is  independent of details \nof the classifier. \n\n3  Mutual  Information  Criterion \n\nIntuitively, an unsupervised adaptive classifier is  doing a  plausible job if its outputs \nusually  give  a  fairly  clear  indication  of the  class  of an  input  vector,  and  if there  is \nalso an even dist.ribution of input patterns between the classes.  We could label these \ndesiderata  'decisive'  and  'fair'  respectively.  Note  that  it is  trivial  to  achieve  either \nof them alone.  For a poorly regularised model it may also be trivial to achieve both. \n\nThere are several ways to proceed.  We could devise  ad-hoc measures corresponding \nto  our  notions  of decisiveness  and  fairness,  or  we  could  consider  particular  types \nof  classifier  and  their  unsupervised  equivalents,  seeking  a  general  way  of  turning \none  into  the  other.  Our  approach  is  to  return  to  the  general  idea  that  the  class \npredictions  should  retain  as  much  information  about  the  input  values  as  possible. \nWe  use  a  measure  of  the  information  about  x  which  is  conveyed  by  the  output \ndistribution,  i. e. \nthe mutual information between  the inputs and  the outputs.  'Ne \ninterpret the outputs y  as a probability distribution over a discrete random variable \ne (the class label),  thus y  =  p( elx).  The mutual  information  between  x  and e is \n\nI(e; x) \n\njr{ \np(e,x) \nJ dcdxp(e, x) log p(e)p(x) \nJ dxp(x) J dep(elx) log p~~~~) \nJ  J \ndxp(x)  de p(elx) log J dxp(x)p( elx) \n\np(clx) \n\nThe elements of this expression  are separately  recognizable: \nJ dx p(x)(.) is equivalent  to an average over  a  training set  .~t.  Lts (.); \np( clx)  is  simply the network  output Yc; \nJ dc(\u00b7) is  a sum over  the  class  labels  and  corresponding  network  outputs. \nHence: \n\nI(c; x) \n\nI\n\nNc \n\nN  L L Yi  log  :-! \n\ny. \n\nts \n\nt$ \n\ni=l \n\nYi \n\n(4) \n\n(5) \n\n(6) \n\n(7) \n\n\fUnsupervised Classifiers,  Mutual Information and 'Phantom Targets' \n\n1099 \n\nNc \n\n- L fh  log Yh  + IV L L Yi log Yi \n\nNc \n\n1 \n\n1 \n\nis \n\nts \n\ni=l \n\ni=l \n\n1i(y) -1i(y) \n\n(8) \n\n(9) \n\nThe objective function  I  is the difference  between  the entropy of the average of the \nout.puts,  and  the  average  of the  entropy  of the  outputs,  where  both  averages  are \nover  the  training  set.  1i(y) has  its  maximum  value  when  the  average  activities  of \nthe separate output.s are equal- this is 'fairness'.  1i(Y) has its minimum value when \none output is  full  on  and the rest  are  off for  every  training  case - this is  'firmness'. \n\n\\Ve  now  evaluate I  for  the training set.  a.nd  take  the gradient of I. \n\n4  Gradient  descent \n\nTo use this criterion with back-propagation network training, we need its derivatives \nwith  respect  to the network  outputs. \n\noI(c ;x) \n\nOYi \n\n(10) \n\n(11 ) \n\n(12) \n\nThe  resulting  expression  is  quite  simple,  but  note  that  the  presence  of  a  fii  term \nmeans that two passes through the training set are required:  the first to calculate the \naverage output node activations,  and  the second to back-propagate the derivatives. \n\n5 \n\nIllustrations \n\nFigures  1 shows  I  (divided  by  its maximum  possible  value, log Nc )  for  a  run  of a \nparticular  unit-Gaussian  classifier  network.  The 30  data points  are drawn  from  a \n2-d isotropic Gaussian.  Figure 2 shows the fairness and firmness criteria separately. \n(The  upper  curve  is  'fairness' ?i(y )/log N e ,  and  the lower  curve is  'firmness'  (1  -\n1i(y)/log N c ).) \n\nThe t.en  reference points had starting values drawn from  the same distribution as the \ndata.  Figure 3 shows their movement during training.  From initial positions within \nthe  data  cluster,  they  move  outwards into a  circle around  the data.  The resulting \nclassification  regions  are shown  in  Figure  4.  (The grey level  is  proportional  to the \nvalue  of the  maximum  response  at  each  point,  and since  the outputs  are  positive \nnormalised  this  value  drops  to 0.5  or less  at the  decision  boundaries.)  We  observe \nthat  the  space  is  being  partitioned  into  regions  with  roughly  equal  numbers  of \npoints.  It might  be  surprising at.  first  t.hat  t.he  reference  points do  not  end  up  near \n\n\f1100 \n\nBridle,  Heading, and MacKay \n\n1.0 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n20 \n\n40 \n60 \nIteration \n\n80 \n\n100 \n\n20 \n\n40 \n60 \nIteration \n\n80 \n\n100 \n\n1.  The M.1.  criterion \n\n2.  Firm and  Fair separately \n\n1.0 \n\n0.8 \n\n0.2 \n\n4 \n\n2 \n\no \n\n-2 \n\n4~~~~~~~~~~~ \n\n2 \n\n4 \n\n4 \n\n-2 \n\no \n\n3.  'Tracks of reference  points \n\n4.  Decision  Regions \n\n\fUnsupervised Classifiers,  Mutual Information and 'Phantom Targets' \n\n1101 \n\nthe  dat.a.  However,  it  is  only  the  transformat.ion  from  dat.a  x  to  out.puts  y  that  is \nbeing  trained,  and  t.he  refereme  points  are just  parameters of t.hat  t.ra.nsformation. \nAs  t.he  reference point.s move further away from  OBe  anot.her t.he  dE'cision  bounclaries \ngrow firmer.  In t.his  example  the fairness  crit.erion  happens  t.o  decreasf'  in favour  of \nt.he  firmness,  and  this  usually  happens.  \\Ve  could  consider  different  weightings  of \nthe  two  components of the  criterion. \n\n6  Con1n1ents \n\nThe  usefulness  of this  objective  function  will  prove  will  depend  very  much  on  the \nform  of classifier that it is applied t.o.  For  a  poorly regularised  classifier,  maximisa(cid:173)\ntion of the criterion alone will  not necessarily lead to good solutions to unsupervised \nclassification;  it  could  be  ma.ximised  by  any  implausible  classification  of the  input. \nthat is  completely  hard  (i. e. \nthe output  vector  always  has one  1 and  all  the other \noutputs 0),  and  t.hat.  chops  the t.raining set int.o regions  cont.aining similar numbers \nof training points; such  a solution  would  be one of many global maxima, regardless \nof whether  it  chopped t.he  data  into  natural  classes. \n\nThe meaning of a  'natural' partition  in  t.his  cont.ext  is,  of course,  rather  ill-defined. \nSimple models often do not.  have  t.he  capacity  t.o  break  a  pattern space  int.o  highly \ncontorted regions - the decision  boundaries shown in  the figure below is  an example \nof model  producing  a  reasonable  result  as  a  consequence of its inherent  simplicity. \nWhen  we  use  more complex models,  however,  we  must ensure t.hat  we  find  simpler \nsolutions  in  preference  to  more  complex  ones.  Thus  this  criterion  encourages  us \nto  pursue  objective  t.echniques  for  regularising  classification  networks  [2,  3];  such \ntechniques  are probably long  overdue. \n\nCopyright \u00a9 Controller  HMSO  London  1992 \n\nReferences \n\n[1]  J .S.  Bridle  (1988).  The  phantom  target  cluster  network:  a  peculiar  relative \nof  (unsupervised)  maximum  likelihood stochastic  modelling  and  (supervised) \nerror  backpropagation,  RSRE  Research  Note SP4:  66,  DRA  Malvern  UK. \n\n[2]  D.J .C.  MacKay  (1991).  Bayesian  interpolation, submitted  to  Neural  computa(cid:173)\n\ntion. \n\n[3]  D.J .C.  MacKay (1991). A practical Bayesian framework for backprop networks, \n\nsubmitted  to  Neural  computation. \n\n[4]  .J  S  Bridle  and  S  J  Cox.  Recnorm:  Simultaneous  normalisation  and  clas(cid:173)\nsification  applied  to  speech  recognition.  In  Advances  in  Ne'ural  Information \nProcessing  Systems  ;g.  Morgan  Kaufmann,  1991. \n\n[5]  J  S  Bridle.  Training stochastic model  recognition  algorithms  as  networks can \nlead  to  maximum  mut.ual  informat.ion  estimation  of parameters.  In  Advances \nin  Neural  Informatio71  Processing  Systems  2.  Morgan  Kaufmann,  1990. \n\n\f", "award": [], "sourceid": 440, "authors": [{"given_name": "John", "family_name": "Bridle", "institution": null}, {"given_name": "Anthony", "family_name": "Heading", "institution": null}, {"given_name": "David", "family_name": "MacKay", "institution": null}]}