{"title": "Unsupervised and Supervised Clustering: The Mutual Information between Parameters and Observations", "book": "Advances in Neural Information Processing Systems", "page_first": 232, "page_last": 238, "abstract": null, "full_text": "Unsupervised and  supervised clustering: \n\nthe mutual information  between \n\nparameters and  observations \n\nDidier Herschkowitz \n\nJean-Pierre  Nadal \n\nLaboratoire de  Physique Statistique de  l'E.N.S .* \n\nEcole  Normale Superieure \n\n24, rue  Lhomond - 75231  Paris cedex  05,  France \n\nherschko@lps.ens.fr \n\nnadal@lps.ens.fr \nhttp://www.lps.ens.frrrisc/rescomp \n\nAbstract \n\nRecent  works  in  parameter  estimation  and  neural  coding  have \ndemonstrated that optimal performance are related to the  mutual \ninformation between parameters and data.  We  consider the mutual \ninformation in  the case where  the dependency in the parameter (a \nvector  8)  of  the  conditional  p.d.f.  of each  observation  (a  vector \n0, is  through  the  scalar  product  8.~ only.  We  derive  bounds  and \nasymptotic behaviour for  the mutual information and compare with \nresults  obtained on  the same model with the\" replica technique\" . \n\n1 \n\nINTRODUCTION \n\nIn  this  contribution  we  consider  an  unsupervised  clustering  task.  Recent  results \non  neural  coding and  parameter estimation  (supervised  and unsupervised  learning \ntasks)  show  that  the  mutual  information  between  data  and  parameters  (equiva(cid:173)\nlently between neural activities and stimulus)  is  a relevant tool for  deriving optimal \nperformances (Clarke and Barron, 1990; Nadal and Parga, 1994;  Opper and Kinzel, \n1995; Haussler and Opper, 1995; Opper and Haussler, 1995; Rissanen,  1996;  BruneI \nand Nadal  1998). \n\nLaboratory  associated  with  C.N.R.S.  (U.R.A.  1306) ,  ENS,  and Universities  Paris  VI \n\nand Paris  VII. \n\n\fMutual Information between Parameters and Observations \n\n233 \n\nWith  this  tool  we  analyze  a  particular  case  which  has  been  studied  extensively \nwith the \"replica technique\"  in  the framework of statistical mechanics  (Watkin and \nNadal, 1994; Reimann and Van den  Broeck,  1996;  Buhot and Gordon, 1998).  After \nintroducing  the  model  in  the  next  section,  we  consider  the  mutual  information \nbetween  the  patterns  and  the  parameter.  We  derive  a  bound  on  it  which  is  of \ninterest  for  not  too  large  p.  We  show  how  the  \"free  energy\"  associated  to  Gibbs \nlearning is  related  to the  mutual  information.  We  then  compare  the  exact  results \nwith  replica  calculations.  We  show  that  the  asymptotic  behaviour  (p  > >  N)  of \nthe  mutual information is  in  agreement with the exact result which  is  known to be \nrelated to the  Fish er  information (Clarke and Barron, 1990; Rissanen , 1996;  Brunei \nand  Nadal  1998).  However  for  moderate  values  of a  = pIN, we  can eliminate false \nsolutions  of the replica calculation.  Finally,  we  give  bounds  related  to  the  mutual \ninformation between the parameter and its estimators, and discuss common features \nof parameter estimation and neural coding. \n\n2  THE MODEL \n\nWe  consider  the  problem  where  a  direction  0  (a unit  vector)  of dimension  N  has \nto  be  found  based  on  the  observation  of p  patterns.  The  probability  distribution \nof the  patterns is  uniform  except  in  the  unknown  symmetry-breaking direction  O. \nVarious  instances  of this  problem  have  been  studied  recently  within  the  satistical \nmechanics framework, making use of the replica  technique (Watkin and Nadal, 1994; \nReimann  and  Van  den  Broeck,  1996;  Buhot and  Gordon,  1998).  More  specifically \nit  is  assumed  that  a  set  of  patterns  D  = {~J.L}~= 1  is  generated  by  p  independant \nsamplings from  a non-uniform probability distribution P(~IO) where 0 = {Ol , ... , ON} \nrepresents  the  symmetry-breaking  orientation.  The  probability  is  written  in  the \nform: \n\nP(~IO) =  o/S exp( -2 - V(A)) \n\n1 \n\n~2 \n\nwhere N  is the dimension of the space, A =  O.~ is  the overlap and V(A)  characterizes \nthe structure of the data in the breaking direction.  As  justified within the Bayesian \nand  Statistical Physics frameworks,  one  has to consider a  prior distribution on  the \nparameter space,  p(O),  e.g.  the  uniform  distribution on the sphere. \n\nThe  mutual information J(DIO)  between  the  data and 0 is  defined  by \n\n(1) \n\n(2) \n\n(3) \n\n(4) \n\nIt can  be  rewritten: \n\nwhere \n\nN \n\nJ(DIO)  = -a < V(A)  > _ <<In(Z) \u00bb, \n\nN \n\nZ  = 100  dOp(O)exp( - t V(AJ.L)) \n\n-00 \n\nJ.L=l \n\nIn the statistical physics literature -In Z  is a \"free energy\".  The brackets < < ..  > > \nstand for  the  average over  the  pattern distribution, and  <  .. >  is  the average over \nthe resulting overlap distribution.  We  will  consider properties valid  for  any Nand \nany p . others for  p  > > N ,  and  the  replica calculations  are  valid  for  Nand p  large \nat any  given  value  of a  = ~ . \n\n\f234 \n\nD.  Herschkowitz and J.-P  Nadal \n\n3  LINEAR BOUND \n\nThe mutual information, a positive quantity, cannot grow faster than linearly in the \namount of data, p.  We  derive  the simple  linear bound: \n\n[(DIB)  :::;  - p  < V(>')  > \n\n(5) \nWe proove the inequality for the case < >.  >= O.  The extension to the case < >.  >i- 0 \nis  straightforward.  The mutual information can be  written as  [  =  H(D) - H(DIB). \nThe calculation of H(DIB)  is  straightforward: \n\nH(DIB)  = p; In(e27r)  + ~\u00ab  >.2  >  -1) + p < V  > \n\n(6) \nNow,  the entropy of the data H(D) =  - J dDP(D)lnP(D)  is  lower or equal to the \nentropy  of a  Gaussian  distribution  with  the  same  variance.  We  thus  calculate  the \ncovariance  matrix of the  data \n\n\u00ab  ~r~j \u00bb= 61Jv(  6ij  + \u00ab  >.  > -l)ei(/j) \n\n2 \n\n-\n\nwhere D denotes the average over the  parameter distribution.  We  then  have \n\nH(D)  :::;  T ln (27fe)  + \"2  L....ln(l + \u00ab  >.  >  -l)ri) \n\n2 \n\npN \n\nN \n\nP\", \n\ni=l \n\n(7) \n\n(8) \n\nwhere I i  are the  eigen  value  of the matrix BiB).  Using  I: Bt  = 1  and  the  property \nIn(l + x)  :::;  x  we  obtain \n\ni=l \n\nN_ \n\nH(D) :::; p; In(27fe) + ~\u00ab  >.2  >  -1) \n\n(9) \n\nPutting  (9)  and  (6)  together,  we  find  the  inequality  (5). \nfollows  also \n\np  < V  >:::;  -\u00ab  In(Z)>>:::; 0 \n\nl.From  this  and  (3)  it \n\n(10) \n\n4  REPLICA  CALCULATIONS \n\nIn  the  limit  N  -T  00  with  a  finite,  the  free  energy  becomes  self-averaging,  that \nis  equal  to  its  average ,  and  its  calculation  can  be  performed  by  standard  replica \ntechnique.  This  calculation  is  the  same  as  calculations  related  to  Gibbs  learning, \ndone  in  (Reimann  and  van  den  Broeck,  1996,  Buhot  and  Gordon,  1998),  but  the \ninterpretation of the order parameters is  different.  Assuming replica symmetry,  we \nreproduce in  fig.2  results from  (Buhot and Gordon,  1998) for  the behaviour with a \nof Q which  is  the typical overlap between two directions compatible with  the data. \nThe overlap distribution  P(>.)  was  chosen  to get  patterns distributed  according to \ntwo clusters  along the symmetry-breaking direction \n\nP(>.)  = \n\n1  L  exp( _ (>.  - Ep)2 ) \n\n2O'.j27i=  f = \u00b1l \n\n20'2 \n\n(11) \n\nIn  fig.2  and  fig.1  we  show  the  corresponding behaviour of the  average  free  energy \nand of the mutual information. \n\n\fMutual Information between Parameters and Observations \n\n235 \n\n4.1  Discussion \n\nUp  to  aI ,  Q =  0  and  the  mutual  information  is  in  a  purely linear  phase  I(~D)  = \n-a <  F()')  >.  This  correspond  to  a  regime  where  the  data have  no  correlations. \nFor a  ~ aI, the replica calculation admits up to three differents solutions.  In view of \nthe fact that the mutual information can never decrease with a and that the average \nfree  energy can not be  positive, it follows  that only  two behaviours are  acceptable. \nIn the first , Q leaves the solution Q =  0 at aI , and follows  the lower branch until a 3 \nwhere it jumps to the upper branch.  This is  the stable way.  The second possibility \nis  that Q =  0 until  a2  where  it directly jumps to the  upper branch.  In  (Buhot and \nGordon,  1998) ,  it  has  been  suggested  that  One  can  reach  the  upper  branch,  well \nbefore  a 3 .  Here  we  have  thus  shown  that  it  is  only  possible  from  a2.  It  remains \nalso  the possibility  of a  replica symetry breacking phase in  this range of a. \nIn  the  limit  a  --+  00  the  replica  calculus  gives  for  the  behaviour  of  the  mutual \ninformation \n\nI(DIO)  ~ ~ In(a < (dV~f))2 \u00bb \n\n(12) \n\nThe r.h.s  can be shown  to be equal  to half the logarithm of the determinant of the \nFish er  information  matrix,  which  is  the  exact  asymptotic  behaviour  (Clarke  and \nBarron,  1990;  BruneI  and  Nadal,  1998).  It can  be  shown  that  this  behaviour  for \np  > >  N  implies  that  the  best  possible  estimator  based  on  the  data will  saturate \nthe  Cramer-Rao  bound  (see  e.g.  Blahut, 1988) .  It has  already been noted that the \nasymptotics  performance  in  estimating the  direction,  as  computed  by  the  replica \ntechnique , saturate this  bound  (Van  den  Broeck,  1997) .  What we  have check  here \nis  that this  manifests  itself in  the  behaviour of the mutual information for  large a. \n\n4.2  Bounds for  specific  estimators \n\nGiven  the  data D,  one  wants to find  an  estimate J  of the  parameter.  The  amount \nof  information  I(DIO)  limits  the  performance  of  the  estimator.  Indeed,  one  has \nI(JIO)  ::;  I(DIO).  This basic relationship  allows to derive  interesting bounds  based \non  the  choice  of  particular  estimators.  We  consider  first  Gibbs  learning,  which \nconsists  in  sampling  a  direction  J  from  the  'a  posteriori'  probability  P(JID)  = \nP(DIJ)p(J) /  P(D) .  In this particular case, the differential entropy ofthe estimator \nJ  and of the parameter 0 are equal  H(J)  =  H(O).  If 1 - Qg 2  is  the variance of the \nGibbs estimator one gets, for  a  Gaussian prior on  0, the relations \n\n(13) \n\nThese  relations  together  with  the  linear  bound  (5)  allows  to  bound  the  order  pa(cid:173)\nrameter Qg  for  small  a  where  this  bound  is  of interest. \nThe Bayes  estimator consists in  taking for  J  the center of mass of the  'a posteriori' \nprobability.  In the limit a  --+  00 ,  this distribution becomes Gaussian centered at its \nmost  probable  value.  We  can  thus  assume  PBayes (JIO)  to be  Gaussian  with  mean \nQbB  and  variance  1 - Qb 2 ,  then  the  first  inequality  in  (13)  (with  Qg  replaced  by \nQb  and  Gibbs by  Bayes)  is  an  equality.  Then using the  Cramer-Rao bound on  the \nvariance  of  the  estimator,  that  is  (1  - Q~)/Q~ ~ (a  <  (dV/d).)2  \u00bb-1, one  can \nbound  the mutual information for  the  Bayes estimator \n\nIBay es(JIB)  ::;  2ln(1 + a  < (~) \u00bb \n\ndV().)  2 \n\nN \n\n(14) \n\n\f236 \n\nD.  Herschkowitz and J-P. Nadal \n\nThese different  quantities are  shown  on fig.1. \n\n5  CONCLUSION \n\nWe  have studied the mutual information between data and parameter in a  problem \nof  unsupervised  clustering:  \\ve  deriyed  bounds,  asymptotic  behaviour,  and  com(cid:173)\npared  these  results  with  replica  calculations.  Most  of  the  results  concerning  the \nbehaviour  of the  mutual  information,  observed  for  this  particular  clustering  task, \nare\" universal\" , in  that they  will  be  qualitatively the same for  any problem  which \ncan  be  formulated  as  either a  parameter estimation task or a  neural  coding/signal \nprocessing task.  In  particular, there  is  a  linear regime  for  small  enough  amount of \ndata (number of coding cells),  up  to a  maximal  value  related  to the VC  dimension \nof  the  system.  For  large  data size,  the  behaviour is  logarithmic - that  is  I  ,....,  lnp \n(Nadal  and  Parga,  1994;  Opper  and  Haussler,  1995)  or  ~ lnp  (Clarke  and  Bar(cid:173)\nron,  1990;  Opper  and  Haussler,  1995;  BruneI  and  Nadal,  1998)  depending  on  the \nsmoothness of the model.  A mOre  detailed review with mOre such universal features, \nexact  bounds  and  relations  between  unsupervised  and  supervised  learning  will  be \npresented elsewhere.  (Nadal,  Herschkowitz,  to appear in  Phys.  rev.  E). \n\nAcknowledgements \n\nWe  thank Arnaud Buhot and Mirta Gordon for  stimulating discussions.  This work \nhas been  partly supported by  the  French  contract DGA  96  2557 A/DSP. \n\nReferences \n\n[B88] \n\n[BG98] \n\n[BN98] \n\n[CB90] \n\n[H095] \n\n[OH95] \n\n[NP94a] \n\n[OK95] \n\nR.  E.  Blahut,  Addison-Wesley, Cambridge MA,  1998. \n\nA.  Buhot and  M.  Gordon.  Phys.  Rev.  E,  57(3):3326- 3333,1998. \n\nN.  BruneI and  J.-P.  Nadal.  Neural  Computation,  to appear,  1998. \n\nB. S.  Clarke and A. R. Barron.  IEEE Trans.  on Information  Theory, \n36  (3):453- 471, 1990. \n\nD.  Haussler  and  M.  Opper.  conditionally  independent  observa(cid:173)\ntions.  In  VIIIth Ann.  Workshop  on  Computational Learning  Theory \n(COL T95) , pages 402-411, Santa Cruz,  1995  (ACM,  New-York). \n\nM.  Opper  and  D.  Haussler  supervised  learning,  Phys.  Rev.  Lett., \n75:3772-3775,  1995. \n\nJ.-P.  Nadal and N.  Parga. unsupervised  learning.  Neural  Computa(cid:173)\ntion,  6:489- 506,  1994. \n\nM.  Opper  and  W.  Kinzel.  In  E.  Domany  J .L.  van  Hemmen  and \nK.  Schulten,  editors,  Physics  of  Neural  Networks,  pages  151- . \nSpringer,  1995. \n\n[Ris] \n\nJ.  Rissanen. \n1996. \n\nIEEE  Trans.  on  Information  Theory,  42  (1) :40-47, \n\n\fMutual Information between Parameters and Observations \n\n237 \n\n[RVdB96] \n\nP. Reimann and C. Van den Broeck.  Phys.  Rev.  E, 53 (4):3989-3998, \n1996. \n\n[VdB98] \n\n[WN94] \n\nC.  Van  den  Broeck.  In  proceedings  of the  TANG workshop  (Hong(cid:173)\nKong May 26-28,  1997). \n\nT .  Watkin  and  J.-P.  Nadal.  1.  Phys.  A:  Math.  and  Gen.,  27:1899-\n1915,  1994. \n\n.-\"': ---\n-----\n\n---=-~.--\n\n--\n-----,.-\n-----\n----\n\nI \nI \nI \nI \nI \n\n,. \n,. \n\n./ \n\n./ \n\n./ \n\n/ \n\n/ \n\n/ \n\nI \n\n./ \n\n./' \n\n/' \n\n/ \n\n-piN <V> \n\n-- -\n- - replica information \n---- O.5*ln(1 +p/N \u00abV')**2\u00bb \n-\n-\n\n-O.5*ln(1-Qb**2) \n-O.5*ln(1-Qg**2) \n\n-\n- -\n\n1.8 \n\n1.6 \n\n1.4 \n\n1.2 \n\n1.0 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0.0 \n\n/ \n\nI \nI \nI \nI \nI \n\nI  I , , \nf , \n1/ \nJ \u2022 \n\nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI  I \nI \nI \n\n0.0 \n\n2.0 \n\n4.0 \n\n6.0 \n\n8.0 \na \n\n10.0 \n\n12.0 \n\n14.0 \n\nFigure 1:  Dashed line is the linear bound on the mutual information I(DI())/N.  The \nlatter, calculated  with  the  replica technique,  saturates the bound  for  a  :::::  aI,  and \nis  the (lower)  solid line for  a  >  al .  The special structure On  fig.2  is  not visible  here \ndue  to the graph  scale.  The curve  -~ln(1 - Q/) is  a  lower  bound  On  the  mutual \ninformation between the Gibbs estimator and ()  (which would be equal to this bound \nif the conditional probability distribution of the estimator were Gaussian with mean \nQg()  and  variance  1- Qg2).  Shown  also  is  the  analogous  curve  -~ln(1 - Qb2) for \nthe  Bayes  estimator.  In  the  limit  a  -t  (Xl  these  two  latter  Gaussian  CUrves  and \nthe replica information I(DI()), all converge toward the exact asymptotic behaviour, \nwhich  can  be  expressed  as  ~ln(1 + a  < (d\\~~'\\))2 \u00bb  (upper  solid  line).  This latter \nexpression  is,  for  any p,  an  upper bound for  the two  Gaussian CUrves. \n\n\f238 \n\nD. Herschkowitz and J-P.  Nadal \n\n0.002  r - - - - - - - r - - - - - , - - - - - r - - - - - - - ,  \n\n- - -\u00abIn(z\u00bb> \n\n0.001 \n\n0.000 \n\n-0.001 \n\n-0.002 \n\n-0.003  L -___  ---JI....-___  ---L ___  ....I....--L ____  --.J \n\n2.0 \n\n2.2 \n\n2.4 \nex. \n\n2.6 \n\n-----------------------------------------\n\n-Qb  \n- - _.  bome Cramer-Rao \n\n0.9 \n\n0.8 \n\n0.7 \n\n0.6 \n\n0.5 \n\n0.4 \n\n0.3 \n\n0.2 \n\n0.1 \n\n0.0 \n\n2.0 \n\n2.2 \n\na2 \n\n2.4 \na \n\nex  3 \n\n2.6 \n\nFigure  2:  In  the  lower  figure ,  the  optimal  learning  curve  Qb(a)  for  p  =  1.2  and \na  =  0.5,  as  computed  in  (Buhot  and  Gordon,  1998)  under  the  replica  symetric \nansatz.  We  have  put the Cramer-Rao bound for  this quantity.  In  the upper figure, \nthe  average  free  energy  - < <  InZ  > >  / N.  All  the  part  above  zero  has  to  be \nrejected. \nal  =  2.10,  a2  =  2.515  and  a3  =  2.527 \n\n\f", "award": [], "sourceid": 1625, "authors": [{"given_name": "Didier", "family_name": "Herschkowitz", "institution": null}, {"given_name": "Jean-Pierre", "family_name": "Nadal", "institution": null}]}