{"title": "Learning from Dyadic Data", "book": "Advances in Neural Information Processing Systems", "page_first": 466, "page_last": 472, "abstract": null, "full_text": "Learning  from  Dyadic Data \n\nThomas Hofmann\u00b7,  Jan  Puzicha+,  Michael I.  Jordan\u00b7 \n\n\u2022  Center for  Biological  and  Computational Learning,  M .I.T \n\n+ Institut fi.ir  Informatik III , Universitat  Bonn,  Germany, jan@cs.uni-bonn.de \n\nCambridge, MA , {hofmann , jordan}@ai.mit.edu \n\nAbstract \n\nDyadzc  data  refers  to  a  domain  with  two  finite  sets  of objects  in \nwhich  observations are  made for  dyads , i.e., pairs with one element \nfrom  either  set.  This  type  of data  arises  naturally  in  many  ap(cid:173)\nplication  ranging  from  computational  linguistics  and  information \nretrieval to preference  analysis and computer vision.  In this paper, \nwe  present  a  systematic,  domain-independent framework  of learn(cid:173)\ning  from  dyadic  data by  statistical  mixture  models.  Our  approach \ncovers different models with fiat  and hierarchical latent class struc(cid:173)\ntures.  We  propose  an  annealed  version  of the  standard  EM  algo(cid:173)\nrithm for  model fitting  which  is  empirically evaluated  on  a  variety \nof data sets from  different  domains. \n\n1 \n\nIntroduction \n\nOver  the  past  decade  learning  from  data  has  become  a  highly  active  field  of re(cid:173)\nsearch  distributed  over  many  disciplines  like  pattern  recognition,  neural  compu(cid:173)\ntation ,  statistics,  machine  learning,  and  data  mining.  Most  domain-independent \nlearning  architectures  as  well  as  the  underlying  th eories  of learning  have  been  fo(cid:173)\ncusing on a feature-based  data representation by  vectors in an Euclidean space.  For \nthis  restricted  case  substantial  progress  has  been  achieved.  However,  a  variety  of \nimportant  problems  does  not  fit  into  this  setting  and  far  less  advances  have  been \nmade for  data types  based on  different  representations. \n\nIn  this  paper,  we  will  present  a  general  framework  for  unsupervised  learning from \ndyadic  data .  The notion  dyadic  refers  to  a  domain  with  two  (abstract)  sets  of ob(cid:173)\njects, ;r =  {Xl , ... , XN}  and Y =  {YI, ... , YM}  in which observations S  are made for \ndyads  (Xi, Yk).  In the simplest case - on which we focus - an elementary observation \nconsists  just  of (Xi, Yk)  itself,  i.e.,  a  co-occurrence  of  Xi  and  Yk,  while  other  cases \nmay also provide a scalar value Wik  (strength of preference or association).  Some ex(cid:173)\nemplary application areas  are:  (i)  Computational linguistics with  the  corpus-based \nstatistical  analysis of word  co-occurrences  with  applications in  language  modeling , \nword  clustering,  word  sense  disambiguation , and  thesaurus  construction.  (ii)  Text(cid:173)\nbased  znJormatzon  retrieval,  where  ,:{,  may  correspond  to  a  document  collection , Y \n\n\fLearningfrom Dyadic Data \n\n467 \n\nto keywords , and (Xi, Yk)  would represent the occurrence of a term Yk  in a document \nXi.  (iii)  Modeling  of preference  and  consumption  behavior by identifying X  with in(cid:173)\ndividuals  and  Y  with  obj ects  or stimuli  as  in  collaborative  jilterzng.  (iv)  Computer \nVIS tOn ,  in  particular  in  the  context  of image segmentation,  where  X  corresponds  to \nimagE'  loc ations , y  to  discretized  or  categorical feature  values ,  and  a  dyad  (Xi , Yk) \nrepresents  a  feature  Yk  observed  at  a  particular location  Xi. \n\n2  Mixture Models for  Dyadic  Data \n\nAcross  different  domains there  are at  least two tasks which  playa fundamental role \nin  unsupervised  learning from dyadic data:  (i)  probabilistic modeling, i.e.,  learning \na joint or conditional probability model over X  xY , and (ii) structure discovery, e.g. , \nidentifying clusters and data hierarchies.  The key  problem in probabilistic modeling \nis the  data  sparseness:  How can probabilities for  rarely observed or even  unobserved \nco-occurrences  be  reliably  estimated?  As  an  answer  we  propose  a  model-based  ap(cid:173)\nproach  and  formulate  latent  class  or  mixture  models.  The  latter  have  the  further \nadvantage  to offer  a  unifying method for  probabilistic modeling and  structure  dis(cid:173)\ncovery.  There  are  at  least  three  (four,  if both  variants in (ii)  are  counted)  different \nways  of defining  latent  class  models: \n\nI.  The  most direct  way is to introduce an (unobserved)  mapping c  : X  X  Y  --+ \n{Cl , . . . , CK}  that  partitions  X  x  Y  into  K  classes.  This  type  of model  is \ncalled  aspect-based and  the  pre-image c- l (cO')  is  referred  to as  an  aspect. \n\nn.  Alternatively, a class can be defined  as a subset of one of the spaces X  (or Y \nby  symmetry, yielding  a  different  model) , i.e.,  C : X  --+  {Cl, . .. , CK}  which \ninduces  a  unique partitioning on  X  x  Y  by C(Xi , yk)  ==  C(Xi) .  This model  is \nreferred  to  as  on e-szded  clustering  and  c-l(c a )  ~ X  is  called  a  cluster. \n\nIll.  If  latent  classes  are  defined  for  both  sets,  c  :  X  --+  {ci , .. . , cK}  and  C : \n\nY  --+  {cI , . .. , cD, respectively, this induces  a  mapping  C  which  is  a  K  . L (cid:173)\npartitioning of X  x  y.  This model is  called  two-sided  clustering. \n\n2.1  Aspect  Model for  Dyadic  Data \n\nIn order to sp ecify  an aspect  model we  make the assumption that all co-occurrences \nin the sample set S  are i.i .d.  and that Xi  and Yk  are conditionally independent given \nthe class.  With parameters P(x i lca ), P(Yklca) for the class-conditional distributions \nand  prior  probabilities  P( cO' ) the  complete data probability can  be  written  as \n\nP(S , c)  = IT [P(Cik)P(Xilcik)P(Yklcik)t (x\"Yk)  , \n\n(1) \n\ni,k \n\nwhere  n(xi, Yk)  are  the  empirical  counts  for  dyads  in  Sand Cik  ==  C(Xi, Yk) .  By \nsumming over  the  latent  variables C the usual  mixture formulation is obtained \nP(S) = IT P(Xi, Ykt(X\"Yk),  where  P(Xi , Yk)  = L P(ca)P(xilca)P(Yk Ica ) \n\n(2) \n\n. \n\ni,k \n\na \n\nFollowing the standard  Expectation  Maximi zation approach for maximum likelihood \nt'stimation [DE'mpster  et  al ..  1977],  the E-step equations for the class posterior prob(cid:173)\nabilities  arE'  given  byl \n\n(3) \n\n1 In the case of multiple observations of dyads it has been assumed that each observation \nmay  have  a  different  latent  class.  If only  one  latent  class  variable  is  introduced  for  each \ndyad,  slightly  different  equations  are obtained. \n\n\f468 \n\n\u2022 \u2022  ~ . . . . . . . . . .  _0  . . . . . . .  . \n\n! P (Ca) \n: maximal \n!P(XiICcx) \n\n: maximal \n!p(YklcCX> \n\n............... --\n\nIll ,  U.UU4 \n\ntwo  0.18 \nseven  0.10 \ntbree  0.10 \nfour  0.06 \nfive  0.06 \n\nyears  0.11 \nbousand  0.1 \nbuodred 0.1 \ndays  0.07 \ncubits  0.05 \n\nT.  Hofmann, J  puzicha and M. 1.  Jordan \n\nbave  0.38 \nbatb  0.22 \nbad  0.11 \nbast  0.09 \nbe  0.02 \n\n..... __ ......... \n\n114, U.UU:l \n\nsbalt  0.18 \nbast  0.08 \nwilt  0.08 \nart  0.07 \nif 0.05 \n\ntbou  0.85 \nDOt  0.01 \nalso 0.004 \nndeed 0.00 \naooiot 0.003 \n\ntbe  0.95 \nbis 0.006 \nmy 0.005 \nour 0.003 \ntby 0.003 \n\nlord 0.09 \nbildreo 0.0 \nSOD 0.02 \nland 0.02 \no  Ie 0.02 \n\nup  0.40 \ndowoO.17 \nfortb 0.15 \nout 0.09 \nioO.Ol \n\nI\"', U.U~9 \n<.>  0.52 \n<:>  0.16 \n<,>  0.14 \n<;>  0.07 \n<?>  0.04 \n\naDd  0.33 \nfor  0.08 \nbut  0.07 \nben  0.0 \nso  0.02 \n\nee  O. \nme 0.03 \nhim 0.03 \nit  0.02 \nyou 0.02 \n\n<?>  0.27 \n<,>  0.23 \n<.>  0.12 \n<:>  0.06 \n<.>  0.04 \n\nFigure  1:  Some aspects  of the  Bible  (bigrams) . \n\nIt is  straightforward to derive  the  M-step  re-estimation formulae \n\nP(ca) ex L n(xi' Yk)P{Cik  =  Ca},  P(xilca)  ex L n(xi, Yk)P{Cik  =  Ca }, \n\n(4) \n\ni,k \n\nk \n\nand  an  analogous equation for  P(Yk Ica).  By  re-parameterization  the  aspect  model \ncan  also  be  characterized  by  a  cross-entropy  criterion.  Moreover,  formal  equiva(cid:173)\nlence  to  the  aggregate  Markov  model,  independently  proposed  for  language model(cid:173)\ning  in  [Saul,  Pereira,  1997],  has  been  established  (cf.  [Hofmann,  Puzicha,  1998]  for \ndetails). \n\n2.2  One-Sided Clustering Model \n\nThe complete data model  proposed  for  the one-sided  clustering  model is \n\nP(S, c)  =  P( c)P(SIc) = (If P( c(x;)) )  (IT [P( x;)P(Y' Ic( X;))]n(x\",,))\n\n, \n\n(5) \n\nwhere  we  have  made  the  assumption  that  observations  (Xi,  Yk)  for  a  particular  Xi \nare  conditionally independent  given  c( xd .  This effectively  defines  the  mixture \n\nP(S) = IT P(S;) ,  P(S;) = L P(ca ) IT [P(XdP(Yklea)r(X\"Yk)  , \n\n(6) \n\na \n\nk \n\nwhere Si  are  all  observations involving Xi.  Notice that co-occurrences  in Si  are  not \nindependent  (as  they  are  in  the  aspect  model) ,  but  get  coupled  by  the  (shared) \nlatent  variable  C(Xi).  As  before,  it  is  straightforward  to  derive  an  EM  algorithm \nwith  update equations \n\nP{ c( Xi) = Ca } ex P( Ca) IT P(Yk Icat(x. ,Yk), P(Yk lea) ex L n(Xi, Yk )P{ c( Xi) = ca }  (7) \n\nk \n\nand  P(ca)  ex  Li P{C(Xi) =  cal,  P(Xi)  ex  Lj n(xi,Yj)\u00b7  The  one-sided  clustering \nmod el  is  similar  to  the  distributional  clustering  model  [Pereira  et  al. , 1993],  how(cid:173)\never,  there  are two important differences:  (i)  the number of likelihood contributions \nin  (7) scales  with the number of observations - a fact  which follows from Bayes' rule \n- and  (ii)  mixing  proportions  are  rpissing  in  the  original  distributional  clustering \nmodel.  The one-sided  clustering  model  corresponds  to  an  unsupervised  version  of \nthe  naive  Bayes'  classifier,  if we  interpret Y  as  a  feature  space  for  objects  Xi  EX . \nThere  are  also  ways  to  weaken  the  conditional  independence  assumption,  e.g.,  by \nutilizing  a  mixture of tree  dependency  models  [Meila,  Jordan,  1998] . \n\n2.3  Two-Sided Clustering Model \n\nThe latent variable structure  of the two-sided clustering model significantly reduces \nthe degrees  of freedom  in the specification  of the class  conditional distribution.  We \n\n\fLearning from Dyadic Data \n\n469 \n\nFigure 2:  Exemplary segmentation results  on  Aerial by one-sided  clustering. \n\npropose  the following complete data model \n\nP(S, c)  = II P(C(Xi))P(C(Yk))  [P(xi)P(Yk)1Tc(xi),c(YIc)f(x\"yIc) \n\n(8) \n\ni,k \n\n\"Y \n\n0\" \n\nwhere  1Tc:r:  cll  are  cluster  association  parameters.  In  this  model the latent  variables \nin  the  X  and  Y  space  are  coupled  by  the  1T-parameters.  Therefore,  there  exists \nno simple mixture  model  representation  for  P(S).  Skipping some  of the  technical \ndetails  (cf.  [Hofmann, Puzicha, 1998])  we  obtain  P(Xi)  ex  Lk n(xi,Yk),  P(Yk)  ex \nLi n(xi' Yk)  and  the  M-step  equations \n\nL i k n(xi, Yk)P{C(Xi)  =  c~ /\\ C(Yk)  =  c~} \n\n1Tc~.c~ = [Li P{C(Xi) = ~;} Lk n(xi, Yk)]  [Lk P{C(Yk) = cn Li n(xi, Yk)] \n\n(9) \nas  well  as  P(c~) = L i P{C(Xi) = c~} and  P(c~) = Lk P{C(Xk) = cn .  To preserve \ntractability  for  the  remaining problem  of computing the  posterior  probabilities in \nthe  E-step,  we  apply  a  factorial  approximation  (mean  field  approximation),  i.e., \nP{C(Xi )  =  c~ /\\ C(Yk)  =  cO  ~ P{C(Xi) =  c~}P{C(Yk) =  cn.  This  results  in  the \nfollowing coupled  approximation equations for  the marginal posterior  probabilities \nP{ c(x;) = c~} ex P(c~) exp [~n(x;, y,) ~ PI cry,) = c'(} log \"'~\"~ 1  (10) \n\nand a similar equation for  P {C(Yk)  =  c~}. The resulting approximate EM algorithm \nperforms updates  according  to the sequence  (CX- post.,  1T,  cLpost.,  1T).  Intuitively \nthe (probabilistic) clustering in one set  is  optimized in alternation for  a  given  clus(cid:173)\ntering  in  the  other space  and  vice  versa.  The two-sided  clustering  model can  also \nbe  shown  to maximize a  mutual information criterion  [Hofmann,  Puzicha,  1998] . \n\n2.4  Discussion:  Aspects and Clusters \nTo  better  understand  the  differences  of the  presented  models  it  is  elucidating  to \nsystematically compare the conditional probabilities P( CO' Ixd  and P( CO' IYk): \n\nOne-sided \n\nAspect \nModel \n\nX  Clustering \nP{x.ico' W{co' 2  P{c(xd =  cO'} \nP~lf.k Ic\", W( c'\" 2 \n\nP(x,) \n\nOne-sided \n\nTwo-sided \nClustering \nP{C(Xi) = c~} \nP(lf.kl cO' )P(cO' 2  P{C(Yk)  =  cO'}  P{C(Yk)  =  c~} \n\nY Clustering \nP{xdcO' W{ CO' 2 \n\nP(colxd \n\nP(CoIYk ) \n\nP(Y k) \n\nP(Yk) \n\nP(x.) \n\nAs can be seen from the above table, probabilities P(CoIXi) and P(CaIYk)  correspond \nto  posterior  probabilities of latent  variables  if clusters  are  defined  in  the  X-and \nY-space,  respectively.  Otherwise,  they are computed from model  parameters.  This \nis  a  crucial  difference  as,  for  example,  the  posterior  probabilities  are  approaching \n\n\f470 \n\nT.  Hofmann,  J.  Puzicha and M.  I. Jordan \n\n' ~. -....... \n\nI  .~ , \n\n..... r... \n\n'''' P' '' ''''~  .... \n\n... 1\u00b7 \n\nh  .. .. ' \u2022 \u2022\u2022\u2022  \",oi, \n\u2022 . .  ,, '\" \n\n.... \n\n.. \n, ..... _ .. ',.<. \n:~::~ \n< \u2022 \u2022\u2022 .,. \n< . . . .  11 ..  .. i . .. \n\n~ \n... n, ... \" \n\nfl \u2022\u2022\u2022\u2022 \n\nf ..... I~ \n\n< \u2022\u2022 I~ \n\n.. --\" \n. .......... , ..... t. \n.. \n\" \n'dol \n_.ie \n. , ... \n_  \u2022\u2022\u2022 1 \nIl10' ' '' \n\", \n\n~~.:: \n\n1\" \n\n. .... ]0 \n1, \u2022\u2022 -\nc r ..... \n\n.flu \n\n........... \n:I~;:;\u00b7.~' \n\n'u.. \n~:.:' \n\n.. ,  .. ,... \n\nI \u2022 \u2022  ~ \n\nIT \n. h .~. \n\n1<1 \n~ ._ .. \n.,..... \n~,,\" \n~.::;,:'::.I \nU \n\n. ..... ,_. \n\" .. , .. \n., \u2022\u2022 \" \n,\" ua, \n\n........ \n.......... \n\nI''''''.'  d\u00b7,. \n.. ~ ... , \nu \n~:.:: .. ~.\".. \n... ,,<., \n..... , \n\u2022 ~.... \nI\"\". \n... n, \n\u2022\u2022 10'''' \nf.:~I:.:,., \n\n,.1 .\u2022 \n.. ,.<. \n~::. \n\n,h .... , ., \n\n\"  \u2022\u2022\u2022 , .. .. . \n\n..-~ .. \n., ..  \u00abu. \n\" \n..... ,. \n..... \n, ..\u2022 , .. \n\n_ ;',117 \n\nf . . .. \n\n1111 , \n\n:~::. \n\u2022\u2022\u2022\u2022 ,e. \n\" \n..... ,~,-.\" ..\u2022... \n:,~:::.  u .\"'~'  .< ~ \u2022\u2022 I \n\u2022.. Ii., \n\n'It \nb  ..... , \n\nd  ..... \n\n\"'~h\" \n\n.\" .... \n.~I. \n)O,.i. \n. ~.c, .. ' \n\nII \n\n\u2022 ... ~ .\" ... , \n~.'''< .I \u2022\u2022 \n~:::: .. \n\nu d d . .. \n\n.h.y \n,,, \u2022\u2022. U, \n,.\".... \n\n.. .. .... hi \n\nHo, \nh' \" \nu u ll \u2022 \u2022\u2022 \n\n.. ~i... \n\nt._~.. \n\n_\" .... \n.. ,.. \n\".d . \n\n..10 \n\n,n., \n\u2022\u2022 cl~ \n\nU.tl~ \n\nD~~~\"\"  .~ ~\" \n\n\" \n\u2022 ..... _ . \n\n.1 \nI \u2022\u2022 ,. \n' '''.,. \n'''.\" ...... , .... ~ .i ul  \"\n::':~., \n\nI~ .. i,. \n::.7.:,., \n\nI~ \n._11 \nto., \n\n' .. tol. \n\nt~\u00b7.\u00b7t \n\n1& \n\"'7 \nIo'.,f .. , \n\u2022 \u2022\u2022 1-. \n~;:.;::.' \n\n:~::,~~:. \n\" .. ,fa_ . \n\u2022 ,.1 .... \n\n~::~' ... \n.i' .... \n\n1. '10 \u2022 \n\n.... ~., \n'.,\"10 ... \n\n. 1 . . .... \n\n~~\u00b7:;\u00b7i:\u00b7 \n\noa.\" .... , \n\n,' \" .... ,. \n\n~, ... c; \u2022\u2022 \"\" ... !:~:I::' \n\n.Iln., ...\n,... \n\n::,':~~;~~ \n\n... w \n.. \u2022\u2022 \u2022 \u2022 \u2022  L \n\n, ... ,, 1 ... \n\n... \" ... ,  ... it, \n,,,II \n\" . .. .. \n, .. h lt. \n,... \n::::,:,:;\" \u2022\u2022 1::;::;. \n\n.. \n\n.0 \n31 \n1.<0' \nI.\"\"\"'''' .. ~Ii.,<oJ \n~ .\u2022\u2022 ;::.,  :::;! .. '.,  ~:~::~', \n;:~:,. \n~~:::~.\".~'. :::~'.~~~~' \n\nFigure 3:  Two-sided clustering of LOB:  7r  matrix and  most probable  words. \n\nBoolean values in  the infinite data limit and P(Yklxd = Lo P{C(Xi)=Co}P(Yk!co) \nare converging to one of the class-conditional distributions.  Yet,  in the aspect model \nP(Yklxd  =  Lo P(CoIXi)P(Yk!co)  and  P(CoIXi)  ex:  P(Co)P(Xi!co)  are  typically  not \npeaking  more  sharply  with  an  increasing  number  of observations.  In  the  aspect \nmodel,  conditionals  P(Yk IXi)  are  inherently  a  weighted  sum  of the  'prototypical' \ndistributions  P(Yk Ico ).  Cluster  models  in  turn  ultimately look for  the  'best'  class(cid:173)\nconditional  and  weights  are  only  indirectly  induced  by  the  posterior  uncertainty. \n\n3  The  Cluster-Abstraction Model \n\nThe  models  discussed  in  Section  2  all  define  a  non-hierarchical,  'flat'  latent  class \nstructure.  However,  for  structure discovery  it is  important to find  hierarchical data \norganizations.  There  are  well-known  architectures  like  the  Hierarchical  Mixtures \nof Experts  [Jordan, Jacobs ,  1994]  which  fit  hierarchical  models.  Yet,  in  the  case \nof  dyadic  data  there  is  an  alternative  possibility  to  define  a  hierarchical  model. \nThe  Cluster-Abstraction  Model  (CAM)  is  a  clustering  model  (e.g.,  in  X)  where \nthe  conditionals  P(Yk Ico)  are  itself  xi-specific  aspect  mixtures,  P(Yk leo, Xi)  = \nLII P(Yk la ll )P( alllco, Xi)  with  a  latent  aspect  mapping  a.  To  obtain  a  hierarchi(cid:173)\ncal  organization,  clusters  Co  are  identified  with  the  terminal  nodes  of a  hierarchy \n(e.g.,  a  complete  binary  tree)  and  aspects  all  with  inner  and  terminal  nodes.  As \na  compatibility  constraint  it  is  imposed  that  P( all/co, xd = 0  whenever  the  node \ncorresponding  to  all  is  not  on  the  path  to  the  terminal  node  co.  Intuitively,  con(cid:173)\nditioned  on  a  'horizontal'  clustering  c all  observations (Xi, Yk)  E Si  for  a  particular \nXi  have  to  be generated  from  one  of the  'vertical' abstraction  levels  on  the  path to \nc( Xi)'  Since  different  clusters  share aspects  according  to their  topological  relation , \nthis favors  a  meaningful hierarchical  organization of clusters.  Moreover,  aspects  at \ninner  nodes  do  not simply represent  averages over  clusters  in their subtree  as  they \nare forced  to explicitly  represent  what is  common to all  subsequent  clusters. \n\nSkipping the  technical  details,  the  E-step  is  given  by \n\nP{a(xi,Yk) = all/c(xi) = co}  ex:  P(alllco,xi)P(Yk/all) \nP{ C(Xi)  =  co}  ex:  P( co) II L [P( alllco, Xi)P(Yk /a ll )r(X\"Yk) \n\nk \n\nII \n\n(11) \n(12) \n\nand  the  M-step  formulae  are  P(Yk/all)  ex:  LiP{a(xi,Yk)  =  all}n(xi,Yk),  P(co)  ex: \nLi P{C(Xi)  = co},  and  P(alllco , Xi)  ex:  Lk P{a(xi ' Yk)  = all/c(xi) = co}n(xi, Yk)' \n\n\fLearning from Dyadic Data \n\n471 \n\n....... :f;:::::m\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 \n\nfunction \n: weiCht  '. \n\\ \n.:  learn \n.'  error \n'. \n.:'  example \\ \n\nlearn \ntrain \nproblem \n\nptrfonn \n:  ntlwork: \ni  process  ~ \n: \n:  data \nj  modtl \n: \n\nnetwork \nneural \nfunction \n\"\",wI \nctntr \n\n...... ...... -- ~t~n \u00b7-\u00b7\u00b7\u00b7 \u00b7\u00b7 \u00b7-\u00b7 \n\n,  dynamjc \n:'  Umt\n'\" \n.:  modd  \\ \ni  synaptic.. \nf  nruron  \", \n\nprocess \nauthor\nInfonn \n\nlearn. \n\n:'  a1corithm \ni  rult \n~ \ni nrtwork \\ \n\\ \n;  map \n\nmethod \n~rror \ncener \nlearn \n\nnd \n,talc \nneuron \n~onlinu \n<lrbitr~ri  perform \n1I11~'rilh~ \nrundl.l.ln : \n~~~!~-tc \n~idd~:r i \nupdate-: \nmachin : \n\nd .. s~ir \n'r:lin \ntaleet \nfrtttur \nC);,I;\\sin \n\ncontrol \nneural \ndesi\", \ninfonn \nne~.~ork \n\n, \n\nj  model\n\nim.{ \n;~=fit  ~  :.:~ \n'_tur\nlin\u00b7 ... ~c~. \n\n.pUm \n\n: \n\n. \n\n: \n\n!'in \n\nraw \n\npattern \n\nnrural \nlearn \n\nvisual  memori \nassoci \nimae \nmotion  (papc \nctll \np~~tSS  lra~ \n~pt  INtt:w.  j \n:::;eri  i \n~~:nl \nmap \n\nneuron  nrtwork \nactiy \nspike \nr~pons  aicorithm \n~~rs~ \ndy~.~m,ic \ncontrol\n; \n/ ;?k.  i \n;~~:...; \n\u2022  m.\" ....  \u00b7 \n.. = \n,.,\"'/ /   7\u00b7~d~_\" \"  ~\"\"_-::;'bll  t \nd'.;:;\"  ';~'';ari \n',.... \n::!::-'  ~:;::  m:krn \n\n~ni;~ \n=!: ........ ~  :::  ::~-\n(: .... , \nr::kl' \n\nparamll  : \n\nnaI~ \n\n,..oIen: \n\nobject \n\n-;;:.. \n\ncertic \n\n1.caI \n\n( ... J \n\n' \n\npedsJ\"a.pt  nop) \n\".ndud. \n~tin \n\nrepoa  \"tat \nGscil \n\nt,..nsfonll \n\n,,.,....w \n,.,-...et \n:;::~~  :~it.ri \nsteM,. ... t \nthre:m.td \n.plim \nmKhin \n\nute \n\ncCHWtr-.in, \n\n1 \n'1M1\"I1  m .... 1 \n\nbound; \ni\\, \n/~ \ni \"\"\" ~:!:'  ~.r  ./  \\ ... \n\ni  \\=\" ~;IKIU' !  ........ :::.......::.-::.  ~;~-.// \n\n:-e:rsion \n\ndemnln \n\n.... .-. \n\nIt.n:d \n\nappt\"e(\"u.  th~d  1MOt00itnn \n;;yne~nu::=t \ntheta \n.... \nUrUyers \ntnnde!' \n\n~:::n \nrglat \n-II \n\n'lUlu \n~i;~ \ndeted \n,.b.1' \n\nFigure  4:  Parts of the  top  levels  of a  hierarchical  clustering  solution for  the  Neural \ndocument collection,  aspects  are  represented  by  their  5 most probable word  stems. \n\n4  Annealed  Expectation Maximization \n\nAnnealed  EM  is  a generalization  of EM  based  on  the idea of deterministic  anneal(cid:173)\ning [Rose  et  al.,  1990]  that has been successfully  applied as a  heuristic optimization \ntechnique to many clustering and mixture problems.  Annealing reduces the sensitiv(cid:173)\nity to local maxima, but, even more importantly in this context, it may also improve \nthe generalization performance compared to maximum likelihood estimation.2  The \nkey  idea  in  annealed  EM  is  to  introduce  an  (inverse  temperature)  parameter  (3, \nand  to replace the negative (averaged)  complete data log-likelihood by  a substitute \nknown  as  the  fre e  energy  (both  are  in  fact  equivalent  at f3  =  1) .  This  effectively \nresults  in  a  simple modifi cation of the  E-step  by  taking the  likelihood  contribution \nin  Bayes'  rul e to  the  power  of ;3.  In order  to determine the  optimal value for  f3  we \nused  an  additional  validation set  in  a  cross  validation procedure. \n\n5  Results and  Conclusions \n\nIn our experiments we  have utilized the following real-world data sets:  (i)  Cranfield: \na  standard  test  collection  from  information  retrieval  (N = 1400,  M  = 4898) ,  (ii) \nPenn :  adjective-noun  co-occurrences  from  the  Penn  Treebank  corpus  (N =  6931 , \nM  = 4995)  and  the  LOB  corpus  (N = 5448,  M  = 6052) ,  (iii)  Neural:  a  document \ncollection with abstracts of journal papers on neural networks (N = 1278, M = 6065) , \n(iv)  Bzble:  word bigrams from  the bible edition of the Gutenberg  project  (N = M = \n12858) , (v)  Aerial: Textured aerial images for segmentation (N = 128x128,  M = 192). \nIn Fig.  1 we  have visualized an aspect model fitted  to the  Bible bigram data.  Notice \nthat although X  =  Y the role of the preceding and the subsequent words in bigrams \nis  quite  different .  Segmentation results  obtained on  Aerial applying the one-sided \nclustering model are  depicted  in  Fig. 2.  A  multi-scale Gabor filter  bank  (3  octaves, \n4  orientations) was  utilized as  an  image representation  (cf.  [Hofmann  et  al. , 1998]) . \nIn  Fig.  3  a  two- sided  clustering  solution  of LOB  is  shown.  Fig.  4  shows  the  top \nlevels  of  the  hierarchy  found  by  the  Cluster-Abstraction  Model  in  Neural.  The \ninner  node  distributions  provide  resolution-specific  descriptors  for  the  documents \nin  the  conesponding  subtree  which  can  be  utilized ,  e.g.,  in  interactive  browsing \nfor  information  retrieval,  Fig.  5  shows  typical  test  set  perplexity  curves  of  the \n\n2 Moreover,  the  tree topology  for  the CAM  is  heuristically  grown  via phase  transitions. \n\n\f472 \n\n(a) \n\nT.  Hofmann, 1.  Puzicha and M  1.  Jordan \n\n(b) \n\n(c) \n\n\".;----=-~---;:--~-,:;---:~ \n\n.~EJoII,\"\"\"', .... \n\nFigure 5:  Perplexity curves for  annealed  EM  (aspect  (a),  (b)  and one-sided cluster-\ning model (c))  on the  Bible and  Gran data. \nX /Y-c1uster \nf3 \n\nX /Y-cluster \nf3 \n\nX-duster \n'P \nf3 \n\nX-cluster \n'P \nf3 \n\nCAM \n'P \nf3 \n\nCAM \n'P \nf3 \n\n'P \n\nK \n\nAspect \n'P \nf3 \nCran \n\n'P \n\nAspect \n'P \nf3 \nPenn \n\n1 \n8 \n16 \n32 \n64 \n128 \n\n-\n\n685 \n0.88  482 \n0.72  255 \n0.83  386 \n0.79  360 \n0.78  353 \n\n0.09  527 \n0.07  302 \n0.07  452 \n0.06  527 \n0.04  663 \n\n0.67 \n0.18  511 \n0.51 \n0.10  268 \n0.12  438 \n0.53 \n0.11  422  OA8 \n0.10  410  OA5 \n\n615 \n335 \n506 \n477 \n462 \n\n639 \n0.73  312 \n0.72  255 \n0.71  205 \n0.69  182 \n0.68  166 \n\n0.08  352 \n0.07  302 \n0.07  254 \n0.07  223 \n0.06  231 \n\n0.55 \n0.13  322 \n0.51 \n0.10  268 \n0.46 \n0.08  226 \n0.07  204 \n0.44 \n0.06  179  DAD \n\n394 \n335 \n286 \n272 \n241 \n\nTable 1:  Perplexity results for  different models on the  Gran (predicting words condi(cid:173)\ntioned on  documents)  and  Penn data (predicting nouns conditioned on  adjectives). \n\nannealed  EM  algorithm for  the  aspect  and  clustering  model  (P  =  e- 1  where  I  is \nthe  per-observation  log-likelihood).  At  {J  = 1  (standard  EM)  overfitting  is  clearly \nvisible, an effect  that  vanishes with  decreasing  (J.  Annealed  learning  performs also \nbetter  than  standard  EM  with  early  stopping.  Tab.  1  systematically summarizes \nperplexity  results  for  different  models and data sets. \n\nIn  conclusion  mixture models for  dyadic  data have shown  a  broad  application  po(cid:173)\ntential.  Annealing yields  a  substantial  improvement in  generalization  performance \ncompared  to  standard  EM,  in  particular  for  clustering  models,  and  also  outper(cid:173)\nforms  a  complexity  control  via  J{.  In  terms  of perplexity,  the  aspect  model  has \nthe  best  performance.  Detailed  performance  studies  and  comparisons  with  other \nstate-of-the-art techniques  will  appear  in forthcoming  papers. \n\nReferences \n\n[Dempster  et al.,  1977]  Dempster,  A.P.,  Laird,  N.M.,  Rubin,  D.B.  (1977).  Maximum like(cid:173)\n\nlihood  from  incomplete  data via  the  EM  algorithm .  J.  Royal Statist.  Soc.  B ,  39,  1-38. \nStatistical  models  for  co(cid:173)\n\n[Hofmann ,  Puzicha,  1998]  Hofmann,  T.,  Puzicha,  J.  1998. \n\noccurrence data.  Tech.  rept.  Artifical  Intelligence  Laboratory  Memo  1625,  M.LT. \n\n[Hofmann  et  al.,  1998]  Hofmann,  T.,  Puzicha,  J .,  Buhmann,  J.M.  (1998).  Unsupervised \ntexture  segmentation  in  a  deterministic  annealing  framework.  IEEE  Transactions  on \nPattern  Analysis and  Machine  Intelligence , 20(8) ,  803-818. \n\n[Jordan,  Jacobs,  1994]  Jordan,  M.L,  Jacobs,  R.A.  (1994).  Hierarchical mixtures of experts \n\nand  the  EM  algorithm.  Neural  Computation,  6(2),  181-214. \n\n[Meila,  Jordan,  1998]  Meila,  M.,  Jordan,  M.  L  1998.  Estimating  Dependency  Structure \n\nas  a  Hidden  Variable.  In:  Advances in  Neural  Information  Processing Systems  10. \n\n[Pereira  et  al.,  1993]  Pereira,  F.e.N., Tishby,  N.Z.,  Lee,  L.  1993.  Distributional  clustering \n\nof English  words.  Pages  189-190 of:  Proceedings of the  A CL. \n\n[Rose  et  al.,  1990]  Rose,  K., Gurewitz , E.,  Fox,  G.  (1990).  Statistical mechanics and phase \n\ntransitions  in  clustering.  Physical Review  Letters,  65(8),  945-948. \n\n[Saul,  Pereira,  1997]  Saul,  1.,  Pereira,  F.  1997.  Aggregate and mixed-order  Markov  mod(cid:173)\nels  for  statistical  language  processing.  In:  Proceedings  of the  2nd International Confer(cid:173)\nence  on  Empirical  Methods  in  Natural  Language Processing. \n\n\f", "award": [], "sourceid": 1503, "authors": [{"given_name": "Thomas", "family_name": "Hofmann", "institution": null}, {"given_name": "Jan", "family_name": "Puzicha", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}