{"title": "Agglomerative Information Bottleneck", "book": "Advances in Neural Information Processing Systems", "page_first": 617, "page_last": 623, "abstract": null, "full_text": "Agglomerative Information Bottleneck \n\nNoam Slonim \n\nNaftali Tishby* \n\nInstitute of Computer Science and \n\nCenter for  Neural  Computation \n\nThe Hebrew  University \nJerusalem, 91904 Israel \n\nemail:  {noamm.tishby}(Qcs.huji.ac.il \n\nAbstract \n\nWe  introduce a novel distributional clustering algorithm that max(cid:173)\nimizes  the  mutual  information  per  cluster  between  data and  giv(cid:173)\nen  categories.  This  algorithm  can  be considered  as  a  bottom  up \nhard  version  of  the  recently  introduced  \"Information  Bottleneck \nMethod\".  The algorithm is  compared with the top-down soft  ver(cid:173)\nsion  of the  information  bottleneck  method  and  a  relationship  be(cid:173)\ntween the hard and soft results is established.  We  demonstrate the \nalgorithm on the 20 Newsgroups data set.  For a subset of two news(cid:173)\ngroups  we  achieve compression  by  3  orders  of magnitudes loosing \nonly  10%  of the original mutual information. \n\n1 \n\nIntroduction \n\nThe problem of self-organization of the members of a set X  based on the similarity \nof the conditional distributions of the members of another set, Y, {p(Ylx)}, was first \nintroduced in  [8]  and was  termed  \"distributional clustering\" . \n\nThis  question  was  recently  shown  in  [9]  to  be  a  special  case of a  much  more  fun(cid:173)\ndamental  problem:  What  are  the  features  of the  variable  X  that  are  relevant  for \nthe  prediction  of another,  relevance,  variable  Y?  This general problem was shown \nto have  a  natural  in~ormation theoretic formulation:  Find  a  compressed  represen(cid:173)\ntation  of the  variable  X,  denoted X,  such  that  the  mutual information  between X \nand  Y,  I (X; Y),  is  as  high  as  possible,  under  a  constraint  on  the  mutual  infor(cid:173)\nmation  between  X  and X,  I (X; X).  Surprisingly,  this  variational  problem  yields \nan  exact  self-consistent  equations  for  the  conditional  distributions  p(ylx),  p(xlx), \nand p(x).  This constrained information optimization problem was called in  [9]  The \nInformation  Bottleneck Method. \n\nThe original approach to the solution of the resulting equations, used already in [8], \nwas based on an analogy with the  \"deterministic annealing\"  approach to clustering \n(see  [7]).  This is  a top-down hierarchical algorithm that starts from  a single cluster \nand  undergoes  a  cascade of cluster  splits  which  are determined  stochastically  (as \nphase transitions)  into a  \"soft\"  (fuzzy)  tree of clusters. \nIn  this  paper  we  propose  an  alternative  approach  to  the  information  bottleneck \n\n\f618 \n\nN  Slonim and N  Tishby \n\nproblem, based on a greedy bottom-up merging.  It has several advantages over the \ntop-down method.  It is  fully  deterministic,  yielding  (initially)  \"hard clusters\",  for \nany desired number of clusters.  It gives higher mutual information per-cluster than \nthe  deterministic  annealing  algorithm  and  it  can  be  considered  as  the  hard  (zero \ntemperature) limit of deterministic annealing, for any prescribed number of clusters. \nFurthermore,  using  the  bottleneck  self-consistent  equations  one  can  \"soften\"  the \nresulting  hard  clusters  and  recover  the  deterministic  annealing  solutions  without \nthe need to identify the cluster splits, which is  rather tricky.  The main disadvantage \nof this method is  computational, since it starts from  the limit  of a  cluster per each \nmember of the set  X. \n\n1.1  The information bottleneck method \n\nThe mutual information  between the random variables X  and Y  is  the symmetric \nfunctional  of their joint distribution, \n\nP Y \n\nxEX,yEY \n\nxEX,yEY \n\nP  P Y \n\nI(X;Y) =  L  p(x,y) log (  ~~~'~\\) =  L  p(x)p(ylx) log (P(Y(lx)))  . \n(1) \nThe  objective  of the information  bottleneck  method  is  to  extract  a  compact  rep(cid:173)\nresentation  of  the  variable  X,  denoted  here  by  X,  with  minimal  loss  of  mutual \ninformation to another,  relevance,  variable Y.  More specifically,  we  want  to find  a \n(possibly stochastic) map, p(xlx), that minimizes the (lossy)  coding length of X  via \nX,  I(Xi X),  under a constraint on the mutual information to the relevance variable \nI(Xi Y).  In other words,  we  want to find  an efficient  representation of the variable \nX, X, such that the predictions of Y  from  X  through X will  be as close as  possible \nto the direct prediction of Y  from  X. \nAs  shown  in  [9],  by  introducing  a  positive  Lagrange  multiplier  13  to  enforce  the \nmutual  information  constraint,  the  problem  amounts  to  minimization  of  the  La(cid:173)\ngrangian: \n\n(2) \nwith respect  to p(xlx), subject to the Markov condition X -t X  -t Y  and normal(cid:173)\nization. \n\n\u00a3[P(xlx)]  =  I(Xi X) - f3I(Xi Y)  , \n\nThis minimization yields directly the following self-consistent equations for the map \np(xlx), as  well  as  for p(Ylx)  and p(x): \n\n{ \n\np(xlx) =  i(~~;) exp (-f3DKL [P(Ylx)lIp(Ylx)]) \np(Ylx)  =  2:xp(Ylx)p(xlx)~ \np(x) = 2:x p(xlx)p(x) \n\n(3) \n\nThe  functional  DKL[Pllq] \n\nis  a  normalization  function. \n\n== \nwhere  Z(f3, x) \n2: y p(y) log ~f~~  is  the Kulback-Liebler divergence  [3J , which  emerges here from  the \nvariational principle.  These equations can be solved by iterations that are proved to \nconverge for  any finite  value  of 13  (see  [9]).  The Lagrange multiplier 13  has the nat(cid:173)\nural  interpretation of inverse temperature,  which  suggests  deterministic  annealing \n[7]  to explore the hierarchy of solutions in X,  an approach taken already in  [8J . \nThe variational principle, Eq.(2), determines also the shape of the annealing process, \nsince by changing 13  the mutual informations Ix ==  I(X; X) and Iy ==  I(Y; X) vary \nsuch that \n\nMy  =  13- 1 \nMx \n\n. \n\n(4) \n\n\fAgglomerative Information Bottleneck \n\n619 \n\nThus  the optimal  curve,  which  is  analogous  to  the  rate distortion  function  in  in(cid:173)\nformation  theory  [3],  follows  a  strictly  concave curve in the  (Ix,Iy)  plane,  called \nthe information plane.  Deterministic annealing, at fixed  number of clusters, follows \nsuch a  concave curve as  well,  but this  curve is suboptimal beyond a  certain critical \nvalue of f3. \nAnother interpretation of the bottleneck principle comes from the relation between \nthe mutual information and Bayes classification error.  This error is bounded above \nand below  (see [6])  by an important information theoretic measure of the class con(cid:173)\nditional distributions p(XIYi),  called the Jensen-Shannon  divergence.  This measure \nplays an important role in our context. \nThe Jensen-Shannon divergence of M  class  distributions,  Pi(X),  each  with  a  prior \n7ri,  1 ~ i  ~ M, is  defined  as,  [6,4]. \n\nM \n\nM \n\nJS7r (Pl,P2, \u00b7\u00b7\u00b7,PM] ==  H[L 7riPi(X)]  - L  7ri H [Pi(X)]  , \n\n(5) \n\ni=l \n\ni=l \n\nwhere  H[P(x)]  is  Shannon's  entropy,  H[P(x)]  =  - Ex p(x) logp(x).  The  convexi(cid:173)\nty  of the  entropy  and  Jensen  inequality  guarantees  the  non-negativity  of the  JS(cid:173)\ndivergence. \n\n1.2  The hard clustering limit \nFor  any  finite  cardinality  of the  representation  IXI  ==  m  the limit  f3  -+  00  of the \nEqs.(3)  induces  a  hard  partition of X  into  m  disjoint  subsets.  In  this  limit  each \nmember x  E  X  belongs only to the subset x  E  X for  which  p(Ylx)  has the smallest \nDKL[P(ylx)lIp(ylx)]  and the probabilistic map p(xlx)  obtains the limit values 0 and \n1 only. \n\nIn  this  paper  we  focus  on  a  bottom  up  agglomerative  algorithm  for  generating \n\"good\" hard partitions of X.  We denote an m-partition of X, i.e.  X with cardinality \nm, also by Zm  =  {Zl,Z2, ... ,Zm}, in which  casep(x) =p(Zi).  We  say that Zm  is  an \noptimal m-partition (not  necessarily unique)  of X  if for  every other m-partition of \nX, Z:n,  I(Zm; Y)  ~ I(Z:n; Y).  Starting from the trivial N-partition, with N  =  lXI, \nwe  seek  a  sequence  of merges  into  coarser and  coarser partitions that  are as  close \nas  possible to optimal. \nIt is  easy to verify that in the f3  -+  00 limit Eqs.(3) for the m-partition distributions \nare simplified  as  follows.  Let  x ==  Z  =  {XI,X2, ... ,xl z l}  ,  Xi  E  X  denote  a  specific \ncomponent  (Le.  cluster)  of the partition Zm,  then \n\n1  {I \n\n(6) \n\nif x  E  Z \nth \no  erWlse \n\n0 \n\np(zlx) = \n.  Vx  E X \np(ylz) =  plz)  El~l P(Xi, y)  Vy  E  Y \np(z) =  El~l P(Xi) \n\nUsing  these distributions  one  can  easily  evaluate the  mutual information  between \nZm  and Y,  I(Zm; Y), and between  Zm  and X, I(Zm; X), using Eq.(l). \nOnce  any  hard  partition,  or  hard  clustering,  is  obtained  one  can  apply  \"reverse \nannealing\" and \"soften\" the clusters by  decreasing f3  in the self-consistent equations, \nEqs.(  3).  Using this  procedure we  in fact  recover the stochastic map, p(xlx), from \nthe hard partition without  the need to identify the cluster splits.  We  demonstrate \nthis reverse deterministic annealing procedure in the last section. \n\n\f620 \n\nN.  Slonim and N.  Tishby \n\n1.3  Relation to other work \n\nA  similar  agglomerative  procedure,  without  the  information  theoretic  framework \nand analysis,  was  recently  used  in  [1]  for  text categorization on the 20  newsgroup \ncorpus.  Another  approach  that stems  from  the  distributional  clustering  algorith(cid:173)\nm  was  given  in  [5]  for  clustering  dyadic  data.  An  earlier  application  of  mutual \ninformation for  semantic clustering of words was  used in  [2]. \n\n2  The agglomerative information bottleneck algorithm \n\nThe algorithm starts with the trivial partition into N  =  IXI clusters or components, \nwith  each  component  contains  exactly  one element  of X.  At  each  step  we  merge \nseveral components of the current partition into a  single new  component in a  way \nthat locally minimizes the loss of mutual information leX; Y) =  l(Zm; Y). \nLet  Zm  be  the  current  m-partition  of  X  and  Zm  denote  the  new  m-partition \nof  X  after  the  merge  of  several  components  of  Zm.  Obviously,  m  <  m.  Let \n{Zl, Z2, ... , zd  ~ Zm  denote  the  set  of  components  to  be  merged,  and  Zk  E  Zm \nthe new  component that is  generated by the merge, so m =  m  - k + 1. \nTo evaluate the reduction in the mutual information l(Zm; Y) due to this merge one \nneeds  the  distributions  that  define  the  new  m-partition,  which  are  determined  as \nfollows.  For every  Z E  Zm, Z f:.  Zk,  its probability distributions (p(z),p(ylz),p{zlx\u00bb \nremains  equal  to  its  distributions  in  Zm.  For  the  new  component,  Zk  E  Zm,  we \ndefine, \n\n{ \n\np(Zk)  =  L~=l P(Zi) \np(yIZk)  =  P(~Ic) ~~=l P(Zi, y)  \\ly E Y  . \n(zlx)  = {1  1f x E ~i for  some 1 ~ ~ ~ k \n\n0  otherw1se \n\np \n\n(7) \n\n\\Ix  X \n\nE \n\nIt  is  easy  to  verify  that  Zm  is  indeed  a  valid  m-partition with  proper probability \ndistributions. \n\nUsing the same notations, for  every  merge we  define the additional quantities: \n\n\u2022  The merge prior distribution:  defined by  ilk  ==  (71\"1,71\"2, ... , 7I\"k),  where 7I\"i \nis  the prior probability of Zi  in the merged subset, i.e.  7I\"i  ==  :t::). \nl(X; Y)  due to a  single merge,  cHy(Zl' \"\"Zk) ==  l(Zm; Y) - l(Zm; Y) \n\n\u2022  The  Y -information  decrease:  the  decrease  in  the  mutual  information \n\n\u2022  The  X-information  decrease:  the  decrease  in  the  mutual  information \nl(X, X) due to a  single merge,  cHx (Z1, Z2, ... , Zk)  ==  l(Zm; X) - l(Zm; X) \n\nOur algorithm is a  greedy procedure, where in each step we perform  \"the best possi(cid:173)\nble merge\" , i.e.  merge the components  {Z1, ... , zd of the current m-partition which \nminimize cHy(Z1, ... , Zk).  Since cHy(Zl, ... , Zk)  can only increase with  k  (corollary 2), \nfor  a  greedy procedure it  is  enough  to  check  only  the  possible  merging  of pairs  of \ncomponents of the current m-partition.  Another advantage of merging only pairs is \nthat in this way we go through all the possible cardinalities of Z  =  X, from  N  to 1. \nthere  are  m(~-1)  possible  pairs \nFor  a  given  m-partition  Zm  =  {Z1,Z2, ... ,Zm} \nto  merge.  To  find  \"the best  possible  merge\"  one  must  evaluate  the  reduction  of \ninformation  cHy(Zi, Zj)  =  l(Zm; Y)  -\nl(Zm-1; Y)  for  every  pair  in  Zm ,  which  is \nO(m . WI)  operations  for  every  pair.  However,  using  proposition  1  we  know  that \ncHy(Zi, Zj)  =  (P(Zi) + p(Zj\u00bb  . JSn 2 (P(YIZi),p(Ylzj\u00bb, so the reduction in the mutual \n\n\fAgglomerative Information Bottleneck \n\n621 \n\ninformation due to the merge of Zi  and  Zj  can be evaluated directly  (looking only \nat this pair)  in  O(IYI)  operations, a  reduction of a  factor  of m  in time complexity \n(for  every  merge). \n\nInput: Empirical probability matrix p(x,y),  N  = IX\\,  M  =  IYI \nOutput: Zm  :  m-partition of X  into m  clusters,  for  every 1 ::;  m  ::;  N \nInitialization: \n\n\u2022  Construct Z  ==  X \n- For i  = 1.. .N \n*  Zi  = {x;} \n*  P(Zi)  = p(Xi) \n*  p(YIZi) = p(YIXi)  for  every  y E Y \n*  p(zlxj) =  1 if j  = i  and 0 otherwise \n\n- Z={Zl, ... ,ZN} \n\n\u2022  for  every  i, j  = 1.. .N,  i  < j, calculate \n\ndi,j  =  (p(Zi)  +p(Zj))' JSn2[P(ylzi),p(ylzj)] \n(every  di,j  points to the corresponding couple in Z) \n\nLoop: \n\n\u2022  For  t  =  1... (N - 1) \n\n- Find {a,.B}  =  argmini ,j {di,j } \n\n(if there are several  minima choose  arbitrarily between them) \n\n- Merge  {z\"\"  Zj3}  =>  z  : \n\n* p(z) =  p(z\",) + p(Zj3) \n* p(ylz)  =  r>li) (p(z\"\"y) +p(zj3,y))  for  every  y  E Y \n* p(zlx) = 1 if x  E z'\"  U Zj3  and 0 otherwise,  for  every x  E  X \n\n- Update  Z  = {Z - {z\"\"  z,q}} U{z} \n\n(Z is  now  a  new (N - t)-partition of X  with N  - t  clusters) \n\n- Update di ,j  costs and pointers w.r.t.  z \n\n(only  for  couples contained z'\"  or  Zj3). \n\n\u2022  End For \n\nFigure 1:  Pseudo-code of the algorithm. \n\n3  Discussion \n\nThe algorithm is  non-parametric, it is a simple greedy procedure, that depends only \non the input empirical joint distribution of X  and Y.  The output of the algorithm \nis  the hierarchy of all m-partitions Zm  of X  for  m  =  N, (N - 1), ... ,2,1.  Moreover, \nunlike  most other clustering heuristics,  it has  a  built in  measure of efficiency  even \nfor  sub-optimal solutions,  namely,  the mutual information I(Zm; Y)  which bounds \nthe Bayes classification error.  The quality measure of the obtained Zm  partition is \nthe fraction  of the mutual information  between  X  and Y  that  Zm  captures.  This \nvs.  m  = 1 Zm I.  We  found  that  empirically  this \nis  given  by  the  curve \ncurve was  concave.  If this is  always true the decrease in the mutual information at \nevery step, given by  8(m)  ==  I(Z7n;Y)(-:~7n-l;Y)  can only  increase with decreasing \nm.  Therefore,  if  at  some  point  8(m)  becomes  relatively  high  it  is  an  indication \nthat we have reached a value of m  with  \"meaningful\"  partition or clusters.  Further \n\nII Z,; -: \n\n, \n\n\f622 \n\nN.  Slonim and N.  Tishby \n\nmerging results  in substantial loss  of information and thus significant  reduction in \nthe performance of the clusters as features.  However, since the computational cost \nof the final  (low  m)  part of the procedure is  very low  we  can just as  well  complete \nthe merging to a  single cluster. \n\n:;:-\n~ \n:::\"0.5 \n>-\n~ \n\n0 .8 \n\n>\" 106 \n->\" \n\n\u2022 \n\n... \n~O . 4  -.6.+ \n\nNG100 \n\n0.2 \n\n0 .4 \n0 .6 \nI(Z;X) /  H(X) \n\n0.8 \n\n0.2 \n\nI-\n\n%~--~'0~--~2~0----~30~---4~0----~~~ \n\no~' ----;-;;;;;;------\n\n\"3000:=--\n121 \n\n-=  \n\nIZI \n\nFigure 2:  On the left  figure  the results  of the agglomerative  algorithm  are  shown  in  the \n\"information plane\", normalized  I(Z ; Y) vs.  normalized  I(Z ; X) for  the NGlOOO  dataset. \nIt is  compared  to the soft  version  of  the information  bottleneck  via  \"reverse  annealing\" \nfor  IZI  = 2,5, 10, 20, 100  (the smooth curves  on  the left).  For  IZI  = 20, 100  the annealing \ncurve is  connected to the starting point by  a dotted line.  In this plane the hard algorithm \nis  clearly  inferior  to the soft one. \nOn the right-hand side:  I(Zm, Y)  of the  agglomerative  algorithm  is  plotted vs.  the car(cid:173)\ndinality  of the  partition  m  for  three subsets  of  the  newsgroup  dataset.  To  compare  the \nperformance  over  the  different  data cardinalities  we  normalize  I(Zm ; Y)  by  the  value  of \nI(Zso ; Y) , thus forcing  all three curves to start (and end)  at the same points.  The predic(cid:173)\ntive  information  on  the  newsgroup  for  NGlOOO  and  NGIOO  is  very  similar, while  for  the \ndichotomy  dataset,  2ng,  a  much  better  prediction  is  possible  at  the same IZI, as  can  be \nexpected for  dichotomies.  The inset  presents the full  curve of the normalized  I(Z ; Y)  vs. \nIZI  for  NGIOO  data for  comparison.  In  this  plane  the hard partitions are  superior  to the \nsoft  ones. \n\n4  Application \n\nTo  evaluate  the  ideas  and  the  algorithm  we  apply  it  to  several  subsets  of  the \n20Newsgroups dataset, collected by Ken Lang using 20, 000 articles evenly distribut(cid:173)\ned among 20  UseNet discussion groups (see [1]).  We replaced every digit by a single \ncharacter and by another to mark non-alphanumeric characters.  Following this pre(cid:173)\nprocessing, the first dataset contained the 530 strings that appeared more then 1000 \ntimes in the data.  This dataset is referred as NG1000 .  Similarly, all the strings that \nappeared more then  100 times  constitutes the NG100 dataset and it contains 5148 \ndifferent strings.  To evaluate also a  dichotomy data we used a  corpus consisting of \nonly two discussion groups out of the 20Newsgroups with similar topics:  alt. atheism \nand  talk. religion. misc.  Using  the same  pre-processing,  and  removing  strings  that \noccur  less  then  10  times,  the  resulting  \"lexicon\"  contained  5765  different  strings. \nWe refer  to this dataset as  2ng. \n\nWe plot the results of our algorithm on these three data sets in two different planes. \nFirst,  the  normalized  information  ;g~~~  vs.  the  size  of partition  of X  (number \nof clusters) ,  IZI.  The  greedy  procedure  directly  tries  to  maximize  J(Z ; Y)  for  a \ngiven  IZI, as  can be seen  by the strong concavity of these curves  (figure  2,  right). \nIndeed the procedure is  able to maintain a  high  percentage of the relevant  mutual \ninformation of the original data, while reducing the dimensionality of the \"features\" , \n\n\fAgglomerative Information Bottleneck \n\n623 \n\nIZI, by  several orders of magnitude. \nOn the right  hand-side of figure  2  we  present a  comparison between  the efficiency \nof  the  procedure  for  the  three  datasets.  The  two-class  data,  consisting  of  5765 \ndifferent strings, is compressed by two orders of magnitude, into 50 clusters, almost \nwithout loosing any of the mutual information about the news groups (the decrease \nin  I(Xi Y)  is  about  0.1%).  Compression  by  three  orders  of  magnitude,  into  6 \nclusters,  maintains about 90%  of the original mutual information. \nSimilar results, even though less striking, are obtained when Y  contain all 20 news(cid:173)\ngroups.  The  NG100  dataset  was  compressed  from  5148  strings  to  515  clusters, \nkeeping  86%  of the  mutual  information,  and  into  50  clusters  keeping  about  70% \nof the  information.  About  the  same compression  efficiency  was  obtained  for  the \nNG1000 dataset. \nThe relationship between the soft and hard clustering is demonstrated in the Infor(cid:173)\nmation plane,  i.e.,  the normalized mutual information values,  :ti;;~~ vs.  Ik(~)' In \nthis plane, the soft procedure is  optimal since it is  a direct maximization of I(Z; Y) \nwhile  constraining  I(Zi X).  While the  hard partition is  suboptimal  in  this  plane, \nas confirmed empirically,  it provides an excellent starting point for  reverse anneal(cid:173)\ning.  In figure  2 we  present the results of the agglomerative procedure for  NG1000 \nin  the information  plane,  together  with  the  reverse  annealing for  different  values \nof IZI.  As  predicted  by  the  theory,  the annealing curves  merge  at  various  critical \nvalues  of f3  into the globally  optimal curve,  which  correspond  to the  \"rate distor(cid:173)\ntion function\"  for  the information bottleneck problem.  With the reverse annealing \n(\"heating\") procedure there is no need to identify the cluster splits as required in the \noriginal  annealing  (\"cooling\")  procedure.  As  can  be seen,  the  \"phase diagram\"  is \nmuch better recovered by this procedure, suggesting a combination of agglomerative \nclustering and reverse annealing as the ultimate algorithm for  this problem. \n\nReferences \n[1]  L.  D.  Baker  and A. K.  McCallum.  Distributional Clustering of Words for  Text Clas(cid:173)\n\nsification  In  ACM SIGIR  98,  1998. \n\n[2]  P.  F . Brown,  P.V.  deSouza, R .L.  Mercer,  V.J. DellaPietra, and J.C. Lai.  Class-based \nIn  Computational  Linguistics,  18( 4}:467-479, \n\nn-gram  models  of  natural  language. \n1992. \n\n[3]  T.  M.  Cover  and  J.  A.  Thomas.  Elements  of Information  Theory.  John  Wiley  & \n\nSons,  New  York,  1991. \n\n[4]  R.  EI-Yaniv,  S.  Fine, and N.  Tishby.  Agnostic classification of Markovian sequences. \n\nIn  Advances  in Neural  Information  Processing  (NIPS'97)  , 1998. \n\n[5]  T.  Hofmann, J . Puzicha, and M.  Jordan.  Learning from  dyadic data.  In  Advances  in \n\nNeural  Information  Processing  (NIPS'98),  1999. \n\n[6]  J.  Lin.  Divergence  Measures  Based on the Shannon Entropy.  IEEE  Transactions  on \n\nInformation  theory,  37(1}:145- 151,  1991. \n\n[7]  K.  Rose.  Deterministic Annealing for  Clustering, Compression, Classification, Regres(cid:173)\nsion, and Related Optimization Problems.  Proceedings  of the IEEE, 86(11}:2210- 2239, \n1998. \n\n[8]  F.C.  Pereira,  N.  Tishby,  and  L.  Lee.  Distributional  clustering  of English  words.  In \n30th  Annual  Meeting  of the  Association  for  Computational  Linguistics,  Columbus, \nOhio,  pages 183-190,  1993. \n\n[9]  N.  Tishby,  W.  Bialek,  and  F .  C.  Pereira.  The information  bottleneck  method:  Ex(cid:173)\n\ntracting  relevant  information  from  concurrent  data.  Yet  unpublished  manuscript, \nNEC  Research  Institute TR,  1998. \n\n\f", "award": [], "sourceid": 1651, "authors": [{"given_name": "Noam", "family_name": "Slonim", "institution": null}, {"given_name": "Naftali", "family_name": "Tishby", "institution": null}]}