{"title": "Geometric Clustering Using the Information Bottleneck Method", "book": "Advances in Neural Information Processing Systems", "page_first": 1165, "page_last": 1172, "abstract": "", "full_text": "Geometric Clustering using the Information\n\nBottleneck method\n\nSusanne Still\n\nDepartment of Physics\n\nPrinceton Unversity, Princeton, NJ 08544\n\nsusanna@princeton.edu\n\nL\u00b4eon Bottou\n\nNEC Laboratories America\n\nWilliam Bialek\n\nDepartment of Physics\n\nPrinceton Unversity, Princeton, NJ 08544\n\n4 Independence Way, Princeton, NJ 08540\n\nwbialek@princeton.edu\n\nleon@bottou.org\n\nAbstract\n\nWe argue that K\u2013means and deterministic annealing algorithms for geo-\nmetric clustering can be derived from the more general Information Bot-\ntleneck approach. If we cluster the identities of data points to preserve\ninformation about their location, the set of optimal solutions is massively\ndegenerate. But if we treat the equations that de\ufb01ne the optimal solution\nas an iterative algorithm, then a set of \u201csmooth\u201d initial conditions selects\nsolutions with the desired geometrical properties. In addition to concep-\ntual uni\ufb01cation, we argue that this approach can be more ef\ufb01cient and\nrobust than classic algorithms.\n\n1\n\nIntroduction\n\nClustering is one of the most widespread methods of data analysis and embodies strong\nintuitions about the world: Many different acoustic waveforms stand for the same word,\nmany different images correspond to the same object, etc.. At a colloquial level, clustering\ngroups data points so that points within a cluster are more similar to one another than\nto points in different clusters. To achieve this, one has to assign data points to clusters\nand determine how many clusters to use. (Dis)similarity among data points might, in the\nsimplest example, be measured with the Euclidean norm, and then we could ask for a\nclustering of the points1 fxig, i = 1; 2; :::; N, such that the mean square distance among\npoints within the clusters is minimized,\n\n1\nNc\n\nNc\n\nXc=1\n\n1\n\nnc Xij2c\n\njxi (cid:0) xjj2;\n\n(1)\n\nwhere there are Nc clusters and nc points are assigned to cluster c. Widely used itera-\ntive reallocation algorithms such as K\u2013means [5, 8] provide an approximate solution to the\n\n1Notation: All bold faced variables in this paper denote vectors.\n\n\fproblem of minimizing this quantity. Several alternative cost functions have been proposed\n(see e.g. [5]), and some use analogies with physical systems [3, 7]. However, this approach\ndoes not give a principled answer to how many clusters should be used. One often intro-\nduces and optimizes another criterion to \ufb01nd the optimal number of clusters, leading to a\nvariety of \u201cstopping rules\u201d for the clustering process [5]. Alternatively, cross-validation\nmethods can be used [11] or, if the underlying distribution is assumed to have a certain\nshape (mixture models), then the number of clusters can be found, e.g by using the BIC\n[4].\n\nA different view of clustering is provided by information theory. Clustering is viewed as\nlossy data compression; the identity of individual points ((cid:24) log2 N bits) is replaced by the\nidentity of the cluster to which they are assigned ((cid:24) log2 Nc bits (cid:28) log2 N bits). Each\ncluster is associated with a representative point xc, and what we lose in the compression\nare the deviations of the individual xi2c, from the representative xc. One way to formalize\nthis trading between data compression and error is rate\u2013distortion theory [10], which again\nrequires us to specify a function d(xi; xc) that measures the magnitude of our error in\nreplacing xi by xc. The trade-off between the coding cost and the distortion de\ufb01nes a\none parameter family of optimization problems, and this parameter can be identi\ufb01ed with\ntemperature through an analogy with statistical mechanics [9]. As we lower the temperature\nthere are phase transitions to solutions with more and more distinct clusters, and if we \ufb01x\nthe number of clusters and vary the temperature we \ufb01nd a smooth variation from \u201csoft\u201d\n(probabilistic) to \u201chard\u201d (deterministic) clustering. For distortion functions d(x; x0) /\n(x (cid:0) x0)2, a deterministic annealing approach to solving the variational problem converges\nto the K\u2013means algorithm in the limit of zero temperature [9].\n\nA more general information theoretic approach to clustering, the Information Bottleneck\nmethod [13], explicitly implements the idea that our analysis of the data typically is mo-\ntivated by our interest in some derived quantity (e.g., words from sounds) and that we\nshould preserve this relevant information rather than trying to guess at what metric in the\nspace of our data will achieve the proper feature selection. We imagine that each point xi\noccurs together with a corresponding variable vi, and that v is really the object of inter-\nest.2 Rather than trying to select the important features of similarity among different points\nxi, we cluster in x space to compress our description of these points while preserving as\nmuch information as possible about v, and again this de\ufb01nes a one parameter family of\noptimization problems. In this formulation there is no need to de\ufb01ne a similarity (or dis-\ntortion) measure; this measure arises from the optimization principle itself. Furthermore,\nthis framework allows us to \ufb01nd the optimal number of clusters for a \ufb01nite data set using\nperturbation theory [12]. The Information Bottleneck principle thus allows a full solution\nof the clustering problem.\n\nThe Information Bottleneck approach is attractive precisely because the generality of infor-\nmation theory frees us from a need to specify in advance what it means for data points to be\nsimilar: Two points can be clustered together if this merger does not lose too much informa-\ntion about the relevant variable v. More precisely, because mutual information is invariant\nto any invertible transformation of the variables, approaches which are built entirely from\nsuch information theoretic quantities are independent of any arbitrary assumptions about\nwhat it means for two points to be close in the data space. This is especially attractive\nif we want the same information theoretic principles to apply both to the analysis of, for\nexample, raw acoustic waveforms and to the sequences of words for which these sounds\nmight stand [2]. On the other hand, it is not clear how to incorporate a geometric intuition\ninto the Information Bottleneck approach.\n\nA natural and purely information theoretic formulation of geometric clustering might ask\nthat we cluster the points, compressing the data index i 2 [1; N ] into a smaller set of cluster\n\n2\n\nv does not have to live in the same space as the data xi.\n\n\findices c 2 [1; Nc] so that we preserve as much information as possible about the locations\nof the points, i.e. location x becomes the relevant variable. Because mutual information is\na geometric invariant, however, such a problem has an in\ufb01nitely degenerate set of solutions.\nWe emphasize that this degeneracy is a matter of principle, and not a failing of any approx-\nimate algorithm for solving the optimization problem. What we propose here is to lift this\ndegeneracy by choosing the initial conditions for an iterative algorithm which solves the\nInformation Bottleneck equations. In effect our choice of initial conditions expresses a no-\ntion of smoothness or geometry in the space of the fxig, and once this is done the dynamics\nof the iterative algorithm lead to a \ufb01nite set of \ufb01xed points. For a broad range of tempera-\ntures in the Information Bottleneck problem the solutions we \ufb01nd in this way are precisely\nthose which would be found by a K\u2013means algorithm, while at a critical temperature we\nrecover the deterministic annealing approach to rate\u2013distortion theory. In addition to the\nconceptual attraction of connecting these very different approaches to clustering in a single\ninformation theoretic framework, we argue that our approach may have some advantages\nof robustness.\n\n2 Derivation of K\u2013means from the Information Bottleneck method\n\nWe use the Information Bottleneck method to solve the geometric clustering problem and\ncompress the data indices i into cluster indices c in a lossy way, keeping as much infor-\nmation about the location x in the compression as possible. The variational principle is\nthen\n\nmax\np(cji)\n\n[I(x; c) (cid:0) (cid:21)I(c; i)]\n\n(2)\n\nwhere (cid:21) is a Lagrange parameter which regulates the trade-off between compression and\npreservation of relevant information. Following [13], we assume that p(xji; c) = p(xji),\ni.e. the distribution of locations for a datum, if the index of the datum is known, does not\ndepend explicitly on how we cluster. Then p(xjc) is given by the Markov condition\n\np(xjc) =\n\n1\n\np(c)Xi\n\np(xji)p(cji)p(i):\n\n(3)\n\nFor simplicity, let us discretize the space that the data live in, let us assume that it is a\n\ufb01nite domain and that we can estimate the probability distribution p(x) by a normalized\nhistogram. Then the data we observe determine\n\np(xji) = (cid:14)xxi ;\n\n(4)\n\nwhere (cid:14)xxi is the Kronecker-delta which is 1 if x = xi and zero otherwise. The probability\nof indices is, of course, p(i) = 1=N.\nThe optimal assignment rule follows from the variational principle (2) and is given by\n\np(cji) =\n\np(c)\n\nZ(i; (cid:21))\n\nexp\" 1\n\n(cid:21)Xx\n\np(xji) log2 [p(xjc)]# :\n\n(5)\n\nwhere Z(i; (cid:21)) ensures normalization. This equation has to be solved self consistently to-\n\ngether with eq.(3) and p(c) = Pi p(cji)=N. These are the Information Bottleneck equa-\n\ntions and they can be solved iteratively [13]. Denoting by pn the probability distribution\n\n\fafter the n-th iteration, the iterative algorithm is given by\n\np(xji) log2 [pn(cid:0)1(xjc)]# ;\n\n(cid:21)Xx\n\npn(xjc) =\n\np(xji)pn(cji);\n\npn(cji) =\n\npn(c) =\n\n1\n\npn(cid:0)1(c)\nZn(i; (cid:21))\n\nexp\" 1\nN pn(cid:0)1(c)Xi\nN Xi\n\npn(cji):\n\n1\n\nLet d(x; x0) be a distance measure on the data space. We choose Nc cluster centers x\nrandom and initialize\n\np0(xjc) =\n\n1\n\nZ0(c; (cid:21))\n\nexp(cid:20)(cid:0)\n\n1\ns\n\nd(x; x\n\n(0)\n\nc )(cid:21)\n\nwhere Z0(c; (cid:21)) is a normalization constant and s > 0 is some arbitrary length scale \u2013\nthe reason for introducing s will become apparent in the following treatment. After each\niteration, we determine the cluster centers x\n\n, n (cid:21) 1, according to (compare [9])\n\n(n)\nc\n\n(6)\n\n(7)\n\n(8)\n\n(0)\nc\n\nat\n\n(9)\n\n(10)\n\n(11)\n\npn(xjc)\n\n(n)\n@d(x; x\nc\n(n)\nc\n\n@x\n\n)\n\n;\n\n0 =Xx\n\nwhich for the squared distance reduces to\n\nx\n\n(n)\n\nc =Xx\n\nx pn(xjc):\n\nWe furthermore initialize p0(c) = 1=Nc, where Nc is the number of clusters. Now de\ufb01ne\nthe index c(cid:3)\ni such that it denotes the cluster with cluster center closest to the datum xi (in\nthe n-th iteration):\n\nc(cid:3)\ni := arg min\n\nc\n\nd(xi; x\n\n(n)\nc\n\n):\n\n(12)\n\nProposition:\nn ! 1\n\nIf 0 < (cid:21) < 1, and if the cluster indexed by c(cid:3)\n\ni is non\u2013empty, then for\n\np(cji) = (cid:14)cc(cid:3)\ni :\n\n(13)\n\nProof: From (7) and (4) we know that pn(xjc) /Pi (cid:14)xxi pn(cji)=pn(cid:0)1(c) and from (6)\n\nwe have\n\npn(cji)=pn(cid:0)1(c) / exp\" 1\n\np(xji) log2 [pn(cid:0)1(xjc)]# ;\n\n(14)\n\n(cid:21)Xx\n\nand hence pn(xjc) / (pn(cid:0)1(xjc))1=(cid:21).\n\nexph(cid:0) 1\n\ns(cid:21) d(x; x\n\nwe have after n iterations:\n\n(n)\nc\n\n(0)\n\nc )i. The cluster centers x\npn(xjc) / exp(cid:20)(cid:0)\n\nSubstituting (9), we have p1(xjc) /\nare updated in each iteration and therefore\n\n1\ns(cid:21)n d(x; x\n\n(n(cid:0)1)\nc\n\n)(cid:21)\n\n(15)\n\nwhere the proportionality constant has to ensure normalization of the probability measure.\nUse (14) and (15) to \ufb01nd that\n\npn(cji) / pn(cid:0)1(c) exp(cid:20)(cid:0)\n\n1\ns(cid:21)n d(xi; x\n\n(n(cid:0)1)\nc\n\n)(cid:21) :\n\n(16)\n\n\fand again the proportionality constant has to ensure normalization. We can now write the\nprobability that a data point is assigned to the cluster nearest to it:\n\npn(c(cid:3)\n\ni ji) =\n\n1\n\npn(cid:0)1(c(cid:3)\n\n0\n@1 +\ns(cid:21)n (cid:16)d(xi; x\nexph(cid:0) 1\n\nBy de\ufb01nition d(xi; x\n\ni ) Xc6=c(cid:3)\n\ni\n\npn(cid:0)1(c) exp(cid:20)(cid:0)\n\n1\n\ns(cid:21)n (cid:16)d(xi; x\n\n(n(cid:0)1)\nc\n\n) (cid:0) d(xi; x\n\n(n(cid:0)1)\nc(cid:3)\n\ni\n\n(n(cid:0)1)\nc\n\n) (cid:0) d(xi; x\n\n(n(cid:0)1)\nc(cid:3)\n\ni\n\n) > 0 8c 6= c(cid:3)\n\ni , and thus for n ! 1,\n\n(cid:0)1\n\n(17)\n\n)(cid:17)(cid:21)1\nA\n\n(n(cid:0)1)\nc\n\n) (cid:0) d(xi; x\n\n(n(cid:0)1)\nc(cid:3)\n\ni\n\n)(cid:17)i ! 0, and for clusters that do not have zero\n\ni ji) ! 1. Finally, because of normal-\n\noccupancy, i.e for which pn(cid:0)1(c(cid:3)\nization, p(c 6= c(cid:3)\nFrom eq. (13) follows with equations (4), (7) and (11) that for n ! 1\n\ni ) > 0, we have p(c(cid:3)\n\ni ji) must be zero. (cid:3)\n\nxc =\n\nxi(cid:14)cc(cid:3)\ni ;\n\n(18)\n\n1\n\nnc Xx\n\nwhere nc = Pi (cid:14)cc(cid:3)\n\ni . This means that for the square distance measure, this algorithm\nproduces the familiar K\u2013means solution: we get a hard clustering assignment (13) where\neach datum i is assigned to the cluster c(cid:3)\ni with the nearest center. Cluster centers are updated\naccording to eq. (18) as the average of all the points that have been assigned to that cluster.\nFor some problems, the squared distance might be inappropriate, and the update rule for\ncomputing the cluster centers depends on the particular distance function (see eq. 10).\n\nExample. We consider the squared Euclidean distance, d(x; x0) = jx (cid:0) x0j2=2. With\nthis distance measure, eq. (15) tells us that the (Gaussian) distribution p(xjc) contracts\naround the cluster center xc as the number of iterations increases. The xc\u2019s are, of course,\nrecomputed in every iteration, following eq. (11).\n\nWe create a synthetic data set by drawing 2500 data points i.i.d. from four two-dimensional\nGaussian distributions with different means and the same variance. Figure (1) shows the\nresult of numerical iteration of the equations (14) and (16) \u2013 ensuring proper normalization\n\u2013 as well as (8) and (11), with (cid:21) = 0:5 and s = 0:5. The algorithm converges to a stable\nsolution after n = 14 iterations.\nThis algorithm is less sensitive to initial conditions than the regular K\u2013means algorithm.\nWe measure the goodness of the classi\ufb01cation by evaluating how much relevant information\nI(x; c) the solution captures. In the case we are looking at, the relevant information reduces\nto the entropy H[p(c)] of the distribution p(c) at the solution3. We used 1000 different\nrandom initial conditions for the cluster centers and for each, we iterated eqs. (8), (11),\n(14) and (16) on the data in Fig. 1. We found two different values for H[p(c)] at the\nsolution, indicating that there are at least two local maxima in I(x; c). Figure 2 shows\nthe fraction of the initial conditions that converged to the global maximum. This number\ndepends on the parameters s and (cid:21). For d(x; x0) = jx (cid:0) x0j2=2s, the initial distribution\np(0)(xjc) is Gaussian with variance s. Larger variance s makes the algorithm less sensitive\nto the initial location of the cluster centers. Figure 2 shows that, for large values of s,\nwe obtain a solution that corresponds to the global maximum of I(x; c) for 100% of the\ninitial conditions. Here, we \ufb01xed (cid:21) at reasonably small values to ensure fast convergence\n((cid:21) 2 f0:05; 0:1; 0:2g). For these (cid:21) values, the number of iterations till convergence lies\n\ni\n\n(cid:14)cc(cid:3)\np(cjx) = 1\n\n. Data points which are located at one particular position: p(xji) = (cid:14)xxi. We thus have\nx = arg minc d(x; xc).\n\n3I(x; c) = H[p(c)] +Px p(x)Pc p(cjx) log2(p(cjx)). Deterministic assignments: p(cji) =\nThenPc p(cjx) log2(p(cjx) = 0 and hence I(x; c) = H[p(c)].\n\nN p(c) Pi p(cji)p(xji) = 1\n\nN p(c) Pi (cid:14)xxi (cid:14)cc(cid:3)\n\nx , where c(cid:3)\n\ni = (cid:14)cc(cid:3)\n\n\f2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\u22122\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n3\n\nFigure 1: 2500 data points drawn i.i.d from four Gaussian distributions with different means\nand the same variance. Those data which got assigned to the same cluster are plotted with\nthe same symbol. The dotted traces indicate movements of the cluster centers (black stars)\nfrom their initial positions in the lower left corner of the graph to their \ufb01nal positions close\nto the means of the Gaussian distributions (black circles) after 14 iterations.\n\nbetween 10 and 20 (for 0:5 < s < 500). As we increase (cid:21) there is a (noisy) trend to more\niterations. In comparison, we did the same test using regular K\u2013means [8] and obtained a\nglobally optimal solution from only 75.8% of the initial cluster locations.\n\nTo see how this algorithm performs on data in a higher dimensional space, we draw 2500\npoints from 4 twenty-dimensional Gaussians with variance 0.3 along each dimension. The\ntypical euclidean distances between the means are around 7. We tested the robustness to\ninitial center locations in the same way as we did for the two dimensional data. Despite\nthe high signal to noise ratio, the regular K\u2013means algorithm [8], run on this data, \ufb01nds a\nglobally optimal solution for only 37.8% of the initial center locations, presumably because\nthe data is relatively scarce and therefore the objective function is relatively rough. We\nfound that our algorithm converged to the global optimum for between 78.0% and 81.0%\nof the initial center locations for large enough values of s (1000 < s < 10000) and (cid:21) = 0:1.\n\n3 Discussion\n\nConnection to deterministic annealing. For (cid:21) = 1, we obtain the solution\n\npn(cji) / exp(cid:20)(cid:0)\n\n(19)\n\n1\ns\n\nd(xi; x\n\n(n(cid:0)1)\nc\n\n)(cid:21)\n\nwhere the proportionality constant ensures normalization. This equation, together with eq.\n(11), recovers the equations derived from rate distortion theory in [9] (for square distance),\nonly here the length scale s appears in the position of the annealing temperature T in [9].\nWe call this parameter the annealing temperature, because [9] suggests the following de-\nterministic annealing scheme: start with large T ; \ufb01x the xc\u2019s and compute the optimal\nassignment rule according to eq. (19), then \ufb01x the assignment rule and compute the xc\u2019s\naccording to eq. (11), and repeat these two steps until convergence. Then lower the temper-\n\n\f100\n\n95\n\n%\n\n90\n\n85\n\n80\n\nl =0.2\n\nl =0.1\n\nl =0.05\n\n100\n\n101\ns\n\n102\n\n103\n\nFigure 2: Robustness of algorithm to initial center positions as a function of the initial\nvariance, s. 1000 different random initial positions were used to obtain clustering solutions\non the data shown in Fig. 1. Displayed is, as a function of the initial variance s, the percent\nof initial center positions that converge to a global maximum of the objective function. In\ncomparison, regular K\u2013means [8] converges to the global optimum for only 75.8% of the\ninitial center positions. The parameter (cid:21) is kept \ufb01xed at reasonably small values (indicated\nin the plot) to ensure fast convergence (between 10 and 20 iterations).\n\nature and repeat the procedure. There is no general rule that tells us how slow the annealing\nhas to be. In contrast, the algorithm we have derived here for (cid:21) < 1 suggests to start with\na very large initial temperature, given by s(cid:21), by making s very large and to lower the tem-\nperature rapidly by making (cid:21) reasonably small. In contrast to the deterministic annealing\nscheme, we do not iterate the equations for the optimal assignment rule and cluster centers\ntill convergence before we lower the temperature, but instead the temperature is lowered by\na factor of (cid:21) after each iteration. This produces an algorithm that converges rapidly while\n\ufb01nding a globally optimal solution with high probability.\n\nFor (cid:21) = 1, we furthermore \ufb01nd from eq. (15), that pn(xjc) / exph(cid:0) 1\n\nfor d(x; x0) = jx (cid:0) x0j2=2, the clusters are simply Gaussians.\nFor (cid:21) > 1, we obtain a useless solution for n ! 1, that assigns all the data to one cluster.\n\n)i, and\n\ns d(x; x\n\n(n(cid:0)1)\nc\n\nOptimal number of clusters One of the advancements that the approach we have laid\nout here should bring is that it should now be possible to extend our earlier results on\n\ufb01nding the optimal number of clusters [12] to the problem of geometric clustering. We\nhave to leave the details for a future paper, but essentially we would argue that as we\nobserve a \ufb01nite number of data points, we make an error in estimating the distribution that\nunderlies the generation of these data points. This mis-estimate leads to a systematic error\nin evaluating the relevant information. We have computed this error using perturbation\ntheory [12]. For deterministic assignments (as we have in the hard K\u2013means solution), we\nknow that a correction of the error introduces a penalty in the objective function for using\nmore clusters and this allows us to \ufb01nd the optimal number of clusters. Since our result\nsays that the penalty depends on the number of bins that we use to estimate the distribution\nunderlying the data [12], we either have to know the resolution with which to look at our\n\n\fdata, or estimate this resolution from the size of the data set, as in e.g. [1, 6]. A combination\nof these insights should tell us how to determine, for geometrical clustering, the number of\nclusters that is optimal for a \ufb01nite data set.\n\n4 Conclusion\n\nWe have shown that it is possible to cast geometrical clustering into the general, informa-\ntion theoretic framework provided by the Information Bottleneck method. More precisely,\nwe cluster the data keeping information about location and we have shown that the degener-\nacy of optimal solutions, which arises from the fact that the mutual information is invariant\nto any invertible transformation of the variables, can be lifted by the correct choice of the\ninitial conditions for the iterative algorithm which solves the Information Bottleneck equa-\ntions. We have shown that for a large range of values of the Lagrange multiplier (cid:21) (which\nregulates the trade-off between compression and preservation of relevant information), we\nobtain an algorithm that converges to a hard clustering K\u2013means solution. We have found\nsome indication that this algorithm might be more robust to initial center locations than\nregular K\u2013means. Our results also suggest an annealing scheme, which might prove to\nbe faster than the deterministic annealing approach to geometrical clustering, known from\nrate\u2013distortion theory [9]. We recover the later for (cid:21) = 1. Our results shed new light\non the connection between the relatively novel Information Bottleneck method and earlier\napproaches to clustering, particularly the well-established K\u2013means algorithm.\n\nAcknowledgments\n\nWe thank G. Atwal and N. Slonim for interesting discussions. S. Still acknowledges support\nfrom the German Research Foundation (DFG), grant no. Sti197.\n\nReferences\n[1] W. Bialek and C. G. Callan and S. P. Strong, Phys. Rev. Lett. 77 (1996) 4693-4697,\n\nhttp://arxiv.org/abs/cond-mat/9607180\n\n[2] W. Bialek in Physics of bio-molecules and cells; \u00b4Ecole d\u2019ete de physique th\u00b4eorique Les Houches\nSession LXXV Eds.: H. Flyvbjerg, F. J\u00a8ulicher, P. Ormos and F. David (2001) Springer-Verlag,\npp.485\u2013577, http://arxiv.org/abs/physics/0205030\n\n[3] M. Blatt, S. Wiseman and E. Domany, Phys. Rev. Lett. 76 (1996) 3251-3254,\n\nhttp://arxiv.org/abs/cond-mat/9702072\n\n[4] C. Fraley and A. Raftery, J. Am. Stat. Assoc. 97 (2002) 611-631.\n[5] A. D. Gordon, Classi\ufb01cation, (1999) Chapmann and Hall/CRC Press, London.\n[6] P. Hall and E. J. Hannan, Biometrika 75, 4 (1988) 705-714.\n[7] D. Horn and A. Gottlieb, Phys. Rev. Lett. 88 (2002) 018702,\n\nhttp://arxiv.org/abs/physics/0107063\n\nextended version:\n\n[8] J. MacQueen in Proc. 5th Berkeley Symp. Math. Statistics and Probability Eds.: L.M.L Cam\n\nand J. Neyman (1967) University of California Press, pp. 281-297 (Vol. I)\n\n[9] K. Rose, E. Gurewitz and G. C. Fox, Phys. Rev. Lett. 65 (1990) 945; and: K. Rose, Proceedings\n\nof the IEEE 86, 11 (1998) pp. 2210-2239.\n\n[10] C. E. Shannon, Bell System Tech. J. 27, (1948). pp. 379-423, 623-656. See also: C. Shannon\nand W. Weaver, The Mathematical Theory of Communication (1963) University of Illinois Press\n\n[11] P. Smyth, Statistics and Computing 10, 1 (2000) 63-72.\n[12] S. Still and W. Bialek (2003, submitted), available at http://arxiv.org/abs/physics/0303011\n[13] N. Tishby, F. Pereira and W. Bialek in Proc. 37th Annual Allerton Conf. Eds.: B. Hajek and R.\n\nS. Sreenivas (1999) University of Illinois, http://arxiv.org/abs/physics/0004057\n\n\f", "award": [], "sourceid": 2361, "authors": [{"given_name": "Susanne", "family_name": "Still", "institution": null}, {"given_name": "William", "family_name": "Bialek", "institution": null}, {"given_name": "L\u00e9on", "family_name": "Bottou", "institution": null}]}