{"title": "Maximum Likelihood and the Information Bottleneck", "book": "Advances in Neural Information Processing Systems", "page_first": 351, "page_last": 358, "abstract": null, "full_text": "Maximum Likelihood and the Information\n\nBottleneck\n\nNoam Slonim Yair Weiss\n\nSchool of Computer Science & Engineering,\nHebrew University, Jerusalem 91904, Israel\n\n noamm,yweiss\n\n@cs.huji.ac.il\n\nAbstract\n\n\u0002\u0004\u0003\u0006\u0005\b\u0007\n\t\f\u000b\nthat de\ufb01nes partitions over the values of\n\nThe information bottleneck (IB) method is an information-theoretic formulation\n, this method constructs\nfor clustering problems. Given a joint distribution\na new variable\nthat are informative\n. Maximum likelihood (ML) of mixture models is a standard statistical\nabout\napproach to clustering problems. In this paper, we ask: how are the two methods\nrelated ? We de\ufb01ne a simple mapping between the IB problem and the ML prob-\nlem for the multinomial mixture model. We show that under this mapping the\nproblems are strongly related. In fact, for uniform input distribution over\nor\nfor large sample size, the problems are mathematically equivalent. Speci\ufb01cally,\nin these cases, every \ufb01xed point of the IB-functional de\ufb01nes a \ufb01xed point of the\n(log) likelihood and vice versa. Moreover, the values of the functionals at the\n\ufb01xed points are equal under simple transformations. As a result, in these cases,\nevery algorithm that solves one of the problems, induces a solution for the other.\n\n1 Introduction\n\nUnsupervised clustering is a central paradigm in data analysis. Given a set of objects\n,\none would like to \ufb01nd a partition\nwhich optimizes some score function. Tishby\net al. [1] proposed a principled information-theoretic approach to this problem. In this\n,\napproach, given the joint distribution\n(see [2] for a detailed discussion).\nwhich preserves as much information as possible about\n\n, one looks for a compact representation of\n\n\u0011\u0013\u0012\u0014\u0010\u0016\u0015\n\u0017\u0018\u0012\u001a\u0019\u001c\u001b\u001e\u001d\u001f\u0015\n\nand\n\n!\b\u0012\u001a\u0010#\"\u001e $\u0015\n23-\f45\u0017\u0018\u0012\u001a\u0019%\u0015\u0006\u00176\u0012\u0014\u001d87\n\n, between the random variables\n2BA\n+DC\n\u0019%\u0015:9<;.=?>3@\n2FC\n>E@\n\nis given by [3]\nThe mutual information,\n. In [1] it is argued that both the compactness\n!%\u0012\u0014\u0010#\"& \u0013\u0015('*),+.-./10\nof the representation and the preserved relevant information are naturally measured by mu-\ntual information, hence the above principle can be formulated as a trade-off between these\nquantities. Speci\ufb01cally, Tishby et al. [1] suggested to introduce a compressed representa-\ntion\n. The compactness of the representation is then determined\n, is measured by the fraction of informa-\nby\n. The IB problem can be stated as \ufb01nding a\ntion they capture about\n(stochastic) mapping\nis min-\nKL'M!%\u0012\u0014\u0011?\"\u001e\u0010\u0016\u0015ONQP\u001c!%\u0012\u0014\u0011?\"& \u0013\u0015\nG\u001f\u0012\u0014HF7\nimized, where\nis a positive Lagrange multiplier that determines the trade-off between\ncompression and precision. It was shown in [1] that this problem has an exact optimal\n(formal) solution without any assumption about the origin of the joint distribution\n.\n\u0017\u0018\u0012\u0014\u0019\u0018\u001bI\u001dR\u0015\nThe standard statistical approach to clustering is mixture modeling. We assume the mea-\n\n, by de\ufb01ning\nG\u001f\u0012\u0014HF7\n\u0019%\u0015\n, while the quality of the clusters,\n,\n!%\u0012\u0014\u0011?\"& \u0013\u0015\u001eJB!\b\u0012\u001a\u0010#\"\u001e $\u0015\nsuch that the IB-functional\n\u0019%\u0015\n\nof\n!\b\u0012\u001a\u0011?\"I\u0010\u0016\u0015\n\n\u0001\n\n\u000e\n\u000f\n\u000e\n\u0010\n\u0010\n \n\u0010\n \n\u0011\n\u0010\n\u0011\n \nP\n\ffor each\n\nsurements\n\ncome from one of\n\nparameters\u0002\u0001 (e.g.\nthe maximum likelihood estimates of\n\nposterior probability that the measurements at\nposterior probabilities de\ufb01ne a \u201csoft\u201d clustering of\n\n\u001b\u0006\u0005\u0007\u0001\n\n\u0003\u0004\u0001\n\npossible statistical sources, each with its own\nin Gaussian mixtures). Clustering corresponds to \ufb01rst \ufb01nding\n\n\u0001 and then using these parameters to calculate the\n\nwere generated by each source. These\n\nvalues.\n\n\u0011$7\n\nWhile both approaches try to solve the same problem the viewpoints are quite different. In\nthe information-theoretic approach no assumption is made regarding how the data was gen-\nerated but we assume that the joint distribution\nis known exactly. In the maximum-\nlikelihood approach we assume a speci\ufb01c generative model for the data and assume we\n\n\u0017\u0018\u0012\u001a\u0019\u001c\u001b\u001e\u001d\u001f\u0015\n\n, not the true probability.\n\nIn spite of these conceptual differences we show that under a proper choice of the genera-\ntive model, these two problems are strongly related. Speci\ufb01cally, we use the multinomial\nmixture model (a.k.a the one-sided [4] or the asymmetric clustering model [5]), and pro-\nvide a simple \u201cmapping\u201d between the concepts of one problem to those of the other. Using\nthis mapping we show that in general, searching for a solution of one problem induces a\nsearch in the solution space of the other. Furthermore, for uniform input distribution\n\u00176\u0012\u0014\u0019%\u0015\nor for large sample sizes, we show that the problems are mathematically equivalent. Hence,\nin these cases, any algorithm which solves one problem, induces a solution for the other.\n\nhave samples\b(\u0012\u0014\u0019\u0018\u001bI\u001dR\u0015\n\n2 Short review of the IB method\n\nof\n\n\u0019%\u0015\n\n. The goal is to \ufb01nd\n\nIn the IB framework, one is given as input a joint distribution\nbution, a compressed representation\nG\u001f\u0012\u0014HF7\nG\u001f\u0012\u0014HF7\nminimized for a given value of\nThe joint distribution over\n\u001b\u001e \nrelation,\nability\ninvolved in calculating the IB-functional are given by\n\n. Speci\ufb01cally, every choice of\n\u0017\u0018\u0012\u001a\u0019\u001c\u001b\u001e\u001d\u001f\u0015IG\u001f\u0012\u0014HF7\n\nsuch that the IB-functional,\n\n\u0012\u001a\u0019\u001c\u001b\u001e\u001d%\u001b\u001eHI\u0015$'\n\n\u0010\f\u000b\n\nG\u001f\u0012\u0014HF7\n\nand\n\n\u0019%\u0015\n\n\u0019%\u0015\n\n\u0019%\u0015\n\n.\n\n\u0011\n\t\nG\u000e\r\u0010\u000f\n\n. Given this distri-\nis introduced through the stochastic mapping\nis\n\n\u0017\u0018\u0012\u001a\u0019\u001c\u001bI\u001dR\u0015\n'*!\b\u0012\u001a\u0011?\"I\u0010\n\n\u0015ON\u0016P\u0018!\b\u0012\u001a\u0011?\"\u001e \n\nis de\ufb01ned through the IB Markovian independence\nde\ufb01nes a speci\ufb01c joint prob-\nthat are\n\n. Therefore, the distributions\n\nand\n\nG\u001f\u0012\u0014HI\u0015\n\nG\u001f\u0012\u001a\u001d\n\nHI\u0015\n\nG\u000e\r\u0010\u000f\n\n+.0\n\nG\u001f\u0012\u0014HI\u0015('\nG\u001f\u0012\u0014\u001d87\nHI\u0015('\n\n\u0010\u000f\n\nG\u001f\u0012\u001aHF7\n\n\u0017\u0018\u0012\u001a\u0019%\u0015\n\n\u0012\u001a\u0019\u001c\u001bI\u001d\u0004\u001b\u001eHI\u0015('*)\nG\u001f\u0012\u001aHF7\n+(G\n\u0012\u0014\u0019\u0018\u001bI\u001d\u0004\u001bIHI\u0015\nis possible but as shown in [1], if\nis de\ufb01ned through,\n\n\u0019\u0004\u0015\n\u00176\u0012\u0014\u0019\u0018\u001bI\u001dR\u0015\n\nIn principle every choice of\nG\u001f\u0012\u001aHF7\ngiven, the choice that minimizes\n\n\u0019\u0004\u0015\n\n\u0019\u0004\u0015\u0017\u0016\n\nG\u001f\u0012\u0014HI\u0015\n\nand\n\nG\u001f\u0012\u001a\u001d\n\nHI\u0015\n\n(1)\n\nare\n\n(2)\n\nG\u001f\u0012\u0014HF7\n\n\u0019%\u0015\n\nG\u001f\u0012\u0014HI\u0015\n\n\u0012\u0014P(\u001b\u001e\u0019%\u0015\u001a\u0019\u001a\u001b\u001d\u001c\u001a\u001e \u001f\"!\n\n2BA\n\n>E@\n\n2BA\n\nC\u001aC\n\nwhere\u0018\n\nKullback-Leibler divergence. Iterating over this equation and the\nde\ufb01nes an iterative algorithm that is guaranteed to converge to a (local) \ufb01xed point of\n\nis the normalization (partition) function and#%$'&6\u0012\n!\u001a(\n\nis the\nGB\u00151'\n-step de\ufb01ned in Eq.(1)\n[1].\n\n\u0012\u001aP(\u001bI\u0019\u0004\u0015\n\n\u001767\n\n;\f=\n\n3 Short review of ML for mixture models\n\nIn a multinomial mixture model, we assume that\n\nit from a multinomial distribution )\n\n\u0012\u001a\u001d\n\n\u0012\u0014\u0019\u0004\u0015I\u0015\n\nsided clustering model [4] [5] we further assume that there can be multiple observations\ncorresponding to a single\nThis model can be described through the following generative process:\n\nbut they are all sampled from the same multinomial distribution.\n\ntakes on discrete values and sample\nIn the one-\n\u0012\u001a\u0019%\u0015\n\n\u2019s label.\n\ndenotes\n\n, where\n\n\u001d\n\u0019\n7\n\u0019\n\u0010\n\u0011\n\u0010\nK\n\u0015\nP\n\u0010\n\u0011\n \n7\n\u0011\n\u0012\n\u0013\n)\n2\n+\n\u0014\n\u0015\n@\n\u0001\nC\n)\n'\n\u0014\n\u0015\n@\n\u0001\nC\n)\n+\n7\nK\n'\n\u0018\n@\n+\nC\nA\nA\n\u0015\n@\n\u0001\n\u001b\n7\n)\n\u0017\n9\n>\n\u0015\nK\n \n7\nH\nH\n\u0019\n\u001d\n\u0019\n\fby one.\n\n\u001b\u0013\u000b8\u0015\n\n-step is\n\n\u0012\u0014HI\u0015\n\n\u0012\u001aHI\u0015\n\n.\n\n4%5\n\n+.0\n\n\u0012\u0014\u0019\n\u0012\u0014\u0019\n\n\u0012\u0014\u0019%\u0015\n\n\u001bI\u001d\n\n,\n\n\u0012\u001a\u001d87\n\nHI\u0015\n\n\u0012\u0014\u0019\u0004\u0015\n\n(3)\n\n(4)\n\n(5)\n\n+#\"\n\n2%$\n\nor topics for all\n\n\u0012\u0014H\n\n\u0012\u0014\u0019\n\n\u0015I\u0015\n\n\u0012\u001a\u001d\n\n\u0012\u0014\u0019\n\nby its distribution\n\n.\n\n\u0012\u0014\u0019\u0004\u0015\n\u0012\u0014\u0019\u0004\u0015\n\u0012\u001a\u001d\n\nchoose a unique label\n\n\u2013 choose\n\u2013 choose\n\nwe replace the missing value of\n\n\u001bI\u001d\f\t\u001a\u0015\n\u0015I\u0015\u001f\u001e! \n\n-step\nwhich we denote by\n\n\u001d\u0004\u0012\u001a\u0019%\u0015\u001e\u0015\n-step is de\ufb01ned through\n\n\u0012\u0014\u0019\u0004\u0015F7\n\n\u0017\u0018\u0012\u001aH\n\nis a count matrix. The (true) likelihood is de\ufb01ned through summing over\n\ndenotes the random vector that de\ufb01nes the (typically hidden) labels,\n. The complete likelihood is given by:\n\n, the goal of ML estimation is to \ufb01nd an assignment for the parameters\nsuch that the likelihood is (at least locally) maximized. Since it is\n(where\n\n For each\nby sampling from\u0001(\u0012\u0014HI\u0015\n For\u0002\n'\u0004\u0003\u0006\u0005\b\u0007\n\u0019\n\t by sampling from\u000b\nand increase\b(\u0012\u001a\u0019\n\t\n\u001d\f\t by sampling from)\n\u0012\u0014\u0019\r\t\n\u001b\u000e\u0016\n\u0016<\u001bIH\nLet\u000e\n\u0019\u0010\u000f\n\u001b\u0006)R\u001b\u0013\u000b8\u0015\n)\b\u0012\u0014\u001d\n\u0015\u0017\u0016\n\u0017\u0018\u0012\u0014\u0019\u0018\u001bI\u001d\u0004\u001b\u0011\u000e\nH\u0012\u0005\f\u0001\n\u0001(\u0012\u001aH\n\u0015\u001e\u0015\u0013\u0014\u0019\u0018\n\u001b\u001c\u0016\n\u0015\u001a\u0016\n\u0015\u0017\u0016\n\u000bO\u0012\u001a\u0019\n\u0015\u001e\u0015\u0013\u0014\n\u0001(\u0012\u001aH\nwhere\b(\u0012\u0014\u0019\nall the possible choices of\u000e\n\u001b\u0006)R\u001b\u0013\u000b8\u0015(')(+*\n\b(\u0012\u0014\u0019\u0018\u001bI\u001dR\u0015'\u0005\f\u0001\n\u00176\u0012\u0014\u0019\u001c\u001b\u001e\u001d\u0004\u001b\u0011\u000e\nH\u0012\u0005,\u0001\nGiven \b(\u0012\u001a\u0019\u001c\u001b\u001e\u001d\u001f\u0015\nand\u000b\n\u0001(\u0012\u0014HI\u0015\nis just the empirical counts\b(\u0012\u0014\u0019\u0004\u0015\u001eJ-\u0007\neasy to show that the ML estimate for\u000b\n\b(\u0012\u0014\u0019\u0004\u0015\n\b(\u0012\u0014\u0019\u0018\u001bI\u001dR\u0015\n) .\n), we further focus only on estimating\u0001\nA standard algorithm for this purpose is the EM algorithm [6]. Informally, in the.\n) . Using standard derivation\n. In the/\n-step we use that distribution to reestimate\u0001\nit is easy to verify that in our context the.\n+DC-3\n4%576\n)21\n\u0012\u001a\u0019%\u0015\u0013\u0001(\u0012\u0014HI\u0015\nC!;\nC-3\n4\u001c576\nC-3\n+DC:9\n0\b8\f\u0012\u001a\u0019%\u0015\u0013\u0001(\u0012\u0014HI\u0015\n\u001e \u001f\"!\n\u0019\u001a\u001b\n0\b8\f\u0012\u001a\u0019%\u0015\u0013\u0001(\u0012\u0014HI\u0015\nare normalization factors and \b(\u0012\u0014\u001d87\n. The/\nwhere0\n\u0001(\u0012\u0014HI\u0015>=*),+\nHI\u0015?=\nNB(\nEJ(\n\n\b(\u0012\u0014\u0019\u001c\u001b\u001e\u001dR\u0015'\u0005.G:\u001bA\u0001\n\u0012\u0014HI\u0015\u0006C\n9<;.=D\u0001(\u0012\u001aHI\u0015FEG(\nHI\u0015IH\n-step then involves minimizing@ with respect to\nThe.\n\u001b\u0006) . Since this functional is bounded (under mild conditions), the EM\nwith respect to\u0001\nalgorithm will converge to a local \ufb01xed point of @ which corresponds to a \ufb01xed point of\n\b(\u0012\u0014\u0019\u0018\u001bI\u001dR\u0015\u0012\u0005\f\u0001\n\n\u0012\u0014HI\u0015:9<;.=\n\n\b(\u0012\u001a\u0019\u001c\u001b\u001e\u001d\u001f\u0015\u001f9\nwhile the/\n@ will become identical to\n\nIterating over these EM steps is guaranteed to converge to a local \ufb01xed point of the likeli-\nhood. Moreover, every \ufb01xed point of the likelihood de\ufb01nes a \ufb01xed point of this algorithm.\n\nAn alternative derivation [7] is to de\ufb01ne the free energy functional:\n\n\u0012\u0014HI\u0015\n\n+ \b(\u0012\u001a\u0019\u001c\u001b\u001e\u001d\u001f\u0015IG\n\nthe likelihood. At these \ufb01xed points,\n\nand0\n\n-step minimizes it\n\nsimply given by\n\n\u0012\u0014\u0019\u0004\u0015\n\n\u0012\u001a\u0019%\u0015\n\n2BA\n\n2BA\n2BA\n\nC\u0014C\n\n\u0012\u001a\u0019%\u0015\n\n+DC\n\n2.A\n\n2BA\n\n2BA\n\n(6)\n\n(7)\n(8)\n\n(9)\n\n2BA\n\n2BA\n\n(10)\n\n(11)\n\n.\n\n\u0019\u0004\u0015\n\n\u0012\u0014HI\u0015\n\n\u0012\u001a\u001d\n\n\u001b\u0006)\f\u0015\n\n;\f=\n\n\u0012\u0014\u001d87\n\n;\f=\n\n\u0015I\u0015\n\n\u0012\u0014\u0019\n\n\u0012\u001aHI\u0015\n\n\u0019\nH\n7\nH\nH\n'\n\u0014\n\u0016\nA\n/\nA\n\u0015\n\u0010\n'\n\u0014\nA\n/\nA\n\u0014\n\u0015\n\t\n\u0016\n\u0014\n\u000b\n\t\n\u0015\n\t\n7\nH\n\t\n'\n\u0014\nA\n/\nA\n\u0014\n\u0015\nA\n/\nA\n\u0014\n\u0014\nA\n4\nA\n\u0014\n\u001d\n\u0015\n\u0015\n)\n\u001b\n7\nH\n\u0015\n@\n0\nC\n\u001b\n\u0015\n\u001b\n\u0015\nH\n&\n\u0012\n\u0001\n\u001b\n)\n\u0016\n\u001b\n)\n'\n)\n2\n\u001b\nH\nG\n+\n\u001b\nG\n+\n'\n0\n\u0019\n \n@\n \n@\n@\n\u0001\nC\n'\n\u0019\n \n@\n)\n1\n \n@\n+\n@\n\u0001\nC\n\u001b\n)\n1\n \n@\n+\n \n@\n+\n'\n \n@\n+\nC\n@\n \n@\n+\nC\nA\nA\n6\n@\n\u0001\n\u001b\n8\n'\n \n@\n2\nC\n \n@\n+\nC\n<\nG\n+\n)\n7\n)\n+\n\u0016\n@\n\u0012\n'\n\u0001\n0\n+\nG\n+\n2\n)\n\u0001\n0\n+\nG\n+\nG\n+\n\u0016\nG\nN\n9\n&\n\u0012\n\u001b\n)\n\u0015\n\f4 The ML\n\nIB mapping\n\nAs already mentioned, the IB problem and the ML problem stem from different motiva-\ntions and involve different \u201csettings\u201d. Hence, it is not entirely clear what is the purpose of\n\u201cmapping\u201d between these problems. Here, we de\ufb01ne this mapping to achieve two goals.\nThe \ufb01rst is theoretically motivated: using the mapping we show some mathematical equiv-\nalence between both problems. The second is practically motivated, where we show that\nalgorithms designed for one problem are (in some cases) suitable for solving the other.\n\nwhere \u0003\n\nsponding one.\n\n.\n\n\u00176\u0012\u0014\u0019%\u0015\n\n\u0012\u0014HI\u0015\n\nto\n\nG\u001f\u0012\u0014HF7\n\n\u0019%\u0015\n\n\u0012\u001aHI\u0015\n\nG\u001f\u0012\u0014HF7\n\n\u0019%\u0015\n\n\u0017\u0018\u0012\u001a\u0019\u001c\u001b\u001e\u001d\u001f\u0015\n\n\u0012\u0014HI\u0015\n\n\u0001\u0004\u0003\n\nand\n\nG\u001f\u0012\u0014HF7\n\n\u0019%\u0015\n\n'M)L+.0\n\n( mapping by\n\n\u0001(\u0012\u001aHI\u0015\nand\b(\u0012\u0014\u0019\u0018\u001bI\u001dR\u0015\n\b(\u0012\u0014\u0019\u001c\u001b\u001e\u001dR\u0015\n\n)\b\u0012\u0014\u001d87\n\n\u001bA\u0001(\u0012\u001aHI\u0015\n\nA natural mapping would be to identify each distribution with its corresponding one. How-\never, this direct mapping is problematic. Assume that we are mapping from ML to IB. If\nwe directly map\n, respectively, obviously there is\n\u001b\u001eG\u001f\u0012\u001aHI\u0015\nno guarantee that the IB Markovian independence relation will hold once we complete the\nmapping. Speci\ufb01cally, using this relation to extract\nthrough Eq.(1) will in general re-\nsult with a different prior over\n. However, we notice\nG\u001f\u0012\u001aHI\u0015\nthat once we de\ufb01ned\n, the other distributions could be extracted by per-\nforming the IB-step de\ufb01ned in Eq.(1). Moreover, as already shown in [1], performing this\nstep can only improve (decrease) the corresponding IB-functional. A similar phenomenon\nis present once we map from IB to ML. Although in principle there are no \u201cconsistency\u201d\nproblems by mapping directly, we know that once we de\ufb01ned\n, we can ex-\n-step. This step, by de\ufb01nition, will only improve the likelihood,\nwhich is our goal in this setting. The only remaining issue is to de\ufb01ne a corresponding com-\nponent in the ML setting for the trade-off parameter\n. As we will show in the next section,\n\n\u001b\u001eG\u001f\u0012\u001a\u001d87\nHI\u0015\nHI\u0015\nG\u001f\u0012\u001aHI\u0015\nthen by simply de\ufb01ning\n\u0017\u0018\u0012\u001a\u0019\u001c\u001b\u001e\u001d\u001f\u0015\n\n!\u001a( mapping is equivalent for a direct mapping of each distribution to its corre-\n\nis uniformly distributed (i.e.,\b(\u0012\u0014\u0019%\u0015\n\nis a positive (scaling) constant and the mapping is completed by performing an\n-step according to the mapping direction. Notice that under this mapping,\nevery search in the solution space of the IB problem induces a search in the solution space\nof the ML problem, and vice versa (see Figure 2).\n\ntract\u0001 and) by a simple/\nthe natural choice for this purpose is the sample size,\u0007\n&\u0002\u0001\nTherefore, to summarize, we de\ufb01ne the /\n\b(\u0012\u0014\u0019\u0018\u001bI\u001dR\u0015\nIB-step or an/\n&\u0005\u0001\nIB-step de\ufb01ned in Eq.(1) and the /\nIB iterative optimization algorithm under the/\nAgain, this observation is a direct result from the equivalence of the IB-step and the/\ngood or worse) in the ML setting, and the exponential factor in EM must be\b(\u0012\u001a\u0019%\u0015\nhood& are mapped to all the \ufb01xed points of the IB-functional\n\n( mapping with \u0003\n. Additionally, we notice that in this case\b(\u0012\u0014\u0019\u0004\u00151'\n\b(\u0012\u0014\u0019%\u0015\n\n-step\nfor uniform prior over\n,\n'*P\nhence Eq.(6) and Eq.(2) are also equivalent. It is important to emphasize, though, that this\nequivalence holds only for a speci\ufb01c choice of\n. While clearly the IB iterative\n, there is no such freedom (for\nalgorithm (and problem) are meaningful for any value of\n\nis uniformly distributed, then the\n-step de\ufb01ned in Eq.(9) are mathematically equivalent.\n\n, all the \ufb01xed points of the likeli-\n. Moreover,\n\nwith\n\nis uniformly distributed, the EM algorithm is equivalent to the\n\nThis observation is a direct result from the fact that if\n\n5 Comparing ML and IB\n\nis uniformly distributed and \u0003\n\nor\n\nare constant), the\n\nObservation 4.1 When\n\nObservation 4.2 When\n\n&\u0006\u0001\n\nP*'\n\n(12)\n\n.\n\n.\n\n\b(\u0012\u001a\u0019%\u0015\n\nClaim 5.1 When\n\n\nG\n+\n\u001b\n\u0011\n'\nG\n+\nP\n2\n!\nG\n+\n\u0001\n\u001b\n\u0003\n\u0007\n\u0001\n\u001b\n\u0007\nP\n\u001b\n\u0010\n/\n\u0010\n\u0010\n!\n'\n7\n\u0010\n7\n\u0010\n\u0018\nA\n/\nA\n'\n\u0018\n\u0007\nP\n\u0010\n'\n7\n\u0010\n7\nK\nP\n'\n\f;\f=\n\n\u0012\u0014HI\u0015\n\nwith\n\nat the \ufb01xed points,\n\nCorollary 5.2 When\n\n, induces a \ufb01xed point of\n\nis mapped to a \ufb01xed-point of\n\nis mapped to the one that minimizes\n\nis uniformly distributed, every algorithm which \ufb01nds a \ufb01xed point\n, and vice versa. When the algorithm \ufb01nds\n.\n\nProof: We prove the direction from ML to IB. the opposite direction is similar. We assume\nthat de\ufb01ne a \ufb01xed\n. As a result, this is also a \ufb01xed point of the EM algorithm (where\n-step). Using observation 4.2 it follows that this \ufb01xed-point\n\n, with0 constant.\u0014\nKJE\n\b(\u0012\u0014\u0019%\u0015\nof&\nseveral \ufb01xed points, the solution that maximizes&\nwhere\b(\u0012\u001a\u0019%\u0015\nthat we are given observations\b(\u0012\u0014\u0019\u001c\u001b\u001e\u001dR\u0015\npoint of the likelihood&\nis de\ufb01ned through an .\n\b(\u0012\u0014\u0019\u0004\u0015\n, it is enough to show the relationship between@\n. Rewriting@\n\u001b\u0006)\f\u0015\n\b(\u0012\u0014\u0019\u001c\u001b\u001e\u001dR\u0015'\u0005.G:\u001bA\u0001\n\u0001(\u0012\u0014HI\u0015\n( mapping and observation 4.1 we get\nUsing the/\n\nis constant, and\u0001\n\nand\n\n(13)\n\n(14)\n\n\b(\u0012\u001a\u0019\u001c\u001bI\u001dR\u0015IG\n\nSince at the \ufb01xed point,\n\nfrom Eq.( 10) we get\n\n, as required.\n\n\u00176\u0012\u0014\u0019\u001c\u001b\u001e\u001dR\u0015\n\nG\u001f\u0012\u001a\u001d87\n\nG\u001f\u0012\u0014HF7\n\n\u0012\u001aHI\u0015:9\n\nwith\n\n9<;.=\n\n9<;.=\n\n\u0012\u001aHI\u0015\n\n\u0012\u0014HI\u0015\n\n\u0012\u001a\u001d\n\nHI\u0015\n\n\u0019%\u0015\n\n;\f=\n\n;\f=\n\nHI\u0015\n\nG\u001f\u0012\u0014HF7\n\u0019%\u0015\nG\u001f\u0012\u001aHI\u0015\n\n\u0019%\u0015\u001f9\nG\u001f\u0012\u0014HF7\nMultiplying both sides by\nrelation, we \ufb01nd that\n\n;\f=\n\n\u00176\u0012\u0014\u0019\u0004\u0015\n\n\u0014 and using the IB Markovian independence\n\nHI\u0015\u0017\u0016\n\nG\u001f\u0012\u001aHI\u0015\n\nG\u001f\u0012\u001a\u001d87\n\nHI\u0015:9<;.=\n\nG\u001f\u0012\u0014\u001d87\nto both sides gives:\n\n\u00176\u0012\u0014\u0019%\u0015IG\u001f\u0012\u0014HF7\n\n\u0019%\u0015:9<;.=\n\nN\u0016P\n\nG\u001f\u0012\u0014HF7\n\u0019%\u0015\nG\u001f\u0012\u0014HI\u0015\n26G\u001f\u0012\u0014HI\u0015IG\u001f\u0012\u0014\u001d87\n\n;\f=\u001c\u0017\u0018\u0012\u001a\u001dR\u0015\n\nHI\u0015\u001f9\n\u00156N\u0016P\u001c!%\u0012\u0014\u0011\u0013\"\u001e \u0013\u0015('LK\n\n $\u0015\n\n'L!\b\u0012\u001a\u0011?\"I\u0010\n\n(15)\n\n(16)\n\n.\n\n\b(\u0012\u0014\u0019%\u0015\n\n'MN\nP\u0001#\u0012\n\n.\n\n\b(\u0012\u001a\u0019%\u0015\n\nReducing a (constant)\n\nP\u0001#\u0012\u001a \n\nas required. We emphasize again that this equivalence is for a speci\ufb01c value of\n\nCorollary 5.3 When\n\n, iff it decreases\n\nwith\n\nis uniformly distributed and \u0003\n\n, every algorithm decreases\n\nThis corollary is a direct result from the above proof that showed the equivalence of the\nfree energy of the model and the IB-functional (up to linear transformations).\n\nThe previous claims dealt with the special case of uniform prior over\n\n. The following\n\n(or\n\n(or\n\npoints of\n\n) are large enough.\n\n, and vice versa. Moreover, at the \ufb01xed points,\n\nclaims provide similar results for the general case, when the\u0007\n\u000b\u0003\u0002\n\u000b\u0003\u0002\n), all the \ufb01xed points of& are mapped to all the \ufb01xed\nClaim 5.4 For\u0007\n=LKJE\nevery algorithm which \ufb01nds a \ufb01xed point of&\nCorollary 5.5 When\u0007\n\ufb01xed points, the solution that maximizes&\n\n\u0006 A similar result was recently obtained independently in [8] for the special case of \u201chard\u201d cluster-\ning. It is also important to keep in mind that in many clustering applications, a uniform prior over\nis \u201cforced\u201d during the pre-process to avoid non-desirable bias. In particular this was done in several\nprevious applications of the IB method (see [2] for details).\n\n, induces a\n, and vice versa. When the algorithm \ufb01nds several different\n\nis mapped to the solution that minimize\n\n\u000b\u0004\u0002\n\u000b\u0005\u0002\n\n\ufb01xed point of\n\nwith\n\n9<;.=\n\n.\n\n.\n\nN\n9\n&\n=\n0\n\u0010\nK\nP\n'\nK\n\u001b\n)\nG\n+\nK\nP\n'\nN\n&\n'\n@\nK\n@\n\u0012\n'\n(\n\u0001\n0\n+\nG\n+\nG\n+\nN\n(\n\u0001\n0\n2\n9\n)\n7\n(\n+\n+\n\u0016\n&\n\u000b\n!\n@\n'\n(\n\u0001\n0\n+\nN\n\u0003\nP\n(\n\u0001\n0\n2\n(\n+\n\u0016\n'\n\u0014\nA\n/\nA\n'\n\u0003\n\u001b\n\u0003\n\u001b\n\u0014\n@\n'\n(\n\u0001\n0\n+\n(\n\u0001\n0\n2\n\u0015\nP\n)\n\u0001\n0\n\u0003\n\u001b\n\u0014\n@\nN\n\u001b\nP\n'\n\u0010\n'\n7\n\u0010\n7\n@\nK\nP\n'\n\u0010\nP\nP\nK\nN\n&\n0\nK\nP\nK\n\u000e\n\fSmall b (iIB)\n\nLIB\nF/r b H(Y)\n\n4. 2\n\n4. 3\n\n4. 4\n\n4. 5\n\n4. 6\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n1.22x 104\n\n1.215\n\n1.21\n\n1.205\n\n1.2\n\n1.195\n0\n\nSmall N (EM)\n\nF\nr(LIB+b H(Y))\n\n20\n\n40\n\n60\n\n 43\n\n 43.5\n\n 44\n\n 44.5\n0\n\nLarge b (iIB)\n\nLIB\nF/r b H(Y)\n\n10\n\n20\n\n30\n\n40\n\n50\n\nx 105\n\nLarge N (EM)\n\nF\nr(LIB+b H(Y))\n\n10\n\n20\n\n30\n\n40\n\n2.829\n\n2.828\n\n2.827\n\n2.826\n0\n\nmapping\n\nfor different\n\nFigure 1: Progress of\n\nbecomes deterministic:\n\nvalues, while running iIB and EM.\n\n+.0\n-step in Eq.(6) we extract\n\nProof: Again, we prove only the direction from ML to IB as the opposite direction is\nthat de\ufb01ne a \ufb01xed\n, ending up with a \ufb01xed point of the\n. Therefore, the\n\nand@\nsimilar. We are given\b(\u0012\u0014\u0019\u0018\u001bI\u001dR\u0015\nwhere\u0007\npoint of&\n. Using the.\nEM algorithm. We notice that from\u0007\n( mapping (including the IB-step), it is easy to verify that we get\nPerforming the/\n\u0001(\u0012\u0014HI\u0015\nG\u001f\u0012\u0014HI\u0015\u0011\u0010\n\nis not uniform). After completing the\n\u0019%\u0015\nG\u001f\u0012\u0014HF7\n(18)\n\nand\u0007\n\b(\u0012\u0014\u0019\u001c\u001b\u001e\u001dR\u0015\nfollows\b(\u0012\u0014\u0019%\u0015\n\b(\u0012\u001a\u001d\n$'&O\u0012\n\u0001\f\u000b\n\nG\u001f\u0012\u0014\u001d87\nmapping we try to update\nwill remain deterministic. Speci\ufb01cally,\n\nif the prior over\nthrough Eq.(2). Since now\n\nand\u0001\n\nit follows that\n\n\u000b\u0003\u0002\n\n\u0019B\u000f\n\n\u0003\u0006\u0005\b\u0007\n\t\n\notherwise.\n\nH('\u0004\u0003\n\nH\u000e\r\u0014\u0015\u001e\u0015\n\nG\u001f\u0012\u001aHF7\n\n\u0012\u001a\u001d87\n\nHI\u0015\n\nHI\u00151'\n\n\u0019%\u0015D7\n\n(17)\n\n\u0012\u001a\u001d87\n\n\u0012\u0014HI\u0015\n\n\u0012\u001aHI\u0015\n\n(but\n\n\u0012\u0014HI\u0015\n\n\u0019\u0004\u0015\n\n\u0003\u0006\u0005\b\u0007\n\t\n\notherwise,\n\n'\u0016\u0003\n\n$'&\n\n\u0017\u0018\u0012\u001a\u001d87\n\n\u0019\u0004\u0015F7\n\nG\u001f\u0012\u001a\u001d87\n\nH\u000e\r\u0006\u0015I\u0015\n\nwhich is equal to its previous value. Therefore, we are at a \ufb01xed point of the IB iterative\nalgorithm, and by that at a \ufb01xed point of the IB-functional\n\n, as required.\n\n\u0012\u0014HF7\n\n\u0019%\u0015\n\n \u0013\u0012\u0015\u0014\n\n. From\n\n;\f=\n\n(19)\n\n(20)\n\n.\n\n\u0015I\u0015\u0017\u0016\u0004 \n\nwith\n\nHI\u0015\n\n\u0012\u0014HI\u0015\n\n9\u0018\u0017\u0018\u0019\n\n9<;.=\n\n)\b\u0012\u0014\u001d87\n\n\b(\u0012\u001a\u0019\u001c\u001b\u001e\u001d\u001f\u0015IG\n\nTo show that\n9<;.=\nEq.(13) we see that\n\nEG0 we notice again that at the \ufb01xed point@\n9\u0018\u0017\u001a\u0019\n\u0018\u001c\u001b\u001e\u001d\n( mapping and similar algebra as above, we \ufb01nd that\nUsing the/\nP\u001c!%\u0012\u0014\u0011\u0013\"\u001e $\u0015+E\n9\u001a\u0017\u0018\u0019\n\u0018\u001f\u001b\u0011\u001d\n\u001b\u0011\u001d\nevery algorithm decreases@\nCorollary 5.6 When\u0007\nHow large must\u0007\nroughly speaking, we notice that the value of\u0007\nhold is related to the \u201camount of uniformity\u201d in \b(\u0012\u001a\u0019%\u0015\nabove proof assumed that each\b(\u0012\u0014\u0019%\u0015\nClearly, when\b(\u0012\u001a\u0019%\u0015\n\nis less uniform, achieving this situation requires larger\u0007\n\n#\u0012\u001a \n\u0012\u0014K\niff it decreases\n\n6 Simulations\n\n9\u001a\u0017\u0018\u0019\n\u001b\u0011\u001d\n\nP\u0001#\u0012\n\n\u0012\u0014HI\u0015\n\n\u0015('\n\n(or\n\n) be? We address this question through numeric simulations. Yet,\nfor which the above claims (approximately)\n. Speci\ufb01cally, a crucial step in the\nbecomes deterministic.\n\nis large enough such that\n\nvalues.\n\nWe performed several different simulations using different IB and ML algorithms. Due\nto the lack of space, only one example is reported below; In this example we used the\n\nK\nP\n'\n)\n2\n\u000b\n\u0002\n\u001b\n)\nG\n+\n\u000b\n\u0002\n\u000b\n\u0002\n\u0001\n\u0010\nG\n+\nG\n+\n'\n\u0002\n\u0003\n\b\n#\n7\n7\n)\n\u000f\n&\n\u000b\n!\n)\n'\n\u0010\nP\nG\n'\n\u0002\n\u0003\nH\n\b\n\u0001\n\u000b\n#\n\u0012\n7\n\u000f\nK\nN\n&\n=\nK\n'\nN\n9\n&\n@\n'\nN\n(\n\u0001\n0\n2\n(\n+\n+\n\u0016\n&\n\u000b\n!\n@\n'\n\u001c\nN\n\u0003\n\u0003\n \n\u001c\n\u0003\nE\nP\n\u000b\n\u0002\nK\nP\n\u000b\n\u0002\nP\nG\n+\n\f(cid:8) (cid:9) (cid:10)\n\n(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)\n(cid:1)(cid:2)(cid:1) (cid:3) (cid:2)(cid:1) (cid:4)\n\n(cid:1)(cid:2)(cid:3)(cid:1) (cid:4)\n\n(cid:5) (cid:3)(cid:6)\n\n+\n\n(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:2)(cid:6)(cid:7)(cid:3)(cid:8)(cid:1)(cid:9)\n\n+\n\n+\n\n(cid:1)(cid:2)(cid:3)(cid:4)(cid:3)(cid:5)(cid:6)(cid:7)(cid:3)(cid:8)(cid:1)(cid:2)(cid:9)(cid:1)\n\n\u0002\u0001\u0002\u0003\n\n(cid:11)(cid:12) (cid:11)(cid:13)\n\n(cid:9)(cid:10)\n\n(cid:11)(cid:12) (cid:11)(cid:13)\n\n(cid:8) (cid:9) (cid:10)\n\n(cid:14) (cid:3)(cid:4)(cid:15)\n(cid:3) (cid:2)(cid:1) (cid:1)(cid:2)(cid:1) (cid:4)\n\n+\n\n+\n\n+\n\n(cid:18) (cid:3)(cid:4)(cid:3)(cid:5)(cid:6)(cid:7)(cid:3)(cid:8)(cid:1)(cid:2)(cid:9)(cid:19) (cid:9)(cid:10)\n(cid:12)\n\n(cid:11)(cid:12)\n\n(cid:18) (cid:9)(cid:7)(cid:9)(cid:10)\n\n(cid:11)(cid:12)\n\n(cid:20) (cid:3)(cid:1)(cid:11)(cid:2)(cid:14)\n\nFigure 2: In general, ML (for mixture models) and IB operate in different solution spaces.\nNonetheless, a sequence of probabilities that is obtained through some optimization routine\n(e.g., EM) in the \u201cML space\u201d, can be mapped to a sequence of probabilities in the \u201cIB\nspace\u201d, and vice versa. The main result of this paper is that under some conditions these\ntwo sequences are completely equivalent.\n\n\u000f\u0013\u000f\n\n7\f'\n\n\u000f\u0013\u000f\n\n\u0011$7\f'\n\n7\f'\u0005\u0007\n\n'\t\b\u000b\n\f\b\u000b\n\f\n\nand the words\n\n(or\nabout \u0006\u000b\n\n, after pre-processing [10] we have\n\niterative IB (iIB) algorithm (where we took\n\n\u0017\u0018\u0012\u0014\u0019\u0018\u001bI\u001dR\u0015\n(e.g., for iIB, after each iteration we mapped from\n\n\u000f different discussion groups. Denoting the documents by\n\nby\nSince our main goal was to check the differences between IB and ML for different values\n), we further produced another dataset. In this data we randomly choose only\nof the word occurrences for every document\n\n\u0014 subset of the 20-Newsgroups corpus [9], consisted of \u0006\u0013\u000f\u0013\u000f documents randomly\n\u000f .\n\u0003 .\n\u0003\u000f\u000e\n\n/\u0005\u0004\nchosen from\u0003\nof\u0007\nFor both datasets we clustered the documents into\u0003\neach algorithm we used the/\ncalculated@\n\n, ending up with\u0007\n\u0019B\u000f\n\u000f clusters, using both EM and the\n\b(\u0012\u0014\u0019\u0018\u001bI\u001dR\u0015\n( mapping to calculate@\n, including the/\nto/\n\n\u000f different initializations, for each dataset.\n\nruns we found that usually both algorithms improved both functionals mono-\nIn these\ntonically. Comparing the functionals during the process, we see that for the smaller sample\nsize the differences are indeed more evident (Figure 1). Comparing the \ufb01nal values of the\niterations, which typically yielded convergence), we see that in \u0006\f\u0010 out\nfunctionals (after \u0006\nof\nruns, EM converged to a\n. Thus, occasionally, iIB \ufb01nds a better ML solution or EM \ufb01nds a better\nsmaller value of\nIB solution. This phenomenon was much more common for the large sample size case.\n\n). We repeated this procedure for\u0003\nruns iIB converged to a smaller value of@\n\n). For\nduring the process\n-step, and\n\nthan EM. In\n\n'\t\u0007\n\nand\n\n\u000f\u0013\u000f\n\n\b\u0012\u0011\n\n7 Discussion\n\ngiven\n\n\u0019\u001c\u001b\u001e\u001d%\u001b\u001eH\n\nand (2) assumes that\n\n. The mixture model (1) assumes that\n\nWhile we have shown that the ML and IB approaches are equivalent under certain con-\nditions, it is important to keep in mind the different assumptions both approaches make\nis inde-\nregarding the joint distribution over\npendent of\n) of\n\u0011$7\npossible conditional distributions. For this reason, the marginal probability over\n(i.e.,\n) is usually different from\n. Indeed, an alternative view of\n\u0017\u001c\u0012\u0014\u0019\u0018\u001bI\u001dR\u0015\n\u0017\u0018\u0012\u001a\u0019\u001c\u001b\u001e\u001d\n\u0012\u0014\u0013\n\u00176\u0012\u0014\u0019\u0018\u001bI\u001dR\u0015F7\nis de\ufb01ned through the IB Markovian indepen-\nOn the other hand, in the IB framework,\n. Therefore, the solution space is the family of distributions\ndence relation:\nfor which this relation holds and the marginal distribution over\nis consistent with the in-\nput. Interestingly, it is possible to give an alternative formulation for the IB problem which\n\nML estimation is as minimizing#\n\nis one of a small number (\n\u0019\u001c\u001b\u001e\u001d\n\n\b(\u0012\u0014\u0019\u001c\u001b\u001e\u001dR\u0015\n\b(\u0012\u001a\u0019\u001c\u001b\u001e\u001d\u001f\u0015\n\u0005\f\u0001\n\n\u0011\n\t\n\n\u001b\u0006)\f\u0015\n\n)\f\u0015\u001e\u0015\n\n\u0017\u0018\u0012\u0014\u001d87\n\n$'&\n\n\u0005\f\u0001\n\n\u0011\u0013\u0012\u0014\u0010\n\n\u0019%\u0015\n\n.\n\n\u0019\u0018\u001bI\u001d\n\n(cid:10)\n(cid:14)\n(cid:15)\n(cid:15)\n(cid:16)\n(cid:14)\n(cid:14)\n(cid:17)\n(cid:14)\n(cid:15)\n(cid:15)\n(cid:14)\n(cid:14)\n(cid:6)\n(cid:7)\n(cid:11)\n(cid:5)\n(cid:8)\n(cid:12)\n(cid:13)\n(cid:12)\n(cid:6)\n(cid:7)\n(cid:11)\n(cid:5)\n(cid:8)\n(cid:12)\n(cid:7)\n(cid:8)\n(cid:8)\n(cid:9)\n(cid:10)\n(cid:11)\n(cid:10)\n(cid:11)\n\u0002\nH\n\t\n\u0003\n\u000f\n\u0010\n \n7\n\u0010\n\u0006\n\u001b\n7\n \n\u000f\n\u001b\n\u0007\n\u001b\n7\n\u0003\nP\n\u0010\n'\n\u0014\n\u0018\n\u001b\nP\n'\n\u0018\n\u0007\n\u001b\n\u0003\n'\n7\n\u0010\n7\n&\n\u0001\n!\nK\n!\n(\n&\n\u000f\n\u0007\n\u000f\n\u0007\n\u000f\n\u000f\nK\n \n\u0010\n\u0015\n7\n\u0013\n'\n\u0014\n\u0018\n7\n&\n\u0012\n\u001b\n\u0011\n\u0010\n\u000b\n \n\falso involves KL minimization [11]. In this formulation the IB problem is related to mini-\ndenotes the family of distributions\n\n, where\n\nfor which the mixture model assumption holds,\n\n\u0012\u001a\u0019\u001c\u001b\u001e\u001d%\u001b\u001eHI\u0015F7\n\n\u0012\u0014\u0019\u0018\u001bI\u001d\u0004\u001bIHI\u0015\u001e\u0015\n\n\u0012\u0014\u0019\u0018\u001bI\u001d\u0004\u001bIHI\u0015\n\nmizing#\n\n$'&\n\n\u0010\u000f\n\n.8\n\nfrom\n\n\u0017\u001c\u0012\u001a\u0019\u001c\u001bI\u001dR\u0015\n\nIn this sense, we may say that while solving the IB problem, one tries to minimize the KL\nwith respect to the \u201cideal\u201d world, in which\n. On the other hand, while\nsolving the ML problem, one assumes an \u201cideal\u201d world, and tries to minimize the KL with\nrespect to the given marginal distribution\n. Our theoretical analysis shows that under\n\nseparates\n\nIn the IB framework, for large enough\n\n!\u001a( mapping, these two procedures are in some cases equivalent (see Figure 2).\n\nOnce we are able to map between ML and IB, it should be interesting to try and adopt\nadditional concepts from one approach to the other. In the following we provide two such\nexamples.\n, the quality of a given solution is\n[1]. This measure provides a theoretical upper bound,\n\n&\u0006\u0001\nthe/\nmeasured through \r\nwhich can be used for purposes of model selection and more. Using the /\n); In EM, the exponential factor\b(\u0012\u001a\u0019%\u0015\n\nmapping, we can now adopt this measure for the ML estimation problem (for large enough\n. However, its analogous\n, obviously does not. Nonetheless, in principle it is\n(without changing the\n\ncomponent in the IB framework,\npossible to reformulate the IB problem while de\ufb01ning\n\u0012\u0014\u0019\u0004\u0015\nform of the optimal solution). We leave this issue for future research.\n\nin general depends on\n\n4\u0018C\n4\u0018C\u0006\u0005\n\n@\u0002\u0001\u0004\u0003\n\nWe have shown that for the multinomial mixture model, ML and IB are equivalent in some\ncases. It is worth noting that in principle, by choosing a different generative model, one\nmay \ufb01nd further equivalences. Additionally, the IB method was recently extended into the\nmultivariate case, where a new family of IB-like variational problems was presented and\nsolved [11]. A natural question is to look for further generative models that can be mapped\nto this multivariate IB problems, and we are working in this direction.\n\nAcknowledgments\n\nInsightful discussions with Nir Friedman, Naftali Tishby and Gal Elidan are greatly appre-\nciated.\n\nReferences\n[1] N. Tishby, F. Pereira, and W. Bialek. The Information Bottleneck method. In Proc. 37th Allerton Conference on Communi-\n\ncation and Computation, 1999.\n\n[2] N. Slonim. The Information Bottleneck: theory and applications. Ph.D. thesis, The Hebrew University, 2002.\n[3] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, New York, 1991.\n\n[4] T. Hofmann, J. Puzicha, and M. I. Jordan. Learning from dyadic data. In Proc. of NIPS-11, 1998.\n[5] J. Puzicha, T. Hofmann, and J. M. Buhmann. Histogram clustering for unsupervised segmentation and image retrieval. In\n\nPattern Recognition Letters 20(9), 899-909, 1999.\n\n[6] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from incomplete data via the EM algorithm. Journal\n\nof the Royal Statistical Society B, vol. 39, pp. 1-38, 1977.\n\n[7] R. M. Neal and G. E. Hinton. A view of the EM algorithm that justi\ufb01es incremental, sparse, and other variants. In M. I.\n\nJordan (editor), Learning in Graphical Models, pp. 355-368, 1998.\n\n[8] L. Hermes, T. z\u00a8oller, and J. M. Buhmann. Parametric distributional clustering for image segmentation. In Proc. of European\n\nConference on Computer Vision (ECCV), 2002\n\n[9] K. Lang. Learning to \ufb01lter netnews. In Proc. of the 12th Int. Conf. on Machine Learning, 1995.\n[10] N. Slonim, N. Friedman, and N. Tishby. Unsupervised document classi\ufb01cation using sequential information maximization.\n\nIn Proc. of SIGIR-25, 2002.\n\n[11] N. Friedman, O. Mosenzon, N. Slonim, and N. Tishby. Multivariate Information Bottleneck. In Proc. of UAI-17, 2001.\n\n\u0007 The KL with respect to\n\n. Therefore,\nhere, both arguments of the KL are changing during the process, and the distributions involved in the\nminimization are over all the three random variables.\n\nis de\ufb01ned as the minimum over all the members in\n\n\u0012\nG\n7\n\n\n\u0010\n\u000b\n\u0011\n\t\n \n\u0011\n\u0010\n \n\u0013\nP\n\n@\n/\n\u0003\n\u0003\n&\n\u0001\n!\n(\n\u0007\n\u0019\nP\nP\n'\nP\n\b\n\b\n\f", "award": [], "sourceid": 2214, "authors": [{"given_name": "Noam", "family_name": "Slonim", "institution": null}, {"given_name": "Yair", "family_name": "Weiss", "institution": null}]}