{"title": "Information Bottleneck for Gaussian Variables", "book": "Advances in Neural Information Processing Systems", "page_first": 1213, "page_last": 1220, "abstract": "", "full_text": "Information Bottleneck for\n\nGaussian Variables\n\nGal Chechik\u2044 Amir Globerson\u2044 Naftali Tishby Yair Weiss\n\nfggal,gamir,tishby,yweissg@cs.huji.ac.il\n\nSchool of Computer Science and Engineering and\n\nThe Interdisciplinary Center for Neural Computation\n\nThe Hebrew University of Jerusalem, 91904, Israel\n\n\u2044 Both authors contributed equally\n\nAbstract\n\nThe problem of extracting the relevant aspects of data was ad-\ndressed through the information bottleneck (IB) method, by (soft)\nclustering one variable while preserving information about another\n- relevance - variable. An interesting question addressed in the\ncurrent work is the extension of these ideas to obtain continuous\nrepresentations that preserve relevant information, rather than dis-\ncrete clusters. We give a formal de\ufb02nition of the general continuous\nIB problem and obtain an analytic solution for the optimal repre-\nsentation for the important case of multivariate Gaussian variables.\nThe obtained optimal representation is a noisy linear projection to\neigenvectors of the normalized correlation matrix \u00a7xjy\u00a7\u00a11\nx , which\nis also the basis obtained in Canonical Correlation Analysis. How-\never, in Gaussian IB, the compression tradeo\ufb01 parameter uniquely\ndetermines the dimension, as well as the scale of each eigenvector.\nThis introduces a novel interpretation where solutions of di\ufb01erent\nranks lie on a continuum parametrized by the compression level.\nOur analysis also provides an analytic expression for the optimal\ntradeo\ufb01 - the information curve - in terms of the eigenvalue spec-\ntrum.\n\n1\n\nIntroduction\n\nExtracting relevant aspects of complex data is a fundamental task in machine learn-\ning and statistics. The problem is often that the data contains many structures,\nwhich make it di\u2013cult to de\ufb02ne which of them are relevant and which are not in an\nunsupervised manner. For example, speech signals may be characterized by their\nvolume level, pitch, or content; pictures can be ranked by their luminosity level,\ncolor saturation or importance with regard to some task.\n\nThis problem was principally addressed by the information bottleneck (IB) approach\n[1]. Given the joint distribution of a \\source\" variable X and another \\relevance\"\nvariable Y , IB operates to compress X, while preserving information about Y . The\nvariable Y thus implicitly de\ufb02nes what is relevant in X and what isn\u2019t. Formally,\nthis is cast as the following variational problem\n\nmin\np(tjx)L : L \u00b7 I(X; T ) \u00a1 \ufb02I(T ; Y )\n\n(1)\n\n\fwhere T represents the compression of X via the conditional distributions p(tjx),\nwhile the information that T maintains on Y is captured by p(yjt). The positive\nparameter \ufb02 determines the tradeo\ufb01 between compression and preserved relevant\ninformation, as the Lagrange multiplier for the constrained optimization problem\nminp(tjx) I(X; T ) \u00a1 \ufb02 (I(T ; Y ) \u00a1 const).\nThe information bottleneck approach has been applied so far mainly to categorical\nvariables, with a discrete T that represents (soft) clusters of X. It has been proved\nuseful for a range of applications from documents clustering, to gene expression\nanalysis (see [2] for review and references). However, its general information theo-\nretic formulation is not restricted, both in terms of the variables X and Y , as well\nas in the compression variable T . It can be naturally extended to nominal and con-\ntinuous variables, as well as dimension reduction techniques rather than clustering.\nThis is the goal of the current paper.\n\nThe general treatment of IB for continuous T yields the same set of self-consistent\nequations obtained already in [1]. But rather than solving them for the distributions\np(tjx), p(t) and p(yjt) using the generalized Blahut-Arimoto algorithm as proposed\nthere, one can turn them into two coupled eigenvector problems for the logarithmic\nfunctional derivatives \u2013 log p(xjt)\n, respectively. Solving these equations,\nin general, turns out to be a rather di\u2013cult challenge. As in many other cases,\nhowever, the problem turns out to be analytically tractable when X and Y are\njoint multivariate Gaussian variables, as shown in this paper.\n\nand \u2013 log p(yjt)\n\n\u2013t\n\n\u2013t\n\nThe optimal compression in the Gaussian Information Bottleneck (GIB) is de\ufb02ned\nin terms of the compression-relevance tradeo\ufb01, determined through the parameter\n\ufb02.\nIt turns out to be a noisy linear projection to a subspace whose dimension\nis determined by the tradeo\ufb01 parameter \ufb02. The subspaces are spanned by the\nbasis vectors obtained in the well known Canonical Correlation Analysis (CCA)[3]\nmethod, but the exact nature of the projection is determined in a unique way via\nthe tradeo\ufb01 parameter \ufb02. Speci\ufb02cally, as \ufb02 increases, additional dimensions are\nadded to the projection variable T , through a series of critical points (structural\nphase transitions), while at the same time the relative magnitude of each basis\nvector is rescaled. This process continues until all the relevant information about\nY is captured in T . This demonstrates how the IB formalism provides a continuous\nmeasure of model complexity in information theoretic terms.\n\nThe idea of maximization of relevant information was also taken in the Imax frame-\nwork [4, 5]. In that setting, there are two feed forward networks with inputs Xa, Xb\nand output neurons Ya, Yb. The output neuron Ya serves to de\ufb02ne relevance to the\noutput of the neighboring network Yb. Formally, The goal is to tune the incoming\nweights of both output neurons, such that their mutual information I(Ya; Yb) is\nmaximized. An important di\ufb01erence between Imax and the IB setting, is that in\nthe Imax setting, I(Ya; Yb) is invariant to scaling and translation of the Y \u2019s since\nthe compression achieved in the mapping Xa ! Ya is not modeled explicitly. In\ncontrast, the IB framework aims to characterize the dependence of the solution on\nthe explicit compression term I(T ; X), which is a scale sensitive measure when the\ntransformation is noisy. This view of compressed representation T of the inputs X\nis useful when dealing with neural systems that are stochastic in nature and limited\nin their response amplitudes and are thus constrained to \ufb02nite I(T ; X).\n\n2 Gaussian Information Bottleneck\n\nWe now formalize the problem of Information Bottleneck for Gaussian variables.\nLet (X; Y ) be two jointly Gaussian variables of dimensions nx; ny and denote by\n\n\f\u00a7x; \u00a7y the covariance matrices of X; Y and by \u00a7xy their cross-covariance matrix1.\n\nThe goal of GIB is to compress the variable X via a stochastic transformation into\nanother variable T 2 Rnx, while preserving information about Y . With Gaussian\nX and Y , the optimal T is also jointly Gaussian with X and Y . The intuition is\nthat only second order correlations exist in the joint distribution p(X; Y ), so that\ndistributions of T with higher order moments do not carry additional information.\nThis can be rigorously shown using an application of the entropy power inequality\nas in [6], and will be published elsewhere. Note that we do not explicitly limit the\ndimension of T , since we will show that the e\ufb01ective dimension is determined by the\nvalue of \ufb02. Since every two random variables X; T with jointly Gaussian distribution\ncan be presented as T = AX + \u00bb, where \u00bb \u00bb N (0; \u00a7\u00bb) is another Gaussian that is\nindependent of X, we formalize the problem as the minimization\n\nmin\nA;\u00a7\u00bb L \u00b7 I(X; T ) \u00a1 \ufb02I(T ; Y )\n\n(2)\n\nover the noisy linear transformations parametrized by the transformation A and\nnoise covariance \u00a7\u00bb. T is normally distributed T \u00bb N (0; \u00a7t) with \u00a7t = A\u00a7xAT +\u00a7\u00bb.\n\n3 The optimal projection\n\nA main result of this paper is the characterization of the optimal A,\u00a7\u00bb as a function\nof \ufb02\n\nTheorem 3.1 The optimal projection T = AX + \u00bb for a given tradeo\ufb01 parameter\n\ufb02 is given by \u00a7\u00bb = Ix and\n\n9>>>=\n>>>;\n\n(3)\n\n\u00a30T ; : : : ; 0T\u2044\n1 ; 0T ; : : : ; 0T\u2044\n\u00a3\ufb011vT\n1 ; \ufb012vT\n\u00a3\ufb011vT\n2 ; 0T ; : : : ; 0T\u2044 \ufb02c\n...\n\n1\n\n0 \u2022 \ufb02 \u2022 \ufb02c\n\ufb02c\n1 \u2022 \ufb02 \u2022 \ufb02c\n2 \u2022 \ufb02 \u2022 \ufb02c\n\n2\n\n3\n\nA =8>>><\n>>>:\n\n1 ; vT\n\n2 ; : : : ; vT\n\nnxg are left eigenvectors of \u00a7xjy\u00a7\u00a11\n\nwhere fvT\ning ascending eigenvalues \u201a1; \u201a2; : : : ; \u201anx , \ufb02c\ncoe\u2013cients de\ufb02ned by \ufb01i \u00b7 \ufb02(1\u00a1\u201ai)\u00a11\n, ri \u00b7 vT\nvector of zeros, and semicolons separate rows in the matrix A.\n\nsorted by their correspond-\nx\ni = 1\nare critical \ufb02 values, \ufb01i are\n1\u00a1\u201ai\ni \u00a7xvi, 0T is an nx dimensional row\n\n\u201airi\n\nThis theorem asserts that the optimal projection consists of eigenvectors of \u00a7xjy\u00a7\u00a11\nx ,\ncombined in an interesting manner: For \ufb02 values that are smaller than the smallest\ncritical point \ufb02c\n1, compression is more important than any information preservation\nand the optimal solution is the degenerated one A \u00b7 0. As \ufb02 is increased, it\ngoes through a series of critical points \ufb02 c\ni , at each of which another eigenvector\nof \u00a7xjy\u00a7\u00a11\nis added to A. Even though the rank of A increases at each of these\nx\ntransition points, it changes smoothly as a function of \ufb02 since at the critical point\n\ufb02c\ni the coe\u2013cient \ufb01i vanishes. Thus \ufb02 parameterizes a \\continuous rank\" of the\nprojection.\n\nTo illustrate the form of the solution, we plot the landscape of the target function\nL together with the solution in a simple problem where X 2 R2 and Y 2 R. In\nthis case A has a single non-zero row, thus A can be thought of as a row vector\n\n1For simplicity we assume that \u00a7x; \u00a7y are full rank, otherwise X; Y can be reduced to\n\nthe proper dimensionality.\n\n\flow L values; light-yellow:\n\nFigure 1. L as a function of all possi-\nble projections A, for A : R2 ! R, ob-\ntained numerically from Eq. 4. Dark-\nred:\nlarge\nL values. \u00a7xy = [0:1 0:2], \u00a7x = I2. A.\nFor \ufb02 = 15, the optimal solution is the\ndegenerated solution A \u00b7 0. B. For\n\ufb02 = 100, the eigenvector of \u00a7xjy\u00a7\u00a11\nx\nwith a norm according to theorem 3.1\n(superimposed) is optimal.\n\nA.\n\u22125\n\n\u22122.5\n\n2\n\nA\n\n0\n\n2.5\n\nB.\n\n\u22125\n\n\u22122.5\n\n2\n\nA\n\n0\n\n2.5\n\n5\n\u22125\n\n\u22122.5\n\n0\nA1\n\n2.5\n\n5\n\n5\n\u22125\n\n\u22122.5\n\n0\nA1\n\n2.5\n\n5\n\nof length 2, that projects X to a scalar A : X ! R, T 2 R. Figure 1 shows the\ntarget function L as a function of the projection A. In this example, \u201a1 = 0:95,\nthus \ufb02c\n1 = 20. Therefor, for \ufb02 = 15 (\ufb02gure 1A) the zero solution is optimal, but for\n\ufb02 = 100 > \ufb02c (\ufb02gure 1B) the corresponding eigenvector is a feasible solution, and\nthe target function manifold contains two mirror minima. As \ufb02 increases from 0 to\n1, these two minima, starting as a single uni\ufb02ed minimum at zero, split at \ufb02 c\n1, and\nthen diverge apart to 1.\nWe now turn to prove theorem 3.12. We start by rewriting L using the formula\nfor the entropy of a d dimensional Gaussian variable h(X) = 1\n2 log((2\u2026e)dj\u00a7xj),\nwhere j\u00a2j denotes a determinant. Using the Schur complement formula to calculate\nthe covariance of the conditional variable TjY we have \u00a7tjy = \u00a7t \u00a1 \u00a7ty\u00a7\u00a11\ny \u00a7yt =\nA\u00a7xjyAT + \u00a7\u00bb, and the target function (up to a factor of 2) can be written as\n\nL(A; \u00a7\u00bb) = (1\u00a1\ufb02) log jA\u00a7xAT + \u00a7\u00bbj \u00a1 log j\u00a7\u00bbj + \ufb02 log jA\u00a7xjyAT + \u00a7\u00bbj :\n\n(4)\n\nAlthough L is a function of both the noise \u00a7\u00bb and the projection A, it can be easily\nshown that for every pair (A; \u00a7\u00bb), there is another projection ~A = pD\u00a11V A where\n\u00a7\u00bb = V DV T and L( ~A; I) = L(A; \u00a7\u00bb) 3. This allows us to simplify the calculations\nby replacing the noise covariance matrix \u00a7\u00bb with the identity matrix.\nTo identify the minimum of L we now di\ufb01erentiate L w.r.t. to the projection A\nusing the algebraic identity \u2013\n\u2013A log jACATj = (ACAT )\u00a112AC which holds for any\nsymmetric matrix C. Equating this derivative to zero and rearranging, we obtain\nnecessary conditions for an internal minimum of L\n\n(\ufb02 \u00a1 1)=\ufb02\u00a3(A\u00a7xjyAT + Id)(A\u00a7xAT + Id)\u00a11\u2044 A = A\u00a3\u00a7xjy\u00a7\u00a11\nx \u2044 :\n\nEquation 5 shows that the multiplication of \u00a7xjy\u00a7\u00a11\nx by A must reside in the span\nof the rows of A. This means that A should be spanned by up to d eigenvectors of\n\u00a7xjy\u00a7\u00a11\nx . We can therefore represent the projection A as a mixture A = W V where\nthe rows of V are left normalized eigenvectors of \u00a7xjy\u00a7\u00a11\nx and W is a mixing matrix\nthat weights these eigenvectors. In the remainder of this section we characterize\nthe nature of the mixing matrix W .\n\n(5)\n\nLemma 3.2 The optimal mixing matrix W is a diagonal matrix of the form\n\ns \ufb02(1 \u00a1 \u201a1) \u00a1 1\n\n\u201a1r1\n\nvT\n\n1 ; : : : ;s \ufb02(1 \u00a1 \u201ak) \u00a1 1\n\n\u201akrk\n\n(6)\n\nW = diag2\n4\n\nvT\n\nk ; 0T ; : : : ; 0T3\n5\n\n2Further details of the proofs can be found in a technical report [7].\n3Although this theorem holds only for full rank \u00a7\u00bb, it does not limit the generality of the\ndiscussion since low rank matrices yield in\ufb02nite values of L and are therefore suboptimal.\n\n\f1 ; : : : ; vT\n\nx with \ufb02c\n\nwhere fvT\n\u00a7xjy\u00a7\u00a11\n\nk g and f\u201a1; : : : ; \u201akg are k \u2022 nx eigenvectors and eigenvalues of\n1; : : : ; \ufb02c\n\nk \u2022 \ufb02.\nProof: We write V \u00a7xjy\u00a7\u00a11\nx = DV where D is a diagonal matrix whose elements\nare the corresponding eigenvalues, and denote by R the diagonal matrix whose ith\nelement is ri = vT\ni \u00a7xvi. When k = nx, we substitute A = W V into equation 5,\nand use the fact that W is full rank to obtain\n\nW T W = [\ufb02(I \u00a1 D) \u00a1 I](DR)\u00a11 :\n\n(7)\n\nWhile this does not uniquely characterize W , we note that if we substitute A into\nthe target function in equation 4, and use properties of the eigenvalues, we have\n\nL = (1 \u00a1 \ufb02)\n\nn\n\nXi=1\n\nlog\u00a1jjwT\n\ni jj2ri + 1\u00a2 + \ufb02\n\nn\n\nXi=1\n\nlog\u00a1jjwT\n\ni jj2ri\u201ai + 1\u00a2\n\n(8)\n\ni jj2 is the ith element of the diagonal of W T W . This shows that L depends\nwhere jjwT\nonly on the norm of the columns of W , and all matrices W that satisfy (7) yield\nthe same target function. We can therefore choose to take W to be the diagonal\nmatrix which is the square root of (7)\n\nW =p[\ufb02(I \u00a1 D) \u00a1 I)](DR)\u00a11\n\n(9)\n\nTo prove the case of k < nx, consider a matrix W that is a k\u00a3k matrix padded with\nzeros, thus it mixes only the \ufb02rst k eigenvectors. In this case, calculation similar to\nthat above gives the solution A which has nx \u00a1 k zero rows. To complete the proof,\nit remains to be shown that the above solution capture all extrema points. This\npoint is detailed in [7] due to space considerations.\nWe have thus characterized the set of all minima of L, and turn to identify which\nof them achieve the global minima.\n\nCorollary 3.3 The global minimum of L is obtained with all \u201ai satisfying \ufb02 > \ufb02c\nProof: Substituting the optimal W of equation 9 into equation 8 yields L =\nPk\ni=1(\ufb02 \u00a1 1) log \u201ai + log(1 \u00a1 \u201ai) + f (\ufb02). Since 0 \u2022 \u201a \u2022 1 and \ufb02 \u201a 1\n1\u00a1\u201a , L is\nminimized by taking all the eigenvalues that satisfy \ufb02 > 1\n\n(1\u00a1\u201ai) .\n\ni\n\nTaken together, these observations prove that for a given value of \ufb02, the optimal\nprojection is obtained by taking all the eigenvectors whose eigenvalues \u201ai satisfy\n\ufb02 \u201a 1\n, and setting their norm according to A = W V . This completes the proof\nof theorem 3.1.\n\n1\u00a1\u201ai\n\n4 The GIB Information Curve\n\nThe information bottleneck is targeted at characterizing the tradeo\ufb01 between in-\nformation preservation (accuracy of relevant predictions) and compression. Inter-\nestingly, much of the structure of the problem is re(cid:176)ected in the information curve,\nnamely the maximal value of relevant preserved information (accuracy), I(T ; Y ),\nas a function of the complexity of the representation of X, measured by I(T ; X).\nThis curve is related to the rate-distortion function in lossy source coding, as well\nas to the achievability limit in channel coding with side-information [8]. It is shown\nto be concave in general [9], but its precise functional form depends on the joint\n\n\f1\n\nb \u22121 = 1\u2212l\n\n1\n\n)\n\ni\n\n(l\ng\no\n\nl\n \n\ni\n\n S\n/\n)\n\n;\n\nY\nT\n(\nI\n\n0\n0\n\n5\n\n10\n\nI(T;X)\n\n15\n\n20\n\n25\n\nFigure 2. GIB information curve obtained with\nfour eigenvalues \u201ai = 0:1,0:5,0:7,0:9. The in-\nformation at the critical points are designated\nby circles. For comparison, information curves\ncalculated with smaller number of eigenvec-\ntors are also depicted (all curves calculated for\n\ufb02 < 1000). The slope of the curve at each point\nis the corresponding \ufb02\u00a11. The tangent at zero,\nwith slope \ufb02\u00a11 = 1 \u00a1 \u201a1, is super imposed on\nthe information curve.\n\ndistribution and can reveal properties of the hidden structure of the variables. An-\nalytic forms for the information curve are known only for very special cases, such as\nBernoulli variables and some intriguing self-similar distributions. The analytic char-\nacterization of the Gaussian IB problem allows us to obtain a closed form expression\nfor the information curve in terms of the relevant eigenvalues.\n\nTo this end, we substitute the optimal projection A(\ufb02) into I(T ; X) and I(T ; Y )\nand isolate I\ufb02(T ; Y ) as a function of I\ufb02(T ; X)\n\nI\ufb02(T ; Y ) = I\ufb02(T ; X) \u00a1\n\nnI\n2\n\nlog\u02c6 nI\nYi=1\n\n(1\u00a1\u201ai)\n\n1\n\nnI + e\n\n2I\ufb02 (T ;X)\n\nnI\n\n\u201ai\n\n1\n\nnI!\n\nnI\n\nYi=1\n\n(10)\n\nlog\n\ni=1\n\n.\n\n\u201anI\n\u201ai\n\n1\u00a1\u201ai\n1\u00a1\u201anI\n\n2PnI \u00a11\n\nwhere the products are over the \ufb02rst nI eigenvalues, since these obey the critical \ufb02\ncondition, with cnI \u2022 I\ufb02(T ; X) \u2022 cnI +1 and cnI = 1\nThe GIB curve, illustrated in Figure 2, is continuous and smooth, but is built of\nseveral of segments, since as I(T ; X) increases additional eigenvectors are used in\nthe projection. The derivative of the curve is given by \ufb02 \u00a11, which can be easily\nshown to be continuous and decreasing, yielding that the GIB information curve is\nconcave everywhere. At each value of I(T ; X) the curve is therefore bounded by a\ntangent with a slope \ufb02\u00a11(I(T ; X)). Generally in IB, the data processing inequality\nyields an upper bound on the slope at the origin, \ufb02 \u00a11(0) < 1, in GIB we obtain a\ntighter bound: \ufb02\u00a11(0) < 1\u00a1\u201a1. The asymptotic slope of the curve is always zero, as\n\ufb02 ! 1, re(cid:176)ecting the law of diminishing return: adding more bits to the description\nof X does not provide more accuracy about T . This interesting relation between\nthe spectral properties of the covariance matrices raises interesting questions for\nspecial cases where more can be said about this spectrum, such as for patterns in\nneural-network learning problems.\n\n5 Relation To Other Works\n\n5.1 Canonical Correlation Analysis and Imax\n\ny \u00a7yx\u00a7\u00a11\n\nx = Ix \u00a1 \u00a7xy\u00a7\u00a11\n\nThe GIB projection derived above uses weighted eigenvectors of the matrix\n\u00a7xjy\u00a7\u00a11\nx . The same eigenvectors are also used in Canonical\ncorrelations Analysis (CCA) [3], a statistical method that \ufb02nds linear relations be-\ntween two variables. CCA aims to \ufb02nd sets of basis vectors for the two variables that\nmaximize the correlation coe\u2013cient between the projections of the variables on the\nbasis vectors. The CCA bases are the eigenvectors of the matrices \u00a7\u00a11\nx \u00a7xy\nand \u00a7\u00a11\ny \u00a7yx, and the square roots of their corresponding eigenvalues are\ntermed canonical correlation coe\u2013cients. CCA was also shown to be a special case\nof continuous Imax [4, 5].\n\nx \u00a7xy\u00a7\u00a11\n\ny \u00a7yx\u00a7\u00a11\n\n\fAlthough GIB and CCA involve the spectral analysis of the same matrices, they have\nsome inherent di\ufb01erences. First of all, GIB characterizes not only the eigenvectors\nbut also their norm, in a way that that depends on the trade-o\ufb01 parameter \ufb02. Since\nCCA depends on the correlation coe\u2013cient between the compressed (projected)\nversions of X and Y , which is a normalized measure of correlation, it is invariant\nto a rescaling of the projection vectors. In contrast, for any value of \ufb02, GIB will\nchoose one particular rescaling given by equation (4).\n\nWhile CCA is symmetric (in the sense that both X and Y are projected), IB is non\nsymmetric and only the X variable is compressed. It is therefore interesting that\nboth GIB and CCA use the same eigenvectors for the projection of X.\n\n5.2 Multiterminal information theory\n\nThe Information Bottleneck formalism was recently shown [9] to be closely related\nto the problem of source coding with side information [8]. In the latter, two discrete\nvariables X; Y are encoded separately at rates Rx; Ry, and the aim is to use them\nto perfectly reconstruct Y . The bounds on the achievable rates in this case were\nfound in [8] and can be obtained from the IB information curve.\n\nWhen considering continuous variables, lossless compression at \ufb02nite rates is no\nlonger possible. Thus, mutual information for continuous variables is no longer\ninterpretable in terms of encoding bits, but rather serves as an optimal measure of\ninformation between variables. The IB formalism, although coinciding with coding\ntheorems in the discrete case, is more general in the sense that it re(cid:176)ects the tradeo\ufb01\nbetween compression and information preservation, and is not concerned with exact\nreconstruction. Such reconstruction can be considered by introducing distortion\nmeasures as in [6] but is not relevant for the question of \ufb02nding representations\nwhich capture the information between the variables.\n\n6 Discussion\n\nWe applied the information bottleneck method to continuous jointly Gaussian vari-\nables X and Y , with a continuous representation of the compressed variable T . We\nderived an analytic optimal solution as well as a general algorithm for this problem\n(GIB) which is based solely on the spectral properties of the covariance matrices\nin the problem. The solution for GIB, characterized in terms of the trade-o\ufb01 pa-\nrameter \ufb02, between compression and preserved relevant information, consists of\neigenvectors of the matrix \u00a7xjy\u00a7\u00a11\nx , continuously adding up as weaker compression\nand more complex models are allowed. We provide an analytic characterization\nof the information curve, which relates the spectrum to relevant information in\nan intriguing manner. Besides its clean analytic structure, GIB o\ufb01ers a new way\nfor analyzing empirical multivariate data when only its correlation matrices can be\nestimated. In thus extends and provides new information theoretic insight to the\nclassical Canonical Correlation Analysis method.\n\nThe IB optima are known to obey three self consistent equations, that can be\nused in an iterative algorithm guaranteed to converge to a local optimum [1]. In\nGIB, these iterations over the conditional distributions p(tjx), p(t) and p(yjt) can\nbe transformed into iterations over the projection parameter A. In this case, the\niterative IB algorithm turns into repeated projections on the matrix \u00a7xjy\u00a7\u00a11\nx , as\nused in power methods for eigenvector calculation. The parameter \ufb02 determines\nthe scaling of the vectors, such that some of the eigenvectors decay to zero, while\nthe others converge to their value de\ufb02ned in Theorem 3.1.\n\n\fWhen handling real world data, the relevance variable Y often contains mul-\ntiple structures that are correlated to X, although many of them are actually\nirrelevant. The information bottleneck with side information (IBSI) [10] alle-\nviates this problem using side information in the form of an irrelevance vari-\nable Y \u00a1 about which information is removed.\nIBSI thus aims to minimize\nL = I(X; T ) \u00a1 \ufb02 (I(T ; Y +) \u00a1 (cid:176)I(T ; Y \u00a1)). This functional can be analyzed in the\ncase of Gaussian variables (GIBSI: Gaussian IB with side information), in a similar\nway to the analysis of GIB presented above. This results in a generalized eigenvalue\nproblem involving the covariance matrices \u00a7xjy+ and \u00a7xjy\u00a1 . The detailed solution\nof this problem as a function of the tradeo\ufb01 parameters remains to be investigated.\n\nFor categorical variables, the IB framework can be shown to be closely related to\nmaximum-likelihood in a latent variable model [11].\nIt would be interesting to\nsee whether the GIB-CCA equivalence can be extended and give a more general\nunderstanding of the relation between IB and statistical latent variable models.\n\nThe extension of IB to continuous variables reveals a common principle behind reg-\nularized unsupervised learning methods ranging from clustering to CCA. It remains\nan interesting challenge to obtain practical algorithms in the IB framework for di-\nmension reduction (continuous T ) without the Gaussian assumption, for example\nby kernelizing [12] or adding non linearities to the projections (as in [13]).\n\nReferences\n\n[1] N. Tishby, F.C. Pereira, and W. Bialek. The information bottleneck method. In Proc.\n\nof 37th Allerton Conference on communication and computation, 1999.\n\n[2] N. Slonim. Information Bottlneck theory and applications. PhD thesis, Hebrew Uni-\n\nversity of Jerusalem, 2003.\n\n[3] H. Hotelling. The most predictable criterion. Journal of Educational Psychology,,\n\n26:139{142, 1935.\n\n[4] S. Becker and G.E. Hinton. A self-organizing neural network that discovers surfaces\n\nin random-dot stereograms. Nature, 355(6356):161{163, 1992.\n\n[5] S. Becker. Mutual information maximization: Models of cortical self-organization.\n\nNetwork: Computation in Neural Systems, pages 7{31, 1996.\n\n[6] T. Berger abd R. Zamir. A semi-continuous version of the berger-yeung problem.\n\nIEEE Transactions on Information Theory, pages 1520{1526, 1999.\n\n[7] G. Chechik and A. Globerson. Information bottleneck and linear projections of gaus-\n\nsian processes. Technical Report 4, Hebrew University, May 2003.\n\n[8] A.D. Wyner. On source coding with side information at the decoder. IEEE Trans.\n\non Info Theory, IT-21:294{300, 1975.\n\n[9] R. Gilad-Bachrach, A. Navot, and N. Tishby. An information theoretic tradeo\ufb01 be-\n\ntween complexity and accuracy. In Proceedings of the COLT, Washington., 2003.\n\n[10] G. Chechik and N. Tishby. Extracting relevant structures with side information. In\nS. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information\nProcessing Systems 15, 2002.\n\n[11] N. Slonim and Y. Weiss. Maximum likelihood and the information bottleneck. In\nS. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information\nProcessing Systems 15, 2002.\n\n[12] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, A. Smola, and K. Muller. Invariant\nfeature extraction and classi\ufb02cation in kernel spaces. In S.A. Solla, T.K. Leen, and\nK.R. Muller, editors, Advances in Neural Information Processing Systems 12, 2000.\n\n[13] A.J. Bell and T.J. Sejnowski. An information maximization approach to blind seper-\n\nation and blind deconvolution. Neural Computation, 7:1129{1159, 1995.\n\n\f", "award": [], "sourceid": 2457, "authors": [{"given_name": "Gal", "family_name": "Chechik", "institution": null}, {"given_name": "Amir", "family_name": "Globerson", "institution": null}, {"given_name": "Naftali", "family_name": "Tishby", "institution": null}, {"given_name": "Yair", "family_name": "Weiss", "institution": null}]}