{"title": "Kernelized Infomax Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 17, "page_last": 24, "abstract": null, "full_text": "Kernelized Infomax Clustering\n\nFelix V. Agakov\n\nEdinburgh University\n\nEdinburgh EH1 2QL, U.K.\n\nfelixa@inf.ed.ac.uk\n\nDavid Barber\n\nIDIAP Research Institute\n\nCH-1920 Martigny Switzerland\n\ndavid.barber@idiap.ch\n\nAbstract\n\nWe propose a simple information-theoretic approach to soft clus-\ntering based on maximizing the mutual information I(x, y) between\nthe unknown cluster labels y and the training patterns x with re-\nspect to parameters of speci\ufb01cally constrained encoding distribu-\ntions. The constraints are chosen such that patterns are likely to\nbe clustered similarly if they lie close to speci\ufb01c unknown vectors\nin the feature space. The method may be conveniently applied to\nlearning the optimal a\ufb03nity matrix, which corresponds to learn-\ning parameters of the kernelized encoder. The procedure does not\nrequire computations of eigenvalues of the Gram matrices, which\nmakes it potentially attractive for clustering large data sets.\n\n1 Introduction\n\nLet x \u2208 R|x| be a visible pattern, and y \u2208 {y1, . . . , y|y|} its discrete unknown cluster\nlabel. Rather than learning a density model of the observations, our goal here will\nbe to learn a mapping x \u2192 y from the observations to the latent codes (cluster\nlabels) by optimizing a formal measure of coding e\ufb03ciency. Good codes y should\nbe in some way informative about the underlying high-dimensional source vectors x,\nso that the useful information contained in the sources is not lost. The fundamental\nmeasure in this context is the mutual information\n\nI(x, y) def= H(x) \u2212 H(x|y) \u2261 H(y) \u2212 H(y|x),\n\n(1)\n\nwhich indicates the decrease in uncertainty about the pattern x due to the knowl-\nedge of the underlying cluster label y (e.g. Cover and Thomas (1991)). Here\nH(y) \u2261 \u2212hlog p(y)ip(y) and H(y|x) \u2261 \u2212hlog p(y|x)ip(x,y) are marginal and condi-\ntional entropies respectively, and the brackets h. . .ip represent averages over p. In\nour case the encoder model is de\ufb01ned as\n\np(x, y) \u221d\n\nM\n\nX\n\nm=1\n\n\u03b4(x \u2212 x(m))p(y|x),\n\n(2)\n\nwhere {x(m)|m = 1, . . . , M } is a set of training patterns.\n\nOur goal is to maximize (1) with respect to parameters of a constrained encod-\ning distribution p(y|x). In contrast to most applications of the infomax principle\n\n\f(Linsker (1988)) in stochastic channels (e.g. Brunel and Nadal (1998); Fisher and\nPrincipe (1998); Torkkola and Campbell (2000)), optimization of the objective (1)\nis computationally tractable since the cardinality of the code space |y| (the number\nof clusters) will typically be low. Indeed, had the code space been high-dimensional,\ncomputation of I(x, y) would have required evaluation of the generally intractable\nentropy of the mixture H(y), and approximations would have needed to be consid-\nered (e.g. Barber and Agakov (2003); Agakov and Barber (2006)).\n\nMaximization of the mutual information with respect to parameters of the encoder\nmodel e\ufb00ectively de\ufb01nes a discriminative unsupervised optimization framework,\nwhere the model is parameterized similarly to a conditionally trained classi\ufb01er, but\nwhere the cluster allocations are generally unknown. Training such models p(y|x)\nby maximizing the likelihood p(x) would be meaningless, as the cluster variables\nwould marginalize out, which motivates also our information theoretic approach.\nIn this way we may extract soft cluster allocations directly from the training set,\nwith no additional information about class labels, relevance patterns, etc. required.\nThis is an important di\ufb00erence from other clustering techniques making a recourse\nto information theory, which consider di\ufb00erent channels and generally require addi-\ntional information about relevance or irrelevance variables (cf Tishby et al. (1999);\nChechik and Tishby (2002); Dhillon and Guan (2003)).\n\nOur infomax approach is in contrast with probabilistic methods based on likelihood\nmaximization. There the task of \ufb01nding an optimal cluster allocation y for an ob-\nserved pattern x may be viewed as an inference problem in generative models y \u2192 x,\nwhere the probability of the data p(x) = Py p(y)p(x|y) is de\ufb01ned as a mixture of\n|y| processes. The key idea of \ufb01tting such models to data is to \ufb01nd a constrained\nprobability distribution p(x) which would be likely to generate the visible patterns\n{x(1), . . . , x(M )} (this is commonly achieved by maximizing the marginal likelihood\nfor deterministic parameters of the constrained distribution). The unknown clusters\ny corresponding to each pattern x may then be assigned according to the posterior\np(y|x) \u221d p(y)p(x|y). Such generative approaches are well-known but su\ufb00er from the\nconstraint that p(x|y) is a correctly normalised distribution in x. In high dimensions\n|x| this restricts the class of generative distributions usually to (mixtures of) Gaus-\nsians whose mean is dependent (in a linear or non-linear way) on the latent cluster\ny. Typically data will lie on low dimensional curved manifolds embedded in the high\ndimensional x-space. If we are restricted to using mixtures of Gaussians to model\nthis curved manifold, typically a very large number of mixture components will be\nrequired. No such restrictions apply in the infomax case so that the mappings p(y|x)\nmay be very complex, subject only to sensible clustering constraints.\n\n2 Clustering in Nonlinear Encoder Models\n\nArguably, there are at least two requirements which a meaningful cluster allocation\nprocedure should satisfy. Firstly, clusters should be, in some sense, locally smooth.\nFor example, each pair of source vectors should have a high probability of being\nassigned to the same cluster if the vectors satisfy speci\ufb01c geometric constraints.\nSecondly, we may wish to avoid assigning unique cluster labels to outliers (or other\nconstrained regions in the data space), so that under-represented regions in the\ndata space are not over-represented in the code space. Note that degenerate cluster\nallocations are generally suboptimal under the objective (1), as they would lead to\na reduction in the marginal entropy H(y). On the other hand, it is intuitive that\nmaximization of the mutual information I(x, y) favors hard assignments of cluster\nlabels to equiprobable data regions, as this would result in the growth in H(y) and\nreduction in H(y|x).\n\n\f2.1 Learning Optimal Parameters\n\nLocal smoothness and \u201csoftness\u201d of the clusters may be enforced by imposing ap-\npropriate constraints on p(y|x). A simple choice of the encoder is\n\np(yj|x(i)) \u221d exp{\u2212kx(i) \u2212 wjk2/sj + bj},\n\n(3)\n\nwhere the cluster centers wj \u2208 R|x|, the dispersions sj, and the biases bj are the\nencoder parameters to be learned. Clearly, under the encoding distribution (3)\npatterns x lying close to speci\ufb01c centers wj in the data space will tend to be clustered\nsimilarly. In principle, we could consider other choices of p(y|x); however (3) will\nprove to be particularly convenient for the kernelized extensions.\n\nLearning the optimal cluster allocations corresponds to maximizing (1) with respect\nto the encoder parameters (3). The gradients are given by\n\n\u2202I(x, y)\n\n\u2202wj\n\n\u2202I(x, y)\n\n\u2202sj\n\n=\n\n=\n\n1\nM\n\n1\nM\n\nM\n\nX\n\nm=1\n\nM\n\nX\n\nm=1\n\np(yj|x(m))\n\n(x(m) \u2212 wj)\n\nsj\n\n\u03b1(m)\n\nj\n\np(yj|x(m))\n\nkx(m) \u2212 wjk2\n\n2s2\nj\n\n\u03b1(m)\n\nj\n\n.\n\n(4)\n\n(5)\n\nAnalogously, we get \u2202I(x, y)/\u2202bj = PM\n\nm=1 p(yj|x(m))\u03b1(m)\n\nj\n\n/M .\n\nExpressions (4) and (5) have the form of the weighted EM updates for isotropic\nGaussian mixtures, with the weighting coe\ufb03cients \u03b1(m)\n\nde\ufb01ned as\n\nj\n\n\u03b1(m)\n\nj\n\ndef= \u03b1j(x(m)) def= log\n\np(yj|x(m))\n\np(yj)\n\n\u2212 KL\u00b3p(y|x(m))khp(y|x)i \u02dcp(x)\u00b4 ,\n\n(6)\n\nand \u02dcp(x) \u221d Pm \u03b4(x \u2212 x(m)) is the empirical distribution. Clearly, if \u03b1(m)\n\nwhere KL de\ufb01nes the Kullback-Leibler divergence (e.g. Cover and Thomas (1991)),\nis kept\n\ufb01xed for all m = 1, . . . , M and j = 1, . . . , |y|, the gradients (4) are identical to\nthose obtained by maximizing the log-likelihood of a Gaussian mixture model (up\nto irrelevant constant pre-factors). Generally, however, the coe\ufb03cients \u03b1(m)\nj will be\nfunctions of wl, sl, and bl for all cluster labels l = 1, . . . , |y|.\n\nj\n\nIn practice, we may impose a simple construction ensuring that sj > 0, for example\nby assuming that sj = exp{\u02dcsj} where \u02dcsj \u2208 R. For this case, we may re-express\nthe gradients for the variances as \u2202I(x, y)/\u2202\u02dcsj = sj\u2202I(x, y)/\u2202sj. Expressions (4)\nand (5) may then be used to perform gradient ascent on I(x, y) for wj, \u02dcsj, and bj,\nwhere j = 1, . . . , |y|. After training, the optimal cluster allocations may be assigned\naccording to the encoding distribution p(y|x).\n\n2.2\n\nInfomax Clustering with Kernelized Encoder Models\n\nWe now extend (3) by considering a kernelized parameterization of a nonlinear\nencoder. Let us assume that the source patterns x(i), x(j) have a high probability\nof being assigned to the same cluster if they lie close to a speci\ufb01c cluster center in\nsome feature space. One choice of the encoder distribution for this case is\n\np(yj|x(i)) \u221d exp{\u2212k\u03c6(x(i)) \u2212 wjk2/sj + bj},\n\n(7)\n\nwhere \u03c6(x(i)) \u2208 R|\u03c6| is the feature vector corresponding to the source pattern x(i),\nand wj \u2208 R|\u03c6| is the (unknown) cluster center in the feature space. The feature\nspace may be very high- or even in\ufb01nite-dimensional.\n\n\fSince each cluster center wi \u2208 R|\u03c6| lives in the same space as the projected sources\n\u03c6(x(i)), it is representable in the basis of the projections as\n\nwj =\n\nM\n\nX\n\nm=1\n\n\u03b1mj\u03c6(x(m)) + w\u22a5\nj ,\n\n(8)\n\nwhere \u02dcw\u22a5\ni \u2208 R|\u03c6| is orthogonal to the span of \u03c6(x1), . . . , \u03c6(xM ), and {\u03b1mj} is a set\nof coe\ufb03cients (here j and m index |y| codes and M patterns respectively). Then\nwe may transform the encoder distribution (7) to\n\np(yj|x(m)) \u221d expn\u2212\u00b3Kmm \u2212 2kT (x(m))aj + aT\n\nj Kaj + cj\u00b4 /sjo\n\ndef= exp{\u2212fj(x(m))},\n\n(9)\n\nwhere k(x(m)) corresponds to the mth column (or row) of the Gram matrix\ndef= {Kij} def= {\u03c6(x(i))T \u03c6(x(j))} \u2208 RM \u00d7M , aj \u2208 RM is the jth column of the\nK\nj \u2212 sjbj. With-\nmatrix of the coe\ufb03cients A\nout loss of generality, we may assume that c = {cj} \u2208 R|y| is a free unconstrained\nparameter. Additionally, we will ensure positivity of the dispersions sj by consid-\nering a construction constraint sj = exp{\u02dcsj}, where \u02dcsj \u2208 R.\n\ndef= {amj} \u2208 RM \u00d7|y|, and cj = (w\u22a5\n\nj )T w\u22a5\n\nLearning Optimal Parameters\n\nFirst we will assume that the Gram matrix K \u2208 RM \u00d7M is \ufb01xed and known (which\ne\ufb00ectively corresponds to considering a \ufb01xed a\ufb03nity matrix, see e.g. Dhillon et al.\n(2004)). Objective (1) should be optimized with respect to the log-dispersions\n\u02dcsj \u2261 log(sj), biases cj, and coordinates A \u2208 RM \u00d7|y| in the space spanned by the\nfeature vectors {\u03c6(x(i))|i = 1, . . . , M }. From (9) we get\n\n\u2202I(x, y)\n\n\u2202aj\n\n\u2202I(x, y)\n\n\u2202\u02dcsj\nwhere \u02dcp(x) \u221d PM\n\n=\n\n=\n\n1\nsj\n1\n2sj\n\nhp(yj|x) (k(x) \u2212 Kaj) \u03b1j(x)i \u02dcp(x) \u2208 RM ,\n\nhp(yj|x)fj(x)\u03b1j(x)i \u02dcp(x) ,\n\n(10)\n\n(11)\n\nm\u22121 \u03b4(x\u2212x(m)) is the empirical distribution. Analogously, we obtain\n(12)\n\n\u2202I(x, y)/\u2202cj = h\u03b1j(x)i \u02dcp(x),\n\nwhere the coe\ufb03cients \u03b1j(x) are given by (6). For a known Gram matrix K \u2208 RM \u00d7M ,\nthe gradients \u2202I/\u2202aj, \u2202I/\u2202\u02dcsj, and \u2202I/\u2202cj given by expressions (10) \u2013 (12) may be\nused in numerical optimization for the model parameters. Note that the matrix\nmultiplication in (10) is performed once for each aj, so that the complexity of\ncomputing the gradient is \u223c O(M 2|y|) per iteration. We also note that one could\npotentially optimize (1) by applying the iterative Arimoto-Blahut algorithm for\nmaximizing the channel capacity (see e.g. Cover and Thomas (1991)). However, for\nany given constrained encoder it is generally di\ufb03cult to derive closed-form updates\nfor the parameters of p(y|x), which motivates a numerical optimization.\n\nLearning Optimal Kernels\n\nSince we presume that explicit computations in R|\u03c6| are expensive, we cannot com-\npute the Gram matrix by trivially applying its de\ufb01nition K = {\u03c6(xi)T \u03c6(xj)}. In-\nstead, we may interpret scalar products in feature spaces as kernel functions\n\n\u03c6(x(i))T \u03c6(x(j)) = K\u0398(x(i), x(j); \u0398), \u2200x(i), x(j) \u2208 Rx,\n\n(13)\n\n\fwhere K\u0398 : Rx \u00d7 Rx \u2192 R satis\ufb01es Mercer\u2019s kernel properties (e.g. Scholkopf and\nSmola (2002)). We may now apply our unsupervised framework to implicitly learn\nthe optimal nonlinear features by optimizing I(x, y) with respect to the parameters\n\u0398 of the kernel function K\u0398. After some algebraic manipulations, we get\n\nM\n\n\u2202I(x, y)\n\n\u2202\u0398\n\n=\n\nM\n\nX\n\nm=1\n\nKL(p(y|x(m))kp(y))\n\n|y|\n\nX\n\nk=1\n\n\u2202fk(x(m))\n\n\u2202\u0398\n\np(yk|x(m))\n\n\u2212\n\nM\n\n|y|\n\nX\n\nm=1\n\nX\n\nj=1\n\n\u2202fj(x(m))\n\n\u2202\u0398\n\np(yj|x(m)) log\n\np(yj|x(m))\n\np(yj)\n\n(14)\n\nwhere fk(x(m)) is given by (9). The computational complexity of computing the\nupdates for \u0398 is O(M |y|2), where M is the number of training patterns and |y|\nis the number of clusters (which is assumed to be small). Note that in contrast\nto spectral methods (see e.g. Shi and Malik (2000), Ng et al. (2001)) neither the\nobjective (1) nor its gradients require inversion of the Gram matrix K \u2208 RM \u00d7M or\ncomputations of its eigenvalue decomposition.\n\nIn the special case of the radial basis function (RBF) kernels\n\nK\u03b2(x(i), x(j)) = exp{\u2212\u03b2kx(i) \u2212 x(j)k2},\n\nthe gradients of the encoder potentials are simply given by\n\n\u2202fj(x(m))\n\n\u2202\u03b2\n\n=\n\n1\nsj\n\n\u00b3aT\n\nj\n\n\u02dcKaj \u2212 2\u02dckT (x(m))aj\u00b4 ,\n\n(15)\n\n(16)\n\ndef= { \u02dcKij} def= K(x(i), x(j))(1 \u2212 \u03b4(x(i) \u2212 x(j))), and \u03b4 is the Kronecker delta.\nwhere \u02dcK\nBy substituting (16) into the general expression (14), we obtain the gradient of the\nmutual information with respect to the RBF kernel parameters.\n\n3 Demonstrations\n\nWe have empirically compared our kernelized information-theoretic clustering ap-\nproach with Gaussian mixture, k-means, feature-space k-means, non-kernelized\ninformation-theoretic clustering (see Section 2.1), and a multi-class spectral cluster-\ning method optimizing the normalized cuts. We illustrate the methods on datasets\nthat are particularly easy to visualize. Figure 1 shows a typical application of the\nmethods to the spiral data, where x1(t) = t cos(t)/4, x2(t) = t sin(t)/4 correspond\nto di\ufb00erent coordinates of x \u2208 R|x|, |x| = 2, and t \u2208 [0, 3.(3)\u03c0]. The kernel param-\neters \u03b2 of the RBF-kernelized encoding distribution were initialized at \u03b20 = 2.5\nand learned according to (16). The initial settings of the coe\ufb03cients A \u2208 RM \u00d7|y|\nin the feature space were sampled from NAij (0, 0.1). The log-variances \u02dcs1, . . . , \u02dcs|y|\nwere initialized at zeros. The encoder parameters A and {\u02dcsj|j = 1, . . . , |y|} (along\nwith the RBF kernel parameter \u03b2) were optimized by applying the scaled conjugate\ngradients. We found that Gaussian mixtures trained by maximizing the likelihood\nusually resulted in highly stochastic cluster allocations; additionally, they led to\na large variation in cluster sizes. The Gaussian mixtures were initialized using\nk-means \u2013 other choices usually led to worse performance. We also see that the\nk-means e\ufb00ectively breaks, as the similarly clustered points lie close to each other\nin R2 (according to the L2-norm), but the allocated clusters are not locally smooth\nin t. On the other hand, our method with the RBF-kernelized encoders typically\nled to locally smooth cluster allocations.\n\n\fGaussian mixture clustering for |y|=3\n\n2.5\n\nK\u2212means clustering for |y|=3\n\nKMI Clustering, \u03b2=0.825 (\u03b2\n = 2.500), |y|=3\n0\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\n)\n\n)\nm\nx\n\n(\n\n|\ny\n(\np\n\n2\n\nj\n\n2\n\n60\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\n)\n\n)\nm\nx\n\n(\n\n|\ny\n(\np\n\n2\n\nj\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\n)\n\n)\nm\nx\n\n(\n\n|\ny\n(\np\n\nj\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\nGaussian Mixture Clustering\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n3\n\n3.5\n\n10\n\n20\n50\nTraining Patterns\n\n40\n\n30\n\n60\n\n0\n\n\u22121\nK\u2212means Clustering\n\n1\n\n\u22122\n\n0.5\n\n1\n\n1.5\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\nKernelized Encoders, \u03b2 = 0.825\n\n0.5\n\n1\n\n1.5\n\n2.5\n\n3\n\n3.5\n\n10\n\n30\n\n20\n50\nTraining Patterns\n\n40\n\n2.5\n\n3\n\n3.5\n\n10\n\n30\n\n20\n50\nTraining Patterns\n\n40\n\n60\n\nFigure 1: Cluster allocations (top) and the corresponding responsibilities (bottom)\np(yj|x(m)) for |x| = 2, |y| = 3, M = 70 (the patterns are sorted to indicate local\nsmoothness in the phase parameter). Left: Gaussian mixtures; middle: K-means;\nright: information-maximization for the (RBF-)kernelized encoder (the learned pa-\nrameter \u03b2 \u2248 0.825). Light, medium, and dark-gray squares show the cluster colors\ncorresponding to deterministic cluster allocations. The color intensity of each train-\ning point x(m) is the average of the pure cluster intensities, weighted by the respon-\nsibilities p(yj|x(m)). Nearly indistinguishable dark colors of the Gaussian mixture\nclustering indicate soft cluster assignments.\n\nFigure 2 shows typical results for spatially translated letters with |x| = 2, M =\n150, and |y| = 2 (or |y| = 3), where we compare Gaussian mixture, feature-space\nk-means, the spectral method of Ng et al. (2001), and our information-theoretic\nclustering method. The initializations followed the same procedure as the previous\nexperiment. The results produced by our kernelized infomax method were generally\nstable under di\ufb00erent initializations, provided that \u03b20 was not too large or too small.\nIn contrast to Gaussian mixture, spectral, and feature-space k-means clustering,\nthe clusters produced by kernelized infomax for the cases considered are arguably\nmore anthropomorphically appealing. Note that feature-space k-means, as well\nas the spectral method, presume that the kernel matrix K \u2208 RM \u00d7M is \ufb01xed and\nknown (in the latter case, the Gram matrix de\ufb01nes the edge weights of the graph).\nFor illustration purposes, we show the results for the \ufb01xed Gram matrices with\nkernel parameters \u03b2 set to the initial values \u03b20 = 1 or the learned values \u03b2 \u2248\n0.604 of the kernelized infomax method for |y| = 2. One may potentially improve\nthe performance of these methods by running the algorithms several times (with\ndi\ufb00erent kernel parameters \u03b2), and choosing \u03b2 which results in tightest clusters\n(Ng et al. (2001)). We were indeed able to apply the spectral method to obtain\nclusters for TA and T (for \u03b2 \u2248 1.1). While being useful in some situations, the\nprocedure generally requires multiple runs.\nIn contrast, the kernelized infomax\nmethod typically resulted in meaningful cluster allocations (TT and A) after a single\nrun of the algorithm (see Figure 2 (c)), with the results qualitatively consistent\nunder a variety of initializations.\n\nAdditionally, we note that in situations when we used simpler encoder models (see\nexpression (3)) or did not adapt parameters of the kernel functions, the extracted\nclusters were often more intuitive than those produced by rival methods, but inferior\n\n\fGaussian mixture clustering for |y|=2\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n(a)\n\nSpectral Clustering, \u03b2 \u2248 0.604, |y|=2\n\nFeauture space K\u2212means, \u03b2=0.604\n\nKMI Clustering, \u03b2 = 0.6035 (from \u03b2\n2.5\n\n0=1), |y|=2\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\n\u22123\n\n(c)\n2.5KMI Clusters: \u03b2 \u2248 0.579 (\u03b2\n\n \n\n0 = 1), |y|=3, I = 1.10\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\n\u22123\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\n\u22123\n\nKMI Clustering, \u03b2\n\n(b)\n0=1.000, |y|=3, I = 1.03\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n\u22122.5\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n(d)\n\n(e)\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n(f)\n\nFigure 2: Learning cluster allocations for |y| = 2 and |y| = 3. Where appropriate,\nthe stars show the cluster centers. (a) two-component Gaussian mixture trained\nby the EM algorithm; (b) feature-space k-means with \u03b2 = 1.0 and \u03b2 \u2248 0.604 (the\nonly pattern clustered di\ufb00erently (under identical initializations) is shown by \u229a);\n(c) kernelized infomax clustering for |y| = 2 (the inverse variance \u03b2 of the RBF\nkernel varied from \u03b20 = 1 (at the initialization) to \u03b2 \u2248 0.604 after convergence);\n(d) spectral clustering for |y| = 2 and \u03b2 \u2248 0.604; (e) kernelized infomax clustering\nfor |y| = 3 with a \ufb01xed Gram matrix; (f ) kernelized infomax clustering for |y| = 3\nstarted at \u03b20 = 1 and reaching \u03b2 \u2248 0.579 after convergence.\n\nto the ones produced by (7) with the optimal learned \u03b2. Our results suggest that by\nlearning kernel parameters we may often obtain higher values of the objective I(x, y),\nas well as more appealing cluster labeling (e.g. for the examples shown on Figure 2\n(e), (f) we get I(x, y) \u2248 1.03 and I(x, y) \u2248 1.10 respectively). Undoubtedly, a careful\nchoice of the kernel function could potentially lead to an even better visualization\nof the locally smooth, non-degenerate structure.\n\n4 Discussion\n\nThe proposed information-theoretic clustering framework is fundamentally di\ufb00er-\nent from the generative latent variable clustering approaches. Instead of explicitly\nparameterizing the data-generating process, we impose constraints on the encoder\ndistributions, transforming the clustering problem to learning optimal discrete en-\ncodings of the unlabeled data. Many possible parameterizations of such distribu-\ntions may potentially be considered. Here we discussed one such choice, which\nimplicitly utilizes projections of the data to high-dimensional feature spaces.\n\nOur method suggests a formal information-theoretic procedure for learning optimal\ncluster allocations. One potential disadvantage of the method is a potentially large\nnumber of local optima; however, our empirical results suggest that the method is\nstable under di\ufb00erent initializations, provided that the initial variances are su\ufb03-\nciently large. Moreover, the results suggest that in the cases considered the method\n\n\ffavorably compares with the common generative clustering techniques, k-means,\nfeature-space k-means, and the variants of the method which do not use nonlinear-\nities or do not learn parameters of kernel functions.\n\nA number of interesting interpretations of clustering approaches in feature spaces\nare possible. Recently, it has been shown (Bach and Jordan (2003); Dhillon et al.\n(2004)) that spectral clustering methods optimizing normalized cuts (Shi and Malik\n(2000); Ng et al. (2001)) may be viewed as a form of weighted feature-space k-means,\nfor a speci\ufb01c \ufb01xed similarity matrix. We are currently relating our method to the\ncommon spectral clustering approaches and a form of annealed weighted feature-\nspace k-means. We stress, however, that our information-maximizing framework\nsuggests a principled way of learning optimal similarity matrices by adapting param-\neters of the kernel functions. Additionally, our method does not require computa-\ntions of eigenvalues of the similarity matrix, which may be particularly bene\ufb01cial for\nlarge datasets. Finally, we expect that the proper information-theoretic interpreta-\ntion of the encoder framework may facilitate extensions of the information-theoretic\nclustering method to richer families of encoder distributions.\n\nReferences\n\nAgakov, F. V. and Barber, D. (2006). Auxiliary Variational Information Maximization\nfor Dimensionality Reduction. In Proceedings of the PASCAL Workshop on Subspace,\nLatent Structure and Feature Selection Techniques. Springer. To appear.\n\nBach, F. R. and Jordan, M. I. (2003). Learning spectral clustering. In NIPS. MIT Press.\n\nBarber, D. and Agakov, F. V. (2003). The IM Algorithm: A Variational Approach to\n\nInformation Maximization. In NIPS. MIT Press.\n\nBrunel, N. and Nadal, J.-P. (1998). Mutual Information, Fisher Information and Popula-\n\ntion Coding. Neural Computation, 10:1731\u20131757.\n\nChechik, G. and Tishby, N. (2002). Extracting relevant structures with side information.\n\nIn NIPS, volume 15. MIT Press.\n\nCover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. Wiley, NY.\n\nDhillon, I. S. and Guan, Y. (2003).\n\nInformation Theoretic Clustering of Sparse Co-\nOccurrence Data. In Proceedings of the 3rd IEEE International Conf. on Data Mining.\nDhillon, I. S., Guan, Y., and Kulis, B. (2004). Kernel k-means, Spectral Clustering and\n\nNormalized Cuts. In KDD. ACM.\n\nFisher, J. W. and Principe, J. C. (1998). A methodology for information theoretic feature\n\nextraction. In Proc. of the IEEE International Joint Conference on Neural Networks.\n\nLinsker, R. (1988). Towards an Organizing Principle for a Layered Perceptual Network.\nIn Advances in Neural Information Processing Systems. American Institute of Physics.\n\nNg, A. Y., Jordan, M., and Weiss, Y. (2001). On spectral clustering: Analysis and an\n\nalgorithm. In NIPS, volume 14. MIT Press.\n\nScholkopf, B. and Smola, A. (2002). Learning with Kernels. MIT Press.\n\nShi, J. and Malik, J. (2000). Normalized Cuts and Image Segmentation. IEEE Transactions\n\non Pattern Analysis and Machine Intelligence, 22(8):888\u2013905.\n\nTishby, N., Pereira, F. C., and Bialek, W. (1999). The information bottleneck method. In\nProceedings of the 37-th Annual Allerton Conference on Communication, Control and\nComputing. Kluwer Academic Publishers.\n\nTorkkola, K. and Campbell, W. M. (2000). Mutual Information in Learning Feature\n\nTransformations. In ICML. Morgan Kaufmann.\n\n\f", "award": [], "sourceid": 2934, "authors": [{"given_name": "David", "family_name": "Barber", "institution": null}, {"given_name": "Felix", "family_name": "Agakov", "institution": null}]}