{"title": "Clustering with Bregman Divergences: an Asymptotic Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 2351, "page_last": 2359, "abstract": "Clustering, in particular $k$-means clustering, is a central topic in data analysis. Clustering with Bregman divergences is a recently proposed generalization of $k$-means clustering which has already been widely used in applications. In this paper we analyze theoretical properties of Bregman clustering when the number of the clusters $k$ is large. We establish quantization rates and describe the limiting distribution of the centers as $k\\to \\infty$, extending well-known results for $k$-means clustering.", "full_text": "Clustering with Bregman Divergences: an Asymptotic\n\nAnalysis\n\nChaoyue Liu, Mikhail Belkin\n\nDepartment of Computer Science & Engineering\n\nThe Ohio State University\n\nliu.2656@osu.edu, mbelkin@cse.ohio-state.edu\n\nAbstract\n\nClustering, in particular k-means clustering, is a central topic in data analysis.\nClustering with Bregman divergences is a recently proposed generalization of\nk-means clustering which has already been widely used in applications. In this\npaper we analyze theoretical properties of Bregman clustering when the number\nof the clusters k is large. We establish quantization rates and describe the limiting\ndistribution of the centers as k \u2192 \u221e, extending well-known results for k-means\nclustering.\n\n1\n\nIntroduction\n\nClustering and the closely related problem of vector quantization are fundamental problems in\nmachine learning and data mining. The aim is to partition similar points into \"clusters\" in order to\norganize or compress the data. In many clustering methods these clusters are represented by their\ncenters or centroids. The set of these centers is often called \u201cthe codebook\" in the vector quantization\nliterature. In this setting the goal of clustering is to \ufb01nd an optimal codebook, i.e., a set of centers\nwhich minimizes a clustering loss function also known as the quantization error.\nThere is vast literature on clustering and vector quantization, see, e.g., [8, 10, 12]. One of the particu-\nlarly important types of clustering and, arguably, of data analysis methods of any type, is k-means\nclustering [16] which aims to minimize the loss function based on the squared Euclidean distance.\nThis is typically performed using the Lloyd\u2019s algorithm [15], which is an iterative optimization\ntechnique. The Lloyd\u2019s algorithm is simple, easy to implement and is guaranteed to converge in a\n\ufb01nite number of steps. There is an extensive literature on various aspects and properties of k-means\nclustering, including applications and theoretical analysis [2, 13, 23]. An important type of analysis is\nthe asymptotic analysis, which studies the setting when the number of centers is large. This situation\n(n (cid:29) k (cid:29) 0) arises in many applications related to data compression as well as algorithms such as\nsoft k-means features used in computer vision and other applications, where the number of centers\nk is quite large but signi\ufb01cantly less than the number of data points n. This situation also arises in\nk-means feature-based methods which have seen signi\ufb01cant success in computer vision, e.g., [6].\nThe quantization loss for k-means clustering in the setting k \u2192 \u221e is well-known (see [5, 9, 20]). A\nless well-known fact shown in [9, 18] is that the discrete set of centers also converges to a measure\nclosely related to the underlying probability distribution. This fact can be used to reinterpret k-means\nfeature based methods in terms of a density dependent kernel [21].\nMore recently, it has been realized that the properties of square Euclidean distance which make the\nLloyd\u2019s algorithm for k-means clustering so simple and ef\ufb01cient are shared by a class of similarity\nmeasures based on Bregman divergence. In an in\ufb02uential paper [3] the authors introduced clustering\nbased on Bregman divergences, which generalized k-means clustering to that setting and produced\na corresponding generalized version of the Lloyd\u2019s algorithm. That work has lead to a new line\nof research on clustering including results on multitask Bregman clustering[24], agglomerative\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fBregman clustering[22] and many others. There has also been some theoretical analysis of Bregman\nclustering including [7] proving the existence of an optimal quantizer and convergence and bounds\nfor quantization loss in the limit of size of data n \u2192 \u221e for \ufb01xed k.\nIn this paper we set out to investigate asymptotic properties of Bregman clustering as the number\nof centers increases. We provide explicit asymptotic rates for the quantization error of Bregman\nclustering as well as the continuous measure which is the limit of the center distribution. Our results\ngeneralize the well-known results for k-means clustering. We believe that these results will be useful\nfor better understanding in Bregman divergence based clustering algorithms and algorithms design.\n\n2 Preliminaries and Existing Work\n\n2.1 k-means clustering and its asymptotic analysis\n\nk-means clustering is one of the most popular and well studied clustering problems in data analysis.\ni=1 \u2282 Rd , containing n observations of a Rd-valued\nSuppose we are given a dataset D = {xi}n\nj=1 \u2282 Rd,\nrandom variable X. k-means clustering aims to \ufb01nd a set of points (centroids) \u03b1 = {aj}k\n(cid:88)\nwith |\u03b1| = k initially \ufb01xed, that minimizes the squared Euclidean loss function\n\n(1)\n\nL(\u03b1) =\n\n1\nn\n\nmin\na\u2208\u03b1\n\nj\n\n(cid:107)xj \u2212 a(cid:107)2\n2.\n\nFinding the global minimum of loss function is a NP-hard problem [1, 17]. However, Lloyd\u2019s\nalgorithm [15] is a simple and elegant method to obtain a locally optimal clustering of the data,\ncorresponding to a local minimum of the loss function. A key reason for the practical utility of the\nLloyd\u2019s k-means algorithm is the following property of squared Euclidean distance: the arithmetic\nmean of a set of points is the unique minimizer of the loss for a single center:\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nn(cid:88)\n\ni=1\n\nxi = arg min\ns\u2208Rd\n\n1\nn\n\n(cid:107)xi \u2212 s(cid:107)2\n2.\n\n(2)\n\nIt turns out that this property holds in far greater generality as we will discuss below.\nAsymptotic analysis of Euclidean quantization:\nIn an asymptotic quantization problem, we focus on the limiting case of k \u2192 \u221e, where the size\nof dataset n (cid:29) k. In this paper we will assume n = \u221e, i.e., that the probability distribution with\ndensity P is given. This setting is in line with the analysis in [9].\n\nCorrespondingly, a density measure P is de\ufb01ned as follows: for a set A \u2286 Rd, P(A) =(cid:82)\n\nwhere \u03bbd is the Lebesgue measure on Rd. We also have P = dP\nd\u03bbd .\nThe classical asymptotic results for the Euclidean quantization are provided in a more general setting\nfor an arbitrary power of the distance Eq.(1), Euclidean quantization of order r (1 \u2264 r < \u221e), with\nloss\n\nA P d\u03bbd,\n\n(cid:20)\n\n(cid:21)\n\nL(\u03b1) = EP\n\nmin\na\u2208\u03b1\n\n(cid:107)X \u2212 a(cid:107)r\n\n2\n\n.\n\n(3)\n\nNote that the Lloyd\u2019s algorithm is only applicable to the standard case with r = 2.\nThe output of the k-means algorithm include locations of centroids, which then imply the partition\nand the corresponding loss. For large k we are interested in: (1) the asymptotic quantization error,\nand (2) the distribution of centroids.\nAsymptotic quantization error. The asymptotic quantization error for k-means clustering has\nbeen analyzed in detail in [5, 14, 20]. S. Graf and H. Luschgy [9] show that as k \u2192 \u221e, the r-th\nquantization error decreases at a rate of k\u2212r/d. Furthermore, coef\ufb01cient of the term k\u2212r/d is of the\nform\n\n(4)\nwhere Qr([0, 1]d), a constant independent of P , is geometrically interpreted as asymptotic Euclidean\nquantization error for uniform distribution on d-dimensional unite cube [0, 1]d. Here (cid:107) \u00b7 (cid:107)d/(d+r) is\n\nthe Ld/(d+r) norm of function: (cid:107)f(cid:107)d/(d+r) = ((cid:82) f d/(d+r)d\u03bbd)(d+r)/d.\n\nQr(P ) = Qr([0, 1]d)(cid:107)P(cid:107)d/(d+r),\n\n2\n\n\fLocational distribution of centroids. A less well-known fact is that the locations of the optimal\ncentroid con\ufb01guration of k-means converges to a limit distribution closely related to the underlying\ndensity [9, 18]. Speci\ufb01cally, given a discrete set of centroids \u03b1k, to construct the corresponding\ndiscrete measure,\n\nk(cid:88)\n\nPk =\n\n1\nk\n\n(5)\nwhere \u03b4a is Dirac measure centered at a. For a open set A \u2286 Rd, Pk(A) is the ratio of number of\ncentroids kA located within A to the total number of centroids k, namely Pk(A) = kA/k. We say\nthat a continuous measure \u02dcP is the limit distribution of centroids, if {Pk} (weakly) converges to \u02dcP,\nspeci\ufb01cally\n\n\u03b4aj ,\n\nj=1\n\n\u2200A \u2286 Rd, lim\n\nk\u2192\u221ePk(A) = \u02dcP(A).\n\n(6)\n\nS. Graf and H. Luschgy [9] gave an explicit expression for this continuous limit distribution of\ncentroids:\n(7)\nwhere \u03bbd is the Lebesgue measure on Rd, P is the density of the probability distribution and N is\nthe normalization constant to make sure that \u02dcPr integrates to 1.\n\n\u02dcPr = N \u00b7 P d/(d+r),\n\n\u02dcPr = \u02dcPr\u03bbd,\n\n2.2 Bregman divergences and Bregman Clustering\n\nIn this section we brie\ufb02y review basics of Bregman divergences and the Bregman clustering algorithm.\nBregman divergence, \ufb01rst proposed in 1967 by L.M.Bregman [4], measure dissimilarity between two\npoints in a space. The formal de\ufb01nition is as follows:\nDe\ufb01nition 1 (Bregman Divergence). Let function \u03c6 be strictly convex on a convex set \u2126 \u2286 Rd, such\nthat \u03c6 is differentiable on relative interior of \u2126, we de\ufb01ne Bregman divergence D\u03c6 : \u2126 \u00d7 \u2126 \u2192 R\nwith respect to \u03c6 as:\n\nD\u03c6(p, q) = \u03c6(p) \u2212 \u03c6(q) \u2212 (cid:104)p \u2212 q,\u2207\u03c6(q)(cid:105) ,\n\nwhere (cid:104)\u00b7,\u00b7(cid:105) is inner product in Rd. \u2126 is domain of the Bregman divergence.\nNote that Bregman divergences are not necessarily true metrics. In general, they do satisfy the basic\nproperties of non-negativity and identity of indiscernibles, but may not respect the triangle inequality\nand symmetry.\nExamples: Some popular examples of Bregman divergences include:\n\n(8)\n\nSquared Euclidean distance: DEU (p, q) = (cid:107)p \u2212 q(cid:107)2\n2,\nMahalanobis distance:\nKullback-Leibler divergence: KL(p(cid:107)q) =\n\n(\u03c6EU (z) = (cid:107)z(cid:107)2)\nDM H (p, q) = (p \u2212 q)T A(p \u2212 q), A \u2208 Rd\u00d7d\n\n\u2212(cid:88)\n\n(pi \u2212 qi),\n\npi ln\n\nItakura-Saito divergence:\n\nNorm-like divergence:\n\nDIS(p(cid:107)q) =\nDN L(p(cid:107)q) =\n\npi\nqi\n\n(cid:88)\n\n(\u03c6KL(z) =\n\n(cid:88)\n(cid:88) pi\n(cid:88)\n\ni\n\n(\u03c6N L(z) =\n\nzi ln zi \u2212 zi,\npi\nqi\n\nzi > 0)\n\n(\u03c6IS(z) = \u2212(cid:88)\n\n\u2212 ln\n\u2212 1,\nqi\ni + (\u03b1 \u2212 1)q\u03b1\n(cid:88)\np\u03b1\n\nz\u03b1\ni ,\n\n.\n\ni \u2212 \u03b1piq\u03b1\u22121\nzi > 0, \u03b1 \u2265 2)\n\ni\n\nln zi)\n\n(9)\n\nDomains of Bregman divergences: \u2126EU = \u2126M H = Rd, and \u2126KL = \u2126IS = \u2126N L = Rd\n+.\nAlternative expression: the quadratic form. Suppose that \u03c6 \u2208 C 2(\u2126), which holds for most\npopularly used Bregman divergences. Note that \u03c6(q) + (cid:104)p \u2212 q,\u2207\u03c6(q)(cid:105) is simply the \ufb01rst two terms\nin Taylor expansion of \u03c6 at q. Thus, Bregman divergences are nothing but the difference between a\nfunction and its linear approximation. By Lagrange\u2019s form of the remainder term, there exists \u03be with\n\u03bei \u2208 [min(pi, qi), max(pi, qi)] (i.e. \u03be is in the smallest d-dimensional axis-parallel cube that contains\np and q) such that\n\nD\u03c6(p, q) =\n\n(p \u2212 q)T\u22072\u03c6(\u03be)(p \u2212 q),\n\n1\n2\n\n(10)\n\n3\n\n\fwhere \u22072\u03c6(\u03be) denotes the Hessian matrix of \u03c6 at \u03be.\nThis form is more compact and will be convenient for further analysis, but at the expense of\nintroducing an unknown point \u03be. We will use this form in later discussions.\nThe mean as the minimizer. As shown in A. Banerjee et al. [3], the property Eq.(2) still holds if\nsquared Euclidean distance is substituted by a general Bregman divergence:\n\nxi = arg min\ns\u2208\u2126\n\nD\u03c6(xi, s).\n\n(11)\n\nThat allows for the Lloyd\u2019s method to be generalized to arbitrary Bregman clustering problems, where\nthe new loss function is de\ufb01ned as\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n(cid:88)\n\ni\n\nL(\u03b1) =\n\n1\nn\n\nmin\na\u2208\u03b1\n\nD\u03c6(xi, a).\n\n(12)\n\nThis modi\ufb01ed version of k-means, called Bregman hard clustering algorithm (see Algorithm 1 in [3]),\nresults a locally optimal quantization as well.\n\n3 Asymptotic Analysis of Bregman Quantization\n\nWe do not distinguish the terminology of Bregman quantization and Bregman clustering. In this\nsection, we analyze the asymptotics of Bregman quantization allowing a power of Bregman diver-\ngences in the loss function. We show expressions for the quantization error and limiting distribution\nof centers.\nWe start with the following:\nDe\ufb01nition 2 (k-th quantization error for P of order r). Suppose a variable X takes values on \u2126 \u2286 Rd\nfollowing a density P , where \u2126 is the d-dimensional domain of Bregman divergence D\u03c6. The k-th\nquantization error for P of order r (1/2 \u2264 r < \u221e) associated with D\u03c6 is de\ufb01ned as\n\n(cid:20)\n\n(cid:21)\n\ninf\n\nEP\n\n\u03b1\u2282Rd,|\u03b1|=k\n\nVk,r,\u03c6(P ) =\n\n(13)\nwhere \u03b1 \u2282 Rd is set of representatives of clusters, corresponding to a certain partition, or quantiza-\ntion of Rd or support of P , and EP [\u00b7] means taking expectation value over P .\nRemark: (a) The set \u03b1\u2217 that reaches the in\ufb01mum is called k-optimal set of centers for P of order\n\u03c6(X, a). (b) In this setting, Bregman quantization of order r corresponds to\nr with respect to Dr\nEuclidean quantization of order 2r, because of Eq.(10).\n\nDr\n\n\u03c6(X, a)\n\nmin\na\u2208\u03b1\n\n3.1 Asymptotic Bregman quantization error\nWe are interested in the asymptotic case, where k \u2192 \u221e.\nFirst note that quantization error asymptotically approaches zero as every point x in the support\nsupport of the distribution can always is arbitrarily closed to a centroid with respect to the Bregman\ndivergence when k is large enough.\nIntuition on Convergence rate. We start by providing an informal intuition for the convergence\nrate. Assume P has a compact support with a \ufb01nite volume. Suppose each cluster is a (Bregman)\nVoronoi cell with typical size \u0001. Since total volume of the support does not change, volume of one\ncell should be inversely proportional to the number of clusters, \u0001d \u223c 1\nk . On the other hand, because\nof Eq.(10), Bregman divergence between two points in one cell is the order of square of the cell size,\nD\u03c6(X, a) \u223c \u00012, That implies\n\nVk,r,\u03c6(P ) \u223c k\u22122r/d asymptotically.\n\n(14)\n\nWe will now focus making this intuition precise and on deriving an expression for the coef\ufb01cient at\nthe leading term k\u22122r/d in the quantization error. For now we will keep the assumption that P has\ncompact support, and remove it later on. We only describe the method and display important results\nin the following. Please see detailed proofs of these results in the Appendix.\nWe \ufb01rst mention a few useful facts:\n\n4\n\n\fLemma 1. In the limit of k \u2192 \u221e, each interior point x in the support of P is assigned to an\narbitrarily close centroid in the optimal Bregman quantization setting.\nLemma 2. If support of P is convex, \u03c6 is strictly convex on the support and \u22072\u03c6 is uniformly\nd Vk,r,\u03c6(P ) exists in (0,\u221e), denoted as Qr,\u03c6(P ),\ncontinuous on the support, then (a): limk\u2192\u221e k 2r\nand (b):\n\nQr,\u03c6(P ) = lim\n\nk\u2192\u221e k\n\n2r\nd\n\ninf\n\n\u03b1(|\u03b1|=k)\n\nEP\n\nmin\na\u2208\u03b1\n\n(X \u2212 a)T\u22072\u03c6(a)(X \u2212 a)\n\n.\n\n(15)\n\n(cid:20)\n\n(cid:18) 1\n\n2\n\n(cid:19)r(cid:21)\n\nRemark: 1, Since Qr,\u03c6(P ) is \ufb01nite, part (a) of Lemma 2 proves our intuition on convergence rate,\nEq.(14). 2, In Eq.(15), it does not matter whether \u22072\u03c6 take values at a, x or even any point between\nx and a, as long as \u22072\u03c6 has \ufb01nite values at that point.\nCoef\ufb01cient of Bregman quantization error. We evaluate the coef\ufb01cient of quantization error\nQr,\u03c6(P ), based on Eq.(15). What makes this analysis challenging is that unlike is that Euclidean\nquantization, general Bregman error does not satisfy translational invariance and scaling properties.\nFor example, Lemma 3.2 in [9] does not hold for general Bregman divergence. We follow the\nfollowing approach: First, dice the the support of P into in\ufb01nitesimal cubes {Al} with edges parallel\nto axes, where l is the index for cells. In each cell, we approximate the Hessian by a constant matrix\n\u22072\u03c6(zl), where zl is a \ufb01xed point located in the cell. Therefore, evaluating the Bregman quantization\nerror within each cell reduces to a Euclidean quantization problem, with existing result, Eq.(4). Then\nsumming them up appropriately over the cubes gives total quantization error.\nWe start from Eq.(15), and introduce the following notation: denote sl = P(Al) and conditional\ndensity on Al as P (\u00b7|Al), \u03b1l = \u03b1 \u2229 Al as set of centroids that located in Al and kl = |\u03b1l| as size\n\nof \u03b1l, and ratio vl = kl/k. Following the above intuition and noting that P =(cid:80)P(Al)P (\u00b7|Al),\n(cid:21)r\n\nQr,\u03c6(P,{vl}) \u223c (cid:88)\n\nQr,\u03c6(P ) is approximated by\n\nQr,M h,l (P (\u00b7|Al)) ,\n\n\u22122r/d\nl\n\n(16)\n\n(cid:20)\n\nslv\n\ninf\n\nEP (\u00b7|Al)\n\nkl\u2192\u221e k\n\n\u03b1l(|\u03b1l|=kl)\n\n(17)\nwhere Qr,M h,l (P (\u00b7|Al)) is coef\ufb01cient of asymptotic Mahalanobis quantization error with Maha-\nlanobis matrix \u22072\u03c6(zl), evaluated on Al with density P (\u00b7|Al). It can be shown that the approximation\nerror of Qr,\u03c6(P ) converges to zero in the limits of k \u2192 \u221e and then size of cell to zero.\nIn each cell Al, P (\u00b7|Al) is further approximated by uniform density U (Al) = 1/Vl, and Hessian\n\u22072\u03c6(zl), as a constant, is absorbed by performing a coordinate transformation. Then Qr,M h,l (U (Al))\nreduces to squared Euclidean quantization error. By applying Eq.(4), we show that\n\nmin\na\u2208\u03b1l\n\nQr,M h,l (P (\u00b7|Al)) = lim\n\n(X \u2212 a)T\u22072\u03c6(zl)(X \u2212 a)\n\n1\n2\n\n2r\nd\n\nl\n\nl\n\nQr,M h,l (U (Al)) =\n\n2r Q2r([0, 1]d)\u03b42r[det\u22072\u03c6(zl)]r/d\nwhere \u03b4 is the size of cube, and Q2r([0, 1]d) is again the constant in Eq.(4).\nCombining Eq.(17) and Eq.(18), Qr,\u03c6(P ) is approximated by\n\n1\n\nQr,\u03c6(P,{vl}) \u223c 1\n\n2r Q2r([0, 1]d)\u03b42r(cid:88)\nLemma 3. Let B = {(v1,\u00b7\u00b7\u00b7 , vL) \u2208 (0,\u221e)L : (cid:80)L\n\nl\n\n\u22122r/d\nl\n\nslv\n\n[det\u22072\u03c6(zl)]r/d.\n\nPortion of centroids vl within Al is still undecided yet. The following lemma provides an optimal\ncon\ufb01guration of {vl} that minimizes Qr,\u03c6(P,{vl}):\n\nl=1 vl = 1}, and de\ufb01ne\n\n(cid:80)\n\nsd/(d+2r)\nl\nl sd/(d+2r)\n\nl\n\n[det\u22072\u03c6(zl)]r/(d+2r)\n[det\u22072\u03c6(zl)]r/(d+2r)\n\n,\n\nv\u2217\nl =\n\nthen for the function\n\n(18)\n\n(19)\n\n(20)\n\n(21)\n\nF (v1,\u00b7\u00b7\u00b7 , vL) =\n\n\u22122r/d\nl\n\nslv\n\n[det\u22072\u03c6(zl)]r/d,\n\nL(cid:88)\n\nl=1\n\n5\n\n\fF (v\u2217\n\n1,\u00b7\u00b7\u00b7 , v\u2217\n\nL) =\n\nmin\n\n(v1,\u00b7\u00b7\u00b7 ,vL)\u2208B\n\nF (v1,\u00b7\u00b7\u00b7 , vL) =\n\n(cid:32)(cid:88)\n\nl\n\nsd/(d+2r)\nl\n\n[det\u22072\u03c6(zl)]r/(d+2r)\n\n(cid:33)(d+2r)/d\n\n.\n\n(22)\n\nLemma 3 \ufb01nds the optimal con\ufb01guration of {vl} in Eq.(19). Recall that quantization error is de\ufb01ned\nto be in\ufb01mum over all possible con\ufb01gurations, we have our main result:\nTheorem 1. Suppose E||X||2r+\u0001 < \u221e for some \u0001 > 0 and \u22072(\u03c6) is uniformly continuous on the\nsupport of P , then\n\nQr,\u03c6(P ) =\n\n1\n\n2r Q2r([0, 1]d)(cid:107)(det\u22072\u03c6)r/dP(cid:107)d/(d+2r).\n\n(23)\n\nRemark: 1, In the Euclidean quantization cases, where \u03c6(z) = (cid:107)z(cid:107)2, Eq.(23) reduces to Eq.(4),\nnoting that \u22072\u03c6 = 2I. Bregman quantization, which is more general than Euclidean quantization,\nhas result that is quite similar to Eq.(4), differing by a det\u22072\u03c6-related term.\n\n3.2 The Limit Distribution of Centroids\n\nSimilar to Euclidean clustering, Bregman clustering also outputs k discrete cluster centroids, which\ncan be interpreted as a discrete measure. Below we show that in the limit this discrete measure\ncoincide with a continuous measure de\ufb01ned in terms of the probability density P .\nDe\ufb01ne Pr,\u03c6 to be the integrand function in Eq.(23) (with a normalization factor N ),\n\nPr,\u03c6 = N \u00b7 (det\u22072\u03c6)r/(d+2r)P d/(d+2r).\n\n(24)\n\nThe following theorem claim that Pr,\u03c6 is exactly the continuous distribution we are looking for:\nTheorem 2. Suppose P is absolutely continuous with respect to Lebesgue measure \u03bbd. Let \u03b1k be an\nasymptotically k-optimal set of centers for P of order r based on D\u03c6. De\ufb01ne measure Pr,\u03c6 := Pr,\u03c6\u03bbd,\nthen\n\n\u03b4a \u2192 Pr,\u03c6 (weakly).\n\n(25)\n\n(cid:88)\n\na\u2208\u03b1k\n\n1\nk\n\nRemark: As before Pr,\u03c6 is the measure while Pr,\u03c6 is the corresponding density function. The proof\nof the theorem can be found in the appendix.\nExample 1: Clustering with Squared Euclidean distance (Graf and Luschgy[9]). Squared Eu-\ni . Graf and Luschgy proved\n\nclidean distance is an instance of Bregman divergence, with \u03c6(z) =(cid:80) z2\n\nthat asymptotic centroid\u2019s distribution is like\n\n(26)\nExample 2: Clustering with Mahalanobis distance. Mahalanobis distance is another instance of\nBregman divergence, with \u03c6(z) = zT Az, (A) \u2208 Rd. Hessian matrix \u22072\u03c6 = A. Then the asymptotic\ncentroid\u2019s distribution is same as that of Squared Euclidean distance\n\nPr,EU (z) \u223c P d/(d+2r)(z).\n\n(27)\nExample 3: Clustering with Kullback-Leibler divergence. The convex function used to de\ufb01ne\nKullback-Leibler divergence is negative Shannon entropy de\ufb01ned on domain \u2126 \u2286 Rd\n+,\n\nPr,M h(z) \u223c P d/(d+2r)(z).\n\n(cid:88)\n\n\u03c6KL(z) =\n\nzi ln zi \u2212 zi\n\nwith component index i. Then Hessian matrix\n\ni\n\n\u22072\u03c6KL(z) = diag(\n\n1\nz1\n\n,\n\n1\nz2\n\n,\u00b7\u00b7\u00b7 ,\n\n1\nzd\n\n).\n\n6\n\n(28)\n\n(29)\n\n\fAccording to Eq. (24), centroid\u2019s density distribution function\n\nPr,KL(z) \u223c P d/(d+2r)(z)\n\nzi\n\n.\n\n(30)\n\nExample 4: Clustering with Itakura-Saito divergence. Itakura-Saito divergence uses Burg entropy\nas \u03c6,\n\n\u03c6IS(z) = \u2212(cid:88)\n\nln zi,\n\nz \u2208 Rd,\n\n(cid:33)\u2212r/(d+2r)\n\n(cid:32)(cid:89)\n\ni\n\nwith component index i. Then Hessian matrix\n\ni\n\n\u22072\u03c6IS(z) = diag(\n\nAccording to Eq. (24), centroid\u2019s density distribution function\n\n1\nz2\n1\n\n,\n\n1\nz2\n2\n\n(cid:32)(cid:89)\n\n,\u00b7\u00b7\u00b7 ,\n\n1\nz2\nd\n\n).\n\n(cid:33)\u2212r/(d+2r)\n\n.\n\nPr,IS(z) \u223c P d/(d+2r)(z)\n\nz2\ni\n\nExample 5: Clustering with Norm-like divergence. Convex function \u03c6(z) = (cid:80)\n\ni ,z \u2208 Rd\n+,\ni z\u03b1\nwith power \u03b1 \u2265 2. Simple calculation shows that the divergence reduces to Euclidean distance when\n\u03b1 = 2. However, the divergence is no longer Euclidean-like, as long as \u03b1 > 2:\n\ni\n\n(cid:88)\n\nDN L(p, q) =\n\ni + (\u03b1 \u2212 1)q\u03b1\np\u03b1\n\n(31)\n\n(32)\n\n(33)\n\nWith some calculation, we have\n\ni\n\nPr,N L(z) \u223c P d/(d+2r)(z)\n\n(cid:32)(cid:89)\n\ni\n\nzi\n\ni\n\n.\n\ni \u2212 \u03b1piq\u03b1\u22121\n(cid:33)(\u03b1\u22122)r/(d+2r)\n\n(34)\n\n(35)\n\n.\n\nRemark: It is easy to see that Kullback-Leibler and Itakura-Saito quantization tend to move centroids\ncloser to axes, and Norm-like quantization, when \u03b1 > 2, does opposite thing, moving centroids far\naway from axes.\n\n4 Experiments\n\nIn this section, we verify our re-\nsults, especially centroid\u2019s loca-\ntion distribution Eq.(24), by using\nthe Bregman hard clustering algo-\nrithm.\nRecall that our results are obtained\nin a limiting case, where we \ufb01rst\ntake size of dataset n \u2192 \u221e and\nthen number of clusters k \u2192 \u221e.\nHowever, size of real data is \ufb01nite\nand it is also not practical to apply\nBregman clustering algorithms on\nthe asymptotic case. In this section,\nwe simply sample data points from\ngiven distribution, with dataset size\nlarge enough, compared to k, to\navoid early stopping of Bregman\nclustering.\nIn addition, we only\nverify r = 1 cases here, since\nthe Bregman clustering algorithm,\nwhich utilizes Lloyd\u2019s method, cannot address Bregman quantization problems with r (cid:54)= 1.\n\nSquared Euclidean Kullback-Leibler Norm-like (\u03b1 = 3)\nFigure 1: First row are predicted distribution functions of\ncentroids by Eq.(36,37,38); second row are experimental his-\ntograms of location of centroids, by applying corresponding\nBregman hard clustering algorithms.\n\n7\n\n00.20.40.60.8100.20.40.60.811.21.41.61.82x100.20.40.60.810.60.811.21.41.61.8x2/3 x\u22121/300.20.40.60.810.20.40.60.811.21.4x4/3 x1/300.20.40.60.8102468101200.20.40.60.8102468101214161800.20.40.60.810123456789\fCase 1 (1-dimensional): Suppose the density P is uniform over [0, 1]. We set number of clusters\nk = 81, and apply different versions of Bregman hard clustering algorithm on this sample: standard\nk-means, Kullback-Leibler clustering and norm-like clustering. According to Eq.(27), Eq.(33) and\nEq.(35), theoretical prediction of centroids locational distribution functions in this case should be:\n\nz \u2208 [0, 1];\n\nP1,EU (z) = 1,\nP1,KL(z) \u223c z\u22121/3,\nP1,N L(z) \u223c z1/3,\n\nz \u2208 (0, 1];\nz \u2208 [0, 1];\n\n(36)\n(37)\n(38)\n\n(39)\n(40)\n(41)\n\nand P (z) = 0 elsewhere.\nFigure 1 shows, in the \ufb01rst row, the theoretical prediction of distribution of centroids, and in the second\nrow, experimental histograms of centroid locations for different Bregman quantization problems.\nCase 2 (2-dimensional): The density P = U ([0, 1]2). Set k = 81 and apply the same three Bregman\nclustering algorithms as in case 1. Theoretical predictions of distribution of centroids for this case by\nEq.(27), Eq.(33) and Eq.(35) are as follow, also shown in Figure 2:\nz = (z1, z2) \u2208 [0, 1]2;\n\nP1,EU (z) = 1,\nP1,KL(z) \u223c (z1z2)\u22121/4,\nP1,N L(z) \u223c (z1z2)1/4,\n\nz = (z1, z2) \u2208 (0, 1]2;\nz = (z1, z2) \u2208 [0, 1]2;\n\nand P (z) = 0 elsewhere.\nFigure 2, in the \ufb01rst row,\nshows a visualization of\ncentroids locations gener-\nated by experiments. For\ncomparison, second row of\nFigure 2 presents 3-d plots\nof theoretical predictions of\ndistribution of centroids. In\neach of the 3-d plots, func-\ntion is plotted over the cube\n[0, 1]2, with left most cor-\nner corresponding to point\n(0, 0).\nIt is easy to see that squared\nEuclidean quantization, in\nthis case, results an uni-\nform distribution of cen-\ntroids, and that Kullback-\nLeibler quantization tends\nto attract centroids towards\naxes, and norm-like quantization tends repel centroids away from axes.\n\nSquared Euclidean Kullback-Leibler Norm-like (\u03b1 = 3)\n\nFigure 2: Experimental results and theoretical predictions of centroids\ndistribution for Case 2. In each of the 3-d plots, function is plotted over\nthe cube [0, 1]2, with left most corner corresponding to point (0, 0),\nand right most corner corresponding to point (1, 1).\n\n5 Conclusion\n\nIn this paper, we analyzed the asymptotic Bregman quantization problems for general Bregman\ndivergences. We obtained explicit expressions for both leading order of asymptotic quantization\nerror and locational distribution of centroids, both of which extend the classical results for k-means\nquantization. We showed how our results apply to commonly used Bregman divergences, and\ngave some experimental veri\ufb01cation. We hope these results will provide guidance and insight for\nfurther theoretical analysis of Bregman clustering, such as Bregman soft clustering and other related\nmethods [3, 11], as well as for practical algorithm design and applications. Our results can also lead\nto better understanding of the existing seeding strategies for Bregman clustering [19] and to new\nseeding methods.\n\nAcknowledgement\nWe thank the National Science Foundation for \ufb01nancial support and to Brian Kulis for discussions.\n\n8\n\n00.20.40.60.8100.10.20.30.40.50.60.70.80.9100.20.40.60.8100.10.20.30.40.50.60.70.80.910.00.51.00.00.51.00.00.51.01.52.00.00.51.00.00.51.0012340.00.51.00.00.51.00.00.51.0\fReferences\n[1] D. Aloise, A. Deshpande, P. Hansen, and P. Popat. NP-hardness of Euclidean sum-of-squares clustering.\n\nMachine learning, 75(2):245\u2013248, 2009.\n\n[2] K. Alsabti, S. Ranka, and V. Singh. An ef\ufb01cient k-means clustering algorithm. IPPS/SPDP Workshop on\n\nHigh Performance Data Mining, 1998.\n\n[3] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. The Journal of\n\nMachine Learning Research, 6:1705\u20131749, 2005.\n\n[4] L. M. Bregman. The relaxation method of \ufb01nding the common point of convex sets and its application to\nthe solution of problems in convex programming. USSR computational mathematics and mathematical\nphysics, 7(3):200\u2013217, 1967.\n\n[5] J. A. Bucklew and G. L. Wise. Multidimensional asymptotic quantization theory with r th power distortion\n\nmeasures. Information Theory, IEEE Transactions on, 28(2):239\u2013247, 1982.\n\n[6] A. Coates and A. Y. Ng. Learning feature representations with k-means. In Neural Networks: Tricks of the\n\nTrade, pages 561\u2013580. Springer, 2012.\n\n[7] A. Fischer. Quantization and clustering with Bregman divergences. Journal of Multivariate Analysis,\n\n101(9):2207\u20132221, 2010.\n\n[8] A. Gersho and R. M. Gray. Vector quantization and signal compression, volume 159. Springer Science &\n\nBusiness Media, 2012.\n\n[9] S. Graf and H. Luschgy. Foundations of quantization for probability distributions. Springer, 2000.\n\n[10] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM computing surveys (CSUR),\n\n31(3):264\u2013323, 1999.\n\n[11] K. Jiang, B. Kulis, and M. I. Jordan. Small-variance asymptotics for exponential family Dirichlet process\n\nmixture models. In Advances in Neural Information Processing Systems, pages 3158\u20133166, 2012.\n\n[12] L. Kaufman and P. J. Rousseeuw. Finding groups in data: an introduction to cluster analysis, volume 344.\n\nJohn Wiley & Sons, 2009.\n\n[13] K. Krishna and M. N. Murty. Genetic k-means algorithm. Systems, Man, and Cybernetics, Part B:\n\nCybernetics, IEEE Transactions on, 29(3):433\u2013439, 1999.\n\n[14] T. Linder. On asymptotically optimal companding quantization. Problems of Control and Information\n\nTheory, 20(6):475\u2013484, 1991.\n\n[15] S. P. Lloyd. Least squares quantization in PCM. Info. Theory, IEEE Transactions on, 28(2):129\u2013137, 1982.\n\n[16] J. MacQueen. Some methods for classi\ufb01cation and analysis of multivariate observations. In Proceedings\nof the \ufb01fth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281\u2013297.\nOakland, CA, USA., 1967.\n\n[17] M. Mahajan, P. Nimbhorkar, and K. Varadarajan. The planar k-means problem is NP-hard. In WALCOM:\n\nAlgorithms and Computation, pages 274\u2013285. Springer, 2009.\n\n[18] D. E. McClure. Nonlinear segmented function approximation and analysis of line patterns. Quarterly of\n\nApplied Mathematics, 33(1):1\u201337, 1975.\n\n[19] R. Nock, P. Luosto, and J. Kivinen. Mixed Bregman clustering with approximation guarantees. In Joint\n\nECML and KDD, pages 154\u2013169. Springer, 2008.\n\n[20] P. Panter and W. Dite. Quantization distortion in pulse-count modulation with nonuniform spacing of\n\nlevels. Proceedings of the IRE, 39(1):44\u201348, 1951.\n\n[21] Q. Que and M. Belkin. Back to the future: Radial basis function networks revisited. In Proceedings of the\n\n19th International Conference on Arti\ufb01cial Intelligence and Statistics, pages 1375\u20131383, 2016.\n\n[22] M. Telgarsky and S. Dasgupta. Agglomerative Bregman clustering. arXiv preprint arXiv:1206.6446, 2012.\n\n[23] K. Wagstaff, C. Cardie, S. Rogers, S. Schr\u00f6dl, et al. Constrained k-means clustering with background\n\nknowledge. In ICML, volume 1, pages 577\u2013584, 2001.\n\n[24] J. Zhang and C. Zhang. Multitask Bregman clustering. Neurocomputing, 74(10):1720\u20131734, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1227, "authors": [{"given_name": "Chaoyue", "family_name": "Liu", "institution": "The Ohio State University"}, {"given_name": "Mikhail", "family_name": "Belkin", "institution": "Ohio State University"}]}