{"title": "Mixture Modeling by Affinity Propagation", "book": "Advances in Neural Information Processing Systems", "page_first": 379, "page_last": 386, "abstract": "", "full_text": "Mixture Modeling by Af\ufb01nity Propagation\n\nBrendan J. Frey and Delbert Dueck\n\nUniversity of Toronto\n\nSoftware and demonstrations available at www.psi.toronto.edu\n\nAbstract\n\nClustering is a fundamental problem in machine learning and has been\napproached in many ways. Two general and quite different approaches\ninclude iteratively \ufb01tting a mixture model (e.g., using EM) and linking to-\ngether pairs of training cases that have high af\ufb01nity (e.g., using spectral\nmethods). Pair-wise clustering algorithms need not compute suf\ufb01cient\nstatistics and avoid poor solutions by directly placing similar examples\nin the same cluster. However, many applications require that each cluster\nof data be accurately described by a prototype or model, so af\ufb01nity-based\nclustering \u2013 and its bene\ufb01ts \u2013 cannot be directly realized. We describe a\ntechnique called \u201caf\ufb01nity propagation\u201d, which combines the advantages\nof both approaches. The method learns a mixture model of the data by\nrecursively propagating af\ufb01nity messages. We demonstrate af\ufb01nity prop-\nagation on the problems of clustering image patches for image segmen-\ntation and learning mixtures of gene expression models from microar-\nray data. We \ufb01nd that af\ufb01nity propagation obtains better solutions than\nmixtures of Gaussians, the K-medoids algorithm, spectral clustering and\nhierarchical clustering, and is both able to \ufb01nd a pre-speci\ufb01ed number\nof clusters and is able to automatically determine the number of clusters.\nInterestingly, af\ufb01nity propagation can be viewed as belief propagation\nin a graphical model that accounts for pairwise training case likelihood\nfunctions and the identi\ufb01cation of cluster centers.\n\n1 Introduction\nMany machine learning tasks involve clustering data using a mixture model, so that the\ndata in each cluster is accurately described by a probability model from a pre-de\ufb01ned,\npossibly parameterized, set of models [1]. For example, words can be grouped according to\ncommon usage across a reference set of documents, and segments of speech spectrograms\ncan be grouped according to similar speaker and phonetic unit. As researchers increasingly\nconfront more challenging and realistic problems, the appropriate class-conditional models\nbecome more sophisticated and much more dif\ufb01cult to optimize.\n\nBy marginalizing over hidden variables, we can still view many hierarchical learning prob-\nlems as mixture modeling, but the class-conditional models become complicated and non-\nlinear. While such class-conditional models may more accurately describe the problem at\nhand, the optimization of the mixture model often becomes much more dif\ufb01cult. Exact\ncomputation of the data likelihoods may not be feasible and exact computation of the suf-\n\ufb01cient statistics needed to update parameterized models may not be feasible. Further, the\ncomplexity of the model and the approximations used for the likelihoods and the suf\ufb01cient\nstatistics often produce an optimization surface with a large number of poor local minima.\n\nA different approach to clustering ignores the notion of a class-conditional model, and\n\n\flinks together pairs of data points that have high af\ufb01nity. The af\ufb01nity or similarity (a real\nnumber in [0, 1]) between two training cases gives a direct indication of whether they should\nbe in the same cluster. Hierarchical clustering and its Bayesian variants [2] is a popular\naf\ufb01nity-based clustering technique, whereby a binary tree is constructed greedily from the\nleaves to the root, by recursively linking together pairs of training cases with high af\ufb01nity.\nAnother popular method uses a spectral decomposition of the normalized af\ufb01nity matrix\n[4]. Viewing af\ufb01nities as transition probabilities in a random walk on data points, modes\nof the af\ufb01nity matrix correspond to clusters of points that are isolated in the walk [3, 5].\n\nWe describe a new method that, for the \ufb01rst time to our knowledge, combines the advan-\ntages of model-based clustering and af\ufb01nity-based clustering. Unlike previous techniques\nthat construct and learn probability models of transitions between data points [6, 7], our\ntechnique learns a probability model of the data itself. Like af\ufb01nity-based clustering,\nour algorithm directly examines pairs of nearby training cases to help ascertain whether\nor not they should be in the same cluster. However, like model-based clustering, our\ntechnique uses a probability model that describes the data as a mixture of class-conditional\ndistributions. Our method, called \u201caf\ufb01nity propagation\u201d, can be viewed as the sum-product\nalgorithm or the max-product algorithm in a graphical model describing the mixture model.\n2 A greedy algorithm: K-medoids\nThe \ufb01rst step in obtaining the bene\ufb01t of pair-wise training case comparisons is to replace\nthe parameters of the mixture model with pointers into the training data. A similar rep-\nresentation is used in K-medians clustering or K-medoids clustering, where the goal is\nto identify K training cases, or exemplars, as cluster centers. Exact learning is known to\nbe NP-hard (c.f. [8]), but a hard-decision algorithm can be used to \ufb01nd approximate solu-\ntions. While the algorithm makes greedy hard decisions for the cluster centers, it is a useful\nintermediate step in introducing af\ufb01nity propagation.\nFor training cases x1, . . . , xN , suppose the likelihood of training case xi given that training\ncase xk is its cluster center is P (xi|xi in xk) (e.g., a Gaussian likelihood would have the\nform e\u2212(xi\u2212xk)2/2\u03c32\n2\u03c0\u03c32). Given the training data, this likelihood depends only on\ni and k, so we denote it by Lik. Lii is set to the Bayesian prior probability that xi is a\ncluster center. Initially, K training cases are chosen as exemplars, e.g., at random. Denote\nthe current set of cluster center indices by K and the index of the current cluster center\nfor xi by si. K-medoids iterates between assigning training cases to exemplars (E step),\nand choosing a training case as the new exemplar for each cluster (M step). Assuming for\nsimplicity that the mixing proportions are equal and denoting the responsibility likelihood\nratio by rik = P (xi|xi in xk)/P (xi|xi not in xk)1, the updates are\n\n\u221a\n/\n\nE step\nFor i = 1, . . . , N:\n\nFor k \u2208 K: rik \u2190 Lik/(P\n\nsi \u2190 argmaxk\u2208K rik\n\nj:j6=k Lij)\n\nGreedy M step\n\nFor k \u2208 K: Replace k in K with argmaxj:sj =k (Q\n\ni:si=k Lij)\n\nThis algorithm nicely replaces parameter-to-training case comparisons with pair-wise\ntraining case comparisons. However, in the greedy M step, speci\ufb01c training cases are\nchosen as exemplars. By not searching over all possible combinations of exemplars, the\nalgorithm will frequently \ufb01nd poor local minima. We now introduce an algorithm that\ndoes approximately search over all possible combinations of exemplars.\n\n1Note that using the traditional de\ufb01nition of responsibility, rik \u2190 Lik/(\u03a3jLij), will give the\n\nsame decisions as using the likelihood ratio.\n\n\fj:j6=k aijLij)\n\nResponsibility updates\n\nrik \u2190 Lik/(P\nakk \u2190Q\nQ\n\naki \u2190 1/( 1\n\nrkk\n\nAvailability updates\n\nj:j6=k(1 + rjk) \u2212 1\n\nj:j6=k,j6=i(1 + rjk)\u22121 + 1 \u2212Q\n\n3 Af\ufb01nity propagation\nThe responsibilities in the greedy K-medoids algorithm can be viewed as messages that are\nsent from training cases to potential exemplars, providing soft evidence of the preference\nfor each training case to be in each exemplar. To avoid making hard decisions for the\ncluster centers, we introduce messages called \u201cavailabilities\u201d. Availabilities are sent from\nexemplars to training cases and provide soft evidence of the preference for each exemplar\nto be available as a center for each training case.\n\nResponsibilities are computed using likelihoods and availabilities, and availabilities are\ncomputed using responsibilities, recursively. We refer to both responsibilities and avail-\nabilities as af\ufb01nities and we refer to the message-passing scheme as af\ufb01nity propagation.\nHere, we explain the update rules; in the next section, we show that af\ufb01nity propagation\ncan be derived as the sum-product algorithm in a graphical model describing the mixture\nmodel. Denote the availability sent from candidate exemplar xk to training case xi by aki.\nInitially, these messages are set equal, e.g., aki = 1 for all i and k. Then, the af\ufb01nity\npropagation update rules are recursively applied:\n\nj:j6=k,j6=i(1 + rjk)\u22121)\n\nThe \ufb01rst update rule is quite similar to the update used in EM, except the likelihoods used\nto normalize the responsibilities are modulated by the availabilities of the competing ex-\nemplars. In this rule, the responsibility of a training case xi as its own cluster center, rii,\nis high if no other exemplars are highly available to xi and if xi has high probability under\nthe Bayesian prior, Lii.\nThe second update rule also has an intuitive explanation. The availability of a training\ncase xk as its own exemplar, akk, is high if at least one other training case places high\nresponsibility on xk being an exemplar. The availability of xk as a exemplar for xi, aki\nis high if the self-responsibility rkk is high (1/rkk\u22121 approaches \u22121), but is decreased if\nother training cases compete in using xk as an exemplar (the term 1/rkk \u22121 is scaled down\nif rjk is large for some other training case xj).\nMessages may be propagated in parallel or sequentially. In our implementation, each candi-\ndate exemplar absorbs and emits af\ufb01nities in parallel, and the centers are ordered according\ni Lik. Direct implementation of the above propaga-\ntion rules gives an N 2-time algorithm, but af\ufb01nities need only be propagated between i and\nk if Lik > 0. In practice, likelihoods below some threshold can be set to zero, leading to a\nsparse graph on which af\ufb01nities are propagated.\n\nto the sum of their likelihoods, i.e. P\n\nAf\ufb01nity propagation accounts for a Bayesian prior pdf on the exemplars and is able to\nautomatically search over the appropriate number of exemplars. (Note that the number of\nexemplars is not pre-speci\ufb01ed in the above updates.) In applications where a particular\nnumber of clusters is desired, the update rule for the responsibilities (in particular, the self-\nresponsibilities rkk, which determine the availabilities of the exemplars) can be modi\ufb01ed,\nas described in the next section. Later, we describe applications where K is pre-speci\ufb01ed\nand where K is automatically selected by af\ufb01nity propagation.\n\nThe af\ufb01nity propagation update rules can be derived as an instance of the sum-product\n\n\fFigure 1: Af\ufb01nity propagation can be viewed as belief propagation in this factor graph.\n\nis QN\nwrittenQN\n\ni=1 Lisi.\n\n(\u201cloopy BP\u201d) algorithm in a graphical model. Using si to denote the index of the exemplar\nfor xi, the product of the likelihoods of the training cases and the priors on the exemplars\n(If si = i, xi is an exemplar with a priori pdf Lii.) The set of hidden\nvariables s1, . . . , sN completely speci\ufb01es the mixture model, but not all con\ufb01gurations of\nthese variables are allowed: si = k (xi in cluster xk) implies sk = k (xk is an exemplar)\nand sk = k (xk is an exemplar) implies si = k for some i 6= k (some other training case is\nin cluster xk). The global indicator function for the satisfaction of these constraints can be\n\nk=1 fk(s1, . . . , sN ), where fk is the constraint for candidate cluster xk:\n\nfk(s1, . . . , sN ) =\n\n\uf8f1\uf8f2\uf8f30 if sk = k and si 6= k for all i 6= k\nNY\n\n0 if sk 6= k and si = k for some i 6= k\n1 otherwise.\n\nNY\n\nfk(s1, . . . , sN ).\n\nP =\n\nLisi\n\nThus, the joint distribution of the mixture model and data factorizes as follows:\n\nThe factor graph [10] in Fig. 1 describes this factorization. Each black box corresponds to\na term in the factorization, and it is connected to the variables on which the term depends.\n\ni=1\n\nk=1\n\n[K = PN\n\nWhile exact inference in this factor graph is NP-hard, approximate inference algorithms can\nbe used to infer the s variables. It is straightforward to show that the updates for af\ufb01nity\npropagation correspond to the message updates for the sum-product algorithm or loopy\nbelief propagation (see [10] for a tutorial). The responsibilities correspond to messages\nsent from the s\u2019s to the f\u2019s, while the availabilities correspond to messages sent from the\nf\u2019s to the s\u2019s. If the goal is to \ufb01nd K exemplars, an additional constraint g(s1, . . . , sN ) =\nk=1[sk = k]] can be included, where [ ] indicates Iverson\u2019s notation ([true]=1\nand [false] = 0). Messages can be propagated through this function in linear time, by\nimplementing it as a Markov chain that accumulates exemplar counts.\nMax-product af\ufb01nity propagation. Max-product af\ufb01nity propagation can be derived\nas an instance of the max-product algorithm, instead of the sum-product algorithm. The\nupdate equations for the af\ufb01nities are modi\ufb01ed and maximizations are used instead of\nsummations. An advantage of max-product af\ufb01nity propagation is that the algorithm is\ninvariant to multiplicative constants in the log-likelihoods.\n4\nA sensible model-based approach to image segmentation is to imagine that each patch in\nthe image originates from one of a small number of prototype texture patches. The main\ndif\ufb01culty is that in addition to standard additive or multiplicative pixel-level noise, another\nprevailing form of noise is due to transformations of the image features, and in particular\ntranslations.\n\nImage segmentation\n\nPair-wise af\ufb01nity-based techniques and in particular spectral clustering has been employed\nwith some success [4, 9], with the main disadvantage being that without an underlying\n\n\fFigure 2: Segmentation of non-aligned gray-scale characters. Patches clustered by af\ufb01nity\npropagation and K-medoids are colored according to classi\ufb01cation (centers shown below\nsolutions). Af\ufb01nity propagation achieves a near-best score compared to 1000 runs of K-\nmedoids.\nmodel there is no sound basis for selecting good class representatives. Having a model with\nclass representatives enables ef\ufb01cient synthesis (generation) of patches, and classi\ufb01cation\nof test patches \u2013 requiring only K comparisons (to class centers) rather than N comparisons\n(to training cases).\n\nWe present results for segmenting two image types. First, as a toy example, we segment\nan image containing many noisy examples of the letters \u2018N\u2019 \u2018I\u2019 \u2018P\u2019 and \u2018S\u2019 (see Fig. 2).\nThe original image is gray-scale with resolution 216 \u00d7 240 and intensities ranging from 0\n(background color, white) to 1 (foreground color, black). Each training case xi is a 24\u00d7 24\nimage patch and xm\nis the mth pixel in the patch. To account for translations, we include a\ni\nhidden 2-D translation variable T . The match between patch xi and patch xk is measured\ni \u00b7f m(xk, T ), where f(xk, T ) is the patch obtained by applying a 2-D translation\nT plus cropping to patch xk. f m is the mth pixel in the translated, cropped patch. This\nmetric is used in the likelihood function:\n\nbyP\n\nm xm\n\np(T )e\u03b2(\u03a3mxm\n\ni \u00b7f m(xk,T ))/\u00afxi \u2248 e\u03b2 maxT (\u03a3mxm\n\ni \u00b7f m(xk,T ))/\u00afxi,\n\nLik \u221dX\nP\n\nT\n\nm xm\ni\n\nwhere \u00afxi = 1\nis used to normalize the match by the amount of ink in xi. \u03b2\n242\ncontrols how strictly xi should match xk to have high likelihood. Max-product af\ufb01nity\npropagation is independent of the choice of \u03b2, and for sum-product af\ufb01nity propagation we\nquite arbitrarily chose \u03b2 = 1. The exemplar priors Lkk were set to mediani,k6=iLik.\nWe cut the image in Fig. 2 into a 9\u00d710 grid of non-overlapping 24\u00d724 patches, computed\nthe pair-wise likelihoods, and clustered them into K = 4 classes using the greedy EM\nalgorithm (randomly chosen initial exemplars) and af\ufb01nity propagation. (Max-product and\nsum-product af\ufb01nity propagation yielded identical results.) We then took a much larger set\nof overlapping patches, classi\ufb01ed them into the 4 categories, and then colored each pixel in\nthe image according to the most frequent class for the pixel. The results are shown in Fig. 2.\nWhile af\ufb01nity propagation is deterministic, the EM algorithm depends on initialization. So,\nwe ran the EM algorithm 1000 times and in Fig. 2 we plot the cumulative distribution of\nthe log P scores obtained by EM. The score for af\ufb01nity propagation is also shown, and\nachieves near-best performance (98th percentile).\nWe next analyzed the more natural 192 \u00d7 192 image shown in Fig. 3. Since there is no\nnatural background color, we use mean-squared pixel differences in HSV color space to\nmeasure similarity between the 24 \u00d7 24 patches:\n\nLik \u221d e\u2212\u03b2 minT \u03a3m\u2208W (xm\n\ni \u2212f m(xk,T ))2\n\n,\n\nwhere W is the set of indices corresponding to a 16 \u00d7 16 window centered in the patch\nand f m(xk, T ) is the same as above. As before, we arbitrarily set \u03b2 = 1 and Lkk to\nmediani,k6=iLik.\n\n\fFigure 3: Segmentation results for several methods applied to a natural image. For methods\nother than af\ufb01nity propagation, many parameter settings were tried and the best segmenta-\ntion selected. The histograms show the percentile in score achieved by af\ufb01nity propagation\ncompared to 1000 runs of greedy EM, for different random training sets.\nWe cut the image in Fig. 3 into an 8 \u00d7 8 grid of non-overlapping 24 \u00d7 24 patches and\nclustered them into K = 6 classes using af\ufb01nity propagation (both forms), greedy EM\nin our model, spectral clustering (using a normalized L-matrix based on a set of 29 \u00d7 29\noverlapping patches), and mixtures of Gaussians2. For greedy EM, the af\ufb01nity propagation\nalgorithms, and mixtures of Gaussians, we then choose all possible 24 \u00d7 24 overlapping\npatches and calculated the likelihoods of them given each of the 6 cluster centers, classify-\ning each patch according to its maximum likelihood.\n\nFig. 3 shows the segmentations for the various methods, where the central pixel of each\npatch is colored according to its class. Again, af\ufb01nity propagation achieves a solution that\nis near-best compared to one thousand runs of greedy EM.\n5 Learning mixtures of gene models\nCurrently, an important problem in genomics research is the discovery of genes and gene\nvariants that are expressed as messenger RNAs (mRNAs) in normal tissues. In a recent\nstudy [11], we used DNA-based techniques to identify 837,251 possible exons (\u201cputative\nexons\u201d) in the mouse genome. For each putative exon, we used an Agilent microarray\nprobe to measure the amount of corresponding mRNA that was present in each of 12 mouse\ntissues. Each 12-D vector, called an \u201cexpression pro\ufb01le\u201d, can be viewed as a feature vector\nindicating the putative exon\u2019s function. By grouping together feature vectors for nearby\nprobes, we can detect genes and variations of genes. Here, we compare af\ufb01nity propagation\nwith hierarchical clustering, which was previously used to \ufb01nd gene structures [12].\n\nFig. 4a shows a normalized subset of the data and gives three examples of groups of nearby\n\n2For spectral clustering, we tried \u03b2 = 0.5, 1 and 2, and for each of these tried clustering using 6, 8,\n10, 12 and 14 eigenvectors. We then visually picked the best segmentation (\u03b2 = 1, 10 eigenvectors).\nThe eigenvector features were clustered using EM in a mixture of Gaussians and out of 10 trials,\nthe solution with highest likelihood was selected. For mixtures of Gaussians applied directly to the\nimage patches, we picked the model with highest likelihood in 10 trials.\n\n\f(a)\n\n(b)\n\nFigure 4: (a) A normalized subset of 837,251 tissue expression pro\ufb01les \u2013 mRNA level\nversus tissue \u2013 for putative exons from the mouse genome (most pro\ufb01les are much noisier\nthan these). (b) The true exon detection rate (in known genes) versus the false discovery\nrate, for af\ufb01nity propagation and hierarchical clustering.\n\nfeature vectors that are similar enough to provide evidence of gene units. The actual data\nis generally much noisier, and includes multiplicative noise (exon probe sensitivity can\nvary by two orders of magnitude), correlated additive noise (a probe can cross-hybridize in\na tissue-independent manner to background mRNA sources), and spurious additive noise\n(due to a noisy measurement procedure and biological effects such as alternative splicing).\nTo account for noise, false putative exons, and the distance between exons in the same\ngene, we used the following likelihood function:\n\nZ\n\nLij = \u03bbe\u2212\u03bb|i\u2212j|(cid:16)\n\u2248 \u03bbe\u2212\u03bb|i\u2212j|(cid:16)\n\nq\u00b7p0(xi) + (1\u2212q)\n\np(y, z, \u03c3) e\n\nq\u00b7p0(xi) + (1\u2212q) max\n\ny,z,\u03c3\n\np(y, z, \u03c3) e\n\n\u2212 1\n\n2\u03c32 \u03a312\nm=1(xm\n\u221a\n\n\u2212 1\n\n2\u03c0\u03c32\n2\u03c32 \u03a312\nm=1(xm\n\u221a\n\n12\n\n2\u03c0\u03c32\n\ni \u2212(y\u00b7xm\n\nj +z))2\n\n12\ni \u2212(y\u00b7xm\n\nj +z))2\n\ndydzd\u03c3\n\n(cid:17)\n\n,\n\n(cid:17)\n\nwhere xm\nis the expression level for the mth tissue in the ith probe (in genomic order).\ni\nWe found that in this application, the maximum is a suf\ufb01ciently good approximation to\nthe integral. The distribution over the distance between probes in the same gene |i \u2212 j|\nis assumed to be geometric with parameter \u03bb. p0(xi) is a background distribution that\naccounts for false putative exons and q is the probability of a false putative exon within a\ngene. We assumed y, z and \u03c3 are independent and uniformly distributed3. The Bayesian\nprior probability that xk is an exemplar is set to \u03b8 \u00b7 p0(xk), where \u03b8 is a control knob used\nto vary the sensitivity of the system.\nBecause of the term \u03bbe\u2212\u03bb|i\u2212j| and the additional assumption that genes on the same strand\ndo not overlap, it is not necessary to propagate af\ufb01nities between all 837, 2512 pairs of\ntraining cases. We assume Lij = 0 for |i \u2212 j| > 100, in which case it is not necessary\nto propagate af\ufb01nities between xi and xj. The assumption that genes do not overlap im-\nplies that if si = k, then sj = k for j \u2208 {min(i, k), . . . , max(i, k)}. It turns out that\nthis constraint causes the dependence structure in the update equations for the af\ufb01nities to\nreduce to a chain, so af\ufb01nities need only be propagated forward and backward along the\ngenome. After af\ufb01nity propagation is used to automatically select the number of mixture\n\n3Based on the experimental procedure and a set of previously-annotated genes (RefSeq), we es-\ni ), \u03c3 \u2208 (0, \u00b5]. We\n\ntimated \u03bb = 0.05, q = 0.7, y \u2208 [.025, 40], z \u2208 [\u2212\u00b5, \u00b5] (where \u00b5 = maxi,m xm\nused a mixture of Gaussians for p0(xi), which was learned from the entire training set.\n\n\fcomponents and identify the mixture centers and the probes that belong to them (genes),\neach probe xi is labeled as an exon or a non-exon depending on which of the two terms in\nthe above likelihood function (q \u00b7 p0(xi) or the large term to its right) is larger.\nFig. 4b shows the fraction of exons in known genes detected by af\ufb01nity propagation\nversus the false detection rate. The curve is obtained by varying the sensitivity parameter,\n\u03b8. The false detection rate was estimated by randomly permuting the order of the\nprobes in the training set, and applying af\ufb01nity propagation. Even for quite low false\ndiscovery rates, af\ufb01nity propagation identi\ufb01es over one third of the known exons. Using\na variety of metrics, including the above metric, we also used hierarchical clustering\nto detect exons. The performance of hierarchical clustering using the metric with\nhighest sensitivity is also shown. Af\ufb01nity propagation has signi\ufb01cantly higher sensitiv-\nity, e.g., achieving a \ufb01ve-fold increase in true detection rate at a false detection rate of 0.4%.\n6 Computational ef\ufb01ciency\nThe following table compares the MATLAB execution times of our implementations of\nthe methods we compared on the problems we studied. For methods that \ufb01rst compute\na likelihood or af\ufb01nity matrix, we give the timing of this computation \ufb01rst. Techniques\ndenoted by \u201c*\u201d were run many times to obtain the shown results, but the given time is for\na single run.\n\nNIPS\nDog\nGenes\n\nAf\ufb01nity Prop K-medoids*\n12.9 s + .2 s\n12.9 s + 2.0 s\n12.0 s + 1.5 s\n12.0 s + 0.1 s\n16 m + 43 m\n\n-\n\nSpec Clust* MOG EM* Hierarch Clust\n\n12.0 s + 29 s\n\n-\n\n-\n\n3.3 s\n\n-\n\n-\n\n-\n-\n\n16 m + 28 m\n\n7 Summary\nAn advantage of af\ufb01nity propagation is that the update rules are deterministic, quite simple,\nand can be derived as an instance of the sum-product algorithm in a factor graph. Using\nchallenging applications, we showed that af\ufb01nity propagation obtains better solutions (in\nterms of percentile log-likelihood, visual quality of image segmentation and sensitivity-\nto-speci\ufb01city) than other techniques, including K-medoids, spectral clustering, Gaussian\nmixture modeling and hierarchical clustering.\n\nTo our knowledge, af\ufb01nity propagation is the \ufb01rst algorithm to combine advantages of\npair-wise clustering methods that make use of bottom-up evidence and model-based\nmethods that seek to \ufb01t top-down global models to the data.\nReferences\n[1] CM Bishop. Neural Networks for Pattern Recognition. Oxford University Press, NY, 1995.\n[2] KA Heller, Z Ghahramani. Bayesian hierarchical clustering. ICML, 2005.\n[3] M Meila, J Shi. Learning segmentation by random walks. NIPS 14, 2001.\n[4] J Shi, J Malik. Normalized cuts and image segmentation. Proc CVPR, 731-737, 1997.\n[5] A Ng, M Jordan, Y Weiss. On spectral clustering: Analysis and an algorithm. NIPS 14, 2001.\n[6] N Shental A Zomet T Hertz Y Weiss. Pairwise clustering and graphical models NIPS 16 2003.\n[7] R Rosales, BJ Frey. Learning generative models of af\ufb01nity matrices. Proc UAI, 2003.\n[8] M Charikar, S Guha, A Tardos, DB Shmoys. A constant-factor approximation algorithm for the\n\nk-median problem. J Comp and Sys Sci, 65:1, 129-149, 2002.\n\n[9] J Malik et al.. Contour and texture analysis for image segmentation. IJCV 43:1, 2001.\n[10] FR Kschischang, BJ Frey, H-A Loeliger. Factor graphs and the sum-product algorithm. IEEE\n\nTrans Info Theory 47:2, 498-519, 2001.\n\n[11] BJ Frey, QD Morris, M Robinson, TR Hughes. Finding novel transcripts in high-resolution\n\ngenome-wide microarray data using the GenRate model. Proc RECOMB 2005, 2005.\n\n[12] D. D. Shoemaker et al. Experimental annotation of the human genome using microarray tech-\n\nnology. Nature 409, 922-927, 2001.\n\n\f", "award": [], "sourceid": 2870, "authors": [{"given_name": "Brendan", "family_name": "Frey", "institution": null}, {"given_name": "Delbert", "family_name": "Dueck", "institution": null}]}