{"title": "Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3230, "page_last": 3238, "abstract": "We describe a latent variable model for supervised dimensionality reduction and distance metric learning. The model discovers linear projections of high dimensional data that shrink the distance between similarly labeled inputs and expand the distance between differently labeled ones. The model\u2019s continuous latent variables locate pairs of examples in a latent space of lower dimensionality. The model differs significantly from classical factor analysis in that the posterior distribution over these latent variables is not always multivariate Gaussian. Nevertheless we show that inference is completely tractable and derive an Expectation-Maximization (EM) algorithm for parameter estimation. We also compare the model to other approaches in distance metric learning. The model\u2019s main advantage is its simplicity: at each iteration of the EM algorithm, the distance metric is re-estimated by solving an unconstrained least-squares problem. Experiments show that these simple updates are highly effective.", "full_text": "Latent Coincidence Analysis: A Hidden\n\nVariable Model for Distance Metric Learning\n\nMatthew Der and Lawrence K. Saul\n\nDepartment of Computer Science and Engineering\n\nUniversity of California, San Diego\n\nLa Jolla, CA 92093\n\n{mfder,saul}@cs.ucsd.edu\n\nAbstract\n\nWe describe a latent variable model for supervised dimensionality reduction\nand distance metric learning. The model discovers linear projections of high\ndimensional data that shrink the distance between similarly labeled inputs\nand expand the distance between di\ufb00erently labeled ones. The model\u2019s\ncontinuous latent variables locate pairs of examples in a latent space of\nlower dimensionality. The model di\ufb00ers signi\ufb01cantly from classical factor\nanalysis in that the posterior distribution over these latent variables is not\nalways multivariate Gaussian. Nevertheless we show that inference is com-\npletely tractable and derive an Expectation-Maximization (EM) algorithm\nfor parameter estimation. We also compare the model to other approaches\nin distance metric learning. The model\u2019s main advantage is its simplicity:\nat each iteration of the EM algorithm, the distance metric is re-estimated\nby solving an unconstrained least-squares problem. Experiments show that\nthese simple updates are highly e\ufb00ective.\n\n1 Introduction\n\nIn this paper we propose a simple but new model to learn informative linear projections\nof multivariate data. Our approach is rooted in the tradition of latent variable modeling,\na popular methodology for discovering low dimensional structure in high dimensional data.\nTwo well-known examples of latent variable models are factor analyzers (FAs), which recover\nsubspaces of high variance [1], and Gaussian mixture models (GMMs), which reveal clusters\nof high density [2]. Here we describe a model that we call latent coincidence analysis (LCA).\nThe goal of LCA is to discover a latent space in which metric distances re\ufb02ect meaningful\nnotions of similarity and di\ufb00erence.\n\nWe apply LCA to two problems in distance metric learning, where the goal is to improve\nthe performance of a classi\ufb01er\u2014typically, a k-nearest neighbor (kNN) classi\ufb01er [3]\u2014by a\nlinear transformation of its input space. Several previous methods have been proposed for\nthis problem, including neighborhood component analysis (NCA) [4], large margin neighbor\nneighbor classi\ufb01cation (LMNN) [5], and information-theoretic metric learning (ITML) [6].\nThese methods\u2014all of them successful, all of them addressing the same problem\u2014beg the\nobvious question: why yet another?\n\nOne answer is suggested by the di\ufb00erent lineages of previous approaches. NCA was conceived\nas a supervised counterpart to stochastic neighborhood embedding [7], an unsupervised\nmethod for dimensionality reduction. LMNN was conceived as a kNN variant of support\nvector machines [8]. ITML evolved from earlier work in Bregman optimizations\u2014that of\nminimizing the LogDet divergence subject to linear constraints [9]. Perhaps it is due to\n\n1\n\n\fFigure 1: Bayesian network for latent coincidence analysis. The inputs x, x(cid:48) \u2208 (cid:60)d are\nmapped into Gaussian latent variables z, z(cid:48) \u2208 (cid:60)p whose statistics are parameterized by the\nlinear transformation W \u2208 (cid:60)p\u00d7d and noise level \u03c3. Coincidence in the latent space at length\nscale \u03ba is detected by the binary variable y \u2208 {0, 1}. Observed nodes are shaded.\n\nthese di\ufb00erent lineages that none of these methods completely dominates the others. They\nall o\ufb00er improvements in kNN classi\ufb01cation, yet arguably their larger worth stems from\nthe related work they have inspired in other areas of pattern recognition. Distance metric\nlearning is a fundamental problem, and the more solutions we have, the better equipped we\nare to solve its myriad variations.\n\nIt is in this spirit that we revisit the problem of distance metric learning in the venerable\ntradition of latent variable modeling. We believe that LCA, like factor analysis and Gaussian\nmixture modeling, is the simplest latent variable model that can be imagined for its purpose.\nIn particular, the inference in LCA (though not purely Gaussian) is tractable, and the\ndistance metric is re-estimated at each iteration of its EM algorithm by a simple least-\nsquares update. This update has stronger guarantees of convergence than the gradient-\nbased methods in NCA; it also sidesteps the large number of linear inequality constraints\nthat appear in the optimizations for LMNN and ITML. For all these reasons, we believe\nthat LCA deserves to be widely known.\n\n2 Model\n\nWe begin by describing the probabilistic model for LCA. Fig. 1 shows the model\u2019s repre-\nsentation as a Bayesian network. There are three observed variables: the inputs x, x(cid:48) \u2208 (cid:60)d,\nwhich we always imagine to be observed in pairs, and the binary label y \u2208 {0, 1}, which\nindicates if the inputs map (or are desired to be mapped) to nearby locations in a latent\nspace of equal or reduced dimensionality p\u2264 d. These locations are in turn represented by\nthe Gaussian latent variables z, z(cid:48) \u2208 (cid:60)p.\nEach node in the Bayesian network is conditionally dependent on its parents. The con-\nditional distributions P (z|x) and P (z(cid:48)|x(cid:48)) are parameterized by a linear transformation\nW \u2208 (cid:60)p\u00d7d (from the input space to the latent space) and a noise level \u03c32. They take the\nsimple Gaussian form:\n\nP (z|x) = (2\u03c0\u03c32)\u2212p/2 e\nP (z(cid:48)|x(cid:48)) = (2\u03c0\u03c32)\u2212p/2 e\n\n\u2212 1\n2\u03c32 (cid:107)z\u2212Wx(cid:107)2\n,\n2\u03c32 (cid:107)z(cid:48)\u2212Wx(cid:48)(cid:107)2\n\u2212 1\n\n(1)\n\n(2)\nFinally, the binary label y \u2208 {0, 1} is used to detect the coincidence of the variables z, z(cid:48) in\nthe latent space. In particular, y follows a Bernoulli distribution with mean value:\n\n.\n\nP (y = 1|z, z(cid:48)) = e\n\n\u2212 1\n2\u03ba2 (cid:107)z\u2212z(cid:48)(cid:107)2\n\n(3)\nEq. (3) states that y = 1 with certainty if z and z(cid:48) coincide at the exact same point in the\nlatent space; otherwise, the probability in eq. (3) falls o\ufb00 exponentially with their squared\ndistance. The length scale \u03ba in eq. (3) governs the rate of this exponential decay.\n\n.\n\n2\n\nzz'xx'yx x' z'zyW,x x' z'zyW,N\fInference\n\n2.1\nInference in this model requires averaging over the Gaussian latent variables z, z(cid:48). The\nrequired integrals take the form of simple Gaussian convolutions. For example:\n\nP (y = 1|x, x(cid:48)) =\n\n=\n\ndz dz(cid:48)P (y = 1|z, z(cid:48)) P (z|x) P (z(cid:48)|x(cid:48))\n\u2212(cid:107)W(x\u2212x(cid:48))(cid:107)2\n\n(cid:19)p/2\n\n(cid:18)\n\nexp\n\n(cid:19)\n\n.\n\n(4)\n\n(5)\n\n\u03ba2 +2\u03c32\n\n2 (\u03ba2 +2\u03c32)\n\n(cid:90)\n(cid:18) \u03ba2\n\nNote that this marginal probability is invariant to uniform re-scalings of the model param-\neters W, \u03c3, and \u03ba; we will return to this observation later. For inputs (x, x(cid:48)), we denote\nthe relative likelihood, or odds, of the event y = 1 by\n\n\u03bd(x, x(cid:48)) =\n\nP (y = 1|x, x(cid:48))\nP (y = 0|x, x(cid:48))\n\n.\n\n(6)\n\nAs we shall see, the odds appear in the calculations for many useful forms of inference. Note\nthat the odds \u03bd(x, x(cid:48)) has a complicated nonlinear dependence on the inputs (x, x(cid:48)); the\nnumerator in eq. (6) is Gaussian, but the denominator (equal to one minus the numerator)\nis not.\n\nOf special importance for learning (as discussed in section 2.2) are the statistics of the\nposterior distribution P (z, z(cid:48)|x, x(cid:48), y). We obtain this distribution using Bayes rule:\n\nP (y|z, z(cid:48)) P (z|x) P (z(cid:48)|x(cid:48))\n\nP (z, z(cid:48)|x, x(cid:48), y) =\n\n(7)\nWe note that the prior distribution P (z, z(cid:48)|x, x(cid:48)) is multivariate Gaussian, as is the posterior\ndistribution P (z, z(cid:48)|x, x(cid:48), y = 1) for positively labeled pairs of examples. However, this is\nnot true of the posterior distribution P (z, z(cid:48)|x, x(cid:48), y = 0) for negatively labeled pairs.\nIn\nthis respect, the model di\ufb00ers from classical factor analysis and other canonical models with\nGaussian latent variables (e.g., Kalman \ufb01lters).\n\nP (y|x, x(cid:48))\n\n.\n\nDespite the above wrinkle, it remains straightforward to compute the low-order moments\nof the distribution in eq. (7) for both positively (y = 1) and negatively (y = 0) labeled pairs1\nof examples. In particular, for the posterior means, we obtain:\n\nE[z|x, x(cid:48), y = 0] = W\n\nE[z|x, x(cid:48), y = 1] = W\n\nx \u2212\n\nx +\n\n(cid:20)\n(cid:20)\n\n(cid:18) \u03bd\u03c32\n(cid:18)\n\n\u03c32\n\n\u03ba2 + 2\u03c32\n\n(cid:19)\n(cid:19)\n\n\u03ba2 + 2\u03c32\n\n(cid:21)\n(cid:21)\n\n,\n\n,\n\n(x(cid:48)\u2212x)\n\n(x(cid:48)\u2212x)\n\n(8)\n\n(9)\n\n(11)\n\n(12)\n\nwhere the coe\ufb03cient \u03bd in eq. (8) is shorthand for the odds \u03bd(x, x(cid:48)) in eq. (6). Note how the\nposterior means E[z|x, x(cid:48), y] in eqs. (8\u20139) di\ufb00er from the prior mean\n\n(10)\nAnalogous results hold for the prior and posterior means of the latent variable z(cid:48). Intuitively,\nthese calculations show that the expected values of z and z(cid:48) move toward each other if the\nobserved label indicates a coincidence (y = 1) and away from each other if not (y = 0).\n\nE[z|x, x(cid:48)] = Wx.\n\nFor learning it is also necessary to compute second-order statistics of the posterior distribu-\ntion. For the posterior variances, straightforward calculations give:\n\n(cid:20)\n(cid:20)\n\n(cid:107)z \u2212 \u00afz(cid:107)2\n\nE\n\n(cid:107)z \u2212 \u00afz(cid:107)2\n\nE\n\n(cid:21)\n(cid:21)\n\n(cid:12)(cid:12)(cid:12)(cid:12) x, x(cid:48), y = 0\n(cid:12)(cid:12)(cid:12)(cid:12) x, x(cid:48), y = 1\n\n(cid:20)\n(cid:20)\n\n= p\u03c32\n\n= p\u03c32\n\n1 +\n\n1 \u2212\n\n\u03bd\u03c32\n\n\u03ba2 + 2\u03c32\n\n\u03c32\n\n\u03ba2 + 2\u03c32\n\n(cid:21)\n(cid:21)\n\n,\n\n,\n\n1For the latter, the statistics can be expressed as the di\ufb00erences of Gaussian integrals.\n\n3\n\n\fwhere \u00afz in these expressions denotes the posterior means in eqs. (8\u20139), and again the coef-\n\ufb01cient \u03bd is shorthand for the odds \u03bd(x, x(cid:48)) in eq. (6). Note how the posterior variances in\neqs. (11\u201312) di\ufb00er from the prior variance\n\n(cid:20)\n\n(cid:12)(cid:12)(cid:12)(cid:12)x, x(cid:48)(cid:21)\n\n(cid:107)z \u2212 Wx(cid:107)2\n\nE\n\n= p\u03c32.\n\n(13)\n\nN(cid:88)\n\nIntuitively, we see that the posterior variance shrinks if the observed label indicates a coin-\ncidence (y = 1) and grows if not (y = 0). The expressions for the posterior variance of the\nlatent variable z(cid:48) are identical due to the model\u2019s symmetry.\n\n2.2 Learning\n\nNext we consider how to learn the linear projection W, the noise level \u03c32, and the length\nscale \u03ba2 from data. We assume that the data comes in the form of paired inputs x, x(cid:48),\ntogether with binary judgments y \u2208 {0, 1} of similarity or di\ufb00erence. In particular, from\na training set {(xi, x(cid:48)\ni=1 of N such examples, we wish to learn the parameters that\nmaximize the conditional log-likelihood\n\ni, yi)}N\n\nL(W, \u03c32, \u03ba2) =\n\nlog P (yi|xi, x(cid:48)\ni)\n\n(14)\n\ni=1\n\nof observed coincidences (yi = 1) and non-coincidences (yi = 0). We say that the data is\nincomplete or partially observed in the sense that the examples do not specify target values\nfor the latent variables z, z(cid:48); instead, such target values must be inferred from the model\u2019s\nposterior distribution.\n\nGiven values for the model parameters W, \u03c32, and \u03ba2, we can compute the right hand side of\neq. (14) from the result in eq. (5). However, the parameters that maximize eq. (14) cannot\nbe computed in closed form. In the absence of an analytical solution, we avail ourselves\nof the EM algorithm, an iterative procedure for maximum likelihood estimation in latent\nvariable models [10]. The EM algorithm consists of two steps, an E-step which computes\nstatistics of the posterior distribution in eq. (7), and an M-step which uses these statistics\nto re-estimate the model parameters. The two steps are iterated until convergence.\n\nIntuitively, the EM algorithm uses the posterior means in eqs. (8\u20139) to \u201c\ufb01ll in\u201d the missing\nvalues of the latent variables z, z(cid:48). As shorthand, let\n\u00afzi = E[z|xi, x(cid:48)\ni = E[z(cid:48)|xi, x(cid:48)\n\u00afz(cid:48)\n\ni, yi],\ni, yi]\n\n(15)\n(16)\n\ndenote these posterior means for the ith example in the training set, as computed from the\nresults in eq. (8\u20139). The M-step of the EM algorithm updates the linear transformation W\nby minimizing the sum of squared errors\n\n(cid:104)(cid:107)\u00afzi \u2212 Wxi(cid:107)2 + (cid:107)\u00afz(cid:48)\n\ni(cid:107)2(cid:105)\n\n,\n\ni \u2212 Wx(cid:48)\n\nN(cid:88)\n\ni=1\n\nE(W) =\n\n1\n2\n\nwhere the expected values \u00afzi, \u00afz(cid:48)\ni are computed with respect to the current model parameters\n(and thus treated as constants in the above minimization). Minimizing the sum of squared\nerrors in eq. (17) gives the update rule:\n\nwhere the product in eq. (18) is understood as a vector-matrix multiplication. The EM\nupdate for the noise level \u03c32 takes an equally simple form. As shorthand, let\n\n(cid:34) N(cid:88)\n\n(cid:18)\n\ni=1\n\nW \u2190\n\n\u00afzix(cid:62)\n\nix(cid:48)\n\ni\n\ni + \u00afz(cid:48)\n(cid:20)\n\n(cid:62)(cid:19)(cid:35)\u22121\n\n,\n\n(cid:62)(cid:19)(cid:35)(cid:34) N(cid:88)\n(cid:18)\n(cid:12)(cid:12)(cid:12)(cid:12) xi, x(cid:48)\n\ni=1\n\n4\n\nxix(cid:62)\n\nix(cid:48)\n\ni\n\ni + x(cid:48)\n(cid:21)\n\n\u03b52\ni = E\n\n(cid:107)z \u2212 \u00afz(cid:107)2\n\ni, yi\n\n(17)\n\n(18)\n\n(19)\n\n\fMNIST BBC Classic4\n\n# train\n# test\n# classes\n# features (D)\n# inputs (d)\nLCA dim (p)\nEuclidean\nPCA\nLMNN\nLCA\n\n60000\n10000\n\n10\n784\n164\n40\n2.83\n2.12\n1.34\n1.61\n\n1558\n667\n5\n\n9635\n200\n4\n\n30.10\n11.90\n3.40\n3.41\n\n4257\n1419\n\n4\n\n5896\n200\n32\n7.89\n9.74\n3.19\n3.54\n\nIsolet Letters\n6238\n1559\n26\n617\n172\n40\n8.98\n8.60\n3.40\n3.72\n\n14000\n6000\n26\n16\n16\n16\n4.73\n4.75\n2.58\n2.93\n\nSeg\n210\n2100\n\n7\n19\n19\n18\n\n13.71\n13.71\n8.57\n8.57\n\nBal\n438\n187\n3\n4\n4\n4\n\n18.82\n18.07\n8.98\n4.06\n\nIris\n105\n45\n3\n4\n4\n4\n\n5.33\n5.33\n5.33\n2.22\n\nTable 1: Summary of data sets for multiway classi\ufb01cation. For e\ufb03ciency we projected data\nsets of high dimensionality D down to their leading d principal components. MNIST [11] is\na data set of handwritten digits; we deslanted the images to reduce variability. BBC [12]\nand Classic4 [13] are text corpora with labeled topics. The last \ufb01ve data sets are from\nthe UCI repository [14]. The bottom four rows compare test error percentage using 3-NN\nclassi\ufb01cation. For data sets without dedicated test sets, we averaged across multiple random\n70/30 splits.\n\ndenote the posterior variance of z for the ith example in the training set, as computed from\nthe results in eqs. (11\u201312). Then the EM update for the noise level is given by:\n\n(cid:34)\n\nN(cid:88)\n\n(cid:35)\n\n\u03b52\ni\n\n.\n\n(20)\nThe minimum of E(W) in this update is computed by substituting the right hand side of\neq. (18) into eq. (17).\n\nmin\nW\n\ni=1\n\n\u03c32 \u2190 1\npN\n\nE(W) +\n\nThe EM updates for W and \u03c32 have the desirable property that they converge monotonically\nto a stationary point of the log-likelihood: that is, at each iteration, they are guaranteed\nto increase the right hand side of eq. (14) except at points in the parameter space with\nvanishing gradient. A full derivation of the EM algorithm is omitted for brevity.\n\nWe have already noted that the log-likelihood in eq. (14) is invariant to uniform rescaling\nof the model parameters W, \u03c3, and \u03ba. Thus without loss of generality we can set \u03ba2 = 1 in\nthe simplest setting of the model, as described above. It does become necessary to estimate\nthe parameter \u03ba2, however, in slightly extended formulations of the model, as we consider\nin section 3.2. Unlike the parameters W and \u03c32, the parameter \u03ba2 does not have a simple\nupdate rule for its re-estimation by EM. When necessary, however, this parameter can be\nre-estimated by a simple line search. This approach also preserves the property of monotonic\nconvergence.\n\n3 Applications\n\nWe explore two applications of LCA in which its linear transformation is used to preprocess\nthe data for di\ufb00erent models of multiway classi\ufb01cation. We assume that the original data\nconsists of labeled examples {(xi, ci)} of inputs xi\u2208(cid:60)d and their class labels ci\u2208{1, 2, . . . , c}.\nFor each application, we show how to instantiate LCA by creating a particular data set of\nlabeled pairs, where the labels indicate whether the examples in each pair should be mapped\ncloser together (y = 1) or farther apart (y = 0) in LCA\u2019s latent space of dimensionality p\u2264 d.\nIn the \ufb01rst application, we use LCA to improve a parametric model of classi\ufb01cation; in the\nsecond application, a nonparametric one. The data sets in our experiments are summarized\nin Table 1.\n\n3.1 Gaussian mixture modeling\n\nGaussian mixture models (GMMs) o\ufb00er perhaps the simplest parametric model of multiway\nclassi\ufb01cation. In the most straightforward application of GMMs, the labeled examples in\n\n5\n\n\feach class c are modeled by a single multivariate Gaussian distribution with mean \u00b5c and\ncovariance matrix \u03a3c. Classi\ufb01cation is also simple:\nfor each unlabeled example, we use\nBayes rule to compute the class with the highest posterior probability.\n\nEven in these simplest of GMMs, however, challenges arise when the data is very high\ndimensional. In this case, it may be prohibitively expensive to estimate or store the covari-\nance matrix for each class of the data. In this case two simple options are: (i) to reduce\nthe input\u2019s dimensionality using principal component analysis (PCA) or linear discriminant\nanalysis (LDA) [15], or (ii) to model each multivariate Gaussian distribution using factor\nanalysis. In the latter, we learn distributions of the form:\n\n(21)\nwhere the diagonal matrix \u03a8c \u2208 (cid:60)d\u00d7d and loading matrix \u039bc \u2208 (cid:60)d\u00d7p are the model param-\neters of the factor analyzer belonging to the cth class. Factor analysis can be formulated as\na latent variable model, and its parameters estimated by an EM algorithm [1].\n\nP (x|c) \u223c N (\u00b5c, \u03a8c + \u039bc\u039b(cid:62)\nc )\n\nGMMs are generative models trained by maximum likelihood estimation. In this section,\nwe explore how LCA may yield classi\ufb01ers of similar form but higher accuracy. To do so, we\nlearn one model of LCA for each class of the data. In particular, we use LCA to project\neach example xi into a lower dimensional space where we hope for two properties: (i) that\nit is closer to the mean of projected examples from the same class yi, and (ii) that it is\nfarther from the mean of projected examples from other classes c(cid:54)= yi.\nMore speci\ufb01cally, we instantiate the model of LCA for each class as follows. Let \u00b5c \u2208 (cid:60)d\ndenote the mean of the labeled examples in class c. Then we create a training set of labeled\npairs {\u00b5c, xi, yic} over all examples xi where yic = 1 if yi = c and yic = 0 if yi (cid:54)= c. From\nthis training set, we use the EM algorithm in section 2.2 to learn a (class-speci\ufb01c) linear\nprojection Wc and variance \u03c3c. Finally, to classify an unlabeled example x, we compute\nthe probabilities:\n\nP (yc = 1|x) =\n\n(cid:18) 1\n\n(cid:19)p/2\n\n1+2\u03c32\nc\n\n(cid:26)\n\nexp\n\n\u2212(cid:107)Wc(x\u2212\u00b5c)(cid:107)2\n\n2 (1+2\u03c32\nc )\n\n(cid:27)\n\n.\n\n(22)\n\nWe label the example x by the class c that maximizes the probability in eq. (22). As\nwe shall see, this decision rule for LCA often makes di\ufb00erent predictions than Bayes rule\nin maximum likelihood GMMs. Conveniently, we can train the LCA models for di\ufb00erent\nclasses in parallel.\n\nWe evaluated the performance of LCA in this setting on the \ufb01rst four data sets in Table 1.\nOver a range of reduced dimensionalities p < d, we compared the classi\ufb01cation accuracy of\nthree approaches: (i) GMMs with full covariance matrices after projecting the data down\nto p dimensions with PCA or LDA, (ii) GMMs with p-dimensional factor analyzers, and\n(iii) p-dimensional models of LCA. Fig. 2 shows that LCA generally outperforms these\nother methods; also, its largest gains occur in the regime of very aggressive dimensionality\nreduction p(cid:28) d. To highlight the results in this regime, Fig. 3 contrasts the p = 2 dimensional\nrepresentations of the data discovered by PCA and LCA. Here it is visually apparent that\nLCA leads to much better separation of the examples in di\ufb00erent classes.\n\n3.2 Distance metric learning\n\nWe can also apply LCA to learn a distance metric that improves kNN classi\ufb01cation [4, 5, 6].\nOur approach draws heavily on the ideas of LMNN [5], though di\ufb00ers in its execution. In\nLMNN, each training example has k target neighbors, typically chosen as the k nearest\nneighbors in Euclidean space with the same class label. LMNN learns a metric to shrink\nthe distances between examples and target neighbors while preserving (or increasing) the\ndistances between examples from di\ufb00erent classes. Errors in kNN classi\ufb01cation tend to\noccur when di\ufb00erently labeled examples are closer together than pairs of target neighbors.\nThus LMNN seeks to minimize the number of di\ufb00erently labeled examples that invade the\nperimeters established by target neighbors. These examples are known as impostors.\nIn LCA, we can view the matrix W(cid:62)W as a Mahalanobis distance metric for kNN classi\ufb01-\ncation. The starting point of LCA is to create a training set of pairs of examples. Among\n\n6\n\n\fFigure 2: Comparison of dimensionality reduction by principal components analysis (PCA),\nfactor analysis (FA), linear discriminant analysis (LDA), and latent coincidence analysis\n(LCA). The plots show test set error versus dimensionality p.\n\nFigure 3: Comparison of two-dimensional (p = 2) representations of data discovered by PCA\nand LCA. The examples are color-coded by class label.\n\nthese pairs, we wish the similarly labeled examples to coincide (y = 1) and the di\ufb00erently\nlabeled examples to diverge (y = 0). For the former, it is natural to choose all pairs of\nexamples and their target neighbors. For the latter, it is natural to choose all pairs of\ndi\ufb00erently labeled examples. Concretely, if there are c classes, each with m examples, then\nthis approach creates a training set of ckm pairs of similarly labeled examples (with y = 1)\nand c(c \u2212 1)m2 pairs of di\ufb00erently labeled examples (with y = 0).\nUnfortunately it is clear that this approach does not scale well with the number of examples.\nWe therefore adopt two pruning strategies in our implementation of LCA. First, we do\nnot include training examples without impostors. Second, among the pairs of di\ufb00erently\nlabeled examples, we only include each example with its current or previous impostors.\nA complication of this approach is that every few iterations we must check to see if any\nexample has new impostors. If so, we add the example, its target neighbors, and impostors\ninto the training set. This strategy was used in all the experiments described below. Our\nshort-cuts are similar in spirit to the optimizations in LMNN [5] as well as more general\ncutting plane strategies of constrained optimization [16].\n\nThe use of LCA for kNN classi\ufb01cation also bene\ufb01ts from a slight but crucial extension of\nthe model in Fig. 1. Recall that the parameter \u03ba2 determines the length scale at which\nprojected examples are judged to coincide in the latent space. For kNN classi\ufb01cation, we\nextend the model in Fig. 1 to learn a local parameter \u03ba2 for each input in the training set.\nThese local parameters \u03ba2 are needed to account for the fact that di\ufb00erent inputs may reside\nat very di\ufb00erent distances from their target neighbors. In the graphical model of Fig. 1,\nthis extension amounts to drawing an additional plate that encloses the parameter \u03ba2 and\nthe model\u2019s random variables, but not the parameters W and \u03c32.\n\nNote that the \u03c32 and \u03ba2 parameters of LCA, though important to estimate, are not ul-\ntimately used for kNN classi\ufb01cation.\nIn particular, after a model is trained, we simply\n\n7\n\n2481632641280102030405060p% test error  PCAFALDALCAMNIST2481632641280102030405060p% test error  PCAFALDALCABBC248163264128051015202530p% test error  PCAFALDALCAClassic4248163264128020406080p% test error  PCAFALDALCAIsoletPCAMNISTBBCClassic4IsoletLCA\fFigure 4: Comparison of dimensionality reduction by principal components analysis (PCA)\nand latent coincidence analysis (LCA). The plots show kNN classi\ufb01cation error (training\ndotted, test solid) versus dimensionality p.\n\nFigure 5: Comparison of kNN classi\ufb01cation by PCA, LMNN, and LCA. We set k = 3 for all\nexperiments. Training error is computed using leave-one-out cross-validation. The values\nof p used for LCA are given in Table 1.\n\nperform kNN classi\ufb01cation using the Mahalanobis distance metric parameterized by the\nlinear transformation W.\n\nFor comparison, we measure kNN classi\ufb01cation accuracy using Euclidean distance, PCA,\nand LMNN. We report all three along with LCA in Table 1, but we focus on PCA in Fig. 4 to\nillustrate the e\ufb00ect of dimensionality reduction. LCA consistently outperforms PCA across\nall dimensionalities. Additionally, we hold out a validation set to search for an optimal\ndimensionality p. In Fig. 5, we compare LCA to PCA and LMNN. Again, LCA is clearly\nsuperior to PCA and generally achieves comparable performance to LMNN. Advantageously,\nwe often obtain our best result with LCA using a lower dimensionality p < d.\n\n4 Discussion\n\nIn this paper we have introduced Latent Coincidence Analysis (LCA), a latent variable model\nfor learning linear projections that map similar inputs closer together and di\ufb00erent inputs\nfarther apart. Inference in LCA is entirely tractable, and we use an EM algorithm to learn\nmaximum likelihood estimates of the model parameters. Our approach values simplicity,\nbut not at the expense of e\ufb03cacy. On problems in mixture modeling and distance metric\nlearning tasks, LCA performs competitively across a range of reduced dimensionalities.\n\nThere are many directions for future work. One challenge that we observed was slow conver-\ngence of the EM algorithm, an issue that may be ameliorated by the gradient or second-order\nmethods proposed in [17]. To handle larger data sets, we plan to explore online strategies\nfor distance metric learning [18], possibly based on Bayesian [19] or con\ufb01dence-weighted\nupdates [20]. Finally, we will explore hybrid strategies between the mixture modeling in\nsection 3.1 and kNN classi\ufb01ction in section 3.2, where multiple (but not all) examples in\neach class are used as \u201canchors\u201d for distance-based classi\ufb01cation. All these directions should\nopen the door to implementations on larger scales [21] than we have considered here.\n\n8\n\n481632640246810p% errorMNIST  PCALCA48163264128051015202530pBBC  PCALCA4816326412802468pClassic4  PCALCA481632641280102030405060pIsolet  PCALCAMNISTBBCClassic4IsoletLettersSegBalIris05101520% train error  PCALMNNLCAMNISTBBCClassic4IsoletLettersSegBalIris05101520% test error  PCALMNNLCA\fReferences\n\n[1] D. B. Rubin and D. T. Thayer. EM algorithms for ML factor analysis. Psychometrika, 47:69\u2013\n\n76, 1982.\n\n[2] G. McLachlan and K. Basford. Mixture Models: Inference and Applications to Clustering.\n\nMarcel Dekker, 1988.\n\n[3] T. Cover and P. Hart. Nearest neighbor pattern classi\ufb01cation.\n\nIn IEEE Transactions in\n\nInformation Theory, IT-13, pages 21\u201327, 1967.\n\n[4] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components\nanalysis. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information\nProcessing Systems 17, pages 513\u2013520, Cambridge, MA, 2005. MIT Press.\n\n[5] K.Q. Weinberger and L.K. Saul. Distance metric learning for large margin nearest neighbor\n\nclassi\ufb01cation. The Journal of Machine Learning Research, 10:207\u2013244, 2009.\n\n[6] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-theoretic metric learning.\n\nIn ICML, pages 209\u2013216, Corvalis, Oregon, USA, June 2007.\n\n[7] G. Hinton and S. Roweis. Stochastic neighbor embedding. In S. Thrun S. Becker and K. Ober-\nmayer, editors, Advances in Neural Information Processing Systems 15, pages 833\u2013840. MIT\nPress, Cambridge, MA, 2003.\n\n[8] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273\u2013297, 1995.\n\n[9] B. Kulis, M. A. Sustik, and I. S. Dhillon. Learning low-rank kernel matrices. In Proceedings\n\nof the Twenty-Third International Conference on Machine Learning (ICML-06), 2006.\n\n[10] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data\n\nvia the EM algorithm. Journal of the Royal Statistical Society B, 39:1\u201337, 1977.\n\n[11] http://yann.lecun.com/exdb/mnist/.\n\n[12] http://mlg.ucd.ie/datasets/bbc.html.\n\n[13] http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/.\n\n[14] http://archive.ics.uci.edu/ml/datasets.html.\n\n[15] R A Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics,\n\n7(2):179\u2013188, 1936.\n\n[16] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n\n[17] R. Salakhutdinov, S. T. Roweis, and Z. Ghahramani. On the convergence of bound opti-\nmization algorithms. In Proceedings of the Nineteenth Conference on Uncertainty in Arti\ufb01cial\nIntelligence (UAI-03), pages 509\u2013516, 2003.\n\n[18] S. Shalev-Shwartz, Y. Singer, and A. Y. Ng. Online and batch learning of pseudo-metrics. In\nProceedings of the Twenty First International Conference on Machine Learning (ICML-04),\npages 94\u2013101, Ban\ufb00, Canada, 2004.\n\n[19] T. Jaakkola and M. Jordan. A variational approach to bayesian logistic regression models and\ntheir extensions. In Proceedings of the Sixth International Workshop on Arti\ufb01cial Intelligence\nand Statistics, 1997.\n\n[20] M. Dredze, K. Crammer, and F. Pereira. Con\ufb01dence-weighted linear classi\ufb01cation. In Andrew\nMcCallum and Sam Roweis, editors, Proceedings of the 25th Annual International Conference\non Machine Learning (ICML 2008), pages 264\u2013271. Omnipress, 2008.\n\n[21] G. Chechik, U. Shalit, V. Sharma, and S. Bengio. An online algorithm for large scale image\nsimilarity learning. In Y. Bengio, D. Schuurmans, J. La\ufb00erty, C. K. I. Williams, and A. Culotta,\neditors, Advances in Neural Information Processing Systems 22, pages 306\u2013314. 2009.\n\n9\n\n\f", "award": [], "sourceid": 1482, "authors": [{"given_name": "Matthew", "family_name": "Der", "institution": null}, {"given_name": "Lawrence", "family_name": "Saul", "institution": null}]}