{"title": "Similarity Component Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 1511, "page_last": 1519, "abstract": "Measuring similarity is crucial to many learning tasks. It is also a richer and broader notion than what most metric learning algorithms can model. For example, similarity can arise from the process of aggregating the decisions of multiple latent components, where each latent component compares data in its own way by focusing on a different subset of features. In this paper, we propose Similarity Component Analysis (SCA), a probabilistic graphical model that discovers those latent components from data. In SCA, a latent component generates a local similarity value, computed with its own metric, independently of other components. The final similarity measure is then obtained by combining the local similarity values with a (noisy-)OR gate. We derive an EM-based algorithm for fitting the model parameters with similarity-annotated data from pairwise comparisons. We validate the SCA model on synthetic datasets where SCA discovers the ground-truth about the latent components. We also apply SCA to a multiway classification task and a link prediction task. For both tasks, SCA attains significantly better prediction accuracies than competing methods. Moreover, we show how SCA can be instrumental in exploratory analysis of data, where we gain insights about the data by examining patterns hidden in its latent components' local similarity values.", "full_text": "Similarity Component Analysis\n\nSoravit Changpinyo\u2217\nDept. of Computer Science\nU. of Southern California\nLos Angeles, CA 90089\nschangpi@usc.edu\n\nKuan Liu\u2217\n\nDept. of Computer Science\nU. of Southern California\nLos Angeles, CA 90089\n\nkuanl@usc.edu\n\nFei Sha\n\nDept. of Computer Science\nU. of Southern California\nLos Angeles, CA 90089\n\nfeisha@usc.edu\n\nAbstract\n\nMeasuring similarity is crucial to many learning tasks. To this end, metric learning\nhas been the dominant paradigm. However, similarity is a richer and broader no-\ntion than what metrics entail. For example, similarity can arise from the process of\naggregating the decisions of multiple latent components, where each latent com-\nponent compares data in its own way by focusing on a different subset of features.\nIn this paper, we propose Similarity Component Analysis (SCA), a probabilistic\ngraphical model that discovers those latent components from data. In SCA, a la-\ntent component generates a local similarity value, computed with its own metric,\nindependently of other components. The \ufb01nal similarity measure is then obtained\nby combining the local similarity values with a (noisy-)OR gate. We derive an\nEM-based algorithm for \ufb01tting the model parameters with similarity-annotated\ndata from pairwise comparisons. We validate the SCA model on synthetic datasets\nwhere SCA discovers the ground-truth about the latent components. We also ap-\nply SCA to a multiway classi\ufb01cation task and a link prediction task. For both\ntasks, SCA attains signi\ufb01cantly better prediction accuracies than competing meth-\nods. Moreover, we show how SCA can be instrumental in exploratory analysis of\ndata, where we gain insights about the data by examining patterns hidden in its\nlatent components\u2019 local similarity values.\n\n1\n\nIntroduction\n\nLearning how to measure similarity (or dissimilarity) is a fundamental problem in machine learning.\nArguably, if we have the right measure, we would be able to achieve a perfect classi\ufb01cation or\nclustering of data.\nIf we parameterize the desired dissimilarity measure in the form of a metric\nfunction, the resulting learning problem is often referred to as metric learning. In the last few years,\nresearchers have invented a plethora of such algorithms [18, 5, 11, 13, 17, 9]. Those algorithms have\nbeen successfully applied to a wide range of application domains.\nHowever, the notion of (dis)similarity is much richer than what metric is able to capture. Consider\nthe classical example of CENTAUR, MAN and HORSE. MAN is similar to CENTAUR and CENTAUR\nis similar to HORSE. Metric learning algorithms that model the two similarities well would need to\nassign small distances among those two pairs. On the other hand, the algorithms will also need to\nstrenuously battle against assigning a small distance between MAN and HORSE due to the triangle in-\nequality, so as to avoid the fallacy that MAN is similar to HORSE too! This example (and others [12])\nthus illustrates the important properties, such as non-transitiveness and non-triangular inequality, of\n(dis)similarity that metric learning has not adequately addressed.\nRepresenting objects as points in high-dimensional feature spaces, most metric learning learning al-\ngorithms assume that the same set of features contribute indistinguishably to assessing similarity. In\n\n\u2217Equal contributions\n\n1\n\n\fFigure 1: Similarity Component Analysis and its application to the example of CENTAUR, MAN and HORSE.\nSCA has K latent components which give rise to local similarity values sk conditioned on a pair of data xm\nand xn. The model\u2019s output s is a combination of all local values through an OR model (straightforward to\nextend to a noisy-OR model). \u0398k is the parameter vector for p(sk|xm, xn). See texts for details.\n\nparticular, the popular Mahalanobis metric weights each feature (and their interactions) additively\nwhen calculating distances. In contrast, similarity can arise from a complex aggregation of com-\nparing data instances on multiple subsets of features, to which we refer as latent components. For\ninstance, there are multiple reasons for us to rate two songs being similar: being written by the same\ncomposers, being performed by the same band, or of the same genre. For an arbitrary pair of songs,\nwe can rate the similarity between them based on one of the many components or an arbitrary sub-\nset of components, while ignoring the rest. Note that, in the learning setting, we observe only the\naggregated results of those comparisons \u2014 which components are used is latent.\nMulti-component based similarity exists also in other types of data. Consider a social network where\nthe network structure (i.e., links) is a supposition of multiple networks where people are connected\nfor various organizational reasons: school, profession, or hobby. It is thus unrealistic to assume that\nthe links exist due to a single cause. More appropriately, social networks are \u201cmultiplex\u201d [6, 15].\nIn this paper, we propose Similarity Component Analysis (SCA) to model the richer similarity re-\nlationships beyond what current metric learning algorithms can offer. SCA is a Bayesian network,\nillustrated in Fig. 1. The similarity (node s) is modeled as a probabilistic combination of multiple\nlatent components. Each latent component (sk) assigns a local similarity value to whether or not two\nobjects are similar, inferring from only a subset (but unknown) of features. The (local) similarity\nvalues of those latent components are aggregated with a (noisy-) OR model. Intuitively, two objects\nare likely to be similar if they are considered to be similar by at least one component. Two objects\nare likely to be dissimilar if none of the components voices up.\nWe derive an EM-based algorithm for \ufb01tting the model with data annotated with similarity relation-\nships. The algorithm infers the intermediate similarity values of latent components and identi\ufb01es the\nparameters for the (noisy-)OR model, as well as each latent component\u2019s conditional distribution,\nby maximizing the likelihood of the training data.\nWe validate SCA on several learning tasks. On synthetic data where ground-truth is available, we\ncon\ufb01rm SCA\u2019s ability in discovering latent components and their corresponding subsets of features.\nOn a multiway classi\ufb01cation task, we contrast SCA to state-of-the-art metric learning algorithms\nand demonstrate SCA\u2019s superior performance in classifying data samples. Finally, we use SCA to\nmodel the network link structures among research articles published at NIPS proceedings. We show\nthat SCA achieves the best link prediction accuracy among competitive algorithms. We also conduct\nextensive analysis on how learned latent components effectively represent link structures.\nIn section 2, we describe the SCA model and inference and learning algorithms. We report our\nempirical \ufb01ndings in section 3. We discuss related work in section 4 and conclude in section 5.\n\n2 Approach\n\nWe start by describing in detail Similarity Component Analysis (SCA), a Bayesian network for\nmodeling similarity between two objects. We then describe the inference procedure and learning\nalgorithm for \ufb01tting the model parameters with similarity-annotated data.\n\n2\n\nkQmxnxkSSNN\u00b7KK12S32S31S91.0)1(==sp9.01=p1S2S1x2x2x3x1x3x1S2S1S2S91.0)1(==sp19.0)1(==sp1.02=p1.01=p9.02=p1.01=p1.02=p\f2.1 Probabilistic model of similarity\nIn what follows, let (u, v, s) denote a pair of D-dimensional data points u, v \u2208 RD and their as-\nsociated value of similarity s \u2208 {DISSIMILAR, SIMILAR} or {0, 1} accordingly. We are interested\nin modeling the process of assigning s to these two data points. To this end, we propose Similarity\nComponent Analysis (SCA) to model the conditional distribution p(s|u, v), illustrated in Fig. 1.\nIn SCA, we assume that p(s|u, v) is a mixture of multiple latent components\u2019s local similarity\nvalues. Each latent component evaluates its similarity value independently, using only a subset\nof the D features.\nIntuitively, there are multiple reasons of annotating whether or not two data\ninstances are similar and each reason focuses locally on one aspect of the data, by restricting itself\nto examining only a different subset of features.\nLatent components Formally, let u[k] denote the subset of features from u corresponding to the\nk-th latent component where [k] \u2282 {1, 2, . . . , D}. The similarity assessment sk of this component\nalone is determined by the distance between u[k] and v[k]\n\ndk = (u \u2212 v)TMk(u \u2212 v)\n\n(1)\nwhere Mk (cid:23) 0 is a D \u00d7 D positive semide\ufb01nite matrix, used to measure the distance more \ufb02exibly\nthan the standard Euclidean metric. We restrict Mk to be sparse, in particular, only the correspond-\ning [k]-th rows and columns are non-zeroes. Note that in principle [k] needs to be inferred from\ndata, which is generally hard. Nonetheless, we have found that empirically even without explicitly\nconstraining Mk, we often obtain a sparse solution.\nThe distance dk is transformed to the probability for the Bernoulli variable sk according to\n\nP (sk = 1|u, v) = (1 + e\u2212bk )[1 \u2212 \u03c3(dk \u2212 bk)]\n\n(2)\nwhere \u03c3(\u00b7) is the sigmoid function \u03c3(t) = (1 + e\u2212t)\u22121 and bk is a bias term. Intuitively, when\nthe (biased) distance (dk \u2212 bk) is large, sk is less probable to be 1 and the two data points are\nregarded less similar. Note that the constraint Mk being positive semide\ufb01nite is important as this\nwill constrain the probability to be bounded above by 1.\nCombining local similarities Assume that there are K latent components. How can we combine\nall the local similarity assessments? In this work, we use an OR-gate. Namely,\n\nP (s = 1|s1, s2,\u00b7\u00b7\u00b7 , sK) = 1 \u2212 K(cid:89)\n\nk=1\n\nI[sk = 0]\n\n(3)\n\nThus, the two data points are similar (s = 1) if at least one of the aspects deems so, corresponding\nto sk = 1 for a particular k. The OR-model can be extended to the noisy-OR model [14]. To this\nend, we model the non-deterministic effect of each component on the \ufb01nal similarity value,\n\nP (s = 1|sk = 1) = \u03c4k = 1 \u2212 \u03b8k, P (s = 1|sk = 0) = 0\n\n(4)\nIn essence, the uncertainty comes from our probability of failure \u03b8k (false negative) to identify\nthe similarity if we are only allowed to consider one component at a time.\nIf we can consider\nall components at the same time, this failure probability would be reduced. The noisy-OR model\ncaptures precisely this notion:\n\nP (s = 1|s1, s2,\u00b7\u00b7\u00b7 , sK) = 1 \u2212 K(cid:89)\n\nk=1\n\nI[sk=1]\nk\n\n\u03b8\n\n(5)\n\nwhere the more sk = 1, the less the false-negative rate is after combination. Note that the noisy-OR\nmodel reduces to the OR-model eq. (3) when \u03b8k = 0 for all k.\nSimilarity model Our desired model for the conditional probability p(s|u, v) is obtained by\nmarginalizing all possible con\ufb01gurations of the latent components s = {s1, s2,\u00b7\u00b7\u00b7 , sK}\nP (sk|u, v)\n\nP (s = 0|u, v) =\n\nP (sk|u, v) =\n\n(cid:88)\n\n(cid:89)\n\n(cid:89)\n\n\u03b8\n\nI[sk=1]\nk\n\nP (s = 0|s)\n\n(cid:89)\nconditional probability simpli\ufb01es to P (s = 0|u, v) =(cid:81)\n\n[\u03b8kpk + 1 \u2212 pk] =\n\n=\n\nk\n\nk\n\nk\n\nwhere pk = p(sk = 1|u, v) is a shorthand for eq. (2). Note that despite the exponential number of\ncon\ufb01gurations for s, the marginalized probability is tractable. For the OR-model where \u03b8k = 0, the\n\nk\n\n(6)\n\n(cid:88)\n(cid:89)\n\ns\n\ns\n\n[1 \u2212 \u03c4kpk]\n\nk[1 \u2212 pk].\n\n3\n\n\fInference and learning\n\n2.2\nGiven an annotated training dataset D = {(xm, xn, smn)}, we learn the parameters, which include\nall the positive semide\ufb01nite matrices Mk, the biases bk and the false negative rates \u03b8k (if noisy-OR\nis used), by maximizing the likelihood of D. Note that we will assume that K is known throughout\nthis work. We develop an EM-style algorithm to \ufb01nd the local optimum of the likelihood.\nPosterior The posteriors over the hidden variables are computationally tractable:\n\nqk = P (sk = 1|u, v, s = 0) =\n\npk\u03b8k\n\nrk = P (sk = 1|u, v, s = 1) =\n\npk\n\n(cid:81)\nl(cid:54)=k [1 \u2212 \u03c4lpl]\n(cid:17)\n(cid:81)\nl(cid:54)=k [1 \u2212 \u03c4lpl]\n\nP (s = 0|u, v)\n1 \u2212 \u03b8k\n\n(cid:16)\n\nP (s = 1|u, v)\n\n(7)\n\nFor OR-model eq. (3), these posteriors can be further simpli\ufb01ed as all \u03b8k = 0.\nNote that, these posteriors are suf\ufb01cient to learn the parameters Mk and bk. To learn the parameters\n\u03b8k, however, we need to compute the expected likelihood with respect to the posterior P (s|u, v, s).\nWhile this posterior is tractable, the expectation of the likelihood is not and variational inference is\nneeded [10]. We omit the derivation for brevity. In what follows, we focus on learning Mk and bk.\nFor the k-th component, the relevant terms in the expected log-likelihood, given the posteriors, from\na single similarity assessment s on (u, v), is\n\nJk = q1\u2212s\n\nk\n\nk log P (sk = 1|u, v) + (1 \u2212 q1\u2212s\nrs\n\nk) log(1 \u2212 P (sk = 1|u, v))\nrs\n\nk\n\n(8)\n\nLearning the parameters Note that Jk is not jointly convex in bk and Mk. Thus, we optimize them\nalternatively. Concretely, \ufb01xing Mk, we grid search and optimize over bk. Fixing bk, maximizing\nJk with respect to Mk is convex optimization as Jk is a concave function in Mk given the linear\ndependency of the distance eq. (1) on this parameter.\nWe use the method of projected gradient ascent. Essentially, we take a gradient ascent step to update\nMk iteratively. If the update violates the positive semide\ufb01nite constraint, we project back to the\nfeasible region by setting all negative eigenvalues of Mk to zeroes. Alternatively, we have found that\nreparameterizing Jk in the following form Mk = LT\nkLk is more computationally advantageous, as\nLk is unconstrained. We use L-BFGS to optimize with respect to Lk and obtain faster convergence\nand better objective function values. (While this procedure only guarantees local optima, we observe\nno signi\ufb01cant detrimental effect of arriving at those solutions.) We give the exact form of gradients\nwith respect to Mk and Lk in the Suppl. Material.\n\n2.3 Extensions\n\nVariants to local similarity models The choice of using logistic-like functions eq. (2) for modeling\nlocal similarity of the latent components is orthogonal to how those similarities are combined in\neq. (3) or eq. (5). Thus, it is relatively straightforward to replace eq. (2) with a more suitable one.\nFor instance, in some of our empirical studies, we have constrained Mk to be a diagonal matrix\nwith nonnegative diagonal elements. This is especially useful when the feature dimensionality is\nextremely high. We view this \ufb02exibility as a modeling advantage.\nDisjoint components We could also explicitly express our desiderata that latent components focus\non non-overlapping features. To this end, we penalize the likelihood of the data with the following\nregularizer to promote disjoint components\nR({Mk}) =\n\ndiag(Mk)Tdiag(Mk(cid:48))\n\n(cid:88)\n\n(9)\n\nwhere diag(\u00b7) extracts the diagonal elements of the matrix. As the metrics are constrained to be pos-\nitive semide\ufb01nite, the inner product attains its minimum of zero when the diagonal elements, which\nare nonnegative, are orthogonal to each other. This will introduce zero elements on the diagonals\nof the metrics, which will in turn deselect the corresponding feature dimensions, because the corre-\nsponding rows and columns of those elements are necessarily zero due to the positive semide\ufb01nite\nconstraints. Thus, metrics that have orthogonal diagonal vectors will use non-overlapping subsets\nof features.\n\nk,k(cid:48)\n\n4\n\n\f(a) Disjoint ground-truth metrics\n\n(b) Overlapping ground-truth metrics\n\nFigure 2: On synthetic datasets, SCA successfully identi\ufb01es the sparse structures and\n(non)overlapping patterns of ground-truth metrics. See texts for details. Best viewed in color.\n\n3 Experimental results\n\nWe validate the effectiveness of SCA in modeling similarity relationships on three tasks. In sec-\ntion 3.1, we apply SCA to synthetic datasets where the ground-truth is available to con\ufb01rm SCA\u2019s\nability in identifying correctly underlying parameters. In section 3.2, we apply SCA to a multiway\nclassi\ufb01cation task to recognize images of handwritten digits where similarity is equated to having\nthe same class label. SCA attains superior classi\ufb01cation accuracy to state-of-the-art metric learning\nalgorithms. In section 3.3, we apply SCA to a link prediction problem for a network of scienti\ufb01c\narticles. On this task, SCA outperforms competing methods signi\ufb01cantly, too.\nOur baseline algorithms for modeling similarity are information-theoretic metric learning (ITML) [5]\nand large margin nearest neighbor (LMNN) [18]. Both methods are discriminative approaches where\na metric is optimized to reduce the distances between data points from the same label class (or similar\ndata instances) and increase the distances between data points from different classes (or dissimilar\ndata instances). When possible, we also contrast to multiple metric LMNN (MM-LMNN) [18], a\nvariant to LMNN where multiple metrics are learned from data.\n\n3.1 Synthetic data\n\nData We generate a synthetic dataset according to the graphical model in Fig. 1. Speci\ufb01cally, our\nfeature dimensionality is D = 30 and the number of latent components is K = 5. For each com-\nponent k, the corresponding metric Mk is a D \u00d7 D sparse positive semide\ufb01nite matrix where only\nelements in a 6 \u00d7 6 matrix block on the diagonal are nonzero. Moreover, for different k, these block\nmatrices do not overlap in rows and columns indices. In short, these metrics mimic the setup where\neach component focuses on its own 1/K-th of total features that are disjoint from each other. The\n\ufb01rst row of Fig. 2(a) illustrates these 5 matrices while the black background color indicates zero ele-\nments. The values of nonzero elements are randomly generated as long as they maintain the positive\nsemide\ufb01niteness of the metrics. We set the bias term bk to zeroes for all components. We sample\nN = 500 data points randomly from RD. We select a random pair and compute their similarity\naccording to eq. (6) and threshold at 0.5 to yield a binary label s \u2208 {0, 1}. We select randomly\n74850 pairs for training, 24950 for development, 24950 for testing.\nMethod We use the OR-model eq. (3) to combine latent components. We evaluate the results of\nSCA on two aspects: how well we can recover the ground-truth metrics (and biases) and how well\nwe can use the parameters to predict similarities on the test set.\nResults The second row of Fig. 2(a) contrasts the learned metrics to the ground-truth (the \ufb01rst row).\nClearly, these two sets of metrics have almost identical shapes and sparse structures. Note that for\nthis experiment, we did not use the disjoint regularizer (described in section 2.3) to promote sparsity\nand disjointness in the learned metrics. Yet, the SCA model is still able to identify those structures.\nFor the biases, SCA identi\ufb01es them as being close to zero (details are omitted for brevity).\n\n5\n\n1020305101520253010203051015202530true metrics1020305101520253010203051015202530102030510152025301020305101520253010203051015202530recovered metrics1020305101520253010203051015202530102030510152025301020305101520253010203051015202530true metrics1020305101520253010203051015202530102030510152025301020305101520253010203051015202530recovered metrics102030510152025301020305101520253010203051015202530\fTable 1: Similarity prediction accuracies and standard errors (%) on the synthetic dataset\n\nBASELINES\n\nSCA\n\nITML\n72.7\u00b10.0\n\nK = 1\n72.8\u00b10.0\n\nK = 10\nLMNN\n71.3\u00b10.2\n91.8\u00b10.1\nTable 2: Misclassi\ufb01cation rates (%) on the MNIST recognition task\n\nK = 3\n82.1\u00b10.1\n\nK = 5\n91.5\u00b10.1\n\nK = 7\n91.7\u00b10.1\n\nK = 20\n90.2\u00b10.4\n\nBASELINES\n\nD\n25\n50\n100\n\nEUC.\n21.6\n18.7\n18.1\n\nITML\n15.1\n13.35\n11.85\n\nLMNN MM-LMNN\n20.6\n16.5\n13.4\n\n20.2\n13.6\n9.9\n\nK = 1\n17.7 \u00b1 0.9\n13.8 \u00b1 0.3\n12.1 \u00b1 0.1\n\nSCA\nK = 5\n16.0 \u00b1 1.5\n12.0 \u00b1 1.1\n10.8 \u00b1 0.6\n\nK = 10\n14.5 \u00b1 0.6\n11.4 \u00b1 0.6\n11.1 \u00b1 0.3\n\nTable 1 contrasts the prediction accuracies by SCA to competing methods. Note that ITML, LMNN\nand SCA with K = 1 perform similarly. However, when the number of latent components increases,\nSCA outperforms other approaches by a large margin. Also note that when the number of latent\ncomponents exceeds the ground-truth K = 5, SCA reaches a plateau until over\ufb01tting.\nIn real-world data, \u201ctrue metrics\u201d may overlap, that is, it is possible that different components of\nsimilarity rely on overlapping set of features. To examine SCA\u2019s effectiveness in this scenario,\nwe create another synthetic data where true metrics heavily overlap, illustrated in the \ufb01rst row of\nFig. 2(b). Nonetheless, SCA is able to identify the metrics correctly, as seen in the second row.\n\n3.2 Multiway classi\ufb01cation\n\nFor this task, we use the MNIST dataset, which consists of 10 classes of hand-written digit images.\nWe use PCA to reduce the original dimension from 784 to D = 25, 50 and 100, respectively. We\nuse 4200 examples for training, 1800 for development and 2000 for testing.\nThe data is in the format of (xn, yn) where yn is the class label. We convert them into the format\n(xm, xn, smn) that SCA expects. Speci\ufb01cally, for every training data point, we select its 15 nearest\nneighbors among samples in the same class and formulate 15 similar relationships. For dissimilar\nrelationships, we select its 80 nearest neighbors among samples from the rest classes. For testing,\nthe label y of x is determined by\n\n(cid:88)\n\nx(cid:48)\u2208Bc(x)\n\ny = arg maxc sc = arg maxc\n\nP (s = 1|x, x(cid:48))\n\n(10)\n\nwhere sc is the similarity score to the c-th class, computed as the sum of 5 largest similarity values\nBc to samples in that class. In Table 2, we show classi\ufb01cation error rates for different values of D.\nFor K > 1, SCA clearly outperforms single-metric based baselines. In addition, SCA performs well\ncompared to MM-LMNN, achieving far better accuracy for small D.\n\n3.3 Link prediction\n\nWe evaluate SCA on the task of link prediction in a \u201csocial\u201d network of scienti\ufb01c articles. We aim to\ndemonstrate SCA\u2019s power to model similarity/dissimilarity in \u201cmultiplex\u201d real-world network data.\nIn particular, we are interested in not only link prediction accuracies, but also the insights about data\nthat we gain from analyzing the identi\ufb01ed latent components.\nSetup We use the NIPS 0-12 dataset [1] to construct the aforementioned network. The dataset\ncontains papers from the NIPS conferences between 1987 and 1999. The papers are organized into\n9 sections (topics) (cf. Suppl. Material). We sample randomly 80 papers per section and use them\nto construct the network. Each paper is a vertex and two papers are connected with an edge and\ndeemed as similar if both of them belong to the same section.\nWe experiment three representations for the papers: (1) Bag-of-words (BoW) uses normalized oc-\ncurrences (frequencies) of words in the documents. As a preprocessing step, we remove \u201crare\u201d\nwords that appear less than 75 times and appear more than 240. Those words are either too special-\nized (thus generalize poorly) or just functional words. After the removal, we obtain 1067 words. (2)\nTopic (ToP) uses the documents\u2019 topic vectors (mixture weights of topics) after \ufb01tting the corpus\n\n6\n\n\fTable 3: Link prediction accuracies and their standard errors (%) on a network of scienti\ufb01c papers\n\nBASELINES\n\nSCA-DIAG\n\nSCA\n\nSVM\n\nFeature\ntype\nBoW 73.3\u00b10.0\nToW 75.3\u00b10.0\n71.2\u00b10.0\nToP\n\nITML\n\nLMNN\n\n-\n-\n\n-\n-\n\n81.1\u00b10.1\n\n80.7\u00b10.1\n\nK = 1\n64.8 \u00b1 0.1\n67.0 \u00b1 0.0\n62.6 \u00b1 0.0\n\nK\u2217\n\n87.0 \u00b1 1.2\n88.1 \u00b1 1.4\n81.0 \u00b1 0.8\n\nK = 1\n\n-\n-\n\n81.0 \u00b1 0.0\n\n87.6 \u00b1 1.0\n\nK\u2217\n-\n-\n\nto a 50-topic LDA [4]. (3) Topic-words (ToW) is essentially BoW except that we retain only 1036\nfrequent words used by the topics of the LDA model (top 40 words per topic).\nMethods We compare the proposed SCA extensively to several competing methods for link pre-\ndiction. For BoW and ToW represented data, we compare SCA with diagonal metrics (SCA-DIAG,\ncf. section 2.3) to Support Vector Machines (SVM) and logistic regression (LOGIT) to avoid high\ncomputational costs associated with learning high-dimensional matrices (the feature dimensionality\nD \u2248 1000). To apply SVM/LOGIT, we treat the link prediction as a binary classi\ufb01cation problem\nwhere the input is the absolute difference in feature values between the two data points.\nFor 50-dimensional ToP represented data, we compare SCA (SCA) and SCA-DIAG to SVM/LOGIT,\ninformation-theoretical metric learning (ITML), and large margin nearest neighbor (LMNN).\nNote that while LMNN was originally designed for nearest-neighbor based classi\ufb01cation, it can be\nadapted to use similarity information to learn a global metric to compute the distance between any\npair of data points. We learn such a metric and threshold on the distance to render a decision on\nwhether two data points are similar or not (i.e., whether there is a link between them). On the other\nend, multiple-metric LMNN, while often having better classi\ufb01cation performance, cannot be used\nfor similarity and link prediction as it does not provide a principled way of computing distances\nbetween two arbitrary data points when there are multiple (local) metrics.\nLink or not? In Table 3, we report link prediction accuracies, which are averaged over several runs\nof randomly generated 70/30 splits of the data. SVM and LOGIT perform nearly identically so we\nreport only SVM. For both SCA and SCA-DIAG, we report results when a single component is used\nas well as when the optimal number of components are used (under columns K\u2217).\nBoth SCA-DIAG and SCA outperform the rest methods by a signi\ufb01cant margin, especially when the\nnumber of latent components is greater than 1 (K\u2217 ranges from 3 to 13, depending on the methods\nand the feature types). The only exception is SCA-DIAG with one component (K = 1), which is an\noverly restrictive model as the diagonal metrics constrain features to be combined additively. This\nrestriction is overcome by using a larger number of components.\nEdge component analysis Why does learning latent components in SCA achieve superior link pre-\ndiction accuracies? The (noisy-)OR model used by SCA is naturally inclined to favoring \u201cpositive\u201d\nopinions \u2014 a pair of samples are regarded as being similar as long as there is one latent compo-\nnent strongly believing so. This implies that a latent component can be tuned to a speci\ufb01c group of\nsamples if those samples rely on common feature characteristics to be similar.\nFig. 3(a) con\ufb01rms our intuition. The plot displays in relative strength \u2014darker being stronger \u2014\nhow much each latent component believes a pair of articles from the same section should be similar.\nConcretely, after \ufb01tting a 9-component SCA (from documents in ToP features), we consider edges\nconnecting articles in the same section and compute the average local similarity values assigned by\neach component. We observe two interesting sparse patterns: for each section, there is a dominant\nlatent component that strongly supports the fact that the articles from that section should be similar\n(e.g., for section 1, the dominant one is the 9-th component). Moreover, for each latent component,\nit often strongly \u201cvoices up\u201d for one section \u2013 the exception is the second component which seems\nto support both section 3 and 4. Nonetheless, the general picture is that, each section has a signature\nin terms of how similarity values are distributed across latent components.\nThis notion is further illustrated, with greater details, in Fig. 3(b). While Fig. 3(a) depicts averaged\nsignature for each section, the scatterplot displays 2D embeddings computed with the t-SNE algo-\nrithm, on each individual edge\u2019s signature \u2014 9-dimensional similarity values inferred with the 9\nlatent components. The embeddings are very well organized in 9 clusters, colored with section IDs.\n\n7\n\n\f(a) Averaged component-\nwise similarity values of\nedges within each section\n\nEmbedding\n\n(b)\nof\nlinks,\nrepresented with\ncomponent-wise similarity\nvalues\n\n(c) Embedding of network\nnodes (documents),\nrepre-\nsented in LDA\u2019s topics\n\nFigure 3: Edge component analysis. Representing network links with local similarity values reveals interest-\ning structures, such as nearly one-to-one correspondence between latent components and sections, as well as\nclusters. However, representing articles in LDA\u2019s topics does not reveal useful clustering structures such that\nlinks can be inferred. See texts for details. Best viewed in color.\n\nIn contrast, embedding documents using their topic representations does not reveal clear cluster-\ning structures such that network links can be inferred. This is shown in Fig. 3(c) where each dot\ncorresponds to a document and the low-dimensional coordinates are computed using t-SNE (sym-\nmetrized KL divergence between topics is used as a distance measure). We observe that while topics\nthemselves do not reveal intrinsic (network) structures, latent components are able to achieve so by\napplying highly-specialized metrics to measure local similarities and yield characteristic signatures.\nWe also study whether or not the lack of an edge between a pair of dissimilar documents from\ndifferent sections, can give rise to characteristic signatures from the latent components. In summary,\nwe do not observe those telltale signatures for those pairs. Detailed results are in the Suppl. Material.\n\n4 Related Work\n\nOur model learns multiple metrics, one for each latent component. However, the similarity (or\nassociated dissimilarity) from our model is de\ufb01nitely non-metric due to the complex combination.\nThis stands in stark contrast to most metric learning algorithms [19, 8, 7, 18, 5, 11, 13, 17, 9].\n[12] gives an information-theoretic de\ufb01nition of (non-metric) similarity as long as there is a proba-\nbilistic model for the data. Our approach of SCA focuses on the relationship between data but not\ndata themselves. [16] proposes visualization techniques for non-metric similarity data.\nOur work is reminiscent of probabilistic modeling of overlapping communities in social networks,\nsuch as the mixed membership stochastic blockmodels [3]. The key difference is that those works\nmodel vertices with a mixture of latent components (communities) where we model the interactions\nbetween vertices with a mixture of latent components. [2] studies a social network whose edge\nset is the union of multiple edge sets in hidden similarity spaces. Our work explicitly models the\nprobabilistic process of combining latent components with a (noisy-)OR gate.\n\n5 Conclusion\n\nWe propose Similarity Component Analysis (SCA) for probabilistic modeling of similarity relation-\nship for pairwise data instances. The key ingredient of SCA is to model similarity as a complex\ncombination of multiple latent components, each giving rise to a local similarity value. SCA attains\nsigni\ufb01cantly better accuracies than existing methods on both classi\ufb01cation and link prediction tasks.\nAcknowledgements We thank reviewers for extensive discussion and references on the topics of similarity and\nlearning similarity. We plan to include them as well as other suggested experimentations in a longer version\nof this paper. This research is supported by a USC Annenberg Graduate Fellowship (S.C.) and the IARPA via\nDoD/ARL contract # W911NF-12-C-0012. The U.S. Government is authorized to reproduce and distribute\nreprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclu-\nsions contained herein are those of the authors and should not be interpreted as necessarily representing the\nof\ufb01cial policies or endorsements, either expressed or implied, of IARPA, DoD/ARL, or the U.S. Government.\n\n8\n\nMetric IDSection ID2468123456789 123456789 123456789\fReferences\n[1] NIPS0-12 dataset. http://www.stats.ox.ac.uk/\u02dcteh/data.html.\n[2] I. Abraham, S. Chechik, D. Kempe, and A. Slivkins. Low-distortion Inference of Latent Simi-\n\nlarities from a Multiplex Social Network. CoRR, abs/1202.0922, 2012.\n\n[3] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed Membership Stochastic\n\nBlockmodels. Journal of Machine Learning Research, 9:1981\u20132014, June 2008.\n\n[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine\n\nLearning Research, 3:993\u20131022, 2003.\n\n[5] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-theoretic Metric Learning.\n\nIn ICML, 2007.\n\n[6] S. E. Fienberg, M. M. Meyer, and S. S. Wasserman. Statistical Analysis of Multiple Sociomet-\n\nric Relations. Journal of the American Statistical Association, 80(389):51\u201367, March 1985.\n\n[7] A. Globerson and S. Roweis. Metric Learning by Collapsing Classes. In NIPS, 2005.\n[8] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood Components\n\nAnalysis. In NIPS, 2004.\n\n[9] S. Hauberg, O. Freifeld, and M. Black. A Geometric take on Metric Learning. In NIPS, 2012.\n[10] T. S. Jaakkola and M. I. Jordan. Variational Probabilistic Inference and the QMR-DT Network.\n\nJournal of Arti\ufb01cial Intelligence Research, 10(1):291\u2013322, May 1999.\n\n[11] P. Jain, B. Kulis, I. Dhillon, and K. Grauman. Online Metric Learning and Fast Similarity\n\nSearch. In NIPS, 2008.\n\n[12] D. Lin. An Information-Theoretic De\ufb01nition of Similarity. In ICML, 1998.\n[13] S. Parameswaran and K. Weinberger. Large Margin Multi-Task Metric Learning. In NIPS,\n\n2010.\n\n[14] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.\n\nMorgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988.\n\n[15] M. Szell, R. Lambiotte, and S. Thurner. Multirelational Organization of Large-scale Social\n\nNetworks in an Online World. Proceedings of the National Academy of Sciences, 2010.\n\n[16] L. van der Maaten and G. Hinton. Visualizing Non-Metric Similarities in Multiple Maps.\n\nMachine Learning, 33:33\u201355, 2012.\n\n[17] J. Wang, A. Woznica, and A. Kalousis. Parametric Local Metric Learning for Nearest Neighbor\n\nClassi\ufb01cation. In NIPS, 2012.\n\n[18] K. Q. Weinberger and L. K. Saul. Distance Metric Learning for Large Margin Nearest Neigh-\n\nbor Classi\ufb01cation. Journal of Machine Learning Research, 10:207\u2013244, 2009.\n\n[19] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance Metric Learning, with Application\n\nto Clustering with Side-information. In NIPS, 2002.\n\n9\n\n\f", "award": [], "sourceid": 754, "authors": [{"given_name": "Soravit", "family_name": "Changpinyo", "institution": "University of Southern California (USC)"}, {"given_name": "Kuan", "family_name": "Liu", "institution": "University of Southern California (USC)"}, {"given_name": "Fei", "family_name": "Sha", "institution": "University of Southern California (USC)"}]}