{"title": "What Makes Objects Similar: A Unified Multi-Metric Learning Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 1235, "page_last": 1243, "abstract": "Linkages are essentially determined by similarity measures that may be derived from multiple perspectives. For example, spatial linkages are usually generated based on localities of heterogeneous data, whereas semantic linkages can come from various properties, such as different physical meanings behind social relations. Many existing metric learning models focus on spatial linkages, but leave the rich semantic factors unconsidered. Similarities based on these models are usually overdetermined on linkages. We propose a Unified Multi-Metric Learning (UM2L) framework to exploit multiple types of metrics. In UM2L, a type of combination operator is introduced for distance characterization from multiple perspectives, and thus can introduce flexibilities for representing and utilizing both spatial and semantic linkages. Besides, we propose a uniform solver for UM2L which is guaranteed to converge. Extensive experiments on diverse applications exhibit the superior classification performance and comprehensibility of UM2L. Visualization results also validate its ability on physical meanings discovery.", "full_text": "What Makes Objects Similar:\n\nA Uni\ufb01ed Multi-Metric Learning Approach\n\nHan-Jia Ye\n\nYuan Jiang\n\nZhi-Hua Zhou\n\nDe-Chuan Zhan\nNational Key Laboratory for Novel Software Technology,\n\nXue-Min Si\n\nNanjing University, Nanjing, 210023, China\n\n{yehj,zhandc,sixm,jiangy,zhouzh}@lamda.nju.edu.cn\n\nAbstract\n\nLinkages are essentially determined by similarity measures that may be derived\nfrom multiple perspectives. For example, spatial linkages are usually generated\nbased on localities of heterogeneous data, whereas semantic linkages can come\nfrom various properties, such as different physical meanings behind social rela-\ntions. Many existing metric learning models focus on spatial linkages, but leave\nthe rich semantic factors unconsidered. Similarities based on these models are\nusually overdetermined on linkages. We propose a Uni\ufb01ed Multi-Metric Learn-\ning (UM2L) framework to exploit multiple types of metrics. In UM2L, a type of\ncombination operator is introduced for distance characterization from multiple per-\nspectives, and thus can introduce \ufb02exibilities for representing and utilizing both\nspatial and semantic linkages. Besides, we propose a uniform solver for UM2L\nwhich is guaranteed to converge. Extensive experiments on diverse applications\nexhibit the superior classi\ufb01cation performance and comprehensibility of UM2L.\nVisualization results also validate its ability on physical meanings discovery.\n\n1 Introduction\n\nSimilarities measure the closeness of connections between objects and usually are re\ufb02ected by dis-\ntances. Distance Metric Learning (DML) aims to learn appropriate metric that can \ufb01gure out the\nunderlying linkages or connections, thus can greatly improve the performance of similarity-based\nclassi\ufb01ers, such as kNN.\nObjects are linked with each other for different reasons. Global DML methods consider the deter-\nministic single metric which measures similarities between all object pairs. Recently, investigations\non local DML have considered locality speci\ufb01c approaches, and consequently multiple metrics are\nlearned. These metrics are either in charge of different spatial areas [15, 20] or responsible for each\nspeci\ufb01c instance [7, 22]. Both global and local DML methods emphasize the linkage constraints\n(including must-link and cannot-link) in localities with univocal semantic meaning, e.g., the side\ninformation of class. However, there can be many different reasons for two instances to be similar\nin real world applications [3, 9].\nLinkages between objects can be with multiple latent semantics. For example, in a social network,\nfriendship linkages may lie on different hobbies of users. Although a user has many friends, their\ncommon hobbies could be different and as a consequence, one can be friends with others for differ-\nent reasons. Another concrete example is, for articles on \u201cA. Feature Learning\u201d which are closely\nrelated to both \u201cB. Feature Selection\u201d and \u201cC. Subspace Models\u201d, their connections are different in\nsemantics. The linkage between A and B emphasizes \u201cpicking up some helpful features\u201d, while the\ncommon semantic between A and C is about \u201cextracting subspaces\u201d or \u201c feature transformation\u201d.\nThese phenomena clearly indicate ambiguities rather than a single meaning in linkage generation.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fHence, the distance/similarity measurements are overdetermined in these applications. As a conse-\nquence, a new type of multi-metric learner which can describe the ambiguous linkages is desired.\nIn this paper, we propose a Uni\ufb01ed Multi-Metric Learning (UM2L) approach which integrates the\nconsideration of linking semantic ambiguities and localities in one framework. In the training pro-\ncess, more than one metric is learned to measure distances between instances and each of them\nre\ufb02ects a type of inherent spatial or semantic properties of objects. During the test, UM2L can auto-\nmatically pick up or integrate these measurements, since semantically/spatially similar data points\nhave small distances and otherwise they are pulled away from each other; such a mechanism en-\nables the adaptation to environment to some degree, which is important for the development of\nlearnwares [25]. Furthermore, the proposed framework can be easily adapted to different types of\nambiguous circumstances: by specifying the mechanism of metric integration, various types of link-\nages in applications can be considered; by incorporating sparse constraints, UM2L also turns out\ngood visualization results re\ufb02ecting physical meanings of latent linkages between objects; besides,\nby limiting the number of metrics or specifying the regularizer, the approach can be degenerated\nto some popular DML methods, such as MMLMNN [20]. Bene\ufb01tting from alternative strategy and\nstochastic techniques, the general framework can be optimized steadily and ef\ufb01ciently.\nOur main contributions are: (I) A Uni\ufb01ed Multi-Metric Learning framework considering both data\nlocalities and ambiguous semantics linkages. (II) A \ufb02exible framework adaptable for different tasks.\n(III) Uni\ufb01ed and ef\ufb01cient optimization solutions, superior and interpretable results.\nThe rest of this paper starts with some notations. Then the UM2L framework is presented in detail,\nwhich is followed by a review of related work. The last are experiments and conclusion.\n\n2 The Uni\ufb01ed Multi-Metric Framework\n\nDis2\n\n\u221a\n\n\u2211\n\nd\ni\n\n\u22a4 \u2208 S +\n\nM (xt \u2212 yt) = Tr(M At\n\nGenerally speaking, the supervision information for Distance Metric Learning (DML) is formed as\npairwise constraints or triplet sets. We restrict our discussion on the latter one, T = {xt, yt, zt}T\nt=1,\nsince it provides more local information. In each triplet, target instance yt is more similar to xt\nthan imposter zt and {xt, yt, zt} \u2208 Rd. Sd and S +\nd are the set of symmetric and positive semi-\nde\ufb01nite (PSD) matrix of size d \u00d7 d, respectively. I is the identity matrix. Matrix Frobenius Norm\n\u2225M\u2225F =\nTr(M\u22a4M ). Let mi and mj denote the i-th row and j-th column of matrix M respec-\n\u2225mi\u22252. Operator [\u00b7]+ = max(\u00b7, 0) preserves the non-negative\ntively, and \u21132;1-norm \u2225M\u22252;1 =\npart of the input value. DML aims at learning a metric M \u2208 S +\nd making similar instances have small\ndistances to each other and dissimilar ones far apart. The (squared) Mahalanobis distance between\npair (xt, yt) with metric M can be denoted as:\nM (xt, yt) = (xt \u2212 yt)\n\u22a4\n\n(1)\nxy = (xt \u2212 yt)(xt \u2212 yt)\nd is the outer product of difference between instance xt and\nAt\nyt. The distance in Eq.1 assumes that there is a single type of relationship between object features,\nwhich uses univocal linkages between objects.\nMulti-metric learning takes data heterogeneities into consideration. However, both single metric\nlearned by global DML and multiple metrics learned with local methods focus on exploiting locality\ninformation, i.e., constraints or metrics are closely related to the localities. In particular, local DML\napproaches mainly aim at learning a set of multiple metrics one for each local area. In this paper,\na general multi-metric con\ufb01guration is investigated to deal with linkage ambiguities from both se-\nmantic and locality perspectives. We denote the set of K multiple metrics to be learned as MK =\n{M1, M2, . . . , MK} and {Mk}K\nd . Similarity score between a pair of instances based on Mk,\nw.l.o.g., can be set as the negative distance, i.e., fMk (xt, yt) = \u2212Dis2\n(xt, yt). In multi-metric\nscenario, consequently, there will be a set of multiple similarity scores fMK = {fMk\n}K\nk=1. Each\nmetric/score in the set re\ufb02ects a particular semantic or spatial view of data. The overall similarity\nscore f v(xt, yt) = \u03bav(fMK (xt, yt)), v = {1, 2} and \u03bav(\u00b7) is a functional operator closely related\nto concrete applications, which maps the set of similarity scores w.r.t. all metrics to a single value.\nWith these discussions, the Uni\ufb01ed Multi-Metric Learning (UM2L) framework can be denoted as:\n\n\u2208 S +\n\nxy).\n\nT\u2211\n\n\u2113\n\nt=1\n\nminMK\n\n1\nT\n\n(\nf 1(xt, yt) \u2212 f 2(xt, zt)\n\n)\n\nK\u2211\n\n+ \u03bb\n\n\u2126k(Mk) .\n\nk=1\n\nk=1\n\nMk\n\n(2)\n\n2\n\n\fThe overall inter-instance similarity f 1 and f 2 are based on operators \u03ba1 and \u03ba2 respectively. \u2113(\u00b7) is\na convex loss function which encourages (xt, yt) to have larger overall similarity score than (xt, zt).\nNote that although inter-instance similarities are de\ufb01ned on different metrics in MK, the convex loss\nfunction \u2113(\u00b7) acts as a bridge and makes the similarities measured by different metrics comparable as\nin [20]. The fact that triplet restrictions being provided without specifying concrete measurements\nmakes it reasonable to use \ufb02exible \u03bas. For instance, in a social network, similar nodes only share\nsome common interests (features) rather than consistently possessing all interests. Tendency on\ndifferent types of hobbies can be re\ufb02ected by various metrics. Therefore, the similarity scores may\nbe calculated with different measurements and operator \u03bav is used for taking charge of \u201cselecting\u201d\nor \u201cintegrating\u201d the right base metric for measuring similarities. The choices of loss functions and\n\u03bas are substantial issues in this framework and will be described later. Convex regularizer \u2126k(Mk)\ncan impose prior or structure information on base metric Mk. \u03bb \u2265 0 is a balance parameter.\n\n2.1 Choices for \u03ba\n\nUM2L takes both spatial and ambiguous semantic linkages into account based on the con\ufb01gurations\nof \u03ba, which integrates or selects base metrics. As an integrator, in applications where locality related\nmultiple metrics are needed, \u03ba can be an RBF like function which decreases as the distance is in-\ncreasing. The locality determines the impact of each metric. When \u03ba acts as a selector, UM2L should\nautomatically assign triplets to one of the metrics which can explain instance similarity/dissimilarity\nbest. Besides, from the aspect of loss function \u2113(\u00b7), the elected fs form a comparable set of simi-\nlarity measurements [17, 20]. In this case, we may implement the operator \u03ba by choosing the most\nremarkable base metric making the pair of instances xt and yt similar. Advantages of selection\nmechanism are two folds. First, it reduces the impact of initial triplets construction in localities [19];\nsecond, it stresses the most evident semantic and re\ufb02ects the consideration of ambiguous semantics\nin a linkage construction. Choices of \u03bas heavily depend on concrete applications. It is actually a\ncombiner and can get inspiration from ensemble methods [24]. Here, we mainly consider 4 different\ntypes of linkage based on various sets of \u03bas as follows.\nApical Dominance Similarity (ADS): which is named after the phenomenon in auxanology of\nplants, where the most important term dominates the evaluation. In this case, \u03ba1 = \u03ba2 = max(\u00b7),\ni.e., maximum similarity among all similarities calculated with MK on similar pair (xt, yt) should\nbe larger than the maximum similarity of (xt, zt). This corresponds to similar pairs being close\nto each other under at least one measurement, meanwhile dissimilar pairs are disconnected by all\ndifferent measurements. This type of linkage generation often occurs in social network applications,\ne.g., nodes are linked together for a portion of similar orientations while nodes are unlinked because\nthere are no common interests. By explicitly modeling each node in a social network as an instance,\neach of the base metrics {Mk}K\nk=1 can represent parts of semantics in linkages. Then the dissimilar\npair in a triplet, e.g., the non-friendship relationship, should be with small similarity scores over\nMK; while for the similar pair, there should be at least one base similarity score with high value,\nwhich re\ufb02ects their common interests [3, 11].\nOne Vote Similarity (OVS): which indicates the existence of potential key metric in MK, i.e.,\neither similar or dissimilar pair is judged by at least one key metric respectively, while remaining\nmetrics with other semantic meanings are ignored. In this case, \u03ba1 = max(\u00b7) and \u03ba2 = min(\u00b7). This\ntype of similarity should usually be applied as an \u201cinterpreter\u201d in domains like image, video which\nare with complicated semantics. The learned metrics reveal different latent concepts in objects. Note\nthat simply applying OVS in UM2L with impropriate regularizer \u2126 will lead to a trivial solution, i.e.,\nMk = 0, which satis\ufb01es all similar pair restrictions yet has no generalization ability. Therefore, we\nneed to set \u2126k(Mk) = \u2225Mk \u2212 I\u22252\nRank Grouping Similarity (RGS): which groups the pairs and makes the similar pairs with higher\nranks than dissimilar ones. This is the most rigorous similarity and we also refer it as One-Vote\nVeto Similarity (OV2S). In this case, \u03ba1 = min(\u00b7) while \u03ba2 = max(\u00b7), which regards the pairs as\ndissimilar even when there is only one metric denying the linkage. This case is usually applied to\napplications where latent multiple views exist and different views are measured by different metrics\n\u2211\nin MK. In these applications, it is obviously required that all potential views obtain consistencies,\nand weak con\ufb02ict detected by one metric should also be punished by RGS (OV2S) loss.\n(\u00b7).\nAverage Case Similarity (ACS): which treats all metrics in MK equally, i.e., \u03ba1 = \u03ba2 =\nThis is the general case when there is no prior knowledge on applications.\n\nF or restrict the trace of Mk to equal to 1.\n\n3\n\n\f\u2211\nThere are many derivatives of similarity where \u03bav is con\ufb01gured as min(\u00b7), max(\u00b7) and\n(\u00b7). Fur-\nthermore, \u03bav in fact can be with richer forms, and we will leave the discussions of choosing different\n\u03bas later in section 3. Besides, in the framework, multiple choices of the regularizer \u2126k(\u00b7) can be\nmade. As most DML methods [14], \u2126k(Mk) can be set as \u2225Mk\u22252\nF . Yet it also can be incorporated\nwith more structural information, e.g., we can con\ufb01gure \u2126(Mk) = \u2225Mk\u22252;1, where the row/column\nsparsity \ufb01lters in\ufb02uential features for composing linkages in a network; or \u2126k(Mk) = Tr(Mk),\nwhich guarantees the low rank property for all metrics. Due to the high applicability of the proposed\nframework, we name it as UM2L (Uni\ufb01ed Multi-Metric Learning).\n\n2.2 General Solutions for UM2L\nUM2L can be solved alternatively between metrics MK and af\ufb01liation portion of each instance,\nwhen \u03ba is a piecewise linear operator such as max(\u00b7) and min(\u00b7). For example, in the case\nof ADS, the metric used to measure the similarity of pair (xt, yt) is decided by: kt\nv;\u2217 =\narg maxk fMk (xt, yt), which is the index of the metric Mk that has the largest similarity value\nover the pair. Once the dominating key metric of each instance is found, the whole optimization\nproblem is convex w.r.t. each Mk, which can be easily optimized. On account of the convexity of\neach sub-problem in the alternating approach, the whole objective is ensured to decrease in itera-\ntions so as to converge eventually. It is notable that when dealing with a single triplet in a stochastic\napproach, the convergence can be guaranteed as well in Theorem 1, which will be introduced later.\nIn batch case, for facilitating the discussion, we can implement \u2113(\u00b7) as the smooth hinge loss, i.e.,\n2 (1 \u2212 x)2 otherwise. If trace norm \u2126k(Mk) =\n\u2113(x) = [ 1\nTr(Mk) is used, MK can be solved with accelerated projected gradient descent method. If the\n2\nwhole objective in Eq. 2 is denoted as LMK , the gradient w.r.t. one metric Mk can be computed as:\n\n\u2212 x]+ if x \u2265 1 or x \u2264 0 and equals to 1\n\n\u2211\n\n\u2211\n\n1\nT\n\n\u2207t\nMk (at) + \u03bbI ,\n\n\u2202LMK\n\u2202Mk\n\n1\nT\n\n\u2202\u2113(Tr(Mkt\n\n2;(cid:3) At\n\nxz) \u2212 Tr(Mkt\n\u2202Mk\n\n1;(cid:3) At\n\nxy))\n\n=\n\nt\u2208 ^Tk\n\n+ \u03bbI =\n\n(3)\nwhere the \ufb01rst part is a sum of gradients over the triplets subset \u02c6Tk whose membership indexes\ncontaining k, i.e., \u02c6Tk = {t | k = kt\n(at), with\nat = Tr(Mkt\n\u2207t\n\nxy), for triplet t \u2208 \u02c6Tk is:\nxy \u2212 \u03b4(k = kt\n\n2;\u2217}. The separated gradient \u2207t\n\n{\nxz) \u2212 Tr(Mkt\n\n1;\u2217 or k = kt\n\n2;\u2217)At\nxz\n\n(at) =\n\n2;(cid:3) At\n\n1;(cid:3)At\n\nt\u2208 ^Tk\n\nMk\n\n.\n\n1;\u2217)At\n1;\u2217)(1 \u2212 at)At\n\nxy \u2212 \u03b4(k = kt\n\n2;\u2217)(1 \u2212 at)At\n\nxz\n\nif at \u2265 1\nif at \u2264 0\notherwise\n\n0\n\u03b4(k = kt\n\u03b4(k = kt\n\nMk\n\n\u03b4(\u00b7) is the Kronecker delta function, which contributes to the computation of the gradient when \u03bav is\noptimized by Mk. After accelerated gradient descent, a projection step is conducted to maintain the\nPSD property of each solution. If structured sparsity is stressed, \u21132;1-norm is used as a regularizer,\ni.e., \u2126k(Mk) = \u2225Mk\u22252;1. FISTA [2] can be used to optimize the non-smooth regularizer ef\ufb01ciently:\nafter a gradient descent with step size \u03b3 on the smooth loss to get an intermediate solution Vk =\nMk\u2212\u03b3 1\n(at), the following proximal sub-problem is conducted to get a further update:\n\n\u2211\n\n\u2207t\n\nT\n\nt\u2208 ^Tk\n\nMk\n\nM\n\n\u2032\nk = arg min\nM\u2208Sd\n\n\u2225M \u2212 Vk\u22252\n\nF + \u03bb\u2225M\u22252;1 .\n\n1\n2\n\n(4)\n\nThe PSD property of Mk can be ensured by a projection in each iteration, or can often be preserved\nby last step projection [14]. Hence, in Eq. 4, only symmetric constraint of Mk is imposed. Since\n\u21132;1-norm considers only one-side (row-wise) property of a matrix, Lim et al. [12] uses iterative\nsymmetric projection to get a solution, which has heavy computational cost in some cases. In a\nreweighted way, the proximal subproblem can be tackled by the following lemma ef\ufb01ciently.\nLemma 1 The proximal problem in Eq. 4 can be solved by updating diagonal matrixes D1 and D2\nand symmetric matrix M alternatively:\n}d\n{D1;ii =\ni=1 ; vec(M ) = (I \u2297 (I +\n\u22121vec(Vk) ,\n2\u2225mi\u22252\nwhere vec(\u00b7) is the vector form of a matrix and \u2297 means the Kronecker product. Due to the diagonal\nproperty of each term, the update of M can be further simpli\ufb01ed.1\n\nD2 \u2297 I))\n\n2\u2225mi\u22252\n\n, D2;ii =\n\nD1) + (\n\n\u03bb\n2\n\n\u03bb\n2\n\n1\n\n1\n\n1Detailed derivation and ef\ufb01ciency comparison are in the supplementary material.\n\n4\n\n\fThe update of M in Lemma 1 takes row-wise and column-wise \u21132-norm into consideration simulta-\nneously, and usually gets converged in about 5 \u223c 10 iterations.\nThe batch solution for UM2L can bene\ufb01t from the acceleration strategy [2]. The computational cost\nof a full gradient, however, sometimes becomes the dominant expense owing to the huge number\nof triplets. Inspired by [6], we propose a stochastic solution, which manipulates one triplet in each\niteration. In the s-th iteration, we sample a triplet (xs, ys, zs) uniformly and update current solution\nset Ms\n\nK\u2211\nk=1. The whole objective of s-th iteration with Ms\n}K\n\nK = {M s\n\nK is:\n\nk\n\n(5)\n\nLsMs\n\nK\n\n= \u2113(f 1(xs, ys) \u2212 f 2(xs, zs)) + \u03bb\n\n\u2126k(M s\n\nk ).\n\nk=1\n\nSimilar to proximal gradient solution, after doing (sub-) gradient descent on the loss function\n}K\nin Eq. 5, proximal operator can be utilized to update base metrics {M s\nk=1. The stochas-\ntic strategy is guaranteed to converge theoretically. By denoting M\u2217\nK) \u2208\n\u2217\narg min\n\n\u2217\n1 , . . . , M\nK) as the optimal solution, given totally S iterations, we have:\n\nLs(M s\n\n1 , . . . , M s\n\nK = (M\n\n\u2211\n\nS\ns=1\n\nTheorem 1 Suppose in UM2L framework, the loss \u2113(\u00b7) is a convex one and selection operator \u03bav\nis in piecewise linear form. If each training instance \u2225x\u22252 \u2264 1, the sub-gradient set of \u2126k(\u00b7) is\nbounded by R, i.e., \u2225\u2202\u2126k(Mk)\u22252\n\u2264 R2 and sub-gradient of loss \u2113(\u00b7) is bounded by C. When for\nS\u2211\neach base metric2 \u2225Mk \u2212 M\n\u2225F \u2264 D, it holds that:3\n\n\u2217\nk\n\nF\n\nk\n\n\u221a\n\nLsMs\n\nK\n\n\u2212 LsM(cid:3)\n\nK\n\n\u2264 2GD + B\n\nS\n\nwith G2 = max(C 2, R2) and B = ( D2\n\n2 + 8G2). Given hinge loss, C 2 = 16.\n\ns=1\n\n3 Related Work and Discussions\n\nGlobal DML approaches devote to \ufb01nding a single metric for all instances [5, 20] while local DML\napproaches further take spatial data heterogeneities into consideration. Recently, different types of\nlocal metric approaches are proposed, either assigning cluster-speci\ufb01c metric to instance based on\nlocality [20] or constructing local metrics generatively [13] or discriminatively [15, 18]. Further-\nmore, instance speci\ufb01c metric learning methods [7, 22] extend the locality properties of linkages to\nextreme and gain improved classi\ufb01cation performance. However, these DML methods, either global\nor local, take univocal semantic from label, namely, the side information.\nRichness of semantics is noticed and exploited by machine learning researchers [3, 11]. In DML\ncommunity, PSD [9] and SCA [4] are proposed. PSD works as collective classi\ufb01cation which is\nless related to UM2L. SCA, a multi-metric learning method based on pairwise constraints, focuses\non learning metrics under one speci\ufb01c type of ambiguities, i.e., linkages are with competitive se-\nmantic meanings. UM2L is a more general multi-metric learning framework which considers triplet\nconstraints and various kinds of ambiguous linkages from both localities and semantic views.\nUM2L maintains good compatibilities and can degenerate to several state-of-the-art DML methods.\nFor example, by considering univocal semantic (K = 1), we can get a global metric learning model\nused in [14]. If we further choose the hinge loss and set the regularizer \u2126(M ) = tr(M B) with B\nan intra-class similar pair covariance matrix, UM2L degrades to LMNN [20]. With trace norm on\nM, [10] is recovered. For multi-metric approaches, if we set \u03bav as the indicator of classes for the\nsecond instance in a similar or dissimilar pair, UM2L can be transformed to MMLMNN [20].\n\n4 Experiments on Different Types of Applications\n\nDue to different choices of \u03bas in UM2L, we test the framework in diverse real applications, namely\nsocial linkages/feature pattern discovering, classi\ufb01cation, physical semantic meaning distinguishing\nand visualization on multi-view semantic detection. To simplify the discussion, we use alternative\nbatch solver, smooth hinge loss and set regularizer \u2126k(Mk) = \u2225Mk\u22252;1 if without further statement.\nTriplets are constructed with 3 targets and 10 impostors with Euclidean nearest neighbors.\n\n2This condition generally holds according to the norm regularizer in the objective function.\n3Detailed proof can be found in the supplementary material.\n\n5\n\n\f4.1 Comparisons on Social Linkage/Feature Pattern Discovering\n\nADS con\ufb01guration is designed for social linkage and pattern discovering. To validate the effective-\nness of UM2LADS , we test it on social network data and synthetic data to show its grouping ability\non linkages and features, respectively.\nSocial linkages come from 6 real world Facebook network datasets from [11]. Given friendship\ncircles of an ego user and users\u2019 binary features, the goal of ego-user linkages discovering is to utilize\nthe overall linkage and \ufb01gure out how users are grouped. We form instances by taking absolute value\nof differences between features of ego and the others. After circles with < 5 nodes are removed,\nK is con\ufb01gured as the number of circles remained. Pairwise distance is computed by each metric\nin MK, and a threshold is tuned on the training set to \ufb01lter out irrelevant users. Thus, users with\ndifferent common hobbies are grouped together. MAC detects group assignments based on binary\nfeatures [8]; SCA constructs user linkages in a probabilistic way, and EGO [11] can directly output\nuser circles. KMeans (KM) and Spectral Clustering (SC) directly group users based on their features\nwithout using linkages. Performance is measured by Balanced Error Rate (BER) [11], the lower the\nbetter. Results are listed in Table 1, which shows UM2LADS performs the best on most datasets.\n\nTable 1: BER of the linkage discovering compar-\nisons on Facebook datasets: UM2LADS vs. others\n\nTable 2: BER of feature pattern discovery compar-\nisons on synthetic datasets: UM2LADS vs. others\n\nBER\u2193\nFacebook_348\nFacebook_414\nFacebook_686\nFacebook_698\nFacebook_1684\nFacebook_3980\n\nKM SP MAC\n\n.669\n.721\n.637\n.661\n.807\n.708\n\n.669\n.721\n.637\n.661\n.807\n.708\n\n.730\n.699\n.681\n.640\n.767\n.541\n\nSCA\n\n.847\n.870\n.772\n.729\n.844\n.667\n\nEGO UM2L\n.405\n.420\n.391\n.420\n.465\n.402\n\n.426\n.449\n.446\n.392\n.491\n.538\n\nBER\u2193\nsyn1\nsyn2\nad\nccd\nmy_movie\nreuters\n\nKM SP\n\n.382\n.564\n.670\n.244\n.370\n.704\n\n.382\n.564\n.670\n.244\n.370\n.704\n\nSCA EGO UM2L\n.355\n.323\n.381\n.071\n.155\n.398\n\n.392\n.399\n.400\n.250\n.249\n.400\n\n.467\n.428\n.583\n.225\n.347\n.609\n\nSimilarly, we test feature pattern discovering ability of UM2LADS on 4 transformed multi-view\ndatasets. For each dataset, we \ufb01rst extract principal components of each view, and construct sub-\nlinkage candidates between instances with random thresholds on each single view. Thus, these\ncandidates are various among different views. After that, the overall linkage is further generated\nfrom these candidates using \u201cor\u201d operation. With features on each view and the overall linkage, the\ngoal of feature pattern discovering is to reveal responsible features for each sub-linkage. Zero-value\nrows/columns of learned metrics indicate irrelevant features in the corresponding group. Syn1 and\nsyn2 are purely synthetic datasets with features sampled from Uniform, Beta, Binomial, Gamma and\nNormal distributions using different parameters. BER results are listed in Table 2 and UM2LADS\nachieves the best on all datasets. These assessments indicate UM2LADS can \ufb01gure out reasonable\nlinkages or patterns hidden behind observations, and even better than domain speci\ufb01c methods.\n\n4.2 Comparisons on Classi\ufb01cation Performance\n\nTo test classi\ufb01cation generalization performance, our framework is compared with 8 state-of-the-art\nmetric learning methods on 10 benchmark datasets and 8 large scale datasets (results of 8 large scale\ndata are in the supplementary material). In detail, global DML methods: ITML [5], LMNN [20]\nand EIG [21]; local and instance speci\ufb01c DML methods: PLML [18], SCML (local version) [15];\nMMLMNN [20], ISD [22] and SCA [4].\nIn UM2L, distance values from different metrics are comparable. Therefore in the test phase, we \ufb01rst\ncompute 3 nearest neighbors for testing instance \u02dcx using each base metric Mk. Then 3\u00d7 K distance\nvalues are collected adaptively and the smallest 3 ones (3 instances with the highest similarity scores)\nform neighbor candidates. Majority voting over them is used for prediction.\nEvaluations on classi\ufb01cation are repeated for 30 times. In each trial, 70% of instances are used\nfor training, and the remaining part is for test. Cross-validation is employed for parameters tun-\ning. Generalization errors (mean\u00b1std.) based on 3NN are listed in Table 3 where Euclidean dis-\ntance results (EUCLID) are also listed as a baseline. Considering the abilities of multi-semantic\ndescription of ADS and the rigorous restrictions of RGS, UM2LADS=RGS are implemented in this\ncomparison. Number of metrics K is con\ufb01gured as the number of classes. Table 3 clearly shows\nthat UM2LADS=RGS perform well on most datasets. Especially, UM2LRGS achieves best on more\ndatasets according to t-tests and this can be attributed to the rigorous restrictions of RGS.\n\n6\n\n\fTable 3: Comparisons of classi\ufb01cation performance (test errors, mean \u00b1 std.) based on 3NN. UM2LADS and\nUM2LRGS are compared. The best performance on each dataset is in bold. Last two rows list the Win/Tie/Lose\ncounts of UM2LADS=RGS against other methods on all datasets with t-test at signi\ufb01cance level 95%.\n\nUM2LADS UM2LRGS\nAutompg .201\u00b1.034 .225\u00b1.031\n.070\u00b1.018 .086\u00b1.020\nClean1\n.281\u00b1.019 .284\u00b1.030\nGerman\n.312\u00b1.043 .293\u00b1.047\nGlass\n.276\u00b1.044 .307\u00b1.068\nHayes-r\n.190\u00b1.035 .194\u00b1.063\nHeart-s\n.051\u00b1.015 .048\u00b1.013\nHouse-v\n.363\u00b1.045 .342\u00b1.047\nLiver-d\n.023\u00b1.038 .029\u00b1.034\nSegment\n.136\u00b1.032 .132\u00b1.036\nSonar\nUM2LADS vs. others\nUM2LRGS vs. others\n\nW / T / L\nW / T / L\n\nPLML\n\nISD\n\nSCA\n\nITML\n\nLMNN\n\nEIG\n\nSCML MMLMNN\n\nEUCLID\n.265\u00b1.048 .253\u00b1.026 .256\u00b1.032 .288\u00b1.033 .286\u00b1.037 .292\u00b1.032 .259\u00b1.037 .266\u00b1.031 .260\u00b1.036\n.098\u00b1.027 .100\u00b1.027 .097\u00b1.022 .143\u00b1.023 .306\u00b1.072 .141\u00b1.024 .084\u00b1.021 .127\u00b1.021 .139\u00b1.023\n.280\u00b1.016 .302\u00b1.021 .289\u00b1.019 .297\u00b1.017 .292\u00b1.023 .288\u00b1.021 .292\u00b1.021 .284\u00b1.014 .296\u00b1.021\n.389\u00b1.050 .328\u00b1.054 .296\u00b1.047 .334\u00b1.050 .529\u00b1.053 .311\u00b1.038 .315\u00b1.049 .314\u00b1.050 .307\u00b1.042\n.436\u00b1.201 .296\u00b1.053 .282\u00b1.062 .378\u00b1.093 .379\u00b1.068 .342\u00b1.080 .314\u00b1.072 .289\u00b1.067 .398\u00b1.046\n.365\u00b1.127 .205\u00b1.040 .191\u00b1.037 .192\u00b1.036 .203\u00b1.039 .186\u00b1.032 .200\u00b1.026 .189\u00b1.034 .190\u00b1.030\n.121\u00b1.240 .066\u00b1.019 .055\u00b1.017 .072\u00b1.024 .174\u00b1.075 .063\u00b1.023 .061\u00b1.017 .080\u00b1.024 .083\u00b1.025\n.361\u00b1.055 .371\u00b1.042 .372\u00b1.045 .364\u00b1.042 .408\u00b1.011 .377\u00b1.052 .373\u00b1.045 .380\u00b1.037 .384\u00b1.040\n.041\u00b1.031 .041\u00b1.008 .036\u00b1.006 .063\u00b1.009 .324\u00b1.043 .050\u00b1.012 .039\u00b1.006 .059\u00b1.016 .050\u00b1.007\n.171\u00b1.048 .193\u00b1.045 .157\u00b1.038 .182\u00b1.038 .220\u00b1.040 .174\u00b1.039 .145\u00b1.032 .159\u00b1.042 .168\u00b1.036\n8 / 2 / 0\n6 / 4 / 0\n6 / 4 / 0\n8 / 2 / 0\n\n8 / 2 / 0\n8 / 2 / 0\n\n6 / 4 / 0\n8 / 2 / 0\n\n5 / 5 / 0\n8 / 2 / 0\n\n6 / 4 / 0\n7 / 3 / 0\n\n7 / 3 / 0\n8 / 2 / 0\n\n4 / 6 / 0\n5 / 5 / 0\n\n7 / 3 / 0\n9 / 1 / 0\n\n(a) LMNN\n\n(b) PLML 1\n\n(c) PLML 2\n\n(d) MMLMNN 1\n\n(e) MMLMNN 2\n\n(f) MMLMNN 3\n\n(g) UM2L 1\n\n(h) UM2L 2\n\n(i) UM2L 3\n\n(j) UM2L 4\n\n(k) UM2L 5\n\n(l) UM2L 6\n\nFigure 1: Word clouds generated from the results of compared DML methods. The size of word depends on the\n\u22a4\nimportance weight of each word (feature). The weight is calculated by decomposing each metric Mk = LkL\nk ,\nand calculate the \u21132-norm of each row in Lk, where each row corresponds to a speci\ufb01c word. Each subplot\ngives a word cloud for a base metric learned from DML approaches.\n\n4.3 Comparisons of Latent Semantic Discovering\n\nUM2L is proposed for DML with both localities and semantic linkages considered. Hence, to inves-\ntigate the ability of latent semantics discovering, two assessments in real applications are performed,\ni.e., Academic Paper Linkages Explanation (APLE) and Image Weak Label Discovering (IWLD).\nIn APLE, data are collected from 2012-2015 ICML papers, which can be connected with each other\nby more than one topic, yet only the session ID is captured to form explicit linkages. 3 main di-\nrections of sessions are picked up in this assessment, i.e., \u201cfeature learning\u201d, \u201conline learning\u201d and\n\u201cdeep learning\u201d. No sub-\ufb01elds and additional labels/topics are provided. Simplest TF-IDF is used\nto extract features, which forms a corpus of 220 papers and 1622 words in total. Aiming at \ufb01nding\nthe hidden linkages together with their causes, both UM2LADS and UM2LOVS are invoked. To avoid\ntrivial solutions, regularizer for each metric is con\ufb01gured as \u2126k(Mk) = \u2225Mk \u2212 I\u22252\nF for UM2LOVS.\nAll feature (word) weights and correlations can be provided by learned metrics, i.e., with decompo-\n\u22a4\nsition Mk = LkL\nk , the \u21132-norm value of each row in Lk can be regarded as the weight for each\nfeature (word). The importance of feature (word) weights is demonstrated in word clouds in Fig. 1,\nwhere the size of fonts re\ufb02ects the weights of each word. Due to the page limits, supplementary\nmaterials represent full evaluations.\nFig. 1 shows the results of LMNN [20] (a), PLML [18] (b, c), MMLMNN [20] (d, e, f) and UM2LOVS\n(g \u223c l) with K = 6, respectively. Global method LMNN returns one subplot. The metric learned by\nLMNN perhaps has discriminative ability but the weights of words cannot distinguish sub\ufb01elds in 3\nselective domains. For multi-metric learning approaches PLML and MMLMNN, though they can pro-\nvide more than one base metric and consequently have multiple word clouds, the words presented in\nsubplots are not with legible physical semantic meanings. Especially, PLML outputs multiple met-\nrics which are similar to each other (tends to global learner\u2019s behavior) and only focus on \ufb01rst part\nof the alphabet, while MMLMNN by default only learns multiple metrics with the number of base\nmetrics equaling to the number of classes. However, results of UM2LOVS clearly demonstrate all 3\n\ufb01elds. On session \u201conline learning\u201d, it can discover different sub-\ufb01elds such as \u201conline convex opti-\n\n7\n\n\f(b) (mountains, sea)\n\n(a) ADS subspace 1 (b) ADS subspace 2 (c) RGS subspace\n\n(a) (sea, mountains)\n\n(c) (sea, sunset)\nFigure 2: Results of visual semantic discovery on im-\nages. The \ufb01rst annotation in the bracket is the provided\nweak label. The second one is one of the latent semantic\nlabels discovered by UM2L.\n\nFigure 3: Subspaces discovered by UM2LADS (a,b)\nand UM2LRGS (c).\nInstances possess 2 semantic\nproperties, i.e., color and shape. Blue dot-lines give\nthe decision boundary.\n\nmization\u201d (g and h), and \u201conline (multi-) armed bandit problem\u201d (j); for session \u201cfeature learning\u201d,\nit has \u201cfeature score\u201d (i) and \u201cPCA projection\u201d (l); and for \u201cdeep learning\u201d, the word cloud returns\npopular words like \u201cnetwork layer\u201d, \u201cautoencoder\u201d and \u201clayer\u201d(k).\nBesides APLE, the second application is about weak label discovering in images from [23], where\nthe most obvious label for each image is used for triplets constraints generation. UM2LOVS can\nobtain multiple metrics, each of which is with a certain visual semantic. By computing similarities\nbased on different metrics, latent semantics can be discovered, i.e., if we assume images connected\nwith high similarities share the same label, missing labels can be completed as in Fig. 2. More weak\nlabel results can be found in the supplementary material.\n\n4.4\n\nInvestigations of Latent Multi-View Detection\n\nAnother direct application of UM2L is hidden multi-view detection, where data can be described by\nmultiple views from different channels yet feature partitions are not clearly provided [16]. Data with\nmulti-view goes consistent with the assumption of ADS or RGS con\ufb01guration. ADS emphasizes the\nexistence of relevant views and aims at decomposing helpful aspects or views; while RGS requires\nfull accordance among views. Trace norm regularizes the approach in this part to get low dimen-\nsional projection. UM2L framework facilitates the understanding of data by decomposing each base\nmetric to low dimensional subspace, i.e., for each base metric Mk, 2 eigen-vectors Lk \u2208 Rd\u00d72\ncorresponding to the largest 2 eigen-values are picked as orthogonal bases.\nThe hidden multi-view data [1] are composed of 200 instances and each instance has two hidden\nviews, namely color and shape. We perform UM2LADS=RGS on this dataset with K = 2. Results of\nother methods such as SCA can be found in the supplementary material. Fig. 3 (a) (b) give the 2-D\nvisualization results by plotting the projected instances in subspaces corresponding to metric M1\nand M2 of UM2LADS. It clearly shows that M1 captures the semantic view of color, and M2 re\ufb02ects\nthe meaning of shape. While for UM2LRGS, the visualization result of one of the obtained metrics\nis showed in Fig. 3 (c). It can be clearly found that both UM2LADS and UM2LRGS can capture the\ntwo different semantic views hidden in data. Moreover, since UM2LRGS requires more accordance,\nit can capture these physical meanings with a single metric.\n\n5 Conclusion\n\nIn this paper, we propose the Uni\ufb01ed Multi-Metric Learning (UM2L) framework which can exploit\nside information from multiple aspects such as locality and semantics linkage constraints. It is no-\ntable that both types of constraints can be absorbed in the multi-metric loss functions with a type of\n\ufb02exible function operator \u03ba in UM2L. By implementing \u03ba in different forms, UM2L can be used for\nlocal metric learning in classi\ufb01cation, latent semantic linkage discovering, etc., or degrade to state-\nof-the-art DML approaches. The regularizer in UM2L is \ufb02exible for different purposes. UM2L can be\nsolved by various optimization techniques such as proximal gradient and accelerated stochastic ap-\nproaches, and theoretical guarantee on the convergence is proved. Experiments show the superiority\nof UM2L in classi\ufb01cation performance and hidden semantics discovery. Automatic determination of\nthe number of base metrics is an interesting future work.\nAcknowledgements This research was supported by NSFC (61273301, 61333014), Collaborative\nInnovation Center of Novel Software Technology and Industrialization, and Tencent Fund.\n\n8\n\n\fReferences\n[1] E. Amid and A. Ukkonen. Multiview triplet embedding: Learning attributes in multiple maps. In ICML,\n\npages 1472\u20131480, Lille, France, 2015.\n\n[2] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIIMS, 2(1):183\u2013202, 2009.\n\n[3] D. Chakrabarti, S. Funiak, J. Chang, and S. Macskassy. Joint inference of multiple label types in large\n\nnetworks. In ICML, pages 874\u2013882, Beijing, China, 2014.\n\n[4] S. Changpinyo, K. Liu, and F. Sha. Similarity component analysis. In NIPS, pages 1511\u20131519. MIT Press,\n\nCambridge, MA., 2013.\n\n[5] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-theoretic metric learning. In ICML,\n\npages 209\u2013216, Corvalis, OR., 2007.\n\n[6] J. C. Duchi and Y. Singer. Ef\ufb01cient online and batch learning using forward backward splitting. JMLR,\n\n10:2899\u20132934, 2009.\n\n[7] E. Fetaya and S. Ullman. Learning local invariant mahalanobis distances. In ICML, pages 162\u2013168, Pairs,\n\nFrance, 2015.\n\n[8] M. Frank, A. P. Streich, D. Basin, and J. M. Buhmann. Multi-assignment clustering for boolean data.\n\nJMLR, 13:459\u2013489, 2012.\n\n[9] J.-H. Hu, D.-C. Zhan, X. Wu, Y. Jiang, and Z.-H. Zhou. Pairwised speci\ufb01c distance learning from physical\n\nlinkages. TKDD, 9(3):Article 20, 2015.\n\n[10] K. Huang, Y. Ying, and C. Campbell. GSML: A uni\ufb01ed framework for sparse metric learning. In ICDM,\n\npages 189\u2013198, Miami, FL., 2009.\n\n[11] J. Leskovec and J. Mcauley. Learning to discover social circles in ego networks. In NIPS, pages 539\u2013547.\n\nMIT Press, Cambridge, MA., 2012.\n\n[12] D. Lim, G. Lanckriet, and B. McFee. Robust structural metric learning. In ICML, pages 615\u2013623, Atlanta,\n\nGA., 2013.\n\n[13] Y.-K. Noh, B.-T. Zhang, and D. Lee. Generative local metric learning for nearest neighbor classi\ufb01cation.\n\nIn NIPS, pages 1822\u20131830. MIT Press, Cambridge, MA., 2010.\n\n[14] Q. Qian, R. Jin, S. Zhu, and Y. Lin. Fine-grained visual categorization via multi-stage metric learning. In\n\nCVPR, pages 3716\u20133724, Boston, MA., 2015.\n\n[15] Y. Shi, A. Bellet, and F. Sha. Sparse compositional metric learning. In AAAI, pages 2078\u20132084, Quebec,\n\nCanada, 2014.\n\n[16] W. Wang and Z.-H. Zhou. A new analysis of co-training. In ICML, pages 1135\u20131142, Haifa, Israel, 2010.\n\n[17] B. Wang, J. Jiang, W. Wang, Z.-H. Zhou, and Z. Tu. Unsupervised metric fusion by cross diffusion. In\n\nCVPR, pages 2997\u20133004, Providence, RI., 2012.\n\n[18] J. Wang, A. Kalousis, and A. Woznica. Parametric local metric learning for nearest neighbor classi\ufb01cation.\n\nIn NIPS, pages 1601\u20131609. MIT Press, Cambridge, MA., 2012.\n\n[19] J. Wang, A. Woznica, and A. Kalousis. Learning neighborhoods for metric learning. In ECML/PKDD,\n\npages 223\u2013236, Bristol, UK, 2012.\n\n[20] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classi\ufb01cation.\n\nJMLR, 10:207\u2013244, 2009.\n\n[21] Y. Ying and P. Li. Distance metric learning with eigenvalue optimization. JMLR, 13:1\u201326, 2012.\n\n[22] D.-C. Zhan, M. Li, Y.-F. Li, and Z.-H. Zhou. Learning instance speci\ufb01c distances using metric propagation.\n\nIn ICML, pages 1225\u20131232, Montreal, Canada, 2009.\n\n[23] M.-L. Zhang and Z.-H. Zhou. ML-KNN: A lazy learning approach to multi-label learning. Pattern\n\nRecognition, 40(7):2038\u20132048, 2007.\n\n[24] Z.-H. Zhou. Ensemble methods: foundations and algorithms. Chapman & Hall/CRC, Boca Raton, FL.,\n\n2012.\n\n[25] Z.-H. Zhou. Learnware: On the future of machine learning. Frontiers of Computer Science, 10(4):589\u2013\n\n590, 2016.\n\n9\n\n\f", "award": [], "sourceid": 675, "authors": [{"given_name": "Han-Jia", "family_name": "Ye", "institution": "Nanjing University"}, {"given_name": "De-Chuan", "family_name": "Zhan", "institution": "Nanjing University"}, {"given_name": "Xue-Min", "family_name": "Si", "institution": "Nanjing University"}, {"given_name": "Yuan", "family_name": "Jiang", "institution": "Nanjing University"}, {"given_name": "Zhi-Hua", "family_name": "Zhou", "institution": "Nanjing University"}]}