{"title": "An Online Algorithm for Large Scale Image Similarity Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 306, "page_last": 314, "abstract": "Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. It stands in the core of classification methods like kernel machines, and is particularly useful for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, current approaches for learning similarity may not scale to large datasets with high dimensionality, especially when imposing metric constraints on the learned similarity. We describe OASIS, a method for learning pairwise similarity that is fast and scales linearly with the number of objects and the number of non-zero features. Scalability is achieved through online learning of a bilinear model over sparse representations using a large margin criterion and an efficient hinge loss cost. OASIS is accurate at a wide range of scales: on a standard benchmark with thousands of images, it is more precise than state-of-the-art methods, and faster by orders of magnitude. On 2 million images collected from the web, OASIS can be trained within 3 days on a single CPU. The non-metric similarities learned by OASIS can be transformed into metric similarities, achieving higher precisions than similarities that are learned as metrics in the first place. This suggests an approach for learning a metric from data that is larger by an order of magnitude than was handled before.", "full_text": "An Online Algorithm for\n\nLarge Scale Image Similarity Learning\n\nGal Chechik\n\nGoogle\n\nMountain View, CA\ngal@google.com\n\nVarun Sharma\n\nGoogle\n\nBengalooru, Karnataka, India\nvasharma@google.com\n\nICNC, The Hebrew University\n\nUri Shalit\n\nIsrael\n\nuri.shalit@mail.huji.ac.il\n\nSamy Bengio\n\nGoogle\n\nMountain View, CA\n\nbengio@google.com\n\nAbstract\n\nLearning a measure of similarity between pairs of objects is a fundamental prob-\nlem in machine learning. It stands in the core of classi\ufb01cation methods like kernel\nmachines, and is particularly useful for applications like searching for images\nthat are similar to a given image or \ufb01nding videos that are relevant to a given\nvideo. In these tasks, users look for objects that are not only visually similar but\nalso semantically related to a given object. Unfortunately, current approaches for\nlearning similarity do not scale to large datasets, especially when imposing metric\nconstraints on the learned similarity. We describe OASIS, a method for learning\npairwise similarity that is fast and scales linearly with the number of objects and\nthe number of non-zero features. Scalability is achieved through online learning\nof a bilinear model over sparse representations using a large margin criterion and\nan ef\ufb01cient hinge loss cost. OASIS is accurate at a wide range of scales: on a stan-\ndard benchmark with thousands of images, it is more precise than state-of-the-art\nmethods, and faster by orders of magnitude. On 2.7 million images collected\nfrom the web, OASIS can be trained within 3 days on a single CPU. The non-\nmetric similarities learned by OASIS can be transformed into metric similarities,\nachieving higher precisions than similarities that are learned as metrics in the \ufb01rst\nplace. This suggests an approach for learning a metric from data that is larger by\norders of magnitude than was handled before.\n\n1\n\nIntroduction\n\nLearning a pairwise similarity measure from data is a fundamental task in machine learning. Pair\ndistances underlie classi\ufb01cation methods like nearest neighbors and kernel machines, and similarity\nlearning has important applications for \u201cquery-by-example\u201d in information retrieval. For instance,\na user may wish to \ufb01nd images that are similar to (but not identical copies of) an image she has;\na user watching an online video may wish to \ufb01nd additional videos about the same subject. In all\nthese cases, we are interested in \ufb01nding a semantically-related sample, based on the visual content\nof an image, in an enormous search space. Learning a relatedness function from examples could be\na useful tool for such tasks.\n\nA large number of previous studies of learning similarities have focused on metric learning, like\nin the case of a positive semide\ufb01nite matrix that de\ufb01nes a Mahalanobis distance [19]. However,\nsimilarity learning algorithms are often evaluated in a context of ranking [16, 5]. When the amount\n\n1\n\n\fof training data available is very small, adding positivity constraints for enforcing metric properties\nis useful for reducing over\ufb01tting and improving generalization. However, when suf\ufb01cient data is\navailable, as in many modern applications, adding positive semi-de\ufb01nitiveness constraints is very\ncostly, and its bene\ufb01t in terms of generalization may be limited. With this view, we take here an\napproach that avoids imposing positivity or symmetry constraints on the learned similarity measure.\n\nSome similarity learning algorithms assume that the available training data contains real-valued pair-\nwise similarities or distances. Here we focus on a weaker supervision signal: the relative similarity\nof different pairs [4]. This signal is also easier to obtain, here we extract similarity information from\npairs of images that share a common label or are retrieved in response to a common text query in an\nimage search engine.\n\nThe current paper presents an approach for learning semantic similarity that scales up to two orders\nof magnitude larger than current published approaches. Three components are combined to make\nthis approach fast and scalable: First, our approach uses an unconstrained bilinear similarity. Given\ntwo images p1 and p2 we measure similarity through a bilinear form p1Wp2, where the matrix\nW is not required to be positive, or even symmetric. Second we use a sparse representation of\nthe images, which allows to compute similarities very fast. Finally, the training algorithm that\nwe developed, OASIS, Online Algorithm for Scalable Image Similarity learning, is an online dual\napproach based on the passive-aggressive algorithm [2]. It minimizes a large margin target function\nbased on the hinge loss, and converges to high quality similarity measures after being presented with\na small fraction of the training pairs.\n\nWe \ufb01nd that OASIS is both fast and accurate at a wide range of scales: for a standard benchmark with\nthousands of images, it achieves better or comparable results than existing state-of-the-art methods,\nwith computation times that are shorter by an order of magnitude. For web-scale datasets, OASIS\ncan be trained on more than two million images within three days on a single CPU. On this large\nscale dataset, human evaluations of OASIS learned similarity show that 35% of the ten nearest\nneighbors of a given image are semantically relevant to that image.\n\n2 Learning Relative Similarity\n\nWe consider the problem of learning a pairwise similarity function S, given supervision on the rela-\ntive similarity between two pairs of images. The algorithm is designed to scale well with the number\nof samples and the number of features, by using fast online updates and a sparse representation.\n\ni \u2208 P and p\u2212\n\ni \u2208 P, such that p+\n\ni \u2208 P is more relevant to pi \u2208 P than p\u2212\n\nFormally, we are given a set of images P, where each image is represented as a vector p \u2208 Rd. We\nassume that we have access to an oracle that, given a query image pi \u2208 P, can locate two other\nimages, p+\ni \u2208 P. Formally,\nwe could write that relevance(pi, p+\ni ). However, unlike methods that assume\nthat a numerical value of the similarity is available, relevance(pi, pj) \u2208 R, we use this weaker\nform of supervision, and only assume that some pairs of images can be ranked by their relevance\nto a query image pi. The relevance measure could re\ufb02ect that the relevant image p+\ni belongs to the\nsame class of images as the query image, or re\ufb02ect any other semantic property of the images.\nOur goal is to learn a similarity function SW (pi, pj) parameterized by W that assigns higher simi-\nlarity scores to the pairs of more relevant images (with a safety margin),\n\ni ) > relevance(pi, p\u2212\n\nS(pi, p+\n\ni ) > S(pi, p\u2212\n\ni ) + 1 ,\n\n\u2200pi, p+\n\ni , p\u2212\n\ni \u2208 P .\n\nIn this paper, we consider a parametric similarity function that has a bi-linear form,\n\nSW(pi, pj) \u2261 pT\n\ni W pj\n\n(1)\n\n(2)\n\nwith W \u2208 Rd\u00d7d. Importantly, if the image vectors pi \u2208 Rd are sparse, namely, the number of\nnon-zero entries ki \u2261 kpik0 is small, ki \u226a d, then the value of the score de\ufb01ned in Eq. (2) can be\ncomputed very ef\ufb01ciently even when d is large. Speci\ufb01cally, SW can be computed with complexity\nof O(kikj) regardless of the dimensionality d. To learn a scoring function that obeys the constraints\nin Eq. (1), we de\ufb01ne a global loss LW that accumulates hinge losses over all possible triplets in\ni , p\u2212\ni ), with the loss for a single triplet being\ni ) + SW(pi, p\u2212\ni )(cid:1).\n\nthe training set: LW \u2261 P(pi,p+\n\ni ) \u2261 max(cid:0)0, 1 \u2212 SW(pi, p+\n\ni )\u2208P 3 lW(pi, p+\n\nlW(pi, p+\n\ni , p\u2212\n\ni ,p\u2212\n\n2\n\n\fTo minimize the global loss LW, we propose an algorithm that is based on the Passive-Aggressive\nfamily of algorithms [2]. First, W is initialized to the identity matrix W0 = Id\u00d7d. Then, the\nalgorithm iteratively draws a random triplet (pi, p+\ni ), and solves the following convex problem\nwith a soft margin:\n\ni , p\u2212\n\nWi = argmin\n\nW\n\n1\n2\n\nkW \u2212 Wi\u22121k2\n\nF ro + C\u03be\n\ns.t.\n\nlW(pi, p+\n\ni , p\u2212\n\ni ) \u2264 \u03be\n\nand\n\n\u03be \u2265 0 (3)\n\nwhere k\u00b7kF ro is the Frobenius norm (point-wise L2 norm). At the ith iteration, Wi is updated to\noptimize a trade-off between staying close to the previous parameters Wi\u22121 and minimizing the\nloss on the current triplet lW(pi, p+\ni ). The aggressiveness parameter C controls this trade-off.\nTo solve the problem in Eq. (3) we follow the derivation in [2]. When lW(pi, p+\ni ) = 0, it is clear\nthat Wi = Wi\u22121 satis\ufb01es Eq. (3) directly. Otherwise, we de\ufb01ne the Lagrangian\n\ni , p\u2212\n\ni , p\u2212\n\nL(W, \u03c4, \u03be, \u03bb) =\n\n1\n2\n\nkW \u2212 Wi\u22121k2\n\nF ro + C\u03be + \u03c4 (1 \u2212 \u03be \u2212 pT\n\ni W(p+\n\ni \u2212 p\u2212\n\ni )) \u2212 \u03bb\u03be\n\n(4)\n\n\u2202W\n\nwhere \u03c4 \u2265 0 and \u03bb \u2265 0 are the Lagrange multipliers. The optimal solution is obtained when the\ngradient vanishes \u2202L(W,\u03c4,\u03be,\u03bb)\n= W \u2212 Wi\u22121 \u2212 \u03c4 Vi = 0, where Vi is the gradient matrix at the\ni \u2212 p\u2212\ncurrent step Vi = \u2202lW\ni )]T . When image vectors are sparse, the\n\u2202W = [p1\ngradient Vi is also sparse, hence the update step costs only O(|pi|0 \u00d7 (kp+\ni k0)), where the\nL0 norm kxk0 is the number of nonzero values in x. Differentiating the Lagrangian with respect to\n\u03be we obtain \u2202L(W,\u03c4,\u03be,\u03bb)\n= C \u2212 \u03c4 \u2212 \u03bb = 0 which, knowing that \u03bb \u2265 0, means that \u03c4 \u2264 C. Plugging\nback into the Lagrangian in Eq. (4), we obtain L(\u03c4 ) = \u2212 1\ni )).\nFinally, taking the derivative of this second Lagrangian with respect to \u03c4 and using \u03c4 \u2264 C, we obtain\n\n2 \u03c4 2kVik2 + \u03c4 (1 \u2212 pT\n\ni Wi\u22121(p+\n\ni k0 + kp\u2212\n\ni (p+\n\ni \u2212 p\u2212\n\ni ), . . . , pd\n\ni \u2212 p\u2212\n\ni (p+\n\n\u2202\u03be\n\nW = Wi\u22121 + \u03c4 Vi\n\u03c4 = min(cid:26)C,\n\nlWi\u22121 (pi, p+\nkVik2\n\ni , p\u2212\ni )\n\n(cid:27) .\n\n(5)\n\nThe optimal update for the new W therefore has a form of a gradient descent step with a step size \u03c4\nthat can be computed exactly. Applying this algorithm for classi\ufb01cation tasks was shown to yield a\nsmall cumulative online loss, and selecting the best Wi during training using a hold-out validation\nset was shown to achieve good generalization [2].\n\nIt should be emphasized that OASIS is not guaranteed to learn a parameter matrix that is positive,\nor even symmetric. We study variants of OASIS that enforce symmetry or positivity in Sec. 4.3.2.\n\n3 Related Work\nLearning similarity using relative relevance has been intensively studied, and a few recent ap-\nproaches aim to address learning at large scale. For small-scale data, there are two main groups of\nsimilarity learning approaches. The \ufb01rst approach, learning Mahalanobis distances, can be viewed\nas learning a linear projection of the data into another space (often of lower dimensionality), where a\nEuclidean distance is de\ufb01ned among pairs of objects. Such approaches include Fisher\u2019s Linear Dis-\ncriminant Analysis (LDA), relevant component analysis (RCA) [1], supervised global metric learn-\ning [18], large margin nearest neighbor (LMNN) [16], and metric learning by collapsing classes [5]\n(MLCC). Other constraints like sparseness are sometimes induced over the learned metric [14]. See\nalso a review in [19] for more details.\n\nThe second family of approaches, learning kernels, is used to improve performance of kernel based\nclassi\ufb01ers. Learning a full kernel matrix in a non parametric way is prohibitive except for very\nsmall data sets. As an alternative, several studies suggested learning a weighted sum of pre-de\ufb01ned\nkernels [11] where the weights are learned from data. In some applications this was shown to be\ninferior to uniform weighting of the kernels [12]. The work in [4] further learns a weighting over\nlocal distance functions for every image in the training set. Non linear image similarity learning was\nalso studied in the context of dimensionality reduction, as in [8].\n\nFinally, Jain et al [9] (based on Davis et al [3]) aim to learn metrics in an online setting. This work\nis one of the closest work with respect to OASIS: it learns online a linear model of a [dis-]similarity\n\n3\n\n\fQuery image\n\nTop 5 relevant images retrieved by OASIS\n\nTable 1: OASIS: Successful cases from the web dataset. The relevant text queries for each image\nare shown beneath the image (not used in training).\n\nfunction between documents (images); the main difference is that Jain et al [9] try to learn a true\ndistance, imposing positive de\ufb01niteness constraints, which makes the algorithm more complex and\nmore constrained. We argue in this paper that in the large scale regime, imposing these constraints\nthroughout could be detrimental.\n\nLearning a semantic similarity function between images was also studied in [13]. There, semantic\nsimilarity is learned by representing each image by the posterior probability distribution over a\nprede\ufb01ned set of semantic tags, and then computing the distance between two images as the distance\nbetween the two underlying posterior distributions. The representation size of each image therefore\ngrows with the number of semantic classes.\n\n4 Experiments\n\nWe tested OASIS on two datasets spanning a wide regime of scales. First, we tested its scalability on\n2.7 million images collected from the web. Then, to quantitatively compare the precision of OASIS\nwith other, small-scale metric-learning methods, we tested OASIS using Caltech-256, a standard\nmachine vision benchmark.\nImage representation. We use a sparse representation based on bags of visual words [6]. These\nfeatures were systematically tested and found to outperform other features in related tasks, but the\ndetails of the visual representation is outside the focus of this paper. Broadly speaking, features are\nextracted by dividing each image into overlapping square blocks, representing each block by edge\nand color histograms, and \ufb01nding the nearest block in a prede\ufb01ned set (dictionary) of d = 10, 000\nvectors of such features. An image is thus represented as the number of times each dictionary visual\nword was present in it, yielding vectors in Rd with an average of 70 non-zero values.\nEvaluation protocol. We evaluated the performance of all algorithms using precision-at-top-k, a\nstandard ranking precision measure based on nearest neighbors. For each query image in the test set,\nall other test images were ranked according to their similarity to the query image, and the number of\nsame-class images among the top k images (the k nearest neighbors) is computed, and then averaged\nacross test images. We also calculated the mean average precision (mAP), a measure that is widely\nused in the information retrieval community.\n\n4.1 Web-Scale Experiment\n\nWe \ufb01rst tested OASIS on a set of 2.7 million images scraped from the Google image search engine.\nWe collected a set of \u223c150K anonymized text queries, and for each of these queries, we had access\nto a set of relevant images. To compute an image-image relevance measure, we \ufb01rst obtained mea-\nsures of relevance between images and text queries. This was achieved by collecting anonymized\nclicks over images collected from the set of text queries. We used this query-image click counts\n\n4\n\n\fC(query,image) to compute the (unnormalized) probability that two images are co-queried as Rel-\nevance(image,image) = C T C. The relevance matrix was then thresholded to keep only the top 1\npercent values. We trained OASIS on a training set of 2.3 million images, and tested performance on\n0.4 million images. The number of training iterations (each corresponding to sampling one triplet)\nwas selected using a second validation set of around 20000 images, over which the performance\nsaturated after 160 million iterations. Overall, training took a total of \u223c4000 minutes on a single\nCPU of a standard modern machine.\n\nTable 1 shows the top \ufb01ve images as ranked by OASIS on two examples of query-images in the test\nset. In these examples, OASIS captures similarity that goes beyond visual appearance: most top\nranked images are about the same concept as the query image, even though that concept was never\nprovided in a textual form, and is inferred in the viewers mind (\u201cdog\u201d, \u201csnow\u201d). This shows that\nlearning similarity across co-queried images can indeed capture the semantics of queries even if the\nqueries are not explicitly used during training.\n\nTo obtain a quantitative evaluation of the ranking obtained by OASIS we created an evaluation\nbenchmark, by asking human evaluators to mark if a set of candidate images were semantically\nrelevant to a set of 25 popular image queries. For each query image, evaluators were presented with\nthe top-10 images ranked by OASIS, mixed with 10 random images. Given the relevance ranking\nfrom 30 evaluators, we computed the precision of each OASIS rank as the fraction of people that\nmarked each image as relevant to the query image. On average across all queries and evaluators,\nOASIS rankings yielded precision of \u223c 40% at the top 10 ranked images.\nAs an estimate of an \u201cupper bound\u201d on the dif\ufb01culty of the task, we also computed the precision\nobtained by human evaluators: For every evaluator, we used the rankings of all other evaluators\nas ground truth, to compute his precision. As with the ranks of OASIS, we computed the fraction\nof evaluators that marked an image as relevant, and repeated this separately for every query and\nhuman evaluator, providing a measure of \u201ccoherence\u201d per query. Fig. 1(a) shows the mean precision\nobtained by OASIS and human evaluators for every query in our data. For some queries OASIS\nachieves precision that is very close to that of the mean human evaluator. In many cases OASIS\nachieves precision that is as good or better than some evaluators.\n\n(a)\n\n1\n\n(b)\n\n \n\nHuman precision\nOASIS precision\n\ni\n\ni\n\nn\no\ns\nc\ne\nr\np\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n0\n\n2days\n\n3hrs\n\n)\nn\nm\n\ni\n\n(\n \n\ne\nm\n\ni\nt\n\nn\nu\nr\n\n5min\n\n37sec\n9sec\n\n \n60\n\n25\n\n5\n20\nquery ID (sorted by precision)\n\n10\n\n15\n\nfast LMNN (MNIST 10 categories)\nprojected extrapolation (2nd poly)\nOASIS (Web data)\n\n \n\n~190 days\n\n3 hrs\n60K\n\n2 days\n2.3M\n\n1.5 hrs\n100K\n\n5 min\n\n600\n\nnumber of images (log scale)\n\n10K\n\n100K\n\n2M\n\nFigure 1: (a) Precision of OASIS and human evaluators, per query, using rankings of all (remaining)\nhuman evaluators as a ground truth. (b) Comparison of the runtime of OASIS and fast-LMNN[17],\nover a wide range of scales. LMNN results (on MNIST data) are faster than OASIS results on\nsubsets of the web data. However LMNN scales quadratically with the number of samples, hence is\nthree times slower on 60K images, and may be infeasible for handling 2.3 million images.\n\nWe further studied how the runtime of OASIS scales with the size of the training set. Figure 1(b)\nshows that the runtime of OASIS, as found by early stopping on a separate validation set, grows\nlinearly with the train set size. We compare this to the fastest result we found in the literature, based\non a fast implementation of LMNN [17]. The LMNN algorithm scales quadratically with the number\nof objects, although their experiments with MNIST data show that the active set of constraints grows\nlinearly. This could be because MNIST has 10 classes only.\n\n5\n\n\f10 classes\n\n(b)\n\n20 classes\n\n \n\n \n\n(a)\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n \n0\n\ni\n\ni\n\nn\no\ns\nc\ne\nr\np\n\n0.3\n\n0.2\n\n0.1\n\ni\n\ni\n\nn\no\ns\nc\ne\nr\np\n\nRandom\n\n20\n\n30\n\nnumber of neighbors\n\n40\n\n50\n\n0\n \n0\n\nOASIS\nMCML\nLEGO\nLMNN\nEuclidean\n\n10\n\nOASIS\nMCML\nLEGO\nLMNN\nEuclidean\n\n10\n\n(c)\n\n50 classes\n\n \n\n0.2\n\n0.1\n\ni\n\ni\n\nn\no\ns\nc\ne\nr\np\n\nRandom\n\n20\n\n30\n\nnumber of neighbors\n\n40\n\n50\n\n0\n \n0\n\nOASIS\nLEGO\nLMNN\nEuclidean\n\n10\n\nRandom\n\n40\n\n50\n\n20\n\n30\n\nnumber of neighbours\n\nFigure 2: Comparison of the performance of OASIS, LMNN, MCML, LEGO and the Euclidean\nmetric in feature space. Each curve shows the precision at top k as a function of k neighbors. The\nresults are averaged across 5 train/test partitions (40 training images, 25 test images per class), error\nbars are standard error of the means (s.e.m.), black dashed line denotes chance performance.\n\n4.2 Caltech256 Dataset\n\nTo compare OASIS with small-scale methods we used the Caltech256 dataset [7], containing im-\nages collected from Google image search and from PicSearch.com. Images were assigned to 257\ncategories and evaluated by humans in order to ensure image quality and relevance. After we have\npre-processed the images, and \ufb01ltered images that were too small, we were left with 29461 images\nin 256 categories. To allow comparisons with methods that were not optimized for sparse represen-\ntation, we also reduced the block vocabulary size d from 10000 to 1000.\n\nWe compared OASIS with the following metric learning methods.\n(1) Euclidean - The standard Euclidean distance in feature space (equivalent to using the identity\nmatrix W = Id\u00d7d). (2) MCML [5] - Learning a Mahalanobis distance such that same-class sam-\nples are mapped to the same point, formulated as a convex problem. (3) LMNN [16] - learning a\nMahalanobis distance for aiming to have the k-nearest neighbors of a given sample belong to the\nsame class while separating different-class samples by a large margin. As a preprocessing phase,\nimages were projected to a basis of the principal components (PCA) of the data, with no dimen-\nsionality reduction. (4) LEGO [9] - Online learning of a Mahalanobis distance using a Log-Det\nregularization per instance loss, that is guaranteed to yield a positive semide\ufb01nite matrix. We used a\nvariant of LEGO that, like OASIS, learns from relative distances.1\nWe tested all methods on subsets of classes taken from the Caltech256 repository. For OASIS,\nimages from the same class were treated as similar. Each subset was built such that it included\nsemantically diverse categories, controlled for classi\ufb01cation dif\ufb01culty. We tested sets containing 10,\n20 and 50 classes, each spanning the range of dif\ufb01culties.\n\nWe used two levels of 5-fold cross validation, one to train the model, and a second to select\nhyper parameters of each method (early stopping time for OASIS; the \u03c9 parameter for LMNN\n(\u03c9 \u2208 {0.125, 0.25, 0.5}), and the regularization parameter \u03b7 for LEGO (\u03b7 \u2208 {0.02, 0.08, 0.32}).\nResults reported below were obtained by selecting the best value of the hyper parameter and then\ntraining again on the full training set (40 images per class).\n\nFigure 2 compares the precision obtained with OASIS, with the four competing approaches. OASIS\nachieved consistently superior results throughout the full range of k (number of neighbors) tested,\nand on all four sets studied. LMNN performance on the training set was often high, suggesting that\nit over\ufb01ts the training set, as was also observed sometimes by [16].\n\nTable 2 shows the total CPU time in minutes for training all algorithms compared, and for four\nsubsets of classes at sizes 10, 20, 50 and 249. Data is not given when runtime was longer than 5\ndays or performance was worse than the Euclidean baseline. For the purpose of a fair comparison,\nwe tested two implementations of OASIS: The \ufb01rst was fully implemented Matlab. The second had\nthe core loop of the algorithm implemented in C and called from Matlab. All other methods used\n\n1We have also experimented with the methods of [18], which we found to be too slow, and with RCA [1],\n\nwhose precision was lower than other methods. These results are not included in the evaluations below.\n\n6\n\n\fTable 2: Runtime (minutes) on a standard CPU of all compared methods\n\nnum\nclasses\n\n10\n20\n50\n249\n\nOASIS\nMatlab\n42 \u00b1 15\n45 \u00b1 8\n25 \u00b1 2\n\n485 \u00b1 113\n\nOASIS\n\nMCML\nMatlab+C Matlab+C\n1835 \u00b1 210\n0.12 \u00b1 .03\n0.15 \u00b1 .02\n7425 \u00b1 106\n1.60 \u00b1 .04\n1.13 \u00b1 .15\n\nLMNN\n\nLEGO\nfastLMNN\nMatlab Matlab+C Matlab+C\n247 \u00b1 209\n143 \u00b1 44\n365 \u00b1 62\n533 \u00b1 49\n711 \u00b1 28\n2109 \u00b1 67\n\n337 \u00b1 169\n631 \u00b1 40\n960 \u00b1 80\n\ncode supplied by the authors implemented in Matlab, with core parts implemented in C. Due to\ncompatibility issues, fast-LMNN was run on a different machine, and the given times are rescaled to\nthe same time scale as all other algorithms. LEGO is fully implemented in Matlab. All other code\nwas compiled (mex) to C. The C implementation of OASIS is signi\ufb01cantly faster, since Matlab does\nnot use the potential speedup gained by sparse images.\n\nOASIS is signi\ufb01cantly faster, with a runtime that is shorter by orders of magnitudes than MCML\neven on small sets, and about one order of magnitude faster than LMNN. The run time of OASIS\nand LEGO was measured until the point of early stopping. OASIS memory requirements grow\nquadratically with the size of the dictionary. For a large dictionary of 10K, the parameters matrix\ntakes 100M \ufb02oats, or 0.4 Giga bytes of memory.\n\n(a)\n\n(b)\n\n \n\n \n\ni\n\ni\n\nn\no\ns\nc\ne\nr\np\n\n0.3\n\n0.2\n\n0.1\n\n \n\n0\n0\n\ni\n\ni\n\nn\no\ns\nc\ne\nr\np\n\nOASIS\nPROJ OASIS\nONLINE\u2212PROJ OASIS\nDISSIM\u2212OASIS\nEuclidean\n\nRandom\n\n \n\ne\ng\na\nr\ne\nv\na\n\n \n\nn\na\ne\nm\n\n10\n\n20\n\nnumber of neighbors\n\n30\n\n40\n\n50\n\n0.22\n\n0.2\n\n0.18\n\n \n\nproj. every 5000\nproj. every 50000\nproj. after complete\n\n50K\n\n100K 150K 200K 250K\nlearning steps\n\nFigure 3: (a) Comparing symmetric variants of OASIS on the 20-class subset, similar results ob-\ntained with other sets. (b) mAP along training for three PSD projection schemes.\n\n4.3 Symmetry and positivity\nThe similarity matrix W learned by OASIS is not guaranteed to be positive or even symmetric.\nSome applications, like ranking images by semantic relevance to a given image query are known to\nbe non-symmetric when based on human judgement [15]. However, in some applications symmetry\nor positivity constraints re\ufb02ects a prior knowledge that may help in avoiding over\ufb01tting. We now\ndiscuss variants of OASIS that learn a symmetric or positive matrices.\n\n4.3.1 Symmetric similarities\nA simple approach to enforce symmetry is to project the OASIS model W onto the set of symmetric\nmatrices W\u2032 = sym(W) = 1\nOnline-Proj-Oasis) or after learning is completed (Proj-Oasis). Alternatively, the asymmetric score\nfunction SW(pi, pj) in lW can be replaced with a symmetric score\n\n2 (cid:0)WT + W(cid:1). Projection can be done after each update (denoted\n\nS\u2032\n\nW(pi, pj) \u2261 \u2212(pi \u2212 pj)T W (pi \u2212 pj) .\n\n(6)\nand used to derive an OASIS-like algorithm (which we name Dissim-Oasis). The optimal update for\nthis loss has a symmetric gradient V\u2032i = (pi \u2212 p+\ni )T . Therefore,\nif W0 is initialized with a symmetric matrix (e.g., the identity) all Wi are guaranteed to remain\n\ni )T \u2212 (pi \u2212 p\u2212\n\ni )(pi \u2212 p\u2212\n\ni )(pi \u2212 p+\n\n7\n\n\fW(pi, p+\n\nsymmetric. Dissim-Oasis is closely related to LMNN [16]. This can be seen be casting the batch\nobjective of LMNN, into an online setup, which has the form err(W ) = \u2212\u03c9 \u00b7 S\u2032\ni ) + (1 \u2212\ni ). This online version of LMNN becomes equivalent to Dissim-Oasis for \u03c9 = 0.\n\u03c9) \u00b7 l\u2032\nFigure 3(a) compares the precision of the different symmetric variants with the original OA-\nSIS. All symmetric variants performed slightly worse, or equal, to the original asymmetric OA-\nSIS. The precision of Proj-Oasis was equivalent to that of OASIS, most likely since asymmet-\nric OASIS actually converged to an almost-symmetric model (as measured by a symmetry index\n\u03c1(W) = ksym(W)k2\n\nW(pi, p+\n\ni , p\u2212\n\nkWk2\n\n= 0.94).\n\n4.3.2 Positive similarity\nMost similarity learning approaches focus on learning metrics. In the context of OASIS, when W is\npositive semi de\ufb01nite (PSD), it de\ufb01nes a Mahalanobis distance over the images. The matrix square-\nroot of W, AT A = W can then be used to project the data into a new space in which the Euclidean\ndistance is equivalent to the W distance in the original space.\n\nWe experimented with positive variants of OASIS, where we repeatedly projected the learned model\nonto the set of PSD matrices, once every t iterations. Projection is done by taking the eigen decom-\nposition W = V \u00b7 D \u00b7 VT where V is the eigenvector matrix and D is a the diagonal eigenvalues\nmatrix limited to positive eigenvalues. Figure 3(b) traces precision on the test set throughout learning\nfor various values of t.\n\nThe effect of positive projections is complex. First, continuously projecting at every step helps\nto reduce over\ufb01tting, as can be observed by the slower decline of the blue curve (upper smooth\ncurve) compared to the orange curve (lowest curve). However, when projection is performed after\nmany steps, (instead of continuously), performance of the projected model actually outperforms\nthe continuous-projection model (upper jittery curve). The reason for this effect is likely to be\nthat estimating the positive sub-space is very noisy when only based on a few samples. Indeed,\naccurate estimation of the negative subspace is known to be a hard problem, in that the estimated\neigenvalues of eigenvectors \u201cnear zero\u201d, is relatively large. We found that this effect was so strong,\nthat the optimal projection strategy is to avoid projection throughout learning completely. Instead,\nprojecting into PSD after learning (namely, after a model was chosen using early stopping) provided\nthe best performance in our experiments.\n\nAn interesting alternative to obtain a PSD matrix was explored by [10, 9]. Using a LogDet diver-\ngence between two matrices Dld (X, Y ) = tr(XY \u22121) \u2212 log(det(XY \u22121)) ensures that, given an\ninitial PSD matrix, all subsequent matrices will be PSD as well. It will be interesting to test the\neffect of using LogDet regularization in the OASIS setup.\n\n5 Discussion\n\nWe have presented OASIS, a scalable algorithm for learning image similarity that captures both\nsemantic and visual aspects of image similarity. Three key factors contribute to the scalability of\nOASIS. First, using a large margin online approach allows training to converge even after seeing\na small fraction of potential pairs. Second, the objective function of OASIS does not require the\nsimilarity measure to be a metric during training, although it appears to converge to a near-symmetric\nsolution, whose positive projection is a good metric. Finally, we use a sparse representation of low\nlevel features which allows to compute scores very ef\ufb01ciently.\n\nOASIS learns a class-independent model: it is not aware of which queries or categories were shared\nby two similar images. As such, it is more limited in its descriptive power and it is likely that class-\ndependent similarity models could improve precision. On the other hand, class-independent models\ncould generalize to handle classes that were not observed during training, as in transfer learning.\nLarge scale similarity learning, applied to images from a large variety of classes, could therefore be\na useful tool to address real-world problems with a large number of classes.\n\nThis paper focused on the training part of metric learning. To use the learned metric for ranking, an\nef\ufb01cient procedure for scoring a large set of images is needed. Techniques based on locality-sensitive\nhashing could be used to speed up evaluation, but this is outside the scope of this paper.\n\n8\n\n\fReferences\n\n[1] A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning Distance Functions us-\ning Equivalence Relations. In Proc. of 20th International Conference on Machine Learning\n(ICML), pages 11\u201318, 2003.\n\n[2] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive\n\nalgorithms. JMLR, 7:551\u2013585, 2006.\n\n[3] J.V. Davis, B. Kulis, P. Jain, S. Sra, and I.S. Dhillon. Information-theoretic metric learning. In\n\nICML 24, pages 209\u2013216, 2007.\n\n[4] A. Frome, Y. Singer, F. Sha, and J. Malik. Learning globally-consistent local distance functions\nfor shape-based image retrieval and classi\ufb01cation. In International Conference on Computer\nVision, pages 1\u20138, 2007.\n\n[5] A. Globerson and S. Roweis. Metric Learning by Collapsing Classes. NIPS, 18:451, 2006.\n[6] D. Grangier and S. Bengio. A discriminative kernel-based model to rank images from text\nqueries. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 30(8):1371\u2013\n1384, 2008.\n\n[7] G. Grif\ufb01n, A. Holub, and P. Perona. Caltech-256 object category dataset. Technical Report\n\n7694, CalTech, 2007.\n\n[8] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant map-\nIn IEEE Computer Society Conference on Computer Vision and Pattern Recognition\n\nping.\n(CVPR), volume 2, 2006.\n\n[9] P. Jain, B. Kulis, I. Dhillon, and K. Grauman. Online metric learning and fast similarity search.\n\nIn NIPS, volume 22, 2008.\n\n[10] B. Kulis, M.A. Sustik, and I.S. Dhillon. Low-rank kernel learning with bregman matrix diver-\n\ngences. Journal of Machine Learning Research, 10:341\u2013376, 2009.\n\n[11] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M.I. Jordan. Learning the kernel\n\nmatrix with semide\ufb01nite programming. JMLR, 5:27\u201372, 2004.\n\n[12] W. S. Noble. Multi-kernel learning for biology. In NIPS workshop on kernel learning, 2008.\n[13] N. Rasiwasia and N. Vasconcelos. A study of query by semantic example. In 3rd International\n\nWorkshop on Semantic Learning and Applications in Multimedia, 2008.\n\n[14] R. Rosales and G. Fung. Learning sparse metrics via linear programming. In Proceedings of\nthe 12th ACM SIGKDD international conference on Knowledge discovery and data mining,\npages 367\u2013373. ACM New York, NY, USA, 2006.\n\n[15] A. Tversky. Features of similarity. Psychological Review, 84(4):327\u2013352, 1977.\n[16] K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin nearest neigh-\n\nbor classi\ufb01cation. NIPS, 18:1473, 2006.\n\n[17] K.Q. Weinberger and L.K. Saul. Fast solvers and ef\ufb01cient implementations for distance metric\n\nlearning. In ICML25, pages 1160\u20131167, 2008.\n\n[18] E.P. Xing, A.Y. Ng, M.I. Jordan, and S. Russell. Distance metric learning with application to\nclustering with side-information. In S. Becker, S. Thrun, and K. Obermayer, editors, NIPS 15,\npages 521\u2013528, Cambridge, MA, 2003. MIT Press.\n\n[19] L. Yang. Distance metric learning: A comprehensive survey. Technical report, Michigan State\n\nUniv., 2006.\n\n9\n\n\f", "award": [], "sourceid": 844, "authors": [{"given_name": "Gal", "family_name": "Chechik", "institution": null}, {"given_name": "Uri", "family_name": "Shalit", "institution": null}, {"given_name": "Varun", "family_name": "Sharma", "institution": null}, {"given_name": "Samy", "family_name": "Bengio", "institution": null}]}