{"title": "Local Similarity-Aware Deep Feature Embedding", "book": "Advances in Neural Information Processing Systems", "page_first": 1262, "page_last": 1270, "abstract": "Existing deep embedding methods in vision tasks are capable of learning a compact Euclidean space from images, where Euclidean distances correspond to a similarity metric. To make learning more effective and efficient, hard sample mining is usually employed, with samples identified through computing the Euclidean feature distance. However, the global Euclidean distance cannot faithfully characterize the true feature similarity in a complex visual feature space, where the intraclass distance in a high-density region may be larger than the interclass distance in low-density regions. In this paper, we introduce a Position-Dependent Deep Metric (PDDM) unit, which is capable of learning a similarity metric adaptive to local feature structure. The metric can be used to select genuinely hard samples in a local neighborhood to guide the deep embedding learning in an online and robust manner. The new layer is appealing in that it is pluggable to any convolutional networks and is trained end-to-end. Our local similarity-aware feature embedding not only demonstrates faster convergence and boosted performance on two complex image retrieval datasets, its large margin nature also leads to superior generalization results under the large and open set scenarios of transfer learning and zero-shot learning on ImageNet 2010 and ImageNet-10K datasets.", "full_text": "Local Similarity-Aware Deep Feature Embedding\n\nChen Huang\n\nChen Change Loy\n\nXiaoou Tang\n\nDepartment of Information Engineering, The Chinese University of Hong Kong\n\n{chuang,ccloy,xtang}@ie.cuhk.edu.hk\n\nAbstract\n\nExisting deep embedding methods in vision tasks are capable of learning a compact\nEuclidean space from images, where Euclidean distances correspond to a similarity\nmetric. To make learning more effective and ef\ufb01cient, hard sample mining is\nusually employed, with samples identi\ufb01ed through computing the Euclidean feature\ndistance. However, the global Euclidean distance cannot faithfully characterize\nthe true feature similarity in a complex visual feature space, where the intraclass\ndistance in a high-density region may be larger than the interclass distance in\nlow-density regions. In this paper, we introduce a Position-Dependent Deep Metric\n(PDDM) unit, which is capable of learning a similarity metric adaptive to local\nfeature structure. The metric can be used to select genuinely hard samples in a\nlocal neighborhood to guide the deep embedding learning in an online and robust\nmanner. The new layer is appealing in that it is pluggable to any convolutional\nnetworks and is trained end-to-end. Our local similarity-aware feature embedding\nnot only demonstrates faster convergence and boosted performance on two complex\nimage retrieval datasets, its large margin nature also leads to superior generalization\nresults under the large and open set scenarios of transfer learning and zero-shot\nlearning on ImageNet 2010 and ImageNet-10K datasets.\n\nIntroduction\n\n1\nDeep embedding methods aim at learning a compact feature embedding f (x) \u2208 Rd from image x\nusing a deep convolutional neural network (CNN). They have been increasingly adopted in a variety of\nvision tasks such as product visual search [1, 14, 29, 33] and face veri\ufb01cation [13, 27]. The embedding\nobjective is usually in a Euclidean sense: the Euclidean distance Di,j = (cid:107)f (xi) \u2212 f (xj)(cid:107)2 between\ntwo feature vectors should preserve their semantic relationship encoded pairwise (by contrastive\nloss [1]), in triplets [27, 33] or even higher order relationships (e.g., by lifted structured loss [29]).\nIt is widely observed that an effective data sampling strategy is crucial to ensure the quality and\nlearning ef\ufb01ciency of deep embedding, as there are often many more easy examples than those\nmeaningful hard examples. Selecting overly easy samples can in practice lead to slow convergence\nand poor performance since many of them satisfy the constraint well and give nearly zero loss,\nwithout exerting any effect on parameter update during the back-propagation [3]. Hence hard\nexample mining [7] becomes an indispensable step in state-of-the-art deep embedding methods.\nThese methods usually choose hard samples by computing the convenient Euclidean distance in the\nembedding space. For instance, in [27, 29], hard negatives with small Euclidean distances are found\nonline in a mini-batch. An exception is [33] where an online reservoir importance sampling scheme\nis proposed to sample discriminative triplets by relevance scores. Nevertheless, these scores are\ncomputed of\ufb02ine with different hand-crafted features and distance metrics, which is suboptimal.\nWe question the effectiveness of using a single and global Euclidean distance metric for \ufb01nding hard\nsamples, especially for real-world vision tasks that exhibit complex feature variations due to pose,\nlighting, and appearance. As shown in a \ufb01ne-grained bird image retrieval example in Figure 1(a),\nthe diversity of feature patterns learned for each class throughout the feature space can easily lead\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: (a) 2-D feature embedding (by t-SNE [19]) of the CUB-200-2011 [32] test set. The\nintraclass distance can be larger than the interclass distance under the global Euclidean metric, which\ncan mislead the hard sample mining and consequently deep embedding learning. We propose a PDDM\nunit that incorporates the absolute position (i.e., feature mean denoted by the red triangle) to adapt\nmetric to the local feature structure. (b) Overlapped similarity distribution by the Euclidean metric (the\nsimilarity scores are transformed from distances by a Sigmoid-like function) vs. the well-separated\ndistribution by PDDM. (c) PDDM-guided hard sample mining and embedding learning.\n\nto a larger intraclass Euclidean distance than the interclass distance. Such a heterogeneous feature\ndistribution yields a highly overlapped similarity score distribution for the positive and negative\npairs, as shown in the left chart of Figure 1(b). We observed similar phenomenon for the global\nMahalanobis metric [11, 12, 21, 35, 36] in our experiments. It is not dif\ufb01cult to see that using a single\nand global metric would easily mislead the hard sample mining. To circumvent this issue, Cui et\nal. [3] resorted to human intervention for harvesting genuinely hard samples.\nMitigating the aforementioned issue demands an improved metric that is adaptive to the local feature\nstructure. In this study we wish to learn a local-adaptive similarity metric online, which will be\nexploited to search for high-quality hard samples in local neighborhood to facilitate a more effective\ndeep embedding learning. Our key challenges lie in the formulation of a new layer and loss function\nthat jointly consider the similarity metric learning, hard samples selection, and deep embedding\nlearning. Existing studies [13, 14, 27, 29, 33] only consider the two latter objectives but not together\nwith the \ufb01rst. To this end, we propose a new Position-Dependent Deep Metric (PDDM) unit for\nsimilarity metric learning. It is readily pluggable to train end-to-end with an existing deep embedding\nlearning CNN. We formulate the PDDM such that it learns locally adaptive metric (unlike the\nglobal Euclidean metric), through a non-linear regression on both the absolute feature difference\nand feature mean (which encodes absolute position) of a data pair. As depicted in the right chart of\nFigure 1(b), the proposed metric yields a similarity score distribution that is more distinguishable\nthan the conventional Euclidean metric. As shown in Figure 1(c), hard samples are mined from\nthe resulting similarity score space and used to optimize the feature embedding space in a seamless\nmanner. The similarity metric learning in PDDM and embedding learning in the associated CNN are\njointly optimized using a novel large-margin double-header hinge loss.\nImage retrieval experiments on two challenging real-world vision datasets, CUB-200-2011 [32] and\nCARS196 [15], show that our local similarity-aware feature embedding signi\ufb01cantly outperforms\nstate-of-the-art deep embedding methods that come without the online metric learning and associated\nhard sample mining scheme. Moreover, the proposed approach incurs a far lower computational\ncost and encourages faster convergence than those structured embedding methods (e.g., [29]), which\nneed to compute a fully connected dense matrix of pairwise distances in a mini-batch. We further\ndemonstrate our learned embedding is generalizable to new classes in large open set scenarios. This\nis validated in the transfer learning and zero-shot learning (using the ImageNet hierarchy as auxiliary\nknowledge) tasks on ImageNet 2010 and ImageNet-10K [5] datasets.\n\n2\n\nDeep Euclidean metricTriplet loss [27,33]Position-Dependent Deep MetricDouble header hinge loss0Similarity10Similarity1Feature embedding space\u02c6()ifx\u02c6()kfx\u02c6()jfx\u02c6()lfx\u02c6\u02c6,ijS\u02c6\u02c6,ikS\u02c6\u02c6,jlS()pSSimilarity score spacePDDMHard quadruplet mining(b)()ifxHornedPuffinHornedPuffinNashvilleWarblerMyrtleWarblerElegantTernYellowWarblerCommonYellowthroat()ifx()jfxElegantTern()jfx(c)(a)Mini-batchPositive similarityNegative similarity\fFigure 2: (a) The overall network architecture. All CNNs have shared architectures and parameters.\n(b) The PDDM unit.\n\n2 Related work\n\nHard sample mining in deep learning: Hard sample mining is a popular technique used in computer\nvision for training robust classi\ufb01er. The method aims at augmenting a training set progressively with\nfalse positive examples with the model learned so far. It is the core of many successful vision solutions,\ne.g. pedestrian detection [4, 7]. In a similar spirit, contemporary deep embedding methods [27, 29]\nchoose hard samples in a mini-batch by computing the Euclidean distance in the embedding space.\nFor instance, Schroff et al. [27] selected online the semi-hard negative samples with relatively small\nEuclidean distances. Wang et al. [33] proposed an online reservoir importance sampling algorithm\nto sample triplets by relevance scores, which are computed of\ufb02ine with different distance metrics.\nSimilar studies on image descriptor learning [28] and unsupervised feature learning [34] also select\nhard samples according to the Euclidean distance-based losses in their respective CNNs. We argue in\nthis paper that the global Euclidean distance is a suboptimal similarity metric for hard sample mining,\nand propose a locally adaptive metric for better mining.\nMetric learning: An effective similarity metric is at the core of hard sample mining. Euclidean\ndistance is the simplest similarity metric, and it is widely used by current deep embedding methods\nwhere Euclidean feature distances directly correspond the similarity. Similarities can be encoded\npairwise with a contrastive loss [1] or in more \ufb02exible triplets [27, 33]. Song et al. [29] extended to\neven higher order similarity constraints by lifting the pairwise distances within a mini-batch to the\ndense matrix of pairwise distances. Beyond Euclidean metric, one can actually turn to the parametric\nMahalanobis metric instead. Representative works [12, 36] minimize the Mahalanobis distance\nbetween positive sample pairs while maximizing the distance between negative pairs. Alternatives\ndirectly optimize the Mahalanobis metric for nearest neighbor classi\ufb01cation via the method of\nNeighbourhood Component Analysis (NCA) [11], Large Margin Nearest Neighbor (LMNN) [35] or\nNearest Class Mean (NCM) [21]. However, the common drawback of the Mahalanobis and Euclidean\nmetrics is that they are both global and are far from being ideal in the presence of heterogeneous\nfeature distribution (see Figure 1(a)). An intuitive remedy would be to learn multiple metrics [9],\nwhich would be computationally expensive though. Xiong et al. [37] proposed a single adaptive\nmetric using the absolute position information in random forest classi\ufb01ers. Our approach shares the\nsimilar intuition, but incorporates the position information by a deep CNN in a more principled way,\nand can jointly learn similarity-aware deep features instead of using hand-crafted ones as in [37].\n\n3 Local similarity-aware deep embedding\nLet X = {(xi, yi)} be an imagery dataset, where yi is the class label of image xi. Our goal is to jointly\nlearn a deep feature embedding f (x) from image x into a feature space Rd, and a similarity metric\nSi,j = S(f (xi), f (xj)) \u2208 R1, such that the metric can robustly select hard samples online to learn a\ndiscriminative, similarity-aware feature embedding. Ideally, the learned features (f (xi), f (xj)) from\nthe set of positive pairs P = {(i, j)|yi = yj} should be close to each other with a large similarity\nscore Si,j, while the learned features from the set of negative pairs N = {(i, j)|yi (cid:54)= yj} should be\npushed far away with a small similarity score. Importantly, this relationship should hold independent\nof the (heterogeneous) feature distribution in Rd, where a global metric Si,j can fail. To adapt Si,j to\nthe latent structure of feature embeddings, we propose a Position-Dependent Deep Metric (PDDM)\nunit that can be trained end-to-end, see Figure 2(b).\n\n3\n\n\u02c6jxTraining dataMini-batches...CNNCNNCNNCNN\u02c6ix\u02c6()ifx\u02c6()jfx\u02c6()kfx\u02c6()lfxDouble-header hinge lossFeature embedding:(a) PDDM-Net architecture(b) PDDM learning unit11(,)xy(,)mmxy\u02c6lx\u02c6kxHard quadruplet L2,\u02c6(,)\u02c6,\u02c6\u02c6(,)\u02c6,\u02c6\u02c6(,)\u02c6\u02c6(,)argmin\u02c6argmax\u02c6argmaxijijPikikNjljlNijSkSlS\uf0ce\uf0ce\uf0ce\uf03d\uf03d\uf03d...L2L2L2PDDMPDDMPDDM\u02c6\u02c6,ikS\u02c6\u02c6,ijS\u02c6\u02c6,jlSDouble-header hinge loss\u02c6()ifx\u02c6()kfx\u02c6()jfx\u02c6()lfxSimilarity score:\u02c6\u02c6,ijS\u02c6\u02c6,ikS\u02c6\u02c6,jlS()pS()ifxL2()jfxL2()()ijfxfx\uf02dL2()()2ijfxfx\uf02bL2PDDMSimilarity score,ijSCat\u02c6jxMini-batches...CNNCNNCNNCNN\u02c6ix\u02c6()ifx\u02c6()jfx\u02c6()kfx\u02c6()lfxDouble-header hinge lossFeature embedding(a) Overall network architecture(b) PDDM unit11(,)xy(,)mmxy\u02c6lx\u02c6kxHard quadruplet L2,\u02c6(,)\u02c6,\u02c6\u02c6(,)\u02c6,\u02c6\u02c6(,)\u02c6\u02c6(,)argmin\u02c6argmax\u02c6argmaxijijPikikNjljlNijSkSlS\uf0ce\uf0ce\uf0ce\uf03d\uf03d\uf03dL2L2L2PDDMPDDMPDDM\u02c6\u02c6,ikS\u02c6\u02c6,ijS\u02c6\u02c6,jlSDouble-header hinge lossSimilarity score()ifxL2()jfxL2()()ijfxfx\uf02dL2()()2ijfxfx\uf02bL2PDDM,ijSCatFCFCFCFC\f(cid:88)\n\n(i,j)\u2208P\u222aN\n\nThe overall network architecture is shown in Figure 2(a). First, we use PDDM to compute similarity\nscores for the mini-batch samples during a particular forward pass. The scores are used to select\none hard quadruplet from the local sets of positive pairs \u02c6P \u2208 P and negative pairs \u02c6N \u2208 N in the\nbatch. Then each sample in the hard quadruplet is separately fed into four identical CNNs with shared\nparameters W to extract d-dimensional features. Finally, a discriminative double-header hinge loss is\napplied to both the similarity score and feature embeddings. This enables us to jointly optimize the\ntwo that bene\ufb01t each other. We will provide the details in the following.\n\n3.1 PDDM learning and hard sample mining\n\nGiven a feature pair (fW (xi), fW (xj)) extracted from images xi and xj by an embedding function\nfW (\u00b7) parameterized by W , we wish to obtain an ideal similarity score yi,j = 1 if (i, j) \u2208 P , and\nyi,j = 0 if (i, j) \u2208 N. Hence, we seek the optimal similarity metric S\u2217(\u00b7,\u00b7) from an appropriate\nfunction space H, and also seek the optimal feature embedding parameters W \u2217:\n\n(S\u2217(\u00b7,\u00b7), W \u2217) = argmin\nS(\u00b7,\u00b7)\u2208H,W\n\n1\n\n|P \u222a N|\n\nl (S(fW (xi), fW (xj)), yi,j) ,\n\n(1)\n\nwhere l(\u00b7) is some loss function. We will omit the parameters W of f (\u00b7) in the following for brevity.\nAdapting to local feature structure. The standard Euclidean or Mahalanobis metric can be seen as a\nspecial form of function S(\u00b7,\u00b7) that is based solely on the feature difference vector u = |f (xi)\u2212f (xj)|\nor its linearly transformed version. These metrics are suboptimal in a heterogeneous embedding\nspace, thus could easily fail the searching of genuinely hard samples. On the contrary, the proposed\nPDDM leverages the absolute feature position to adapt the metric throughout the embedding space.\nSpeci\ufb01cally, inspired by [37], apart from the feature difference vector u, we additionally incorporate\nthe feature mean vector v = (f (xi) + f (xj))/2 to encode the absolute position. Unlike [37], we\nformulate a principled learnable similarity metric from u and v in our CNN.\nFormally, as shown in Figure 2(b), we \ufb01rst normalize the features f (xi) and f (xj) onto the unit\nhypersphere, i.e., (cid:107)f (x)(cid:107)2 = 1, in order to maintain feature comparability. Such normalized features\nare used to compute their relative and absolute positions encoded in u and v, each followed by\na fully connected layer, an elementwise ReLU nonlinear function \u03c3(\u03be) = max(0, \u03be), and again,\n(cid:96)2-normalization r(x) = x/(cid:107)x(cid:107)2. To treat u and v differently, the fully connected layers applied to\nthem are not shared, parameterized by Wu \u2208 Rd\u00d7d, bu \u2208 Rd and Wv \u2208 Rd\u00d7d, bv \u2208 Rd, respectively.\nThe nonlinearities ensure the model is not trivially equivalent to be the mapping from f (xi) and\nf (xj) themselves. Then we concatenate the mapped u(cid:48) and v(cid:48) vectors and pass them through another\nfully connected layer parameterized by Wc \u2208 R2d\u00d7d, bc \u2208 Rd and the ReLU function, and \ufb01nally\nmap to a score Si,j = S(f (xi), f (xj)) \u2208 R1 via Ws \u2208 Rd\u00d71, bs \u2208 R1. To summarize:\n\nu = |f (xi) \u2212 f (xj)| ,\nu(cid:48) = r (\u03c3(Wuu + bu)) ,\n\n(cid:18)\n\n(cid:19)\n\n(cid:21)\n\n(cid:20) u(cid:48)\n\nv(cid:48)\n\nv = (f (xi) + f (xj)) /2,\nv(cid:48) = r (\u03c3(Wvv + bv)) ,\n\nWc\n\n+ bc\n\nc = \u03c3\n\n, Si,j = Wsc + bs.\n\n(2)\nIn this way, we transform the seeking of the similarity metric function S(\u00b7,\u00b7) into the joint learning of\nCNN parameters (Wu, Wv, Wc, Ws, bu, bv, bc, bs for the PDDM unit, and W for feature embeddings).\nThe parameters collectively de\ufb01ne a \ufb02exible nonlinear regression function for the similarity score.\nDouble-header hinge loss. To optimize all these CNN parameters, we can choose a standard\nregression loss function l(\u00b7), e.g., logistic regression loss. Or alternatively, we can cast the problem as\na binary classi\ufb01cation one as in [37]. However, in both cases the CNN is prone to over\ufb01tting, because\nthe supervisory binary similarity labels yi,j \u2208 {0, 1} tend to independently push the scores towards\ntwo single points. While in practice, the similarity scores of positive and negative pairs live on a 1-D\nmanifold following some distribution patterns on heterogeneous data, as illustrated in Figure 1(b).\nThis motivates us to design a loss function l(\u00b7) to separate the similarity distributions, instead of in\nan independent pointwise way that is noise-sensitive. One intuitive option is to impose the Fisher\ncriterion [20] on the similarity scores, i.e., maximizing the ratio between the interclass and intraclass\nscatters of scores. Similarly, it can be reduced to maximize (\u00b5P \u2212 \u00b5N )2/(VarP + VarN ) in our\n1-D case, where \u00b5 and Var are the mean and variance of each score distribution. Unfortunately, the\n\n4\n\n\foptimality of Fisher-like criteria relies on the assumption that the data of each class is of a Gaussian\ndistribution, which is obviously not satis\ufb01ed in our case. Also, a high cost O(m2) is entailed to\ncompute the Fisher loss in a mini-batch with m samples by computing all the pairwise distances.\nConsequently, we propose a faster-to-compute loss function that approximately maximizes the\nmargin between the positive and negative similarity distributions without making any assumption\nabout the distribution\u2019s shape or pattern. Speci\ufb01cally, we retrieve one hard quadruplet from a\nrandom batch during each forward pass. Please see the illustration in Figure 1(c). The quadruplet\nconsists of the most dissimilar positive sample pair in the batch (\u02c6i, \u02c6j) = argmin(i,j)\u2208 \u02c6P Si,j, which\nmeans their similarity score is most likely to cross the \u201csafe margin\u201d towards the negative similarity\ndistribution in this local range. Next, we build a similarity neighborhood graph that links the chosen\npositive pair with their respective negative neighbors in the batch, and choose the hard negatives\nas the other two quadruplet members \u02c6k = argmax(\u02c6i,k)\u2208 \u02c6N S\u02c6i,k, and \u02c6l = argmax(\u02c6j,l)\u2208 \u02c6N S\u02c6j,l. Using\nthis hard quadruplet (\u02c6i, \u02c6j, \u02c6k, \u02c6l), we can now locally approximate the inter-distribution margin as\nmin(S\u02c6i,\u02c6j \u2212 S\u02c6i,\u02c6k, S\u02c6i,\u02c6j \u2212 S\u02c6j,\u02c6l) in a robust manner. This makes us immediately come to a double-header\nhinge loss Em to discriminate the target similarity distributions under the large margin criterion:\n\n(cid:88)\n(\u03b5\u02c6i,\u02c6j + \u03c4\u02c6i,\u02c6j),\n0, \u03b1 + S\u02c6i,\u02c6k \u2212 S\u02c6i,\u02c6j\n\n\u02c6i,\u02c6j\n\n(cid:16)\n\n(cid:17) \u2264 \u03b5\u02c6i,\u02c6j, max\n\n(cid:16)\n\nmin Em =\ns.t. : \u2200(\u02c6i, \u02c6j), max\n\n(cid:17) \u2264 \u03c4\u02c6i,\u02c6j,\n\n0, \u03b1 + S\u02c6j,\u02c6l \u2212 S\u02c6i,\u02c6j\n\n(3)\n\n(\u02c6i, \u02c6j) = argmin\n(i,j)\u2208 \u02c6P\n\nSi,j, \u02c6k = argmax\n(\u02c6i,k)\u2208 \u02c6N\n\nS\u02c6i,k, \u02c6l = argmax\n(\u02c6j,l)\u2208 \u02c6N\n\nS\u02c6j,l, \u03b5\u02c6i,\u02c6j \u2265 0, \u03c4\u02c6i,\u02c6j \u2265 0,\n\nwhere \u03b5\u02c6i,\u02c6j, \u03c4\u02c6i,\u02c6j are the slack variables, and \u03b1 is the enforced margin.\nThe discriminative loss has four main bene\ufb01ts: 1) The discrimination of similarity distributions is\nassumption-free. 2) Hard samples are simultaneously found during the loss minimization. 3) The\nloss function incurs a low computational cost and encourages faster convergence. Speci\ufb01cally, the\nsearching cost of the hard positive pair (\u02c6i, \u02c6j) is very small since the positive pair set \u02c6P is usually much\nsmaller than the negative pair set \u02c6N in an m-sized mini-batch. While the hard negative mining only\nincurs an O(m) complexity. 4) Eqs. (2, 3) can be easily optimized through the standard stochastic\ngradient descent to adjust the CNN parameters.\n\n3.2\n\nJoint metric and embedding optimization\n\nGiven the learned PDDM and mined hard samples in a mini-batch, we can use them to solve for\na better, local similarity-aware feature embedding at the same time. For computational ef\ufb01ciency,\nwe reuse the hard quadruplet\u2019s features for metric optimization (Eq. (3)) in the same forward pass.\nWhat follows is to use the double-header hinge loss again, but to constrain the deep features this time,\nsee Figure 2. The objective is to ensure the Euclidean distance between hard negative features (D\u02c6i,\u02c6k\nor D\u02c6j,\u02c6l) is larger than that between hard positive features D\u02c6i,\u02c6j by a large margin. Combining the\nembedding loss Ee and metric loss Em (Eq. (3)) gives our \ufb01nal joint loss function:\n\nmin\n\ns.t. : \u2200(\u02c6i, \u02c6j), max\n\n(cid:88)\nEm + \u03bbEe + \u03b3(cid:107)(cid:102)W(cid:107)2, where Ee =\n(cid:17) \u2264 o\u02c6i,\u02c6j, max\n(cid:16)\nD\u02c6i,\u02c6j = (cid:107)f (x\u02c6i) \u2212 f (x\u02c6j)(cid:107)2, o\u02c6i,\u02c6j \u2265 0, \u03c1\u02c6i,\u02c6j \u2265 0,\n\n0, \u03b2 + D\u02c6i,\u02c6j \u2212 D\u02c6i,\u02c6k\n\n(cid:16)\n\n\u02c6i,\u02c6j\n\n(o\u02c6i,\u02c6j + \u03c1\u02c6i,\u02c6j),\n0, \u03b2 + D\u02c6i,\u02c6j \u2212 D\u02c6j,\u02c6l\n\n(cid:17) \u2264 \u03c1\u02c6i,\u02c6j,\n\n(4)\n\nwhere(cid:102)W are the CNN parameters for both the PDDM and feature embedding, and o\u02c6i,\u02c6j, \u03c1\u02c6i,\u02c6j and \u03b2 are\nthe slack variables and enforced margin for Ee, and \u03bb, \u03b3 are the regularization parameters. Since all\nfeatures are (cid:96)2-normalized (see Figure 2), we have \u03b2+D\u02c6i,\u02c6j\u2212D\u02c6i,\u02c6k = \u03b2\u22122f (x\u02c6i)f (x\u02c6j)+2f (x\u02c6i)f (x\u02c6k),\nand can conveniently derive the gradients as those in triplet-based methods [27, 33].\nThis joint objective provides effective supervision in two domains, respectively at the score level and\nfeature level that are mutually informed. Although the score level supervision by Em alone is already\ncapable of optimizing both our metric and feature embedding, we will show the bene\ufb01ts of adding the\nfeature level supervision by Ee in experiments. Note we can still enforce the large margin relations\nof quadruple features in Ee using the simple Euclidean metric. This is because the quadruple features\nare selected by our PDDM that is learned in the local Euclidean space as well.\n\n5\n\n\fFigure 3: Illustrative comparison of different feature embeddings. Pairwise similarities in classes 1-3\nare effortlessly distinguishable in a heterogeneous feature space because there is always a relative\nsafe margin between any two involved classes w.r.t. their class bounds. However, it is not the case for\nclass 4. The contrastive [1], triplet [27, 33] and lifted structured [29] embeddings select hard samples\nby the Euclidean distance that is not adaptive to the local feature structure. They may thus select\ninappropriate hard samples and the negative pairs get misled towards the wrong gradient direction\n(red arrow). In contrast, our local similarity-aware embedding is correctly updated by the genuinely\nhard examples in class 4.\n\nFigure 3 compares our local similarity-aware feature embedding with existing works. Contrastive [1]\nembedding is trained on pairwise data {(xi, xj, yi,j)}, and tries to minimize the distance between the\npositive feature pair and penalize the distance between negative feature pair for being smaller than a\nmargin \u03b1. Triplet embedding [27, 33] samples the triplet data {(xa, xp, xn)} where xa is an anchor\npoint and xp, xn are from the same and different class, respectively. The objective is to separate\nthe intraclass distance between (f (xa), f (xp)) and interclass distance between (f (xa), f (xn)) by\nmargin \u03b1. While lifted structured embedding [29] considers all the positive feature pairs (e.g.,\n(f (xi), f (xj)) in Figure 3) and all their linked negative pairs (e.g., (f (xi), f (xk)), (f (xj), f (xl))\nand so on), and enforces a margin \u03b1 between positive and negative distances.\nThe common drawback of the above-mentioned embedding methods is that they sample pairwise\nor triplet (i.e., anchor) data randomly and rely on simplistic Euclidean metric. They are thus very\nlikely to update from inappropriate hard samples and push the negative pairs towards the already\nwell-separated embedding space (see the red arrow in Figure 3). While our method can use PDDM\nto \ufb01nd the genuinely hard feature quadruplet (f (x\u02c6i), f (x\u02c6j), f (x\u02c6k), f (x\u02c6l)), thus can update feature\nembedding in the correct direction. Also, our method is more ef\ufb01cient than the lifted structured\nembedding [29] that requires computing dense pairwise distances within a mini-batch.\n\n3.3\n\nImplementation details\n\nWe use GoogLeNet [31] (feature dimension d = 128) and CaffeNet [16] (d = 4096) as our base\nnetwork architectures for retrieval and transfer learning tasks respectively. They are initialized\nwith their pretrained parameters on ImageNet classi\ufb01cation. The fully-connected layers of our\nPDDM unit are initialized with random weights and followed by dropout [30] with p = 0.5. For all\nexperiments, we choose by grid search the mini-batch size m = 64, initial learning rate 1 \u00d7 10\u22124,\nmomentum 0.9, margin parameters \u03b1 = 0.5, \u03b2 = 1 in Eqs. (3, 4), and regularization parameters\n\u03bb = 0.5, \u03b3 = 5 \u00d7 10\u22124 (\u03bb balances the metric loss Em against the embedding loss Ee). To \ufb01nd\nmeaningful hard positives in our hard quadruplets, we ensure that any one class in a mini-batch has at\nleast 4 samples. And, we always scale S\u02c6i,\u02c6j into the range [0, 1] by the similarity graph in the batch.\nThe entire network is trained for a maximum of 400 epochs until convergence.\n\n4 Results\n\nImage retrieval. The task of image retrieval is a perfect testbed for our method, where both the\nlearned PDDM and feature embedding (under the Euclidean feature distance) can be used to \ufb01nd\nsimilar images for a query. Ideally, a good similarity metric should be query-adapted (i.e., position-\ndependent), and both the metric and features should be able to generalize. We test these properties of\nour method on two popular \ufb01ne-grained datasets with complex feature distribution. We deliberately\nmake the evaluation more challenging by preparing training and testing sets that are disjoint in terms\nof class labels. Speci\ufb01cally, we use the CUB-200-2011 [32] dataset with 200 bird classes and 11,788\n\n6\n\n()ifx()jfxClass1Class2Class3Class4Contrastive embedding()afx()nfxClass1Class2Class3Class4Triplet embedding()pfx()ifx()kfxClass1Class2Class3Class4Lifted structured embedding()jfx\u02c6()ifx\u02c6()jfxClass1Class2Class3Class4Local similarity-aware embedding\u02c6()kfx()lfx......\u02c6()lfx\fFigure 4: Top: a comparison of the training convergence curves of our method with Euclidean- and\nPDDM-based hard quadruplet mining on the test sets of CUB-200-2011 [32] and CARS196 [15]\ndatasets. Bottom: top 8 images retrieved by PDDM (similarity score and feature distance are shown\nunderneath) and the corresponding feature embeddings (black dots) on CUB-200-2011.\n\nTable 1: Recall@K (%) on the test sets of CUB-200-2011 [32] and CARS196 [15] datasets.\n\nCUB-200-2011\n\nK\nContrastive [1]\nTriplet [27, 33]\nLiftedStruct [29]\nLMDM score\nPDDM score\nPDDM+Triplet\nPDDM+Quadruplet\n\n1\n26.4\n36.1\n47.2\n49.5\n55.0\n50.9\n58.3\n\n2\n37.7\n48.6\n58.9\n61.1\n67.1\n62.1\n69.2\n\n4\n49.8\n59.3\n70.2\n72.1\n77.4\n73.2\n79.0\n\n8\n62.3\n70.0\n80.2\n81.8\n86.9\n82.5\n88.4\n\n16\n76.4\n80.2\n89.3\n90.5\n92.2\n91.1\n93.1\n\n32\n85.3\n88.4\n93.2\n94.1\n95.0\n94.4\n95.7\n\n1\n21.7\n39.1\n49.0\n50.9\n55.2\n46.4\n57.4\n\n2\n32.3\n50.4\n60.3\n61.9\n66.5\n58.2\n68.6\n\nCARS196\n4\n46.1\n63.3\n72.1\n73.5\n78.0\n70.3\n80.1\n\n8\n58.9\n74.5\n81.5\n82.5\n88.2\n80.1\n89.4\n\n16\n72.2\n84.1\n89.2\n89.8\n91.5\n88.6\n92.3\n\n32\n83.4\n89.8\n92.8\n93.1\n94.3\n92.6\n94.9\n\nimages. We employ the \ufb01rst 100 classes (5,864 images) for training, and the remaining 100 classes\n(5,924 images) for testing. Another used dataset is CARS196 [15] with 196 car classes and 16,185\nimages. The \ufb01rst 98 classes (8,054 images) are used for training, and the other 98 classes are retained\nfor testing (8,131 images). We use the standard Recall@K as the retrieval evaluation metric.\nFigure 4-(top) shows that the proposed PDDM leads to 2\u00d7 faster convergence in 200 epochs (28 hours\non a Titan X GPU) and lower converged loss than the regular Euclidean metric, when both are used\nto mine hard quadruplets for embedding learning. Note the two resulting approaches both incur lower\ncomputational costs than [29], with a near linear rather than quadratic [29] complexity in mini-batches.\nAs observed from the retrieval results and their feature distributions in Figure 4-(bottom), our PDDM\ncopes comfortably with large intraclass variations, and generates stable similarity scores for those\ndifferently scattered features positioned around a particular query. These results also demonstrate the\nsuccessful generalization of PDDM on a test set with disjoint class labels.\nTable 1 quanti\ufb01es the advantages of both of our similarity metric (PDDM) and similarity-aware feature\nembedding (dubbed \u2018PDDM+Quadruplet\u2019 for short). In the middle rows, we compare the results\nfrom using the metrics of Large Margin Deep Metric (LMDM) and our PDDM, both jointly trained\nwith our quadruplet embedding. The LMDM is implemented by deeply regressing the similarity\nscore from the feature difference only, without using the absolute feature position. Although it is also\noptimized under the large margin rule, it performs worse than our PDDM due to the lack of position\ninformation for metric adaptation. In the bottom rows, we test using the learned features under the\nEuclidean distance. We observed PDDM signi\ufb01cantly improves the performance of both triplet and\nquadruplet embeddings. In particular, our full \u2018PDDM+Quadruplet\u2019 method yields large gains (8%+\nRecall@K=1) over previous works [1, 27, 29, 33] all using the Euclidean distance for hard sample\nmining. Indeed, as visualized in Figure 4, our learned features are typically well-clustered, with sharp\nboundaries and large margins between many classes.\n\n7\n\n(0.87,0.11)(0.87,0.09)(0.96,0.04)(0.86,0.16)(0.87,0.19)(0.96,0.04)(0.84,0.25)(0.85,0.23)(0.96,0.07)(0.96,0.08)(0.84,0.29)(0.84,0.31)(0.82,0.35)(0.83,0.34)(0.95,0.10)(0.83,0.27)(0.81,0.41)(0.95,0.11)(0.79,0.49)(0.81,0.41)(0.95,0.12)(0.75,0.71)(0.80,0.59)(0.94,0.13)CUB-200-2011CARS196020040030405060Loss020040040506070EpochQuadruplet+EuclideanQuadruplet+PDDMLossEpochCUB-200-2011\fTable 2: The \ufb02at top-1 accuracy (%) of transfer learning on ImageNet-10K [5] and \ufb02at top-5 accuracy\n(%) of zero-shot learning on ImageNet 2010.\n\nTransfer learning on ImageNet-10K\n\nZero-shot learning on ImageNet 2010\n\n[5]\n6.4\n\n[26]\n16.7\n\n[23]\n18.1\n\n[18]\n19.2\n\n[21] Ours ConSE [22] DeViSE [8]\n21.9\n\n28.4\n\n28.5\n\n31.8\n\nPST [24]\n34.0\n\n[25]\n34.8\n\n[21] AMP [10] Ours\n48.2\n35.7\n\n41.0\n\nDiscussion. We previously mentioned that our PDDM and feature embedding can be learned by\nonly optimizing the metric loss Em in Eq. (3). Here we experimentally prove the necessity of extra\nsupervision from the embedding loss Ee in Eq. (4). Without it, the Recall@K=1 of image retrieval\nby our \u2018PDDM score\u2019 and \u2018PDDM+Quadruplet\u2019 methods drop by 3.4%+ and 6.5%+, respectively.\nAnother important parameter is the batch size m. When we set it to be smaller than 64, say 32,\nRecall@K=1 on CUB-200-2011 drops to 55.7% and worse with even smaller m. This is because the\nchosen hard quadruplet from a small batch makes little sense for learning. When we use large m=132,\nwe have marginal gains but need many more epochs (than 400) to use enough training quadruplets.\nTransfer learning. Considering the good performance of our fully learned features, here we evaluate\ntheir generalization ability under the scenarios of transfer learning and zero-shot learning. Transfer\nlearning aims to transfer knowledge from the source classes to new ones. Existing methods explored\nthe knowledge of part detectors [6] or attribute classi\ufb01ers [17] across classes. Zero-shot learning is\nan extreme case of transfer learning, but differs in that for a new class only a description rather than\nlabeled training samples is provided. The description can be in terms of attributes [17], WordNet\nhierarchy [21, 25], semantic class label graph [10, 24], or text data [8, 22]. These learning scenarios\nare also related to the open set one [2] where new classes grow continuously.\nFor transfer learning, we follow [21] to train our feature embeddings and a Nearest Class Mean\n(NCM) classi\ufb01er [21] on the large-scale ImageNet 20101 dataset, which contains 1,000 classes and\nmore than 1.2 million images. Then we apply the NCM classi\ufb01er to the larger ImageNet-10K [5]\ndataset with 10,000 classes, thus do not use any auxiliary knowledge such as parts and attributes.\nWe use the standard \ufb02at top-1 accuracy as the classi\ufb01cation evaluation metric. Table 2 shows that\nour features outperform state-of-the-art methods by a large margin, including the deep feature-based\nones [18, 21]. We attribute this advantage to our end-to-end feature embedding learning and its large\nmargin nature, which directly translates to good generalization ability.\nFor zero-shot learning, we follow the standard settings in [21, 25] on ImageNet 2010: we learn our\nfeature embeddings on 800 classes, and test on the remaining 200 classes. For simplicity, we also\nuse the ImageNet hierarchy to estimate the mean of new testing classes from the means of related\ntraining classes. The \ufb02at top-5 accuracy is used as the classi\ufb01cation evaluation metric. As can be\nseen from Table 2, our features achieve top results again among many competing deep CNN-based\nmethods. Considering our PDDM and local similarity-aware feature embedding are both well learned\nwith safe margins between classes, in this zero-shot task, they would be naturally resistent to class\nboundary confusion between known and unseen classes.\n\n5 Conclusion\n\nIn this paper, we developed a method of learning local similarity-aware deep feature embeddings\nin an end-to-end manner. The PDDM is proposed to adaptively measure the local feature similarity\nin a heterogeneous space, thus it is valuable for high quality online hard sample mining that can\nbetter guide the embedding learning. The double-header hinge loss on both the similarity metric\nand feature embedding is optimized under the large margin criterion. Experiments show the ef\ufb01cacy\nof our learned feature embedding in challenging image retrieval tasks, and point to its potential of\ngeneralizing to new classes in the large and open set scenarios such as transfer learning and zero-shot\nlearning. In the future, it is interesting to study the generalization performance when using the shared\nattributes or visual-semantic embedding instead of ImageNet hierarchy for zero-shot learning.\nAcknowledgments. This work is supported by SenseTime Group Limited and the Hong Kong\nInnovation and Technology Support Programme.\n\n1See http://www.image-net.org/challenges/LSVRC/2010/index.\n\n8\n\n\fReferences\n[1] S. Bell and K. Bala. Learning visual similarity for product design with convolutional neural networks.\n\nTOG, 34(4):98:1\u201398:10, 2015.\n\n[2] A. Bendale and T. Boult. Towards open world recognition. In CVPR, 2015.\n[3] Y. Cui, F. Zhou, Y. Lin, and S. Belongie. Fine-grained categorization and dataset bootstrapping using deep\n\nmetric learning with humans in the loop. In CVPR, 2016.\n\n[4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.\n[5] J. Deng, A. C. Berg, K. Li, and L. Fei-Fei. What does classifying more than 10,000 image categories tell\n\nus? In ECCV, 2010.\n\n[6] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. TPAMI, 28(4):594\u2013611, 2006.\n[7] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively\n\ntrained part-based models. TPAMI, 32(9):1627\u20131645, 2010.\n\n[8] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato, and T. Mikolov. DeViSE: A deep\n\nvisual-semantic embedding model. In NIPS, 2013.\n\n[9] A. Frome, Y. Singer, F. Sha, and J. Malik. Learning globally-consistent local distance functions for\n\nshape-based image retrieval and classi\ufb01cation. In ICCV, 2007.\n\n[10] Z. Fu, T. A. Xiang, E. Kodirov, and S. Gong. Zero-shot object recognition by semantic manifold distance.\n\n[11] J. Goldberger, G. E. Hinton, S. T. Roweis, and R. R. Salakhutdinov. Neighbourhood components analysis.\n\nIn CVPR, 2015.\n\nIn NIPS, 2005.\n\n[12] S. C. H. Hoi, W. Liu, M. R. Lyu, and W.-Y. Ma. Learning distance metrics with contextual constraints for\n\nimage retrieval. In CVPR, 2006.\n\n[13] C. Huang, Y. Li, C. C. Loy, and X. Tang. Learning deep representation for imbalanced classi\ufb01cation. In\n\n[14] C. Huang, C. C. Loy, and X. Tang. Unsupervised learning of discriminative attributes and visual represen-\n\nCVPR, 2016.\n\ntations. In CVPR, 2016.\n\nICCVW, 2013.\n\nnetworks. In NIPS, 2012.\n\n[15] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3D object representations for \ufb01ne-grained categorization. In\n\n[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nlarge-scale setting. In CVPR, 2011.\n\nIn CVPR, 2011.\n\nclustering. In CVPR, 2015.\n\n[17] C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classi\ufb01cation for zero-shot visual object\n\ncategorization. TPAMI, 36(3):453\u2013465, 2014.\n\n[18] Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng. Building high-level\n\nfeatures using large scale unsupervised learning. In ICML, 2012.\n\n[19] L. Maaten and G. E. Hinton. Visualizing high-dimensional data using t-SNE. JMLR, 9:2579\u20132605, 2008.\n[20] A. M. Martinez and A. C. Kak. PCA versus LDA. TPAMI, 23(2):228\u2013233, 2001.\n[21] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Distance-based image classi\ufb01cation: Generalizing to\n\nnew classes at near-zero cost. TPAMI, 35(11):2624\u20132637, 2013.\n\n[22] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. Corrado, and J. Dean. Zero-shot\n\nlearning by convex combination of semantic embeddings. In ICLR, 2014.\n\n[23] F. Perronnin, Z. Akata, Z. Harchaoui, and C. Schmid. Towards good practice in large-scale learning for\n\nimage classi\ufb01cation. In CVPR, 2012.\n\n[24] M. Rohrbach, S. Ebert, and B. Schiele. Transfer learning in a transductive setting. In NIPS, 2013.\n[25] M. Rohrbach, M. Stark, and B. Schiele. Evaluating knowledge transfer and zero-shot learning in a\n\n[26] J. Sanchez and F. Perronnin. High-dimensional signature compression for large-scale image classi\ufb01cation.\n\n[27] F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: A uni\ufb01ed embedding for face recognition and\n\n[28] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, and F. Moreno-Noguer. Fracking deep convolutional\n\nimage descriptors. arXiv preprint, arXiv:1412.6537v2, 2015.\n\n[29] H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature\n\nembedding. In CVPR, 2016.\n\n[30] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to\n\nprevent neural networks from over\ufb01tting. JMLR, 15(1):1929\u20131958, 2014.\n\n[31] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.\n\nGoing deeper with convolutions. In CVPR, 2015.\n\n[32] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset.\n\nTechnical Report CNS-TR-2011-001, California Institute of Technology, 2011.\n\n[33] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu. Learning \ufb01ne-grained\n\nimage similarity with deep ranking. In CVPR, 2014.\n\n[34] X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In ICCV, 2015.\n[35] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classi\ufb01cation.\n\n[36] E. P. Xing, M. I. Jordan, S. J. Russell, and A. Y. Ng. Distance metric learning with application to clustering\n\n[37] C. Xiong, D. Johnson, R. Xu, and J. J. Corso. Random forests for metric learning with implicit pairwise\n\nJMLR, 10:207\u2013244, 2009.\n\nwith side-information. In NIPS, 2003.\n\nposition dependence. In SIGKDD, 2012.\n\n9\n\n\f", "award": [], "sourceid": 684, "authors": [{"given_name": "Chen", "family_name": "Huang", "institution": "Chinese University of HongKong"}, {"given_name": "Chen Change", "family_name": "Loy", "institution": "The Chinese University of HK"}, {"given_name": "Xiaoou", "family_name": "Tang", "institution": "The Chinese University of Hong Kong"}]}