{"title": "Deep Supervised Discrete Hashing", "book": "Advances in Neural Information Processing Systems", "page_first": 2482, "page_last": 2491, "abstract": "With the rapid growth of image and video data on the web, hashing has been extensively studied for image or video search in recent years. Benefiting from recent advances in deep learning, deep hashing methods have achieved promising results for image retrieval. However, there are some limitations of previous deep hashing methods (e.g., the semantic information is not fully exploited). In this paper, we develop a deep supervised discrete hashing algorithm based on the assumption that the learned binary codes should be ideal for classification. Both the pairwise label information and the classification information are used to learn the hash codes within one stream framework. We constrain the outputs of the last layer to be binary codes directly, which is rarely investigated in deep hashing algorithm. Because of the discrete nature of hash codes, an alternating minimization method is used to optimize the objective function. Experimental results have shown that our method outperforms current state-of-the-art methods on benchmark datasets.", "full_text": "Deep Supervised Discrete Hashing\n\nQi Li\n\nZhenan Sun\n\nRan He\n\nTieniu Tan\n\nCenter for Research on Intelligent Perception and Computing\n\nNational Laboratory of Pattern Recognition\n\nCAS Center for Excellence in Brain Science and Intelligence Technology\n\nInstitute of Automation, Chinese Academy of Sciences\n\n{qli,znsun,rhe,tnt}@nlpr.ia.ac.cn\n\nAbstract\n\nWith the rapid growth of image and video data on the web, hashing has been\nextensively studied for image or video search in recent years. Bene\ufb01ting from\nrecent advances in deep learning, deep hashing methods have achieved promising\nresults for image retrieval. However, there are some limitations of previous deep\nhashing methods (e.g., the semantic information is not fully exploited). In this\npaper, we develop a deep supervised discrete hashing algorithm based on the\nassumption that the learned binary codes should be ideal for classi\ufb01cation. Both the\npairwise label information and the classi\ufb01cation information are used to learn the\nhash codes within one stream framework. We constrain the outputs of the last layer\nto be binary codes directly, which is rarely investigated in deep hashing algorithm.\nBecause of the discrete nature of hash codes, an alternating minimization method\nis used to optimize the objective function. Experimental results have shown that\nour method outperforms current state-of-the-art methods on benchmark datasets.\n\n1\n\nIntroduction\n\nHashing has attracted much attention in recent years because of the rapid growth of image and\nvideo data on the web. It is one of the most popular techniques for image or video search due to\nits low computational cost and storage ef\ufb01ciency. Generally speaking, hashing is used to encode\nhigh dimensional data into a set of binary codes while preserving the similarity of images or videos.\nExisting hashing methods can be roughly grouped into two categories: data independent methods and\ndata dependent methods.\nData independent methods rely on random projections to construct hash functions. Locality Sensitive\nHashing (LSH) [3] is one of the representative methods, which uses random linear projections to\nmap nearby data into similar binary codes. LSH is widely used for large scale image retrieval. In\norder to generalize LSH to accommodate arbitrary kernel functions, the Kenelized Locality Sensitive\nHashing (KLSH) [7] is proposed to deal with high-dimensional kernelized data. Other variants of\nLSH are also proposed in recent years, such as super-bit LSH [5], non-metric LSH [14]. However,\nthere are some limitations of data independent hashing methods, e.g., it makes no use of training data.\nThe learning ef\ufb01ciency is low, and it requires longer hash codes to attain high accuracy. Due to the\nlimitations of the data independent hashing methods, recent hashing methods try to exploit various\nmachine learning techniques to learn more effective hash function based on a given dataset.\nData dependent methods refer to using training data to learn the hash functions. They can be further\ncategorized into supervised and unsupervised methods. Unsupervised methods retrieve the neighbors\nunder some kinds of distance metrics. Iterative Quantization (ITQ) [4] is one of the representative\nunsupervised hashing methods, in which the projection matrix is optimized by iterative projection and\nthresholding according to the given training samples. In order to utilize the semantic labels of data\nsamples, supervised hashing methods are proposed. Supervised Hashing with Kernels (KSH) [13]\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fis a well-known method of this kind, which learns the hash codes by minimizing the Hamming\ndistances between similar pairs, and at the same time maximizing the Hamming distances between\ndissimilar pairs. Binary Reconstruction Embedding (BRE) [6] learns the hash functions by explicitly\nminimizing the reconstruction error between the original distances and the reconstructed distances\nin Hamming space. Order Preserving Hashing (OPH) [17] learns the hash codes by preserving the\nsupervised ranking list information, which is calculated based on the semantic labels. Supervised\nDiscrete Hashing (SDH) [15] aims to directly optimize the binary hash codes using the discrete cyclic\ncoordinate descend method.\nRecently, deep learning based hashing methods have been proposed to simultaneously learn the\nimage representation and hash coding, which have shown superior performance over the traditional\nhashing methods. Convolutional Neural Network Hashing (CNNH) [20] is one of the early works to\nincorporate deep neural networks into hash coding, which consists of two stages to learn the image\nrepresentations and hash codes. One drawback of CNNH is that the learned image representation can\nnot give feedback for learning better hash codes. To overcome the shortcomings of CNNH, Network\nIn Network Hashing (NINH) [8] presents a triplet ranking loss to capture the relative similarities of\nimages. The image representation learning and hash coding can bene\ufb01t each other within one stage\nframework. Deep Semantic Ranking Hashing (DSRH) [26] learns the hash functions by preserving\nsemantic similarity between multi-label images. Other ranking-based deep hashing methods have\nalso been proposed in recent years [18, 22]. Besides the triplet ranking based methods, some pairwise\nlabel based deep hashing methods are also exploited [9, 27]. A novel and ef\ufb01cient training algorithm\ninspired by alternating direction method of multipliers (ADMM) is proposed to train very deep neural\nnetworks for supervised hashing in [25]. The classi\ufb01cation information is used to learn hash codes.\n[25] relaxes the binary constraint to be continuous, then thresholds the obtained continuous variables\nto be binary codes.\nAlthough deep learning based methods have achieved great progress in image retrieval, there are\nsome limitations of previous deep hashing methods (e.g., the semantic information is not fully\nexploited). Recent works try to divide the whole learning process into two streams under the multi-\ntask learning framework [11, 21, 22]. The hash stream is used to learn the hash function, while the\nclassi\ufb01cation stream is utilized to mine the semantic information. Although the two stream framework\ncan improve the retrieval performance, the classi\ufb01cation stream is only employed to learn the image\nrepresentations, which does not have a direct impact on the hash function. In this paper, we use CNN\nto learn the image representation and hash function simultaneously. The last layer of CNN outputs\nthe binary codes directly based on the pairwise label information and the classi\ufb01cation information.\nThe contributions of this work are summarized as follows. 1) The last layer of our method is\nconstrained to output the binary codes directly. The binary codes are learned to preserve the similarity\nrelationship and keep the label consistent simultaneously. To the best of our knowledge, this is the \ufb01rst\ndeep hashing method that uses both pairwise label information and classi\ufb01cation information to learn\nthe hash codes under one stream framework. 2) In order to reduce the quantization error, we keep\nthe discrete nature of the hash codes during the optimization process. An alternating minimization\nmethod is proposed to optimize the objective function by using the discrete cyclic coordinate descend\nmethod. 3) Extensive experiments have shown that our method outperforms current state-of-the-art\nmethods on benchmark datasets for image retrieval, which demonstrates the effectiveness of the\nproposed method.\n\n2 Deep supervised discrete hashing\n\n2.1 Problem de\ufb01nition\n\nGiven N image samples X = {xi}N\ni=1 \u2208 Rd\u00d7N , hash coding is to learn a collection of K-bit\nbinary codes B \u2208 {\u22121, 1}K\u00d7N , where the i-th column bi \u2208 {\u22121, 1}K denotes the binary codes\nfor the i-th sample xi. The binary codes are generated by the hash function h (\u00b7), which can\nbe rewritten as [h1 (\u00b7) , ..., hK (\u00b7)]. For image sample xi, its hash codes can be represented as\nbi = h (xi) = [h1 (xi) , ..., hK (xi)]. Generally speaking, hashing is to learn a hash function to\nproject image samples to a set of binary codes.\n\n2\n\n\f2.2 Similarity measure\nIn supervised hashing, the label information is given as Y = {yi}N\ni=1 \u2208 Rc\u00d7N , where yi \u2208 {0, 1}c\ncorresponds to the sample xi, c is the number of categories. Note that one sample may belong\nto multiple categories. Given the semantic label information, the pairwise label information is\nderived as: S = {sij}, sij \u2208 {0, 1}, where sij = 1 when xi and xj are semantically similar,\nsij = 0 when xi and xj are semantically dissimilar. For two binary codes bi and bj, the relationship\nbetween their Hamming distance distH (\u00b7,\u00b7) and their inner product (cid:104)\u00b7,\u00b7(cid:105) is formulated as follows:\n2 (K \u2212 (cid:104)bi, bj(cid:105)). If the inner product of two binary codes is small, their Hamming\ndistH (bi, bj) = 1\ndistance will be large, and vice versa. Therefore the inner product of different hash codes can be used\nto quantify their similarity.\nGiven the pairwise similarity relationship S = {sij}, the Maximum a Posterior (MAP) estimation of\nhash codes can be represented as:\n\np (B|S) \u221d p (S|B) p (B) = \u03a0\nsij\u2208S\n\n(1)\nwhere p (S|B) denotes the likelihood function, p (B) is the prior distribution. For each pair of the\nimages, p (sij|B) is the conditional probability of sij given their hash codes B, which is de\ufb01ned as\nfollows:\n\np (sij|B) p (B)\n\n(cid:26) \u03c3 (\u03a6ij) ,\n\n1 \u2212 \u03c3 (\u03a6ij) ,\n\np (sij|B) =\n\nwhere \u03c3 (x) = 1/ (1 + e\u2212x) is the sigmoid function, \u03a6ij = 1\ni bj. From Equation 2\nwe can see that, the larger the inner product (cid:104)bi, bj(cid:105) is, the larger p (1|bi, bj) will be, which implies\nthat bi and bj should be classi\ufb01ed as similar, and vice versa. Therefore Equation 2 is a reasonable\nsimilarity measure for hash codes.\n\n2 bT\n\nsij = 1\n\nsij = 0\n2 (cid:104)bi, bj(cid:105) = 1\n\n(2)\n\n2.3 Loss function\n\nIn recent years, deep learning based methods have shown their superior performance over the\ntraditional handcrafted features on object detection, image classi\ufb01cation, image segmentation, etc. In\nthis section, we take advantage of recent advances in CNN to learn the hash function. In order to have\na fair comparison with other deep hashing methods, we choose the CNN-F network architecture [2]\nas a basic component of our algorithm. This architecture is widely used to learn the hash function\nin recent works [9, 18]. Speci\ufb01cally, there are two separate CNNs to learn the hash function, which\nshare the same weights. The pairwise samples are used as the input for these two separate CNNs. The\nCNN model consists of 5 convolutional layers and 2 fully connected layers. The number of neurons\nin the last fully connected layer is equal to the number of hash codes.\nConsidering the similarity measure, the following loss function is used to learn the hash codes:\n\nlog p (sij|B) = \u2212 (cid:80)\n\n(cid:0)sij\u03a6ij \u2212 log(cid:0)1 + e\u03a6ij(cid:1)(cid:1).\n\nJ = \u2212 log p (S|B) = \u2212 (cid:80)\n\n(3)\n\nsij\u2208S\n\nsij\u2208S\n\nEquation 3 is the negative log likelihood function, which makes the Hamming distance of two similar\npoints as small as possible, and at the same time makes the Hamming distance of two dissimilar\npoints as large as possible.\nAlthough pairwise label information is used to learn the hash function in Equation 3, the label\ninformation is not fully exploited. Most of the previous works make use of the label information\nunder a two stream multi-task learning framework [21, 22]. The classi\ufb01cation stream is used to\nmeasure the classi\ufb01cation error, while the hash stream is employed to learn the hash function. One\nbasic assumption of our algorithm is that the learned binary codes should be ideal for classi\ufb01cation.\nIn order to take advantage of the label information directly, we expect the learned binary codes to be\noptimal for the jointly learned linear classi\ufb01er.\nWe use a simple linear classi\ufb01er to model the relationship between the learned binary codes and the\nlabel information:\n\n(4)\nwhere W = [w1, w2,...,wC] is the classi\ufb01er weight, Y = [y1, y2,...,yN ] is the ground-truth label\nvector. The loss function can be calculated as:\n\nY = W T B,\n\nQ = L(cid:0)Y, W T B(cid:1) + \u03bb(cid:107)W(cid:107)2\n\nL(cid:0)yi, W T bi\n\n(cid:1) + \u03bb(cid:107)W(cid:107)2\n\nF ,\n\nN(cid:80)\n\ni=1\n\n(5)\n\nF =\n\n3\n\n\fwhere L (\u00b7) is the loss function, \u03bb is the regularization parameter, (cid:107)\u00b7(cid:107)F is the Frobenius norm of a\nmatrix. Combining Equation 5 and Equation 3, we have the following formulation:\n\nF = J + \u00b5Q = \u2212 (cid:80)\n\n(cid:0)sij\u03a6ij \u2212 log(cid:0)1 + e\u03a6ij(cid:1)(cid:1) + \u00b5\n\nL(cid:0)yi, W T bi\n\n(cid:1) + \u03bd (cid:107)W(cid:107)2\n\nF ,\n\n(6)\n\nN(cid:80)\n\ni=1\n\nsij\u2208S\n\nwhere \u00b5 is the trade-off parameters, \u03bd = \u03bb\u00b5. Suppose that we choose the l2 loss for the linear\nclassi\ufb01er, Equation 6 is rewritten as follows:\n\nF = \u2212 (cid:80)\n\n(cid:0)sij\u03a6ij \u2212 log(cid:0)1 + e\u03a6ij(cid:1)(cid:1) + \u00b5\n\nN(cid:80)\n\n(cid:13)(cid:13)yi \u2212 W T bi\n\n(cid:13)(cid:13)2\n\nsij\u2208S\n\n(7)\nwhere (cid:107)\u00b7(cid:107)2 is l2 norm of a vector. The hypothesis for Equation 7 is that the learned binary codes\nshould make the pairwise label likelihood as large as possible, and should be optimal for the jointly\nlearned linear classi\ufb01er.\n\ni=1\n\n2 + \u03bd (cid:107)W(cid:107)2\nF ,\n\n2.4 Optimization\n\nThe minimization of Equation 7 is a discrete optimization problem, which is dif\ufb01cult to optimize\ndirectly. There are several ways to solve this problem. (1) In the training stage, the sigmoid or tanh\nactivation function is utilized to replace the ReLU function after the last fully connected layer, and\nthen the continuous outputs are used as a relaxation of the hash codes. In the testing stage, the hash\ncodes are obtained by applying a thresholding function on the continuous outputs. One limitation of\nthis method is that the convergence of the algorithm is slow. Besides, there will be a large quantization\nerror. (2) The sign function is directly applied after the outputs of the last fully connected layer, which\nconstrains the outputs to be binary variables strictly. However, the sign function is non-differentiable,\nwhich is dif\ufb01cult to back propagate the gradient of the loss function.\nBecause of the discrepancy between the Euclidean space and the Hamming space, it would result in\nsuboptimal hash codes if one totally ignores the binary constraints. We emphasize that it is essential\nto keep the discrete nature of the binary codes. Note that in our formulation, we constrain the outputs\nof the last layer to be binary codes directly, thus Equation 7 is dif\ufb01cult to optimize directly. Similar\nto [9, 18, 22], we solve this problem by introducing an auxiliary variable. Then we approximate\nEquation 7 as:\n\nF = \u2212 (cid:80)\n\n(cid:0)sij\u03a8ij \u2212 log(cid:0)1 + e\u03a8ij(cid:1)(cid:1) + \u00b5\n\nsij\u2208S\n\ns.t. bi = sgn(hi), hi \u2208 RK\u00d71, (i = 1, ..., N ) ,\n\n(cid:13)(cid:13)yi \u2212 W T bi\n\n(cid:13)(cid:13)2\n\nN(cid:80)\n\ni=1\n\n2 + \u03bd (cid:107)W(cid:107)2\nF ,\n\n(8)\n\nwhere \u03a8ij = 1\nwhich is represented as:\n\n2 hi\n\nT hj. hi (i = 1, ..., N ) can be seen as the output of the last fully connected layer,\n\n(9)\nwhere \u03b8 denotes the parameters of the previous layers before the last fully connected layer, M \u2208\nR4096\u00d7K represents the weight matrix, n \u2208 RK\u00d71 is the bias term.\nAccording to the Lagrange multipliers method, Equation 8 can be reformulated as:\n\nhi = M T \u0398 (xi; \u03b8) + n,\n\nwhere \u03b7 is the Lagrange Multiplier. Equation 10 can be further relaxed as:\n\ni=1\n\n+\u00b5\n\ns.t.\n\nsij\u2208S\n\n2 + \u03bd (cid:107)W(cid:107)2\n\nF = \u2212 (cid:80)\n(cid:13)(cid:13)yi \u2212 W T bi\nN(cid:80)\nF = \u2212 (cid:80)\nN(cid:80)\n\n(cid:0)sij\u03a8ij \u2212 log(cid:0)1 + e\u03a8ij(cid:1)(cid:1)\n(cid:13)(cid:13)2\nN(cid:80)\n(cid:0)sij\u03a8ij \u2212 log(cid:0)1 + e\u03a8ij(cid:1)(cid:1)\nN(cid:80)\n\nF + \u03b7\nbi \u2208 {\u22121, 1}K, (i = 1, ..., N ) ,\n\n(cid:13)(cid:13)yi \u2212 W T bi\n\n2 + \u03bd (cid:107)W(cid:107)2\n\nF + \u03b7\nbi \u2208 {\u22121, 1}K, (i = 1, ..., N ) .\n\n(cid:13)(cid:13)2\n\nsij\u2208S\n\n+\u00b5\n\ns.t.\n\ni=1\n\ni=1\n\ni=1\n\n4\n\n(cid:107)bi \u2212 sgn (hi)(cid:107)2\n2,\n\n(cid:107)bi \u2212 hi(cid:107)2\n2,\n\n(10)\n\n(11)\n\n\fThe last term actually measures the constraint violation caused by the outputs of the last fully\nconnected layer. If the parameter \u03b7 is set suf\ufb01ciently large, the constraint violation is penalized\nseverely. Therefore the outputs of the last fully connected layer are forced closer to the binary codes,\nwhich are employed for classi\ufb01cation directly.\nThe bene\ufb01t of introducing an auxiliary variable is that we can decompose Equation 11 into two sub\noptimization problems, which can be iteratively solved by using the alternating minimization method.\nFirst, when \ufb01xing bi, W , we have:\n\n= \u2212 1\n\n2\n\n\u2202F\n\u2202hi\n\nsij \u2212 e\u03a8ij\n\n1+e\u03a8ij\n\nhj \u2212 1\n\n2\n\nsji \u2212 e\u03a8ji\n\n1+e\u03a8ji\n\nhj \u2212 2\u03b7 (bi \u2212 hi)\n\n(cid:17)\n\n(cid:16)\n\n(cid:80)\n\nj:sij\u2208S\n\nThen we update parameters M, n and \u0398 as follows:\n\n(cid:16)\n\n(cid:80)\n\nj:sji\u2208S\n\n(cid:17)\n(cid:16) \u2202F\n\n\u2202hi\n\n(cid:17)T\n\n\u2202F\n\u2202M = \u0398 (xi; \u03b8)\n\n, \u2202F\n\n\u2202n = \u2202F\n\u2202hi\n\n,\n\n\u2202F\n\n\u2202\u0398(xi;\u03b8) = M \u2202F\n\u2202hi\n\n.\n\nThe gradient will propagate to previous layers by Back Propagation (BP) algorithm.\nSecond, when \ufb01xing M, n, \u0398 and bi, we solve W as:\n\nF = \u00b5\n\nN(cid:88)\n(cid:13)(cid:13)yi \u2212 W T bi\n(cid:18)\n\ni=1\n\n(cid:13)(cid:13)2\n2 + \u03bd (cid:107)W(cid:107)2\nF .\n(cid:19)\u22121\n\nEquation 14 is a least squares problem, which has a closed form solution:\n\nW =\n\nBBT +\ni=1 \u2208 {\u22121, 1}K\u00d7N , Y = {yi}N\n\ni=1 \u2208 RC\u00d7N .\nwhere B = {bi}N\nN(cid:80)\nFinally, when \ufb01xing M, n, \u0398 and W , Equation 11 becomes:\n\nI\n\nBT Y,\n\n\u03bd\n\u00b5\n\n2 + \u03b7\n\n(cid:107)bi \u2212 hi(cid:107)2\n2,\n\nbi \u2208 {\u22121, 1}K, (i = 1, ..., N ) .\n\ni=1\n\nF = \u00b5\n\nN(cid:80)\n\n(cid:13)(cid:13)2\n(cid:13)(cid:13)yi \u2212 W T bi\n(cid:13)(cid:13)W T B(cid:13)(cid:13)2 \u2212 2 Tr (P ) ,\n\ni=1\n\ns.t.\n\nmin\n\nB\n\n(12)\n\n(13)\n\n(14)\n\n(15)\n\n(16)\n\nIn this paper, we use the discrete cyclic coordinate descend method to iteratively solve B row by row:\n\ns.t. B \u2208 {\u22121, 1}K\u00d7N ,\n\n(17)\n\nwhere P = W Y + \u03b7\n\u00b5 H. Let xT be the kth (k = 1, ..., K) row of B, B1 be the matrix of B excluding\nxT , pT be the kth column of matrix P , P1 be the matrix of P excluding p, wT be the kth column of\nmatrix W , W1 be the matrix of W excluding w, then we can derive:\n\n(18)\nIt is easy to see that each bit of the hash codes is computed based on the pre-learned K \u2212 1 bits B1.\nWe iteratively update each bit until the algorithm converges.\n\nx = sgn(cid:0)p \u2212 BT\n\n1 W1w(cid:1) .\n\n3 Experiments\n\n3.1 Experimental settings\n\nWe conduct extensive experiments on two public benchmark datasets: CIFAR-10 and NUS-WIDE.\nCIFAR-10 is a dataset containing 60,000 color images in 10 classes, and each class contains 6,000\nimages with a resolution of 32x32. Different from CIFAR-10, NUS-WIDE is a public multi-label\nimage dataset. There are 269,648 color images in total with 5,018 unique tags. Each image is\nannotated with one or multiple class labels from the 5,018 tags. Similar to [8, 12, 20, 24], we use a\nsubset of 195,834 images which are associated with the 21 most frequent concepts. Each concept\nconsists of at least 5,000 color images in this dataset.\nWe follow the previous experimental setting in [8, 9, 18]. In CIFAR-10, we randomly select 100\nimages per class (1,000 images in total) as the test query set, 500 images per class (5,000 images in\n\n5\n\n\ftotal) as the training set. For NUS-WIDE dataset, we randomly sample 100 images per class (2,100\nimages in total) as the test query set, 500 images per class (10,500 images in total) as the training set.\nThe similar pairs are constructed according to the image labels: two images will be considered similar\nif they share at least one common semantic label. Otherwise, they will be considered dissimilar.\nWe also conduct experiments on CIFAR-10 and NUS-WIDE dataset under a different experimental\nsetting. In CIFAR-10, 1,000 images per class (10,000 images in total) are selected as the test query\nset, the remaining 50,000 images are used as the training set. In NUS-WIDE, 100 images per class\n(2,100 images in total) are randomly sampled as the test query images, the remaining images (193,734\nimages in total) are used as the training set.\nAs for the comparison methods, we roughly divide them into two groups: traditional hashing methods\nand deep hashing methods. The compared traditional hashing methods consist of unsupervised\nand supervised methods. Unsupervised hashing methods include SH [19], ITQ [4]. Supervised\nhashing methods include SPLH [16], KSH [13], FastH [10], LFH [23], and SDH [15]. Both the\nhand-crafted features and the features extracted by CNN-F network architecture are used as the input\nfor the traditional hashing methods. Similar to previous works, the handcrafted features include a\n512-dimensional GIST descriptor to represent images of CIFAR-10 dataset, and a 1134-dimensional\nfeature vector to represent images of NUS-WIDE dataset. The deep hashing methods include\nDQN [1], DHN [27], CNNH [20], NINH [8], DSRH [26], DSCH [24], DRCSH [24], DPSH [9],\nDTSH [18] and VDSH [25]. Note that DPSH, DTSH and DSDH are based on the CNN-F network\narchitecture, while DQN, DHN, DSRH are based on AlexNet architecture. Both the CNN-F network\narchitecture and AlexNet architecture consist of \ufb01ve convolutional layers and two fully connected\nlayers. In order to have a fair comparison, most of the results are directly reported from previous\nworks. Following [25], the pre-trained CNN-F model is used to extract CNN features on CIFAR-10,\nwhile a 500 dimensional bag-of-words feature vector is used to represent each image on NUS-WIDE\nfor VDSH. Then we re-run the source code provided by the authors to obtain the retrieval performance.\nThe parameters of our algorithm are set based on the standard cross-validation procedure. \u00b5, \u03bd and \u03b7\nin Equation 11 are set to 1, 0.1 and 55, respectively.\nSimilar to [8], we adopt four widely used evaluation metrics to evaluate the image retrieval quality:\nMean Average Precision (MAP) for different number of bits, precision curves within Hamming\ndistance 2, precision curves with different number of top returned samples and precision-recall curves.\nWhen computing MAP for NUS-WIDE dataset under the \ufb01rst experimental setting, we only consider\nthe top 5,000 returned neighbors. While we consider the top 50,000 returned neighbors under the\nsecond experimental setting.\n\n3.2 Empirical analysis\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: The results of DSDH-A, DSDH-B, DSDH-C and DSDH on CIFAR-10 dataset: (a) precision\ncurves within Hamming radius 2; (b) precision curves with respect to different number of top returned\nimages; (c) precision-recall curves of Hamming ranking with 48 bits.\n\nIn order to verify the effectiveness of our method, several variants of our method (DSDH) are\nalso proposed. First, we only consider the pairwise label information while neglecting the linear\nclassi\ufb01cation information in Equation 7, which is named DSDH-A (similar to [9]). Then we design\na two-stream deep hashing algorithm to learn the hash codes. One stream is designed based on\nthe pairwise label information in Equation 3, and the other stream is constructed based on the\nclassi\ufb01cation information. The two streams share the same image representations except for the last\n\n6\n\nNumber of bits15202530354045Precision (Hamming dist. <=2)0.50.60.70.80.9Number of top returned images100300500700900Precision0.50.60.70.80.9Recall0 0.20.40.60.81 Precision0 0.20.40.60.81 DSDH-ADSDH-BDSDH-CDSDH\fTable 1: MAP for different methods under the \ufb01rst experimental setting. The MAP for NUS-WIDE\ndataset is calculated based on the top 5,000 returned neighbors. DPSH\u2217 denotes re-running the code\nprovided by the authors of DPSH.\n\nCIFAR-10\n\nNUS-WIDE\n\n24 bits\n0.808\n0.776\n0.790\n0.735\n0.808\n0.697\n0.618\n0.650\n0.600\n0.572\n0.568\n0.589\n0.468\n0.406\n\n32 bits\n0.820\n0.783\n0.794\n0.748\n0.812\n0.713\n0.625\n0.665\n0.608\n0.581\n0.568\n0.597\n0.472\n0.405\n\n48 bits\n0.829\n0.792\n0.812\n0.758\n0.824\n0.715\n0.608\n0.687\n0.637\n0.588\n0.585\n0.601\n0.477\n0.400\n\nMethod\nOurs\nDQN\nDPSH\nDHN\nDTSH\nNINH\nCNNH\nFastH\nSDH\nKSH\nLFH\nSPLH\nITQ\nSH\n\n12 bits\n0.740\n0.554\n0.713\n0.555\n0.710\n0.552\n0.439\n0.305\n0.285\n0.303\n0.176\n0.171\n0.162\n0.127\n\n24 bits\n0.786\n0.558\n0.727\n0.594\n0.750\n0.566\n0.511\n0.349\n0.329\n0.337\n0.231\n0.173\n0.169\n0.128\n\n32 bits\n0.801\n0.564\n0.744\n0.603\n0.765\n0.558\n0.509\n0.369\n0.341\n0.346\n0.211\n0.178\n0.172\n0.126\n\n48 bits\n0.820\n0.580\n0.757\n0.621\n0.774\n0.581\n0.522\n0.384\n0.356\n0.356\n0.253\n0.184\n0.175\n0.129\n\nMethod\nOurs\nDQN\nDPSH\u2217\nDHN\nDTSH\nNINH\nCNNH\nFastH\nSDH\nKSH\nLFH\nSPLH\nITQ\nSH\n\n12 bits\n0.776\n0.768\n0.752\n0.708\n0.773\n0.674\n0.611\n0.621\n0.568\n0.556\n0.571\n0.568\n0.452\n0.454\n\nfully connected layer. We denote this method as DSDH-B. Besides, we also design another approach\ndirectly applying the sign function after the outputs of the last fully connected layer in Equation 7,\nwhich is denoted as DSDH-C. The loss function of DSDH-C can be represented as:\n\nF = \u2212 (cid:80)\n\n(cid:0)sij\u03a8ij \u2212 log(cid:0)1 + e\u03a8ij(cid:1)(cid:1) + \u00b5\n\nsij\u2208S\n\n(cid:13)(cid:13)yi \u2212 W T hi\n\n(cid:13)(cid:13)2\n\n2\n\nN(cid:80)\n\ni=1\n\n(19)\n\n+ \u03bd (cid:107)W(cid:107)2\n\nF + \u03b7\n\n(cid:107)bi \u2212 sgn (hi)(cid:107)2\n2,\n\ns.t. hi \u2208 RK\u00d71, (i = 1, ..., N )\n\nN(cid:80)\n\ni=1\n\nThen we use the alternating minimization method to optimize DSDH-C. The results of different\nmethods on CIFAR-10 under the \ufb01rst experimental setting are shown in Figure 1. From Figure 1\nwe can see that, (1) The performance of DSDH-C is better than DSDH-A. DSDH-B is better than\nDSDH-A in terms of precision with Hamming radius 2 and precision-recall curves. More information\nis exploited in DSDH-C than DSDH-A, which demonstrates the classi\ufb01cation information is helpful\nfor learning the hash codes. (2) The improvement of DSDH-C over DSDH-A is marginal. The reason\nis that the classi\ufb01cation information in DSDH-C is only used to learn the image representations,\nwhich is not fully exploited. Due to violation of the discrete nature of the hash codes, DSDH-C has a\nlarge quantization loss. Note that our method further beats DSDH-B and DSDH-C by a large margin.\n\n3.3 Results under the \ufb01rst experimental setting\n\nThe MAP results of all methods on CIFAR-10 and NUS-WIDE under the \ufb01rst experimental setting\nare listed in Table 1. From Table 1 we can see that the proposed method substantially outperforms\nthe traditional hashing methods on CIFAR-10 dataset. The MAP result of our method is more than\ntwice as much as SDH, FastH and ITQ. Besides, most of the deep hashing methods perform better\nthan the traditional hashing methods. In particular, DTSH achieves the best performance among all\nthe other methods except DSDH on CIFAR-10 dataset. Compared with DTSH, our method further\nimproves the performance by 3 \u223c 7 percents. These results verify that learning the hash function and\nclassi\ufb01er within one stream framework can boost the retrieval performance.\nThe gap between the deep hashing methods and traditional hashing methods is not very huge on\nNUS-WIDE dataset, which is different from CIFAR-10 dataset. For example, the average MAP result\nof SDH is 0.603, while the average MAP result of DTSH is 0.804. The proposed method is slightly\nsuperior to DTSH in terms of the MAP results on NUS-WIDE dataset. The main reasons are that\nthere exits more categories in NUS-WIDE than CIFAR-10, and each of the image contains multiple\nlabels. Compared with CIFAR-10, there are only 500 images per class for training, which may not\nbe enough for DSDH to learn the multi-label classi\ufb01er. Thus the second term in Equation 7 plays a\nlimited role to learn a better hash function. In Section 3.4, we will show that our method will achieve\n\n7\n\n\fTable 2: MAP for different methods under the second experimental setting. The MAP for NUS-WIDE\ndataset is calculated based on the top 50,000 returned neighbors. DPSH\u2217 denotes re-running the\ncode provided by the authors of DPSH.\n\nMethod\nOurs\nDTSH\nDPSH\nVDSH\nDRSCH\nDSCH\nDSRH\nDPSH\u2217\n\n16 bits\n0.935\n0.915\n0.763\n0.845\n0.615\n0.609\n0.608\n0.903\n\nCIFAR-10\n\n24 bits\n0.940\n0.923\n0.781\n0.848\n0.622\n0.613\n0.611\n0.885\n\n32 bits\n0.939\n0.925\n0.795\n0.844\n0.629\n0.617\n0.617\n0.915\n\n48 bits\n0.939\n0.926\n0.807\n0.845\n0.631\n0.620\n0.618\n0.911\n\nMethod\nOurs\nDTSH\nDPSH\nVDSH\nDRSCH\nDSCH\nDSRH\nDPSH\u2217\n\n16 bits\n0.815\n0.756\n0.715\n0.545\n0.618\n0.592\n0.609\n\nNUS-WIDE\n\n24 bits\n0.814\n0.776\n0.722\n0.564\n0.622\n0.597\n0.618\n\n32 bits\n0.820\n0.785\n0.736\n0.557\n0.623\n0.611\n0.621\n\nN/A\n\n48 bits\n0.821\n0.799\n0.741\n0.570\n0.628\n0.609\n0.631\n\nTable 3: MAP for different methods under the \ufb01rst experimental setting. The MAP for NUS-WIDE\ndataset is calculated based on the top 5,000 returned neighbors.\n\nCIFAR-10\n\nNUS-WIDE\n\nMethod\nOurs\nFastH+CNN\nSDH+CNN\nKSH+CNN\nLFH+CNN\nSPLH+CNN\nITQ+CNN\nSH+CNN\n\n12 bits\n0.740\n0.553\n0.478\n0.488\n0.208\n0.299\n0.237\n0.183\n\n24 bits\n0.786\n0.607\n0.557\n0.539\n0.242\n0.330\n0.246\n0.164\n\n32 bits\n0.801\n0.619\n0.584\n0.548\n0.266\n0.335\n0.255\n0.161\n\n48 bits\n0.820\n0.636\n0.592\n0.563\n0.339\n0.330\n0.261\n0.161\n\n12 bits\n0.776\n0.779\n0.780\n0.768\n0.695\n0.753\n0.719\n0.621\n\n24 bits\n0.808\n0.807\n0.804\n0.786\n0.734\n0.775\n0.739\n0.616\n\n32 bits\n0.820\n0.816\n0.815\n0.790\n0.739\n0.783\n0.747\n0.615\n\n48 bits\n0.829\n0.825\n0.824\n0.799\n0.759\n0.786\n0.756\n0.612\n\na better performance than other deep hashing methods with more training images per class for the\nmulti-label dataset.\n\n3.4 Results under the second experimental setting\n\nDeep hashing methods usually need many training images to learn the hash function. In this section,\nwe compare with other deep hashing methods under the second experimental setting, which contains\nmore training images. Table 2 lists MAP results for different methods under the second experimental\nsetting. As shown in Table 2, with more training images, most of the deep hashing methods perform\nbetter than in Section 3.3. For CIFAR-10 dataset, the average MAP result of DRSCH is 0.624, and\nthe average MAP results of DPSH, DTSH and VDSH are 0.787, 0.922 and 0.846, respectively. The\naverage MAP result of our method is 0.938 on CIFAR-10 dataset. DTSH, DPSH and VDSH have\na signi\ufb01cant advantage over other deep hashing methods. Our method further outperforms DTSH,\nDPSH and VDSH by about 2 \u223c 3 percents. For NUS-WIDE dataset, our method still achieves the\nbest performance in terms of MAP. The performance of VDSH on NUS-WIDE dataset drops severely.\nThe possible reason is that VDSH uses the provided bag-of-words features instead of the learned\nfeatures.\n\n3.5 Comparison with traditional hashing methods using deep learned features\n\nIn order to have a fair comparison, we also compare with traditional hashing methods using deep\nlearned features extracted by the CNN-F network under the \ufb01rst experimental setting. The MAP\nresults of different methods are listed in Table 3. As shown in Table 3, most of the traditional\nhashing methods obtain a better retrieval performance using deep learned features. The average\nMAP results of FastH+CNN and SDH+CNN on CIFAR-10 dataset are 0.604 and 0.553, respectively.\nAnd the average MAP result of our method on CIFAR-10 dataset is 0.787, which outperforms the\ntraditional hashing methods with deep learned features. Besides, the proposed algorithm achieves a\ncomparable performance with the best traditional hashing methods on NUS-WIDE dataset under the\n\ufb01rst experimental setting.\n\n8\n\n\f4 Conclusion\n\nIn this paper, we have proposed a novel deep supervised discrete hashing algorithm. We constrain\nthe outputs of the last layer to be binary codes directly. Both the pairwise label information and the\nclassi\ufb01cation information are used for learning the hash codes under one stream framework. Because\nof the discrete nature of the hash codes, we derive an alternating minimization method to optimize\nthe loss function. Extensive experiments have shown that our method outperforms state-of-the-art\nmethods on benchmark image retrieval datasets.\n\n5 Acknowledgements\n\nThis work was partially supported by the National Key Research and Development Program of China\n(Grant No. 2016YFB1001000) and the Natural Science Foundation of China (Grant No. 61622310).\n\nReferences\n[1] Y. Cao, M. Long, J. Wang, H. Zhu, and Q. Wen. Deep quantization network for ef\ufb01cient image retrieval.\n\nIn AAAI, pages 3457\u20133463, 2016.\n\n[2] K. Chat\ufb01eld, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep\n\ninto convolutional nets. In BMVC, 2014.\n\n[3] A. Gionis, P. Indyk, R. Motwani, et al. Similarity search in high dimensions via hashing. In VLDB, pages\n\n518\u2013529, 1999.\n\n[4] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to\n\nlearning binary codes for large-scale image retrieval. IEEE TPAMI, 35(12):2916\u20132929, 2013.\n\n[5] J. Ji, J. Li, S. Yan, B. Zhang, and Q. Tian. Super-bit locality-sensitive hashing. In NIPS, pages 108\u2013116,\n\n2012.\n\n[6] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings.\n\n1042\u20131050, 2009.\n\nIn NIPS, pages\n\n[7] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing for scalable image search. In ICCV, pages\n\n2130\u20132137, 2009.\n\n[8] H. Lai, Y. Pan, Y. Liu, and S. Yan. Simultaneous feature learning and hash coding with deep neural\n\nnetworks. In CVPR, pages 3270\u20133278, 2015.\n\n[9] W.-J. Li, S. Wang, and W.-C. Kang. Feature learning based deep supervised hashing with pairwise labels.\n\nIn IJCAI, pages 1711\u20131717, 2016.\n\n[10] G. Lin, C. Shen, Q. Shi, A. van den Hengel, and D. Suter. Fast supervised hashing with decision trees for\n\nhigh-dimensional data. In CVPR, pages 1963\u20131970, 2014.\n\n[11] K. Lin, H.-F. Yang, J.-H. Hsiao, and C.-S. Chen. Deep learning of binary hash codes for fast image retrieval.\n\nIn CVPRW, pages 27\u201335, 2015.\n\n[12] W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with graphs. In ICML, pages 1\u20138, 2011.\n\n[13] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. In CVPR, pages\n\n2074\u20132081, 2012.\n\n[14] Y. Mu and S. Yan. Non-metric locality-sensitive hashing. In AAAI, pages 539\u2013544, 2010.\n\n[15] F. Shen, C. Shen, W. Liu, and H. Tao Shen. Supervised discrete hashing. In CVPR, pages 37\u201345, 2015.\n\n[16] J. Wang, S. Kumar, and S.-F. Chang. Sequential projection learning for hashing with compact codes. In\n\nICML, pages 1127\u20131134, 2010.\n\n[17] J. Wang, J. Wang, N. Yu, and S. Li. Order preserving hashing for approximate nearest neighbor search. In\n\nACM MM, pages 133\u2013142, 2013.\n\n[18] X. Wang, Y. Shi, and K. M. Kitani. Deep supervised hashing with triplet labels. In ACCV, pages 70\u201384,\n\n2016.\n\n9\n\n\f[19] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, pages 1753\u20131760, 2009.\n\n[20] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan. Supervised hashing for image retrieval via image representation\n\nlearning. In AAAI, pages 2156\u20132162, 2014.\n\n[21] H. F. Yang, K. Lin, and C. S. Chen. Supervised learning of semantics-preserving hash via deep convolutional\n\nneural networks. IEEE TPAMI, (99):1\u20131, 2017.\n\n[22] T. Yao, F. Long, T. Mei, and Y. Rui. Deep semantic-preserving and ranking-based hashing for image\n\nretrieval. In IJCAI, pages 3931\u20133937, 2016.\n\n[23] P. Zhang, W. Zhang, W.-J. Li, and M. Guo. Supervised hashing with latent factor models. In SIGIR, pages\n\n173\u2013182, 2014.\n\n[24] R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang. Bit-scalable deep hashing with regularized similarity\n\nlearning for image retrieval and person re-identi\ufb01cation. IEEE TIP, 24(12):4766\u20134779, 2015.\n\n[25] Z. Zhang, Y. Chen, and V. Saligrama. Ef\ufb01cient training of very deep neural networks for supervised\n\nhashing. In CVPR, pages 1487\u20131495, 2016.\n\n[26] F. Zhao, Y. Huang, L. Wang, and T. Tan. Deep semantic ranking based hashing for multi-label image\n\nretrieval. In CVPR, pages 1556\u20131564, 2015.\n\n[27] H. Zhu, M. Long, J. Wang, and Y. Cao. Deep hashing network for ef\ufb01cient similarity retrieval. In AAAI,\n\npages 2415\u20132421, 2016.\n\n10\n\n\f", "award": [], "sourceid": 1448, "authors": [{"given_name": "Qi", "family_name": "Li", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"given_name": "Zhenan", "family_name": "Sun", "institution": "Institute of Automation, Chinese Academy of Sciences (CASIA)"}, {"given_name": "Ran", "family_name": "He", "institution": "CASIA"}, {"given_name": "Tieniu", "family_name": "Tan", "institution": "Chinese Academy of Sciences"}]}