{"title": "Isotropic Hashing", "book": "Advances in Neural Information Processing Systems", "page_first": 1646, "page_last": 1654, "abstract": "Most existing hashing methods adopt some projection functions to project the original data into several dimensions of real values, and then each of these projected dimensions is quantized into one bit (zero or one) by thresholding. Typically, the variances of different projected dimensions are different for existing projection functions such as principal component analysis (PCA). Using the same number of bits for different projected dimensions is unreasonable because larger-variance dimensions will carry more information. Although this viewpoint has been widely accepted by many researchers, it is still not verified by either theory or experiment because no methods have been proposed to find a projection with equal variances for different dimensions. In this paper, we propose a novel method, called isotropic hashing (IsoHash), to learn projection functions which can produce projected dimensions with isotropic variances (equal variances). Experimental results on real data sets show that IsoHash can outperform its counterpart with different variances for different dimensions, which verifies the viewpoint that projections with isotropic variances will be better than those with anisotropic variances.", "full_text": "Isotropic Hashing\n\nShanghai Key Laboratory of Scalable Computing and Systems\n\nWeihao Kong, Wu-Jun Li\n\nDepartment of Computer Science and Engineering, Shanghai Jiao Tong University, China\n\n{kongweihao,liwujun}@cs.sjtu.edu.cn\n\nAbstract\n\nMost existing hashing methods adopt some projection functions to project the o-\nriginal data into several dimensions of real values, and then each of these projected\ndimensions is quantized into one bit (zero or one) by thresholding. Typically, the\nvariances of different projected dimensions are different for existing projection\nfunctions such as principal component analysis (PCA). Using the same number\nof bits for different projected dimensions is unreasonable because larger-variance\ndimensions will carry more information. Although this viewpoint has been widely\naccepted by many researchers, it is still not veri\ufb01ed by either theory or experiment\nbecause no methods have been proposed to \ufb01nd a projection with equal variances\nfor different dimensions. In this paper, we propose a novel method, called isotrop-\nic hashing (IsoHash), to learn projection functions which can produce projected\ndimensions with isotropic variances (equal variances). Experimental results on\nreal data sets show that IsoHash can outperform its counterpart with different vari-\nances for different dimensions, which veri\ufb01es the viewpoint that projections with\nisotropic variances will be better than those with anisotropic variances.\n\n1\n\nIntroduction\n\nDue to its fast query speed and low storage cost, hashing [1, 5] has been successfully used for\napproximate nearest neighbor (ANN) search [28]. The basic idea of hashing is to learn similarity-\npreserving binary codes for data representation. More speci\ufb01cally, each data point will be hashed\ninto a compact binary string, and similar points in the original feature space should be hashed into\nclose points in the hashcode space. Compared with the original feature representation, hashing has\ntwo advantages. One is the reduced storage cost, and the other is the constant or sub-linear query\ntime complexity [28]. These two advantages make hashing become a promising choice for ef\ufb01cient\nANN search in massive data sets [1, 5, 6, 9, 10, 14, 15, 17, 20, 21, 23, 26, 29, 30, 31, 32, 33, 34].\nMost existing hashing methods adopt some projection functions to project the original data into\nseveral dimensions of real values, and then each of these projected dimensions is quantized into\none bit (zero or one) by thresholding. Locality-sensitive hashing (LSH) [1, 5] and its extension-\ns [4, 18, 19, 22, 25] use simple random projections for hash functions. These methods are called\ndata-independent methods because the projection functions are independent of training data. Anoth-\ner class of methods are called data-dependent methods, whose projection functions are learned from\ntraining data. Representative data-dependent methods include spectral hashing (SH) [31], anchor\ngraph hashing (AGH) [21], sequential projection learning (SPL) [29], principal component analy-\nsis [13] based hashing (PCAH) [7], and iterative quantization (ITQ) [7, 8]. SH learns the hashing\nfunctions based on spectral graph partitioning. AGH adopts anchor graphs to speed up the com-\nputation of graph Laplacian eigenvectors, based on which the Nystr\u00a8om method is used to compute\nprojection functions. SPL leans the projection functions in a sequential way that each function is\ndesigned to correct the errors caused by the previous one. PCAH adopts principal component anal-\nysis (PCA) to learn the projection functions. ITQ tries to learn an orthogonal rotation matrix to\nre\ufb01ne the initial projection matrix learned by PCA so that the quantization error of mapping the data\n\n1\n\n\fto the vertices of binary hypercube is minimized. Compared to the data-dependent methods, the\ndata-independent methods need longer codes to achieve satisfactory performance [7].\nFor most existing projection functions such as those mentioned above, the variances of different\nprojected dimensions are different. Many researchers [7, 12, 21] have argued that using the same\nnumber of bits for different projected dimensions with unequal variances is unreasonable because\nlarger-variance dimensions will carry more information. Some methods [7, 12] use orthogonal trans-\nformation to the PCA-projected data with the expectation of balancing the variances of different\nPCA dimensions, and achieve better performance than the original PCA based hashing. However,\nto the best of our knowledge, there exist no methods which can guarantee to learn a projection with\nequal variances for different dimensions. Hence, the viewpoint that using the same number of bit-\ns for different projected dimensions is unreasonable has still not been veri\ufb01ed by either theory or\nexperiment.\nIn this paper, a novel hashing method, called isotropic hashing (IsoHash), is proposed to learn a pro-\njection function which can produce projected dimensions with isotropic variances (equal variances).\nTo the best of our knowledge, this is the \ufb01rst work which can learn projections with isotropic vari-\nances for hashing. Experimental results on real data sets show that IsoHash can outperform its\ncounterpart with anisotropic variances for different dimensions, which veri\ufb01es the intuitive view-\npoint that projections with isotropic variances will be better than those with anisotropic variances.\nFurthermore, the performance of IsoHash is also comparable, if not superior, to the state-of-the-art\nmethods.\n\n2\n\nIsotropic Hashing\n\nzero centered which means(cid:80)n\n\n2.1 Problem Statement\nAssume we are given n data points {x1, x2,\u00b7\u00b7\u00b7 , xn} with xi \u2208 Rd, which form the columns of\nthe data matrix X \u2208 Rd\u00d7n. Without loss of generality, in this paper the data are assumed to be\ni=1 xi = 0. The basic idea of hashing is to map each point xi into\na binary string yi \u2208 {0, 1}m with m denoting the code size. Furthermore, close points in the\noriginal space Rd should be hashed into similar binary codes in the code space {0, 1}m to preserve\nthe similarity structure in the original space.\nIn general, we compute the binary code of xi as\nyi = [h1(xi), h2(xi),\u00b7\u00b7\u00b7 , hm(xi)]T with m binary hash functions {hk(\u00b7)}m\nBecause it is NP hard to directly compute the best binary functions hk(\u00b7) for a given data set [31],\nmost hashing methods adopt a two-stage strategy to learn hk(\u00b7). In the projection stage, m real-\nvalued projection functions {fk(x)}m\nk=1 are learned and each function can generate one real value.\nHence, we have m projected dimensions each of which corresponds to one projection function. In\nthe quantization stage, the real-values are quantized into a binary string by thresholding.\nCurrently, most methods use one bit to quantize each projected dimension. More speci\ufb01cally,\nhk(xi) = sgn(fk(xi)) where sgn(x) = 1 if x \u2265 0 and 0 otherwise. The exceptions of the quan-\ntization methods only contain AGH [21], DBQ [14] and MH [15], which use two bits to quantize\neach dimension. In sum, all of these methods adopt the same number (either one or two) of bits\nfor different projected dimensions. However, the variances of different projected dimensions are\nunequal, and larger-variance dimensions typically carry more information. Hence, using the same\nnumber of bits for different projected dimensions with unequal variances is unreasonable, which has\nalso been argued by many researchers [7, 12, 21]. Unfortunately, there exist no methods which can\nlearn projection functions with equal variances for different dimensions. In the following content of\nthis section, we present a novel model to learn projections with isotropic variances.\n\nk=1.\n\n2.2 Model Formulation\n\nThe idea of our IsoHash method is to learn an orthogonal matrix to rotate the PCA projection matrix.\nTo generate a code of m bits, PCAH performs PCA on X, and then use the top m eigenvectors of the\ncovariance matrix XX T as columns of the projection matrix W \u2208 Rd\u00d7m. Here, top m eigenvectors\nare those corresponding to the m largest eigenvalues {\u03bbk}m\nk=1, generally arranged with the non-\n\n2\n\n\fk x, where wk is the kth column of W .\n\nincreasing order \u03bb1 \u2265 \u03bb2 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbm. Hence, the projection functions of PCAH are de\ufb01ned as\nfollows: fk(x) = wT\nLet \u03bb = [\u03bb1, \u03bb2,\u00b7\u00b7\u00b7 , \u03bbm]T and \u039b = diag(\u03bb), where diag(\u03bb) denotes the diagonal matrix whose\ndiagonal entries are formed from the vector \u03bb. It is easy to prove that W T XX T W = \u039b. Hence, the\nvariance of the values {fk(xi)}n\ni=1 on the kth projected dimension, which corresponds to the kth\nrow of W T X, is \u03bbk. Obviously, the variances for different PCA dimensions are anisotropic.\nTo get isotropic projection functions, the idea of our IsoHash method is to learn an orthogonal\nmatrix Q \u2208 Rm\u00d7m which makes QT W T XX T W Q become a matrix with equal diagonal values,\ni.e., [QT W T XX T W Q]11 = [QT W T XX T W Q]22 = \u00b7\u00b7\u00b7 = [QT W T XX T W Q]mm. Here, Aii\ndenotes the ith diagonal entry of a square matrix A, and a matrix Q is said to be orthogonal if\nQT Q = I where I is an identity matrix whose dimensionality depends on the context. The effect\nof the orthogonal matrix Q is to rotate the coordinate axes while keeping the Euclidean distances\nbetween any two points unchanged. It is easy to prove that the new projection functions of IsoHash\nare fk(x) = (W Q)T\nk x which have the same (isotropic) variance. Here (W Q)k denotes the kth\ncolumn of W Q.\nIf we use tr(A) to denote the trace of a symmetric matrix A, we have the following Lemma 1.\nLemma 1. If QT Q = I, tr(QT AQ) = tr(A).\n\nBased on Lemma 1, we have tr(QT W T XX T W Q) = tr(W T XX T W ) = tr(\u039b) = (cid:80)m\n\ni=1 \u03bbi if\nQT Q = I. Hence, to make QT W T XX T W Q become a matrix with equal diagonal values, we\nshould set this diagonal value a =\nLet\n\n(cid:80)m\ni=1 \u03bbi\nm .\n\na = [a1, a2,\u00b7\u00b7\u00b7 , am] with ai = a =\n\n,\n\n(1)\n\n(cid:80)m\n\ni=1 \u03bbi\nm\n\nand\n\nT (z) = {T \u2208 Rm\u00d7m|diag(T ) = diag(z)},\n\nwhere z is a vector of length m, diag(T ) is overloaded to denote a diagonal matrix with the same\ndiagonal entries of matrix T .\nBased on our motivation of IsoHash, we can de\ufb01ne the problem of IsoHash as follows:\nProblem 1. The problem of IsoHash is to \ufb01nd an orthogonal matrix Q making QT W T XX T W Q \u2208\nT (a), where a is de\ufb01ned in (1).\n\nThen, we have the following Theorem 1:\nTheorem 1. Assume QT Q = I and T \u2208 T (a). If QT \u039bQ = T , Q will be a solution to the problem\nof IsoHash.\n\nProof. Because W T XX T W = \u039b, we have QT \u039bQ = QT [W T XX T W ]Q. It is obvious that Q\nwill be a solution to the problem of IsoHash.\n\nAs in [2], we de\ufb01ne\n\nM(\u039b) = {QT \u039bQ|Q \u2208 O(m)},\n\n(2)\n\nwhere O(m) is the set of all orthogonal matrices in Rm\u00d7m, i.e., QT Q = I.\nAccording to Theorem 1, the problem of IsoHash is equivalent to \ufb01nding an orthogonal matrix Q\nfor the following equation [2]:\n\n||T \u2212 Z||F = 0,\n\n(3)\nwhere T \u2208 T (a), Z \u2208 M(\u039b), || \u00b7 ||F denotes the Frobenius norm. Please note that for ease of\nunderstanding, we use the same notations as those in [2].\nIn the following content, we will use the Schur-Horn lemma [11] to prove that we can always \ufb01nd a\nsolution to problem (3).\n\n3\n\n\fLemma 2. [Schur-Horn Lemma] Let c = {ci} \u2208 Rm and b = {bi} \u2208 Rm be real vectors in\nnon-increasing order respectively 1, i.e., c1 \u2265 c2 \u2265 \u00b7\u00b7\u00b7 \u2265 cm, b1 \u2265 b2 \u2265 \u00b7\u00b7\u00b7 \u2265 bm. There exists a\nHermitian matrix H with eigenvalues c and diagonal values b if and only if\n\nbi \u2264 k(cid:88)\nm(cid:88)\n\ni=1\n\nbi =\n\nk(cid:88)\nm(cid:88)\n\ni=1\n\nci, for any k = 1, 2, ..., m,\n\nci.\n\ni=1\n\ni=1\n\nProof. Please refer to Horn\u2019s article [11].\n\nBase on Lemma 2, we have the following Theorem 2.\nTheorem 2. There exists a solution to the IsoHash problem in (3). And this solution is in the\nintersection of T (a) and M(\u039b).\n\nProof. Because \u03bb1 \u2265 \u03bb2 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbm and a1 = a2 = \u00b7\u00b7\u00b7 = am =\nprove that\n\n(cid:80)m\n\u2265 k \u00d7 (cid:80)m\ni=1 \u03bbi\nm , it is easy to\n=\ni=1 ai. According to Lemma 2, there exists\na Hermitian matrix H with eigenvalues \u03bb and diagonal values a.\nMoreover, we can prove that H is in the intersection of T (a) and M(\u039b), i.e., H \u2208 T (a) and\nH \u2208 M(\u039b).\n\ni=1 \u03bbi = k \u00d7 (cid:80)k\n\ni=1 \u03bbi\nk\n\ni=1 \u03bbi\nm\n\n(cid:80)k\n\n(cid:80)k\ni=1 ai. Furthermore, we can prove that(cid:80)m\n\nfor any k. Hence, (cid:80)k\ni=1 \u03bbi =(cid:80)m\n\n\u2265 (cid:80)m\n\ni=1 \u03bbi\nk\n\ni=1 \u03bbi\nm\n\nAccording to Theorem 2, to \ufb01nd a Q solving the problem in (3) is equivalent to \ufb01nding the intersec-\ntion point of T (a) and M(\u039b), which is just an inverse eigenvalue problem called SHIEP in [2].\n\n2.3 Learning\n\nThe problem in (3) can be reformulated as the following optimization problem:\n\nargmin\n\nQ:T\u2208T (a),Z\u2208M(\u039b)\n\n||T \u2212 Z||F .\n\n(4)\n\nAs in [2], we propose two algorithms to learn Q: one is called lift and projection (LP), and the other\nis called gradient \ufb02ow (GF). For ease of understanding, we use the same notations as those in [2],\nand some proofs of theorems are omitted. The readers can refer to [2] for the details.\n\n2.3.1 Lift and Projection\n\nThe main idea of lift and projection (LP) algorithm is to alternate between the following two steps:\n\n\u2022 Lift step:\n\nGiven a T (k) \u2208 T (a), we \ufb01nd the point Z (k) \u2208 M(\u039b) such that ||T (k) \u2212 Z (k)||F =\ndist(T (k),M(\u039b)), where dist(T (k),M(\u039b)) denotes the minimum distance between T (k)\nand the points in M(\u039b).\n\n\u2022 Projection step:\n\nGiven a Z (k), we \ufb01nd T (k+1) \u2208 T (a) such that ||T (k+1) \u2212 Z (k)||F = dist(T (a), Z (k)),\nwhere dist(T (a), Z (k)) denotes the minimum distance between Z (k) and the points in\nT (a).\n\n1Please note in [2], the values are in increasing order. It is easy to prove that our presentation of Schur-Horn\nlemma is equivalent to that in [2]. The non-increasing order is chosen here just because it will facilitate our\nfollowing presentation due to the non-increasing order of the eigenvalues in \u039b.\n\n4\n\n\fWe call Z (k) a lift of T (k) onto M(\u039b) and T (k+1) a projection of Z (k) onto T (a). The projection\noperation is easy to complete. Suppose Z (k) = [zij], then T (k+1) = [tij] must be given by\n\n(cid:26) zij, if i (cid:54)= j\n\nai, if i = j\n\ntij =\n\n(5)\n\nFor the lift operation, we have the following Theorem 3.\nTheorem 3. Suppose T = QT DQ is an eigen-decomposition of T where D = diag(d) with\nd = [d1, d2, ..., dm]T being T \u2019s eigenvalues which are ordered as d1 \u2265 d2 \u2265 \u00b7\u00b7\u00b7 \u2265 dm. Then the\nnearest neighbor of T in M(\u039b) is given by\n\nZ = QT \u039bQ.\n\n(6)\n\nProof. See Theorem 4.1 in [3].\n\nSince in each step we minimize the distance between T and Z, we have\n\n||T (k) \u2212 Z (k)||F \u2265 ||T (k+1) \u2212 Z (k)||F \u2265 ||T (k+1) \u2212 Z (k+1)||F .\n\nIt is easy to see that (T (k), Z (k)) will converge to a stationary point. The whole IsoHash algorithm\nbased on LP, abbreviated as IsoHash-LP, is brie\ufb02y summarized in Algorithm 1.\n\nAlgorithm 1 Lift and projection based IsoHash (IsoHash-LP)\n\nInput: X \u2208 Rd\u00d7n, m \u2208 N+, t \u2208 N+\n\n\u2022 [\u039b, W ] = P CA(X, m), as stated in Section 2.2.\n\u2022 Generate a random orthogonal matrix Q0 \u2208 Rm\u00d7m.\n\u2022 Z (0) \u2190 QT\n0 \u039bQ0.\n\u2022 for k = 1 \u2192 t do\n\nCalculate T (k) from Z (k\u22121) by equation (5).\nPerform eigen-decomposition of T (k) to get QT\nCalculate Z (k) from Qk and \u039b by equation (6).\n\n\u2022 end for\n\u2022 Y = sgn(QT\n\nt W T X).\n\nOutput: Y\n\nk DQk = T (k).\n\nBecause M(\u039b) is not a convex set, the stationary point we \ufb01nd is not necessarily inside the intersec-\ntion of T (a) and M(\u039b). For example, if we set Z (0) = \u039b, the lift and projection learning algorithm\nwould get no progress because the Z and T are already in a stationary point. To solve this problem\nof degenerate solutions, we initiate Z as a transformed \u039b with some random orthogonal matrix Q0,\nwhich is illustrated in Algorithm 1.\n\n2.3.2 Gradient Flow\n\nAnother learning algorithm is a continuous one based on the construction of a gradient \ufb02ow (GF)\non the surface M(\u039b) that moves towards the desired intersection point. Because there always ex-\nists a solution for the problem in (3) according to Theorem 2, the objective function in (4) can be\nreformulated as follows [2]:\n\nmin\n\nQ\u2208O(m)\n\nF (Q) =\n\n1\n2\n\n||diag(QT \u039bQ) \u2212 diag(a)||2\nF .\n\n(7)\n\nThe details about how to optimize (7) can be found in [2]. We just show some key steps of the\nlearning algorithm in the following content.\nThe gradient \u2207F at Q can be calculated as\n\n(8)\nwhere \u03b2(Q) = diag(QT \u039bQ) \u2212 diag(a). Once we have computed the gradient of F , it can be\nprojected onto the manifold O(m) according to the following Theorem 4.\n\n\u2207F (Q) = 2\u039b\u03b2(Q),\n\n5\n\n\fTheorem 4. The projection of \u2207F (Q) onto O(m) is given by\ng(Q) = Q[QT \u039bQ, \u03b2(Q)]\n\nwhere [A, B] = AB \u2212 BA is the Lie bracket.\n\n(9)\n\nProof. See the formulas (20), (21) and (22) in [3].\nThe vector \ufb01eld \u02d9Q = \u2212g(Q) de\ufb01nes a steepest descent \ufb02ow on the manifold O(m) for function\nF (Q). Letting Z = QT \u039bQ and \u03b1(Z) = \u03b2(Q), we get\n\n\u02d9Z = [Z, [\u03b1(Z), Z]],\n\n(10)\n\nwhere \u02d9Z is an isospectral \ufb02ow that moves to reduce the objective function F (Q).\nAs stated by Theorems 3.3 and 3.4 in [2], a stable equilibrium point of (10) must be combined\nwith \u03b2(Q) = 0, which means that F (Q) has decreased to zero. Hence, the gradient \ufb02ow method\ncan always \ufb01nd an intersection point as the solution. The whole IsoHash algorithm based on GF,\nabbreviated as IsoHash-GF, is brie\ufb02y summarized in Algorithm 2.\n\nAlgorithm 2 Gradient \ufb02ow based IsoHash (IsoHash-GF)\n\nInput: X \u2208 Rd\u00d7n, m \u2208 N+\n\n\u2022 [\u039b, W ] = P CA(X, m), as stated in Section 2.2.\n\u2022 Generate a random orthogonal matrix Q0 \u2208 Rm\u00d7m.\n\u2022 Z (0) \u2190 QT\n\u2022 Start integration from Z = Z (0) with gradient computed from equation (10).\n\u2022 Stop integration when reaching a stable equilibrium point.\n\u2022 Perform eigen-decomposition of Z to get QT \u039bQ = Z.\n\u2022 Y = sgn(QT W T X).\n\n0 \u039bQ0.\n\nOutput: Y\n\nWe now discuss some implementation details of IsoHash-GF. Since all diagonal matrices in M(\u039b)\nresult in \u02d9Z = 0, one should not use \u039b as the starting point. In our implementation, we use the same\nmethod as that in IsoHash-LP to avoid this degenerate case, i.e., a random orthogonal transformation\nmatrix Q0 is use to rotate \u039b. To integrate Z with gradient in (10), we use Adams-Bashforth-Moulton\nPECE solver in [27] where the parameter RelTol is set to 10\u22123. The relative error of the algorithm\nis computed by comparing the diagonal entries of Z to the target diag(a). The whole integration\nprocess will be terminated when their relative error is below 10\u22127.\n\n2.4 Complexity Analysis\n\nThe learning of our IsoHash method contains two phases: the \ufb01rst phase is PCA and the second\nphase is LP or GF. The time complexity of PCA is O(min(n2d, nd2)). The time complexity of\nLP after PCA is O(m3t), and that of GF after PCA is O(m3). In our experiments, t is set to 100\nbecause good performance can be achieved at this setting. Because m is typically set to be a very\nsmall number like 64 or 128, the main time complexity of IsoHash is from the PCA phase.\nIn\ngeneral, the training of IsoHash-GF will be faster than IsoHash-LP in our experiments.\nOne promising property of both LP and GF is that the time complexity after PCA is independent of\nthe number of training data, which makes them scalable to large-scale data sets.\n\n3 Relation to Existing Works\n\nThe most related method of IsoHash is ITQ [7], because both ITQ and IsoHash have to learn an\northogonal matrix. However, IsoHash is different from ITQ in many aspects: \ufb01rstly, the goal of\nIsoHash is to learn a projection with isotropic variances, but the results of ITQ cannot necessari-\nly guarantee isotropic variances; secondly, IsoHash directly learns the orthogonal matrix from the\neigenvalues and eigenvectors of PCA, but ITQ \ufb01rst quantizes the PCA results to get some binary\n\n6\n\n\fcodes, and then learns the orthogonal matrix based on the resulting binary codes; thirdly, IsoHash\nhas an explicit objective function to optimize, but ITQ uses a two-step heuristic strategy whose\ngoal cannot be formulated by a single objective function; fourthly, to learn the orthogonal matrix,\nIsoHash uses Lift and Projection or Gradient Flow, but ITQ uses Procruster method which is much\nslower than IsoHash. From the experimental results which will be presented in the next section, we\ncan \ufb01nd that IsoHash can achieve accuracy comparable to ITQ with much faster training speed.\n\n4 Experiment\n\n4.1 Data Sets\n\nWe evaluate our methods on two widely used data sets, CIFAR [16] and LabelMe [28].\nThe \ufb01rst data set is CIFAR-10 [16] which consists of 60,000 images. These images are manually\nlabeled into 10 classes, which are airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and\ntruck. The size of each image is 32\u00d732 pixels. We represent them with 256-dimensional gray-scale\nGIST descriptors [24].\nThe second data set is 22K LabelMe used in [23, 28] which contains 22,019 images sampled from\nthe large LabelMe data set. As in [28], The images are scaled to 32\u00d732 pixels, and then represented\nby 512-dimensional GIST descriptors [24].\n\n4.2 Evaluation Protocols and Baselines\n\nAs the protocols widely used in recent papers [7, 23, 25, 31], Euclidean neighbors in the original s-\npace are considered as ground truth. More speci\ufb01cally, a threshold of the average distance to the 50th\nnearest neighbor is used to de\ufb01ne whether a point is a true positive or not. Based on the Euclidean\nground truth, we compute the precision-recall curve and mean average precision (mAP) [7, 21]. For\nall experiments, we randomly select 1000 points as queries, and leave the rest as training set to learn\nthe hash functions. All the experimental results are averaged over 10 random training/test partitions.\nAlthough a lot of hashing methods have been proposed, some of them are either supervised [23]\nor semi-supervised [29]. Our IsoHash method is essentially an unsupervised one. Hence, for fair\ncomparison, we select the most representative unsupervised methods for evaluation, which contain\nPCAH [7], ITQ [7], SH [31], LSH [1], and SIKH [25]. Among these methods, PCAH, ITQ and SH\nare data-dependent methods, while SIKH and LSH are data-independent methods.\nAll experiments are conducted on our workstation with Intel(R) Xeon(R) CPU X7560@2.27GHz\nand 64G memory.\n\n4.3 Accuracy\n\nTable 1 shows the Hamming ranking performance measured by mAP on LabelMe and CIFAR. It\nis clear that our IsoHash methods, including both IsoHash-GF and IsoHash-LP, achieve far better\nperformance than PCAH. The main difference between IsoHash and PCAH is that the PCAH di-\nmensions have anisotropic variances while IsoHash dimensions have isotropic variances. Hence,\nthe intuitive viewpoint that using the same number of bits for different projected dimensions with\nanisotropic variances is unreasonable has been successfully veri\ufb01ed by our experiments. Further-\nmore, the performance of IsoHash is also comparable, if not superior, to the state-of-the-art methods,\nsuch as ITQ.\nFigure 1 illustrates the precision-recall curves on LabelMe data set with different code sizes. The\nrelative performance in the precision-recall curves on CIFAR is similar to that on LabelMe. We omit\nthe results on CIFAR due to space limitation. Once again, we can \ufb01nd that our IsoHash methods can\nachieve performance which is far better than PCAH and comparable to the state-of-the-art.\n\n4.4 Computational Cost\n\nTable 2 shows the training time on CIFAR. We can see that our IsoHash methods are much faster\nthan ITQ. The time complexity of ITQ also contains two parts: the \ufb01rst part is PCA which is the same\n\n7\n\n\fMethod\n# bits\n\nIsoHash-GF\nIsoHash-LP\n\nPCAH\nITQ\nSH\nSIKH\nLSH\n\nTable 1: mAP on LabelMe and CIFAR data sets.\n\nLabelMe\n\nCIFAR\n\n32\n\n0.2580\n0.2534\n0.0516\n0.2786\n0.0826\n0.0590\n0.1549\n\n64\n\n0.3269\n0.3223\n0.0401\n0.3328\n0.1034\n0.1482\n0.2574\n\n96\n\n0.3528\n0.3577\n0.0341\n0.3504\n0.1447\n0.2074\n0.3147\n\n128\n\n0.3662\n0.3826\n0.0307\n0.3615\n0.1653\n0.2526\n0.3375\n\n256\n\n0.3889\n0.4274\n0.0232\n0.3728\n0.2080\n0.4488\n0.4034\n\n32\n\n0.2249\n0.1907\n0.0319\n0.2490\n0.0510\n0.0353\n0.1052\n\n64\n\n0.2969\n0.2624\n0.0274\n0.3051\n0.0589\n0.0902\n0.1907\n\n96\n\n0.3256\n0.3027\n0.0241\n0.3238\n0.0802\n0.1245\n0.2396\n\n128\n0.3357\n0.3223\n0.0216\n0.3319\n0.1121\n0.1909\n0.2776\n\n256\n\n0.3600\n0.3651\n0.0168\n0.3436\n0.1535\n0.3614\n0.3432\n\n(a) 32 bits\n\n(b) 64 bits\n\n(c) 96 bits\n\n(d) 256 bits\n\nFigure 1: Precision-recall curves on LabelMe data set.\n\nas that in IsoHash, and the second part is an iteration algorithm to rotate the original PCA matrix\nwith time complexity O(nm2), where n is the number of training points and m is the number of bits\nin the binary code. Hence, as the number of training data increases, the second-part time complexity\nof ITQ will increase linearly to the number of training points. But the time complexity of IsoHash\nafter PCA is independent of the number of training points. Hence, IsoHash will be much faster than\nITQ, particularly in the case with a large number of training points. This is clearly shown in Figure 2\nwhich illustrates the training time when the numbers of training data are changed.\n\nTable 2: Training time (in second) on CIFAR.\n\n# bits\n\nIsoHash-GF\nIsoHash-LP\n\nPCAH\nITQ\nSH\nSIKH\nLSH\n\n32\n2.48\n2.14\n1.84\n4.35\n1.60\n1.30\n0.05\n\n64\n2.45\n2.43\n2.14\n6.33\n3.41\n1.44\n0.08\n\n96\n2.70\n2.94\n2.23\n9.73\n8.37\n1.57\n0.11\n\n128\n3.00\n3.47\n2.36\n12.40\n13.66\n1.55\n0.19\n\n256\n5.55\n8.83\n2.92\n29.25\n49.44\n2.20\n0.31\n\n5 Conclusion\n\nFigure 2: Training time on CIFAR .\n\nAlthough many researchers have intuitively argued that using the same number of bits for different\nprojected dimensions with anisotropic variances is unreasonable, this viewpoint has still not been\nveri\ufb01ed by either theory or experiment because no methods have been proposed to \ufb01nd projection\nfunctions with isotropic variances for different dimensions. The proposed IsoHash method in this\npaper is the \ufb01rst work to learn projection functions which can produce projected dimensions with\nisotropic variances (equal variances). Experimental results on real data sets have successfully veri-\n\ufb01ed the viewpoint that projections with isotropic variances will be better than those with anisotropic\nvariances. Furthermore, IsoHash can achieve accuracy comparable to the state-of-the-art methods\nwith faster training speed.\n\n6 Acknowledgments\n\nThis work is supported by the NSFC (No. 61100125), the 863 Program of China (No. 2011AA01A202, No. 2012AA011003), and the Program\nfor Changjiang Scholars and Innovative Research Team in University of China (IRT1158, PCSIRT).\n\n8\n\n00.20.40.60.8100.20.40.60.81RecallPrecision IsoHash\u2212GFIsoHash\u2212LPITQSHSIKHLSHPCAH00.20.40.60.8100.20.40.60.81RecallPrecision IsoHash\u2212GFIsoHash\u2212LPITQSHSIKHLSHPCAH00.20.40.60.8100.20.40.60.81RecallPrecision IsoHash\u2212GFIsoHash\u2212LPITQSHSIKHLSHPCAH00.20.40.60.8100.20.40.60.81RecallPrecision IsoHash\u2212GFIsoHash\u2212LPITQSHSIKHLSHPCAH0123456x 10401020304050Number of training dataTraining Time(s) IsoHash\u2212GFIsoHash\u2212LPITQSHSIKHLSHPCAH\fReferences\n[1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high\n\ndimensions. Commun. ACM, 51(1):117\u2013122, 2008.\n\n[2] M.T. Chu. Constructing a Hermitian matrix from its diagonal entries and eigenvalues. SIAM Journal on\n\nMatrix Analysis and Applications, 16(1):207\u2013217, 1995.\n\n[3] M.T. Chu and K.R. Driessel. The projected gradient method for least squares matrix approximations with\n\nspectral constraints. SIAM Journal on Numerical Analysis, pages 1050\u20131060, 1990.\n\n[4] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable\n\ndistributions. In Proceedings of the ACM Symposium on Computational Geometry, 2004.\n\n[5] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999.\n[6] Y. Gong, S. Kumar, V. Verma, and S. Lazebnik. Angular quantization based binary codes for fast similarity\n\nsearch. In NIPS, 2012.\n\n[7] Y. Gong and S. Lazebnik. Iterative quantization: A Procrustean approach to learning binary codes. In\n\nCVPR, 2011.\n\n[8] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A Procrustean approach to\n\nlearning binary codes for large-scale image retrieval. In IEEE Trans. Pattern Anal. Mach. Intell., 2012.\n\n[9] J. He, W. Liu, and S.-F. Chang. Scalable similarity search with optimized kernel hashing. In KDD, 2010.\n[10] J.-P. Heo, Y. Lee, J. He, S.-F. Chang, and S.-E. Yoon. Spherical hashing. In CVPR, 2012.\n[11] A. Horn. Doubly stochastic matrices and the diagonal of a rotation matrix. American Journal of Mathe-\n\nmatics, 76(3):620\u2013630, 1954.\n\n[12] H. Jegou, M. Douze, C. Schmid, and P. P\u00b4erez. Aggregating local descriptors into a compact image\n\nrepresentation. In CVPR, 2010.\n\n[13] I. Jolliffe. Principal Component Analysis. Springer, 2002.\n[14] W. Kong and W.-J. Li. Double-bit quantization for hashing. In AAAI, 2012.\n[15] W. Kong, W.-J. Li, and M. Guo. Manhattan hashing for large-scale image retrieval. In SIGIR, 2012.\n[16] A. Krizhevsky. Learning multiple layers of features from tiny images. Tech report, University of Toronto,\n\n2009.\n\n[17] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. In NIPS, 2009.\n[18] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing for scalable image search.\n\nIn ICCV,\n\n2009.\n\n[19] B. Kulis, P. Jain, and K. Grauman. Fast similarity search for learned metrics. IEEE Trans. Pattern Anal.\n\nMach. Intell., 31(12):2143\u20132157, 2009.\n\n[20] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. In CVPR, 2012.\n[21] W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with graphs. In ICML, 2011.\n[22] Y. Mu and S. Yan. Non-metric locality-sensitive hashing. In AAAI, 2010.\n[23] M. Norouzi and D. J. Fleet. Minimal loss hashing for compact binary codes. In ICML, 2011.\n[24] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial\n\nenvelope. International Journal of Computer Vision, 42(3):145\u2013175, 2001.\n\n[25] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels.\n\n2009.\n\nIn NIPS,\n\n[26] R. Salakhutdinov and G. E. Hinton. Semantic hashing. Int. J. Approx. Reasoning, 50(7):969\u2013978, 2009.\n[27] L.F. Shampine and M.K. Gordon. Computer solution of ordinary differential equations: the initial value\n\nproblem. Freeman, San Francisco, California, 1975.\n\n[28] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition. In CVPR,\n\n2008.\n\n[29] J. Wang, S. Kumar, and S.-F. Chang. Sequential projection learning for hashing with compact codes. In\n\nICML, 2010.\n\n[30] J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing for large-scale search. IEEE Trans. Pattern\n\nAnal. Mach. Intell., 34(12):2393\u20132406, 2012.\n\n[31] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, 2008.\n[32] H. Xu, J. Wang, Z. Li, G. Zeng, S. Li, and N. Yu. Complementary hashing for approximate nearest\n\nneighbor search. In ICCV, 2011.\n\n[33] D. Zhang, F. Wang, and L. Si. Composite hashing with multiple information sources. In SIGIR, 2011.\n[34] Y. Zhen and D.-Y. Yeung. A probabilistic model for multimodal hash function learning. In KDD, 2012.\n\n9\n\n\f", "award": [], "sourceid": 776, "authors": [{"given_name": "Weihao", "family_name": "Kong", "institution": null}, {"given_name": "Wu-jun", "family_name": "Li", "institution": null}]}