{"title": "An ensemble diversity approach to supervised binary hashing", "book": "Advances in Neural Information Processing Systems", "page_first": 757, "page_last": 765, "abstract": "Binary hashing is a well-known approach for fast approximate nearest-neighbor search in information retrieval. Much work has focused on affinity-based objective functions involving the hash functions or binary codes. These objective functions encode neighborhood information between data points and are often inspired by manifold learning algorithms. They ensure that the hash functions differ from each other through constraints or penalty terms that encourage codes to be orthogonal or dissimilar across bits, but this couples the binary variables and complicates the already difficult optimization. We propose a much simpler approach: we train each hash function (or bit) independently from each other, but introduce diversity among them using techniques from classifier ensembles. Surprisingly, we find that not only is this faster and trivially parallelizable, but it also improves over the more complex, coupled objective function, and achieves state-of-the-art precision and recall in experiments with image retrieval.", "full_text": "An Ensemble Diversity Approach\n\nto Supervised Binary Hashing\n\nMiguel \u00b4A. Carreira-Perpi \u02dcn\u00b4an\n\nRamin Raziperchikolaei\n\nEECS, University of California, Merced\nmcarreira-perpinan@ucmerced.edu\n\nEECS, University of California, Merced\nrraziperchikolaei@ucmerced.edu\n\nAbstract\n\nBinary hashing is a well-known approach for fast approximate nearest-neighbor\nsearch in information retrieval. Much work has focused on af\ufb01nity-based objective\nfunctions involving the hash functions or binary codes. These objective functions\nencode neighborhood information between data points and are often inspired by\nmanifold learning algorithms. They ensure that the hash functions differ from each\nother through constraints or penalty terms that encourage codes to be orthogonal\nor dissimilar across bits, but this couples the binary variables and complicates the\nalready dif\ufb01cult optimization. We propose a much simpler approach: we train\neach hash function (or bit) independently from each other, but introduce diversity\namong them using techniques from classi\ufb01er ensembles. Surprisingly, we \ufb01nd\nthat not only is this faster and trivially parallelizable, but it also improves over the\nmore complex, coupled objective function, and achieves state-of-the-art precision\nand recall in experiments with image retrieval.\n\nInformation retrieval tasks such as searching for a query image or document in a database are es-\nsentially a nearest-neighbor search [33]. When the dimensionality of the query and the size of the\ndatabase is large, approximate search is necessary. We focus on binary hashing [17], where the query\nand database are mapped onto low-dimensional binary vectors, where the search is performed. This\nhas two speedups: computing Hamming distances (with hardware support) is much faster than com-\nputing distances between high-dimensional \ufb02oating-point vectors; and the entire database becomes\nmuch smaller, so it may reside in fast memory rather than disk (for example, a database of 1 billion\nreal vectors of dimension 500 takes 2 TB in \ufb02oating point but 8 GB as 64-bit codes).\n\nConstructing hash functions that do well in retrieval measures such as precision and recall is usually\ndone by optimizing an af\ufb01nity-based objective function that relates Hamming distances to super-\nvised neighborhood information in a training set. Many such objective functions have the form of a\nsum of pairwise terms that indicate whether the training points xn and xm are neighbors:\nn,m=1 L(zn, zm; ynm) where zm = h(xm), zn = h(xn).\n\nminh L(h) = PN\n\nHere, X = (x1, . . . , xN ) is the dataset of high-dimensional feature vectors (e.g., SIFT features of\nan image), h: RD \u2192 {\u22121, +1}b are b binary hash functions and z = h(x) is the b-bit code vector\nfor input x \u2208 RD, minh means minimizing over the parameters of the hash function h (e.g. over\nthe weights of a linear SVM), and L(\u00b7) is a loss function that compares the codes for two images\n(often through their Hamming distance kzn \u2212 zmk) with the ground-truth value ynm that measures\nthe af\ufb01nity in the original space between the two images xn and xm (distance, similarity or other\nmeasure of neighborhood). The sum is often restricted to a subset of image pairs (n, m) (for example,\nwithin the k nearest neighbors of each other in the original space), to keep the runtime low. The\noutput of the algorithm is the hash function h and the binary codes Z = (z1, . . . , zN ) for the training\npoints, where zn = h(xn) for n = 1, . . . , N . Examples of these objective functions are Supervised\nHashing with Kernels (KSH) [28], Binary Reconstructive Embeddings (BRE) [21] and the binary\nLaplacian loss (an extension of the Laplacian Eigenmaps objective; [2]) where L(zn, zm; ynm) is:\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fKSH: (zT\nn\n\nzm \u2212 bynm)2\n\nLAP: ynm kzn \u2212 zmk2\n\nBRE: (cid:0) 1\n\nb kzn \u2212 zmk2 \u2212 ynm(cid:1)2\n\n(1)\nwhere for KSH ynm is 1 if xn, xm are similar and \u22121 if they are dissimilar; for BRE ynm =\n2 kxn \u2212 xmk2 (where the dataset X is normalized so the Euclidean distances are in [0, 1]); and for\n1\nthe Laplacian loss ynm > 0 if xn, xm are similar and < 0 if they are dissimilar (\u201cpositive\u201d and\n\u201cnegative\u201d neighbors). Other examples of these objectives include models developed for dimension\nreduction, be they spectral such as Locally Linear Embedding [32] or Anchor Graphs [27], or non-\nlinear such as the Elastic Embedding [7] or t-SNE; as well as objectives designed speci\ufb01cally for\nbinary hashing, such as Semi-supervised sequential Projection Learning Hashing (SPLH) [34]. They\nall can produce good hash functions. We will focus on the Laplacian loss in this paper.\n\nIn designing these objective functions, one needs to eliminate two types of trivial solutions. 1) In\nthe Laplacian loss, mapping all points to the same code, i.e., z1 = \u00b7 \u00b7 \u00b7 = zN , is the global optimum\nof the positive neighbors term (this also arises if the codes zn are real-valued, as in Laplacian eigen-\nmaps). This can be avoided by having negative neighbors. 2) Having all hash functions (all b bits\nof each vector) being identical to each other, i.e., zn1 = \u00b7 \u00b7 \u00b7 = znb for each n = 1, . . . , N . This\ncan be avoided by introducing constraints, penalty terms or other mathematical devices that couple\nthe b bits. For example, in the Laplacian loss (1) we can encourage codes to be orthogonal through\na constraint ZT Z = N I [35] or a penalty term kZT Z \u2212 N Ik2 (with a hyperparameter that controls\nthe weight of the penalty) [14], although this generates dense matrices of N \u00d7 N . In the KSH or\nBRE (1), squaring the dot product or Hamming distance between the codes couples the b bits.\n\nAn important downside of these approaches is the dif\ufb01culty of their optimization. This is due to\nthe fact that the objective function is nonsmooth (implicitly discrete) because of the binary output\nof the hash function. There is a large number of such binary variables (bN ), a larger number of\npairwise interactions (O(N 2), less if using sparse neighborhoods) and the variables are coupled by\nthe said constraints or penalty terms. The optimization is approximated in different ways. Most\npapers ignore the binary nature of the Z codes and optimize over them as real values, then binarize\nthem by truncation (possibly with an optimal rotation; [16]), and \ufb01nally \ufb01t a classi\ufb01er (e.g. linear\nSVM) to each of the b bits separately. For example, for the Laplacian loss with constraints this\ninvolves solving an eigenproblem on Z as in Laplacian eigenmaps [2, 35, 36], or approximated using\nlandmarks [27]. This is fast, but relaxing the codes in the optimization is generally far from optimal.\nSome recent papers try to respect the binary nature of the codes during their optimization, using\ntechniques such as alternating optimization, min-cut and GraphCut [4, 14, 26] or others [25], and\nthen \ufb01t the classi\ufb01ers, or use alternating optimization directly on the hash function parameters [28].\nEven more recently, one can optimize jointly over the binary codes and hash functions [8, 14, 31].\nMost of these approaches are slow and limited to small datasets (a few thousand points) because of\nthe quadratic number of pairwise terms in the objective.\n\nWe propose a different, much simpler approach. Rather than coupling the b hash functions into\na single objective function, we train each hash function independently from each other and using\na single-bit objective function of the same form. We show that we can avoid trivial solutions by\ninjecting diversity into each hash function\u2019s training using techniques inspired from classi\ufb01er en-\nsemble learning. Section 1 discusses relevant ideas from the ensemble learning literature, section 2\ndescribes our independent Laplacian hashing algorithm, section 3 gives evidence with image re-\ntrieval datasets that this simple approach indeed works very well, and section 4 further discusses the\nconnection between hashing and ensembles.\n\n1 Ideas from learning classi\ufb01er ensembles\n\nAt \ufb01rst sight, optimizing Laplacian loss without constraints does not seem like a good idea: since\nkzn \u2212 zmk2 separates over the b bits, we obtain b independent identical objectives, one over each\nhash function, and so they all have the same global optimum. And, if all hash functions are equal,\nthey are equivalent to using just one of them, which will give a much lower precision/recall. In\nfact, the very same issue arises when training an ensemble of classi\ufb01ers [10, 22]. Here, we have\na training set of input vectors and output class labels, and want to train several classi\ufb01ers whose\noutputs are then combined (usually by majority vote). If the classi\ufb01ers are all equal, we gain nothing\nover a single classi\ufb01er. Hence, it is necessary to introduce diversity among the classi\ufb01ers so that they\ndisagree in their predictions. The ensemble learning literature has identi\ufb01ed several mechanisms to\ninject diversity. The most important ones that apply to our binary hashing setting are as follows:\n\nUsing different data for each classi\ufb01er This can be done by: 1) Using different feature subsets for\neach classi\ufb01er. This works best if the features are somewhat redundant. 2) Using different\n\n2\n\n\ftraining sets for each classi\ufb01er. This works best for unstable algorithms (whose resulting\nclassi\ufb01er is sensitive to small changes in the training data), such as decision trees or neural\nnets, and unlike linear or nearest neighbor classi\ufb01ers. A prominent example is bagging [6],\nwhich generates bootstrap datasets and trains a model on each.\n\nInjecting randomness in the training algorithm This is only possible if local optima exist (as for\nneural nets) or if the algorithm is randomized (as for decision trees). This can be done by\nusing different initializations, adding noise to the updates or using different choices in the\nrandomized operations (e.g. the choice of split in decision trees, as in random forests; [5]).\n\nUsing different classi\ufb01er models For example, different parameters (e.g. the number of neighbors\nin a nearest-neighbor classi\ufb01er), different architectures (e.g. neural nets with different num-\nber of layers or hidden units), or different types of classi\ufb01ers altogether.\n\n2 Independent Laplacian Hashing (ILH) with diversity\n\nThe connection of binary hashing with ensemble learning offers many possible options, in terms of\nthe choice of type of hash function (\u201cbase learner\u201d), binary hashing (single-bit) objective function,\noptimization algorithm, and diversity mechanism. In this paper we focus on the following choices.\nWe use linear and kernel SVMs as hash functions. Without loss of generality (see later), we use the\nLaplacian objective (1), which for a single bit takes the form\n\nE(z) = PN\n\nn,m=1 ynm(zn \u2212 zm)2, zn = h(xn) \u2208 {\u22121, 1}, n = 1, . . . , N.\n\n(2)\nTo optimize it, we use a two-step approach, where we \ufb01rst optimize (2) over the N bits and then\nlearn the hash function by \ufb01tting to it a binary classi\ufb01er. (It is also possible to optimize over the\nhash function directly with the method of auxiliary coordinates; [8, 31], which essentially iterates\nover optimizing (2) and \ufb01tting the classi\ufb01er.) The Laplacian objective (2) is NP-complete if we have\nnegative neighbors (i.e., some ynm < 0). We approximately optimize it using a min-cut algorithm\n(as implemented in [4]) applied in alternating fashion to submodular blocks as described in Lin et al.\n[24]. This \ufb01rst partitions the N points into disjoint groups containing only nonnegative weights.\nEach group de\ufb01nes a submodular function (speci\ufb01cally, quadratic with nonpositive coef\ufb01cients)\nwhose global minimum can be found in polynomial time using min-cut. The order in which the\ngroups are optimized over is randomized at each iteration (this improves over using a \ufb01xed order).\nThe approximate optimizer found depends on the initial z \u2208 {\u22121, 1}N.\n\nFinally, we consider three types of diversity mechanism (as well as their combination):\nDifferent initializations (ILHi) Each hash function is initialized from a random N -bit vector z.\nDifferent training sets (ILHt) Each hash function uses a training set of N points that is different\nand (if possible) disjoint from that of other hash functions. We can afford to do this because\nin binary hashing the training sets are potentially very large, and the computational cost\nof the optimization limits the training sets to a few thousand points. Later we show this\noutperforms using bootstrapped training sets.\n\nDifferent feature subsets (ILHf) Each hash function is trained on a random subset of 1 \u2264 d \u2264 D\nfeatures sampled without replacement (so the d features are distinct). The subsets corre-\nsponding to different hash functions may overlap.\n\nThese mechanisms are applicable to other objective functions beyond (2). We could also use the\nsame training set but construct differently the weight matrix in (2) (e.g. using different numbers of\npositive and negative neighbors).\n\nEquivalence of objective functions in the single-bit case Several binary hashing objectives that\ndiffer in the general case of b > 1 bits become essentially identical in the b = 1 case. For example,\nexpanding the pairwise terms in (1) (noting that z2\nn = 1 if zn \u2208 {\u22121, +1}) gives L(zn, zm; ynm) as\nKSH: \u22122ynmznzm+constant BRE: \u22124(2\u2212ynm)znzm+constant LAP: \u22122ynmznzm+constant.\n\nSo all the three objectives are in fact identical and can be written in the form of a binary quadratic\nfunction without linear term (or a Markov random \ufb01eld with quadratic potentials only):\n\nminz E(z) = zT Az with z \u2208 {\u22121, +1}N\n\n(3)\nwith an appropriate, data-dependent neighborhood symmetric matrix A of N \u00d7 N . This problem\nis NP-complete in general [3, 13, 18], when A has both positive and negative elements, as well\nas zeros. It is submodular if A has only nonpositive elements, in which case it is equivalent to a\nmin-cut/max-\ufb02ow problem and it can be solved in polynomial time [3].\n\n3\n\n\fMore generally, any function of a binary vector z that has the form E(z) = PN\nn,m=1 fnm(zn, zm)\nand which only depends on Hamming distances between bits zn, zm can be written as\nfnm(zn, zm) = anmznzm + bnm. Even more, an arbitrary function of 3 binary variables that\ndepends only on their Hamming distances can be written as a quadratic function of the 3 variables.\nHowever, for 4 variables or more this is not generally true (see supplementary material).\n\nComputational advantages Training the hash functions independently has some important ad-\nvantages. First, training the b functions can be parallelized perfectly. This is a speedup of one to\ntwo orders of magnitude for typical values of b (32 to 200 in our experiments). Coupled objective\nfunctions such as KSH do not exhibit obvious parallelism, because they are trained with alternating\noptimization, which is inherently sequential.\n\nSecond, even in a single processor, b binary optimizations over N variables each is generally easier\nthan one binary optimization over bN variables. This is so because the search spaces contain b2N\nand 2bN states, respectively, so enumeration is much faster in the independent case (even though\nit is still impractical). If using an approximate polynomial-time algorithm, the independent case is\nalso faster if the runtime is superlinear on the number of variables: the asymptotic runtimes will be\nO(bN \u03b1) and O((bN )\u03b1) with \u03b1 > 1, respectively. This is the case for the best practical GraphCut\n[4] and max-\ufb02ow/min-cut algorithms [9].\n\nThird, the solution exhibits \u201cnesting\u201d, that is, to get the solution for b + 1 bits we just need to take a\nsolution with b bits and add one more bit (as happens with PCA). This is unlike most methods based\non a coupled objective function (such as KSH), where the solution for b + 1 bits cannot be obtained\nby adding one more bit, we have to solve for b + 1 bits from scratch.\n\nFor ILHf, both the training and test time are lower than if using all D features for each hash function.\nThe test runtime for a query is d/D times smaller.\n\nModel selection for the number of bits b Selecting the number of bits (hash functions) to use has\nnot received much attention in the binary hashing literature. The most obvious way to do this would\nbe to maximize the precision on a test set over b (cross-validation) subject to b not exceeding a preset\nlimit (so applying the hash function is fast with test queries). The nesting property of ILH makes\nthis computationally easy: we simply keep adding bits until the test precision stabilizes or decreases,\nor until we reach the maximum b. We can still bene\ufb01t from parallel processing: if P processors are\navailable, we train P hash functions in parallel and evaluate their precision, also in parallel. If we\nstill need to increase b, we train P more hash functions, etc.\n\n3 Experiments\n\nWe use the following labeled datasets (all using the Euclidean distance in feature space): (1) CIFAR\n[19] contains 60 000 images in 10 classes. We use D = 320 GIST features [30] from each image.\nWe use 58 000 images for training and 2 000 for test. (2) In\ufb01nite MNIST [29]. We generated, using\nelastic deformations of the original MNIST handwritten digit dataset, 1 000 000 images for training\nand 2 000 for test, in 10 classes. We represent each image by a D = 784 vector of raw pixels. The\nsupplementary material contains experiments on additional datasets.\n\nBecause of the computational cost of af\ufb01nity-based methods, previous work has used training sets\nlimited to a few thousand points [14, 21, 25, 28]. Unless otherwise indicated, we train the hash\nfunctions in a subset of 5 000 points of the training set, and report precision and recall by searching\nfor a test query on the entire dataset (the base set). As hash functions (for each bit), we use linear\nSVMs (trained with LIBLINEAR; [12]) and kernel SVMs (with 500 basis functions centered at a\nrandom subset of training points). We report precision and recall for the test set queries using as\nground truth (set of true neighbors in original space) all the training points with the same label as the\nquery. The retrieved set contains the k nearest neighbors of the query point in the Hamming space.\nWe report precision for different values of k to test the robustness of different algorithms.\n\nDiversity mechanisms with ILH To understand the effect of diversity, we evaluate the 3 mecha-\nnisms ILHi, ILHt and ILHf, and their combination ILHitf, over a range of number of bits b (32 to\n128) and training set size N (2 000 to 20 000). As baseline coupled objective, we use KSH [28] but\nusing the same two-step training as ILH: \ufb01rst we \ufb01nd the codes using the alternating min-cut method\ndescribed earlier (initialized from an all-ones code, and running one iteration of alternating min-cut)\nand then we \ufb01t the classi\ufb01ers. This is faster and generally \ufb01nds better optima than the original KSH\noptimization [26]. We denote it as KSHcut.\n\n4\n\n\fILHi\n\n45\n\nh\nr\na\ne\nn\ni\nl\n\n40\n\n35\n\n30\n\n0.2\n\n0.5\n\n1\n\n52\n\nh\n\n48\n\n44\n\nl\ne\nn\nr\ne\nk\n\n40\n\n0.2\n\n0.5\n\nILHt\n\nILHf\n\nILHitf\n\nKSHcut\n\n45\n\n40\n\n35\n\n30\n\n \n\n45\n\n40\n\n35\n\n30\n\nb=32\nb=64\nb=128\n\n45\n\n40\n\n35\n\n30\n\n45\n\n40\n\n35\n\n30\n\n \n\n0.2\n\n0.5\n\n1\n\n2\n4\nx 10\n\n0.2\n\n0.5\n\n1\n\n2\n4\nx 10\n\n \n\n0.2\n\n0.5\n\n1\n\n2\n4\nx 10\n\n0.2\n\n0.5\n\n1\n\n2\n4\nx 10\n\n2\n4\nx 10\n\n52\n\n48\n\n44\n\n40\n\n52\n\n48\n\n44\n\n40\n\nb=32\nb=64\nb=128\n\n52\n\n48\n\n44\n\n40\n\n52\n\n48\n\n44\n\n40\n\n1\n\nN\n\n \n\n0.2\n\n0.5\n\n2\n4\nx 10\n\n1\n\nN\n\n0.2\n\n0.5\n\n2\n4\nx 10\n\n1\n\nN\n\n0.2\n\n0.5\n\n2\n4\nx 10\n\n1\n\nN\n\n0.2\n\n0.5\n\n2\n4\nx 10\n\n1\n\nN\n\n2\n4\nx 10\n\nFigure 1: Diversity mechanisms vs baseline (KSHcut). Precision on CIFAR dataset, as a function of\nthe training set size N (2, 000 to 20 000) and number of bits b (32 to 128). Ground truth: all points\nwith the same label as the query. Retrieved set: k = 500 nearest neighbors of the query. Errorbars\nshown only for ILHt (over 5 random training sets) to avoid clutter. Top to bottom: linear and kernel\nhash functions. Left to right: diversity mechanisms, their combination, and the baseline KSHcut.\n\nFig. 1 shows the results. The clearly best diversity mechanism is ILHt, which works better than\nthe other mechanisms, even when combined with them, and signi\ufb01cantly better than KSHcut. We\nexplain this as follows. Although all 3 mechanisms introduce diversity, ILHt has a distinct advantage\n(also over KSHcut): it effectively uses b times as much training data, because each hash function has\nits own disjoint dataset. Using bN training points in KSHcut would be orders of magnitude slower.\nILHt is equal or even better than the combined ILHitf because 1) since there is already enough\ndiversity in ILHt, the extra diversity from ILHi and ILHf does not help; 2) ILHf uses less data (it\ndiscards features), which can hurt the precision; this is also seen in \ufb01g. 2 (panel 2). The precision of\nall methods saturates as N increases; with b = 128 bits, ILHt achieves nearly maximum precision\nwith only 5 000 points. In fact, if we continued to increase the per-bit training set size N in ILHt,\neventually all bits would use the same training set (containing all available data), diversity would\ndisappear and the precision would drop drastically to the precision of using a single bit (\u2248 12%).\nPractical image retrieval datasets are so large that this is unlikely to occur unless N is very large\n(which would make the optimization too slow anyway).\n\nLinear SVMs are very stable classi\ufb01ers known to bene\ufb01t less from ensembles than less stable classi-\n\ufb01ers such as decision trees or neural nets [22]. Remarkably, they strongly bene\ufb01t from the ensemble\nin our case. This is because each hash function is solving a different classi\ufb01cation problem (different\noutput labels), so the resulting SVMs are in fact quite different from each other. The conclusions\nfor kernel hash functions are similar. In \ufb01g. 1, the kernel functions are using the same, common 500\ncenters for the radial basis functions. Nonlinear classi\ufb01ers are less stable than linear ones. In our\ncase they do not bene\ufb01t much more than linear SVMs from the diversity. They do achieve higher\nprecision since they are more powerful models. See supplementary material for more results.\n\nFig. 2 shows the results on in\ufb01nite MNIST dataset (see supp. mat for the results on CIFAR). Panel 1\nshows the results in ILHf of varying the number of features 1 \u2264 d \u2264 D used by each hash function.\nIntuitively, very low d is bad because each classi\ufb01er receives too little information and will make\nnear-random codes. Indeed, for low d the precision is comparable to that of LSH (random projec-\ntions) in panel 4. Very high d will also work badly because it would eliminate the diversity and drop\nto the precision of a single bit for d = D. This does not happen because there is an additional source\nof diversity: the randomization in the alternating min-cut iterations. This has an effect similar to that\nof ILHi, and indeed a comparable precision. The highest precision is achieved with a proportion\nd/D \u2248 30% for ILHf, indicating some redundancy in the features. When combined with the other\ndiversity mechanisms (ILHitf, panel 2), the highest precision occurs for d = D, because diversity is\nalready provided by the other mechanisms, and using more data is better.\n\nFig. 2 (panel 3) shows the results of constructing the b training sets for ILHt as a random sample\nfrom the base set such that they are \u201cbootstrapped\u201d (sampled with replacement), \u201cdisjoint\u201d (sampled\nwithout replacement) or \u201crandom\u201d (sampled without replacement but reset for each bit, so the train-\ning sets may overlap). As expected, \u201cdisjoint\u201d (closely followed by \u201crandom\u201d) is consistently and\nnotably better than \u201cbootstrap\u201d because it introduces more independence between the hash functions\nand learns from more data overall (since each hash function uses the same training set size).\n\n5\n\n\fn\no\ni\ns\ni\nc\ne\nr\np\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n \n\n0.01\n\n0.2\n\nILHf\n\n \n\n80\n\nILHitf\n\nILHt: train set sampling\n80\n\n \n\n \n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n \n\n0.01\n\n0.2\n\nb = 32\nb = 64\nb = 128\n1\n\n0.8\n\n0.4\n\n0.6\n\nd/D\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n \n\n32\n\nb = 32\nb = 64\nb = 128\n1\n\n0.8\n\ndisjoint\nrandom\nbootstrap\n\n64\n\nnumber of bits b\n\n128\n\n0.4\n\n0.6\n\nd/D\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n \n\n10\n0\n\nIncremental ILHt\n\n \n\nILHt\nKSHcut\u2212ILHt\nKSHcut\ntPCA\nBagged PCA\nLSH\n120\n\n160\n\n200\n\n40\n\n80\n\nnumber of bits b\n\nFigure 2: Panels 1\u20132: effect of the proportion of features d/D used in ILHf and ILHitf. Panel\n3: bootstrap vs random vs disjoint training sets in ILHt. Panel 4: precision as a function of the\nnumber of hash functions b for different methods. All results show precision using a training set of\nN = 5 000 points of in\ufb01nite MNIST dataset. Errorbars over 5 random training sets. Ground truth:\nall points with the same label as the query. Retrieved set: k = 10 000 nearest neighbors of the query.\n\nPrecision as a function of b Fig. 2 (panel 4) shows the precision (in the test set) as a function of\nthe number of bits b for ILHt, where the solution for b + 1 bits is obtained by adding a new bit to\nthe solution for b. Since the hash functions obtained depend on the order in which we add the bits,\nwe show 5 such orders (red curves). Remarkably, the precision increases nearly monotonically and\ncontinues increasing beyond b = 200 bits (note the prediction error in bagging ensembles typically\nlevels off after around 25\u201350 decision trees; [22, p. 186]). This is (at least partly) because the\neffective training set size is proportional to b. The variance in the precision decreases as b increases.\nIn contrast, for KSHcut the variance is larger and the precision barely increases after b = 80. The\nhigher variance for KSHcut is due to the fact that each b value involves training from scratch and we\ncan converge to a relatively different local optimum. As with ILHt, adding LSH random projections\n(again 5 curves for different orders) increases precision monotonically, but can only reach a low\nprecision at best, since it lacks supervision. We also show the curve for thresholded PCA (tPCA),\nwhose precision tops at around b = 30 and decreases thereafter. A likely explanation is that high-\norder principal components essentially capture noise rather than signal, i.e., random variation in\nthe data, and this produces random codes for those bits, which destroy neighborhood information.\nBagging tPCA (here, using ensembles where each member has 16 principal components, i.e., 16\nbits) [23] does make tPCA improve monotonically with b, but the result is still far from competitive.\nThe reason is the low diversity among the ensemble members, because the top principal components\ncan be accurately estimated even from small samples.\n\nIs the precision gap between KSH and ILHt due to an incomplete optimization of the KSH objective,\nor to bad local optima? We veri\ufb01ed that 1) random perturbations of the KSHcut optimum lower\nthe precision; 2) optimizing KSHcut using the ILHt codes as initialization (\u201cKSHcut-ILHt\u201d curve)\nincreases the precision but it still remains far from that of ILHt. This con\ufb01rms that the optimization\nalgorithm is doing its job, and that the ILHt diversity mechanism is superior to coupling the hash\nfunctions in a joint objective.\n\nAre the codes orthogonal? The result of learning binary hashing is b functions, represented by\na matrix Wb\u00d7D of real weights for linear SVMs, and a matrix ZN \u00d7b of binary (\u22121, +1) codes for\nthe entire dataset. We de\ufb01ne a measure of code orthogonality as follows. De\ufb01ne b \u00d7 b matrices\nCZ = 1\nZT Z for the codes and CW = WWT for the weights (assuming normalized SVM\nN\nweights). Each C matrix has entries in [\u22121, 1], equal to a normalized dot product of codes or weight\nvectors, and diagonal entries equal to 1. (Note that any matrix SCS where S is diagonal with \u00b11\nentries is equivalent, since reverting a hash function\u2019s output does not alter the Hamming distances.)\nPerfect orthogonality happens when C = I, and is encouraged by many binary hashing methods.\n\nFig. 3 shows this for ILHt in CIFAR (N = 58 000 training points of dim. D = 320). It plots CZ as\nan image, as well as the histogram of the entries of CZ and CW. The histograms also contain, as a\ncontrol, the histogram corresponding to normalized dot products of random vectors (of dimension N\nor D, respectively), which is known to tend to a delta function at 0 as the dimension grows. Although\nCW has some tendency to orthogonality as the number of bits b increases, it is clear that, for both\ncodes and weight vectors, the distribution of dot products is wide, far from strict orthogonality.\nHence, enforcing orthogonality does not seem necessary to achieve good hash functions and codes.\n\nComparison with other binary hashing methods We compare with both the original KSH [28]\nand its min-cut optimization KSHcut [26], and a representative subset of af\ufb01nity-based and unsu-\npervised hashing methods: Supervised Binary Reconstructive Embeddings (BRE) [21], Supervised\nSelf-Taught Hashing (STH) [36], Spectral Hashing (SH) [35], Iterative Quantization (ITQ) [16], Bi-\n\n6\n\n\fZ\nC\nx\ni\nr\nt\na\nm\nb\n\u00d7\nb\n\nb = 32\n\nb = 64\n\nb = 200\n\n \n\n1\n\n0.8\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0.6\n\n0.4\n\n\u22120.2\n\n0.2\n\n32bits\n64bits\n128bits\n200bits\nrandom\n\n \n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n \n\n32bits\n64bits\n128bits\n200bits\nrandom\n\n \n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n \n\n0\n\u22121\nentries (zT\n\n\u22120.6\n\nn\n\n\u22120.2 0 0.2\n\n1\nzm)/N of CZ\n\n0.6\n\n \n\n0\n\u22121\n\n\u22120.6\n\n\u22120.2 0 0.2\n\nentries wT\n\nd\n\n0.6\n\n1\n\nwe of CW\n\nFigure 3: Orthogonality of codes (b \u00d7 b images and left histogram) and of hash function weight\nvectors (right histogram) in CIFAR.\n\nnary Autoencoder (BA) [8], thresholded PCA (tPCA), and Locality-Sensitive Hashing (LSH) [1].\nWe create af\ufb01nities ynm for all the af\ufb01nity-based methods using the dataset labels. For each train-\ning point xn, we use as similar neighbors 100 points with the same labels as xn; and as dissimilar\nneighbors 100 points chosen randomly among the points whose labels are different from that of\nxn. For all datasets, all the methods are trained using a subset of 5 000 points. Given that KSHcut\nalready performs well [26] and that ILHt consistently outperforms it both in precision and runtime,\nwe expect ILHt to be competitive with the state-of-the-art. Fig. 4 shows this is generally the case,\nparticularly as the number of bits b increases, when ILHt beats all other methods, which are not able\nto increase precision as much as ILHt does.\nRuntime Training a single ILHt hash function (in a single processor) for CIFAR dataset with\nN = 2 000, 5 000 and 20 000 takes 1.2, 2.8 and 22.5 seconds, respectively. This is much faster\nthan other af\ufb01nity-based hashing methods (for example, for 128 bits with 5 000 points, BRE did not\nconverge after 12 hours). KSHcut is among the faster methods. Its runtime per min-cut pass over\na single bit is comparable to ours, but it needs b sequential passes to complete just one alternating\noptimization iteration, while our b functions can be trained in parallel.\nSummary\nILHt achieves a remarkably high precision compared to a coupled KSH objective using\nthe same optimization algorithm but introducing diversity by feeding different data to independent\nhash functions rather than by jointly optimizing over them. It also compares well with state-of-the-\nart methods in precision/recall, being competitive if few bits are used and the clear winner as more\nbits are used, and is very fast and embarrassingly parallel.\n\n4 Discussion\n\nWe have revealed for the \ufb01rst time a connection between supervised binary hashing and ensemble\nlearning that could open the door to many new hashing algorithms. Although we have focused on\na speci\ufb01c objective and identi\ufb01ed as particularly successful with it a speci\ufb01c diversity mechanism\n(disjoint training sets), other choices may be better depending on the application. The core idea we\npropose is the independent training of the hash functions via the introduction of diversity by means\nother than coupling terms in the objective or constraints. This may come as a surprise in the area\nof learning binary hashing, where most work has focused on proposing complex objective functions\nthat couple all b hash functions and developing sophisticated optimization algorithms for them.\n\nAnother surprise is that orthogonality of the codes or hash functions seems unnecessary. ILHt creates\ncodes and hash functions that do differ from each other but are far from being orthogonal, yet they\nachieve good precision that keeps growing as we add bits. Thus, introducing diversity through\ndifferent training data seems a better mechanism to make hash functions differ than coupling the\ncodes through an orthogonality constraint or otherwise. It is also far simpler and faster to train\nindependent single-bit hash functions.\n\nA \ufb01nal surprise is that the wide variety of af\ufb01nity-based objective functions in the b-bit case reduces\nto a binary quadratic problem in the 1-bit case regardless of the form of the b-bit objective (as long\nas it depends on Hamming distances only). In this sense, there is a unique objective in the 1-bit case.\n\nThere has been a prior attempt to use bagging (bootstrapped samples) with truncated PCA [23]. Our\nexperiments show that, while this improves truncated PCA, it performs poorly in supervised hashing.\nThis is because PCA is unsupervised and does not use the user-provided similarity information,\nwhich may disagree with Euclidean distances in image space; and because estimating principal\ncomponents from samples has low diversity. Also, PCA is computationally simple and there is little\ngain by bagging it, unlike the far more dif\ufb01cult optimization of supervised binary hashing.\n\nSome supervised binary hashing work [28, 34] has proposed to learn the b hash functions sequen-\ntially, where the ith function has an orthogonality-like constraint to force it to differ from the previ-\n\n7\n\n\f45\n\n40\n\n35\n\n30\n\n25\n\n \n\n20\n500\n\n80\n\n70\n\n60\n\n50\n\nn\no\ni\ns\ni\nc\ne\nr\np\n\nn\no\ni\ns\ni\nc\ne\nr\np\n\nR\nA\nF\nI\nC\n\nT\nS\nI\nN\nM\n\n.\nf\nn\nI\n\nb = 64\n\n \n\nILHt\nKSHcut\nKSH\nSTH\nCCA\u2212ITQ\nSH\nLSH\nBRE\n\n600\n\n700\n\n800\n\n900\n\n1000\n\n \n\n45\n\n40\n\n30\n\n20\n\n10\n\n \n\n90\n80\n70\n60\n50\n40\n30\n20\n10\n\n \n\nILHt\nKSHcut\nKSH\nSTH\nCCA\u2212ITQ\nSH\nLSH\nBRE\n\nb = 64\n\nb = 128\n\n \n\n45\n\nILHt\nKSHcut\nKSH\nSTH\nCCA\u2212ITQ\nSH\nLSH\nBRE\n\n40\n\n35\n\n30\n\n25\n\n20\n500\n\n80\n\n70\n\n60\n\n50\n\n600\n\n700\n\n800\n\n900\n\n1000\n\nb = 128\n\n20\n\n40\n\n60\n\n80\n\n100\n\n45\n\n40\n\n30\n\n20\n\n10\n\n90\n80\n70\n60\n50\n40\n30\n20\n10\n\n20\n\n40\n\n60\n\n80\n\n100\n\n \n\nILHt\nKSHcut\nKSH\nSTH\nCCA\u2212ITQ\nSH\nLSH\nBRE\n\n80\n\n100\n\n20\n\n60\n\n40\n\nrecall\n\n \n\n40\n5000 6000 7000 8000 9000 10000\n\nk\n\n40\n5000 6000 7000 8000 9000 10000\n\nk\n\n20\n\n60\n\n40\n\nrecall\n\n80\n\n100\n\nFigure 4: Comparison with binary hashing methods in precision and precision/recall, using linear\nSVMs as hash functions and different numbers of bits b, for CIFAR and Inf. MNIST.\n\nous functions. Hence, this does not learn the functions independently and can be seen as a greedy\noptimization of a joint objective over all b functions.\n\nBinary hashing does differ from ensemble learning in one important point: the predictions of the b\nclassi\ufb01ers (= b hash functions) are not combined into a single prediction, but are instead concatenated\ninto a binary vector (which can take 2b possible values). The \u201clabels\u201d (the binary codes) for the\n\u201cclassi\ufb01ers\u201d (the hash functions) are unknown, and are implicitly or explicitly learned together with\nthe hash functions themselves. This means that well-known error decompositions such as the error-\nambiguity decomposition [20] and the bias-variance decomposition [15] do not apply. Also, the real\ngoal of binary hashing is to do well in information retrieval measures such as precision and recall,\nbut hash functions do not directly optimize this. A theoretical understanding of why diversity helps\nin learning binary hashing is an important topic of future work.\n\nIn this respect, there is also a relation with error-correcting output codes (ECOC) [11], an approach\nfor multiclass classi\ufb01cation. In ECOC, we represent each of the K classes with a b-bit binary vector,\nensuring that b is large enough for the vectors to be suf\ufb01ciently separated in Hamming distance. Each\nbit corresponds to partitioning the K classes into two groups. We then train b binary classi\ufb01ers, such\nas decision trees. Given a test pattern, we output as class label the one closest in Hamming distance\nto the b-bit output of the b classi\ufb01ers. The redundant error-correcting codes allow for small errors in\nthe individual classi\ufb01ers and can improve performance. An ECOC can also be seen as an ensemble\nof classi\ufb01ers where we manipulate the output targets (rather than the input features or training set)\nto obtain each classi\ufb01er, and we apply majority vote on the \ufb01nal result (if the test output in classi\ufb01er\ni is 1, then all classes associated with 1 get a vote). The main bene\ufb01t of ECOC seems to be in\nvariance reduction, as in other ensemble methods. Binary hashing can be seen as an ECOC with\nN classes, one per training point, with the ECOC prediction for a test pattern (query) being the\nnearest-neighbor class codes in Hamming distance. However, unlike in ECOC, in binary hashing\nthe codes are learned so they preserve neighborhood relations between training points. Also, while\nideally all N codes should be different (since a collision makes two originally different patterns\nindistinguishable, which will degrade some searches), this is not guaranteed in binary hashing.\n\n5 Conclusion\n\nMuch work in supervised binary hashing has focused on designing sophisticated objectives of the\nhash functions that force them to compete with each other while trying to preserve neighborhood\ninformation. We have shown, surprisingly, that training hash functions independently is not just sim-\npler, faster and parallel, but also can achieve better retrieval quality, as long as diversity is introduced\ninto each hash function\u2019s objective function. This establishes a connection with ensemble learning\nand allows one to borrow techniques from it. We showed that having each hash function optimize a\nLaplacian objective on a disjoint subset of the data works well, and facilitates selecting the number\nof bits to use. Although our evidence is mostly empirical, the intuition behind it is sound and in\nagreement with the many results (also mostly empirical) showing the power of ensemble classi\ufb01ers.\nThe ensemble learning perspective suggests many ideas for future work, such as pruning a large\nensemble or using other diversity techniques. It may also be possible to characterize theoretically\nthe performance in precision of binary hashing depending on the diversity of the hash functions.\n\nAcknowledgments\nWork supported by NSF award IIS\u20131423515.\n\n8\n\n\fReferences\n\n[1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high\n\ndimensions. Comm. ACM, 51(1):117\u2013122, Jan. 2008.\n\n[2] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.\n\nNeural Computation, 15(6):1373\u20131396, June 2003.\n\n[3] E. Boros and P. L. Hammer. Pseudo-boolean optimization. Discrete Applied Math., Nov. 15 2002.\n\n[4] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. PAMI, 2001.\n\n[5] L. Breiman. Random forests. Machine Learning, 45(1):5\u201332, Oct. 2001.\n\n[6] L. J. Breiman. Bagging predictors. Machine Learning, 24(2):123\u2013140, Aug. 1996.\n[7] M. \u00b4A. Carreira-Perpi\u02dcn\u00b4an. The elastic embedding algorithm for dimensionality reduction. ICML 2010.\n[8] M. \u00b4A. Carreira-Perpi\u02dcn\u00b4an and R. Raziperchikolaei. Hashing with binary autoencoders. CVPR, 2015.\n[9] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, 2009.\n\n[10] T. G. Dietterich. Ensemble methods in machine learning. Springer-Verlag, 2000.\n\n[11] T. G. Dietterich and G. Bakiri. Solving multi-class learning problems via error-correcting output codes. J.\n\nArti\ufb01cial Intelligence Research, 2:253\u2013286, 1995.\n\n[12] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear\n\nclassi\ufb01cation. J. Machine Learning Research, 9:1871\u20131874, Aug. 2008.\n\n[13] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness.\n\nW.H. Freeman, 1979.\n\n[14] T. Ge, K. He, and J. Sun. Graph cuts for supervised binary coding. ECCV, 2014.\n\n[15] S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemma. Neural\n\nComputation, 4(1):1\u201358, Jan. 1992.\n\n[16] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A Procrustean approach to\n\nlearning binary codes for large-scale image retrieval. PAMI, 2013.\n\n[17] K. Grauman and R. Fergus. Learning binary hash codes for large-scale image search. In Machine Learning\n\nfor Computer Vision, pages 49\u201387. Springer-Verlag, 2013.\n\n[18] V. Kolmogorov and R. Zabih. What energy functions can be minimized via graph cuts? PAMI, 2003.\n\n[19] A. Krizhevsky. Learning multiple layers of features from tiny images. Master\u2019s thesis, Apr. 8 2009.\n\n[20] A. Krogh and J. Vedelsby. Neural network ensembles, cross validation, and active learning. NIPS, 1995.\n\n[21] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. NIPS, 2009.\n\n[22] L. I. Kuncheva. Combining Pattern Classi\ufb01ers: Methods and Algorithms. John Wiley & Sons, 2014.\n\n[23] C. Leng, J. Cheng, T. Yuan, X. Bai, and H. Lu. Learning binary codes with bagging PCA. ECML, 2014.\n\n[24] B. Lin, J. Yang, X. He, and J. Ye. Geodesic distance function learning via heat \ufb02ows on vector \ufb01elds.\n\nICML, 2014.\n\n[25] G. Lin, C. Shen, D. Suter, and A. van den Hengel. A general two-step approach to learning-based hashing.\n\nICCV, 2013.\n\n[26] G. Lin, C. Shen, Q. Shi, A. van den Hengel, and D. Suter. Fast supervised hashing with decision trees for\n\nhigh-dimensional data. CVPR, 2014.\n\n[27] W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with graphs. ICML, 2011.\n\n[28] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. CVPR, 2012.\n\n[29] G. Loosli, S. Canu, and L. Bottou. Training invariant support vector machines using selective sampling.\nIn Large Scale Kernel Machines, Neural Information Processing Series, pages 301\u2013320. MIT Press, 2007.\n\n[30] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial\n\nenvelope. Int. J. Computer Vision, 42(3):145\u2013175, May 2001.\n\n[31] R. Raziperchikolaei and M. \u00b4A. Carreira-Perpi\u02dcn\u00b4an. Optimizing af\ufb01nity-based binary hashing using auxil-\n\niary coordinates. NIPS, 2016.\n\n[32] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science,\n\n290(5500):2323\u20132326, Dec. 22 2000.\n\n[33] G. Shakhnarovich, P. Indyk, and T. Darrell, editors. Nearest-Neighbor Methods in Learning and Vision.\n\nNeural Information Processing Series. MIT Press, Cambridge, MA, 2006.\n\n[34] J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing for large scale search. PAMI, 2012.\n\n[35] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. NIPS, 2009.\n\n[36] D. Zhang, J. Wang, D. Cai, and J. Lu. Self-taught hashing for fast similarity search. SIGIR, 2010.\n\n9\n\n\f", "award": [], "sourceid": 445, "authors": [{"given_name": "Miguel", "family_name": "Carreira-Perpinan", "institution": "UC Merced"}, {"given_name": "Ramin", "family_name": "Raziperchikolaei", "institution": "UC Merced"}]}